Tag Archives: Intermediate (200)

Attribute Amazon EMR on EC2 costs to your end-users

2024-08-27 Raj Patel

Post Syndicated from Raj Patel original https://aws.amazon.com/blogs/big-data/attribute-amazon-emr-on-ec2-costs-to-your-end-users/

Amazon EMR on EC2 is a managed service that makes it straightforward to run big data processing and analytics workloads on AWS. It simplifies the setup and management of popular open source frameworks like Apache Hadoop and Apache Spark, allowing you to focus on extracting insights from large datasets rather than the underlying infrastructure. With Amazon EMR, you can take advantage of the power of these big data tools to process, analyze, and gain valuable business intelligence from vast amounts of data.

Cost optimization is one of the pillars of the Well-Architected Framework. It focuses on avoiding unnecessary costs, selecting the most appropriate resource types, analyzing spend over time, and scaling in and out to meet business needs without overspending. An optimized workload maximizes the use of all available resources, delivers the desired outcome at the most cost-effective price point, and meets your functional needs.

The current Amazon EMR pricing page shows the estimated cost of the cluster. You can also use AWS Cost Explorer to get more detailed information about your costs. These views give you an overall picture of your Amazon EMR costs. However, you may need to attribute costs at the individual Spark job level. For example, you might want to know the usage cost in Amazon EMR for the finance business unit. Or, for chargeback purposes, you might need to aggregate the cost of Spark applications by functional area. After you have allocated costs to individual Spark jobs, this data can help you make informed decisions to optimize your costs. For instance, you could choose to restructure your applications to utilize fewer resources. Alternatively, you might opt to explore different pricing models like Amazon EMR on EKS or Amazon EMR Serverless.

In this post, we share a chargeback model that you can use to track and allocate the costs of Spark workloads running on Amazon EMR on EC2 clusters. We describe an approach that assigns Amazon EMR costs to different jobs, teams, or lines of business. You can use this feature to distribute costs across various business units. This can assist you in monitoring the return on investment for your Spark-based workloads.

Solution overview

The solution is designed to help you track the cost of your Spark applications running on EMR on EC2. It can help you identify cost optimizations and improve the cost-efficiency of your EMR clusters.

The proposed solution uses a scheduled AWS Lambda function that operates on a daily basis. The function captures usage and cost metrics, which are subsequently stored in Amazon Relational Database Service (Amazon RDS) tables. The data stored in the RDS tables is then queried to derive chargeback figures and generate reporting trends using Amazon QuickSight. The utilization of these AWS services incurs additional costs for implementing this solution. Alternatively, you can consider an approach that involves a cron-based agent script installed on your existing EMR cluster, if you want to avoid the use of additional AWS services and associated costs for building your chargeback solution. This script stores the relevant metrics in an Amazon Simple Storage Service (Amazon S3) bucket, and uses Python Jupyter notebooks to generate chargeback numbers based on the data files stored in Amazon S3, using AWS Glue tables.

The following diagram shows the current solution architecture.

The workflow consists of the following steps:

A Lambda function gets the following parameters from Parameter Store, a capability of AWS Systems Manager:

{
  "yarn_url": "http://dummy.compute-1.amazonaws.com:8088/ws/v1/cluster/apps",
  "tbl_applicationlogs_lz": "public.emr_applications_execution_log_lz",
  "tbl_applicationlogs": "public.emr_applications_execution_log",
  "tbl_emrcost": "public.emr_cluster_usage_cost",
  "tbl_emrinstance_usage": "public.emr_cluster_instances_usage",
  "emrcluster_id": "j-xxxxxxxxxx",
  "emrcluster_name": "EMR_Cost_Measure",
  "emrcluster_role": "dt-dna-shared",
  "emrcluster_linkedaccount": "xxxxxxxxxxx",
  "postgres_rds": {
    "host": "xxxxxxxxx.amazonaws.com",
    "dbname": "postgres",
    "user": "postgresadmin",
    "secretid": "postgressecretid"
  }
}

The Lambda function extracts Spark application run logs from the EMR cluster using the Resource Manager API. The following metrics are extracted as part of the process: vcore-seconds, memory MB-seconds, and storage GB-seconds.
The Lambda function captures the daily cost of EMR clusters from Cost Explorer.
The Lambda function also extracts EMR On-Demand and Spot Instance usage data using the Amazon Elastic Compute Cloud (Amazon EC2) Boto3 APIs.
Lambda function loads these datasets into an RDS database.
The cost of running a Spark application is determined by the amount of CPU resources it uses, compared to the total CPU usage of all Spark applications. This information is used to distribute the overall cost among different teams, business lines, or EMR queues.

The extraction process runs daily, extracting the previous day’s data and storing it in an Amazon RDS for PostgreSQL table. The historical data in the table needs to be purged based on your use case.

The solution is open source and available on GitHub.

You can use the AWS Cloud Development Kit (AWS CDK) to deploy the Lambda function, RDS for PostgreSQL data model tables, and a QuickSight dashboard to track EMR cluster cost at the job, team, or business unit level.

The following schema show the tables used in the solution which are queried by QuickSight to populate the dashboard.

emr_applications_execution_log_lz or public.emr_applications_execution_log – Storage for daily run metrics for all jobs run on the EMR cluster:
- appdatecollect – Log collection date
- app_id – Spark job run ID
- app_name – Run name
- queue – EMR queue in which job was run
- job_state – Job running state
- job_status – Job run final status (Succeeded or Failed)
- starttime – Job start time
- endtime – Job end time
- runtime_seconds – Runtime in seconds
- vcore_seconds – Consumed vCore CPU in seconds
- memory_seconds – Memory consumed
- running_containers – Containers used
- rm_clusterid – EMR cluster ID
emr_cluster_usage_cost – Captures Amazon EMR and Amazon EC2 daily cost consumption from Cost Explorer and loads the data into the RDS table:
- costdatecollect – Cost collection date
- startdate – Cost start date
- enddate – Cost end date
- emr_unique_tag – EMR cluster associated tag
- net_unblendedcost – Total unblended daily dollar cost
- unblendedcost – Total unblended daily dollar cost
- cost_type – Daily cost
- service_name – AWS service for which the cost incurred (Amazon EMR and Amazon EC2)
- emr_clusterid – EMR cluster ID
- emr_clustername – EMR cluster name
- loadtime – Table load date/time
emr_cluster_instances_usage – Captures the aggregated resource usage (vCores) and allocated resources for each EMR cluster node, and helps identify the idle time of the cluster:
- instancedatecollect – Instance usage collect date
- emr_instance_day_run_seconds – EMR instance active seconds in the day
- emr_region – EMR cluster AWS Region
- emr_clusterid – EMR cluster ID
- emr_clustername – EMR cluster name
- emr_cluster_fleet_type – EMR cluster fleet type
- emr_node_type – Instance node type
- emr_market – Market type (on-demand or provisioned)
- emr_instance_type – Instance size
- emr_ec2_instance_id – Corresponding EC2 instance ID
- emr_ec2_status – Running status
- emr_ec2_default_vcpus – Allocated vCPU
- emr_ec2_memory – EC2 instance memory
- emr_ec2_creation_datetime – EC2 instance creation date/time
- emr_ec2_end_datetime – EC2 instance end date/time
- emr_ec2_ready_datetime – EC2 instance ready date/time
- loadtime – Table load date/time

Prerequisites

You must have the following prerequisites before implementing the solution:

An EMR on EC2 cluster.
The EMR cluster must have a unique tag value defined. You can assign the tag directly on the Amazon EMR console or using Tag Editor. The recommended tag key is cost-center along with a unique value for your EMR cluster. After you create and apply user-defined tags, it can take up to 24 hours for the tag keys to appear on your cost allocation tags page for activation
Activate the tag in AWS Billing. It takes about 24 hours to activate the tag if not done before. To activate the tag, follow these steps:
- On the AWS Billing and Cost Management console, choose Cost allocation tags from navigation pane.
- Select the tag key that you want to activate.
- Choose Activate.
The Spark application’s name should follow the standardized naming convention. It consists of seven components separated by underscores: <business_unit>_<program>_<application>_<source>_<job_name>_<frequency>_<job_type>. These components are used to summarize the resource consumption and cost in the final report. For example: HR_PAYROLL_PS_PSPROD_TAXDUDUCTION_DLY_LD, FIN_CASHRECEIPT_GL_GLDB_MAIN_DLY_LD, or MKT_CAMPAIGN_CRM_CRMDB_TOPRATEDCAMPAIGN_DLY_LD. The application name must be supplied with the spark submit command using the --name parameter with the standardized naming convention. If any of these components don’t have a value, hardcode the values with the following suggested names:
- frequency
- job_type
- Business_unit
The Lambda function should be able to connect to Cost Explorer, connect to the EMR cluster through the Resource Manager APIs, and load data into the RDS for PostgreSQL database. To do this, you need to configure the Lambda function as follows:
- VPC configuration – The Lambda function should be able to access the EMR cluster, Cost Explorer, AWS Secrets Manager, and Parameter Store. If access is not in place already, you can do this by creating a virtual private cloud (VPC) that includes the EMR cluster and create VPC endpoint for Parameter Store and Secrets Manager and attach it to the VPC. Because there is no VPC endpoint available for Cost Explorer and in order to have Lambda connect to Cost Explorer, a private subnet and a route table are required to send VPC traffic to public NAT gateway. If your EMR cluster is in public subnet, you must create a private subnet including a custom route table and a public NAT gateway, which will allow the Cost Explorer connection to flow from the VPC private subnet. Refer to How do I set up a NAT gateway for a private subnet in Amazon VPC? for setup instructions and attach the newly created private subnet to the Lambda function explicitly.
- IAM role – The Lambda function needs to have an AWS Identity and Access Management (IAM) role with the following permissions: AmazonEC2ReadOnlyAccess, AWSCostExplorerFullAccess, and AmazonRDSDataFullAccess. This role will be created automatically during AWS CDK stack deployment; you don’t need to set it up separately.
The AWS CDK should be installed on AWS Cloud9 (preferred) or another development environment such as VSCode or Pycharm. For more information, refer to Prerequisites.
The RDS for PostgreSQL database (v10 or higher) credentials should be stored in Secrets Manager. For more information, refer to Storing database credentials in AWS Secrets Manager.

Create RDS tables

Create the data model tables mentioned in emr-cost-rds-tables-ddl.sql by logging in to postgres rds manually into the public schema.

Use DBeaver or any compatible SQL clients to connect to the RDS instance and validate the tables have been created.

Deploy AWS CDK stacks

Complete the steps in this section to deploy the following resources using the AWS CDK:

Parameter Store to store required parameter values
IAM role for the Lambda function to help connect to Amazon EMR and underlying EC2 instances, Cost Explorer, CloudWatch, and Parameter Store
Lambda function

Clone the GitHub repo:

git clone [email protected]:aws-samples/attribute-amazon-emr-costs-to-your-end-users.git

Update the following the environment parameters in cdk.context.json (this file can be found in the main directory):
1. yarn_url – YARN ResourceManager URL to read job run logs and metrics. This URL should be accessible within the VPC where Lambda would be deployed.
2. tbl_applicationlogs_lz – RDS temp table to store EMR application run logs.
3. tbl_applicationlogs – RDS table to store EMR application run logs.
4. tbl_emrcost – RDS table to capture daily EMR cluster usage cost.
5. tbl_emrinstance_usage – RDS table to store EMR cluster instance usage info.
6. emrcluster_id – EMR cluster instance ID.
7. emrcluster_name – EMR cluster name.
8. emrcluster_tag – Tag key assigned to EMR cluster.
9. emrcluster_tag_value – Unique value for EMR cluster tag.
10. emrcluster_role – Service role for Amazon EMR (EMR role).
11. emrcluster_linkedaccount – Account ID under which the EMR cluster is running.
12. postgres_rds – RDS for PostgreSQL connection details.
13. vpc_id – VPC ID in which the EMR cluster is configured and the cost metering Lambda function would be deployed.
14. vpc_subnets – Comma-separated private subnets ID associated with the VPC.
15. sg_id – EMR security group ID.

The following is a sample cdk.context.json file after being populated with the parameters:

{
  "yarn_url": "http://dummy.compute-1.amazonaws.com:8088/ws/v1/cluster/apps",
  "tbl_applicationlogs_lz": "public.emr_applications_execution_log_lz",
  "tbl_applicationlogs": "public.emr_applications_execution_log",
  "tbl_emrcost": "public.emr_cluster_usage_cost",
  "tbl_emrinstance_usage": "public.emr_cluster_instances_usage",
  "emrcluster_id": "j-xxxxxxxxxx",
  "emrcluster_name": "EMRClusterName",
  "emrcluster_tag": "EMRClusterTag",
  "emrcluster_tag_value": "EMRClusterUniqueTagValue",
  "emrcluster_role": "EMRClusterServiceRole",
  "emrcluster_linkedaccount": "xxxxxxxxxxx",
  "postgres_rds": {
    "host": "xxxxxxxxx.amazonaws.com",
    "dbname": "dbname",
    "user": "username",
    "secretid": "DatabaseUserSecretID"
  },
  "vpc_id": "xxxxxxxxx",
  "vpc_subnets": "subnet-xxxxxxxxxxx",
  "sg_id": "xxxxxxxxxx"
}

You can choose to deploy the AWS CDK stack using AWS Cloud9 or any other development environment according to your needs. For instructions to set up AWS Cloud9, refer to Getting started: basic tutorials for AWS Cloud9.

Go to AWS Cloud9 and choose File and Upload local files upload the project folder.

Deploy the AWS CDK stack with the following code:

cd attribute-amazon-emr-costs-to-your-end-users/
pip install -r requirements.txt
cdk deploy –-all

The deployed Lambda function requires two external libraries: psycopg2 and requests. The corresponding layer needs to be created and assigned to the Lambda function. For instructions to create a Lambda layer for the requests module, refer to Step-by-Step Guide to Creating an AWS Lambda Function Layer.

Creation of the psycopg2 package and layer is tied to the Python runtime version of the Lambda function. Provided that the Lambda function uses the Python 3.9 runtime, complete the following steps to create the corresponding layer package for peycopog2:

Download psycopg2_binary-2.9.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl from https://pypi.org/project/psycopg2-binary/#files.
Unzip and move the contents to a directory named python:
```
zip ‘python’ directory
```
Create a Lambda layer for psycopg2 using the zip file.
Assign the layer to the Lambda function by choosing Add a layer in the deployed function properties.
Validate the AWS CDK deployment.

Your Lambda function details should look similar to the following screenshot.

On the Systems Manager console, validate the Parameter Store content for actual values.

The IAM role details should look similar to the following code, which allows the Lambda function access to Amazon EMR and underlying EC2 instances, Cost Explorer, CloudWatch, Secrets Manager, and Parameter Store:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "ce:GetCostAndUsage",
        "ce:ListCostAllocationTags",
        "ec2:AttachNetworkInterface",
        "ec2:CreateNetworkInterface",
        "ec2:DeleteNetworkInterface",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeInstances",
        "ec2:DescribeNetworkInterfaces",
        "elasticmapreduce:Describe*",
        "elasticmapreduce:List*",
        "ssm:Describe*",
        "ssm:Get*",
        "ssm:List*"
      ],
      "Resource": "*",
      "Effect": "Allow"
    },
    {
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:DescribeLogStreams",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*",
      "Effect": "Allow"
    },
    {
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "arn:aws:secretsmanager:*:*:*",
      "Effect": "Allow"
    }
  ]
}

Test the solution

To test the solution, you can run a Spark job that combines multiple files in the EMR cluster, and you can do this by creating separate steps within the cluster. Refer to Optimize Amazon EMR costs for legacy and Spark workloads for more details on how to add the jobs as steps to EMR cluster.

Use the following sample command to submit the Spark job (emr_union_job.py).
It takes in three arguments:
1. <input_full_path> – The Amazon S3 location of the data file that is read in by the Spark job. The path should not be changed. The input_full_path is s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/input/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet
2. <output_path> – The S3 folder where the results are written to.
3. <number of copies to be unioned> – By changing the input to the Spark job, you can make sure the job runs for different amounts of time and also change the number of Spot nodes used.

spark-submit --deploy-mode cluster --name HR_PAYROLL_PS_PSPROD_TAXDUDUCTION_DLY_LD s3://aws-blogs-artifacts-public/artifacts/BDB-2997/scripts/emr_union_job.py s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/input/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet s3://<output_bucket>/<output_path>/ 6

spark-submit --deploy-mode cluster --name FIN_CASHRECEIPT_GL_GLDB_MAIN_DLY_LD s3://aws-blogs-artifacts-public/artifacts/BDB-2997/scripts/emr_union_job.py s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/input/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet s3://<output_bucket>/<output_path>/ 12

The following screenshot shows the log of the steps run on the Amazon EMR console.

Run the deployed Lambda function from the Lambda console. This loads the daily application log, EMR dollar usage, and EMR instance usage details into their respective RDS tables.

The following screenshot of the Amazon RDS query editor shows the results for public.emr_applications_execution_log.

The following screenshot shows the results for public.emr_cluster_usage_cost.

The following screenshot shows the results for public.emr_cluster_instances_usage.

Cost can be calculated using the preceding three tables based on your requirements. In the following SQL query, you calculate the cost based on relative usage of all applications in a day. You first identify the total vcore-seconds CPU consumed in a day and then find out the percentage share of an application. This drives the cost based on overall cluster cost in a day.

Consider the following example scenario, where 10 applications ran on the cluster for a given day. You would use the following sequence of steps to calculate the chargeback cost:

Calculate the relative percentage usage of each application (consumed vcore-seconds CPU by app/total vcore-seconds CPU consumed).
Now you have the relative resource consumption of each application, distribute the cluster cost to each application. Let’s assume that the total EMR cluster cost for that date is $400.

app_id	app_name	runtime_seconds	vcore_seconds	% Relative Usage	Amazon EMR Cost ($)
application_00001	app1	10	120	5%	19.83
application_00002	app2	5	60	2%	9.91
application_00003	app3	4	45	2%	7.43
application_00004	app4	70	840	35%	138.79
application_00005	app5	21	300	12%	49.57
application_00006	app6	4	48	2%	7.93
application_00007	app7	12	150	6%	24.78
application_00008	app8	52	620	26%	102.44
application_00009	app9	12	130	5%	21.48
application_00010	app10	9	108	4%	17.84

A sample chargeback cost calculation SQL query is available on the GitHub repo.

You can use the SQL query to create a report dashboard to plot multiple charts for the insights. The following are two examples created using QuickSight.

The following is a daily bar chart.

The following shows total dollars consumed.

Solution cost

Let’s assume we’re calculating for an environment that runs 1,000 jobs daily, and we run this solution daily:

Lambda costs – One run requires 30 Lambda function invocations per month.
Amazon RDS cost – The total number of records in the public.emr_applications_execution_log table for a 30-day month would be 30,000 records, which translates to 5.72 MB of storage. If we consider the other two smaller tables and storage overhead, the overall monthly storage requirement would be approximately 12 MB.

In summary, the solution cost according to the AWS Pricing Calculator is $34.20/year, which is negligible.

Clean up

To avoid ongoing charges for the resources that you created, complete the following steps:

Delete the AWS CDK stacks:
```
cdk destroy –-all
```
Delete the QuickSight report and dashboard, if created.

Run the following SQL to drop the tables:

drop table public.emr_applications_execution_log_lz;
drop table public.emr_applications_execution_log;
drop table public.emr_cluster_usage_cost;
drop table public.emr_cluster_instances_usage;

Conclusion

With this solution, you can deploy a chargeback model to attribute costs to users and groups using the EMR cluster. You can also identify options for optimization, scaling, and separation of workloads to different clusters based on usage and growth needs.

You can collect the metrics for a longer duration to observe trends on the usage of Amazon EMR resources and use that for forecasting purposes.

If you have any thoughts or questions, leave them in the comments section.

About the Authors

Raj Patel is AWS Lead Consultant for Data Analytics solutions based out of India. He specializes in building and modernising analytical solutions. His background is in data warehouse/data lake – architecture, development and administration. He is in data and analytical field for over 14 years.

Ramesh Raghupathy is a Senior Data Architect with WWCO ProServe at AWS. He works with AWS customers to architect, deploy, and migrate to data warehouses and data lakes on the AWS Cloud. While not at work, Ramesh enjoys traveling, spending time with family, and yoga.

Gaurav Jain is a Sr Data Architect with AWS Professional Services, specialized in big data and helps customers modernize their data platforms on the cloud. He is passionate about building the right analytics solutions to gain timely insights and make critical business decisions. Outside of work, he loves to spend time with his family and likes watching movies and sports.

Dipal Mahajan is a Lead Consultant with Amazon Web Services based out of India, where he guides global customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings extensive experience on Software Development, Architecture and Analytics from industries like finance, telecom, retail and healthcare.

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

2024-08-26 Monica Alcalde Angel

Post Syndicated from Monica Alcalde Angel original https://aws.amazon.com/blogs/big-data/copy-and-mask-pii-between-amazon-rds-databases-using-visual-etl-jobs-in-aws-glue-studio/

Moving and transforming data between databases is a common need for many organizations. Duplicating data from a production database to a lower or lateral environment and masking personally identifiable information (PII) to comply with regulations enables development, testing, and reporting without impacting critical systems or exposing sensitive customer data. However, manually anonymizing cloned information can be taxing for security and database teams.

You can use AWS Glue Studio to set up data replication and mask PII with no coding required. AWS Glue Studio visual editor provides a low-code graphic environment to build, run, and monitor extract, transform, and load (ETL) scripts. Behind the scenes, AWS Glue handles underlying resource provisioning, job monitoring, and retries. There’s no infrastructure to manage, so you can focus on rapidly building compliant data flows between key systems.

In this post, I’ll walk you through how to copy data from one Amazon Relational Database Service (Amazon RDS) for PostgreSQL database to another, while scrubbing PII along the way using AWS Glue. You will learn how to prepare a multi-account environment to access the databases from AWS Glue, and how to model an ETL data flow that automatically masks PII as part of the transfer process, so that no sensitive information will be copied to the target database in its original form. By the end, you’ll be able to rapidly build data movement pipelines between data sources and targets, that can hide PII in order to protect individual identities, without needing to write code.

Solution overview

The following diagram illustrates the solution architecture:

The solution uses AWS Glue as an ETL engine to extract data from the source Amazon RDS database. Built-in data transformations then scrub columns containing PII using pre-defined masking functions. Finally, the AWS Glue ETL job inserts privacy-protected data into the target Amazon RDS database.

This solution employs multiple AWS accounts. Having multi-account environments is an AWS best practice to help isolate and manage your applications and data. The AWS Glue account shown in the diagram is a dedicated account that facilitates the creation and management of all necessary AWS Glue resources. This solution works across a broad array of connections that AWS Glue supports, so you can centralize the orchestration in one dedicated AWS account.

It is important to highlight the following notes about this solution:

Following AWS best practices, the three AWS accounts discussed are part of an organization, but this is not mandatory for this solution to work.
This solution is suitable for use cases that don’t require real-time replication and can run on a schedule or be initiated through events.

Walkthrough

To implement this solution, this guide walks you through the following steps:

Enable connectivity from the AWS Glue account to the source and target accounts
Create AWS Glue components for the ETL job
Create and run the AWS Glue ETL job
Verify results

Prerequisites

For this walkthrough, we’re using Amazon RDS for PostgreSQL 13.14-R1. Note that the solution will work with other versions and database engines that support the same JDBC driver versions as AWS Glue. See JDBC connections for further details.

To follow along with this post, you should have the following prerequisites:

Three AWS accounts as follows:
1. Source account: Hosts the source Amazon RDS for PostgreSQL database. The database contains a table with sensitive information and resides within a private subnet. For future reference, record the associated virtual private cloud (VPC) ID, security group, and private subnets associated to the Amazon RDS database.
2. Target account: Contains the target Amazon RDS for PostgreSQL database, with the same table structure as the source table, initially empty. The database resides within a private subnet. Similarly, write down the associated VPC ID, security group ID and private subnets.
3. AWS Glue account: This dedicated account holds a VPC, a private subnet, and a security group. As mentioned in the AWS Glue documentation, the security group includes a self-referencing inbound rule for All TCP and TCP ports (0-65535) to allow AWS Glue to communicate with its components.

The following figure shows a self-referencing inbound rule needed on the AWS Glue account security group.

Make sure the three VPC CIDRs do not overlap with each other, as shown in the following table:

	VPC	Private subnet
Source account	10.2.0.0/16	10.2.10.0/24
AWS Glue account	10.1.0.0/16	10.1.10.0/24
Target account	10.3.0.0/16	10.3.10.0/24

The VPC network attributes enableDnsHostnames and enableDnsSupport are set to true on each VPC. For details, see Using DNS with your VPC.
An AWS Identity and Access Management (IAM) role is used for AWS Glue. For instructions, see Create IAM role for AWS Glue.
A user on the AWS Glue account with access to the AWS Management Console and permissions for AWS Glue Studio. See Set up IAM permissions for AWS Glue Studio for instructions.
An Amazon Simple Storage Service (Amazon S3) endpoint on the AWS Glue account. AWS Glue requires this endpoint to store the ETL script. During the S3 endpoint set up, make sure you associate the endpoint with the route table assigned to the private subnet on the AWS Glue account. For details on creating an S3 endpoint, see Amazon VPC Endpoints for Amazon S3.

The following diagram illustrates the environment with all prerequisites:

To streamline the process of setting up the prerequisites, you can follow the directions in the README file on this GitHub repository.

Database tables

For this example, both source and target databases contain a customer table with the exact same structure. The former is prepopulated with data as shown in the following figure:

The AWS Glue ETL job you will create focuses on masking sensitive information within specific columns. These are last_name, email, phone_number, ssn and notes.

If you want to use the same table structure and data, the SQL statements are provided in the GitHub repository.

Step 1 – Enable connectivity from the AWS Glue account to the source and target accounts

When creating an AWS Glue ETL job, provide the AWS IAM role, VPC ID, subnet ID, and security groups needed for AWS Glue to access the JDBC databases. See AWS Glue: How it works for further details.

In our example, the role, groups, and other information are in the dedicated AWS Glue account. However, for AWS Glue to connect to the databases, you need to enable access to source and target databases from your AWS Glue account’s subnet and security group.

To enable access, first you inter-connect the VPCs. This can be done using VPC peering or AWS Transit Gateway. For this example, we use VPC peering. Alternatively, you can use an S3 bucket as an intermediary storage location. See Setting up network access to data stores for further details.

Follow these steps:

Peer AWS Glue account VPC with the database VPCs
Update the route tables
Update the database security groups

Peer AWS Glue account VPC with database VPCs

Complete the following steps in the AWS VPC console:

On the AWS Glue account, create two VPC peering connections as described in Create VPC peering connection, one for the source account VPC, and one for the target account VPC.
On the source account, accept the VPC peering request. For instructions, see Accept VPC peering connection
On the target account, accept the VPC peering request as well.
On the AWS Glue account, enable DNS Settings on each peering connection. This allows AWS Glue to resolve the private IP address of your databases. For instructions, follow Enable DNS resolution for VPC peering connection.

After completing the preceding steps, the list of peering connections on the AWS Glue account should look like the following figure:
Note that source and target account VPCs are not peered together. Connectivity between the two accounts isn’t needed.

Update subnet route tables

This step will enable traffic from the AWS Glue account VPC to the VPC subnets associate to the databases in the source and target accounts.

Complete the following steps in the AWS VPC console:

On the AWS Glue account’s route table, for each VPC peering connection, add one route to each private subnet associated to the database. These routes enable AWS Glue to establish a connection to the databases and limit traffic from the AWS Glue account to only the subnets associated to the databases.
On the source account’s route table of the private subnets associated to the database, add one route for the VPC peering with the AWS Glue account. This route will allow traffic back to the AWS Glue account.
Repeat step 2 on the target account’s route table.

For instructions on how to update route tables, see Work with route tables.

Update database security groups

This step is required to allow traffic from the AWS Glue account’s security group to the source and target security groups associated to the databases.

For instructions on how to update security groups, see Work with security groups.

Complete the following steps in the AWS VPC console:

On the source account’s database security group, add an inbound rule with Type PostgreSQL and Source, the AWS Glue account security group.
Repeat step 1 from the target account’s database security group.

The following diagram shows the environment with connectivity enabled from the AWS Glue account to the source and target accounts:

Step 2 – Create AWS Glue components for the ETL job

The next task is to create the AWS Glue components to synchronize the source and target database schemas with the AWS Glue Data Catalog.

Follow these steps:

Create an AWS Glue Connection for each Amazon RDS database.
Create AWS Glue Crawlers to populate the Data Catalog.
Run the crawlers.

Create AWS Glue connections

Connections enable AWS Glue to access your databases. The main benefit of creating AWS Glue connections is that connections save time by not making you have to specify all connection details every time you create a job. You can then reuse connections when creating jobs in AWS Glue Studio without having to manually enter connection details each time. This makes the job creation process more consistent and faster.

Complete these steps on the AWS Glue account:

On the AWS Glue console, choose the Data connections link on the navigation pane.
Choose Create connection and follow the instructions in the Create connection wizard:
1. In Choose data source, choose JDBC as data source.
2. In Configure connection:
  - For JDBC URL, enter the JDBC URL for the source database. For PostgreSQL, the syntax is:
```
jdbc:postgresql://database-endpoint:5432/database-name
```
    You can find the database-endpoint on the Amazon RDS console on the source account.
  - Expand Network options. For VPC, Subnet and Security group, select the ones in the centralized AWS Glue account, as shown in the following figure:
3. In Set Properties, for Name enter Source DB connection-Postgresql.
Repeat steps 1 and 2 to create the connection to the target Amazon RDS database. Name the connection Target DB connection-Postgresql.

Now you have two connections, one for each Amazon RDS database.

Create AWS Glue crawlers

AWS Glue crawlers allow you to automate data discovery and cataloging from data sources and targets. Crawlers explore data stores and auto-generate metadata to populate the Data Catalog, registering discovered tables in the Data Catalog. This helps you to discover and work with the data to build ETL jobs.

To create a crawler for each Amazon RDS database, complete the following steps on the AWS Glue account:

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler and follow the instructions in the Add crawler wizard:
1. In Set crawler properties, for Name enter Source PostgreSQL database crawler.
2. In Chose data sources and classifiers, choose Not yet.
3. In Add data source, for Data source choose JDBC, as shown in the following figure:
4. For Connection, choose Source DB Connection - Postgresql.
5. For Include path, enter the path of your database including the schema. For our example, the path is sourcedb/cx/% where sourcedb is the name of the database, and cx the schema with the customer table.
6. In Configure security settings, choose the AWS IAM service role created a part of the prerequisites.
7. In Set output and scheduling, since we don’t have a database yet in the Data Catalog to store the source database metadata, choose Add database and create a database named sourcedb-postgresql.
Repeat steps 1 and 2 to create a crawler for the target database:
1. In Set crawler properties, for Name enter Target PostgreSQL database crawler.
2. In Add data source, for Connection, choose Target DB Connection-Postgresql, and for Include path enter targetdb/cx/%.
3. In Add database, for Name enter targetdb-postgresql.

Now you have two crawlers, one for each Amazon RDS database, as shown in the following figure:

Run the crawlers

Next, run the crawlers. When you run a crawler, the crawler connects to the designated data store and automatically populates the Data Catalog with metadata table definitions (columns, data types, partitions, and so on). This saves time over manually defining schemas.

From the Crawlers list, select both Source PostgreSQL database crawler and Target PostgreSQL database crawler, and choose Run.

When finished, each crawler creates a table in the Data Catalog. These tables are the metadata representation of the customer tables.

You now have all the resources to start creating AWS Glue ETL jobs!

Step 3 – Create and run the AWS Glue ETL Job

The proposed ETL job runs four tasks:

Source data extraction – Establishes a connection to the Amazon RDS source database and extracts the data to replicate.
PII detection and scrubbing.
Data transformation – Adjusts and removes unnecessary fields.
Target data loading – Establishes a connection to the target Amazon RDS database and inserts data with masked PII.

Let’s jump into AWS Glue Studio to create the AWS Glue ETL job.

Sign in to the AWS Glue console with your AWS Glue account.
Choose ETL jobs in the navigation pane.
Choose Visual ETL as shown in the following figure:

Task 1 – Source data extraction

Add a node to connect to the Amazon RDS source database:

Choose AWS Glue Data Catalog from the Sources. This adds a data source node to the canvas.
On the Data source properties panel, select sourcedb-postgresql database and source_cx_customer table from the Data Catalog as shown in the following figure:

Task 2 – PII detection and scrubbing

To detect and mask PII, select Detect Sensitive Data node from the Transforms tab.

Let’s take a deeper look into the Transform options on the properties panel for the Detect Sensitive Data node:

First, you can choose how you want the data to be scanned. You can select Find sensitive data in each row or Find columns that contain sensitive data as shown in the following figure. Choosing the former scans all rows for comprehensive PII identification, while the latter scans a sample for PII location at lower cost.

Selecting Find sensitive data in each row allows you to specify fine-grained action overrides. If you know your data, with fine-grained actions you can exclude certain columns from detection. You can also customize the entities to detect for every column in your dataset and skip entities that you know aren’t in specific columns. This allows your jobs to be more performant by eliminating unnecessary detection calls for those entities and perform actions unique to each column and entity combination.

In our example, we know our data and we want to apply fine-grained actions to specific columns, so let’s select Find sensitive data in each row. We’ll explore fine-grained actions further below.

Next, you select the types of sensitive information to detect. Take some time to explore the three different options.

In our example, again because we know the data, let’s select Select specific patterns. For Selected patterns, choose Person’s name, Email Address, Credit Card, Social Security Number (SSN) and US Phone as shown in the following figure. Note that some patterns, such as SSNs, apply specifically to the United States and might not detect PII data for other countries. But there are available categories applicable to other countries, and you can also use regular expressions in AWS Glue Studio to create detection entities to help meet your needs.

Next, select the level of detection sensitivity. Leave the default value (High).
Next, choose the global action to take on detected entities. Select REDACT and enter **** as the Redaction Text.
Next, you can specify fine-grained actions (overrides). Overrides are optional, but in our example, we want to exclude certain columns from detection, scan certain PII entity types on specific columns only, and specify different redaction text settings for different entity types.

Choose Add to specify the fine-grained action for each entity as shown in the following figure:

Task 3 – Data transformation

When the Detect Sensitive Data node runs, it converts the id column to string type and it adds a column named DetectedEntities with PII detection metadata to the output. We don’t need to store such metadata information in the target table, and we need to convert the id column back to integer, so let’s add a Change Schema transform node to the ETL job, as shown in the following figure. This will make these changes for us.

Note: You must select the DetectedEntities Drop checkbox for the transform node to drop the added field.

Task 4 – Target data loading

The last task for the ETL job is to establish a connection to the target database and insert the data with PII masked:

Choose AWS Glue Data Catalog from the Targets. This adds a data target node to the canvas.
On the Data target properties panel, choose targetdb-postgresql and target_cx_customer, as shown in the following figure.

Save and run the ETL job

From the Job details tab, for Name, enter ETL - Replicate customer data.
For IAM Role, choose the AWS Glue role created as part of the prerequisites.
Choose Save, then choose Run.

Monitor the job until it successfully finishes from Job run monitoring on the navigation pane.

Step 4 – Verify the results

Connect to the Amazon RDS target database and verify that the replicated rows contain the scrubbed PII data, confirming sensitive information was masked properly in transit between databases as shown in the following figure:

And that’s it! With AWS Glue Studio, you can create ETL jobs to copy data between databases and transform it along the way without any coding. Try other types of sensitive information for securing your sensitive data during replication. Also try adding and combining multiple and heterogenous data sources and targets.

Clean up

To clean up the resources created:

Delete the AWS Glue ETL job, crawlers, Data Catalog databases, and connections.
Delete the VPC peering connections.
Delete the routes added to the route tables, and inbound rules added to the security groups on the three AWS accounts.
On the AWS Glue account, delete associated Amazon S3 objects. These are in the S3 bucket with aws-glue-assets-account_id-region in its name, where account-id is your AWS Glue account ID, and region is the AWS Region you used.
Delete the Amazon RDS databases you created if you no longer need them. If you used the GitHub repository, then delete the AWS CloudFormation stacks.

Conclusion

In this post, you learned how to use AWS Glue Studio to build an ETL job that copies data from one Amazon RDS database to another and automatically detects PII data and masks the data in-flight, without writing code.

By using AWS Glue for database replication, organizations can eliminate manual processes to find hidden PII and bespoke scripting to transform it by building centralized, visible data sanitization pipelines. This improves security and compliance, and speeds time-to-market for test or analytics data provisioning.

About the Author

Monica Alcalde Angel is a Senior Solutions Architect in the Financial Services, Fintech team at AWS. She works with Blockchain and Crypto AWS customers, helping them accelerate their time to value when using AWS. She lives in New York City, and outside of work, she is passionate about traveling.

Encryption in transit over external networks: AWS guidance for NYDFS and beyond

2024-08-21 Aravind Gopaluni

Post Syndicated from Aravind Gopaluni original https://aws.amazon.com/blogs/security/encryption-in-transit-over-external-networks-aws-guidance-for-nydfs-and-beyond/

On November 1, 2023, the New York State Department of Financial Services (NYDFS) issued its Second Amendment (the Amendment) to its Cybersecurity Requirements for Financial Services Companies adopted in 2017, published within Section 500 of 23 NYCRR 500 (the Cybersecurity Requirements; the Cybersecurity Requirements as amended by the Amendment, the Amended Cybersecurity Requirements). In the introduction to its Cybersecurity Resource Center, the Department explains that the revisions are aimed at addressing the changes in the increasing sophistication of threat actors, the prevalence of and relative ease in running cyberattacks, and the availability of additional controls to manage cyber risks.

This blog post focuses on the revision to the encryption in transit requirement under section 500.15(a). It outlines the encryption capabilities and secure connectivity options offered by Amazon Web Services (AWS) to help customers demonstrate compliance with this updated requirement. The post also provides best practices guidance, emphasizing the shared responsibility model. This enables organizations to design robust data protection strategies that address not only the updated NYDFS encryption requirements but potentially also other security standards and regulatory requirements.

The target audience for this information includes security leaders, architects, engineers, and security operations team members and risk, compliance, and audit professionals.

Note that the information provided here is for informational purposes only; it is not legal or compliance advice and should not be relied on as legal or compliance advice. Customers are responsible for making their own independent assessments and should obtain appropriate advice from their own legal and compliance advisors regarding compliance with applicable NYDFS regulations.

500.15 Encryption of nonpublic information

The updated requirement in the Amendment states that:

As part of its cybersecurity program, each covered entity shall implement a written policy requiring encryption that meets industry standards, to protect nonpublic information held or transmitted by the covered entity both in transit over external networks and at rest.
To the extent a covered entity determines that encryption of nonpublic information at rest is infeasible, the covered entity may instead secure such nonpublic information using effective alternative compensating controls that have been reviewed and approved by the covered entity’s CISO in writing. The feasibility of encryption and effectiveness of the compensating controls shall be reviewed by the CISO at least annually.

This section of the Amendment removes the covered entity’s chief information security officer’s (CISO) discretion to approve compensating controls when encryption of nonpublic information in transit over external networks is deemed infeasible. The Amendment mandates that, effective November 2024, organizations must encrypt nonpublic information transmitted over external networks without the option of implementing alternative compensating controls. While the use of security best practices such as network segmentation, multi-factor authentication (MFA), and intrusion detection and prevention systems (IDS/IPS) can provide defense in depth, these compensating controls are no longer sufficient to replace encryption in transit over external networks for nonpublic information.

However, the Amendment still allows for the CISO to approve the use of alternative compensating controls where encryption of nonpublic information at rest is deemed infeasible. AWS is committed to providing industry-standard encryption services and capabilities to help protect customer data at rest in the cloud, offering customers the ability to add layers of security to their data at rest, providing scalable and efficient encryption features. This includes the following services:

Data encryption capabilities available in AWS storage and database services, such as Amazon Elastic Block Store (Amazon EBS), Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), and Amazon Redshift.
Flexible key management options, including AWS Key Management Service (AWS KMS), which allow you to choose whether to have AWS manage the encryption keys or keep complete control over your keys.
Dedicated, hardware-based cryptographic key storage using AWS CloudHSM, to help you adhere to compliance requirements

While the above highlights encryption-at-rest capabilities offered by AWS, the focus of this blog post is to provide guidance and best practice recommendations for encryption in transit.

AWS guidance and best practice recommendations

Cloud network traffic encompasses connections to and from the cloud and traffic between cloud service provider (CSP) services. From an organization’s perspective, CSP networks and data centers are deemed external because they aren’t under the organization’s direct control. The connection between the organization and a CSP, typically established over the internet or dedicated links, is considered an external network. Encrypting data in transit over these external networks is crucial and should be an integral part of an organization’s cybersecurity program.

AWS implements multiple mechanisms to help ensure the confidentiality and integrity of customer data during transit and at rest across various points within its environment. While AWS employs transparent encryption at various transit points, we strongly recommend incorporating encryption by design into your architecture. AWS provides robust encryption-in-transit capabilities to help you adhere to compliance requirements and mitigate the risks of unauthorized disclosure and modification of nonpublic information in transit over external networks.

Additionally, AWS recommends that financial services institutions adopt a secure by design (SbD) approach to implement architectures that are pre-tested from a security perspective. SbD helps establish control objectives, security baselines, security configurations, and audit capabilities for workloads running on AWS.

Security and Compliance is a shared responsibility between AWS and the customer. Shared responsibility can vary depending on the security configuration options for each service. You should carefully consider the services you choose because your organization’s responsibilities vary depending on the services used, the integration of those services into your IT environment, and applicable laws and regulations. AWS provides resources such as service user guides and AWS Customer Compliance Guides, which map security best practices for individual services to leading compliance frameworks, including NYDFS.

Protecting connections to and from AWS

We understand that customers place a high priority on privacy and data security. That’s why AWS gives you ownership and control over your data through services that allow you to determine where your content will be stored, secure your content in transit and at rest, and manage access to AWS services and resources for your users. When architecting workloads on AWS, classifying data based on its sensitivity, criticality, and compliance requirements is essential. Proper data classification allows you to implement appropriate security controls and data protection mechanisms, such as Transport Layer Security (TLS) at the application layer, access control measures, and secure network connectivity options for nonpublic information over external networks. When it comes to transmitting nonpublic information over external networks, it’s a recommended practice to identify network segments traversed by this data based on your network architecture. While AWS employs transparent encryption at various transit points, it’s advisable to implement encryption solutions at multiple layers of the OSI model to establish defense in depth and enhance end-to-end encryption capabilities. Although requirement 500.15 of the Amendment doesn’t mandate end-to-end encryption, implementing such controls can provide an added layer of security and can help demonstrate that nonpublic information is consistently encrypted during transit.

AWS offers several options to achieve this. While not every option provides end-to-end encryption on its own, using them in combination helps to ensure that nonpublic information doesn’t traverse open, public networks unprotected. These options include:

Using AWS Direct Connect with IEEE 802.1AE MAC Security Standard (MACsec) encryption
VPN connections
Secure API endpoints
Client-side encryption of data before sending it to AWS

AWS Direct Connect with MACsec encryption

AWS Direct Connect provides direct connectivity to the AWS network through third-party colocation facilities, using a cross-connect between an AWS owned device and either a customer- or partner-owned device. Direct Connect can reduce network costs, increase bandwidth throughput, and provide a more consistent network experience than internet-based connections. Within Direct Connect connections (a physical construct) there will be one or more virtual interfaces (VIFs). These are logical entities and are reflected as industry-standard 802.1Q VLANs on the customer equipment terminating the Direct Connect connection. Depending on the type of VIF, they will use either public or private IP addressing. There are three different types of VIFs:

Public virtual interface – Establish connectivity between AWS public endpoints and your data center, office, or colocation environment.
Transit virtual interface – Establish private connectivity between AWS Transit Gateways and your data center, office, or colocation environment. Transit Gateways is an AWS managed high availability and scalability regional network transit hub used to interconnect Amazon Virtual Private Cloud (Amazon VPC) and customer networks.
Private virtual interface – Establish private connectivity between Amazon VPC resources and your data center, office, or colocation environment.

By default, a Direct Connect connection isn’t encrypted from your premises to the Direct Connect location because AWS cannot assume your on-premises device supports the MACsec protocol. With MACsec, Direct Connect delivers native, near line-rate, point-to-point encryption, ensuring that data communications between AWS and your corporate network remain protected. MACsec is supported on 10 Gbps and 100 Gbps dedicated Direct Connect connections at selected points of presence. Using Direct Connect with MACsec-enabled connections and combining it with the transparent physical network encryption offered by AWS from the Direct Connect location through the AWS backbone not only benefits you by allowing you to securely exchange data with AWS, but also enables you to use the highest available bandwidth. For additional information on MACsec support and cipher suites, see the MACsec section in the Direct Connect FAQs.

Figure 1 illustrates a sample reference architecture for securing traffic from corporate network to your VPCs over Direct Connect with MACsec and AWS Transit Gateways.

Figure 1: Sample architecture for using Direct Connect with MACsec encryption

In the sample architecture, you can see that Layer 2 encryption through MACsec only encrypts the traffic from your on-premises systems to the AWS device in the Direct Connect location, and therefore you need to consider additional encryption solutions at Layer 3, 4, or 7 to get closer to end-to-end encryption to the device where you’re comfortable for the packets to be decrypted. In the next section, let’s review an option for using network layer encryption using AWS Site-to-Site VPN.

Direct Connect with Site-to-Site VPN

AWS Site-to-Site VPN is a fully managed service that creates a secure connection between your corporate network and your Amazon VPC using IP security (IPsec) tunnels over the internet. Data transferred between your VPC and the remote network routes over an encrypted VPN connection to help maintain the confidentiality and integrity of data in transit. Each VPN connection consists of two tunnels between a virtual private gateway or transit gateway on the AWS side and a customer gateway on the on-premises side. Each tunnel supports a maximum throughput of up to 1.25 Gbps. See Site-to-Site VPN quotas for more information.

You can use Site-to-Site VPN over Direct Connect to achieve secure IPsec connection with the low latency and consistent network experience of Direct Connect when reaching resources in your Amazon VPCs.

Figure 2 illustrates a sample reference architecture for establishing end-to-end IPsec-encrypted connections between your networks and Transit Gateway over a private dedicated connection.

Figure 2: Encrypted connections between the AWS Cloud and a customer’s network using VPN

While Direct Connect with MACsec and Site-to-Site VPN with IPsec can provide encryption at the physical and network layers respectively, they primarily secure the data in transit between your on-premises network and the AWS network boundary. To further enhance the coverage for end-to-end encryption, it is advisable to use TLS encryption. In the next section, let’s review mechanisms for securing API endpoints on AWS using TLS encryption.

Secure API endpoints

APIs act as the front door for applications to access data, business logic, or functionality from other applications and backend services.

AWS enables you to establish secure, encrypted connections to its services using public AWS service API endpoints. Public AWS owned service API endpoints (AWS managed services like Amazon Simple Queue Service (Amazon SQS), AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), others) have certificates that are owned and deployed by AWS. By default, requests to these public endpoints use HTTPS. To align with evolving technology and regulatory standards for TLS, as of February 27, 2024, AWS has updated its TLS policy to require a minimum of TLS 1.2, thereby deprecating support for TLS 1.0 and 1.1 versions on AWS service API endpoints across each of our AWS Regions and Availability Zones.

Additionally, to enhance connection performance, AWS has begun enabling TLS version 1.3 globally for its service API endpoints. If you’re using the AWS SDKs or AWS Command Line Interface (AWS CLI), you will automatically benefit from TLS 1.3 after a service enables it.

While requests to public AWS service API endpoints use HTTPS by default, a few services, such as Amazon S3 and Amazon DynamoDB, allow using either HTTP or HTTPS. If the client or application chooses HTTP, the communication isn’t encrypted. Customers are responsible for enforcing HTTPS connections when using such AWS services. To help ensure secure communication, you can establish an identity perimeter by using the IAM policy condition key aws:SecureTransport in your IAM roles to evaluate the connection and mandate HTTPS usage.

As enterprises increasingly adopt cloud computing and microservices architectures, teams frequently build and manage internal applications exposed as private API endpoints. Customers are responsible for managing the certificates on private customer-owned endpoints. AWS helps you deploy private customer-owned identities (that is, TLS certificates) through the use of AWS Certificate Manager (ACM) private certificate authorities (PCA) and the integration with AWS services that offer private customer-owned TLS termination endpoints.

ACM is a fully managed service that lets you provision, manage, and deploy public and private TLS certificates for use with AWS services and internal connected resources. ACM minimizes the time-consuming manual process of purchasing, uploading, and renewing TLS certificates. You can provide certificates for your integrated AWS services either by issuing them directly using ACM or by importing third-party certificates into the ACM management system. ACM offers two options for deploying managed X.509 certificates. You can choose the best one for your needs.

AWS Certificate Manager (ACM) – This service is for enterprise customers who need a secure web presence using TLS. ACM certificates are deployed through Elastic Load Balancing (ELB), Amazon CloudFront, Amazon API Gateway, and other integrated AWS services. The most common application of this type is a secure public website with significant traffic requirements. ACM also helps to simplify security management by automating the renewal of expiring certificates.
AWS Private Certificate Authority (Private CA) – This service is for enterprise customers building a public key infrastructure (PKI) inside the AWS Cloud and is intended for private use within an organization. With AWS Private CA, you can create your own certificate authority (CA) hierarchy and issue certificates with it for authenticating users, computers, applications, services, servers, and other devices. Certificates issued by a private CA cannot be used on the internet. For more information, see the AWS Private CA User Guide.

You can use a centralized API gateway service, such as Amazon API Gateway, to securely expose customer-owned private API endpoints. API Gateway is a fully managed service that allows developers to create, publish, maintain, monitor, and secure APIs at scale. With API Gateway, you can create RESTful APIs and WebSocket APIs, enabling near real-time, two-way communication applications. API Gateway operations must be encrypted in-transit using TLS, and require the use of HTTPS endpoints. You can use API Gateway to configure custom domains for your APIs using TLS certificates provisioned and managed by ACM. Developers can optionally choose a specific TLS version for their custom domain names. For use cases that require mutual TLS (mTLS) authentication, you can configure certificate-based mTLS authentication on your custom domains.

Pre-encryption of data to be sent to AWS

Depending on the risk profile and sensitivity of the data that’s being transferred to AWS, you might want to choose encrypting data in an application running on your corporate network before sending it to AWS (client-side encryption). AWS offers a variety of SDKs and client-side encryption libraries to help you encrypt and decrypt data in your applications. You can use these libraries with the cryptographic service provider of your choice, including AWS Key Management Service or AWS CloudHSM, but the libraries do not require an AWS service.

The AWS Encryption SDK is a client-side encryption library that you can use to encrypt and decrypt data in your application and is available in several programming languages, including a command-line interface. You can use the SDK to encrypt your data before you send it to an AWS service. The SDK offers advanced data protection features, including envelope encryption and additional authenticated data (AAD). It also offers secure, authenticated, symmetric key algorithm suites, such as 256-bit AES-GCM with key derivation and signing.
The AWS Database Encryption SDK is a set of software libraries developed in open source that enable you to include client-side encryption in your database design. The SDK provides record-level encryption solutions. You specify which fields are encrypted and which fields are included in the signatures that help ensure the authenticity of your data. Encrypting your sensitive data in transit and at rest helps ensure that your plaintext data isn’t available to a third party, including AWS. The AWS Database Encryption SDK for DynamoDB is designed especially for DynamoDB applications. It encrypts the attribute values in each table item using a unique encryption key. It then signs the item to protect it against unauthorized changes, such as adding or deleting attributes or swapping encrypted values. After you create and configure the required components, the SDK transparently encrypts and signs your table items when you add them to a table. It also verifies and decrypts them when you retrieve them. Searchable encryption in the AWS Database Encryption SDK enables you search encrypted records without decrypting the entire database. This is accomplished by using beacons, which create a map between the plaintext value written to a field and the encrypted value that is stored in your database. For more information, see the AWS Database Encryption SDK Developer Guide.
The Amazon S3 Encryption Client is a client-side encryption library that enables you to encrypt an object locally to help ensure its security before passing it to Amazon S3. It integrates seamlessly with the Amazon S3 APIs to provide a straightforward solution for client-side encryption of data before uploading to Amazon S3. After you instantiate the Amazon S3 Encryption Client, your objects are automatically encrypted and decrypted as part of your Amazon S3 PutObject and GetObject requests. Your objects are encrypted with a unique data key. You can use both the Amazon S3 Encryption Client and server-side encryption to encrypt your data. The Amazon S3 Encryption Client is supported in a variety of programming languages and supports industry-standard algorithms for encrypting objects and data keys. For more information, see the Amazon S3 Encryption Client developer guide.

Encryption in-transit inside AWS

AWS implements responsible and sophisticated technical and physical controls that are designed to help prevent unauthorized access to or disclosure of your content. To protect data in transit, traffic traversing through the AWS network that is outside of AWS physical control is transparently encrypted by AWS at the physical layer. This includes traffic between AWS Regions (except China Regions), traffic between Availability Zones, and between Direct Connect locations and Regions through the AWS backbone network.

Network segmentation

When you create an AWS account, AWS offers a virtual networking option to launch resources in a logically isolated virtual private network (VPN), Amazon Virtual Private Cloud (Amazon VPC). A VPC is limited to a single AWS Region and every VPC has one or more subnets. VPCs can be connected externally using an internet gateway (IGW), VPC peering connection, VPN, Direct Connect, or Transit Gateways. Traffic within the your VPC is considered internal because you have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.

As a customer, you maintain ownership of your data, and you select which AWS services can process, store, and host your data, and you choose the Regions in which your data is stored. AWS doesn’t automatically replicate data across Regions, unless the you choose to do so. Data transmitted over the AWS global network between Regions and Availability Zones is automatically encrypted at the physical layer before leaving AWS secured facilities. Cross-Region traffic that uses Amazon VPC and Transit Gateway peering is automatically bulk-encrypted when it exits a Region.

Encryption between instances

AWS provides secure and private connectivity between Amazon Elastic Compute Cloud (Amazon EC2) instances of all types. The Nitro System is the underlying foundation for modern Amazon EC2 instances. It’s a combination of purpose-built server designs, data processors, system management components, and specialized firmware that provides the underlying foundation for EC2 instances launched since the beginning of 2018. Instance types that use the offload capabilities of the underlying Nitro System hardware automatically encrypt in-transit traffic between instances. This encryption uses Authenticated Encryption with Associated Data (AEAD) algorithms, with 256-bit encryption and has no impact on network performance. To support this additional in-transit traffic encryption between instances, instances must be of supported instance types, in the same Region, and in the same VPC or peered VPCs. For a list of supported instance types and additional requirements, see Encryption in transit.

Conclusion

The second Amendment to the NYDFS Cybersecurity Regulation underscores the criticality of safeguarding nonpublic information during transmission over external networks. By mandating encryption for data in transit and eliminating the option for compensating controls, the Amendment reinforces the need for robust, industry-standard encryption measures to protect the confidentiality and integrity of sensitive information.

AWS provides a comprehensive suite of encryption services and secure connectivity options that enable you to design and implement robust data protection strategies. The transparent encryption mechanisms that AWS has built into services across its global network infrastructure, secure API endpoints with TLS encryption, and services such as Direct Connect with MACsec encryption and Site-to-Site VPN, can help you establish secure, encrypted pathways for transmitting nonpublic information over external networks.

By embracing the principles outlined in this blog post, financial services organizations can address not only the updated NYDFS encryption requirements for section 500.15(a) but can also potentially demonstrate their commitment to data security across other security standards and regulatory requirements.

For further reading on considerations for AWS customers regarding adherence to the Second Amendment to the NYDFS Cybersecurity Regulation, see the AWS Compliance Guide to NYDFS Cybersecurity Regulation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Financial Services re:Post and AWS Security, Identity, & Compliance re:Post ,or contact AWS Support.

Using Amazon GuardDuty Malware Protection to scan uploads to Amazon S3

2024-08-16 Luke Notley

Post Syndicated from Luke Notley original https://aws.amazon.com/blogs/security/using-amazon-guardduty-malware-protection-to-scan-uploads-to-amazon-s3/

Amazon Simple Storage Service (Amazon S3) is a widely used object storage service known for its scalability, availability, durability, security, and performance. When sharing data between organizations, customers need to treat incoming data as untrusted and assess it for malicious files before ingesting it into their downstream processes. This traditionally requires setting up secure staging buckets, deploying third-party anti-virus and anti-malware scanning software, and managing a complex data pipeline and processing architecture.

To address the need for malware protection in Amazon S3, Amazon Web Services (AWS) has launched Amazon GuardDuty Malware Protection for Amazon S3. This new feature provides malicious object scanning for objects uploaded to S3 buckets, using multiple AWS-developed and industry-leading third-party malware scanning engines. It eliminates the need for customers to manage their own isolated data pipelines, compute infrastructure, and anti-virus software across accounts and AWS Regions, providing malware detection without compromising the scale, latency, and resiliency of S3 usage.

In this blog post, we share a solution that uses Amazon EventBridge, AWS Lambda, and Amazon S3 to copy scanned S3 objects to a destination S3 bucket. EventBridge is a serverless event bus that you can use to build event-driven architectures and automate your business workflows. In this solution, we allow events to be invoked from an object that is being placed in an S3 bucket. The events can be processed by a serverless function in Lambda to invoke a malware scan. We then show you how to extend this solution for other use cases specific to your organization.

Feature overview

GuardDuty Malware Protection for Amazon S3 provides a malware and anti-virus detection service for new objects uploaded to an S3 bucket. Malware Protection for S3 is enabled from within the AWS Management Console for GuardDuty and GuardDuty threat detection is not required to be enabled to use this feature. If GuardDuty threat detection is enabled, security findings for detected malware are also sent to GuardDuty. This allows customer development or application teams and security teams to work together and oversee malware protection for S3 buckets throughout the organization.

When your AWS account has GuardDuty enabled in an AWS Region, your account is associated to a unique regional entity called a detector ID. All findings that GuardDuty generates and API operations that are performed are associated with this detector ID. If you don’t want to use GuardDuty with your AWS account, Malware Protection for S3 is available as an independent feature. Used independently, Malware Protection for S3 will not create an associated detector ID.

When a malware scan identifies a potentially malicious object and you don’t have a detector ID, no GuardDuty finding will be generated in your AWS account. GuardDuty will publish the malware scan results to your default EventBridge event bus and metrics to an Amazon CloudWatch namespace for you to use for automating additional tasks.

GuardDuty manages error handling and reprocessing of event creation and publication as needed to make sure that each object is properly evaluated before being accessed by downstream resources. GuardDuty supports configuring Amazon S3 object tagging actions to be performed throughout the process.

Figure 1 shows the high-level overview of the S3 object scanning process.

Figure 1: S3 object scanning process

The object scanning process is the following:

An object is uploaded to an S3 bucket that has been configured for malware detection. If the object is uploaded as a multi-part upload, then a new object notification will be generated on completion of the upload.
The malware scan service receives a notification that a new object has been detected in the bucket.
The malware scan service downloads the object by using AWS PrivateLink. This will be automatically created when malware detection is enabled on an S3 bucket. No additional configuration is required.
The malware detection service then reads, decrypts, and scans this object in an isolated VPC with no internet access within the GuardDuty service account. Encryption at rest is used for customer data that is scanned during this process. After the malware detection scan is complete, the object is deleted from the malware scanning environment.
The malware scan result event is sent to the EventBridge default event bus in your AWS account and Region where malware detection has been enabled. When malware is detected, an EventBridge notification is generated that includes details of which S3 object was flagged as malicious and supporting information such as the malware variant and known use cases for the malicious software.
Scan metrics such as number of objects scanned and bytes scanned are sent to Amazon CloudWatch.
If malware is detected, the service sends a finding to the GuardDuty detector ID in the current Region.
If you have configured object tagging, GuardDuty adds a predefined tag with key GuardDutyMalwareScanStatus and a potential scan result value of your scanned S3 object.

IAM permissions

Enabling and using GuardDuty Malware Protection for S3 requires you to add AWS Identity and Access Manager (IAM) role permissions and a specific trust policy for GuardDuty to perform the malware scan on your behalf. GuardDuty provides you flexibility to enable this feature for your entire bucket, or limit the scope of the malware scan to specific object prefixes where GuardDuty scans each uploaded object that starts with up to five selected prefixes.

To allow GuardDuty Malware Protection for S3 to scan and add tags to your S3 objects, you need an IAM role that includes permissions to perform the following tasks:

A trust policy to allow Malware Protection to assume the IAM role.
Allow EventBridge actions to create and manage the EventBridge managed rule to allow Malware Protection for S3 to listen to your S3 object notifications.
Allow Amazon S3 and EventBridge actions to send notification to EventBridge for events in the S3 bucket.
Allow Amazon S3 actions to access the uploaded S3 object and add a predefined tag GuardDutyMalwareScanStatus to the scanned S3 object.
If you’re encrypting S3 buckets with AWS Key Management System (AWS KMS) keys, you must allow AWS KMS key actions to access the object before scanning and putting a test object in S3 buckets with the supported encryption.

This IAM policy is required each time you enable Malware Protection for S3 for a new bucket in your account. Alternatively, update an existing IAM PassRole policy to include the details of another S3 bucket resource each time you enable Malware Protection. See the AWS documentation for example policies and permissions required.

S3 object tagging and access control

When you enable S3 object tagging, GuardDuty adds a predefined tag with key GuardDutyMalwareScanStatus and a potential scan result value of your scanned S3 object. These tags enable the implementation of a tag-based access control (TBAC) policy for the objects, halting access to an S3 object until a malware scan has been completed.

The example S3 bucket policy in the AWS GuardDuty user guide stops anyone other than the GuardDuty Malware scan service principal from reading objects from the specific S3 bucket that aren’t tagged GuardDutyMalwareScanStatus with a value NO_THREATS_FOUND. The policy also helps prevent other roles or users other than GuardDuty from adding the GuardDutyMalwareScanStatus tag.

Configure optional access for other IAM roles that are allowed to override the GuardDutyMalwareScanStatus tag after an object is tagged. Achieve this by replacing <IAM-role-name> in the following example S3 bucket policy.

{
            "Sid": "OnlyGuardDutyCanTag",
            "Effect": "Deny",
            "NotPrincipal": {
                "AWS": [
                    "arn:aws:iam::555555555555:root",
                    "arn:aws:iam::555555555555:role/<IAM-role-name>",
                    "arn:aws:iam::555555555555:assumed-role/<IAM-role-name>/GuardDutyMalwareProtection"
                ]
            },

Change the policy if you are required to allow certain principals or roles to read failed or skipped objects. You can permit a special role to read the malicious object if needed as part of your existing incident response process. Do this by adding an additional statement into the S3 bucket policy and replacing the <IAM-role-name>value in the following example.

{
            "Sid": "AllowSecurityTeamReadMalicious",
            "Effect": "Deny",
            "NotPrincipal": {
                "AWS": [
                    "arn:aws:iam::555555555555:role/<IAM-role-name>"
                ]
            },
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::DOC-EXAMPLE-BUCKET",
                "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "s3:ExistingObjectTag/GuardDutyMalwareScanStatus": "THREATS_FOUND"
                }
            }
        },

Solution overview

This solution is designed to streamline the deployment of GuardDuty Malware Protection for S3, helping you to maintain a secure and reliable S3 storage environment while minimizing the risk of malware infections and their potential consequences. The solution provides several configuration options, allowing you to create a new S3 bucket or use an existing one, enable encryption with a new or existing AWS KMS key, and optionally set up a function to copy objects with a defined tag to a destination S3 bucket. The copy function feature offers an additional layer of protection by separating potentially malicious files from clean ones, allowing you to maintain a separate repository of safe data for continued business operations or further analysis.

Figure 2 shows the solution architecture.

Figure 2: Amazon GuardDuty copy S3 object solution overview

The high-level workflow of the solution is as follows:

An object is uploaded to an S3 bucket that has been configured for malware detection.
The malware scan service receives a notification that a new object has been detected in the bucket and then GuardDuty reads, decrypts, and scans the object in an isolated environment.
An EventBridge rule is configured to listen for events that match the pattern of completed scans for the monitored bucket that have a scan result of NO_THREATS_FOUND.
When the matched event pattern occurs, the copy object Lambda function is invoked.
The Lambda copy object function copies the object from the monitored S3 bucket to the target bucket.

In this solution, you will use the follow AWS services and features:

Event tracking: This solution uses an EventBridge rule to listen for completed malware scan result events for a specific S3 bucket, which has been enabled for malware scanning. When the EventBridge rule finds a matched event, the rule passes the required parameters and invokes the Lambda function required to copy the S3 object from the source malware protected bucket to a destination clean bucket. The event pattern used in this solution uses the following format:
```
{
  "source": ["aws.guardduty"],
  "detail-type": ["GuardDuty Malware Protection Object Scan Result"],
  "detail": {
    "scanStatus": ["COMPLETED"],
    "resourceType": ["S3_OBJECT"],
    "s3ObjectDetails": {
      "bucketName": ["<DOC-EXAMPLE-BUCKET-111122223333>"]
    },
    "scanResultDetails": {
      "scanResultStatus": ["NO_THREATS_FOUND"]
    }
  }
}
```
Note: Replace the value of the bucketName attribute with the bucket in your account.
Task orchestration: A Lambda function handles the logic for copying the S3 object from the source bucket to the destination bucket which has just been scanned by GuardDuty. If the object was created within a new S3 prefix, the prefix and the object will be copied. If the object was tagged by GuardDuty, then the object tag will be copied.

Deploy the solution

The solution CloudFormation template provides you with multiple deployment scenarios so you can choose which best applies to your use case.

Deploy the CloudFormation template

For this next step, make sure that you deploy the CloudFormation template provided in the AWS account and Region where you want to test this solution.

To deploy the CloudFormation template

Choose the Launch Stack button to launch a CloudFormation stack in your account. Note that the stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution in other Regions, download the solution’s CloudFormation template, modify it, and deploy it to the selected Regions.
Choose the appropriate scenario and complete the parameters information questions as shown in Figure 3.

Figure 3: CloudFormation template parameters

Each of the following scenarios and their parameter information (from Figure 3) can be evaluated to make sure that the CloudFormation template deploys successfully:

Deployment scenario
- Create a new bucket or use an existing bucket?
- If ”new”, should a KMS key be created for the new bucket?
- Would you like to create the copy function to a destination bucket? Create the Lambda copy function from the protected bucket to the clean bucket.
Post scan file copy function
- This will be used as the basis for the copy function and EventBridge rule to invoke the function: Copy files to the clean bucket with either the THREATS or NO_THREATS_FOUND tagged value.
Existing S3 bucket configuration – not used for new S3 buckets
- Enter the bucket name that you would like to be your scanned bucket: Enter the existing S3 bucket name that will be enabled for GuardDuty Malware Protection for S3.
- Enter the bucket name that you would like to be your scanned bucket: Enter the S3 bucket name to be used as the copy destination for S3 objects.
- Is the existing bucket using a KMS key? Is the existing S3 bucket encrypted with an existing KMS key?
- ARN of the existing KMS key to be used: Provide the existing KMS key Amazon Resource Name (ARN) to be used for KMS encryption. IAM policies will be configured for this KMS key name.
- Lambda Copy Function clean bucket: Create a new S3 bucket with the Lambda copy function from the protected bucket to the clean bucket.
Review the stack name and the parameters for the template.
On the Quick create stack screen, scroll to the bottom and select I acknowledge that AWS CloudFormation will create IAM resources.
Choose Create stack. The deployment of the CloudFormation stack will take 3–4 minutes.

After the CloudFormation stack has deployed successfully, the solution will be available for use in the same Region where you deployed the CloudFormation stack. The solution deploys a specific Lambda function and EventBridge rule to match the name of the source S3 bucket.

Deploy the AWS CDK template

Alternatively if you prefer to use AWS CDK, download the CDK code from the GitHub repository.

Follow the readme contained within the repository to deploy the solution or individual components depending your requirements.

Extend the solution

In this section, you’ll find options for extending the solution.

Copy alternative status results

The solution can be extended to copy S3 objects with a scan result status that you define. To change the scan result used to invoke the copy function, update the scanresultstatus in the event pattern defined in EventBridge rule created as part of the solution named S3Malware-CopyS3Object-<DOC-EXAMPLE-BUCKET-111122223333>.

"scanResultDetails": {
      "scanResultStatus": ["<scan result status>"]

Delete source S3 objects

To delete the object from the source after the copy was successful, you will need to update the Lambda function code and the IAM role used by the Lambda function.

The IAM role used by the Lambda function requires a new statement added to the existing role. The JSON formatted statement is provided in the following example.

{
			"Action": [
				"s3:DeleteObject"
					],
			"Resource": [
				"arn:aws:s3:::<DOC-EXAMPLE-BUCKET-111122223333>/*"
			],
			"Effect": "Allow",
			"Sid": "AllowDeleteObjectSourceBucket"
		},

The copy Lambda function requires the following lines to be added at the end of the function code to delete the object:

s3.delete_object(Bucket=SOURCE_BUCKET,Key=SOURCE_KEY)

Scan existing S3 objects

When GuardDuty Malware Protection for S3 is enabled, it scans only new objects put into the bucket. To scan existing objects in a S3 bucket for malware, set up bucket replication to replicate all objects from a source bucket to a destination bucket with Malware Protection enabled.

Automate tagged object deletion

To remove malicious objects from the S3 bucket to help prevent accidental download or access, implement a tag-based lifecycle rule to delete the object after a specific number of days. To achieve this follow the steps in Setting a lifecycle configuration on a bucket to configure a lifecycle rule and make sure the tag key is GuardDutyMalwareScanStatus and value is THREATS_FOUND.

Figure 4: Tag based S3 lifecycle rule

Align the lifecycle policy with your organization’s current S3 object malware investigation procedures. Deleting objects prematurely might hinder security teams’ ability to analyze potentially malicious content. When using bucket versioning instead of permanently deleting the object, Amazon S3 inserts a delete marker that becomes the current version of the object.

AWS Transfer Family integration

If you’re using the AWS Transfer Family service with Secure File Transfer Protocol (SFTP) connector for S3, it’s recommended to scan external uploads for malware before using the received files. This helps ensure the security and integrity of data transferred into your S3 buckets using SFTP.

Figure 5: AWS Transfer Family S3 workflow

To implement malware scanning, configure a file processing workflow configuration to copy the uploaded objects into an S3 bucket that has GuardDuty Malware Protection for S3 enabled.

Figure 6: Transfer Family configuration workflow

Summary

Amazon GuardDuty Malware Protection for S3 is now available to assess untrusted objects for malicious files before being ingested by downstream processes within your organization. Customers can automatically scan their S3 objects for malware and take appropriate actions, such as quarantining or remediating infected files. This proactive approach helps mitigate the risks associated with malware infections, data breaches, and potential financial losses. The solution provided offers an additional layer of protection by separating potentially malicious files from clean ones, allowing customers to maintain a separate repository of safe data for continued business operations or further analysis. Visit the 2024 re:Inforce session or the what’s new blog post to understand additional service details.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Use AWS Glue to streamline SFTP data processing

2024-08-13 Seun Akinyosoye

Post Syndicated from Seun Akinyosoye original https://aws.amazon.com/blogs/big-data/use-aws-glue-to-streamline-sftp-data-processing/

In today’s data-driven world, seamless integration and transformation of data across diverse sources into actionable insights is paramount. AWS Glue is a serverless data integration service that helps analytics users to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. With AWS Glue, you can discover and connect to hundreds of diverse data sources and manage your data in a centralized data catalog. It enables you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

In this blog post, we explore how to use the SFTP Connector for AWS Glue from the AWS Marketplace to efficiently process data from Secure File Transfer Protocol (SFTP) servers into Amazon Simple Storage Service (Amazon S3), further empowering your data analytics and insights.

Introducing the SFTP connector for AWS Glue

The SFTP connector for AWS Glue simplifies the process of connecting AWS Glue jobs to extract data from SFTP storage and to load data into SFTP storage. This connector provides comprehensive access to SFTP storage, facilitating cloud ETL processes for operational reporting, backup and disaster recovery, data governance, and more.

Solution overview

In this example, you use AWS Glue Studio to connect to an SFTP server, then enrich that data and upload it to Amazon S3. The SFTP connector is used to manage the connection to the SFTP server. You will load the event data from the SFTP site, join it to the venue data stored on Amazon S3, apply transformations, and store the data in Amazon S3. The event and venue files are from the TICKIT dataset.

The TICKIT dataset tracks sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts. In this dataset, analysts can identify ticket movement over time, success rates for sellers, and best-selling events, venues, and seasons.

For this example, you use AWS Glue Studio to develop a visual ETL pipeline. This pipeline will read data from an SFTP server, perform transformations, and then load the transformed data into Amazon S3. The following diagram illustrates this architecture.

solution overview

By the end of this post, your visual ETL job will resemble the following screenshot.

final solution

Prerequisites

For this solution, you need the following:

Subscribe to the SFTP Connector for AWS Glue in the AWS Marketplace.
Access to an SFTP server with permissions to upload and download data.
- If the SFTP server is hosted on Amazon Elastic Compute Cloud (Amazon EC2), we recommend that the network communication between the SFTP server and the AWS Glue job happens within the virtual private cloud (VPC) as pictured in the preceding architecture diagram. Running your Glue job within a VPC and security group will be discussed further in the steps to create the AWS Glue job.
- If the SFTP server is hosted within your on-premises network, we recommend that the network communication between the SFTP server and the Glue job happens through VPN or AWS DirectConnect.
Access to an S3 bucket or the permissions to create an S3 bucket. We recommend that you connect to that bucket using a gateway endpoint. This will allow you to connect to your S3 bucket directly from your VPC. If you need to create an S3 bucket to store the results, complete the following steps:
1. On the Amazon S3 console, choose Buckets in the navigation pane.
2. Choose Create bucket.
3. For Name, enter a globally unique name for your bucket; for example, tickit-use1-<accountnumber>.
4. Choose Create bucket.
5. For this demonstration, create a folder with the name tickit in your S3 bucket.
6. Create the gateway endpoint.
Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job. You must specify an IAM role for the job to use. The role must grant access to all resources used by the job, including Amazon S3 (for any sources, targets, scripts, and temporary directories) and AWS Secrets Manager. For instructions, see Configure an IAM role for your ETL job.

Load dataset to SFTP site

Load the allevents_pipe.txt file and venue_pipe.txt file from the TICKIT dataset to your SFTP server.

Store SFTP server sign-in credentials

An AWS Glue connection is a Data Catalog object that stores connection information, such as URI strings and location to credentials that are stored in a Secrets Manager secret.

To store the SFTP server username and password in Secrets Manager, complete the following steps:

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.
Select Other type of secret.
Enter host as Secret key and your SFTP server’s IP address (for example, 153.47.122) as the Secret value, then choose Add row.
Enter the username as Secret key and your SFTP username as Secret value, then choose Add row.
Enter password as Secret key and your SFTP password as Secret value, then choose Add row.
Enter keyS3Uri as Secret Key and the Amazon S3 location of your SFTP secret key file as Secret value

Note: Secret Value is the full S3 path where the SFTP server key file is stored. For example:s3://sftp-bucket-johndoe123/id_rsa.

For Secret name, enter a descriptive name, then choose Next.
Choose Next to move to the review step, then choose Store.

secret value

Create a connection to the SFTP server in AWS Glue

Complete the following steps to create your connection to the SFTP server.

On the AWS Glue console, under Data Catalog in the navigation pane, choose Connections.

creating sftp connection from marketplace

Select the SFTP connector for AWS Glue 4.0. Then choose Create connection.

using sftp connector

Enter a name for the connection and then, under Connection access, choose the Secrets Manager secret you created for you SFTP server credentials.

Create a connection to the VPC in AWS Glue

A data connection is used to establish network connectivity between the VPC and the AWS Glue job. To create the VPC connection, complete the following steps.

On the AWS Glue console page, click on Data Connections location on the left side menu.
Click the Create connection button in the Connections panel.

creating connection for VPC

Select Network

choosing network option

Select the VPC, Subnet, and Security Group that your SFTP server resides in. Click Next.

choosing vpc, subnet, sg for connection

Name the connection SFTP VPC Connect and then click

Deploy the solution

Now that we completed the prerequisites, we are going to setup the AWS Glue Studio job for this solution. We will create a glue studio job, add events and venue data from the SFTP server, carry out data transformations and load transformed data to s3.

Create your AWS Glue Studio job:

On the AWS Glue console, under ETL Jobs in the navigation pane, choose Visual ETL.
Select Visual ETL in the central pane.
Choose the pencil icon to enter a name for your job.
Choose the Job details tab.

choosing job details

Scroll down to and select Advanced properties and expand.
Scroll to Connections and select SFTP VPC Connect.

choosing sftp vpc connection

Choose Visual to go back to the workflow editor page.

Add the events data from the SFTP server as your first data set:

Choose Add nodes and select SFTP Connector for AWS Glue 4.0 on the Sources
Enter the following for Data source properties for:
1. Connection: Select the connection to the SFTP server that you created in Create the connection to the SFTP server in AWS Glue.
2. Enter the following key-value pairs:

Key	Value
header	false
path	/files (this should be the path to the event file in your SFTP server)
fileFormat	csv
delimiter	\|

glue studio job configuration

Rename the columns of the Event dataset:

Choose Add nodes and choose Change Schema on the Transforms
Enter the following transform properties:
1. For Name, enter Rename Event data.
2. For Node parents, select SFTP Connector for AWS Glue 4.0.
3. In the Change Schema section, map the source keys to the target keys:
  1. col0: eventid
  2. col1: e_venueid
  3. col2: catid
  4. col3: dateid
  5. col4: eventname
  6. col5: starttime

transforming event data

Add the venue_pipe.txt file from the SFTP site:

Choose Add nodes and choose SFTP Connector for AWS Glue 4.0 on the Sources
Enter the following for Data source properties for:
1. Connection: Select the connection to the SFTP server that you created in Create the connection to the SFTP server in AWS Glue.
2. Enter the following key-value pairs:

Key	Value
header	false
path	/files (this should be the path to the venue file in your SFTP site)
fileFormat	csv
delimiter	\|

Rename the columns of the venue dataset:

Choose Add nodes and choose Change Schema on the Transforms
Enter the following transform properties:
1. For Name, enter Rename Venue data.
2. For Node parents, select Venue.
3. In the Change Schema section, map the source keys to the target keys:
  1. col0: venueid
  2. col1: venuename
  3. col2: venuecity
  4. col3: venuestate
  5. col4: venueseats

transforming venue data

Join the venue and event datasets.

Choose Add nodes and choose Join on the Transforms
Enter the following transform properties:
1. For Name, enter Join.
2. For Node parents, select Rename Venue data and Rename Event data.
3. For Join type¸ select Inner join.
4. For Join conditions, select venueid for Rename Venue data and e_venueid for Rename Event data.

transform join venue and event

Drop the duplicate field:

Choose Add nodes and choose Drop Fields on the Transforms
Enter the following transform properties:
1. For Name, enter Drop Fields.
2. For Node parents, select Join.
3. In the DropFields section, select e_venueid.

drop field transform

Load the data into your S3 bucket:

Choose Add nodes and choose Amazon S3 from the Sources
Enter the following transform properties:
1. For Node parents, select Drop Fields.
2. For Format, select CSV.
3. For Compression Type, select None.
4. For S3 Target Location, choose your S3 bucket and enter your desired file name followed by a slash (/).

loading data to s3 target

You can now save and run your AWS Glue visual ETL Job. Run the job and then go to the Runs tab to monitor its progress. After the job has completed, the Run status will change to Succeeded. The data will be in the target S3 bucket.

completed job

Clean up

To avoid incurring additional charges caused by resources created as part of this post, make sure you delete the items created in the AWS Account for this post:

Delete the Secrets Manager key created for the SFTP connector . credentials.
Delete the SFTP connector.
Unsubscribe from the SFTP Connector in AWS Marketplace.
Delete the data loaded to the Amazon S3 bucket and the bucket.
Delete the AWS Glue visual ETL job.

Conclusion

In this blog post, we demonstrated how to use the SFTP connector for AWS Glue to streamline the processing of data from SFTP servers into Amazon S3. This integration plays a pivotal role in enhancing your data analytics capabilities by offering an efficient and straightforward method to bring together disparate data sources. Whether your goal is to analyze SFTP server data for actionable insights, bolster your reporting mechanisms, or enrich your business intelligence tools, this connector ensures a more streamlined and cost-effective approach to achieving your data objectives.

For further details on the SFTP connector, see the SFTP Connector for Glue documentation.

About the Authors

Sean Bjurstrom is a Technical Account Manager in ISV accounts at Amazon Web Services, where he specializes in Analytics technologies and draws on his background in consulting to support customers on their analytics and cloud journeys. Sean is passionate about helping businesses harness the power of data to drive innovation and growth. Outside of work, he enjoys running and has participated in several marathons.

Seun Akinyosoye is a Sr. Technical Account Manager supporting public sector customer at Amazon Web Services. Seun has a background in analytics, data engineering which he uses to help customers achieve their outcomes and goals. Outside of work Seun enjoys spending time with his family, reading, traveling and supporting his favorite sports teams.

Vinod Jayendra is a Enterprise Support Lead in ISV accounts at Amazon Web Services, where he helps customers in solving their architectural, operational, and cost optimization challenges. With a particular focus on Serverless technologies, he draws from his extensive background in application development to deliver top-tier solutions. Beyond work, he finds joy in quality family time, embarking on biking adventures, and coaching youth sports team.

Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest MWAA and AWS Glue features and news!

Chris Scull is a Solutions Architect dealing in orchestration tools and modern cloud technologies. With two years of experience at AWS, Chris has developed an interest in Amazon Managed Workflows for Apache Airflow, which allows for efficient data processing and workflow management. Additionally, he is passionate about exploring the capabilities of GenAI with Bedrock, a platform for building generative AI applications on AWS.

Shengjie Luo is a Big data architect of Amazon Cloud Technology professional service team. Responsible for solutions consulting, architecture and delivery of AWS based data warehouse and data lake, and good at server-less computing, data migration, cloud data integration, data warehouse planning, data service architecture design and implementation.

Qiushuang Feng is a Solutions Architect at AWS, responsible for Enterprise customers’ technical architecture design, consulting, and design optimization on AWS Cloud services. Before joining AWS, Qiushuang worked in IT companies such as IBM and Oracle, and accumulated rich practical experience in development and analytics.

Automate Amazon Redshift Advisor recommendations with email alerts using an API

2024-08-12 Ranjan Burman

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/automate-amazon-redshift-advisor-recommendations-with-email-alerts-using-an-api/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. Amazon Redshift now allows you to programmatically access Amazon Redshift Advisor recommendations through an API, enabling you to integrate recommendations about how to improve your provisioned cluster performance into your own applications.

Amazon Redshift Advisor offers recommendations about optimizing your Redshift cluster performance and helps you save on operating costs. Advisor develops its customized recommendations by analyzing performance and usage metrics for your cluster and displays recommendations that should have a significant impact on performance and operations. Now, with the ability to programmatically access these recommendations through the ListRecommendations API, you can make recommendations available to implement on-demand or automatically through your own internal applications and tools without the need to access the Amazon Redshift console.

In this post, we show you how to use the ListRecommendations API to set up email notifications for Advisor recommendations on your Redshift cluster. These recommendations, such as identifying tables that should be vacuumed to sort the data or finding table columns that are candidates for compression, can help improve performance and save costs.

How to access Redshift Advisor recommendations

To access Advisor recommendations on the Amazon Redshift console, choose Advisor in the navigation pane. You can expand each recommendation to see more details, and sort and group recommendations.

You can also use the ListRecommendations API to automate receiving the Advisor recommendations and programmatically implement them. The API returns a list of recommended actions that can be parsed and implemented. The API and SDKs also enable you to set up workflows to use Advisor programmatically for automated optimizations. These automated periodic checks of Advisor using cron scheduling along with implementing the changes can help you keep Redshift clusters optimized automatically without manual intervention.

You can also use the list-recommendations command in the AWS Command Line Interface (AWS CLI) to invoke the Advisor recommendations from the command line and automate the workflow through scripts.

Solution overview

The following diagram illustrates the solution architecture.

The solution workflow consists of the following steps:

An Amazon EventBridge schedule invokes an AWS Lambda function to retrieve Advisor recommendations.
Advisor generates recommendations that are accessible through an API.
Optionally, this solution stores the recommendations in an Amazon Simple Storage Service (Amazon S3) bucket.
Amazon Simple Notification Service (Amazon SNS) automatically sends notifications to end-users.

Prerequisites

To deploy this solution, you should have the following:

An AWS account
A Redshift provisioned cluster
An SNS topic with an email subscription
Administrator access to launch the AWS CloudFormation stack
Optionally, an S3 bucket

Deploy the solution

Complete the following steps to deploy the solution:

Choose Launch Stack.

For Stack name, enter a name for the stack, for example, blog-redshift-advisor-recommendations.
For SnsTopicArn, enter the SNS topic Amazon Resource Name (ARN) for receiving the email alerts.
For ClusterIdentifier, enter your Redshift cluster name if you want to receive Advisor notifications for a particular cluster. If you leave it blank, you will receive notifications for all Redshift provisioned clusters in your account.
For S3Bucket, enter the S3 bucket name to store the detailed Advisor recommendations in a JSON file. If you leave it blank, this step will be skipped.
For ScheduleExpression, enter the frequency in cron format to receive Advisor recommendation alerts. For this post, we want to receive alerts every Sunday at 14:00 UTC, so we enter cron(0 14 ? * SUN *).

Make sure to provide the correct cron time expression when deploying the CloudFormation stack to avoid any failures.

Keep all options as default under Configure Stack options and choose Next.
Review the settings, select the acknowledge check box, and create the stack.

If the CloudFormation stack fails for any reason, refer to Troubleshooting CloudFormation.

After the CloudFormation template is deployed, it will create the following resources:

A Lambda function
An EventBridge scheduled rule
AWS Identity and Access Management (IAM) roles and policies for the services to communicate with each other

Workflow details

Let’s take a closer look at the Lambda function and the complete workflow.

The input values provided for SnsTopicArn, ClusterIdentifier, and S3Bucket in the CloudFormation stack creation are set as environmental variables in the Lambda function. If the ClusterIdentifier parameter is None, then it will invoke the ListRecommendations API to generate Advisor recommendations for all the clusters within the account (same AWS Region). Otherwise, it will pass the ClusterIdentifier value and generate Advisor recommendations only for the given cluster. If the input parameter S3Bucket is provided, the solution creates a folder named RedshiftAdvisorRecommendations and generates the Advisor recommendations file in JSON format within it. If a value for S3Bucket isn’t provided, this step will be skipped.

Next, the function will summarize recommendations by each provisioned cluster (for all clusters in the account or a single cluster, depending on your settings) based on the impact on performance and cost as HIGH, MEDIUM, and LOW categories. An SNS notification email will be sent to the subscribers with the summarized recommendations.

SQL commands are included as part of the Advisor’s recommended action. RecommendedActionType-SQL summarizes the number of SQL actions that can be applied using SQL commands.

If there are no recommendations available for any cluster, the SNS notification email will be sent notifying there are no Advisor recommendations.

An EventBridge rule is created to invoke the Lambda function based on the frequency you provided in the stack parameters. By default, it’s scheduled to run weekly each Sunday at 14:00 UTC.

The following is a screenshot of a sample SNS notification email.

Clean up

We recommend deleting the CloudFormation stack if you aren’t going to continue using the solution. This will avoid incurring any additional costs from the resources created as part of the solution.

Conclusion

In this post, we discussed how Redshift Advisor offers you specific recommendations to improve the performance of and decrease the operating costs for your Redshift cluster. We also showed you how to programmatically access these recommendations through an API and implement them on-demand or automatically using your own internal tools without having access to the Amazon Redshift console.

By integrating these recommendations into your workflows, you can make informed decisions and implement best practices to optimize the performance and costs of your Redshift clusters, ultimately enhancing the overall efficiency and productivity of your data processing operations.

We encourage you to try out this automated solution to access Advisor recommendations programmatically. If you have any feedback or questions, please leave them in the comments.

About the authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Nita Shah is a Senior Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.

Sumant Nemmani is a Senior Technical Product Manager at AWS. He is focused on helping customers of Amazon Redshift benefit from features that use machine learning and intelligent mechanisms to enable the service to self-tune and optimize itself, ensuring Redshift remains price-performant as they scale their usage.

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

2024-08-09 Valdiney Gomes

Post Syndicated from Valdiney Gomes original https://aws.amazon.com/blogs/big-data/migrate-amazon-redshift-from-dc2-to-ra3-to-accommodate-increasing-data-volumes-and-analytics-demands/

This is a guest post by Valdiney Gomes, Hélio Leal, Flávia Lima, and Fernando Saga from Dafiti.

As businesses strive to make informed decisions, the amount of data being generated and required for analysis is growing exponentially. This trend is no exception for Dafiti, an ecommerce company that recognizes the importance of using data to drive strategic decision-making processes. With the ever-increasing volume of data available, Dafiti faces the challenge of effectively managing and extracting valuable insights from this vast pool of information to gain a competitive edge and make data-driven decisions that align with company business objectives.

Amazon Redshift is widely used for Dafiti’s data analytics, supporting approximately 100,000 daily queries from over 400 users across three countries. These queries include both extract, transform, and load (ETL) and extract, load, and transform (ELT) processes and one-time analytics. Dafiti’s data infrastructure relies heavily on ETL and ELT processes, with approximately 2,500 unique processes run daily. These processes retrieve data from around 90 different data sources, resulting in updating roughly 2,000 tables in the data warehouse and 3,000 external tables in Parquet format, accessed through Amazon Redshift Spectrum and a data lake on Amazon Simple Storage Service (Amazon S3).

The growing need for storage space to maintain data from over 90 sources and the functionality available on the new Amazon Redshift node types, including managed storage, data sharing, and zero-ETL integrations, led us to migrate from DC2 to RA3 nodes.

In this post, we share how we handled the migration process and provide further impressions of our experience.

Amazon Redshift at Dafiti

Amazon Redshift is a fully managed data warehouse service, and was adopted by Dafiti in 2017. Since then, we’ve had the opportunity to follow many innovations and have gone through three different node types. We started with 115 dc2.large nodes and with the launch of Redshift Spectrum and the migration of our cold data to the data lake, then we considerably improved our architecture and migrated to four dc2.8xlarge nodes. RA3 introduced many features, allowing us to scale and pay for computing and storage independently. This is what brought us to the current moment, where we have eight ra3.4xlarge nodes in the production environment and a single node ra3.xlplus cluster for development.

Given our scenario, where we have many data sources and a lot of new data being generated every moment, we came across a problem: the 10 TB we had available in our cluster was insufficient for our needs. Although most of our data is currently in the data lake, more storage space was needed in the data warehouse. This was solved by RA3, which scales compute and storage independently. Also, with zero-ETL, we simplified our data pipelines, ingesting tons of data in near real time from our Amazon Relational Database Service (Amazon RDS) instances, while data sharing enables a data mesh approach.

Migration process to RA3

Our first step towards migration was to understand how the new cluster should be sized; for this, AWS provides a recommendation table.

Given the configuration of our cluster, consisting of four dc2.8xlarge nodes, the recommendation was to switch to ra3.4xlarge.

At this point, one concern we had was regarding reducing the amount of vCPU and memory. With DC2, our four nodes provided a total of 128 vCPUs and 976 GiB; in RA3, even with eight nodes, these values were reduced to 96 vCPUs and 768 GiB. However, the performance was improved, with processing of workloads 40% faster in general.

AWS offers Redshift Test Drive to validate whether the configuration chosen for Amazon Redshift is ideal for your workload before migrating the production environment. At Dafiti, given the particularities of our workload, which gives us some flexibility to make changes to specific windows without affecting the business, it wasn’t necessary to use Redshift Test Drive.

We carried out the migration as follows:

We created a new cluster with eight ra3.4xlarge nodes from the snapshot of our four-node dc2.8xlarge cluster. This process took around 10 minutes to create the new cluster with 8.75 TB of data.
We turned off our internal ETL and ELT orchestrator, to prevent our data from being updated during the migration period.
We changed the DNS pointing to the new cluster in a transparent way for our users. At this point, only one-time queries and those made by Amazon QuickSight reached the new cluster.
After the read query validation stage was complete and we were satisfied with the performance, we reconnected our orchestrator so that the data transformation queries could be run in the new cluster.
We removed the DC2 cluster and completed the migration.

The following diagram illustrates the migration architecture.

Migrate architecture

During the migration, we defined some checkpoints at which a rollback would be performed if something unwanted happened. The first checkpoint was in Step 3, where the reduction in performance in user queries would lead to a rollback. The second checkpoint was in Step 4, if the ETL and ELT processes presented errors or there was a loss of performance compared to the metrics collected from the processes run in DC2. In both cases, the rollback would simply occur by changing the DNS to point to DC2 again, because it would still be possible to rebuild all processes within the defined maintenance window.

Results

The RA3 family introduced many features, allowed scaling, and enabled us to pay for compute and storage independently, which changed the game at Dafiti. Before, we had a cluster that performed as expected, but limited us in terms of storage, requiring daily maintenance to maintain control of disk space.

The RA3 nodes performed better and workloads ran 40% faster in general. It represents a significant decrease in the delivery time of our critical data analytics processes.

This improvement became even more pronounced in the days following the migration, due to the ability in Amazon Redshift to optimize caching, statistics, and apply performance recommendations. Additionally, Amazon Redshift is able to provide recommendations for optimizing our cluster based on our workload demands through Amazon Redshift Advisor recommendations, and offers automatic table optimization, which played a key role in achieving a seamless transition.

Moreover, the storage capacity leap from 10 TB to multiple PB solved Dafiti’s primary challenge of accommodating growing data volumes. This substantial increase in storage capabilities, combined with the unexpected performance enhancements, demonstrated that the migration to RA3 nodes was a successful strategic decision that addressed Dafiti’s evolving data infrastructure requirements.

Data sharing has been used since the moment of migration, to share data between the production and development environment, but the natural evolution is to enable the data mesh at Dafiti through this resource. The limitation we had was the need to activate case sensitivity, which is a prerequisite for data sharing, and which forced us to change some broken processes. But that was nothing compared to the benefits we’re seeing from migrating to RA3.

Conclusion

In this post, we discussed how Dafiti handled migrating to Redshift RA3 nodes, and the benefits of this migration.

Do you want to know more about what we’re doing in the data area at Dafiti? Check out the following resources:

The content and opinions in this post are those of Dafiti’s authors and AWS is not responsible for the content or accuracy of this post.

About the Authors

Valdiney Gomes is Data Engineering Coordinator at Dafiti. He worked for many years in software engineering, migrated to data engineering, and currently leads an amazing team responsible for the data platform for Dafiti in Latin America.

Hélio Leal is a Data Engineering Specialist at Dafiti, responsible for maintaining and evolving the entire data platform at Dafiti using AWS solutions.

Flávia Lima is a Data Engineer at Dafiti, responsible for sustaining the data platform and providing data from many sources to internal customers.

Fernando Saga is a data engineer at Dafiti, responsible for maintaining Dafiti’s data platform using AWS solutions.

OpenSearch optimized instance (OR1) is game changing for indexing performance and cost

2024-08-07 Cedric Pelvet

Post Syndicated from Cedric Pelvet original https://aws.amazon.com/blogs/big-data/opensearch-optimized-instance-or1-is-game-changing-for-indexing-performance-and-cost/

Amazon OpenSearch Service securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like application monitoring, log analytics, observability, and website search.

In this post, we examine the OR1 instance type, an OpenSearch optimized instance introduced on November 29, 2023.

OR1 is an instance type for Amazon OpenSearch Service that provides a cost-effective way to store large amounts of data. A domain with OR1 instances uses Amazon Elastic Block Store (Amazon EBS) volumes for primary storage, with data copied synchronously to Amazon Simple Storage Service (Amazon S3) as it arrives. OR1 instances provide increased indexing throughput with high durability.

To learn more about OR1, see the introductory blog post.

While actively writing to an index, we recommend that you keep one replica. However, you can switch to zero replicas after a rollover and the index is no longer being actively written.

This can be done safely because the data is persisted in Amazon S3 for durability.

Note that in case of a node failure and replacement, your data will be automatically restored from Amazon S3, but would be partially unavailable during the repair operation, so you should not consider it for cases where searches on non-actively written indices require high availability.

Goal

In this blog post, we’ll explore how OR1 impacts the performance of OpenSearch workloads.

By providing segment replication, OR1 instances save CPU cycles by indexing only on the primary shards. By doing that, the nodes are able to index more data with the same amount of compute, or to use fewer resources for indexing and thus have more available for search and other operations.

For this post, we’re going to consider an indexing-heavy workload and do some performance testing.

Traditionally, Amazon Elastic Compute Cloud (Amazon EC2) R6g instances are a high performant choice for indexing-heavy workloads, relying on Amazon EBS storage. Im4gn instances provide local NVMe SSD for high throughput and low latency disk writes.

We will compare OR1 indexing performance relative to these two instance types, focusing on indexing performance only for scope of this blog.

Setup

For our performance testing, we set up multiple components, as shown in the following figure:

Architecture diagram

For the testing process:

AWS Step Functions orchestrates an initialization step to clean up the environment and set up the index mapping and to run the batch testing.
AWS Batch runs parallel jobs to index log data in OpenTelemetry JSON format.
The jobs run a custom Rust program that generates randomized logs using the OpenSearch Rust Client with AWS Identity and Access Management (IAM) authentication.
The OpenSearch Service domain is set up with OpenSearch 2.11, two availability zones, fine-grained access control, encryption at rest using AWS Key Management Service (AWS KMS), and encryption in transit using TLS.

The index mapping, which is part of our initialization step, is as follows:

{
  "index_patterns": [
    "logs-*"
  ],
  "data_stream": {
    "timestamp_field": {
      "name": "time"
    }
  },
  "template": {
    "settings": {
      "number_of_shards": <VARYING>,
      "number_of_replicas": 1,
      "refresh_interval": "20s"
    },
    "mappings": {
      "dynamic": false,
      "properties": {
        "traceId": {
          "type": "keyword"
        },
        "spanId": {
          "type": "keyword"
        },
        "severityText": {
          "type": "keyword"
        },
        "flags": {
          "type": "long"
        },
        "time": {
          "type": "date",
          "format": "date_time"
        },
        "severityNumber": {
          "type": "long"
        },
        "droppedAttributesCount": {
          "type": "long"
        },
        "serviceName": {
          "type": "keyword"
        },
        "body": {
          "type": "text"
        },
        "observedTime": {
          "type": "date",
          "format": "date_time"
        },
        "schemaUrl": {
          "type": "keyword"
        },
        "resource": {
          "type": "flat_object"
        },
        "instrumentationScope": {
          "type": "flat_object"
        }
      }
    }
  }
}

As you can see, we’re using a data stream to simplify the rollover configuration and keep the maximum primary shard size under 50 GiB, as per best practices.

We optimized the mapping to avoid any unnecessary indexing activity and use the flat_object field type to avoid field mapping explosion.

For reference, the Index State Management (ISM) policy we used is as follows:

{
  "policy": {
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_primary_shard_size": "50gb"
            }
          }
        ],
        "transitions": []
      }
    ],
    "ism_template": [
      {
        "index_patterns": [
          "logs-*"
        ]
      }
    ]
  }
}

Our average document size is 1.6 KiB and the bulk size is 4,000 documents per bulk, which makes approximately 6.26 MiB per bulk (uncompressed).

Testing protocol

The protocol parameters are as follows:

Number of data nodes: 6 or 12
Jobs parallelism: 75, 40
Primary shard count: 12, 48, 96 (for 12 nodes)
Number of replicas: 1 (total of 2 copies)
Instance types (each with 16 vCPUs):
- or1.4xlarge.search
- r6g.4xlarge.search
- im4gn.4xlarge.search

Cluster	Instance type	vCPU	RAM	JVM size
or1-target	or1.4xlarge.search	16	128	32
im4gn-target	im4gn.4xlarge.search	16	64	32
r6g-target	r6g.4xlarge.search	16	128	32

Note that the im4gn cluster has half the memory of the other two, but still each environment has the same JVM heap size of approximately 32 GiB.

Performance testing results

For the performance testing, we started with 75 parallel jobs and 750 batches of 4,000 documents per client (a total 225 million documents). We then adjusted the number of shards, data nodes, replicas, and jobs.

Configuration 1: 6 data nodes, 12 primary shards, 1 replica

For this configuration, we used 6 data nodes, 12 primary shards, and 1 replica, we observed the following performance:

Cluster	CPU usage	Time taken	Indexing speed
or1-target	65-80%	24 min	156 kdoc/s	243 MiB/s
im4gn-target	89-97%	34 min	110 kdoc/s	172 MiB/s
r6g-target	88-95%	34 min	110 kdoc/s	172 MiB/s

Highlighted in this table, im4gn and r6g clusters have very high CPU usage, triggering admission control, which rejects document.

The OR1 shows a CPU below 80 percent sustained, which is a very good target.

Things to keep in mind:

In production, don’t forget to retry indexing with exponential backoff to avoid dropping unindexed documents because of intermittent rejections.
The bulk indexing operation returns 200 OK but can have partial failures. The body of the response must be checked to validate that all the documents were indexed successfully.

By reducing the number of parallel jobs from 75 to 40, while maintaining 750 batches of 4,000 documents per client (total 120M documents), we get the following:

Cluster	CPU usage	Time taken	Indexing speed
or1-target	25-60%	20 min	100 kdoc/s	156 MiB/s
im4gn-target	75-93%	19 min	105 kdoc/s	164 MiB/s
r6g-target	77-90%	20 min	100 kdoc/s	156 MiB/s

The throughput and CPU usage decreased, but the CPU remains high on Im4gn and R6g, while the OR1 is showing more CPU capacity to spare.

Configuration 2: 6 data nodes, 48 primary shards, 1 replica

For this configuration, we increased the number of primary shards from 12 to 48, which provides more parallelism for indexing:

Cluster	CPU usage	Time taken	Indexing speed
or1-target	60-80%	21 min	178 kdoc/s	278 MiB/s
im4gn-target	67-95%	34 min	110 kdoc/s	172 MiB/s
r6g-target	70-88%	37 min	101 kdoc/s	158 MiB/s

The indexing throughput increased for the OR1, but the Im4gn and R6g didn’t see an improvement because their CPU utilization is still very high.

Reducing the parallel jobs to 40 and keeping 48 primary shards, we can see that the OR1 gets a little more pressure as the minimum CPU increases from 12 primary shards, and the CPU for R6g looks much better. For the Im4gn however, the CPU is still high.

Cluster	CPU usage	Time taken	Indexing speed
or1-target	40-60%	16 min	125 kdoc/s	195 MiB/s
im4gn-target	80-94%	18 min	111 kdoc/s	173 MiB/s
r6g-target	70-80%	21 min	95 kdoc/s	148 MiB/s

Configuration 3: 12 data nodes, 96 primary shards, 1 replica

For this configuration, we started with the original configuration and added more compute capacity, moving from 6 nodes to 12 and increasing the number of primary shards to 96.

Cluster	CPU usage	Time taken	Indexing speed
or1-target	40-60%	18 min	208 kdoc/s	325 MiB/s
im4gn-target	74-90%	20 min	187 kdoc/s	293 MiB/s
r6g-target	60-78%	24 min	156 kdoc/s	244 MiB/s

The OR1 and the R6g are performing well with CPU usage below 80 percent, with OR1 giving 33 percent better performance with 30 percent less CPU usage compared to R6g.

The Im4gn is still at 90 percent CPU, but the performance is also very good.

Reducing the number of parallel jobs from 75 to 40, we get:

Cluster	CPU usage	Time taken	Indexing speed
or1-target	40-60%	11 min	182 kdoc/s	284 MiB/s
im4gn-target	70-90%	11 min	182 kdoc/s	284 MiB/s
r6g-target	60-77%	12 min	167 kdoc/s	260 MiB/s

Reducing the number of parallel jobs to 40 from 75 brought the OR1 and Im4gn instances on par and the R6g very close.

Interpretation

The OR1 instances speed up indexing because only the primary shards need to be written while the replica is produced by copying segments. While being more performant compared to Img4n and R6g instances, the CPU usage is also lower, which gives room for additional load (search) or cluster size reduction.

We can compare a 6-node OR1 cluster with 48 primary shards, indexing at 178 thousand documents per second, to a 12-node Im4gn cluster with 96 primary shards, indexing at 187 thousand documents per second or to a 12-node R6g cluster with 96 primary shards, indexing at 156 thousand documents per second.

The OR1 performs almost as well as the larger Im4gn cluster, and better than the larger R6g cluster.

How to size when using OR1 instances

As you can see in the results, OR1 instances can process more data at higher throughput rates. However, when increasing the number of primary shards, they don’t perform as well because of the remote backed storage.

To get the best throughput from the OR1 instance type, you can use larger batch sizes than usual, and use an Index State Management (ISM) policy to roll over your index based on size so that you can effectively limit the number of primary shards per index. You can also increase the number of connections because the OR1 instance type can handle more parallelism.

For search, OR1 doesn’t directly impact the search performance. However, as you can see, the CPU usage is lower on OR1 instances than on Im4gn and R6g instances. That enables either more activity (search and ingest), or the possibility to reduce the instance size or count, which would result in a cost reduction.

Conclusion and recommendations for OR1

The new OR1 instance type gives you more indexing power than the other instance types. This is important for indexing-heavy workloads, where you index in batch every day or have a high sustained throughput.

The OR1 instance type also enables cost reduction because their price for performance is 30 percent better than existing instance types. When adding more than one replica, price for performance will decrease because the CPU is barely impacted on an OR1 instance, while other instance types would have indexing throughput decrease.

Check out the complete instructions for optimizing your workload for indexing using this repost article.

About the author

Cédric Pelvet is a Principal AWS Specialist Solutions Architect. He helps customers design scalable solutions for real-time data and search workloads. In his free time, his activities are learning new languages and practicing the violin.

AWS Glue mutual TLS authentication for Amazon MSK

2024-08-07 Edward Ondari

Post Syndicated from Edward Ondari original https://aws.amazon.com/blogs/big-data/aws-glue-mutual-tls-authentication-for-amazon-msk/

In today’s landscape, data streams continuously from countless sources such as social media interactions to Internet of Things (IoT) device readings. This torrent of real-time information presents both a challenge and an opportunity for businesses. To harness the power of this data effectively, organizations need robust systems for ingesting, processing, and analyzing streaming data at scale. Enter Apache Kafka: a distributed streaming platform that has revolutionized how companies handle real-time data pipelines and build responsive, event-driven applications. AWS Glue is used to process and analyze large volumes of real-time data and perform complex transformations on the streaming data from Apache Kafka.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed Apache Kafka service. You can activate a combination of authentication modes on new or existing MSK clusters. The supported authentication modes are AWS Identity and Access Management (IAM) access control, mutual Transport Layer Security (TLS), and Simple Authentication and Security Layer/Salted Challenge Response Mechanism (SASL/SCRAM). For more information about using IAM authentication, refer to Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication.

Mutual TLS authentication requires both the server and the client to present certificates to prove their identity. It’s ideal for hybrid applications that need a common authentication model. It’s also a commonly used authentication mechanism for business-to-business applications and is used in standards such as open banking, which enables secure open API integrations for financial institutions. For Amazon MSK, AWS Private Certificate Authority (AWS Private CA) is used to issue the X.509 certificates and for authenticating clients.

This post describes how to set up AWS Glue jobs to produce, consume, and process messages on an MSK cluster using mutual TLS authentication. AWS Glue will automatically infer the schema from the streaming data and store the metadata in the AWS Glue Data Catalog for analysis using analytics tools such as Amazon Athena.

Example use case

In our example use case, a hospital facility regularly monitors the body temperatures for patients admitted in the emergency ward using smart thermometers. Each device automatically records the patients’ temperature readings and posts the records to a central monitoring application API. Each posted record is a JSON formatted message that contains the deviceId that uniquely identifies the thermometer, a patientId to identify the patient, the patient’s temperature reading, and the eventTime when the temperature was recorded.

The central monitoring application checks the hourly average temperature readings for each patient and notifies the hospital’s healthcare workers when a patient’s average temperature exceeds accepted thresholds (36.1–37.2°C). In our case, we use the Athena console to analyze the readings.

Overview of the solution

In this post, we use an AWS Glue Python shell job to simulate incoming data from the hospital thermometers. This job produces messages that are securely written to an MSK cluster using mutual TLS authentication.

To process the streaming data from the MSK cluster, we deploy an AWS Glue Streaming extract, transform, and load (ETL) job. This job automatically infers the schema from the incoming data, stores the schema metadata in the Data Catalog, and then stores the processed data as efficient Parquet files in Amazon Simple Storage Service (Amazon S3). We use Athena to query the output table in the Data Catalog and uncover insights.

The following diagram illustrates the architecture of the solution.

Solution architecture

The solution workflow consists of the following steps:

Create a private certificate authority (CA) using AWS Certificate Manager (ACM).
Set up an MSK cluster with mutual TLS authentication.
Create a Java keystore (JKS) file and generate a client certificate and private key.
Create a Kafka connection in AWS Glue.
Create a Python shell job in AWS Glue to create a topic and push messages to Kafka.
Create an AWS Glue Streaming job to consume and process the messages.
Analyze the processed data in Athena.

Prerequisites

You should have the following prerequisites:

Access to AWS CloudShell or the AWS Command Line Interface (AWS CLI).
A VPC with a minimum of two subnets in two Availability Zones and a NAT gateway with a route to a public subnet. You can use the following AWS CloudFormation stack to set up the VPC:

This template creates two NAT gateways as shown in the following diagram. However, it’s possible to route the traffic to a single NAT gateway in one Availability Zone for test and development workloads. For redundancy in production workloads, it’s recommended that there is one NAT gateway available in each Availability Zone.

VPC setup

The stack also creates a security group with a self-referencing rule to allow communication between AWS Glue components.

Create a private CA using ACM

Complete the following steps to create a root CA. For more details, refer to Creating a private CA.

On the AWS Private CA console, choose Create a private CA.
For Mode options, select either General-purpose or Short-lived certificate for lower pricing.
For CA type options, select Root.
Provide certificate details by providing at least one distinguished name.

Create private CA

Leave the remaining default options and select the acknowledge checkbox.
Choose Create CA.
On the Actions menu, choose Install CA certificate and choose Confirm and install.

Install certificate

Set up an MSK cluster with mutual TLS authentication

Before setting up the MSK cluster, make sure you have a VPC with at least two private subnets in different Availability Zones and a NAT gateway with a route to the internet. A CloudFormation template is provided in the prerequisites section.

Complete the following steps to set up your cluster:

On the Amazon MSK console, choose Create cluster.
For Creation method, Custom create.
For Cluster type, select Provisioned.
For Broker size, you can choose kafka.t3.small for the purpose of this post.
For Number of zones, choose 2.
Choose Next.
In the Networking section, select the VPC, private subnets, and security group you created in the prerequisites section.
In the Security settings section, under Access control methods, select TLS client authentication through AWS Certificate Manager (ACM).
For AWS Private CAs, choose the AWS private CA you created earlier.

The MSK cluster creation can take up to 30 minutes to complete.

Create a JKS file and generate a client certificate and private key

Using the root CA, you generate client certificates to use for authentication. The following instructions are for CloudShell, but can also be adapted for a client machine with Java and the AWS CLI installed.

Open a new CloudShell session and run the following commands to create the certs directory and install Java:

mkdir certs
cd certs
sudo yum -y install java-11-amazon-corretto-headless

Run the following command to create a keystore file with a private key in JKS format. Replace Distinguished-Name, Example-Alias, Your-Store-Pass, and Your-Key-Pass with strings of your choice:

keytool -genkey -keystore kafka.client.keystore.jks -validity 300 -storepass Your-Store-Pass -keypass Your-Key-Pass -dname "CN=Distinguished-Name" -alias Example-Alias -storetype pkcs12

Generate a certificate signing request (CSR) with the private key created in the preceding step:

keytool -keystore kafka.client.keystore.jks -certreq -file csr.pem -alias Example-Alias -storepass Your-Store-Pass -keypass Your-Key-Pass

Run the following command to remove the word NEW (and the single space that follows it) from the beginning and end of the file:

sed -i -E '1,$ s/NEW //' csr.pem

The file should start with -----BEGIN CERTIFICATE REQUEST----- and end with -----END CERTIFICATE REQUEST-----

Using the CSR file, create a client certificate using the following command. Replace Private-CA-ARN with the ARN of the private CA you created.

aws acm-pca issue-certificate --certificate-authority-arn Private-CA-ARN --csr fileb://csr.pem --signing-algorithm "SHA256WITHRSA" --validity Value=300,Type="DAYS"

The command should print out the ARN of the issued certificate. Save the CertificateArn value for use in the next step.

{
"CertificateArn": "arn:aws:acm-pca:region:account:certificate-authority/CA_ID/certificate/certificate_ID"
}

Use the Private-CA-ARN together with the CertificateArn (arn:aws:acp-pca:<region>:...) generated in the preceding step to retrieve the signed client certificate. This will create a client-cert.pem file.

aws acm-pca get-certificate --certificate-authority-arn Private-CA-ARN --certificate-arn Certificate-ARN | jq -r '.Certificate + "\n" + .CertificateChain' >> client-cert.pem

Add the certificate into the Java keystore so you can present it when you talk to the MSK brokers:

keytool -keystore kafka.client.keystore.jks -import -file client-cert.pem -alias Example-Alias -storepass Your-Store-Pass -keypass Your-Key-Pass -noprompt

Extract the private key from the JKS file. Provide the same destkeypass and deststorepass and enter the keystore password when prompted.

keytool -importkeystore -srckeystore kafka.client.keystore.jks -destkeystore keystore.p12 -srcalias Example-Alias -deststorepass Your-Store-Pass -destkeypass Your-Key-Pass -deststoretype PKCS12

Convert the private key to PEM format. Enter the keystore password you provided in the previous step when prompted.

openssl pkcs12 -in keystore.p12 -nodes -nocerts -out private-key.pem

Remove the lines that begin with Bag Attributes.. from the top of the file:

sed -i -ne '/-BEGIN PRIVATE KEY-/,/-END PRIVATE KEY-/p' private-key.pem

Upload the client-cert.pem, client.keystore.jks, and private-key.pem files to Amazon S3. You can either create a new S3 bucket or use an existing bucket to store the following objects. Replace <s3://aws-glue-assets-11111111222222-us-east-1/certs/> with your S3 location.

aws s3 sync ~/certs s3://aws-glue-assets-11111111222222-us-east-1/certs/ --exclude '*' --include 'client-cert.pem' --include 'private-key.pem' --include 'kafka.client.keystore.jks'

Create a Kafka connection in AWS Glue

Complete the following steps to create a Kafka connection:

On the AWS Glue console, choose Data connections in the navigation pane.
Choose Create connection.
Select Apache Kafka and choose Next.
For Amazon Managed Streaming for Apache Kafka Cluster, choose the MSK cluster you created earlier.

Create Glue Kafka connection

Choose TLS client authentication for Authentication method.
Enter the S3 path to the keystore you created earlier and provide the keystore and client key passwords you used for the -storepass and -keypass

Add authentication method to connection

Under Networking options, choose your VPC, a private subnet, and a security group. The security group should contain a self-referencing rule.
On the next page, provide a name for the connection (for example, Kafka-connection) and choose Create connection.

Create a Python shell job in AWS Glue to create a topic and push messages to Kafka

In this section, you create a Python shell job to create a new Kafka topic and push JSON messages to the topic. Complete the following steps:

On the AWS Glue console, choose ETL jobs.
In the Script section, for Engine, choose Python shell.
Choose Create script.

Create Python shell job

Enter the following script in the editor:

import sys
from awsglue.utils import getResolvedOptions
from kafka.admin import KafkaAdminClient, NewTopic
from kafka import KafkaProducer
from kafka.errors import TopicAlreadyExistsError
from urllib.parse import urlparse

import json
import uuid
import datetime
import boto3
import time
import random

# Fetch job parameters
args = getResolvedOptions(sys.argv, ['connection-names', 'client-cert', 'private-key'])

# Download client certificate and private key files from S3
TOPIC = 'example_topic'
client_cert = urlparse(args['client_cert'])
private_key = urlparse(args['private_key'])

s3 = boto3.client('s3')
s3.download_file(client_cert.netloc, client_cert.path.lstrip('/'),  client_cert.path.split('/')[-1])
s3.download_file(private_key.netloc, private_key.path.lstrip('/'),  private_key.path.split('/')[-1])

# Fetch bootstrap servers from connection
args = getResolvedOptions(sys.argv, ['connection-names'])
if ',' in args['connection_names']:
    raise ValueError("Choose only one connection name in the job details tab!")
glue_client = boto3.client('glue')
response = glue_client.get_connection(Name=args['connection_names'], HidePassword=True)
bootstrapServers = response['Connection']['ConnectionProperties']['KAFKA_BOOTSTRAP_SERVERS']

# Create topic and push messages 
admin_client = KafkaAdminClient(bootstrap_servers= bootstrapServers, security_protocol= 'SSL', ssl_certfile= client_cert.path.split('/')[-1], ssl_keyfile= private_key.path.split('/')[-1])
try:
    admin_client.create_topics(new_topics=[NewTopic(name=TOPIC, num_partitions=1, replication_factor=1)], validate_only=False)
except TopicAlreadyExistsError:
    # Topic already exists
    pass
admin_client.close()

# Generate JSON messages for the new topic
producer = KafkaProducer(value_serializer=lambda m: json.dumps(m).encode('ascii'), bootstrap_servers=bootstrapServers, security_protocol='SSL', 
                         ssl_check_hostname=True, ssl_certfile= client_cert.path.split('/')[-1], ssl_keyfile= private_key.path.split('/')[-1])
                         
for i in range(1200):
    _event = {
        "deviceId": str(uuid.uuid4()),
        "patientId": "PI" + str(random.randint(1,15)).rjust(5, '0'),
        "temperature": round(random.uniform(32.1, 40.9), 1),
        "eventTime": str(datetime.datetime.now())
    }
    producer.send(TOPIC, _event)
    time.sleep(3)
    
producer.close()

On the Job details tab, provide a name for your job, such as Kafka-msk-producer.
Choose an IAM role. If you don’t have one, create one following the instructions in Configuring IAM permissions for AWS Glue.
Under Advanced properties, for Connections, choose the Kafka-connection connection you created.
Under Job parameters, add the following parameters and values:
1. Key: --additional-python-modules, value: kafka-python.
2. Key: --client-cert, value: s3://aws-glue-assets-11111111222222-us-east-1/certs/client-cert.pem. Replace with your client-cert.pem Amazon S3 location from earlier.
3. Key: --private-key, value: s3://aws-glue-assets-11111111222222-us-east-1/certs/private-key.pem. Replace with your private-key.pem Amazon S3 location from earlier.
Save and run the job.

You can confirm that the job run status is Running on the Runs tab.

At this point, we have successfully created a Python shell job to simulate the thermometers sending temperature readings to the monitoring application. The job will run for approximately 1 hour and push 1,200 records to Amazon MSK.

Alternatively, you can replace the Python shell job with a Scala ETL job to act as a producer to send messages to the MSK cluster. In this case, use the JKS file for authentication using ssl.keystore.type=JKS. If you’re using PEM format keys, the current version of Kafka clients libraries (2.4.1) installed in AWS Glue version 4 don’t yet support authentication through certificates in PEM format (as of this writing).

Create an AWS Glue Streaming job to consume and process the messages

You can now create an AWS Glue ETL job to consume and process the messages in the Kafka topic. AWS Glue will automatically infer the schema from the files. Complete the following steps:

On the AWS Glue console, choose Visual ETL in the navigation pane.
Choose Visual ETL to author a new job.
For Sources, choose Apache Kafka.
For Connection name, choose the node and connection name you created earlier.
For Topic name, enter the topic name (example_topic) you created earlier.
Leave the rest of the options as default.

Kafka data source

Add a new target node called Amazon S3 to store the output Parquet files generated from the streaming data.
Choose Parquet as the data format and provide an S3 output location for the generated files.
Select the option to allow AWS Glue to create a table in the Data Catalog and provide the database and table names.

S3 Output node

On the job details tab, provide the following options:
1. For the requested number of workers, enter 2.
2. For IAM Role, choose an IAM role with permissions to read and write to the S3 output location.
3. For Job timeout, enter 60 (for the job to stop after 60 minutes).
4. Under Advanced properties, for Connections, choose the connection you created.
Save and run the job.

You can confirm the S3 output location for new Parquet files created under the prefixes s3://<output-location>/ingest_year=XXXX/ingest_month=XX/ingest_day=XX/ingest_hour=XX/.

At this point, you have created a streaming job to process events from Amazon MSK and store the JSON formatted records as Parquet files in Amazon S3. AWS Glue streaming jobs are meant to be running continuously to process streaming data. We have set the timeout to stop the job after 60 minutes. You can also stop the job manually after the records have been processed to Amazon S3.

Analyze the data in Athena

Going back to our example use case, you can run the following query in Athena to monitor and track the hourly average temperature readings for patients that exceed the normal thresholds (36.1–37.2°C):

SELECT
date_format(parse_datetime(eventTime, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '%h %p') hour,
patientId,
round(avg(temperature), 1) average_temperature,
count(temperature) readings
FROM "default"."devices_data"
GROUP BY 1, 2
HAVING avg(temperature) > 37.2 or avg(temperature) < 36.1
ORDER BY 2, 1 DESC

Amazon Athena Console

Run the query multiple times and observe how the average_temperature and the number of readings changes with new incoming data from the AWS Glue Streaming job. In our example scenario, healthcare workers can use this information to identify patients who are experiencing consistent high or low body temperatures and give the required attention.

At this point, we have successfully created and ingested streaming data to our MSK cluster using mutual TLS authentication. We only needed the certificates generated by AWS Private CA to authenticate our AWS Glue clients to the MSK cluster and process the streaming data with an AWS Glue Streaming job. Finally, we used Athena to visualize the data and observed how the data changes in near real time.

Clean up

To clean up the resources created in this post, complete the following steps:

Delete the private CA you created.
Delete the MSK cluster you created.
Delete the AWS Glue connection you created.
Stop the jobs if they are still running and delete the jobs you created.
If you used the CloudFormation stack provided in the prerequisites, delete the CloudFormation stack to delete the VPC and other networking components.

Conclusion

This post demonstrated how you can use AWS Glue to consume, process, and store streaming data for Amazon MSK using mutual TLS authentication. AWS Glue Streaming automatically infers the schema and creates a table in the Data Catalog. You can then query the table using other data analysis tools like Athena, Amazon Redshift, and Amazon QuickSight to provide insights into the streaming data.

Try out the solution for yourself, and let us know your questions and feedback in the comments section.

About the Authors

Edward Okemwa is a Big Data Cloud Support Engineer (ETL) at AWS Nairobi specializing in AWS Glue and Amazon Athena. He is dedicated to providing customers with technical guidance and resolving issues related to processing and analyzing large volumes of data. In his free time, he enjoys singing choral music and playing football.

Emmanuel Mashandudze is a Senior Big Data Cloud Engineer specializing in AWS Glue. He collaborates with product teams to help customers efficiently transform data in the cloud. He helps customers design and implements robust data pipelines. Outside of work, Emmanuel is an avid marathon runner, sports enthusiast and enjoys creating memories with his family.

Network perimeter security protections for generative AI

2024-08-07 Riggs Goodman III

Post Syndicated from Riggs Goodman III original https://aws.amazon.com/blogs/security/network-perimeter-security-protections-for-generative-ai/

Generative AI–based applications have grown in popularity in the last couple of years. Applications built with large language models (LLMs) have the potential to increase the value companies bring to their customers. In this blog post, we dive deep into network perimeter protection for generative AI applications. We’ll walk through the different areas of network perimeter protection you should consider, discuss how those apply to generative AI–based applications, and provide architecture patterns. By implementing network perimeter protection for your generative AI–based applications, you gain controls to help protect from unauthorized use, cost overruns, distributed denial of service (DDoS), and other threat actors or curious users.

Perimeter protection for LLMs

Network perimeter protection for web applications helps answer important questions, for example:

Who can access the app?
What kind of data is sent to the app?
How much data is the app is allowed to use?

For the most part, the same network protection methods used for other web apps also work for generative AI apps. The main focus of these methods is controlling network traffic that is trying to access the app, not the specific requests and responses the app creates. We’ll focus on three key areas of network perimeter protection:

Authentication and authorization for the app’s frontend
Using a web application firewall
Protection against DDoS attacks

The security concerns of using LLMs in these apps, including issues with prompt injections, sensitive information leaks, or excess agency, is beyond the scope of this post.

Frontend authentication and authorization

When designing network perimeter protection, you first need to decide whether you will allow certain users to access the application, based on whether they are authenticated (AuthN) and whether they are authorized (AuthZ) to ask certain questions of the generative AI–based applications. Many generative AI–based applications sit behind an authentication layer so that a user must sign in to their identity provider before accessing the application. For public applications that are not behind any authentication (a chatbot, for example), additional considerations are required with regard to AWS WAF and DDoS protection, which we discuss in the next two sections.

Let’s look at an example. Amazon API Gateway is an option for customers for the application frontend, providing metering of users or APIs with authentication and authorization. It’s a fully managed service that makes it convenient for developers to publish, maintain, monitor, and secure APIs at scale. With API Gateway, you create AWS Lambda authorizers to control access to APIs within your application. Figure 1 shows how access works for this example.

Figure 1: An API Gateway, Lambda authorizer, and basic filter in the signal path between client and LLM

The workflow in Figure 1 is as follows:

A client makes a request to your API that is fronted by the API Gateway.
When the API Gateway receives the request, it sends the request to a Lambda authorizer that authenticates the request through OAuth, SAML, or another mechanism. The Lambda authorizer returns an AWS Identity and Access Management (IAM) policy to the API Gateway, which will permit or deny the request.
If permitted, the API Gateway sends the API request to the backend application. In Figure 1, this is a Lambda function that provides additional capabilities in the area of LLM security, standing in for more complex filtering. In addition to the Lambda authorizer, you can configure throttling on the API Gateway on a per-client basis or on the application methods clients are accessing before traffic makes it to the backend application. Throttling can provide some mitigation against not only DDoS attacks but also model cloning and inversion attacks.
Finally, the application sends requests to your LLM that is deployed on AWS. In this example, the LLM is deployed on Amazon Bedrock.

The combination of Lambda authorizers and throttling helps support a number of perimeter protection mechanisms. First, only authorized users gain access to the application, helping to prevent bots and the public from accessing the application. Second, for authorized users, you limit the rate at which they can invoke the LLM to prevent excessive costs related to requests and responses to the LLM. Third, after users have been authenticated and authorized by the application, the application can pass identity information to the backend data access layer in order to restrict the data available to the LLM, aligning with what the user is authorized to access.

Besides API Gateway, AWS provides other options you can use to provide frontend authentication and authorization. AWS Application Load Balancer (ALB) supports OpenID Connect (OIDC) capabilities to require authentication to your OIDC provider prior to access. For internal applications, AWS Verified Access combines both identity and device trust signals to permit or deny access to your generative AI application.

AWS WAF

Once the authentication or authorization decision is made, the next consideration for network perimeter protection is on the application side. New security risks are being identified for generative AI–based applications, as described in the OWASP Top 10 for Large Language Model Applications. These risks include insecure output handling, insecure plugin design, and other mechanisms that cause the application to provide responses that are outside the desired norm. For example, a threat actor could craft a direct prompt injection to the LLM, which causes the LLM behave improperly. Some of these risks (insecure plugin design) can be addressed by passing identity information to the plugins and data sources. However, many of those protections fall outside the network perimeter protection and into the realm of security within the application. For network perimeter protection, the focus is on validating the users who have access to the application and supporting rules that allow, block, or monitor web requests based on network rules and patterns at the application level prior to application access.

In addition, bot traffic is an important consideration for web-based applications. According to Security Today, 47% of all internet traffic originates from bots. Bots that send requests to public applications drive up the cost of using generative AI–based applications by causing higher request loads.

To protect against bot traffic before the user gains access to the application, you can implement AWS WAF as part of the perimeter protection. Using AWS WAF, you can deploy a firewall to monitor and block the HTTP(S) requests that are forwarded to your protected web application resources. These resources exist behind Amazon API Gateway, ALB, AWS Verified Access, and other resources. From a web application point of view, AWS WAF is used to prevent or limit access to your application before invocation of your LLM takes place. This is an important area to consider because, in addition to protecting the prompts and completions going to and from the LLM itself, you want to make sure only legitimate traffic can access your application. AWS Managed Rules or AWS Marketplace managed rule groups provide you with predefined rules as part of a rule group.

Let’s expand the previous example. As your application shown in Figure 1 begins to scale, you decide to move it behind Amazon CloudFront. CloudFront is a web service that gives you a distributed ingress into AWS by using a global network of edge locations. Besides providing distributed ingress, CloudFront gives you the option to deploy AWS WAF in a distributed fashion to help protect against SQL injections, bot control, and other options as part of your AWS WAF rules. Let’s walk through the new architecture in Figure 2.

Figure 2: Adding AWS WAF and CloudFront to the client-to-model signal path

The workflow shown in Figure 2 is as follows:

A client makes a request to your API. DNS directs the client to a CloudFront location, where AWS WAF is deployed.
CloudFront sends the request through an AWS WAF rule to determine whether to block, monitor, or allow the traffic. If AWS WAF does not block the traffic, AWS WAF sends it to the CloudFront routing rules.

Note: It is recommended that you restrict access to the API Gateway so users cannot bypass the CloudFront distribution to access the API Gateway. An example of how to accomplish this goal can be found in the Restricting access on HTTP API Gateway Endpoint with Lambda Authorizer blog post.
CloudFront sends the traffic to the API Gateway, where it runs through the same traffic path as discussed in Figure 1.

To dive into more detail, let’s focus on bot traffic. With AWS WAF Bot Control, you can monitor, block, or rate limit bots such as scrapers, scanners, crawlers, status monitors, and search engines. Bot Control provides multiple options in terms of configured rules and inspection levels. For example, if you use the targeted inspection level of the rule group, you can challenge bots that don’t self-identify, making it harder and more expensive for malicious bots to operate against your generative AI–based application. You can use the Bot Control managed rule group alone or in combination with other AWS Managed Rules rule groups and your own custom AWS WAF rules. Bot Control also provides granular visibility on the number of bots that are targeting your application, as shown in Figure 3.

Figure 3: Bot control dashboard for bot requests and non-bot requests

How does this functionality help you? For your generative AI–based application, you gain visibility into how bots and other traffic are targeting your application. AWS WAF provides options to monitor and customize the web request handling of bot traffic, including allowing specific bots or blocking bot traffic to your application. In addition to bot control, AWS WAF provides a number of different managed rule groups, including baseline rule groups, use-case specific rule groups, IP reputation rules groups, and others. For more information, take a look at the documentation on both AWS Managed Rules rule groups and AWS Marketplace managed rule groups.

DDoS protection

The last topic we’ll cover in this post is DDoS with LLMs. Similar to threats against other Layer 7 applications, threat actors can send requests that consume an exceptionally high amount of resources, which results in a decline in the service’s responsiveness or an increase in the cost to run the LLMs that are handling the high number of requests. Although throttling can help support a per-user or per-method rate limit, DDoS attacks use more advanced threat vectors that are difficult to protect against with throttling.

AWS Shield helps to provide protection against DDoS for your internet-facing applications, both at Layer 3/4 with Shield standard or Layer 7 with Shield Advanced. For example, Shield Advanced responds automatically to mitigate application threats by counting or blocking web requests that are part of the exploit by using web access control lists (ACLs) that are part of your already deployed AWS WAF. Depending on your requirements, Shield can provide multiple layers of protection against DDoS attacks.

Figure 4 shows how your deployment might look after Shield is added to the architecture.

Figure 4: Adding Shield Advanced to the client-to-model signal path

The workflow in Figure 4 is as follows:

A client makes a request to your API. DNS directs the client to a CloudFront location, where AWS WAF and Shield are deployed.
CloudFront sends the request through an AWS WAF rule to determine whether to block, monitor, or allow the traffic. AWS Shield can mitigate a wide range of known DDoS attack vectors and zero-day attack vectors. Depending on the configuration, Shield Advanced and AWS WAF work together to rate-limit traffic coming from individual IP addresses. If AWS WAF or Shield Advanced don’t block the traffic, the services will send it to the CloudFront routing rules.
CloudFront sends the traffic to the API Gateway, where it will run through the same traffic path as discussed in Figure 1.

When you implement AWS Shield and Shield Advanced, you gain protection against security events and visibility into both global and account-level events. For example, at the account level, you get information on the total number of events seen on your account, the largest bit rate and packet rate for each resource, and the largest request rate for CloudFront. With Shield Advanced, you also get access to notifications of events that are detected by Shield Advanced and additional information about detected events and mitigations. These metrics and data, along with AWS WAF, provide you with visibility into the traffic that is trying to access your generative AI–based applications. This provides mitigation capabilities before the traffic accesses your application and before invocation of the LLM.

Considerations

When deploying network perimeter protection with generative AI applications, consider the following:

AWS provides multiple options, on both the frontend authentication and authorization side and the AWS WAF side, for how to configure perimeter protections. Depending on your application architecture and traffic patterns, multiple resources can provide the perimeter protection with AWS WAF and integrate with identity providers for authentication and authorization decisions.
You can also deploy more advanced LLM-specific prompt and completion filters by using Lambda functions and other AWS services as part of your deployment architecture. Perimeter protection capabilities are focused on preventing undesired traffic from reaching the end application.
Most of the network perimeter protections used for LLMs are similar to network perimeter protection mechanisms for other web applications. The difference is that additional threat vectors come into play compared to regular web applications. For more information on the threat vectors, see OWASP Top 10 for Large Language Model Applications and Mitre ATLAS.

Conclusion

In this blog post, we discussed how traditional network perimeter protection strategies can provide defense in depth for generative AI–based applications. We discussed the similarities and differences between LLM workloads and other web applications. We walked through why authentication and authorization protection is important, showing how you can use Amazon API Gateway to throttle through usage plans and to provide authentication through Lambda authorizers. Then, we discussed how you can use AWS WAF to help protect applications from bots. Lastly, we talked about how AWS Shield can provide advanced protection against different types of DDoS attacks at scale. For additional information on network perimeter protection and generative AI security, take a look at other blogs posts in the AWS Security Blog Channel.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

2024-08-06 Louis Hourcade

Post Syndicated from Louis Hourcade original https://aws.amazon.com/blogs/big-data/how-amazon-gtts-runs-large-scale-etl-jobs-on-aws-using-amazon-mwaa/

The Amazon Global Transportation Technology Services (GTTS) team owns a set of products called INSITE (Insights Into Transportation Everywhere). These products are user-facing applications that solve specific business problems across different transportation domains: network topology management, capacity management, and network monitoring. As of this writing, GTTS serves around 10,000 customers globally on a monthly basis, managing the outbound transportation network.

INSITE applications are in general data intensive. They ingest and transform large volumes of data in different formats and processing patterns (such as batch and near real time) from various sources internal and external to Amazon. Datasets are often shared between applications both within domains and across domains, and are consumed in complex data pipelines that run under tight SLAs. To enable and meet these requirements, GTTS built its own data platform.

A critical component of the data platform is the data pipeline orchestrator. GTTS built its own orchestrator named Langley in 2018, and used it to schedule and monitor extract, transform, and load (ETL) jobs on a variety of compute platforms, such as Amazon EMR, Amazon Redshift, Amazon Relational Database Service (Amazon RDS).

As the Langley user base grew, GTTS engineers faced a couple of challenges on key dimensions, such as maintainability, scalability, multi-tenancy, observability, and interoperability.

Amazon GTTS partnered with AWS Professional Services to modernize their orchestration platform, relying as much as possible on managed services with auto scaling capabilities. After analyzing candidate solutions, the team decided to build a target solution relying on Amazon Managed Workflows for Apache Airflow (Amazon MWAA). This post elaborates on the drivers of the migration and its achieved benefits.

Legacy platform

Amazon GTTS works with diverse and distributed data stores, storing petabytes of data. Data engineers need a tool to define ETL jobs which run on various compute environments, as illustrated in the following diagram.

Amazon GTTS orchestration platfrom - high-level diagram

GTTS built Langley as their custom orchestrator in 2018, and have been operating it ever since. At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. It also uses AWS Data Pipeline to run SQL-based workloads, Amazon Simple Storage Service (Amazon S3) to store configuration files, and Amazon CloudWatch for alarming on failures. Every day, Langley handles the lifecycle of more than 17,000 ETL jobs in Europe and 5,000 ETL jobs in North America.

The following diagram illustrates the Langley architecture.

Business challenges

Langley started as a simple solution to a team-internal problem, but its growth over the years surfaced key issues:

The maintenance of this custom solution requires considerable time from engineers, which increased over the years with the release of new features, increasing the overall complexity.
The Langley user base grew continuously and eventually became a key orchestration platform for multiple teams and products across Amazon. However, it wasn’t created with multi-tenancy in mind and therefore it didn’t provide the robustness and the appropriate level of isolation to guard each tenant from impacting others on the shared platform.
In 2023, AWS announced the upcoming deprecation of Data Pipeline, one of the core services used by Langley.

GTTS partnered with AWS to design and implement a solution to overcome those challenges. AWS used the following evaluation matrix to build a durable solution:

Maintainability	The level of effort required to maintain the orchestrating system in a functional state, encompassing updates, patches, bug fixes, and routine checks for optimal performance.
Costs	The overall expenditure associated with the orchestrator, including infrastructure costs, licensing fees, personnel expenses, and other relevant costs. This criterion particularly assesses the system’s ability to effectively control and reduce costs.
Scheduling	The capabilities related to running and scheduling jobs, including the ability to resume an ETL job from a failed step.
User experience	The overall satisfaction and usability of a system from the end-users’ perspective, considering factors such as responsiveness, accessibility, interoperability, and ease of use.
Security	Mechanisms in place to safeguard data and applications from unauthorized access at all times.
Monitoring and alerting	The continuous observation and analysis of system components and performance metrics to detect and address issues, optimize resource usage, and provide overall health and reliability.
Scalability	The orchestrator’s capacity to efficiently adapt its resources to handle increased workload or demand, providing sustained performance.

Among the explored solutions, Amazon MWAA was finally determined as the best overall performer across this matrix.

The next section is a dive deep into the rationales that led GTTS and AWS Professional Services to choose Amazon MWAA as the best performer.

Benefits of migrating to Amazon MWAA

Amazon GTTS and AWS Professional Services worked together to release a Minimum Viable Product (MVP) of the solution described earlier, which showcases the benefits on the agreed decision criteria.

Maintainability

With their legacy system, Amazon GTTS had to manage the orchestrator database, web servers, activity queue, dispatch functions, and worker nodes.

Amazon MWAA eliminates the need for underlying infrastructure management. It takes care of provisioning and maintenance of the Apache Airflow web server, scheduler, worker nodes, and relational database, allowing GTTS teams to focus on building their ETL jobs.

Amazon MWAA offers one-click updates of the infrastructure for minor versions, like moving from Airflow version x.4.z to x.5.z. During the upgrade process, Amazon MWAA captures a snapshot of your environment metadata; upgrades the workers, schedulers, and web server to the new Airflow version; and finally restores the metadata database using the snapshot, backing it with an automated rollback mechanism.

Costs

Amazon MWAA contributes to a more cost-effective solution by automatically scaling workers depending on the workload. This dynamic scaling in and out avoids over-provisioning and allows the organization to pay for the compute they actually use, without the risk of downtime during activity spikes. Because this is an AWS-managed solution, it also reduced GTTS’s Total Cost of Ownership (TCO) by freeing up time from engineers that were managing the legacy system.

Scheduling

Amazon MWAA supports all the trigger mechanisms that the Amazon orchestrator needed:

Manual trigger – The users can simply invoke a Direct Acyclic Graph (DAG) using the Airflow API or even more simply via the User Interface (UI).
Scheduler – A scheduler can be defined as code, together with the DAG definition, to make sure it will run at specific rates (from hourly to yearly) or on specific cron schedules.
Event-driven trigger – Airflow provides native operators that enable invoking a downstream DAG from another DAG or from a dataset update (push approach). It also includes sensors that listen for the completion of a task external to the DAG (pull approach).
Partial runs on DAG failures – Another key feature for GTTS was the possibility the recover from partial DAG failures without having to rerun the whole DAG. Airflow provides task-level controls that makes this operation straightforward to implement.

User experience

In this section, we discuss three aspects of the user experience: the web UI, the interoperability, and the programming interface.

Web UI

Amazon MWAA comes with a managed web server that hosts the Airflow UI. As a result, and without any maintenance needed, you can use it to quickly run DAGs, check run history, visualize dependencies between DAGs, troubleshoot with a direct access to task logs, manage variables and database connections, and define granular permissions. The following screenshot shows an example of the UI.

Amazon MWAA User Interface - console screenshot

Interoperability

One of the most important features evaluated was the ability for the new orchestrator to effortlessly integrate with GTTS multiple data storage services, compute components, and monitoring services.

Amazon MWAA comes with a wide variety of providers preinstalled, such as apache-airflow-providers-amazon, apache-airflow-providers-postgres, and apache-airflow-providers-common-sql. This allowed GTTS to connect with those services using multiple connection methodologies, including AWS IAM Identity Center or AWS Secrets Manager password-based authentications, without having to write a single custom Airflow operator.

Amazon MWAA also makes it straightforward to upgrade providers version and install new ones. By providing a requirements.txt file, GTTS was able to change the major version of apache-airflow-providers-amazon and install the apache-airflow-providers-mysql provider.

Programming interface

Airflow is an orchestrator with a low barrier to entry, especially for those familiar with the Python programming language. Its workflow management is defined in Python scripts, with a well-documented set of native operators and external providers, making it straightforward for Python developers to get started with Airflow and create complex data pipelines.

The following are two key Airflow features:

TaskFlow API – The TaskFlow API removes a lot of the boilerplate code required by traditional operators by using Python decorators while simplifying the DAG editing process DAG with cleaner and more concise DAG files.
Dynamic DAG generation – The dynamic DAG generation capability allowed us to generate DAGs from the original legacy orchestrator’s configuration files. This enabled the platform team to build a centralized framework consumed by multiple teams to keep the code DRY (Don’t Repeat Yourself), providing a seamless migration journey from the legacy orchestrator.

The following screenshot shows an example of these features.

Airflow dynamic DAG definition - code sample

Security

The new Amazon MWAA-based architecture improves GTTS’s posture by introducing granular access control. Amazon MWAA integrates with AWS services such as AWS Key Management Service (AWS KMS), Secrets Manager, and IAM Identity Center to keep data safely encrypted at all times, both at rest and in transit using TLS-based communications. Airflow also includes a role-based access control (RBAC) model to determine what users can do on the platform and enforce the principle of least privilege. Amazon MWAA also natively integrates with AWS CloudTrail for auditing purposes.

The Airflow RBAC model enables administrators to define roles with specific privileges to access Airflow system settings and DAGs themselves. This granular access control reduces the risk of data breaches and malicious activities by limiting access to critical DAGs and sensitive Airflow environment variables. Airflow includes five default roles with different sets of permissions (as shown in the following screenshot), but it is possible to create new roles depending on your security requirements.

Airflow roles - console screenshot

GTTS used the Airflow RBAC model to restrict permissions of certain teams and consumers of the application. They also used priority weights and Airflow pools to prioritize tasks and control run concurrency. However, if you want to run a multi-tenant orchestration platform, it’s recommended to use a separate environment for each team. You can assume that everything accessible by the Amazon MWAA role is also accessible to users who can write DAGs to the environment.

To ease authentication in Amazon MWAA, GTTS federated their identity provider (IdP) through Amazon Cognito and SAML. With this integration, users log in to the Amazon MWAA UI using the same identity as in other internal systems, which removes the need for new credentials. The user’s group membership is retrieved from the IdP through Amazon Cognito, and a Lambda function redirects the user to Amazon MWAA with the appropriate Airflow role. This process is illustrated in the following architecture, and is abstracted from the user and attached to a public Application Load Balancer that redirects at the end of the process to an Amazon MWAA private cluster, making the authentication workflow seamless and secure. Refer to Accessing a private Amazon MWAA environment using federated identities to implement it using your own IdP.

Amazon MWAA federation - architecture diagram

Monitoring and alerting

Amazon MWAA integrates with CloudWatch, which manages all infrastructure logs for you. When creating an Amazon MWAA environment, you can configure what level of logs should be saved. GTTS enabled CloudWatch logging for all of the five types of components: Airflow task logs, Airflow web server logs, Airflow scheduler logs, Airflow worker logs, and Airflow DAG processing logs.

Amazon MWAA logging configuration - console screenshot

These logs are all accessible in CloudWatch for continuous monitoring, but Amazon MWAA users can also access task logs directly from the Airflow UI by looking at the DAG run history. The following screenshot shows an example of task-level logs in Airflow 2.5.1.

Amazon MWAA task-level logs - console screenshot

You can also build CloudWatch monitoring dashboards to keep an eye on the state of your environment and alert administrators when required. Amazon MWAA natively provides Airflow environment metrics and Amazon MWAA infrastructure-related metrics.

Scalability

Each Amazon MWAA environment includes the schedulers, web server, and worker nodes. Scheduler nodes are responsible for the overall orchestration and parsing of DAG files. These tasks happen in worker nodes that Amazon MWAA auto scales up and down according to system load. When creating a new Amazon MWAA environment, you need to specify the type of worker nodes, the minimum and maximum number of worker nodes, and the scheduler count, as shown in the following screenshot.

Amazon MWAA environment classes - console screenshot

There are notably two ways GTTS controlled how Amazon MWAA scales to handle the load:

Minimum and maximum worker count – Amazon MWAA automatically adds or deletes workers within the boundaries you set, depending on the number of tasks that are waiting to be processed. As indicated in the AWS documentation, it is possible to request a quota increase to run up to 50 workers in a single environment.
Size of the node – Larger worker nodes can run more concurrent tasks. For example, mw1.small instances run 5 concurrent tasks by default, whereas mw1.large instances run 20 concurrent tasks by default. The following figure shows the specification for each instance type.

Amazon MWAA environment sizes - console screenshot

With Amazon MWAA, GTTS can therefore run up to 4,000 concurrent tasks in a single Amazon MWAA environment (50 worker nodes x 80 tasks per node with mw1.2xlarge). This remains an order of magnitude for the load that can fit into the workers vCPUs and RAM, but it is possible to edit the default configuration to add even more tasks per worker. For more information regarding Amazon MWAA automatic scaling, see Configuring Amazon MWAA automatic scaling.

The Amazon MWAA based orchestration platform

After selecting Amazon MWAA as the core service for their orchestrating system, Amazon GTTS and AWS worked together to develop an end-to-end data platform with automation capabilities, access management, monitoring, and integration with downstream systems. The following diagram illustrates the solution architecture.

MWAA-based platform - architecture diagram

The following are notable components of the architecture:

DAG update – GTTS Developers manage the creation, update, and deletion of Amazon MWAA DAGs through a dedicated code repository. When a developer edits DAG definitions and commits changes to the code repository, a CI/CD pipeline automatically packages the DAG definition and stores it in Amazon S3, which automatically updates DAGs in Amazon MWAA.
Infrastructure as code – The entire stack is defined as IaC with the AWS CDK, which eases the process of updating components, and makes it repeatable if GTTS wants to extend the solution and redeploy the stack in multiple AWS Regions.
Authentication, authorizations, and Permissions – Permissions are centrally managed with AWS Identity and Access Management (IAM) together with Airflow roles. GTTS integrated their identity provider with Amazon Cognito and Amazon MWAA, so Amazon employees can connect to the Amazon MWAA UI with the same authentication tool they are used to, and see only the DAGs they are allowed to access.
UI and DAG runs – Amazon MWAA includes an AWS-managed web server that exposes the Airflow UI. Amazon employees can connect to this UI to list DAGs, run DAGs, and track their status. In addition, GTTS used the native Amazon MWAA scheduler to automatically invoke DAGs at a specific time.
Airflow workers – The users can use Airflow native providers to run custom Shell or Python code directly on the workers nodes. For compute-intensive jobs, the Amazon MWAA worker can delegate the compute to a more suitable AWS service, such as Apache Spark running on Amazon EMR on Amazon EKS, which will provide compute resources only for the duration of the job, helping in optimizing costs.
Data stores and external computes services – Amazon MWAA comes also with the AWS provider preloaded, allowing a seamless connectivity with more than 23 AWS compute and data services. GTTS can extend the connectivity to other AWS or external services by using Boto3 with the PythonOperator or creating dedicated custom operators.
Logging and alerting – Amazon MWAA is seamlessly integrated with CloudWatch and CloudTrail to publish DAG logs, audit logs, and metrics. This enables GTTS to track completion, troubleshoot, and create an automated alerting and notifications system so DAGs owners can take remediation actions as fast as possible.

Conclusion

Amazon GTTS partnered with AWS Professional Services to overcome the challenges faced by their legacy custom orchestrator against various dimensions such as maintainability, cost efficiency, security, scalability, and observability.

The new Amazon MWAA-based architecture offers significant improvements in the context of the AWS Well-Architected Framework compared to their former system. In terms of operational excellence, the new orchestration platform is built with evolutivity in mind and enables the GTTS team to use the most adapted ETL service to run their jobs. Regarding performance efficiency, GTTS observed up to 70% improvement in end-to-end runtime on their jobs running in Amazon MWAA. In terms of security, the new solution implements best practices such as the deployment in private subnets, authentication of users through Amazon internal federation systems, and data encryption at rest and in transit. Reliability is achieved with Multi-AZ failover and built-in auto scaling to meet the workload demand at all times. Finally, cost is reduced because Amazon MWAA is an AWS-managed service, which decreases the human effort from GTTS to maintain the orchestration platform.

Amazon GTTS is now bringing the MVP into production, where it is planned to handle petabytes of data and host more than 2,000 jobs migrated from the legacy system. Additionally, the migration to Amazon MWAA has empowered GTTS to enhance its operational scalability, paving the way for the integration of new jobs and further expansion with greater efficiency and confidence.

To learn more, refer to the following resources:

About the Authors

Béntor Bautista is a Senior Data Engineer at Amazon GTTS
Louis Hourcade is a Solutions Architect at AWS
Raphael Ducay is a Senior DataOps Architect at AWS
Konstantin Zarudaev is a DevOps Consultant at AWS
Dorra Elboukari is a DevOps Architect at AWS
Marcin Zapal is an Engagement Manager at AWS
Grigorios Pikoulas is a Strategic Program Lead at AWS
Antonio Cennamo is a Senior Customer Practice Manager at AWS

SaaS authentication: Identity management with Amazon Cognito user pools

2024-08-05 Shubhankar Sumar

Post Syndicated from Shubhankar Sumar original https://aws.amazon.com/blogs/security/saas-authentication-identity-management-with-amazon-cognito-user-pools/

Amazon Cognito is a customer identity and access management (CIAM) service that can scale to millions of users. Although the Cognito documentation details which multi-tenancy models are available, determining when to use each model can sometimes be challenging. In this blog post, we’ll provide guidance on when to use each model and review their pros and cons to help inform your decision.

Cognito overview

Amazon Cognito handles user identity management and access control for web and mobile apps. With Cognito user pools, you can add sign-up, sign-in, and access control to your apps. A Cognito user pool is a user directory within a specific AWS Region where users can authenticate and register for applications. In addition, a Cognito user pool is an OpenID Connect (OIDC) identity provider (IdP). App users can either sign in directly through a user pool or federate through a third-party IdP. Cognito issues a user pool token after successful authentication, which can be used to securely access backend APIs and resources.

Cognito issues three types of tokens:

ID token – Contains user identity claims like name, email, and phone number. This token type authenticates users and enables authorization decisions in apps and API gateways.
Access token – Includes user claims, groups, and authorized scopes. This token type grants access to API operations based on the authenticated user and application permissions. It also enables fine-grained, user-based access control within the application or service.
Refresh token – Retrieves new ID and access tokens when these are expired. Access and ID tokens are short-lived, while the refresh token is long-lived. By default, refresh tokens expire 30 days after the user signs in, but this can be configured to a value between 60 minutes and 10 years.

You can find more information on using tokens and their contents in the Cognito documentation.

Multi-tenancy approaches

Software as a service (SaaS) architectures often use silo, pool, or bridge deployment models, which also apply to CIAM services like Cognito. The silo model isolates tenants in dedicated resources. The pool model shares resources between tenants. The bridge model connects siloed and pooled components. This post compares the Cognito silo and pool models for SaaS identity management.

It’s also possible to combine the silo and pool models by having multiple tiers of resources. For example, you could have a siloed tier for sensitive tenant data along with a pooled tier for shared functionality. This is similar to the silo model but with added routing complexity to connect the tiers. When you have multiple pools or silos, this is a similar approach to the pure silo model but with more components to manage.

More detail on these models are included in the AWS SaaS Lens.

We’ve detailed five possible patterns in the following sections and explored the scenarios where each of the patterns can be used, along with the advantages and disadvantages for each. The rest of the post delves deeper into the details of these different patterns, enabling you to make an informed decision that best aligns with your unique requirements and constraints.

Pattern 1: Representing SaaS identity with custom attributes

To implement multi-tenancy in a SaaS application, tenant context needs to be associated with user identity. This allows implementation of the multi-tenant policies and strategies that comprise our SaaS application. Cognito has user pool attributes, which are pieces of information to represent identity. There are standard attributes, such as name and email, that describe the user identity. Cognito also supports custom attributes that can be used to hold information about the user’s relationship to a tenant, such as tenantId.

By using custom attributes for multi-tenancy in Amazon Cognito, the tenant context for each user can be stored in their user profile.

To enable multi-tenancy, you can add a custom attribute like tenantId to the user profile. When a new user signs up, this tenantId attribute can be set to a value indicating which tenant the user belongs to. For example, users with tenantId “1234” belong to Tenant A, while users with tenantId “5678” belong to Tenant B.

The tenantId attribute value gets returned in the ID token after a successful user authentication. (This value can also be added to the access token through customization by using a pre-token generation Lambda trigger.) The application can then inspect this claim to determine which tenant the user belongs to. The tenantId attribute is typically managed at the SaaS platform level and is read-only to users and the application layer. (Note: SaaS providers need to configure the tenantId attribute to be read-only.)

In addition to storing a tenant ID, you can use custom attributes to model additional tenant context. For instance, attributes like tenantName, tenantTier, or tenantRegion could be defined and set appropriately for each user to provide relevant informational context for the application. However, make sure not to use custom attributes as a database—they are meant to represent identity, not store application data. Custom attributes should only contain information that is relevant for authorization decisions and JSON web token (JWT) compactness and should be relatively static because their values are stored in the Cognito directory. Updating frequently changing data requires modifying the directory, which can be cumbersome.

The custom attributes themselves need to be defined at the time of creating the Amazon Cognito user pool, and there is a maximum of 50 custom attributes that you can create. Once the pool is created, these custom attribute fields will be present on every user profile in that user pool. However, they won’t have values populated yet. The actual tenant attribute values get populated only when a new user is created in the user pool. This can be done in two ways:

During user sign-up, a post confirmation AWS Lambda trigger can be used to set the appropriate tenant attribute values based on the user’s input.
An admin user can provision a new user through the AdminCreateUser API operation and specify the tenant attribute values at that time.

After user creation, the custom tenant attribute values can still be updated by an administrator through the AdminUpdateUserAttributes API operation or by a user with the UpdateUserAttributes API operation, if needed. But the key point is that the custom attributes themselves must be predefined at user pool creation, while the values get set later during user creation and provisioning flows. Figure 1 shows how custom attributes are associated with an ID token and used subsequently in downstream applications.

Figure 1: Associating tenant context with custom attributes

As shown in Figure 1:

The custom tenant attribute values from the user profile are included in the Cognito ID token that is generated after a successful user authentication. These values can be used for access control for other AWS services, such as Amazon API Gateway.
You can configure Amazon API Gateway with a Lambda authorizer function that validates the ID token signature (the aws-jwt-verify library can be used for this purpose) and inspects the tenant ID claim in each request.
Based on the tenant ID value extracted from the ID token, the Lambda authorizer can determine which backend resources and services each authenticated user is authorized to access.

You can use this method to provide fine-grained access control, as described in this blog post, by using tenant claims as context in addition to the user claims embedded within the token. This pattern of embedding information about the user’s identity, along with details on their associated tenant, in a single token is what AWS refers to as SaaS identity.

The multi-tenancy approaches of using siloed user pools, shared pools, or custom attributes rely on embedding tenant context within the user identity. This is accomplished by having Cognito include claims with tenant information in the JWTs issued after authentication.

The JWT encodes user identity information like the username, email address, and so on. By adding custom claims that contain tenant identifiers or metadata, the tenant context gets tightly coupled to the user identity. The embedded tenant context in the JWT allows applications to implement access control and authorization based on the associated tenant for each user.

This combination of user identity information and tenant context in the issued JWT represents the SaaS identity—a unified identity spanning both user and tenant dimensions. The application uses this SaaS identity for implementing multi-tenant logic and policies.

Pattern 2: Shared user pool (pool model)

A single, shared Amazon Cognito user pool simplifies identity management for multi-tenant SaaS applications. With one consolidated pool, changes and configurations apply across tenants in one place, which can reduce overhead.

For example, you can define password complexity rules and other settings once at the user pool level, and then these settings are shared across tenants. Adding new tenants is streamlined by using the settings in the existing shared pool, without duplicating setup per tenant. This avoids deploying isolated pools when onboarding new tenants.

Additionally, the tokens issued from the shared pool are signed by the same issuer. There is no tenant-specific issuer in the tokens when using a shared pool. For SaaS apps with common identity needs, a shared multi-tenant pool minimizes friction for rapid onboarding despite that loss of per-tenant customization.

Advantages of the pool model:

This model uses a single shared user pool for tenants. This simplifies onboarding by setting user attributes rather than configuring multiple user pools.
Tenants authenticate using the same application client and user pool, which keeps the SaaS client configuration simple.

Disadvantages of the pool model:

Sharing one pool means that settings like password policies and MFA apply uniformly, without customization per tenant.
Some resource quotas are managed at a user pool level (for example, the number of application clients or customer attributes), so you need to consider quotas carefully when adopting this model.

Pattern 3: Group-based multi-tenancy (pool model)

Amazon Cognito user pools give an administrator the capability to add groups and associate users with groups. Doing so introduces specific attributes (cognito:groups and cognito:roles) that are managed and maintained by Cognito and available within the ID tokens. (Access tokens only have the cognito:groups attribute.) These groups can be used to enable multi-tenancy by creating a separate group for each tenant. Users can be assigned to the appropriate tenant group based on the value of a custom tenantId attribute. The application can then implement authorization logic to limit access to resources and data based on the user’s tenant group membership that is encoded in the tokens. This provides isolation and access control across tenants, making use of the native group constructs in Cognito rather than relying entirely on custom attributes.

The group information contained in the tokens can then be used by downstream services to make authorization decisions. Groups are often combined with custom attributes for more granular access control. For example, in the SaaS Factory Serverless SaaS – Reference Solution developed by the AWS SaaS Factory team, roles are specified by using Cognito groups, but tenant identity relies on a custom tenantId attribute. The tenant ID attribute provides isolation between tenants, while the groups define individual user roles and access privileges that apply within a tenant.

Figure 2 shows how groups are associated with the user and then the Lambda authorizer can determine which backend resources and services each authenticated user is authorized to access.

Figure 2: Group-based multi-tenancy

In this model, groups can provide role-based controls, while custom attributes like tenant ID provide the contextual information needed to enforce tenant isolation. The authorization decisions are then made by evaluating a user’s group memberships and attribute values in order to provide fine-grained access tailored to each tenant and user. So groups directly enable role-based checks, while custom attributes provide broader context for conditional access across tenants. Together they can provide the data that is needed to implement granular authorization in a multi-tenant application.

Advantages of group-based multi-tenancy:

This model uses a single shared user pool for tenants, so that onboarding requires setting user attributes rather than configuring multiple pools.
Tenants authenticate through the same application client and pool, keeping SaaS client configuration straightforward.

Disadvantages of group-based multi-tenancy:

Sharing one pool means that settings like password policies and MFA apply uniformly without per-tenant customization.
There is a limit of 10,000 groups per user pool.

Pattern 4: Dedicated user pool per tenant (silo model)

Another common approach for multi-tenant identity with Cognito is to provision a separate user pool for each tenant. A Cognito user pool is a user directory, so using distinct pools provides maximum isolation. However, this approach requires that you implement tenant routing logic in the application to determine which user pool a user should authenticate against, based on their tenant.

Tenant routing

With separate user pools per tenant (or application clients, as we’ll discuss later), the application needs logic to route each user to the appropriate pool (or client) for authentication. There are a few options that you can use for this approach:

Use a subdomain in the URL that maps to the tenant—for example, tenant1.myapp.com routes to Tenant 1’s user pool. This requires mapping subdomains to tenant pools.
Rely on unique email domains per tenant—for example, @tenant1.com goes to Tenant 1’s pool. This requires mapping email domains to pools.
Have the user select their tenant from a dropdown list. This requires the tenant choices to be configured.
Prompt the user to enter a tenant ID code that maps to pools. This requires mapping codes to pools.

No matter the approach you chose, the key requirements are the following:

A data point to identify the tenant (such as subdomain, email, selection, or code).
A mapping dataset that takes tenant identifying information from the user and looks up the corresponding user pool to route to for authentication.
Routing logic to redirect to the appropriate user pool.

For example, the AWS SaaS Factory Serverless Reference Architecture uses the approach shown in Figure 3.

Figure 3: Dedicated user pool per tenant

The workflow is as follows:

The user enters their tenant name during sign-in.
The tenant name retrieves tenant-specific information like the user pool ID, application client ID, and API URLs.
Tenant-specific information is passed to the SaaS app to initialize authentication to the correct user pool and app client, and this is used to initialize an authorization code flow.
The app redirects to the Cognito hosted UI for authentication.
User credentials are validated, and Cognito issues an OAuth code.
The OAuth code is exchanged for a JWT token from Cognito.
The JWT token is used to authenticate the user to access microservices.

Advantages of the one pool per tenant model:

Users exist in a single directory with no cross-tenant visibility. Tokens are issued and signed with keys that are unique to that pool.
Each pool can have customized security policies, like password rules or MFA requirements per tenant.
Pools can be hosted in different AWS Regions to meet data residency needs.

Potential disadvantages of the one pool per tenant model:

There are limits on the number of pools per account. (The default is 1,000 pools, and the maximum is 10,000.)
Additional automation is required to create multiple pools, especially with customized configurations.
Applications must implement tenant routing to direct authentication requests to the correct user pool.
Troubleshooting can be more difficult, because configuration of each pool is managed separately and tenant routing functionality is added.

In summary, separate user pools maximize tenant isolation but require more complex provisioning and routing. You might also need to consider limits on the pool count for large multi-tenant deployments.

Pattern 5: Application client per tenant (bridge model)

You can achieve some extra tenant isolation by using separate application clients per tenant in a single user pool, in addition to using groups and custom attributes. Cognito configurations from the application client, such as OAuth scopes, hosted UI customization, and security policies can be specific to each tenant. The application client also enables external IdP federation per tenant. However, user pool–level settings, such as password policy, remain shared.

Figure 4 shows how a single user pool can be configured with multiple application clients. Each of those application clients is assigned to a tenant. However, this approach requires that you implement tenant routing logic in the application to determine which application client a tenant should be mapped to (similar to the approach we discussed for the shared user pool). Once the user is authenticated, you can configure Amazon API Gateway with a Lambda authorizer function that validates the ID token signature. Subsequently, the Lambda authorizer can determine which backend resources and services each authenticated user is authorized to access.

Figure 4: Application client based multi-tenancy

For tenants that want to use their own IdP through SAML or OpenID Connect federation, you can create a dedicated application client that will redirect users to authenticate with the tenant’s federated IdP. This has some key benefits:

If a single external IdP is enabled on the application client, the hosted UI automatically redirects users without presenting Cognito sign-in screens. This provides a familiar sign-in experience for tenants and is frictionless if users have existing sessions with the tenant IdP.
Management of user activities like joining and leaving, passwords, and other tasks are entirely handled by the tenant in their own IdP. The SaaS provider doesn’t need to get involved in these processes.

Importantly, even with federation, Cognito still issues tokens after successful external authentication. So the SaaS provider gets consistent tokens from Cognito to validate during authorization, regardless of the IdP.

Attribute mapping

When federating with an external IdP, Amazon Cognito can dynamically map attributes to populate the tokens it issues. This allows attributes like groups, email addresses, and roles created in the IdP to be passed to Cognito during authentication and added to the tokens.

The mapping occurs upon every sign-in, overwriting the existing mapped attributes to stay in sync with the latest IdP values. Therefore, changes made in the external IdP related to mapped attributes are reflected in Cognito after signing in. If a mapped attribute is required in the Cognito user pool, like email for sign-in, it must have an equivalent in the IdP to map. The target attributes in Cognito must be configured as mutable, since immutable attributes cannot be overwritten after creation, even through mapping.

Important: For SaaS identity, tenant attributes should be defined in Cognito rather than mapped from an external IdP. This helps to prevent tenants from tampering with values and maintains isolation. However, user attributes like groups and roles can be mapped from the tenant’s IdP to manage permissions. This allows tenants to configure application roles by using their own IdP groups.

Advantages of the bridge model:

This model enables tenant-specific configuration like OAuth scopes, UI, and IdPs.
Tenant users access familiar workflows through external IdPs, and when using external IdPs, tenant user management is handled externally.
No custom claim mappings are needed, but can be used optionally.
Cognito still issues tokens for authorization.

Disadvantages of the bridge model:

Requires routing users to the correct app client per tenant.
There is a limit on the number of app clients per user pool.
Some user pool settings remain shared, such as password policy.
There is no dynamic group claim modification.

Conclusion

In this blog post, we explored various ways Amazon Cognito user pools can enable multi-tenant identity for SaaS solutions. A single shared user pool simplifies management but limits the option to customize user pool–level policies, while separate pools maximize isolation and configurability at the cost of complexity. If you use multiple application clients, you can balance tailored options like external IdPs and OAuth scopes with centralized policies in the user pool. Custom claim mappings provide flexibility but require additional logic.

These two approaches can also be combined. For example, you can have dedicated user pools for select high-tier tenants while others share a multi-tenant pool. The optimal choice depends on the specific tenant needs and on the customization that is required.

In this blog post, we have mainly focused on a static approach. You can also use a pre-token generation Lambda trigger to modify tokens by adding, changing, or removing claims dynamically. The trigger can also override the group membership in both the identity and access tokens. Other claim changes only apply to the ID token. A common use case for this trigger is injecting tenant attributes into the token dynamically.

Evaluate the pros and cons of each approach against the requirements of the SaaS architecture and tenants. Often a hybrid model works best. Cognito constructs like user pools, IdPs, and triggers provide various levers that you can use to fine-tune authentication and authorization across tenants.

For further reading on these topics, see the Common Amazon Cognito scenarios topic in the Cognito Developer Guide and the related blog post How to Use Cognito Pre-Token Generation trigger to Customize Claims in ID Tokens.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Cognito re:Post

Federated access to Amazon Athena using AWS IAM Identity Center

2024-08-01 Ajay Rawat

Post Syndicated from Ajay Rawat original https://aws.amazon.com/blogs/security/federated-access-to-amazon-athena-using-aws-iam-identity-center/

Managing Amazon Athena through identity federation allows you to manage authentication and authorization procedures centrally. Athena is a serverless, interactive analytics service that provides a simplified and flexible way to analyze petabytes of data.

In this blog post, we show you how you can use the Athena JDBC driver (which includes a browser Security Assertion Markup Language (SAML) plugin) to connect to Athena from third-party SQL client tools, which helps you quickly implement identity federation capabilities and multi-factor authentication (MFA). This enables automation and enforcement of data access policies across your organization.

You can use AWS IAM Identity Center to federate access to users to AWS accounts. IAM Identity Center integrates with AWS Organizations to manage access to the AWS accounts under your organization. In this post, you will learn how to configure the Athena driver to use the AWS configuration profile credentials. This will allow you to resolve credentials from IAM Identity Center and use the MFA capability of your federation identity provider (IdP).In this post, you will learn how you can integrate the Athena browser-based SAML plugin to add single sign-on (SSO) and MFA capability with your federation identity provider (IdP).

Prerequisites

To implement this solution, you must have the follow prerequisites:

An AWS account.
Install or update to the latest version of the AWS Command Line Interface (AWS CLI).
Configure IAM Identity Center authentication using the AWS CLI.
IAM Identity Center enabled. See Enabling AWS IAM Identity Center.
Access to SQL client tools (such as SQL Workbench/J, Pycharm, and so on) that support JDBC connections.
An Amazon Simple Storage Service (Amazon S3) bucket to store Athena query results.
Knowledge of using AWS Lake Formation and enabling Lake Formation to manage permissions to a set of tables.
A Lake Formation administrator role. See Lake Formation personas and IAM permissions reference for information on creating a data lake administrator.
Tables and databases are populated in your AWS Glue Data Catalog.
Create two Athena workgroups (for example, sensitive and non-sensitive).

Note: Lake Formation only supports a single role in the SAML assertion. Multiple roles cannot be used.

Solution overview

Figure 1: Solution architecture

To implement the solution, complete the steps below as shown in Figure 1:

An IAM Identity Center delegated administrator creates two custom permission sets within Identity Center.
An IAM Identity Center delegated administrator assign permission sets to AWS accounts and users and groups. The user has permissions to single sign-on roles that are provisioned in the data lake account. The role created by Identity Center has a name that begins with AWSReservedSSO.
A Lake Formation administrator grants single sign-on roles permissions to the corresponding database and tables.

The solution workflow consists of the following high-level steps as shown in Figure 1:

The user configures IAM Identity Center authentication using the AWS CLI.
The AWS CLI redirects the user to the AWS access portal URL. The user enters workforce identity credentials (username and password). Then chooses Sign in.
The AWS access portal verifies the user’s identity. IAM Identity Center redirects the request to the Identity Center authentication service to validate the user’s credentials.
If MFA is enabled for the user, then they are prompted to authenticate their MFA device.
The user enters or approves the MFA details. The user’s MFA is successfully completed.
The user selects the AWS account to use from the displayed list. Then select the IAM single sign-on role to use from the displayed list.
The user tests the SQL client connection and then uses the client to run a SQL query.
The client makes a call to Athena to retrieve the table and associated metadata from the Data Catalog.
Athena requests access to the data from Lake Formation. Lake Formation invokes the AWS Security Token Service (AWS STS).
Lake Formation invokes AWS STS.
1. Lake Formation obtains temporary AWS credentials with the permissions of the defined IAM role (sensitive or non-sensitive) associated with the data lake location.
2. Lake Formation returns temporary credentials to Athena.
Athena uses the temporary credentials to retrieve data objects from Amazon S3.
The Athena engine successfully runs the query and returns the results to the client.

Solution walkthrough

The walkthrough includes five sections that will guide you through the process of creating permission sets, assigning permission sets to AWS Accounts, managing permission sets access using Lake Formation, and setting up third-party SQL clients such as SQL Workbench to connect to your data store and query your data through Athena.

Step 1: Federate onboarding

Federating onboarding is done within the IAM Identity Center account. As part of federated onboarding, you need to create IAM Identity Center users and groups. Groups are a collection of people who have the same security rights and permissions. You can create groups and add users to the groups. Create one IAM Identity Center group for sensitive data and another for non-sensitive data to provide distinct access to different classes of data sets. You can assign access to IAM Identity Center permission sets to a user or group.

To federate onboarding:

Open the AWS Management Console using the IAM Identity Center account and go to IAM Identity Center.
Choose Groups.
Choose Create group.
Enter a Group name and Description .
Choose Create group.

To add a user as a member of a group:

Open the IAM Identity Center console.
Choose Groups.
Select the group name that you want to update.
On the group details page, under Users in this group, choose Add users to group.
On the Add users to group page, under Other users, locate the users you want to add as members and select the check box next to each of them.
Choose Add users to group.

Figure 2: Assigning users to a group

Step 2: Create permission sets

For this step, create two permission sets (sensitive-iam-role and non-sensitive-iam-role). These permission sets can be assigned to users or groups in IAM Identity Center, granting them specific access to AWS account resources.

To create custom permission sets:

In the IAM Identity Center administrator account, under Multi-Account permissions, choose Permission sets.
Choose Create permission set.
On the Select permission set type page, under Permission set type, choose Custom permission set.

Figure 3: Selecting a permission set
Choose Next.
On the Specify policies and permission boundary page, expand Inline policy to add custom JSON-formatted policy text.

Insert the following policy and update the S3 bucket name (<s3-bucket-name>), AWS Region (<region>) account ID (<account-id>), CloudWatch alarm name (<AlarmName>), Athena workgroup name (sensitive or non-sensitive) (<WorkGroupName>), KMS key alias name (<KMS-key-alias-name>), and organization ID (<aws-PrincipalOrgID>).

{
  "Statement": [
    {
      "Action": [
        "lakeformation:SearchTablesByLFTags",
        "lakeformation:SearchDatabasesByLFTags",
        "lakeformation:ListLFTags",
        "lakeformation:GetResourceLFTags",
        "lakeformation:GetLFTag",
        "lakeformation:GetDataAccess",
        "glue:SearchTables",
        "glue:GetTables",
        "glue:GetTable",
        "glue:GetPartitions",
        "glue:GetDatabases",
        "glue:GetDatabase"
      ],
      "Effect": "Allow",
      "Resource": "*",
      "Sid": "LakeformationAccess"
    },
    {
      "Action": [
        "s3:PutObject",
        "s3:ListMultipartUploadParts",
        "s3:ListBucketMultipartUploads",
        "s3:ListBucket",
        "s3:GetObject",
        "s3:GetBucketLocation",
        "s3:CreateBucket",
        "s3:AbortMultipartUpload"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::<s3-bucket-name>/*",
        "arn:aws:s3:::<s3-bucket-name>"
      ],
      "Sid": "S3Access"
    },
    {
      "Action": "s3:ListAllMyBuckets",
      "Effect": "Allow",
      "Resource": "*",
      "Sid": "AthenaS3ListAllBucket"
    },
    {
      "Action": [
        "cloudwatch:PutMetricAlarm",
        "cloudwatch:DescribeAlarms"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:cloudwatch:<region>:<account-id>:alarm:<AlarmName>"
      ],
      "Sid": "CloudWatchLogs"
    },
    {
      "Action": [
        "athena:UpdatePreparedStatement",
        "athena:StopQueryExecution",
        "athena:StartQueryExecution",
        "athena:ListWorkGroups",
        "athena:ListTableMetadata",
        "athena:ListQueryExecutions",
        "athena:ListPreparedStatements",
        "athena:ListNamedQueries",
        "athena:ListEngineVersions",
        "athena:ListDatabases",
        "athena:ListDataCatalogs",
        "athena:GetWorkGroup",
        "athena:GetTableMetadata",
        "athena:GetQueryResultsStream",
        "athena:GetQueryResults",
        "athena:GetQueryExecution",
        "athena:GetPreparedStatement",
        "athena:GetNamedQuery",
        "athena:GetDatabase",
        "athena:GetDataCatalog",
        "athena:DeletePreparedStatement",
        "athena:DeleteNamedQuery",
        "athena:CreatePreparedStatement",
        "athena:CreateNamedQuery",
        "athena:BatchGetQueryExecution",
        "athena:BatchGetNamedQuery"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:athena:<region>:<account-id>:workgroup/<WorkGroupName>",
        "arn:aws:athena:{Region}:{Account}:datacatalog/{DataCatalogName}"
      ],
      "Sid": "AthenaAllow"
    },
    {
      "Action": [
        "kms:GenerateDataKey",
        "kms:DescribeKey",
        "kms:Decrypt"
      ],
      "Condition": {
        "ForAnyValue:StringLike": {
          "kms:ResourceAliases": "<KMS-key-alias-name>"
        }
      },
      "Effect": "Allow",
      "Resource": "*",
      "Sid": "kms"
    },
    {
      "Action": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalOrgID": "<aws-PrincipalOrgID>"
        }
      },
      "Effect": "Deny",
      "Resource": "*",
      "Sid": "denyRule"
    }
  ],
  "Version": "2012-10-17"
}

Update the custom policy to add the corresponding Athena workgroup ARN for the sensitive and non-sensitive IAM roles.

Note: See the documentation for information about AWS global condition context keys.
Choose Next.
On the Specify permission set details page, enter a name to identify this permission set in IAM Identity Center. The name that you specify for this permission set appears in the AWS access portal as an available role. Users sign in to the AWS access portal, choose an AWS account, and then choose the role.
Choose Next.
On the Review and create page, review the selections that you made, and then choose Create.

Step 3: Assign permission sets to AWS accounts

You can add and remove permissions sets for an IAM user or group by attaching and detaching permission sets. Permission sets define what actions an identity can perform on which AWS resources.

To assign permission sets to AWS accounts:

In the IAM Identity Center administrator account, under Multi-account permissions, choose AWS accounts.
On the AWS accounts page, select one or more AWS accounts that you want to assign single sign-on access to.
Choose Assign users or groups.

Figure 4: Selecting users and groups
On the Assign users and groups to “<AWS account name>”, for Selected users and groups, choose the users that you want to create the permission set for. Choose Next.
Select permission sets: On the Assign permission sets to “AWS-account-name” page, select one or more permission sets.
On the Review and submit assignments to AWS-account-name page, for Review and submit, choose Submit.

Step 4. Grant permissions to IAM (single sign-on) roles

A data lake administrator has the broad ability to grant a principal (including themselves) permissions on Data Catalog resources. This includes the ability to manage access controls and permissions for the data lake. When you grant Lake Formation permissions on a specific Data Catalog table, you can also include data filtering specifications. This allows you to further restrict access to certain data within the table, limiting what users can see in their query results based on those filtering rules.

To grant permissions to IAM roles:

In the Lake Formation console, under Permissions in the navigation pane, select Data Lake permissions, and then choose Grant.

To grant Database permissions to IAM roles:

Under Principals, select the IAM role name (for example, Sensitive-IAM-Role).
Under Named Data Catalog resources, go to Databases and select a database (for example, demo).

Figure 5: Select an IAM role and database
Under Database permissions, select Describe and then choose Grant.

Figure 6: Grant database permissions to an IAM role

To grant tables permissions to IAM roles:

Repeat steps 1 and 2 of the preceding procedure.
Under Tables – optional, select a table name (for example, demo2).

Figure 7: Select tables within a database to grant access
Select the desired Table Permissions (for example, select and describe), and then choose Grant.

Figure 8: Grant access to tables within the database
Repeat steps 1 through 4 to grant access for the respective database and tables for the non-sensitive IAM role.

Step 5: Client-side setup using JDBC

You can use a JDBC connection to connect Athena and SQL client applications (for example, PyCharm or SQL Workbench) to enable analytics and reporting on the data that Athena returns from Amazon S3 databases. To use the Athena JDBC driver, you must specify the driver class from the JAR file. Additionally, you must pass in some parameters to change the authentication mechanism so the athena-sts-auth libraries are used:

S3 output location – Where in S3 the Athena service can write its output. For example, s3://path/to/query/bucket/.
The IAM Identity Center administrator can configure the session duration for the AWS access portal. The session duration can be set from a minimum of 15 minutes to a maximum of 90 days.

To set up PyCharm

Install Athena JDBC 3.x driver from Athena JDBC 3.x driver.
1. In the left navigation pane, select JDBC 3.x and then Getting started. Select Uber jar to download a .jar file, which contains the driver and its dependencies.
  
  Figure 9: Download Athena JDBC jar
Open PyCharm and create a new project.
1. Enter a Name for your project
2. Select the desired project Location
3. Choose Create
Figure 10: Create a new project in PyCharm
Configure Data Source and drivers. Select Data Source, and then choose the plus sign or New to configure new data sources and drivers.

Figure 11: Add database source properties
Configure the Athena driver by selecting the Drivers tab, and then choose the plus sign to add a new driver.

Figure 12: Add database drivers
Under Driver Files, upload the custom JAR file that you downloaded in the Step 1. Select the Athena class dropdown. Enter the driver’s name (for example Athena JDBC Driver). Then choose Apply.

Figure 13: Add database driver files
Configure a new data source. Choose the plus sign and select your driver’s name from the driver dropdown.
Enter the data source name (for example, Athena Demo). For the authentication method, select User & Password. Then choose Apply.

Figure 14: Create a project data source profile
Select the SSH/SSL tab and select Use SSL. Verify that the Use truststore options for IDE, JAVA, and system are all selected. Then choose Apply.

Figure 15: Enable data source profile SSL
Select the Options tab and then select Single Session Mode. Then choose Apply.

Figure 16: Configure single session mode in PyCharm
Select the General tab and enter the JDBC and single sign-on URL. The following is a sample JDBC URL based on the SAML application:
```
jdbc:athena://;CredentialsProvider= ProfileCredentials; ProfileName=<name-of-the-profile>;WorkGroup=<name-of-the-WorkGroup>; 
```
1. Choose Apply.
2. Choose Test Connection. If the profile has expired, refresh the single sign-on session by running aws sso login --profile <profile-name> with the corresponding profile.
Figure 17: Test the data source connection
After the connection is successful, select the Schemas tab and select All databases and All schemas.

Figure 18: Select data source databases and schemas
Run a sample test query: SELECT <table-names> FROM <database-name> limit 10;
Verify that the credentials and permissions are working as expected.

To set up SQL Workbench

Open SQL Workbench.
Configure an Athena driver by selecting File and then Manage Drivers.
Enter the Athena JDBC Driver as the name and set the library to browse the path for the location where you downloaded the driver. Enter amazonaws.athena.jdbc.AthenaDriver as the Classname.

Enter the following URL, replacing <name-of-the-WorkGroup> with your workgroup name.

jdbc:athena://;CredentialsProvider=ProfileCredentials;ProfileName=<name-of-the-profile>;WorkGroup=<name-of-the-WorkGroup>;

Choose OK.
Run a test query, replacing <table-names> and <database-name> with your table and database names:
```
SELECT <table-names> FROM <database-name> limit 10;
```
Verify that the credentials and permissions are working as expected.

Conclusion

In this post, we covered how to use JDBC drivers to connect to Athena from third-party SQL client tools. You were able to set this up without creating IAM users or any type of long-lived credentials that would need to be stored on your developers’ workstations. You learned how to configure IAM Identity Center users and groups, create permission sets, and assign permission sets to AWS Accounts. You also learned how to grant permissions to single sign-on roles using Lake Formation to create distinct access to different classes of data sets and connect to Athena through an SQL client tool (such as PyCharm). This setup can also work with other supported identity sources such as IAM Identity Center, self-managed or on-premises Active Directory, or an external IdP.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Get started with the new Amazon DataZone enhancements for Amazon Redshift

2024-07-29 Carmen Manzulli

Post Syndicated from Carmen Manzulli original https://aws.amazon.com/blogs/big-data/get-started-with-the-new-amazon-datazone-enhancements-for-amazon-redshift/

In today’s data-driven landscape, organizations are seeking ways to streamline their data management processes and unlock the full potential of their data assets, while controlling access and enforcing governance. That’s why we introduced Amazon DataZone.

Amazon DataZone is a powerful data management service that empowers data engineers, data scientists, product managers, analysts, and business users to seamlessly catalog, discover, analyze, and govern data across organizational boundaries, AWS accounts, data lakes, and data warehouses.

On March 21, 2024, Amazon DataZone introduced several exciting enhancements to its Amazon Redshift integration that simplify the process of publishing and subscribing to data warehouse assets like tables and views, while enabling Amazon Redshift customers to take advantage of the data management and governance capabilities or Amazon DataZone.

These updates empower the experience for both data users and administrators.

Data producers and consumers can now quickly create data warehouse environments using preconfigured credentials and connection parameters provided by their Amazon DataZone administrators.

Additionally, these enhancements grant administrators greater control over who can access and use the resources within their AWS accounts and Redshift clusters, and for what purpose.

As an administrator, you can now create parameter sets on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and an AWS secret. You can use these parameter sets to create environment profiles and authorize Amazon DataZone projects to use these environment profiles for creating environments.

In turn, data producers and data consumers can now select an environment profile to create environments without having to provide the parameters themselves, saving time and reducing the risk of issues.

In this post, we explain how you can use these enhancements to the Amazon Redshift integration to publish your Redshift tables to the Amazon DataZone data catalog, and enable users across the organization to discover and access them in a self-service fashion. We present a sample end-to-end customer workflow that covers the core functionalities of Amazon DataZone, and include a step-by-step guide of how you can implement this workflow.

The same workflow is available as video demonstration on the Amazon DataZone official YouTube channel.

Solution overview

To get started with the new Amazon Redshift integration enhancements, consider the following scenario:

A sales team acts as the data producer, owning and publishing product sales data (a single table in a Redshift cluster called catalog_sales)
A marketing team acts as the data consumer, needing access to the sales data in order to analyze it and build product adoption campaigns

At a high level, the steps we walk you through in the following sections include tasks for the Amazon DataZone administrator, Sales team, and Marketing team.

Prerequisites

For the workflow described in this post, we assume a single AWS account, a single AWS Region, and a single AWS Identity and Access Management (IAM) user, who will act as Amazon DataZone administrator, Sales team (producer), and Marketing team (consumer).

To follow along, you need an AWS account. If you don’t have an account, you can create one.

In addition, you must have the following resources configured in your account:

An Amazon DataZone domain with admin, sales, and marketing projects
A Redshift namespace and workgroup

If you don’t have these resources already configured, you can create them by deploying an AWS CloudFormation stack:

Choose Launch Stack to deploy the provided CloudFormation template.
For AdminUserPassword, enter a password, and take note of this password to use in later steps.
Leave the remaining settings as default.
Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
When the stack deployment is complete, on the Amazon DataZone console, choose View domains in the navigation pane to see the new created Amazon DataZone domain.
On the Amazon Redshift Serverless console, in the navigation pane, choose Workgroup configuration and see the new created resource.

You should be logged in using the same role that you used to deploy the CloudFormation stack and verify that you’re in the same Region.

As a final prerequisite, you need to create a catalog_sales table in the default Redshift database (dev).

On the Amazon Redshift Serverless console, selected your workgroup and choose Query data to open the Amazon Redshift query editor.
In the query editor, choose your workgroup and select Database user name and password as the type of connection, then provide your admin database user name and password.

Use the following query to create the catalog_sales table, which the Sales team will publish in the workflow:

CREATE TABLE catalog_sales AS 
SELECT 146776932 AS order_number, 23 AS quantity, 23.4 AS wholesale_cost, 45.0 as list_price, 43.0 as sales_price, 2.0 as discount, 12 as ship_mode_sk,13 as warehouse_sk, 23 as item_sk, 34 as catalog_page_sk, 232 as ship_customer_sk, 4556 as bill_customer_sk
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Now you’re ready to get started with the new Amazon Redshift integration enhancements.

Amazon DataZone administrator tasks

As the Amazon DataZone administrator, you perform the following tasks:

Configure the DefaultDataWarehouseBlueprint.
- Authorize the Amazon DataZone admin project to use the blueprint to create environment profiles.
- Create a parameter set on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and AWS secret.
Set up environment profiles for the Sales and Marketing teams.

Configure the DefaultDataWarehouseBlueprint

Amazon DataZone blueprints define what AWS tools and services are provisioned to be used within an Amazon DataZone environment. Enabling the data warehouse blueprint will allow data consumers and data producers to use Amazon Redshift and the Query Editor for data sharing, accessing, and consuming.

On the Amazon DataZone console, choose View domains in the navigation pane.
Choose your Amazon DataZone domain.
Choose Default Data Warehouse.

If you used the CloudFormation template, the blueprint is already enabled.

Part of the new Amazon Redshift experience involves the Managing projects and Parameter sets tabs. The Managing projects tab lists the projects that are allowed to create environment profiles using the data warehouse blueprint. By default, this is set to all projects. For our purpose, let’s grant only the admin project.

On the Managing projects tab, choose Edit.

Select Restrict to only managing projects and choose the AdminPRJ project.
Choose Save changes.

With this enhancement, the administrator can control which projects can use default blueprints in their account to create environment profile

The Parameter sets tab lists parameters that you can create on top of DefaultDataWarehouseBlueprint by providing parameters such as Redshift cluster or Redshift Serverless workgroup name, database name, and the credentials that allow Amazon DataZone to connect to your cluster or workgroup. You can also create AWS secrets on the Amazon DataZone console. Before these enhancements, AWS secrets had to be managed separately using AWS Secrets Manager, making sure to include the proper tags (key-value) for Amazon Redshift Serverless.

For our scenario, we need to create a parameter set to connect a Redshift Serverless workgroup containing sales data.

On the Parameter sets tab, choose Create parameter set.
Enter a name and optional description for the parameter set.
Choose the Region containing the resource you want to connect to (for example, our workgroup is in us-east-1).
In the Environment parameters section, select Amazon Redshift Serverless.

If you already have an AWS secret with credentials to your Redshift Serverless workgroup, you can provide the existing AWS secret ARN. In this case, the secret must be tagged with the following (key-value): AmazonDataZoneDomain: <Amazon DataZone domain ID>.

Because we don’t have an existing AWS secret, we create a new one by choosing Create new AWS Secret.
In the pop-up, enter a secret name and your Amazon Redshift credentials, then choose Create new AWS Secret.

Amazon DataZone creates a new secret using Secrets Manager and makes sure the secret is tagged with the domain in which you’re creating the parameter set.

Enter the Redshift Serverless workgroup name and database name to complete the parameters list. If you used the provided CloudFormation template, use sales-workgroup for the workgroup name and dev for the database name.
Choose Create parameter set.

You can see the parameter set created for your Redshift environment and the blueprint enabled with a single managing project configured.

Set up environment profiles for the Sales and Marketing teams

Environment profiles are predefined templates that encapsulate technical details required to create an environment, such as the AWS account, Region, and resources and tools to be added to projects. The next Amazon DataZone administrator task consists of setting up environment profiles, based on the default enabled blueprint, for the Sales and Marketing teams.

This task will be performed from the admin project in the Amazon DataZone data portal, so let’s follow the data portal URL and start creating an environment profile for the Sales team to publish their data.

On the details page of your Amazon DataZone domain, in the Summary section, choose the link for your data portal URL.

When you open the data portal for the first time, you’re prompted to create a project. If you used the provided CloudFormation template, the projects are already created.

Choose the AdminPRJ project.
On the Environments page, choose Create environment profile.
Enter a name (for example, SalesEnvProfile) and optional description (for example, Sales DWH Environment Profile) for the new environment profile.
For Owner, choose AdminPRJ.
For Blueprint, select the DefaultDataWarehouse blueprint (you’ll only see blueprints where the admin project is listed as a managing project).
Choose the current enabled account and the parameter set you previously created.

Then you will see each pre-compiled value for Redshift Serverless. Under Authorized projects, you can pick the authorized projects allowed to use this environment profile to create an environment. By default, this is set to All projects.

Select Authorized projects only.
Choose Add projects and choose the SalesPRJ project.
Configure the publishing permissions for this environment profile. Because the Sales team is our data producer, we select Publish from any schema.
Choose Create environment profile.

Next, you create a second environment profile for the Marketing team to consume data. To do this, you repeat similar steps made for the Sales team.

Choose the AdminPRJ project.
On the Environments page, choose Create environment profile.
Enter a name (for example, MarketingEnvProfile) and optional description (for example, Marketing DWH Environment Profile).
For Owner, choose AdminPRJ.
For Blueprint, select the DefaultDataWarehouse blueprint.
Select the parameter set you created earlier.
This time, keep All projects as the default (alternatively, you could select Authorized projects only and add MarketingPRJ).
Configure the publishing permissions for this environment profile. Because the Marketing team is our data consumer, we select Don’t allow publishing.
Choose Create environment profile.

With these two environment profiles in place, the Sales and Marketing teams can start working on their projects on their own to create their proper environments (resources and tools) with fewer configurations and less risk to incur errors, and publish and consume data securely and efficiently within these environments.

To recap, the new enhancements offer the following features:

When creating an environment profile, you can choose to provide your own Amazon Redshift parameters or use one of the parameter sets from the blueprint configuration. If you choose to use the parameter set created in the blueprint configuration, the AWS secret only requires the AmazonDataZoneDomain tag (the AmazonDataZoneProject tag is only required if you choose to provide your own parameter sets in the environment profile).
In the environment profile, you can specify a list of authorized projects, so that only authorized projects can use this environment profile to create data warehouse environments.
You can also specify what data authorized projects are allowed to be published. You can choose one of the following options: Publish from any schema, Publish from the default environment schema, and Don’t allow publishing.

These enhancements grant administrators more control over Amazon DataZone resources and projects and facilitate the common activities of all roles involved.

Sales team tasks

As a data producer, the Sales team performs the following tasks:

Create a sales environment.
Create a data source.
Publish sales data to the Amazon DataZone data catalog.

Create a sales environment

Now that you have an environment profile, you need to create an environment in order to work with data and analytics tools in this project.

Choose the SalesPRJ project.
On the Environments page, choose Create environment.
Enter a name (for example, SalesDwhEnv) and optional description (for example, Environment DWH for Sales) for the new environment.
For Environment profile, choose SalesEnvProfile.

Data producers can now select an environment profile to create environments, without the need to provide their own Amazon Redshift parameters. The AWS secret, Region, workgroup, and database are ported over to the environment from the environment profile, streamlining and simplifying the experience for Amazon DataZone users.

Review your data warehouse parameters to confirm everything is correct.
Choose Create environment.

The environment will be automatically provisioned by Amazon DataZone with the preconfigured credentials and connection parameters, allowing the Sales team to publish Amazon Redshift tables seamlessly.

Create a data source

Now, let’s create a new data source for our sales data.

Choose the SalesPRJ project.
On the Data page, choose Create data source.
Enter a name (for example, SalesDataSource) and optional description.
For Data source type, select Amazon Redshift.
For Environment¸ choose SalesDevEnv.
For Redshift credentials, you can use the same credentials you provided during environment creation, because you’re still using the same Redshift Serverless workgroup.
Under Data Selection, enter the schema name where your data is located (for example, public) and then specify a table selection criterion (for example, *).

Here, the * indicates that this data source will bring into Amazon DataZone all the technical metadata from the database tables of your schema (in this case, a single table called catalog_sales).

Choose Next.

On the next page, automated metadata generation is enabled. This means that Amazon DataZone will automatically generate the business names of the table and columns for that asset.

Leave the settings as default and choose Next.
For Run preference, select when to run the data source. Amazon DataZone can automatically publish these assets to the data catalog, but let’s select Run on demand so we can curate the metadata before publishing.
Choose Next.
Review all settings and choose Create data source.
After the data source has been created, you can manually pull technical metadata from the Redshift Serverless workgroup by choosing Run.

When the data source has finished running, you can see the catalog_sales asset correctly added to the inventory.

Publish sales data to the Amazon DataZone data catalog

Open the catalog_sales asset to see details of the new asset (business metadata, technical metadata, and so on).

In a real-world scenario, this pre-publishing phase is when you can enrich the asset providing more business context and information, such as a readme, glossaries, or metadata forms. For example, you can start accepting some metadata automatically generated recommendations and rename the asset or its columns in order to make them more readable, descriptive, and easy to search and understand from a business user.

For this post, simply choose Publish asset to complete the Sales team tasks.

Marketing team tasks

Let’s switch to the Marketing team and subscribe to the catalog_sales asset published by the Sales team. As a consumer team, the Marketing team will complete the following tasks:

Create a marketing environment.
Discover and subscribe to sales data.
Query the data in Amazon Redshift.

Create a marketing environment

To subscribe and access Amazon DataZone assets, the Marketing team needs to create an environment.

Choose the MarketingPRJ project.
On the Environments page, choose Create environment.
Enter a name (for example, MarketingDwhEnv) and optional description (for example, Environment DWH for Marketing).
For Environment profile, choose MarketingEnvProfile.

As with data producers, data consumers can also benefit from a pre-configured profile (created and managed by the administrator) in order to speed up the environment creation process, avoiding mistakes and reducing risks of errors.

Review your data warehouse parameters to confirm everything is correct.
Choose Create environment.

Discover and subscribe to sales data

Now that we have a consumer environment, let’s search the catalog_sales table in the Amazon DataZone data catalog.

Enter sales in the search bar.
Choose the catalog_sales table.
Choose Subscribe.
In the pop-up window, choose your marketing consumer project, provide a reason for the subscription request, and choose Subscribe.

When you get a subscription request as a data producer, Amazon DataZone will notify you through a task in the sales producer project. Because you’re acting as both subscriber and publisher here, you will see a notification.

Choose the notification, which will open the subscription request.

You can see details including which project has requested access, who is the requestor, and why access is needed.

To approve, enter a message for approval and choose Approve.

Now that subscription has been approved, let’s go back to the MarketingPRJ. On the Subscribed data page, catalog_sales is listed as an approved asset, but access hasn’t been granted yet. If we choose the asset, you can see that Amazon DataZone is working on the backend to automatically grant the access. When it’s complete, you’ll see the subscription as granted and the message “Asset added to 1 environment.”

Query data in Amazon Redshift

Now that the marketing project has access to the sales data, we can use the Amazon Redshift Query Editor V2 to analyze the sales data.

Under MarketingPRJ, go to the Environments page and select the marketing environment.
Under the analytics tools, choose Query data with Amazon Redshift, which redirects you to the query editor within the environment of the project.
To connect to Amazon Redshift, choose your workgroup and select Federated user as the connection type.

When you’re connected, you will see the catalog_sales table under the public schema.

To make sure that you have access to this table, run the following query:

SELECT * FROM catalog_sales LIMIT 10

As a consumer, you’re now able to explore data and create reports, or you can aggregate data and create new assets to publish in Amazon DataZone, becoming a producer of a new data product to share with other users and departments.

Clean up

To clean up your resources, complete the following steps:

On the Amazon DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments.
Clean up all Amazon Redshift resources (workgroup and namespace) to avoid incurring additional charges.

Conclusion

In this post, we demonstrated how you can get started with the new Amazon Redshift integration in Amazon DataZone. We showed how to streamline the experience for data producers and consumers and how to grant administrators control over data resources.

Embrace these enhancements and unlock the full potential of Amazon DataZone and Amazon Redshift for your data management needs.

Resources

For more information, refer to the following resources:

See the Amazon DataZone documentation
Check out the YouTube playlist for some of the latest demos of Amazon DataZone and short descriptions of the capabilities available
Check out How Amazon DataZone helps customers find value in oceans of data

About the author

Carmen is a Solutions Architect at AWS, based in Milan (Italy). She is a Data Lover that enjoys helping companies in the adoption of Cloud technologies, especially with Data Analytics and Data Governance. Outside of work, she is a creative people who loves being in contact with nature and sometimes practicing adrenaline activities.

How to build a CA hierarchy across multiple AWS accounts and Regions for global organization

2024-07-26 Jiaqing Xue

Post Syndicated from Jiaqing Xue original https://aws.amazon.com/blogs/security/how-to-build-a-ca-hierarchy-across-multiple-aws-accounts-and-regions-for-global-organization/

Building a certificate authority (CA) hierarchy using AWS Private Certificate Authority has been made simple in Amazon Web Services (AWS); however, the CA tree will often reside in one AWS Region in one account. Many AWS customers run their businesses in multiple Regions using multiple AWS accounts and have described the process of creating a CA hierarchy to reflect their organizational structure as complex. This blog post guides you through building a two-level CA hierarchy with CAs in multiple Regions and in multiple accounts.

Certificates, like those provided through AWS Private CA, are usually used to establish encrypted network connections to meet security requirements of an organization’s data, such as database connections. For example, a software as a service (SaaS) company headquartered in Singapore that offers managed database services to users across multiple Regions, such as Oregon (us-west-2), Singapore (ap-southeast-1), and Tokyo (ap-northeast-1). To comply with the regulatory requirements of various countries and Regions, the company needs to build a CA hierarchy with a root CA located in Singapore and subordinate CAs distributed globally.

Another example is a large manufacturing company selling millions of Internet of Things (IoT) devices globally that needs to issue certificates to those devices for identity, authentication, and secure network connections. Different device types or business requirements are handled by different teams that have their own independent AWS accounts within the organization. To meet the certificate policy requirements (such as key length, validity period, and so on) for different types of devices, preserve business flexibility, and control the blast radius of risks, these teams use their own CAs to issue certificates. A customer like this needs a CA hierarchy that aligns with its organizational structure. Subordinate CAs should be deployed to the AWS accounts in different business departments, with each department managing its own subordinate CAs. Each device, or end-entity, typically requires a unique certificate, but the number of private certificates each CA will issue is 1,000,000 by default. Because this number is easy to exceed in a mass-production scenario, a CA hierarchy with multiple subordinate CAs can help mitigate this limitation.

In a CA hierarchy, the root CA is responsible for issuing certificates to subordinate CAs, and subordinate CAs are responsible for issuing end-entity certificates. Because the root CA is the root of the trust chain, loss of trust in this CA can mean that the end-entity certificates it issued will become invalid. Therefore, it’s recommended to place the root CA in a dedicated account and apply strict access control and auditing on its usage. We also recommend you strictly limit the scope of use of the root CA. For more best practices, see AWS Private CA best practices and Designing a CA hierarchy.

Key concepts: Certificate template

In AWS Private CA, when you use AWS Command Line Interface (AWS CLI) or AWS APIs to issue a certificate, you need to supply the certificate template Amazon Resource Name (ARN) to specify the X509v3 parameters of the certificate. A certificate template defines what type of certificates can be issued and allows CA administrators to control what certificates are issued and with what attributes. If you provide no ARN, AWS Private CA will automatically apply the EndEntityCertificate/V1 template to issue an end-entity certificate.

AWS Private CA offers certificate templates that encapsulate best practices for the basic constraint values in X509v3 parameters of certificates. For example, if it’s an end-entity certificate, the cA value of Basic constraints in the parameter is False; if issuing a CA certificate, it is necessary to set the cA value in the parameter to True and set the pathLenConstraint according to the level of the CA using this certificate in the CA hierarchy.

When issuing certificates, you must select the appropriate certificate template based on the certificate’s intended purpose. For more information about certificate templates, see Understanding certificate templates.

Journey of a two-level CA hierarchy

Now that you understand the core concepts of a CA hierarchy, we will show you how to use two CAs located in different accounts and different Regions to establish a CA hierarchy and use the subordinate CA to issue an end-entity certificate associated with a web application. In the example, you will create two CAs: one in the Oregon (us-west-2) Region acting as the root CA and the other one in the Singapore (ap-southeast-1) Region acting as the subordinate CA.

The following figure shows what you will build:

Figure 1: Architecture of a two-level CA hierarchy

Prerequisites

You need the following prerequisites to build and implement the solution in this post.

Designate two different AWS accounts for creating the hierarchy. In this example, we use one called Root CA account and the another called Subordinate CA account.
Verify that you can sign in to AWS CloudShell in the Root CA account, and verify that your console user or role has the AWS Identity and Access Management (IAM) policy AWSCertificateManagerPrivateCAFullAccess associated or has permissions to create CAs and issue certificates in AWS Private CA.
Verify that you have access to a role in the Subordinate CA account that has appropriate permissions to create CAs in AWS Private CA, issue certificates using AWS Certificate Manager (ACM), and create an Application Load Balancer (ALB) for a sample web application.

Build the hierarchy

With the prerequisites in place, you can start building the hierarchy, including the:

Root CA
Subordinate CA
Subordinate CA certificate
End-entity certificate

To build the root CA

In the AWS Management Console of the root CA account, go to AWS Private CA and make sure that you are in the Oregon (us-west-2) Region.
Create a CA according to this guide. Select Root in the CA type options to indicate that it’s a root CA.

Figure 2: Create root CA
After creation, the CA will be in Pending certificate status. Finish configuring your CA by choosing Actions and selecting Install CA certificate.

Figure 3: Root CA in pending certificate status
The root CA will change to Active status.

Figure 4: Root CA in active status

To build the subordinate CA

To build the subordinate CA, sign in to your second account designated for this and make sure your Region is different than that of the root CA. For this example, I’m creating this in the Singapore (ap-southeast-1) Region. Similar to the root CA, create a CA in the console and select Subordinate as the CA type options.

Figure 5: Create a subordinate CA
As with the root CA, the subordinate CA is in the Pending certificate state, so you need to install the certificate for it.

Figure 6: Subordinate CA in pending certificate status

To issue the subordinate CA certificate

In a CA hierarchy, the certificate of the subordinate CA must be issued by the higher-level CA, which in this case is the root CA. To configure the subordinate CA to be Active, you need to get the certificate signing request (CSR) of the subordinate CA and use the root CA to sign it and issue a certificate for it.

Remaining within the Subordinate CA account, get the CSR of the subordinate CA for signing. In the details page of the subordinate CA, open the Install subordinate CA certificate page by choosing Actions and selecting Install CA certificate, the CSR will be displayed when you select External private CA as the CA type. Export this to a file called sub-ca.csr on your local machine.
Although the subordinate CA was managed by AWS Private CA, it will be considered as an external private CA because it’s in a different account and a different Region from the root CA.

Figure 7: Details page of Install subordinate CA certificate
In the example scenario, you want the subordinate CA to only issue certificates directly to end-entities, so its certificate must be signed using the SubordinateCACertificate_PathLen0/V1 template on the root CA. This encodes a limitation into the CA that declares that the subordinate CA cannot sign a CSR for another CA.
To do this, switch to the Root CA account and use CloudShell to run commands with the AWS CLI. Recreate the file sub-ca.csr using a text editor within CloudShell, and then use the following command to issue subordinate CA certificate. Replace the root-ca-arn with the ARN of the root CA you created previously:
```
aws acm-pca issue-certificate --region us-west-2 \
  --certificate-authority-arn <root-ca-arn> \
  --csr file://sub-ca.csr \
  --signing-algorithm SHA256WITHRSA \
  --template-arn arn:aws:acm-pca:::template/SubordinateCACertificate_PathLen0/V1 \
 --validity Value=5,Type=YEARS
```
The output looks like the following. Note the CertificateArn that is returned, because it will be used as <sub-ca-certificate-arn> in the following steps:
```
{
    "CertificateArn": "<sub-ca-certificate-arn>"
}
```
Remaining in the Root CA account, export the subordinate CA’s certificate and certificate chain. Store the text that is returned from the CLI in a file on your local machine for later use in the Subordinate CA account. Replace the <root-ca-arn> and <sub-ca-certificate-arn> with the actual values.
1. Export the subordinate CA’s certificate:
```
aws acm-pca get-certificate --region us-west-2 --certificate-authority-arn <root-ca-arn> --certificate-arn <sub-ca-certificate-arn> --query “Certificate” --output text
```
2. Export the subordinate CA’s certificate chain:
```
aws acm-pca get-certificate --region us-west-2 --certificate-authority-arn <root-ca-arn> --certificate-arn <sub-ca-certificate-arn> --query “CertificateChain” --output text
```
Switching back to the Subordinate CA account, import the subordinate CA’s certificate and root certificate chain to activate the subordinate CA. On the install subordinate CA certificate, choose Actions and select Install CA certificate.
1. Use the exported certificate from the previous step to fill in the certificate body.
2. Use the exported certificate chain from the previous step to fill in the certificate chain.
3. Choose Confirm and install.
  
  Figure 8: Import certificate and certificate chain to the subordinate CA
Now, the subordinate CA is in Active status.

Figure 9: Subordinate CA in active status

To validate the end-entity certificate

Now that you have established a two-level CA hierarchy, you can issue an end-entity certificate and deploy a web application in the Subordinate CA account. Use a web browser to view the web application and validate the CA hierarchy.

Go to the AWS Certificate Manager console of the subordinate CA account and request a private certificate. In the Certificate authority details section, select the subordinate CA that was previously built.

Figure 10: Issue certificate using subordinate CA in ACM
Deploy a sample web application behind an Application Load Balancer (ALB) in the Subordinate CA account. This sample web application uses a feature of the ALB listener that returns a fixed response without needing to configure backend compute resources. Configure the listener to use the certificate issued in the previous step to handle HTTPS traffic.

Figure 11: Details of the ALB
Check the hierarchical relationship of CAs by visiting the web application and viewing the certificate details in your web browser.

Figure 12: Certificate details in the web browser

From this image, you can see that the certificate trust chain of the CA hierarchy has been established successfully. The end-entity certificate (on the left) was issued by the CA Example Corp SG L2 CA, which you can identify from its Issuer Name. In the middle is the certificate of Example Corp SG L2 CA, and its issuer is Example Corp Root CA. You can see the certificate of Example Corp Root CA on the right, which is the root CA, and its Issuer Name is itself. When the subject and issuer attributes match, this is indicative of a root CA.

Conclusion

In this post, we showed you how to build a CA hierarchy solution across AWS accounts and Regions, using AWS Private CA. Using this as a guide, you can build a complex CA hierarchy to help meet your global business security and compliance requirements.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Private Certificate Authority re:Post or contact AWS Support.

Manage Amazon Redshift provisioned clusters with Terraform

2024-07-25 Amit Ghodke

Post Syndicated from Amit Ghodke original https://aws.amazon.com/blogs/big-data/manage-amazon-redshift-provisioned-clusters-with-terraform/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze all your data using standard SQL and your existing extract, transform, and load (ETL); business intelligence (BI); and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

HashiCorp Terraform is an infrastructure as code (IaC) tool that lets you define cloud resources in human-readable configuration files that you can version, reuse, and share. You can then use a consistent workflow to provision and manage your infrastructure throughout its lifecycle.

In this post, we demonstrate how to use Terraform to manage common Redshift cluster operations, such as:

Creating a new provisioned Redshift cluster using Terraform code and adding an AWS Identity and Access Management (IAM) role to it
Scheduling pause, resume, and resize operations for the Redshift cluster

Solution overview

The following diagram illustrates the solution architecture for provisioning a Redshift cluster using Terraform.

In addition to Amazon Redshift, the solution uses the following AWS services:

Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 750 instances and choice of the latest processors, storage, networking, operating system (OS), and purchase model to help you best match the needs of your workload. For this post, we use an m5.xlarge instance with the Windows Server 2022 Datacenter Edition. The choice of instance type and Windows OS is flexible; you can choose a configuration that suits your use case.
IAM allows you to securely manage identities and access to AWS services and resources. We use IAM roles and policies to securely access services and perform relevant operations. An IAM role is an AWS identity that you can assume to gain temporary access to AWS services and resources. Each IAM role has a set of permissions defined by IAM policies. These policies determine the actions and resources the role can access.
AWS Secrets Manager allows you to securely store the user name and password needed to log in to Amazon Redshift.

In this post, we demonstrate how to set up an environment that connects AWS and Terraform. The following are the high-level tasks involved:

Set up an EC2 instance with Windows OS in AWS.
Install Terraform on the instance.
Configure your environment variables (Windows OS).
Define an IAM policy to have minimum access to perform activities on a Redshift cluster, including pause, resume, and resize.
Establish an IAM role using the policy you created.
Create a provisioned Redshift cluster using Terraform code.
Attach the IAM role you created to the Redshift cluster.
Write the Terraform code to schedule cluster operations like pause, resume, and resize.

Prerequisites

To complete the activities described in this post, you need an AWS account and administrator privileges on the account to use the key AWS services and create the necessary IAM roles.

Create an EC2 instance

We begin with creating an EC2 instance. Complete the following steps to create a Windows OS EC2 instance:

On the Amazon EC2 console, choose Launch Instance.
Choose a Windows Server Amazon Machine Image (AMI) that suits your requirements.
Select an appropriate instance type for your use case.
Configure the instance details:
1. Choose the VPC and subnet where you want to launch the instance.
2. Enable Auto-assign Public IP.
3. For Add storage, configure the desired storage options for your instance.
4. Add any necessary tags to the instance.
For Configure security group, select or create a security group that allows the necessary inbound and outbound traffic to your instance.
Review the instance configuration and choose Launch to start the instance creation process.
For Select an existing key pair or create a new key pair, choose an existing key pair or create a new one.
Choose Launch instance.
When the instance is running, you can connect to it using the Remote Desktop Protocol (RDP) and the administrator password obtained from the Get Windows password

Install Terraform on the EC2 instance

Install Terraform on the Windows EC2 instance using the following steps:

RDP into the EC2 instance you created.
Install Terraform on the EC2 instance.

You need to update the environment variables to point to the directory where the Terraform executable is available.

Under System Properties, on the Advanced tab, choose Environment Variables.

Choose the path variable.

Choose New and enter the path where Terraform is installed. For this post, it’s in the C:\ directory.

Confirm Terraform is installed by entering the following command:

terraform -v

Optionally, you can use an editor like Visual Studio Code (VS Code) and add the Terraform extension to it.

Create a user for accessing AWS through code (AWS CLI and Terraform)

Next, we create an administrator user in IAM, which performs the operations on AWS through Terraform and the AWS Command Line Interface (AWS CLI). Complete the following steps:

Create a new IAM user.
On the IAM console, download and save the access key and user key.

Install the AWS CLI.
Launch the AWS CLI and run aws configure and pass the access key ID, secret access key, and default AWS Region.

This prevents the AWS user name and password from being visible in plain text in the Terraform code and prevents accidental sharing when the code is committed to a code repository.

Create a user for Accessing Redshift through code (Terraform)

Because we’re creating a Redshift cluster and subsequent operations, the administrator user name and password required for these processes (different than the admin role we created earlier for logging in to the AWS Management Console) needs to be invoked in the code. To do this securely, we use Secrets Manager to store the user name and password. We write code in Terraform to access these credentials during the cluster create operation. Complete the following steps:

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.

For Secret type, select Credentials for Amazon Redshift data warehouse.
Enter your credentials.

Set up Terraform

Complete the following steps to set up Terraform:

Create a folder or directory for storing all your Terraform code.
Open the VS Code editor and browse to your folder.
Choose New File and enter a name for the file using the .tf extension

Now we’re ready to start writing our code starting with defining providers. The providers definition is a way for Terraform to get the necessary APIs to interact with AWS.

Configure a provider for Terraform:

terraform {
required_providers {
aws = {
source  = "hashicorp/aws"
version = "5.53.0"
}
}
}

# Configure the AWS Provider
provider "aws" {
region = "us-east-1"
}

Access the admin credentials for the Amazon Redshift admin user:

data "aws_secretsmanager_secret_version" "creds" {
# Fill in the name you gave to your secret
secret_id = "terraform-creds"
}
/*json decode to parse the secret*/
locals {
terraform-creds = jsondecode(
data.aws_secretsmanager_secret_version.creds.secret_string
)
}

Create a Redshift cluster

To create a Redshift cluster, use the aws_redshift_cluster resource:

# Create an encrypted Amazon Redshift cluster

resource "aws_redshift_cluster" "dw_cluster" {
cluster_identifier = "tf-example-redshift-cluster"
database_name      = "dev"
master_username    = local.terraform-creds.username
master_password    = local.terraform-creds.password
node_type          = "ra3.xlplus"
cluster_type       = "multi-node"
publicly_accessible = "false"
number_of_nodes    = 2
encrypted         = true
kms_key_id        = local.RedshiftClusterEncryptionKeySecret.arn
enhanced_vpc_routing = true
cluster_subnet_group_name="<<your-cluster-subnet-groupname>>"
}

In this example, we create a Redshift cluster called tf-example-redshift-cluster, using the ra3.xlplus node type 2 node cluster. We use the credentials from Secrets Manager and jsondecode to access these values. This makes sure the user name and password aren’t passed in plain text.

Add an IAM role to the cluster

Because we didn’t have the option to associate an IAM role during cluster creation, we do so now with the following code:

resource "aws_redshift_cluster_iam_roles" "cluster_iam_role" {
cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
iam_role_arns      = ["arn:aws:iam::yourawsaccountId:role/service-role/yourIAMrolename"]
}

Enable Redshift cluster operations

Performing operations on the Redshift cluster such as resize, pause, and resume on a schedule offers a more practical use of these operations. Therefore, we create two policies: one that allows the Amazon Redshift scheduler service and one that allows the cluster pause, resume, and resize operations. Then we create a role that has both policies attached to it.

You can perform these steps directly from the console and then referenced in Terraform code. The following example demonstrates the code snippets to create policies and a role, and then to attach the policy to the role.

Create the Amazon Redshift scheduler policy document and create the role that assumes this policy:

#define policy document to establish the Trust Relationship between the role and the entity (Redshift scheduler)

data "aws_iam_policy_document" "assume_role_scheduling" {
statement {
effect = "Allow"
principals {
type        = "Service"
identifiers = ["scheduler.redshift.amazonaws.com"]
}
actions = ["sts:AssumeRole"]
}
}

#create a role that has the above trust relationship attached to it, so that it can invoke the redshift scheduling service
resource "aws_iam_role" "scheduling_role" {
name               = "redshift_scheduled_action_role"
assume_role_policy = data.aws_iam_policy_document.assume_role_scheduling.json
}

Create a policy document and policy for Amazon Redshift operations:

/*define the policy document for other redshift operations*/

data "aws_iam_policy_document" "redshift_operations_policy_definition" {
statement {
effect = "Allow"
actions = [
"redshift:PauseCluster",
"redshift:ResumeCluster",
"redshift:ResizeCluster",
]
resources = ["arn:aws:redshift:*:youraccountid:cluster:*"]
}
}

/*create the policy and add the above data (json) to the policy*/
resource "aws_iam_policy" "scheduling_actions_policy" {
name   = "redshift_scheduled_action_policy"
policy = data.aws_iam_policy_document.redshift_operations_policy_definition.json
}

Attach the policy to the IAM role:

/*connect the policy and the role*/
resource "aws_iam_role_policy_attachment" "role_policy_attach" {
policy_arn = aws_iam_policy.scheduling_actions_policy.arn
role       = aws_iam_role.scheduling_role.name
}

Pause the Redshift cluster:

#pause a cluster
resource "aws_redshift_scheduled_action" "pause_operation" {
name     = "tf-redshift-scheduled-action-pause"
schedule = "cron(00 22 * * ? *)"
iam_role = aws_iam_role.scheduling_role.arn
target_action {
pause_cluster {
cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
}
}
}

In the preceding example, we created a scheduled action called tf-redshift-scheduled-action-pause that pauses the cluster at 10:00 PM every day as a cost-saving action.

Resume the Redshift cluster:

name     = "tf-redshift-scheduled-action-resume"
schedule = "cron(15 07 * * ? *)"
iam_role = aws_iam_role.scheduling_role.arn
target_action {
resume_cluster {
cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
}
}
}

In the preceding example, we created a scheduled action called tf-redshift-scheduled-action-resume that resumes the cluster at 7:15 AM every day in time for business operations to start using the Redshift cluster.

Resize the Redshift cluster:

#resize a cluster
resource "aws_redshift_scheduled_action" "resize_operation" {
name     = "tf-redshift-scheduled-action-resize"
schedule = "cron(15 14 * * ? *)"
iam_role = aws_iam_role.scheduling_role.arn
target_action {
resize_cluster {
cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
cluster_type = "multi-node"
node_type = "ra3.xlplus"
number_of_nodes = 4 /*increase the number of nodes using resize operation*/
classic = true /*default behavior is to use elastic resizeboolean value if we want to use classic resize*/
}
}
}

In the preceding example, we created a scheduled action called tf-redshift-scheduled-action-resize that increases the nodes from 2 to 4. You can do other operations like change the node type as well. By default, elastic resize will be used, but if you want to use classic resize, you have to pass the parameter classic = true as shown in the preceding code. This can be a scheduled action to anticipate the needs of peak periods and resize appripriately for that duration. You can then downsize using similar code during non-peak times.

Test the solution

We apply the following code to test the solution. Change the resource details accordingly, such as account ID and Region name.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.53.0"
    }
  }
}

# Configure the AWS Provider
provider "aws" {
  region = "us-east-1"
}

# access secrets stored in secret manager
data "aws_secretsmanager_secret_version" "creds" {
  # Fill in the name you gave to your secret
  secret_id = "terraform-creds"
}

/*json decode to parse the secret*/
locals {
  terraform-creds = jsondecode(
    data.aws_secretsmanager_secret_version.creds.secret_string
  )
}

#Store the arn of the KMS key to be used for encrypting the redshift cluster

data "aws_secretsmanager_secret_version" "encryptioncreds" {
  secret_id = "RedshiftClusterEncryptionKeySecret"
}
locals {
  RedshiftClusterEncryptionKeySecret = jsondecode(
    data.aws_secretsmanager_secret_version.encryptioncreds.secret_string
  )
}

# Create an encrypted Amazon Redshift cluster
resource "aws_redshift_cluster" "dw_cluster" {
  cluster_identifier = "tf-example-redshift-cluster"
  database_name      = "dev"
  master_username    = local.terraform-creds.username
  master_password    = local.terraform-creds.password
  node_type          = "ra3.xlplus"
  cluster_type       = "multi-node"
  publicly_accessible = "false"
  number_of_nodes    = 2
  encrypted         = true
  kms_key_id        = local.RedshiftClusterEncryptionKeySecret.arn
  enhanced_vpc_routing = true
  cluster_subnet_group_name="redshiftclustersubnetgroup-yuu4sywme0bk"
}

#add IAM Role to the Redshift cluster

resource "aws_redshift_cluster_iam_roles" "cluster_iam_role" {
  cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
  iam_role_arns      = ["arn:aws:iam::youraccountid:role/service-role/yourrolename"]
}

#for audit logging please create an S3 bucket which has read write privileges for Redshift service, this example does not include S3 bucket creation.

resource "aws_redshift_logging" "redshiftauditlogging" {
  cluster_identifier   = aws_redshift_cluster.dw_cluster.cluster_identifier
  log_destination_type = "s3"
  bucket_name          = "your-s3-bucket-name"
}

#to do operations like pause, resume, resize on a schedule we need to first create a role that has permissions to perform these operations on the cluster

#define policy document to establish the Trust Relationship between the role and the entity (Redshift scheduler)

data "aws_iam_policy_document" "assume_role_scheduling" {
  statement {
    effect = "Allow"
    principals {
      type        = "Service"
      identifiers = ["scheduler.redshift.amazonaws.com"]
    }

    actions = ["sts:AssumeRole"]
  }
}

#create a role that has the above trust relationship attached to it, so that it can invoke the redshift scheduling service
resource "aws_iam_role" "scheduling_role" {
  name               = "redshift_scheduled_action_role"
  assume_role_policy = data.aws_iam_policy_document.assume_role_scheduling.json
}

/*define the policy document for other redshift operations*/

data "aws_iam_policy_document" "redshift_operations_policy_definition" {
  statement {
    effect = "Allow"
    actions = [
      "redshift:PauseCluster",
      "redshift:ResumeCluster",
      "redshift:ResizeCluster",
    ]

    resources =  ["arn:aws:redshift:*:youraccountid:cluster:*"]
  }
}

/*create the policy and add the above data (json) to the policy*/

resource "aws_iam_policy" "scheduling_actions_policy" {
  name   = "redshift_scheduled_action_policy"
  policy = data.aws_iam_policy_document.redshift_operations_policy_definition.json
}

/*connect the policy and the role*/

resource "aws_iam_role_policy_attachment" "role_policy_attach" {
  policy_arn = aws_iam_policy.scheduling_actions_policy.arn
  role       = aws_iam_role.scheduling_role.name
}

#pause a cluster

resource "aws_redshift_scheduled_action" "pause_operation" {
  name     = "tf-redshift-scheduled-action-pause"
  schedule = "cron(00 14 * * ? *)"
  iam_role = aws_iam_role.scheduling_role.arn
  target_action {
    pause_cluster {
      cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
    }
  }
}

#resume a cluster

resource "aws_redshift_scheduled_action" "resume_operation" {
  name     = "tf-redshift-scheduled-action-resume"
  schedule = "cron(15 14 * * ? *)"
  iam_role = aws_iam_role.scheduling_role.arn
  target_action {
    resume_cluster {
      cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
    }
  }
}

#resize a cluster

resource "aws_redshift_scheduled_action" "resize_operation" {
  name     = "tf-redshift-scheduled-action-resize"
  schedule = "cron(15 14 * * ? *)"
  iam_role = aws_iam_role.scheduling_role.arn
  target_action {
    resize_cluster {
      cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
      cluster_type = "multi-node"
      node_type = "ra3.xlplus"
      number_of_nodes = 4 /*increase the number of nodes using resize operation*/
      classic = true /*default behavior is to use elastic resizeboolean value if we want to use classic resize*/
    }
  }
}

Run terraform plan to see a list of changes that will be made, as shown in the following screenshot.

After you have reviewed the changes, use terraform apply to create the resources you defined.

You will be asked to enter yes or no before Terraform starts creating the resources.

You can confirm that the cluster is being created on the Amazon Redshift console.

After the cluster is created, the IAM roles and schedules for pause, resume, and resize operations are added, as shown in the following screenshot.

You can also view these scheduled operations on the Amazon Redshift console.

Clean up

If you deployed resources such as the Redshift cluster and IAM roles, or any of the other associated resources by running terraform apply, to avoid incurring charges on your AWS account, run terraform destroy to tear these resources down and clean up your environment.

Conclusion

Terraform offers a powerful and flexible solution for managing your infrastructure as code using a declarative approach, with a cloud-agnostic nature, resource orchestration capabilities, and strong community support. This post provided a comprehensive guide to using Terraform to deploy a Redshift cluster and perform important operations such as resize, resume, and pause on the cluster. Embracing IaC and using the right tools, such as Workflow Studio, VS Code, and Terraform, will enable you to build scalable and maintainable distributed applications, and automate processes.

About the Authors

Amit Ghodke is an Analytics Specialist Solutions Architect based out of Austin. He has worked with databases, data warehouses and analytical applications for the past 16 years. He loves to help customers implement analytical solutions at scale to derive maximum business value.

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

How to migrate from AWS Cloud9 to AWS IDE Toolkits or AWS Cloudshell

2024-07-25 Rodney Bozo

Post Syndicated from Rodney Bozo original https://aws.amazon.com/blogs/devops/how-to-migrate-from-aws-cloud9-to-aws-ide-toolkits-or-aws-cloudshell/

Building with AWS requires you to interact with and manipulate your AWS resources, whether it’s to manage infrastructure, deploy applications, or troubleshoot issues and many AWS customers use AWS Cloud9 to do so today. However, developers want the ability to work with AWS resources within their own Integrated Development Environment (IDE) because it allows them to streamline their workflows and leverage familiar tools. Other customers still want the security and flexibility of working with their resources in the AWS Management Console, but with quicker access and portability across different pages. In this blog, we will discuss two solutions, the AWS IDE Toolkits and AWS CloudShell, and why you may want to migrate from AWS Cloud9 to one of these solutions.

Overview

The AWS IDE Toolkits are a set of open-source plugins that integrate AWS services directly into popular IDEs like Visual Studio Code, IntelliJ, and PyCharm. With these toolkits, you can manage AWS resources, deploy applications, and debug code without leaving your familiar development environment. Key features of the AWS IDE Toolkits include seamless access to AWS services, resource exploration and management, local debugging capabilities, and integration with AWS deployment tools like AWS CloudFormation and AWS SAM. The AWS IDE Toolkits saves you the hassle of deploying and managing an AWS Cloud9 EC2 instance in your account and allows you to interact with AWS services in the context of your IDE’s source code.

AWS CloudShell is a browser-based shell available directly in the AWS Management Console that provides a pre-authenticated and pre-configured environment for running interacting with AWS resources. AWS CLI is pre-installed in the AWS CloudShell environment, eliminating the need for you to install and configure the AWS CLI locally, making it easier to interact with AWS resources from anywhere. You can use AWS CloudShell to check or adjust a configuration file, make a quick fix to a production environment, or even experiment with new AWS services or features. Best of all, usage of AWS CloudShell is free. CloudShell’s accessibility from anywhere in the AWS Management Console makes it an ideal alternative when you want to interact with AWS resources via the command line over the web because you have limitations doing so on your local desktop.

Getting started

If you’re interested in leveraging the AWS IDE Toolkits, the onboarding process is straightforward. In many popular IDE’s like Visual Studio Code, you can simply install the AWS Toolkits extension in the IDE’s extension marketplace and authenticate with your AWS credentials to begin taking advantage of all of the AWS Toolkits features. For more detailed information about installation, you can see the onboarding steps for each supported IDE. To begin using AWS CloudShell, simply click the CloudShell icon in the AWS Management Console and follow the prompts to launch your shell environment. CloudShell leverages the credentials from your AWS Management Console sessions to provide a pre-authenticated shell environment. You can also explore detailed user guides and sample use cases to help you get familiar with the tool.

Figure 1: Click on the AWS CloudShell icon

Summary

Both the AWS IDE Toolkits and AWS CloudShell offer powerful capabilities for interacting with AWS resources. Whether you prefer working within your local IDE or a web-based terminal directly in the AWS Management Console, these solutions provide a seamless and efficient way to manage your AWS infrastructure and applications. Take the time to explore these options and see how they can enhance your development workflows. Finally, don’t forget to delete your AWS Cloud9 EC2 instances once you migrate to avoid incurring unnecessary future costs.

Configure SAML federation with Amazon OpenSearch Serverless and Keycloak

2024-07-24 Arpad Csoke

Post Syndicated from Arpad Csoke original https://aws.amazon.com/blogs/big-data/configure-saml-federation-with-amazon-opensearch-serverless-and-keycloak/

Amazon OpenSearch Serverless is a serverless version of Amazon OpenSearch Service, a fully managed open search and analytics platform. On Amazon OpenSearch Service you can run petabyte-scale search and analytics workloads without the heavy lifting of managing the underlying OpenSearch Service clusters and Amazon OpenSearch Serverless supports workloads up to 30TB of data for time-series collections. Amazon OpenSearch Serverless provides an installation of OpenSearch Dashboards with every collection created.

The network configuration for an OpenSearch Serverless collection controls how the collection can be accessed over the network. You have the option to make the collection publicly accessible over the internet from any network, or to restrict access to the collection only privately through OpenSearch Serverless-managed virtual private cloud (VPC) endpoints. This network access setting can be defined separately for the collection’s OpenSearch endpoint (used for data operations) and its corresponding OpenSearch Dashboards endpoint (used for visualizing and analyzing data). In this post, we work with a publicly accessible OpenSearch Serverless collection.

SAML enables users to access multiple applications or services with a single set of credentials, eliminating the need for separate logins for each application or service. This improves the user experience and reduces the overhead of managing multiple credentials. We provide SAML authentication for OpenSearch Serverless. With this you can use your existing identity provider (IdP) to offer single sign-on (SSO) for the OpenSearch Dashboards endpoints of serverless collections. OpenSearch Serverless supports IdPs that adhere to the SAML 2.0 standard, including services like AWS IAM Identity Center, Okta, Keycloak, Active Directory Federation Services (AD FS), and Auth0. This SAML authentication mechanism is solely intended for accessing the OpenSearch Dashboards interface through a web browser.

In this post, we show you how to configure SAML authentication for controlling access to public OpenSearch Dashboards using Keycloak as an IdP.

Solution overview

The following diagram illustrates a sample architecture of a solution that allows users to authenticate to OpenSearch Dashboards using SSO with Keycloak.

The sign-in flow includes the following steps:

A user accesses OpenSearch Dashboards in a browser and chooses an IdP from the list.
OpenSearch Serverless generates a SAML authentication request.
OpenSearch Service redirects the request back to the browser.
The browser redirects the user to the selected IdP (Keycloak). Keycloak provides a login page, where users can provide their login credentials.
If authentication was successful, Keycloak returns the SAML response to the browser.
The SAML assertions is sent back to OpenSearch Serverless.
OpenSearch Serverless validates the SAML assertion, and logs the user in to OpenSearch Dashboards.

Prerequisites

To get started, you should have the following prerequisites:

An active OpenSearch Serverless collection
A working Keycloak server (on premises or in the cloud)
The following AWS Identity and Access Management (IAM) permissions to configure SAML authentication in OpenSearch Serverless:
- aoss:CreateSecurityConfig – Create a SAML provider.
- aoss:ListSecurityConfig – List all SAML providers in the current account.
- aoss:GetSecurityConfig – View SAML provider information.
- aoss:UpdateSecurityConfig – Modify a given SAML provider configuration, including the XML metadata.
- aoss:DeleteSecurityConfig – Delete a SAML provider.

Create and configure a client in Keycloak

Complete the following steps to create your Keycloak client:

Login to your Keycloak admin page.
In the navigation pane, choose Client.
Choose Create client
For Client type, choose SAML.
For Client ID enter aws:opensearch:AWS_ACCOUNT_ID, where AWS_ACCOUNT_ID is your AWS account ID.
Enter a name and description for your client.
Choose Next.
For Valid redirect URIs, enter the address of the assertion consumer service (ACS), where REGION is the AWS Region in which you have created the OpenSearch Serverless collection.
For Master SAML Processing URL, also enter the preceding ACS address.
Complete your client creation.
After you create the client, you have to disable the Signing keys config setting, because OpenSearch Serverless signed and encrypted requests are not supported. For more details, refer to Considerations.
After you have created the client and disabled the client signature, you can export the SAML 2.0 IdP Metadata by choosing the link on the Realm settings page. You need this metadata, when you create the SAML provider in OpenSearch Serverless.

Create a SAML provider

When your OpenSearch Serverless collection is active, you then create a SAML provider. This SAML provider can be assigned to any collection in the same Region. Complete the following steps:

On the OpenSearch Service console, under Serverless in the navigation pane, choose SAML authentication under Security.
Choose Create SAML provider.
Enter a name and description for your SAML provider.
Enter the IdP metadata you downloaded earlier from Keycloak.
Under Additional settings, you can optionally add custom user ID and group attributes (for this example, we leave this empty).
Choose Create a SAML provider.

You have now configured a SAML provider for OpenSearch Serverless. Next, you configure the data access policy for accessing collections.

Create a data access policy

After you have configured SAML provider, you have to create data access policies for OpenSearch Serverless to allow access to the users.

On the OpenSearch Service console, under Serverless in the navigation pane, choose Data access policies under Security.
Choose Create access policy.
Enter a name and optional description for your access policy.
For Policy definition method, select Visual editor.
For Rule name, enter a name.
Under Select principals, for Add principals, choose SAML users and groups.
For SAML provider name, choose the provider you created before.
Choose Save.
Specify the user or group in the format user/USERNAME or group/GROUPNAME. The value of the USERNAME or GROUPNAME should match the value you specified in Keycloak for user-/groupname.
Choose Save.
Choose Grant to grant permissions to resources.
In the Grant resources and permissions section, you can specify access you want to provide for a given user at the collection level, and also at the index pattern level.
For more information about how to set up more granular access for your users, refer to Supported OpenSearch API operations and permissions and Supported policy permissions.
Choose Save.
You can create additional rules if needed.
Choose Create to create the data access policy.

Now, you have data access policy that will allow users to access the OpenSearch Dashboards and perform the allowed actions there.

Access the OpenSearch Dashboards

Complete the following steps to sign in to the OpenSearch Dashboards:

On the OpenSearch Service console, under Serverless in the navigation pane, choose Dashboard.
In the Collection section, locate your collection and choose Dashboard.

The OpenSearch login page will open in a new browser tab.
Choose your IdP provider on the dropdown menu and choose Login.

You will be redirected to the Keycloak sign-in page.
Log in with your SSO credentials.

After a successful login, you will be redirected to OpenSearch Dashboards, and you can perform the actions allowed by the data access policy.

You have successfully federated OpenSearch Dashboards with Keycloak as an IdP.

Cleaning up

When you’re done with this solution, delete the resources you created if you no longer need them.

Delete your OpenSearch Serverless collection.
Delete your data access policy.
Delete the SAML provider.

Conclusion

In this post, we demonstrated how to set up Keycloak as an IdP to access an OpenSearch Serverless dashboard using SAML authentication. For more details, refer to SAML authentication for Amazon OpenSearch Serverless

About the Author

Arpad Csoke is a Solutions Architect at Amazon Web Services. His responsibilities include helping large enterprise customers understand and utilize the AWS environment, acting as a technical consultant to contribute to solving their issues.

Streamline your data governance by deploying Amazon DataZone with the AWS CDK

2024-07-23 Bandana Das

Post Syndicated from Bandana Das original https://aws.amazon.com/blogs/big-data/streamline-your-data-governance-by-deploying-amazon-datazone-with-the-aws-cdk/

Managing data across diverse environments can be a complex and daunting task. Amazon DataZone simplifies this so you can catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources.

Many organizations manage vast amounts of data assets owned by various teams, creating a complex landscape that poses challenges for scalable data management. These organizations require a robust infrastructure as code (IaC) approach to deploy and manage their data governance solutions. In this post, we explore how to deploy Amazon DataZone using the AWS Cloud Development Kit (AWS CDK) to achieve seamless, scalable, and secure data governance.

Overview of solution

By using IaC with the AWS CDK, organizations can efficiently deploy and manage their data governance solutions. This approach provides scalability, security, and seamless integration across all teams, allowing for consistent and automated deployments.

The AWS CDK is a framework for defining cloud IaC and provisioning it through AWS CloudFormation. Developers can use any of the supported programming languages to define reusable cloud components known as constructs. A construct is a reusable and programmable component that represents AWS resources. The AWS CDK translates the high-level constructs defined by you into equivalent CloudFormation templates. AWS CloudFormation provisions the resources specified in the template, streamlining the usage of IaC on AWS.

Amazon DataZone core components are the building blocks to create a comprehensive end-to-end solution for data management and data governance. The following are the Amazon DataZone core components. For more details, see Amazon DataZone terminology and concepts.

Amazon DataZone domain – You can use an Amazon DataZone domain to organize your assets, users, and their projects. By associating additional AWS accounts with your Amazon DataZone domains, you can bring together your data sources.
Data portal – The data portal is outside the AWS Management Console. This is a browser-based web application where different users can catalog, discover, govern, share, and analyze data in a self-service fashion.
Business data catalog – You can use this component to catalog data across your organization with business context and enable everyone in your organization to ﬁnd and understand data quickly.
Projects – In Amazon DataZone, projects are business use case-based groupings of people, assets (data), and tools used to simplify access to AWS analytics.
Environments – Within Amazon DataZone projects, environments are collections of zero or more configured resources on which a given set of AWS Identity and Access Management (IAM) principals (for example, users with a contributor permissions) can operate.
Amazon DataZone data source – In Amazon DataZone, you can publish an AWS Glue Data Catalog data source or Amazon Redshift data source.
Publish and subscribe workﬂows – You can use these automated workﬂows to secure data between producers and consumers in a self-service manner and make sure that everyone in your organization has access to the right data for the right purpose.

We use an AWS CDK app to demonstrate how to create and deploy core components of Amazon DataZone in an AWS account. The following diagram illustrates the primary core components that we create.

In addition to the core components deployed with the AWS CDK, we provide a custom resource module to create Amazon DataZone components such as glossaries, glossary terms, and metadata forms, which are not supported by AWS CDK constructs (at the time of writing).

Prerequisites

The following local machine prerequisites are required before starting:

An AWS account (with AWS IAM Identity Center enabled).
Either Bash or ZSH terminal.
The AWS Command Line Interface (AWS CLI) v2 installed.
Python version 3.10 or higher.
The AWS SDK for Python version 1.34.87 or higher.
Node version v18.17.* or higher.
NPM version v10.2.* or higher.
An AWS Glue table to be registered as a sample data source in an Amazon DataZone project.
As part of this post, we want to publish AWS Glue tables from an AWS Glue database that already exists. For this, you must explicitly provide Amazon DataZone with the permissions to access tables in this existing AWS Glue database. For more information, refer to Configure Lake Formation permissions for Amazon DataZone.
- Remove the IAMAllowedPrincipals permissions from the AWS Lake Formation tables for which Amazon DataZone handles permissions.
- Make sure you have disabled the default permissions under the Data Catalog settings in Lake Formation (see the following screenshot).

Deploy the solution

Complete the following steps to deploy the solution:

Clone the GitHub repository and go to the root of your downloaded repository folder:

git clone https://github.com/aws-samples/amazon-datazone-cdk-example.git
cd amazon-datazone-cdk-example

Install local dependencies:

$ npm ci ### this will install the packages configured in package-lock.json

Sign in to your AWS account using the AWS CLI by configuring your credential file (replace <PROFILE_NAME> with the profile name of your deployment AWS account):
```
$ export AWS_PROFILE=<PROFILE_NAME>
```
Bootstrap the AWS CDK environment (this is a one-time activity and not needed if your AWS account is already bootstrapped):
```
$ npm run cdk bootstrap
```
Run the script to replace the placeholders for your AWS account and AWS Region in the config files:
```
$ ./scripts/prepare.sh <<YOUR_AWS_ACCOUNT_ID>> <<YOUR_AWS_REGION>>
```

The preceding command will replace the AWS_ACCOUNT_ID_PLACEHOLDER and AWS_REGION_PLACEHOLDER values in the following config files:

lib/config/project_config.json
lib/config/project_environment_config.json
lib/constants.ts

Next, you configure your Amazon DataZone domain, project, business glossary, metadata forms, and environments with your data source.

Go to the file lib/constants.ts. You can keep the DOMAIN_NAME provided or update it as needed.
Go to the file lib/config/project_config.json. You can keep the example values for projectName and projectDescription or update them. An example value for projectMembers has also been provided (as shown in the following code snippet). Update the value of the memberIdentifier parameter with an IAM role ARN of your choice that you would like to be the owner of this project.
```
"projectMembers": [
            {
                "memberIdentifier": "arn:aws:iam::AWS_ACCOUNT_ID_PLACEHOLDER:role/Admin",
                "memberIdentifierType": "UserIdentifier"
            }
        ]
```
Go to the file lib/config/project_glossary_config.json. An example business glossary and glossary terms are provided for the projects; you can keep them as is or update them with your project name, business glossary, and glossary terms.
Go to the lib/config/project_form_config.json file. You can keep the example metadata forms provided for the projects or update your project name and metadata forms.
Go to the lib/config/project_enviornment_config.json file. Update EXISTING_GLUE_DB_NAME_PLACEHOLDER with the existing AWS Glue database name in the same AWS account where you are deploying the Amazon DataZone core components with the AWS CDK. Make sure you have at least one existing AWS Glue table in this AWS Glue database to publish as a data source within Amazon DataZone. Replace DATA_SOURCE_NAME_PLACEHOLDER and DATA_SOURCE_DESCRIPTION_PLACEHOLDER with your choice of Amazon DataZone data source name and description. An example of a cron schedule has been provided (see the following code snippet). This is the schedule for your data source run; you can keep the same or update it.
```
"Schedule":{
   "schedule":"cron(0 7 * * ? *)"
}
```

Next, you update the trust policy of the AWS CDK deployment IAM role to deploy a custom resource module.

On the IAM console, update the trust policy of the IAM role for your AWS CDK deployment that starts with cdk-hnb659fds-cfn-exec-role- by adding the following permissions. Replace ${ACCOUNT_ID} and ${REGION} with your specific AWS account and Region.

     {
         "Effect": "Allow",
         "Principal": {
             "Service": "lambda.amazonaws.com"
         },
         "Action": "sts:AssumeRole",
         "Condition": {
             "ArnLike": {
                 "aws:SourceArn": [
                     
                     "arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-GlossaryLambda*",
                     "arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-GlossaryTermLambda*",
                     "arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-FormLambda*"
                 ]
             }
         }
     }

Now you can configure data lake administrators in Lake Formation.

On the Lake Formation console, choose Administrative roles and tasks in the navigation pane.
Under Data lake administrators, choose Add and add the IAM role for AWS CDK deployment that starts with cdk-hnb659fds-cfn-exec-role- as an administrator.

This IAM role needs permissions in Lake Formation to create resources, such as an AWS Glue database. Without these permissions, the AWS CDK stack deployment will fail.

Deploy the solution:
```
$ npm run cdk deploy --all
```
During deployment, enter y if you want to deploy the changes for some stacks when you see the prompt Do you wish to deploy these changes (y/n)?.
After the deployment is complete, sign in to your AWS account and navigate to the AWS CloudFormation console to verify that the infrastructure deployed.

You should see a list of the deployed CloudFormation stacks, as shown in the following screenshot.

Open the Amazon DataZone console in your AWS account and open your domain.
Open the data portal URL available in the Summary section.
Find your project in the data portal and run the data source job.

This is a one-time activity if you want to publish and search the data source immediately within Amazon DataZone. Otherwise, wait for the data source runs according to the cron schedule mentioned in the preceding steps.

Troubleshooting

If you get the message "Domain name already exists under this account, please use another one (Service: DataZone, Status Code: 409, Request ID: 2d054cb0-0 fb7-466f-ae04-c53ff3c57c9a)" (RequestToken: 85ab4aa7-9e22-c7e6-8f00-80b5871e4bf7, HandlerErrorCode: AlreadyExists), change the domain name under lib/constants.ts and try to deploy again.

If you get the message "Resource of type 'AWS::IAM::Role' with identifier 'CustomResourceProviderRole1' already exists." (RequestToken: 17a6384e-7b0f-03b3 -1161-198fb044464d, HandlerErrorCode: AlreadyExists), this means you’re accidentally trying to deploy everything in the same account but a different Region. Make sure to use the Region you configured in your initial deployment. For the sake of simplicity, the DataZonePreReqStack is in one Region in the same account.

If you get the message “Unmanaged asset” Warning in the data asset on your datazone project, you must explicitly provide Amazon DataZone with Lake Formation permissions to access tables in this external AWS Glue database. For instructions, refer to Configure Lake Formation permissions for Amazon DataZone.

Clean up

To avoid incurring future charges, delete the resources. If you have already shared the data source using Amazon DataZone, then you have to remove those manually first in the Amazon DataZone data portal because the AWS CDK isn’t able to automatically do that.

Unpublish the data within the Amazon DataZone data portal.
Delete the data asset from the Amazon DataZone data portal.
From the root of your repository folder, run the following command:
```
$ npm run cdk destroy --all
```
Delete the Amazon DataZone created databases in AWS Glue. Refer to the tips to troubleshoot Lake Formation permission errors in AWS Glue if needed.
Remove the created IAM roles from Lake Formation administrative roles and tasks.

Conclusion

Amazon DataZone offers a comprehensive solution for implementing a data mesh architecture, enabling organizations to address advanced data governance challenges effectively. Using the AWS CDK for IaC streamlines the deployment and management of Amazon DataZone resources, promoting consistency, reproducibility, and automation. This approach enhances data organization and sharing across your organization.

Ready to streamline your data governance? Dive deeper into Amazon DataZone by visiting the Amazon DataZone User Guide. To learn more about the AWS CDK, explore the AWS CDK Developer Guide.

About the Authors

Bandana Das is a Senior Data Architect at Amazon Web Services and specializes in data and analytics. She builds event-driven data architectures to support customers in data management and data-driven decision-making. She is also passionate about enabling customers on their data management journey to the cloud.

Gezim Musliaj is a Senior DevOps Consultant with AWS Professional Services. He is interested in various things CI/CD, data, and their application in the field of IoT, massive data ingestion, and recently MLOps and GenAI.

Sameer Ranjha is a Software Development Engineer on the Amazon DataZone team. He works in the domain of modern data architectures and software engineering, developing scalable and efficient solutions.

Sindi Cali is an Associate Consultant with AWS Professional Services. She supports customers in building data-driven applications in AWS.

Bhaskar Singh is a Software Development Engineer on the Amazon DataZone team. He has contributed to implementing AWS CloudFormation support for Amazon DataZone. He is passionate about distributed systems and dedicated to solving customers’ problems.

How to use the AWS Secrets Manager Agent

2024-07-22 Eduardo Patrocinio

Post Syndicated from Eduardo Patrocinio original https://aws.amazon.com/blogs/security/how-to-use-the-aws-secrets-manager-agent/

AWS Secrets Manager is a service that helps you manage, retrieve, and rotate database credentials, application credentials, OAuth tokens, API keys, and other secrets throughout their lifecycles. You can use Secrets Manager to replace hard-coded credentials in application source code with a runtime call to the Secrets Manager service to retrieve credentials dynamically when you need them. Storing the credentials in Secrets Manager helps to avoid unintended access by anyone who inspects your application’s source code, configuration, or components.

In this blog post, we introduce a new feature, the Secrets Manager Agent, and walk through how you can use it to retrieve Secretes Manager secrets.

New approach: Secrets Manager Agent

Previously, if you had an application that used Secrets Manager and needed to retrieve secrets, you had to use the AWS SDK or one of our existing caching libraries. Both these options are specific to a certain coding language and allow only limited scope for customization.

The Secrets Manager Agent is a client-side agent that allows you to standardize consumption of secrets from Secrets Manager across your AWS compute environments. (AWS has published the code for the agent as open source code.) Secrets Manager Agent pulls and caches secrets in your compute environment and allows your applications to consume secrets directly from the in-memory cache. The Secrets Manager Agent opens a localhost port inside your application environment. With this port, you fetch the secret value from the local agent instead of making network calls to the service. This allows you to improve the overall availability of your application while reducing your API calls. Because the Secrets Manager Agent is language agnostic, you can install the binary file of the agent on many types of AWS compute environments.

Although you can use this feature to retrieve and cache secrets in your application’s compute environment, the access controls for Secrets Manager secrets remain unchanged. This means that AWS Identity and Access Management (IAM) principals need the same permissions as if they were to retrieve each of the secrets. You will need to provide GetSecretValue and DescribeSecret permissions to the secrets that you want to consume by using the Secrets Manager Agent.

The Secrets Manager Agent offers protection against server-side request forgery (SSRF). When you install the Secrets Manager Agent, the script generates a random SSRF token on startup and stores it in the file /var/run/awssmatoken. The token is readable by the awssmatokenreader group that the install script creates. The Secrets Manager Agent denies requests that don’t have an SSRF token in the header or that have an invalid SSRF token.

Solution overview

The Secrets Manager Agent provides a language-agnostic way to consume secrets in your application code. It supports various AWS compute services, such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Lambda functions. In this solution, we share how you can install the Secrets Manager Agent on an EC2 machine and retrieve secrets in your application code by using CURL commands. See the AWS Secrets Manager Agent documentation to learn how you can use this agent with other types of compute services.

Prerequisites

You need to have the following:

An AWS account
The AWS Command Line Interface (AWS CLI) version 2
jq

Follow the steps on the Install or update to the latest version of the AWS CLI page to install the AWS CLI and the Configure the AWS CLI page to configure it.

Create the secret

The first step will be to create a secret in Secrets Manager by using the AWS CLI.

To create a secret

Enter the following command in a terminal to create a secret:

aws secretsmanager create-secret --name MySecret --description "My Secret" \
  --secret-string "{\"user\": \"my_user\", \"password:\": \"my-password\"}"

You will see an output like the following:

% aws secretsmanager create-secret —name MySecret —description "My Secret" \
 —secret-string "{\"user\": \"my_user\", \"password:\": \"my-password\"}"
{
 "ARN": "arn:aws:secretsmanager:us-east-1:XXXXXXXXXXXX:secret:MySecret-LrBlpm",
 "Name": "MySecret",
 "VersionId": "b5e73e9b-6ec5-4144-a176-3648304b2d60"
}

Record the secret ARN as <SECRET_ARN>, because you will use it in the next section.

Create the IAM role

The Lambda function, the EC2 instance, and the ECS task definition need an IAM role that grants permission to retrieve the secret you just created.

To create the IAM role

Using an editor, create a file named ec2_iam_policy.json with the following content:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        } 
    ]
}

Type the following command in a terminal to create the IAM role:

aws iam create-role --role-name ec2-secret-execution-role \
  --assume-role-policy-document file://ec2_iam_policy.json

Create a file named iam_permission.json with the following content, replacing <SECRET_ARN> with the secret ARN you noted earlier:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret"
            ],
            "Resource": "<SECRET_ARN>"
        }
    ]
}

Type the following command to create a policy:
```
aws iam create-policy \
  --policy-name get-secret-policy \
  --policy-document file://iam_permission.json
```
Record the Arn as <POLICY_ARN>, because you will need that value next.
Type the following command to add this policy to the IAM role, replacing <POLICY_ARN> with the value you just noted:
```
aws iam attach-role-policy \
  --role-name ec2-secret-execution-role \
  --policy-arn <POLICY_ARN>
```

Type the following command to add the AWS Systems Manager policy to the role:

aws iam attach-role-policy \
  --role-name ec2-secret-execution-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Launch an EC2 instance

Use the steps in this section to launch an EC2 instance.

To create an instance profile

Type the following command to create an instance profile:

aws iam create-instance-profile --instance-profile-name secret-profile

Type the following command to associate this instance profile with the role you just created:

aws iam add-role-to-instance-profile --instance-profile-name secret-profile \
  --role-name ec2-secret-execution-role

To create a security group

Type the following command to create a security group:
```
aws ec2 create-security-group --group-name secret-security-group \
  --description "Secrets Manager Security Group"
```
Record the group ID as <GROUP_ID>, because you will need this value in the next step.

To launch an EC2 instance

Run the following command to launch an EC2 instance, replacing <GROUP_ID> with the security group ID:

aws ec2 run-instances \
  --image-id resolve:ssm:/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2 \
  --instance-type t3.micro \
  --security-group-ids <GROUP_ID> \
  --iam-instance-profile Name=secret-profile \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=secret-instance}]'

Record the InstanceId value as <INSTANCE_ID>.

Check the status of this launch by running the following command:

aws ec2 describe-instances --filters Name=tag:Name,Values=secret-instance | \
  jq ".Reservations[0].Instances[0].State"

You will see a response like the following, which shows that the instance is running:

% aws ec2 describe-instances —filters Name=tag:Name,Values=secret-instance | jq ".Reservations[0].Instances[0].State"
{
 "Code": 16,
 "Name": "running"
}

After the instance is in running state, type the following command to connect to the EC2 instance, replacing <INSTANCE_ID> with the value you noted earlier:
```
aws ssm start-session --target <INSTANCE_ID>
```

Leave the session open, because you will use it in the next step.

Install the Secrets Manager Agent to the EC2 instance

Use the steps in this section to install the Secrets Manager Agent in the EC2 instance. You will run these commands in the EC2 instance you created earlier.

To download the Secrets Manager Agent code

Type the following command to install git in the EC2 instance:
```
sudo yum install -y git 
```
Type the following command to download the Secrets Manager Agent code:
```
cd ~;git clone https://github.com/awslabs/aws-secretsmanager-agent
```

To install the Secrets Manager Agent

Type the following command to install the Secrets Manager Agent:
```
cd aws-secretsmanager-agent/release
sudo ./install
```

To grant permission to read the token file

Type the following command to copy the token file and grant permission for the current user (ec2-user) to read it:
```
sudo cp /var/run/awssmatoken /tmp
sudo chown ssm-user /tmp/awssmatoken
```

Retrieve the secret

Now you can use the local web server to retrieve the agent. Processes running in this EC2 instance can retrieve the secret with a REST API call from the web server.

To retrieve a secret

Retrieving a secret is now possible for the process in this EC2 instance, thanks to the local agent.

Run the following command to retrieve the secret:

curl -H "X-Aws-Parameters-Secrets-Token: $(</tmp/awssmatoken)” localhost:2773/secretsmanager/get?secretId=MySecret

You will see the following output:

$ curl -H "X-Aws-Parameters-Secrets-Token: $(</tmp/awssmatoken)" localhost:2773/secretsmanager/get?secretId=MySecret
{"ARN":"arn:aws:secretsmanager:us-east-1:XXXXXXXXXXXX:secret:MySecret-3z00LH","Name":"MySecret","VersionId":"e7b07d00-a0e8-41b9-b76e-45bdd8daca4f","SecretString":"{\"user\": \"my_user\", \"password:\": \"my-password\"}","VersionStages":["AWSCURRENT"],"CreatedDate":"1716912317.961"}

Exit from the EC2 instance by typing exit.

Clean up

Follow the steps in this section to clean up the resources created by the solution.

To terminate the EC2 instance and associated resources

Type the following command to stop the EC2 instance, replacing <INSTANCE_ID> with the EC2 InstanceId received at the time of instance launch:
```
aws ec2 terminate-instances --instance-ids <INSTANCE_ID>
```

Run the following command to delete the security group:

aws ec2 delete-security-group --group-name secret-security-group

Run the following command to delete the IAM role from the instance profile:

aws iam remove-role-from-instance-profile --instance-profile-name secret-profile \
  --role-name ec2-secret-execution-role

Run these commands to delete the instance profile:

aws iam delete-instance-profile --instance-profile-name secret-profile

To clean up the IAM role

Run the following command to delete the policy role, replacing <POLICY_ARN> with the value you noted earlier:
```
aws iam detach-role-policy --role-name ec2-secret-execution-role \
  --policy-arn <POLICY_ARN>
```

Run the following command to detach the policy from the role:

aws iam detach-role-policy --role-name ec2-secret-execution-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Run the following command to delete the IAM role:

aws iam delete-role --role-name ec2-secret-execution-role

To clean up the secret

Run the following command to delete the secret:

aws secretsmanager delete-secret --secret-id MySecret

Conclusion

In this post, we introduced the Secrets Manager Agent and showed how to install it in an EC2 instance, allowing the retrieval of secrets from Secrets Manager. An application can call this web server to retrieve secrets without using the AWS SDK. See the AWS Secrets Manager Agent documentation to learn more about how you can use this Secrets Manager Agent in other compute environments.

To learn more about AWS Secrets Manager, see the AWS Secrets Manager documentation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.