Tag Archives: Technical How-to

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

Post Syndicated from Nishchai JM original https://aws.amazon.com/blogs/big-data/harmonize-data-using-aws-glue-and-aws-lake-formation-findmatches-ml-to-build-a-customer-360-view/

In today’s digital world, data is generated by a large number of disparate sources and growing at an exponential rate. Companies are faced with the daunting task of ingesting all this data, cleansing it, and using it to provide outstanding customer experience.

Typically, companies ingest data from multiple sources into their data lake to derive valuable insights from the data. These sources are often related but use different naming conventions, which will prolong cleansing, slowing down the data processing and analytics cycle. This problem particularly impacts companies trying to build accurate, unified customer 360 profiles. There are customer records in this data that are semantic duplicates, that is, they represent the same user entity, but have different labels or values. It’s commonly referred to as a data harmonization or deduplication problem. The underlying schemas were implemented independently and don’t adhere to common keys that can be used for joins to deduplicate records using deterministic techniques. This has led to so-called fuzzy deduplication techniques to address the problem. These techniques utilize various machine learning (ML) based approaches.

In this post, we look at how we can use AWS Glue and the AWS Lake Formation ML transform FindMatches to harmonize (deduplicate) customer data coming from different sources to get a complete customer profile to be able to provide better customer experience. We use Amazon Neptune to visualize the customer data before and after the merge and harmonization.

Overview of solution

In this post, we go through the various steps to apply ML-based fuzzy matching to harmonize customer data across two different datasets for auto and property insurance. These datasets are synthetically generated and represent a common problem for entity records stored in multiple, disparate data sources with their own lineage that appear similar and semantically represent the same entity but don’t have matching keys (or keys that work consistently) for deterministic, rule-based matching. The following diagram shows our solution architecture.

We use an AWS Glue job to transform the auto insurance and property insurance customer source data to create a merged dataset containing fields that are common to both datasets (identifiers) that a human expert (data steward) would use to determine semantic matches. The merged dataset is then used to deduplicate customer records using an AWS Glue ML transform to create a harmonized dataset. We use Neptune to visualize the customer data before and after the merge and harmonization to see how the transform FindMacthes can bring all related customer data together to get a complete customer 360 view.

To demonstrate the solution, we use two separate data sources: one for property insurance customers and another for auto insurance customers, as illustrated in the following diagram.

The data is stored in an Amazon Simple Storage Service (Amazon S3) bucket, labeled as Raw Property and Auto Insurance data in the following architecture diagram. The diagram also describes detailed steps to process the raw insurance data into harmonized insurance data to avoid duplicates and build logical relations with related property and auto insurance data for the same customer.

The workflow includes the following steps:

  1. Catalog the raw property and auto insurance data, using an AWS Glue crawler, as tables in the AWS Glue Data Catalog.
  2. Transform raw insurance data into CSV format acceptable to Neptune Bulk Loader, using an AWS Glue extract, transform, and load (ETL) job.
  3. When the data is in CSV format, use an Amazon SageMaker Jupyter notebook to run a PySpark script to load the raw data into Neptune and visualize it in a Jupyter notebook.
  4. Run an AWS Glue ETL job to merge the raw property and auto insurance data into one dataset and catalog the merged dataset. This dataset will have duplicates and no relations are built between the auto and property insurance data.
  5. Create and train an AWS Glue ML transform to harmonize the merged data to remove duplicates and build relations between the related data.
  6. Run the AWS Glue ML transform job. The job also catalogs the harmonized data in the Data Catalog and transforms the harmonized insurance data into CSV format acceptable to Neptune Bulk Loader.
  7. When the data is in CSV format, use a Jupyter notebook to run a PySpark script to load the harmonized data into Neptune and visualize it in a Jupyter notebook.

Prerequisites

To follow along with this walkthrough, you must have an AWS account. Your account should have permission to provision and run an AWS CloudFormation script to deploy the AWS services mentioned in the architecture diagram of the solution.

Provision required resources using AWS CloudFormation:

To launch the CloudFormation stack that configures the required resources for this solution in your AWS account, complete the following steps:

  1. Log in to your AWS account and choose Launch Stack:

  1. Follow the prompts on the AWS CloudFormation console to create the stack.
  2. When the launch is complete, navigate to the Outputs tab of the launched stack and note all the key-value pairs of the resources provisioned by the stack.

Verify the raw data and script files S3 bucket

On the CloudFormation stack’s Outputs tab, choose the value for S3BucketName. The S3 bucket name should be cloud360-s3bucketstack-xxxxxxxxxxxxxxxxxxxxxxxx and should contain folders similar to the following screenshot.

The following are some important folders:

  • auto_property_inputs – Contains raw auto and property data
  • merged_auto_property – Contains the merged data for auto and property insurance
  • output – Contains the delimited files (separate subdirectories)

Catalog the raw data

To help walk through the solution, the CloudFormation stack created and ran an AWS Glue crawler to catalog the property and auto insurance data. To learn more about creating and running AWS Glue crawlers, refer to Working with crawlers on the AWS Glue console. You should see the following tables created by the crawler in the c360_workshop_db AWS Glue database:

  • source_auto_address – Contains address data of customers with auto insurance
  • source_auto_customer – Contains auto insurance details of customers
  • source_auto_vehicles – Contains vehicle details of customers
  • source_property_addresses – Contains address data of customers with property insurance
  • source_property_customers – Contains property insurance details of customers

You can review the data using Amazon Athena. For more information about using Athena to query an AWS Glue table, refer to Running SQL queries using Amazon Athena. For example, you can run the following SQL query:

SELECT * FROM "c360_workshop_db"."source_auto_address" limit 10;

Convert the raw data into CSV files for Neptune

The CloudFormation stack created and ran the AWS Glue ETL job prep_neptune_data to convert the raw data into CSV format acceptable to Neptune Bulk Loader. To learn more about building an AWS Glue ETL job using AWS Glue Studio and to review the job created for this solution, refer to Creating ETL jobs with AWS Glue Studio.

Verify the completion of job run by navigating to the Runs tab and checking the status of most recent run.

Verify the CSV files created by the AWS Glue job in the S3 bucket under the output folder.

Load and visualize the raw data in Neptune

This section uses SageMaker Jupyter notebooks to load, query, explore, and visualize the raw property and auto insurance data in Neptune. Jupyter notebooks are web-based interactive platforms. We use Python scripts to analyze the data in a Jupyter notebook. A Jupyter notebook with the required Python scripts has already been provisioned by the CloudFormation stack.

  1. Start Jupyter Notebook.
  2. Choose the Neptune folder on the Files tab.

  1. Under the Customer360 folder, open the notebook explore_raw_insurance_data.ipynb.

  1. Run Steps 1–5 in the notebook to analyze and visualize the raw insurance data.

The rest of the instructions are inside the notebook itself. The following is a summary of the tasks for each step in the notebook:

  • Step 1: Retrieve Config – Run this cell to run the commands to connect to Neptune for Bulk Loader.
  • Step 2: Load Source Auto Data – Load the auto insurance data into Neptune as vertices and edges.
  • Step 3: Load Source Property Data – Load the property insurance data into Neptune as vertices and edges.
  • Step 4: UI Configuration – This block sets up the UI config and provides UI hints.
  • Step 5: Explore entire graph – The first block builds and displays a graph for all customers with more than four coverages of auto or property insurance policies. The second block displays the graph for four different records for a customer with the name James.

These are all records for the same customer, but because they’re not linked in any way, they appear as different customer records. The AWS Glue FindMatches ML transform job will identify these records as customer James, and the records provide complete visibility on all policies owned by James. The Neptune graph looks like the following example. The vertex covers represents the coverage of auto or property insurance by the owner (James in this case) and the vertex locatedAt represents the address of the property or vehicle that is covered by the owner’s insurance.

Merge the raw data and crawl the merged dataset

The CloudFormation stack created and ran the AWS Glue ETL job merge_auto_property to merge the raw property and auto insurance data into one dataset and catalog the resultant dataset in the Data Catalog. The AWS Glue ETL job does the following transforms on the raw data and merges the transformed data into one dataset:

  • Changes the following fields on the source table source_auto_customer:
    1. Changes policyid to id and data type to string
    2. Changes fname to first_name
    3. Changes lname to last_name
    4. Changes work to company
    5. Changes dob to date_of_birth
    6. Changes phone to home_phone
    7. Drops the fields birthdate, priority, policysince, and createddate
  • Changes the following fields on the source_property_customers:
    1. Changes customer_id to id and data type to string
    2. Changes social to ssn
    3. Drops the fields job, email, industry, city, state, zipcode, netnew, sales_rounded, sales_decimal, priority, and industry2
  • After converting the unique ID field in each table to string type and renaming it to id, the AWS Glue job appends the suffix -auto to all id fields in the source_auto_customer table and the suffix -property to all id fields in the source_propery_customer table before copying all the data from both tables into the merged_auto_property table.

Verify the new table created by the job in the Data Catalog and review the merged dataset using Athena using below Athena SQL query:

SELECT * FROM "c360_workshop_db"."merged_auto_property" limit 10

For more information about how to review the data in the merged_auto_property table, refer to Running SQL queries using Amazon Athena.

Create, teach, and tune the Lake Formation ML transform

The merged AWS Glue job created a Data Catalog called merged_auto_property. Preview the table in Athena Query Editor and download the dataset as a CSV from the Athena console. You can open the CSV file for quick comparison of duplicates.

The rows with IDs 11376-property and 11377-property are mostly same except for the last two digits of their SSN, but these are mostly human errors. The fuzzy matches are easy to spot by a human expert or data steward with domain knowledge of how this data was generated, cleansed, and processed in the various source systems. Although a human expert can identify those duplicates on a small dataset, it becomes tedious when dealing with thousands of records. The AWS Glue ML transform builds on this intuition and provides an easy-to-use ML-based algorithm to automatically apply this approach to large datasets efficiently.

Create the FindMatches ML transform

  1. On the AWS Glue console, expand Data Integration and ETL in the navigation pane.
  2. Under Data classification tools, choose Record Matching.

This will open the ML transforms page.

  1. Choose Create transform.
  2. For Name, enter c360-ml-transform.
  3. For Existing IAM role, choose GlueServiceRoleLab.
  4. For Worker type, choose G.2X (Recommended).
  5. For Number of workers, enter 10.
  6. For Glue version, choose as Spark 2.4 (Glue Version 2.0).
  7. Keep the other values as default and choose Next.

  1. For Database, choose c360_workshop_db.
  2. For Table, choose merged_auto_property.
  3. For Primary key, select id.
  4. Choose Next.

  1. In the Choose tuning options section, you can tune performance and cost metrics available for the ML transform. We stay with the default trade-offs for a balanced approach.

We have specified these values to achieve balanced results. If needed, you can adjust these values later by selecting the transform and using the Tune menu.

  1. Review the values and choose Create ML transform.

The ML transform is now created with the status Needs training.

Teach the transform to identify the duplicates

In this step, we teach the transform by providing labeled examples of matching and non-matching records. You can create your labeling set yourself or allow AWS Glue to generate the labeling set based on heuristics. AWS Glue extracts records from your source data and suggests potential matching records. The file will contain approximately 100 data samples for you to work with.

  1. On the AWS Glue console, navigate to the ML transforms page.
  2. Select the transform c360-ml-transform and choose Train model.

  1. Select I have labels and choose Browse S3 to upload labels from Amazon S3.


Two labeled files have been created for this example. We upload these files to teach the ML transform.

  1. Navigate to the folder label in your S3 bucket, select the labeled file (Label-1-iteration.csv), and choose Choose. And Click “Upload labeling file from S3”.
  2. A green banner appears for successful uploads.
  3. Upload another label file (Label-2-iteration.csv) and select Append to my existing labels.
  4. Wait for the successful upload, then choose Next.

  1. Review the details in the Estimate quality metrics section and choose Close.

Verify that the ML transform status is Ready for use. Note that the label count is 200 because we successfully uploaded two labeled files to teach the transform. Now we can use it in an AWS Glue ETL job for fuzzy matching of the full dataset.

Before proceeding to the next steps, note the transform ID (tfm-xxxxxxx) for the created ML transform.

Harmonize the data, catalog the harmonized data, and convert the data into CSV files for Neptune

In this step, we run an AWS Glue ML transform job to find matches in the merged data. The job also catalogs the harmonized dataset in the Data Catalog and converts the merged [A1] dataset into CSV files for Neptune to show the relations in the matched records.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job perform_ml_dedup.

  1. On the job details page, expand Additional properties.
  2. Under Job parameters, enter the transform ID you saved earlier and save the settings.

    1. Choose Run and monitor the job status for completion.

  1. Run the following query in Athena to review the data in the new table ml_matched_auto_property, created and cataloged by the AWS Glue job, and observe the results:
SELECT * FROM c360_workshop_db.ml_matched_auto_property WHERE first_name like 'Jam%' and last_name like 'Sanchez%';

The job has added a new column called match_id. If multiple records follow the match criteria, then all matching records have the same match_id.

Match IDs play a crucial role in data harmonization using Lake Formation FindMatches. Each row is assigned a unique integer match ID based on matching criteria such as first_name, last_name, SSN, or date_of_birth, as defined in the uploaded label file. For instance, match ID 25769803941 is assigned to all records that meet the match criteria, such as row 1, 2, 4, and 5 which share the same last_name, SSN, and date_of_birth. Consequently, the properties with ID 19801-property, 29801-auto, 19800-property, and 29800-auto all share the same match ID. It’s important to take note of the match ID because it will be utilized for Neptune Gremlin queries.

The output of the AWS Glue job also has created two files, master_vertex.csv and master_edge.csv, in the S3 bucket output/master_data. We use these files to load data into the Neptune database to find the relationship among different entities.

Load and visualize the harmonized data in Neptune

This section uses Jupyter notebooks to load, query, explore, and visualize the ML matched auto and property insurance data in Neptune. Complete the following steps:

  1. Start Jupyter Notebook.
  2. Choose the Neptune folder on the Files tab.
  3. Under the Customer360 folder, choose the notebook. explore_harmonized_insurance_data.ipynb.
  4. Run Steps 1–5 in the notebook to analyze and visualize the raw insurance data.

The rest of the instructions are inside the notebook itself. The following is a summary of the tasks for each step in the notebook:

  • Step 1. Retrieve Config – Run this cell to run the commands to connect to Neptune for Bulk Loader.
  • Step 2. Load Harmonized Customer Data – Load the final vertex and edge files into Neptune.
  • Step 3. Initialize Neptune node traversals – This block sets up the UI config and provides UI hints.
  • Step 4. Exploring Customer 360 graph – Replace the Match_id 25769803941 copied from the previous step into g.V('REPLACE_ME')( If its not replaced already ) and run the cell.

This displays the graph for four different records for a customer with first_name, and James and JamE are is now connected with the SameAs vertex. The Neptune graph helps connect different entities with match criteria; the AWS Glue FindMatches ML transform job has identified these records as customer James and the records show the Match_id is the same for them. The following diagram shows an example of the Neptune graph. The vertex covers represents the coverage of auto or property insurance by the owner (James in this case) and the vertex locatedAt represents the address of the property or vehicle that is covered by the owner’s insurance.

Clean up

To avoid incurring additional charges to your account, on the AWS CloudFormation console, select the stack that you provisioned as part of this post and delete it.

Conclusion

In this post, we showed how to use the AWS Lake Formation FindMatch transform for fuzzy matching data on a data lake to link records if there are no join keys and group records with similar match IDs. You can use Amazon Neptune to establish the relationship between records and visualize the connect graph for deriving insights.

We encourage you to explore our range of services and see how they can help you achieve your goals. For more data and analytics blog posts, check out AWS Blogs.


About the Authors

Nishchai JM is an Analytics Specialist Solutions Architect at Amazon Web services. He specializes in building Big-data applications and help customer to modernize their applications on Cloud. He thinks Data is new oil and spends most of his time in deriving insights out of the Data.

Varad Ram is Senior Solutions Architect in Amazon Web Services. He likes to help customers adopt to cloud technologies and is particularly interested in artificial intelligence. He believes deep learning will power future technology growth. In his spare time, he like to be outdoor with his daughter and son.

Narendra Gupta is a Specialist Solutions Architect at AWS, helping customers on their cloud journey with a focus on AWS analytics services. Outside of work, Narendra enjoys learning new technologies, watching movies, and visiting new places

Arun A K is a Big Data Solutions Architect with AWS. He works with customers to provide architectural guidance for running analytics solutions on the cloud. In his free time, Arun loves to enjoy quality time with his family

Enforce boundaries on AWS Glue interactive sessions

Post Syndicated from Nicolas Jacob Baer original https://aws.amazon.com/blogs/big-data/enforce-boundaries-on-aws-glue-interactive-sessions/

AWS Glue interactive sessions allow engineers to build, test, and run data preparation and analytics workloads in an interactive notebook. Interactive sessions provide isolated development environments, take care of the underlying compute cluster, and allow for configuration to stop idling resources.

Glue interactive sessions provides default recommended configurations, and also allows users to customize the session to meet their needs. For example, you can provision more workers to experiment on a larger dataset or set the idle timeout for long-running workloads. With the flexibility to change these options depending on the workload, you may need ensure that the options are changed within specific boundaries and apply a control mechanism.

In this post, we present the process of deploying a reusable solution to enforce AWS Glue interactive session limits on three options: connection, number of workers, and maximum idle time. The first option addresses the need for applying custom inspection and controls on traffic, for example by enforcing an interactive session to only be run inside a VPC. The other two enforce limits on costs and usage of AWS Glue resources by enforcing an upper boundary on the number of workers and idle time per session. You can further extend the solution for other properties or services within AWS Glue.

Overview of solution

The proposed architecture is built on serverless components and runs whenever a new AWS Glue interactive session is created.

Architecture Diagram of the Solution

The workflow steps are as follows:

  1. A data engineer creates a new AWS Glue interactive session either through the AWS Management Console or in a Jupyter notebook locally.
  2. The interactive session produces a new event to AWS CloudTrail for the CreateSession event with all relevant information to identify and inspect a session as soon as the session is initiated.
  3. An Amazon EventBridge rule filters the CloudTrail events and invokes an AWS Lambda function to inspect the CreateSession event.
  4. The Lambda function inspects the CreateSession event and checks for all defined boundary conditions. Currently, the boundaries configurable with this solution are limited to maximum number of workers, idle timeout in minutes, and deployment with connection enforced.
  5. If any of the defined boundary conditions are not met, for example too many workers are provisioned for the session, depending on the provided configuration, the function ends the interactive session immediately and sends an email via Amazon Simple Notification Service (Amazon SNS). If the session hasn’t started yet, the function will wait for it to start before taking any action.
  6. If the session was stopped, an email is sent to an SNS topic. There is no information available in the interactive session notebook on the reason for the ending of the session. Therefore, additional context information is provided through the SNS topic to the data engineers.
  7. If the function fails, the sessions are logged in a dead-letter queue inside Amazon Simple Queue Service (Amazon SQS). Furthermore, the queue is monitored and in case of a message, it will trigger an Amazon CloudWatch alarm.

The following steps walk you through how to build and deploy the solution. The code is available in the GitHub repo.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Overview of the deployed resources

All the necessary resources are defined in an AWS CloudFormation file located under cfn/template.yaml. To deploy those resources, we use AWS Serverless Application Model (AWS SAM), which enables us to conveniently build and package all the dependencies and also manages the AWS CloudFormation steps for us.

The CloudFormation stack deploys the following resources:

  • A Lambda function with its library, both defined under the directory src/functions. The function is the control. It will validate that the session is started within the limits defined.
  • An EventBridge rule. This event listens to CloudTrail and in case of a new interactive session, will trigger the control Lambda function.
  • An SQS dead-letter queue (DLQ) attached to the Lambda function. This keeps a record of events that triggered a Lambda function failure.
  • Two CloudWatch alarms monitoring the Lambda function failures and the messages in the DLQ.

If notification via email is enabled, two more resources are deployed:

Additionally, AWS CloudFormation deploys all the necessary AWS Identity and Access Management (IAM) roles and policies, and an AWS Key Management Service (AWS KMS) key to ensure that the exchanged data is encrypted.

Deploy the solution

To facilitate the deployment lifecycle, including the setup of the user local environment, we provide a Makefile that describes all the necessary steps. Make sure you have your AWS credentials renewed and have access to your account. For more information, refer to Configuration and credential file settings.

  • Explore the Makefile and adjust the Region and stack name as needed by modifying the values of the variables AWS_REGION and STACK_NAME.
  • Set KILL_SESSION = "True" if you want to immediately stop the interactive session that has been found out of boundaries. Allowed values are True or False; the default is True.
  • Set NOTIFICATION_EMAIL_ADDRESS = <[email protected]> in the Makefile if you want get notified when a session has been found out of boundaries.
  • Set values for your controls:
    • ENFORCE_VPC_CONNECTION to stop sessions not running inside a VPC (true or false).
    • MAX_WORKERS to set the maximum number of workers for a session (numeric).
    • MAX_IDLE_TIMEOUT_MINUTES to define the maximum idle time for sessions in minutes (numeric).
  • Install all the prerequisite libraries:
    make install-pre-requisites

    These will be installed under a newly created Python virtual environment inside this repository in the directory .venv.

  • Deploy the new stack:
    make deploy

    This command will complete the following tasks:

    • Check if the prerequisites are met.
    • Perform pytest unittest on the Python files.
    • Validate the CloudFormation template.
    • Build the artifacts (Lambda function and Lambda layers).
    • Deploy the resources via AWS SAM.

Test the solution

Refer to Introducing AWS Glue interactive sessions for Jupyter for information about running an interactive session. If you follow the instructions in the post (see the section Run your first code cell and author your AWS Glue notebook), the initialization of the interactive session should fail with an error similar to the following.

Example of code in the cell:

from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
sc3 = SparkContext.getOrCreate()
glueContext1 = GlueContext(sc3)
spark = glueContext1.spark_session
job = Job(glueContext1)

Received output:

Authenticating with profile=XXXXXXXX
glue_role_arn defined by user: arn:aws:iam::XXXXXXXXXX:role/XXXXXXXX
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: XXXXXXXXXXXXX
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Waiting for session xxxxxxxxx to get into ready status...
Session xxxxxxxxx has been created
Exception encountered while running statement: An error occurred (EntityNotFoundException) when calling the GetStatement operation: Session ID xxxxxxxxx not found

If you enabled the email feature, you should also get an email notification.

You can also check on the AWS Glue console that your session ID isn’t listed.

Clean up

Clean up the deployed resources by running the following command:

make clean-up

Note that the resources deployed from following the recommended post, Introducing AWS Glue interactive sessions for Jupyter, will not be removed with the previous command.

Limitations

The delivery guarantee for CloudTrail events to EventBridge are best effort. This means CloudTrail will attempt to deliver all events to EventBridge, but in some rare cases, an event might not be delivered. For more information, refer to Events from AWS services.

Conclusion

This post described how to build, deploy, and test a solution to enforce boundary conditions on AWS Glue interactive sessions in order to enforce constraints on the number of workers, idle timeouts, and AWS Glue connection.

You can adapt this solution based on your needs and further extend it to allow controls on other options.

To learn more about how to use AWS Glue interactive sessions, refer to Introducing AWS Glue interactive sessions for Jupyter and Author AWS Glue jobs with PyCharm using AWS Glue interactive sessions.


About the Authors

Nicolas Jacob Baer is a Senior Cloud Application Architect with a strong focus on data engineering and machine learning, based in Switzerland. He works closely with enterprise customers to design data platforms and build advanced analytics/ml use-cases.

Luca Mazzaferro is a Senior DevOps Architect at Amazon Web Services. He likes to have infrastructure automated, reproducible and secured. In his free time he likes to cook, especially pizza.

Kemeng Zhang is a Cloud Application Architect with a strong focus on machine learning and UX, based in Switzerland. She works closely with customers to design user experiences and build advanced analytics/ml use-cases.

Mark Walser, a Senior Global Data Architect at Amazon Web Services, collaborates with customers to develop innovative Big Data solutions that solve business problems and speed up the adoption of AWS services. Outside of work, he finds pleasure in running, swimming, and all things related to technology.

Gal blog picGal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering and BI, based in California. She is passionate about developing a deep understanding of customer’s business needs and collaborating with engineers to design easy to use data products.

Amazon SES – How to track email deliverability to domain level with CloudWatch

Post Syndicated from Alaa Hammad original https://aws.amazon.com/blogs/messaging-and-targeting/amazon-ses-how-to-track-email-deliverability-to-domain-level-with-cloudwatch/

Why is it important to track email deliverability per domain with Amazon Simple Email Service (SES)?

Amazon Simple Email Service (Amazon SES) is a scalable cloud email service provider that enables businesses to build a large-scale email solution and host multiple domains from the same SES account for different purposes ex: one domain for sending marketing emails such as special offers, another domain to send transactional emails such as order confirmations, and other types of correspondence such as newsletters.

As your product, service or solution built on Amazon SES grows and you require multiple domains verified, it is important to track email deliverability for emails you send from each domain for business continuity, billing purposes or incidents investigations. This can be useful to identify if you have low email deliverability for your business domain or if you have a domain generating high bounce or complaint rates and take proactive actions before impacting the account’s ability to send emails from any other domains.

SES offers features that automatically manage deliverability per domain through Virtual Deliverability Manager. Virtual Deliverability Manager helps enhance email deliverability and provides insights into sending and delivery data, as well as offering solutions to fix negative email sending reputation. You can learn more about Virtual Deliverability Manager here.

Solution Walkthrough

Amazon SES provides a way to monitor sender reputation metrics such as bounce and complaint rates per account or configuration sets using event publishing. This blog will discuss how you can use Amazon SES message auto-tags to monitor and publish email deliverability events (Send, Delivery, Bounce, Complaints) to CloudWatch custom metrics per domain. In addition, you will see how to create a custom CloudWatch dashboard that’s easy to access in a single view to monitor your domain metrics. This CloudWatch dashboard can help to provide guidance for your team members during operational events about how to respond to specific incidents for your sending domain.

What are Amazon SES Auto-Tags:

Message tags are a form of name/value pairs to categorize the email you are sending. For example, if you advertise books, you could name a message tag general, and assign a value of sci-fi or western, when you send an email for the associated campaign. Depending on which email sending interface you use, you can provide the message tag as a parameter to the API call (SendEmail, SendRawEmail) or as an Amazon SES-specific email header.

In addition to the message tags you add to any emails you send, Amazon SES adds a set of Auto-Tags that are automatically included in any emails you send. You don’t need to pass the parameters of the auto-tags to the API call or email headers since SES does this automatically.

The auto-tags in the list below are used to track the email deliverability for specific events ( ex: Send, Delivery, Bounce, Complaint). SES does this by using the name/value pairs of the auto-tag name as a dimension in CloudWatch metric to track the count of events of specific auto-tag. This blog post will use “ses:from-domain” auto-tag to configure event publishing for tracking and publish email deliverability events (Send, Delivery, Bounce, Complaints) you receive per domain to CloudWatch metrics and CloudWatch dashboard.

Amazon SES auto-tags added to messages you send

Prerequisites:

For this walkthrough, you should have the following prerequisites:

Configure Amazon SES to publish email deliverability events to CloudWatch destination:

To configure event publishing for tracking email deliverability events, you first need to create a configuration set. Configuration sets in SES are groups of rules, that you can apply to your verified identities. When you apply a configuration set to an email, all of the rules in that configuration set are applied to the email.

After your configuration set is created, you need to create Amazon SES event destination. Amazon SES will send all email deliverability events you intend to track to this event destination. In this blog the event destination is Amazon CloudWatch.

    1. Sign in to the Amazon SES console.
    2. In the navigation pane, under Configuration, choose Configuration sets. Choose Create set.
    3. Enter Configuration set name, leave the rest of fields to default, scroll to the send and click on Create set.
    4. Under configuration set home page click on Event destinations tab and select Add destination
    5. Add SES event destination to configuration set
    6. Under Select event types, check Sends, Deliveries, Hard bounces and Complaints boxes and click Next.
    7. selecting event types to track
    8. Under Specify destination, Select Amazon CloudWatch.
    9. Select event destination as Amazon CloudWatch
    10. Name – enter the name of the destination for this configuration set. The name can include letters, numbers, dashes, and hyphens. (example : Tracking_per_Domain)
    11. Under Amazon CloudWatch dimensions, Select Value source: Message tag , Dimension name: ses:from-domain and Default value: example.com (you will need to add the verified domain name you want to track) as shown below:
    12. add message auto-tag as CloudWatch dimension to track
    13. Review, When you are satisfied that your entries are correct, Click Add destination to add your event destination.

Send a test email via Amazon SES mailbox simulator to trigger events in CloudWatch custom metric.

After selected Amazon CloudWatch as event destination , Amazon CloudWatch will create a custom metric with the auto-tag dimension and value you chose. For this custom metric to appear in CloudWatch Console, you must send an email to trigger each selected event. We recommend using the Amazon SES Mailbox Simulator to avoid generating real bounces or complaints that could impact your account’s reputation.

In the below section, This blog will show how to send those test emails to the following recipients manually using CLI. If you would like to use the console method to send those emails. you will need to send three separate test emails since the console will only allow one recipient per message:

Amazon SES Mailbox Simulator recipients to trigger the events in CloudWatch metrics:
[email protected]
[email protected]
[email protected]

Note: You must pass the name of the configuration set when sending an email. This can be done by either specifying the configuration set name in the headers of emails, or specifying it as a default configuration set. This can be done at the time of identity creation, or later while editing a verified identity.

The following example uses send-emailCLI command to send a formatted email to the Amazon SES simulator recipients:

Before you run any commands, set your default credentials by following Configuring the AWS CLI. The IAM user must has “ses:SendEmail” permission to send email.

  1. Navigate to your terminal where the AWS CLI is installed and configured. Create message.json file for the message to send and add the following content:
  2. {
    "Subject": {
    "Data": "Testing CW events with email simulator",
    "Charset": "UTF-8"
    },
    "Body": {
    "Text": {
    "Data": "This is the message body of testing CW events with email similulator.",
    "Charset": "UTF-8"
    }
    }
    }
  3. Create a destination.json file to add Amazon SES simulator recipients for bounces, complaints and delivery events as shown below:
  4. { 
    
    "ToAddresses": ["[email protected]", "[email protected]" , "[email protected]"]
    
    }
  5. Send a test email using send-email CLI command to send a formatted email to the Amazon SES simulator recipients:
  6. aws ses send-email --from [email protected] --destination file://destination.json --message file://message.json --configuration-set-name SES_Config_Set --region <AWS Region>
  7. After the message sent, you are expected to see the following output:
  8. {
    
    "MessageId": "EXAMPLEf3a5efcd1-51adec81-d2a4-4e3f-9fe2-5d85c1b23783-000000"
    
    }

Now you sent a test email to trigger the events you want to track in CloudWatch custom metrics. Lets create the CloudWatch dashboard to see those metrics.

Create CloudWatch dashboard to track the email deliverability events for my domain.

  1. Sign in to the Amazon CloudWatch console.
  2. In the navigation pane, choose Dashboards, and then choose Create dashboard.
  3. In the Create new dashboard dialog box, enter a name like ‘CW_Domain_Tracking’ for the dashboard, and then choose Create dashboard.
  4. In the Add Widget dialog box, Choose Number to add a number displaying a metric to the dashboard and then choose Next
  5. Under Add metric graph, click on edit sign to rename the graph with your domain example.com . this will make it easy for you to select the dashboard of the domain if you have multiple domains.
  6. In the Browse tab , Select the AWS region where you are running your SES account and in the search bar, search for “ses:from-domain”.
  7. You will get four metrics returned with your domain name “example.com”. Select checkbox beside the four metrics and click Create widget.
  8. CloudWatch dashboard with the metrics
  9. Save dashboard in the top right corner of the dashboard page to save the widget settings.
  10. Save CloudWatch dashboard settings

After the CloudWatch dashboard created, for any email you send from example.com domain with configuration set name passed in the email header, The email deliverability events will be counted in your CloudWatch metrics and you will be able to see them in the CloudWatch dashboard.

As an additional step. You can also setup a CloudWatch alarms for this custom metrics and add a threshold for each metric. When the metric breach the threshold, the alarm goes on and send an SNS notification to you to take the necessary actions.

Cleaning Up:

This setup includes Amazon CloudWatch and Amazon SES service charges. To avoid incurring any extra charges, remember to delete any resources created manually if you no longer need them for monitoring.

Resources to delete from Amazon SES console.

  1. In the navigation pane, under Configuration, choose Configuration sets.
  2. Check the box beside Configuration set you created and select Delete.

Resources to delete from Amazon CloudWatch console.

  1. In the navigation pane, choose Dashboards, and then choose the dashboard you created.
  2. In the upper-right corner of the graph that you want to remove, choose Actions, and then choose Delete Dashboard.
  3. Save dashboard.

Conclusion:

You have now seen how to configure Amazon SES to track email deliverability at domain level with CloudWatch dashboard. Tracking email deliverability for emails you send from each domain is essential for business continuity, billing purposes or incidents investigations. Using SES message auto-tags and CloudWatch metrics you can identify the domains that have low email deliverability quickly and take necessary actions to maximize your email deliverability and take proactive actions before impacting the account’s ability to send emails from any other domains.

About the author:

Alaa Hammad

Alaa Hammad is a Senior Cloud Support Engineer at AWS and subject matter expert in Amazon Simple Email Service and AWS Backup service. She has a 10 years of diverse experience in supporting enterprise customers across different industries. She enjoys cooking and try new recipes from different cuisines.

How to send your first email on SES

Post Syndicated from Dustin Taylor original https://aws.amazon.com/blogs/messaging-and-targeting/how-to-send-your-first-email-on-ses/

Introduction

Sending your first email on any service can be complicated. In this blog we will walk you through how to send your first email on Amazon Simple Email Service (SES) through the SES Console and to direct you to examples of how you can send email through the AWS SDK. Our public documentation includes additional information on how you can configure SES. We encourage you to read through these documents to learn about these other mechanisms in the future.

Getting Started

Getting started with sending an email on SES requires three actions which are: 1) verifying a domain or email address 2) requesting production access to SES and 3) sending your first email. Let’s walk through each of these steps and send our first email.

Verifying an Identity

To start, you will configure what email address or domain your customers will receive emails from. As part of this verification, you will need to be able to either receive a confirmation email at the email address you are trying to setup, or to publish CNAME records for your intended domain. Generally, we recommend using a domain for your email sending as this gives you the ability to set up SPF, DKIM, and DMARC alignment which will increase recipient trust in your emails. Email addresses can be used for account-specific email sending where a customer may not own a domain, but this type of use-case is prone to receiving entities having low trust in the sender and a lower probability of inbox placement. For more in-depth instructions please review our public documentation as I will briefly touch on the most important pieces to verifying a domain or email address.

To verify an identity, you can go to the SES Console and click the ‘Verified identities’ link on the left-hand side of the screen. It will then present you with a list of verified domains or email addresses currently in your account if they were previously verified. There is a yellow button that states ‘Create identity’, when you click this you will be presented with a screen to choose whether to verify an email address or domain.

Email Address Verification

To verify an email address, you will be prompted with the following dialog:

The dialog presented to a sender when they choose to verify an email address in the SES console.

To verify an email address to use as your sending identity, you will include the address in the ‘Email address’ field and then click the ‘Create identity’ button. This will trigger an automated email to the address with a verification link that will need to be clicked to verify ownership of the email address. Once verified, you can begin sending emails from your new email address identity.

Domain Verification

To start verifying a domain you will click the ‘Verified identities’ option from the ‘Configuration’ dropdown which can be found on the left side of the screen. When choosing to verify a domain, you will be presented with a series of dialogs which include:

The dialog presented to a sender that prompts a decision to verify a domain or email address.

Here you will need to include the domain you intend to use for email sending. If you are keeping to a basic configuration on SES this will be the only data you need to add to this dialog. However, it is recommended to also use a custom mail-from. A custom mail-from is a way for you to remove the amazonses.com domain from your mail-from header to ensure domain alignment throughout your headers. You can find more information about the custom mail-from addresses in our documentation.

After finishing your changes in the first dialog you will then be presented with a second dialog that looks like the following:

The dialog which allows a sender to verify the domain they intend to use to send email.

To verify the domain, you will need to utilize either the Easy DKIM feature, or to provide a DKIM authentication token if you plan to DKIM sign your own messages. In selecting the ‘Easy DKIM’ option, you will be presented with the option to use either 1024 bit or 2048 bit signing key length. We would recommend utilizing the 2048 bit signing key length for most customers as this is the more secure key.

If you use Amazon Route53 as your DNS provider, SES can automatically publish DNS records for your domain. If not, this step will require you to edit your DNS records to include three CNAME records which are used for the DKIM signature process and as a mechanism to prove domain ownership. An example of the CNAME records is as follows:

An example dialog of the CNAME records that are generated when attempting to verify an identity.

Once you have placed these DNS records SES will periodically attempt to look-up the records to change the status of your domain verification. If SES doesn’t automatically update the status, you are presented with the option to force another check to verify the records are present.

After your domain verification is successful, you are now ready to send emails from any email address for your domain.

Requesting Production Access

Now that you’ve verified an identity, the next step is to be able to send an email to an unverified identity you will need to request production access. If you only want to test to your own domain or email address you can skip this step until you are ready to send to unverified recipients.

Note: This is region-based, a request for production access is limited to the region in which you are requesting. 

To begin this process, you will navigate to the SES Console and the ‘Account dashboard’ section. Once you are on this page you will be presented with the following dialog at the top of your screen.

Clicking the ‘Request production access’ button will then navigate to the ‘Request details’ page which you can reference below.

The dialog from the SES console showing that the SES account is still in the sandbox.
Fill out each section with the details of your mail-type, website URL, use case description, and then acknowledging that you have read and agree to the AWS Service Terms and Acceptable Use Policy (AUP). When filling out the use case description, provide as much detail as you can for your request as our teams will review to determine if we need more information before approving or denying your request. An example of a good use case description would look like the following:

“Example.com is the domain my company intends to use to send our transactional emails. Our recipients are all customers who have either signed up for an account, requested a new password, or have made purchases through our website. We require confirmation of opt-in for all our new accounts and if no confirmation is received, we do not attempt to send an email to that address.”

Note: SES will review your production access request and will provide feedback on your use case and whether it could pose a risk to the sending reputation of SES, our customers, or your own sending domain.

Finally, click the ‘Submit request’ button to submit your request for production access. This will create an AWS Support case and will be reviewed by our team. These requests are reviewed with a 24-hour Service Level Agreement (SLA). While you are waiting for production access you can send test emails to any of the Mailbox Simulator endpoints or to your own verified domain(s) or email address(es).

Sending Your First Email

From the Console

To send your first email from the SES Console you will need to start by clicking the ‘Verified identities’ option from the ‘Configuration’ dropdown which can be found on the left side of the screen. From here you will select the domain and/or email address you want to send your email from and then click the ‘Send test email’ button, which will open the following screen:

The message details dialog where a sender can send an email from the SES console

From here you will fill out the ‘From-address’ box with the local name (anything before the @ sign) that you want to use to send the email. If you want to test SES functionality you can choose any of the dropdown events present, or you can choose the ‘Custom’ option which will allow you to set a ‘Custom recipient’ address of your choosing. Then you will fill out the ‘Subject’ and ‘Body’ fields with the content you will use for this first test email and then click the ‘Send test email’ button.

Congratulations, you’ve sent your first email from the SES Console! Now, utilizing SES to send single emails from the console isn’t the most scalable way to send email. In the next section, I will provide you links to our documentation for the 5 programming languages supported with the AWS SDK so that you can begin building your integration with SES.

From Code

The AWS Documentation includes some code snippets on how to send an email with SES via the AWS SDK. You can find examples of how to send an email from languages such as: .NET, Java, PHP, Ruby, and Python. We highly recommend reviewing our documentation to see these introductory code snippets to get you started.

Conclusion

Hopefully this blog post has aided you in your journey to send your first email through SES. From verifying a domain, requesting production access, and finally sending an email through the console. Take this knowledge and build upon it for future success in sending email through SES. Happy sending!

How to send messages to multiple recipients with Amazon Simple Email Service (SES)

Post Syndicated from Joydeep Dutta original https://aws.amazon.com/blogs/messaging-and-targeting/how-to-send-messages-to-multiple-recipients-with-amazon-simple-email-service-ses/

Introduction

Customers frequently ask what is the best way to send messages to multiple recipients using Amazon Simple Email Service (SES) with the best deliverability and without exceeding the maximum recipient’s per message limit. In this blog, we will show you how to determine the best approach for sending a message to multiple recipients based on different use-cases. We will also discuss why in most situations sending messages to a single recipient at a time is the best approach.

Difference between message Header addresses and Envelope addresses

Before we dive into the use cases, let’s discuss how message addressing works in SES. When a client makes a request, SES constructs an email message compliant with the RFC 5322 Internet Message Format specification . An email comprises of a header, a body, and an RFC 5321 envelope, as described in the Email format in Amazon SES document.

The email addresses in the RFC 5322 To, Cc and Bcc headers are for display. These headers enable your email client interface to display to whom the message was addressed. These addresses do not control which recipients receive the messages; the envelope addresses do. The sending mail client provides the envelope recipient addresses to a mail server using the RFC 5321 RCPT TO commands. RCPT is an abbreviation for recipient.

An apt analogy (see diagram below) is how a physical letter within an envelope can address a person whose address is not the envelope. The address on the envelope is what the mail carrier to deliver the envelope. The postal worker should not need to open the envelope to know which address to deliver the mail.

Analogy to show physical mail compares to electronic mail

As an example, a school district may send letters informing residents of enrollment details for their children, but they do not know all of the names of the people who live at each address. The envelope may only list the address, and the letter may just be addressed “To Resident” if the school district doesn’t have a name to address the letter. The message is delivered to the resident’s address regardless of the accuracy of the information on the letter.

To simplify, let’s summarize the differences between To & Cc header and envelope addresses:

Header To & Cc Addresses Envelope Addresses (RCPT)
Used by email clients to display the list of recipients Used by mail servers to deliver the email message
Not used for mail delivery Used for mail delivery
Displayed to recipients Not displayed to recipients

The Bcc address is different than the To and Cc headers because it is used to send a copy of the message to an additional set of recipients that are “blind” to the other recipients. Bcc addresses are only defined by envelope addresses, not as a header address. Mail servers will commonly remove a Bcc header when handling a message, but delivery to the envelope recipient address still occurs.

When to use multiple recipients in a Destination

SES supports sending messages to multiple recipients in a single SendEmail operation. The Destination argument of the SendEmail operation represents the destination of the message. A Destination consists of To, Cc, and Bcc fields which represent both the header addresses and the envelope addresses.

When multiple recipients are defined in the Destination argument to the SendEmail operation, the defining characteristic is that every recipient receives the exact same message with the same message-id. A message-id is used for event handling (bounces, complaints, etc) among other purposes. A message-id pertains to exactly one version of a particular message.

Did you know: The use cases for recipients having a message with the same message-id are limited to situations in which the recipients are expected to interact with the message as a group. For example, recipients may reply-all to the email and have a resulting email conversation. The original message-id is used by email client applications to display a “conversation” view using the References and In-Reply-To headers. This behavior may be a good fit if the use case is a mailing list or internal announcement to employees within a company.

The recipient limit in the Destination argument is 50 because that is a reasonable break-point when the “conversational” use case runs the risk of the “reply all storm“ described in the next section. Consider using a robust mailing list solution or hosted service with capabilities similar to GNU Mailman to facilitate large group email conversations.

Why bulk mail recipients should not see other recipients

For bulk sending purposes, and most transactional sending, the recipients don’t need to know that other recipients also received the message:

  • The recipients likely gain no value from seeing the other recipient addresses, as they may be arbitrarily segmented into batches of 50 or less, and most email client interfaces have trouble displaying more than 50 addresses.
  • There is a risk of a “reply-all storm“, which is when a recipient replies to all of the To and Cc addresses from the original message, and then those people reply back asking everyone to stop replying. This scenario is fun to talk about around the water cooler, but should be avoided.
  • If recipients are defined as Bcc recipients in the Destinations argument of the SendEmail operation, it would not contain a To address, and that can look suspicious when read by the recipients.

Note: There is no authentication mechanism protecting the To or Cc headers from spoofing, so be careful about assuming any trust placed into the values of those headers. This means that it is possible for an attacker to spoof the To or Cc headers in an email message. Therefor the only meaningful address to include in the To header is the recipient’s own address, which they know isn’t spoofed because of the fact that they are reading the message.

For bulk mail it is best practice to have each recipient see only their own name and email address in the To header of the messages they receive. This makes the messages look more personable and can improve deliverability and recipient engagement.

This approach can be achieved by sending the message to each recipient individually via the SendEmail operation. You would use a single address in the “ToAddressses” field of the “Destination” argument.

Use the ToAddress field to individual message in the SendEmail API

How email event notifications are associated with recipients

If you need email event notifications to be associated to each recipient, then you will need each recipient to receive a message with a unique message-id; one recipient per Destination.

The following event types will be associated with every recipient in the Destination:

  • asynchronous bounces
  • complaints
  • opens
  • clicks

Learn more about Amazon SES events in the documentation: how email sending works in Amazon SES

For example, if one of the recipients triggers a open engagement event, and if that recipient was in a group of 50 recipients within the Destination argument to the SendEmail operation, then all 50 of those recipients will be registered as having opened the message.

Other considerations:

  • If the recipients are defined by ToAddresses and CcAddresses they will all appear in the message headers, but the To and Cc headers will be truncated in the event notifications if the headers are over 10 KB. Multi-recipient Destinations may cause you to lose observability needed to troubleshoot deliverability issues.
  • SES Virtual Deliverability Manager only tracks metrics from emails that have one recipient. Multi-recipient Destinations are not counted in any of the Virtual Deliverability Manager dashboard metrics.
  • SES counts the number of envelope recipients in an email toward the account’s sending quotas. Multi-recipient Destinations is not a way to achieve higher sending limits.
  • SES charges for each recipient receiving a message regardless of how many recipients are included in the Destination for each API invocation. Multi-recipient Destinations is not a way to reduce costs.

For bulk sending use cases, it is best practice to have each recipient have a copy of the message with a unique message-id to achieve the highest level of observability of your email sending program. High observability leads to high deliverability. This can be achieved by sending the message to each recipient individually.

How to send Emails to multiple recipients with SES

At this point, you should understand why it is a best practice to send a message to multiple recipients by iteratively using a single recipient in the Destination argument of the SendEmail operation.

Sending a message to a single recipient at a time is the best way to get started delivering messages to multiple recipients. Sending email in this fashion ensures that your deliverability metrics are giving you the observability needed to achieve the highest engagement with your recipients.

The following example uses the SES version 2 command line interface (CLI) to send a message to a list of recipients. If you do not want to use the CLI, use SES with an AWS SDK and adapt the commands into the syntax of the SDK of your choice.

#!/bin/bash

# Replace these variables with your own values
# sender 
# - Consider not using no-reply@, and instead use SES Inbound to receive replies
# - Consider a descriptive username@; some mobile clients will display it prominently, so it should make sense to the recipient.
# - Consider using a subdomain for bulk and transactional mail. Don't use the domain used by your users.
# - Consider using a verified domain identity. Don't use an email address identity within a domain that has a DMARC policy.
sender="[email protected]"
subject="Email subject"
body="Email body"
region="us-east-1"

# List of recipients, one per line. Defaults to SES mailbox simulator addresses (https://docs.aws.amazon.com/ses/latest/dg/send-an-email-from-console.html#send-email-simulator-how-to-use)
recipients=(
  "[email protected]"
# ... 
  "[email protected]"
)

# Send an email to each recipient
# Iterate through the list of recipients.
# Invoke the AWS SES SendEmail operation with a single recipient defined in the Destination
for recipient in "${recipients[@]}"; do
  aws sesv2 send-email \
    --from-email-address "$sender" \
    --destination "ToAddresses=$recipient" \
    --content "Simple={Subject={Data='$subject',Charset='UTF-8'},Body={Text={Data='$body',Charset='UTF-8'}}}" \
    --region "$region"
done

# The output will look similar to this, with a unique MessageID associated with each recipent.
# {
#    "MessageId": "010001874edd1765-be4ea5c2-d2b1-4ffb-bfb9-46461d18d80c-000000"
# }
# ... 51 total message-ids
# {
#    "MessageId": "010001874edd1b94-468ecee9-9198-4356-9f53-a108097777e5-000000"
# }

In this example script, the SendEmail operation is invoked multiple times using the CLI to deliver the message individually to each recipient, and each recipient only sees their own address in the To header. We called the SendEmail operation 51 times and a total of 51 Message Ids were returned in the response.

How to use SendEmail for multiple recipient advanced use cases

Consider a scenario where a memo needs to be sent to an entire team, the team is large, and only a few of the recipients need to be displayed in the headers. In this use case, it is desirable to send multiple copies of an email to many recipients who all receive the same To and Cc headers.

To customize the headers, you must use the Raw field of the Content argument instead of the Simple field.

The example below will reference another internet standard called Multipurpose Internet Mail Extensions (MIME): Format of Internet Message Bodies.

What’s in a MIME object:

  • Headers (such as From, Subject, and Reply-to)
  • Body – Plain text and HTML
  • Attachments – Files and images

MIME extends the capabilities of RFC 5322 and is used to format most email messages to this day. There are a variety of packages that can assist in creating a MIME structured messages, which you can find by searching relevant package managers.

This is an example in Python to create a MIME formatted message for the next script.

#!/usr/bin/env python
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import base64

# You must change the 'fromAddress' variable for this example to work in your environment.
#
# When choosing a From header address:
# - Consider not using no-reply@, and instead use SES Inbound to receive replies
# - Consider a descriptive username@; some mobile clients will display it prominently, so it should make sense to the recipient.
# - Consider using a subdomain for bulk and transactional mail. Don't use the domain used by your users.
# - Consider using a verified domain identity. Don't use an email address identity within a domain that has a DMARC policy.

fromAddress = "Descriptive Name <[email protected]>"

# The To and Cc addresses here are for the email header. They are what will be displayed to the recipient.
# The actual recipient, or evelope recipient, will be set later.
toAddresses = ['Founder Name <[email protected]>']
ccAddresses = ['President <[email protected]>', 'Director <[email protected]>']
subjectTxt = "Success and Scale Bring Broad Responsibility"
messageTxt = "We started in a garage, but we’re not there anymore. We are big, we impact the world, and we are far from perfect. We must be humble and thoughtful about even the secondary effects of our actions. Our local communities, planet, and future generations need us to be better every day. We must begin each day with a determination to make better, do better, and be better for our customers, our employees, our partners, and the world at large. And we must end every day knowing we can do even more tomorrow. Leaders create more than they consume and always leave things better than how they found them."
messageHtml = "<html><body><p>" + messageTxt + "</p></body></html>"
CHARSET = "utf-8"

multiPartEmail = MIMEMultipart()
multiPartEmail['From'] = fromAddress
toAddressesJoined = ",".join(toAddresses)
multiPartEmail['To'] = toAddressesJoined
ccAddressesJoined = ",".join(ccAddresses)
multiPartEmail['Cc'] = ccAddressesJoined
multiPartEmail['Subject'] = subjectTxt
msg_body = MIMEMultipart('alternative')
textpart = MIMEText(messageTxt.encode(CHARSET), 'plain', CHARSET)
htmlpart = MIMEText(messageHtml.encode(CHARSET), 'html', CHARSET)
msg_body.attach(textpart)
msg_body.attach(htmlpart)
multiPartEmail.attach(msg_body)

print("Human readable blob:")
print((multiPartEmail.as_string()))
print("Base64 Encoded Blob:")
print(base64.b64encode(multiPartEmail.as_bytes()))
Running this script will produce output similar to the following:
Human readable blob:
Content-Type: multipart/mixed; boundary="===============0865862865556646150=="
MIME-Version: 1.0
From: [email protected]
To: [email protected]
Cc: [email protected], [email protected]
Subject: Success and Scale Bring Broad Responsibility

--===============0865862865556646150==
Content-Type: text/text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

We started in a garage, but we’re not there anymore. We are big, we impact the world, and we are far from perfect. We must be humble and thoughtful about even the secondary effects of our actions. Our local communities, planet, and future generations need us to be better every day. We must begin each day with a determination to make better, do better, and be better for our customers, our employees, our partners, and the world at large. And we must end every day knowing we can do even more tomorrow. Leaders create more than they consume and always leave things better than how they found them.

--===============0865862865556646150==--

Base64 Encoded Blob:
b'Q29udGVudC1U...TRIMMED...'

The following script has an option to divide the list into batches of 50 or fewer for each SendEmail operation and will send a Base 64 encoded MIME object to a list of recipients. The headers of the message are always the same for every recipient because the headers are defined within the MIME object, which is obtained from running the previous script With SendEmail, the Destination argument does not define the To or Cc headers.

#!/bin/bash

# Replace these variables with your own values
region="us-east-1"

# List of recipients, one per line. Defaults to SES mailbox simulator addresses (https://docs.aws.amazon.com/ses/latest/dg/send-an-email-from-console.html#send-email-simulator-how-to-use)
# These are the actual envelope recipients who will get the above email in their inbox. The To and Cc addresses set above will be displayed, not these.
recipients=(
"[email protected]"
# ...
"[email protected]"
)

# Raw message content
# Paste the base64 encoded message blob that is returned from the python script (the string within b'')
content=''

# Maximum number of recipients per batch
# Increase batch_size up to 50 if your use case requires every recipient have the same message-it. This sacrifices observability into deliverability metrics.
batch_size=1

# Send an email to batch size of 1 to 50 recipients
recipients_count=${#recipients[@]}
echo $recipients_count
for ((i=0; i<$recipients_count; i+=batch_size)); do
to_addresses="${recipients[@]:${i}:${batch_size}}"
to_addresses="${to_addresses// /,}"
aws sesv2 send-email \
--destination "ToAddresses=$to_addresses" \
--content "Raw={Data='$content'}" \
--region "$region"
done

# The output will look similar to this, with a unique MessageID associated with each send-email.
#{
# "MessageId": "010001874ee5cdca-3fe4fb4b-4d36-4ae7-b4e4-cc7fae988a42-000000"
#}
#... 51 total message-ids
#{
# "MessageId": "010001874ee5d210-9225f471-e330-4f01-9044-63a941358477-000000"
#}

Screenshot of email client, viewing email sent by the above code. The sender of the email is “Descriptive Name”, the To recipient is Founder Name, and the President and Director are displayed as Cc addresses.

Remember: If you increase the batch size to greater than 1. Every recipient in each batch will have a message with the same message-id and will all be treated the same for event processing.

Running these scripts will have the effect of each team member receiving exactly the same looking message regardless of how many recipients were defined in each SendEmail Destination. The To and CC addresses were set in the email headers, but the actual envelope recipients were set in the API operation.

SES SendEmail and SendBulkEmail APIs

The latest version of SES API (version 2) offers SendEmail and SendBulkEmail APIs.

With SendBulkEmail, you can only use a pre-defined SES template while, with SendEmail, you can send any email format including raw, text, HTML and templates.

SendEmail operation can send a single email to one Destination (50 recipients across the To:, Cc:, and Bc: fields) while the SendBulkEmail operation can send 50 unique emails to 50 Destinations by leveraging a SES template.

Both operations have the capability to send templated emails, but the SendBulkEmail operation requires less computational resources. This is due to its ability to send emails to 50 Destinations using just a single API call.

Conclusion

In this blog post we discussed how message recipient addresses are displayed by email clients, how message delivery is defined by envelope recipients, and how email sending events are associated with the recipients. Defining multiple recipients in a message destination can lead to poor observability and therefore poor deliverability and should not be used unless your use case specifically requires it

Sending messages to one recipient at a time is a best practice and leads to the highest engagement with your recipients.

About the authors

Jesse Thompson is an Email Deliverability Manager with the Amazon Simple Email Service team. His background is in enterprise IT development and operations, with a focus on email abuse mitigation and encouragement of authenticity practices with open standard protocols. Jesse’s favorite activity outside of technology is recreational curling.
Samuel Wallan is a Software Development Engineer at AWS Simple Email Service. Within SES, Sam works on the Digital User Experience Deliverability team. In his free time, he enjoys hanging out with friends and staying fit.
Farnam Farshneshani is a Technical Account Manager at AWS. He specializes in AWS Simple Email service and helps customers with operational and architectural issues.  In his free time, he enjoys traveling and participating in various outdoor activities.
Joydeep Dutta is a Senior Solutions architect at AWS. Joydeep enjoys working with AWS customers to migrate their workloads to the AWS Cloud, optimize for cost and help with architectural best practices. He is passionate about enterprise architecture to help reduce cost and complexity in the enterprise. He lives in New Jersey and enjoys listening to music and spending time in the outdoors in his spare time.

Stream VPC Flow Logs to Datadog via Amazon Kinesis Data Firehose

Post Syndicated from Chaitanya Shah original https://aws.amazon.com/blogs/big-data/stream-vpc-flow-logs-to-datadog-via-amazon-kinesis-data-firehose/

It’s common to store the logs generated by customer’s applications and services in various tools. These logs are important for compliance, audits, troubleshooting, security incident responses, meeting security policies, and many other purposes. You can perform log analysis on these logs to understand users’ application behavior and patterns to make informed decisions.

When running workloads on Amazon Web Services (AWS), you need to analyze Amazon Virtual Private Cloud (Amazon VPC) Flow Logs to track the IP traffic going to and from the network interfaces for the workloads in their VPC. Analyzing VPC flow logs helps you understand how your applications are communicating over the VPC network and acts as a main source of information to the network in your VPC.

You can easily deliver data to supported destinations using the Amazon Kinesis Data Firehose integration with VPC flow logs. Kinesis Data Firehose is a fully managed service for delivering near-real-time streaming data to various destinations for storage and performing near-real-time analytics. With its extensible data transformation capabilities, you can also streamline log processing and log delivery pipelines into a single Kinesis Data Firehose delivery stream. You can perform analytics on VPC flow logs delivered from your VPC using the Kinesis Data Firehose integration with Datadog as a destination.

Datadog is a monitoring and security platform and AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in AWS Cloud Operations, DevOps, Migration, Security, Networking, Containers, and Microsoft Workloads, along with many others.

Datadog enables you to easily explore and analyze logs to gain deeper insights into the state of your applications and AWS infrastructure. You can analyze all your AWS service logs while storing only the ones you need, generate metrics from aggregated logs to uncover, and send alerts about trends in your AWS services.

In this post, you learn how to integrate VPC flow logs with Kinesis Data Firehose and deliver it to Datadog.

Solution overview

This solution uses native integration of VPC flow logs streaming to Kinesis Data Firehose. We use a Kinesis Data Firehose delivery stream to buffer the streamed VPC flow logs to a Datadog destination endpoint in your Datadog account. You can use these logs with Datadog Log Management and Datadog Cloud SIEM to analyze the health, performance, and security of your cloud resources.

The following diagram illustrates the solution architecture.

We walk you through the following high-level steps:

  1. Link your AWS account with your Datadog account.
  2. Create the Kinesis Data Firehose stream where VPC service streams the flow logs.
  3. Create the VPC flow log subscription to Kinesis Data Firehose.
  4. Visualize VPC flow logs in the Datadog dashboard.

The account ID 123456781234 used in this post is a dummy account. It is used only for demonstration purposes.

Prerequisites

You should have the following prerequisites:

Link your AWS account with your Datadog account for AWS integration

Follow the instructions provided on the Datadog website for AWS Integration. To configure log archiving and enrich the log data sent from your AWS account with useful context, link the accounts. When you complete the linking setup, proceed to the following step.

Create a Kinesis Data Firehose stream

Now that your Datadog integration with AWS is complete, you can create a Kinesis Data Firehose delivery stream where VPC Flow Logs are streamed by following these steps:

  1. On the Amazon Kinesis console, choose Kinesis Data Firehose in the navigation pane.
  2. Choose Create delivery stream.
  3. Choose Direct PUT as the source.
  4. Set Destination as Datadog.
    Create delivery stream
  1. For Delivery stream name, enter PUT-DATADOG-DEMO.
  2. Keep Data transformation set to Disabled under Transform records.
  3. In Destination settings, for HTTP endpoint URL, choose the desired log’s HTTP endpoint based on your Region and Datadog account configuration.
    Kinesis delivery stream configuration
  4. For API key, enter your Datadog API key.

This allows your delivery stream to publish VPC Flow logs to the Datadog endpoint. API keys are unique to your organization. An API key is required by the Datadog Agent to submit metrics and events to Datadog.

  1. Set Content encoding to GZIP to reduce the size of data transferred.
  2. Set the Retry duration to 60.You can change the Retry duration value if you need to. This depends on the request handling capacity of the Datadog endpoint.
    Kinesis destination settings
    Under Buffer hints, Buffer size and Buffer interval are set with default values for Datadog integration.
    Kinesis buffer settings
  1. Under Backup settings, as mentioned in the prerequisites, choose the S3 bucket that you created to store failed logs and backup with specific prefix.
  2. Under S3 buffer hints section, set Buffer size to 5 and Buffer interval to 300.

You can change the S3 buffer size and interval based on your requirements.

  1. Under S3 compression and encryption, select GZIP for Compression for data records or another compression method of your choice.

Compressing data reduces the required storage space.

  1. Select Disabled for Encryption of the data records. You can enable encryption of the data records to secure access to your logs.
    Kinesis stream backup settings
  1. Optionally, in Advanced settings, select Enable server-side encryption for source records in delivery stream.
    You can use AWS managed keys or a CMK managed by you for the encryption type.
  1. Enable CloudWatch error logging.
  2. Choose Create or update IAM role, which is created by Kinesis Data Firehose as part of this stream.
    Kinesis stream Advanced settings
  1. Choose Next.
  2. Review your settings.
  3. Choose Create delivery stream.

Create a VPC flow logs subscription

Create a VPC flow logs subscription for the Kinesis Data Firehose delivery stream you created in the previous step:

  1. On the Amazon VPC console, choose Your VPCs.
  2. Select the VPC that you to create the flow log for.
  3. On the Actions menu, choose Create flow log.
    AWS VPCs
  1. Select All to send all flow log records to the Firehose destination.

If you want to filter the flow logs, you could alternatively select Accept or Reject.

  1. For Maximum aggregation interval, select 10 minutes or the minimum setting of 1 minute if you need the flow log data to be available for near-real-time analysis in Datadog.
  2. For Destination, select Send to Kinesis Data Firehose in the same account if the delivery stream is set up on the same account where you create the VPC flow logs.

If you want to send the data to a different account, refer to Publish flow logs to Kinesis Data Firehose.

  1. Choose an option for Log record format:
  2. If you leave Log record format as the AWS default format, the flow logs are sent as version 2 format.
  3. Alternatively, you can specify the custom fields for flow logs to capture and send it to Datadog.

For more information on log format and available fields, refer to Flow log records.

  1. Choose Create flow log.
    Create VPC Flow log

Now let’s explore the VPC flow logs in Datadog.

Visualize VPC flow logs in the Datadog dashboard

In the Logs Search option in the navigation pane, filter to source:vpc. The VPC flow logs from your VPC are in the Datadog Log Explorer and are automatically parsed so you can analyze your logs by source, destination, action, or other attributes.

Datadog Logs Dashboard

Clean up

After you test this solution, delete all the resources you created to avoid incurring future charges. Refer to the following links for instructions for deleting the resources:

Conclusion

In this post, we walked through a solution of how to integrate VPC flow logs with a Kinesis Data Firehose delivery stream, deliver it to a Datadog destination with no code, and visualize it in a Datadog dashboard. With Datadog, you can easily explore and analyze logs to gain deeper insights into the state of your applications and AWS infrastructure.

Try this new, quick, and hassle-free way of sending your VPC flow logs to a Datadog destination using Kinesis Data Firehose.


About the Author

Chaitanya Shah - AWS Chaitanya Shah is a Sr. Technical Account Manager(TAM) with AWS, based out of New York. He has over 22 years of experience working with enterprise customers. He loves to code and actively contributes to the AWS solutions labs to help customers solve complex problems. He provides guidance to AWS customers on best practices for their AWS Cloud migrations. He is also specialized in AWS data transfer and the data and analytics domain.

Multi-tenancy Apache Kafka clusters in Amazon MSK with IAM access control and Kafka Quotas – Part 1

Post Syndicated from Vikas Bajaj original https://aws.amazon.com/blogs/big-data/multi-tenancy-apache-kafka-clusters-in-amazon-msk-with-iam-access-control-and-kafka-quotas-part-1/

With Amazon Managed Streaming for Apache Kafka (Amazon MSK), you can build and run applications that use Apache Kafka to process streaming data. To process streaming data, organizations either use multiple Kafka clusters based on their application groupings, usage scenarios, compliance requirements, and other factors, or a dedicated Kafka cluster for the entire organization. It doesn’t matter what pattern is used, Kafka clusters are typically multi-tenant, allowing multiple producer and consumer applications to consume and produce streaming data simultaneously.

With multi-tenant Kafka clusters, however, one of the challenges is to make sure that data consumer and producer applications don’t overuse cluster resources. There is a possibility that a few poorly behaved applications may overuse cluster resources, affecting the well-behaved applications as a result. Therefore, teams who manage multi-tenant Kafka clusters need a mechanism to prevent applications from overconsuming cluster resources in order to avoid issues. This is where Kafka quotas come into play. Kafka quotas control the amount of resources client applications can use within a Kafka cluster.

In Part 1 of this two-part series, we explain the concepts of how to enforce Kafka quotas in MSK multi-tenant Kafka clusters while using AWS Identity and Access Management (IAM) access control for authentication and authorization. In Part 2, we cover detailed implementation steps along with sample Kafka client applications.

Brief introduction to Kafka quotas

Kafka quotas control the amount of resources client applications can use within a Kafka cluster. It’s possible for the multi-tenant Kafka cluster to experience performance degradation or a complete outage due to resource constraints if one or more client applications produce or consume large volumes of data or generate requests at a very high rate for a continuous period of time, monopolizing Kafka cluster’s resources.

To prevent applications from overwhelming the cluster, Apache Kafka allows configuring quotas that determine how much traffic each client application produces and consumes per Kafka broker in a cluster. Kafka brokers throttle the client applications’ requests in accordance with their allocated quotas. Kafka quotas can be configured for specific users, or specific client IDs, or both. The client ID is a logical name defined in the application code that Kafka brokers use to identify which application sent messages. The user represents the authenticated user principal of a client application in a secure Kafka cluster with authentication enabled.

There are two types of quotas supported in Kafka:

  • Network bandwidth quotas – The byte-rate thresholds define how much data client applications can produce to and consume from each individual broker in a Kafka cluster measured in bytes per second.
  • Request rate quotas – This limits the percentage of time each individual broker spends processing client applications requests.

Depending on the business requirements, you can use either of these quota configurations. However, the use of network bandwidth quotas is common because it allows organizations to cap platform resources consumption according to the amount of data produced and consumed by applications per second.

Because this post uses an MSK cluster with IAM access control, we specifically discuss configuring network bandwidth quotas based on the applications’ client IDs and authenticated user principals.

Considerations for Kafka quotas

Keep the following in mind when working with Kafka quotas:

  • Enforcement level – Quotas are enforced at the broker level rather than at the cluster level. Suppose there are six brokers in a Kafka cluster and you specify a 12 MB/sec produce quota for a client ID and user. The producer application using the client ID and user can produce a max of 12MB/sec on each broker at the same time, for a total of max 72 MB/sec across all six brokers. However, if leadership for every partition of a topic resides on one broker, the same producer application can only produce a max of 12 MB/sec. Due to the fact that throttling occurs per broker, it’s essential to maintain an even balance of topics’ partitions leadership across all the brokers.
  • Throttling – When an application reaches its quota, it is throttled, not failed, meaning the broker doesn’t throw an exception. Clients who reach their quota on a broker will begin to have their requests throttled by the broker to prevent exceeding the quota. Instead of sending an error when a client exceeds a quota, the broker attempts to slow it down. Brokers calculate the amount of delay necessary to bring clients under quotas and delay responses accordingly. As a result of this approach, quota violations are transparent to clients, and clients don’t have to implement any special backoff or retry policies. However, when using an asynchronous producer and sending messages at a rate greater than the broker can accept due to quota, the messages will be queued in the client application memory first. The client will eventually run out of buffer space if the rate of sending messages continues to exceed the rate of accepting messages, causing the next Producer.send() call to be blocked. Producer.send() will eventually throw a TimeoutException if the timeout delay isn’t sufficient to allow the broker to catch up to the producer application.
  • Shared quotas – If more than one client application has the same client ID and user, the quota configured for the client ID and user will be shared among all those applications. Suppose you configure a produce quota of 5 MB/sec for the combination of client-id="marketing-producer-client" and user="marketing-app-user". In this case, all producer applications that have marketing-producer-client as a client ID and marketing-app-user as an authenticated user principal will share the 5 MB/sec produce quota, impacting each other’s throughput.
  • Produce throttling – The produce throttling behavior is exposed to producer clients via client metrics such as produce-throttle-time-avg and produce-throttle-time-max. If these are non-zero, it indicates that the destination brokers are slowing the producer down and the quotas configuration should be reviewed.
  • Consume throttling – The consume throttling behavior is exposed to consumer clients via client metrics such as fetch-throttle-time-avg and fetch-throttle-time-max. If these are non-zero, it indicates that the origin brokers are slowing the consumer down and the quotas configuration should be reviewed.

Note that client metrics are metrics exposed by clients connecting to Kafka clusters.

  • Quota configuration – It’s possible to configure Kafka quotas either statically through the Kafka configuration file or dynamically through kafka-config.sh or the Kafka Admin API. The dynamic configuration mechanism is much more convenient and manageable because it allows quotas for the new producer and consumer applications to be configured at any time without having to restart brokers. Even while application clients are producing or consuming data, dynamic configuration changes take effect in real time.
  • Configuration keys – With the kafka-config.sh command-line tool, you can set dynamic consume, produce, and request quotas using the following three configuration keys, respectively: consumer_byte_rate, producer_byte_rate, and request_percentage.

For more information about Kafka quotas, refer to Kafka documentation.

Enforce network bandwidth quotas with IAM access control

Following our understanding of Kafka quotas, let’s look at how to enforce them in an MSK cluster while using IAM access control for authentication and authorization. IAM access control in Amazon MSK eliminates the need for two separate mechanisms for authentication and authorization.

The following figure shows an MSK cluster that is configured to use IAM access control in the demo account. Each producer and consumer application has a quota that determines how much data they can produce or consume in bytes per second. For example, ProducerApp-1 has a produce quota of 1024 bytes/sec, and ConsumerApp-1 and ConsumerApp-2 each have a consume quota of 5120 and 1024 bytes/sec, respectively. It’s important to note that Kafka quotas are set on the Kafka cluster rather than in the client applications.

The preceding figure illustrates how Kafka client applications (ProducerApp-1, ConsumerApp-1, and ConsumerApp-2) access Topic-B in the MSK cluster by assuming write and read IAM roles. The workflow is as follows:

  • P1ProducerApp-1 (via its ProducerApp-1-Role IAM role) assumes the Topic-B-Write-Role IAM role to send messages to Topic-B in the MSK cluster.
  • P2 – With the Topic-B-Write-Role IAM role assumed, ProducerApp-1 begins sending messages to Topic-B.
  • C1ConsumerApp-1 (via its ConsumerApp-1-Role IAM role) and ConsumerApp-2 (via its ConsumerApp-2-Role IAM role) assume the Topic-B-Read-Role IAM role to read messages from Topic-B in the MSK cluster.
  • C2 – With the Topic-B-Read-Role IAM role assumed, ConsumerApp-1 and ConsumerApp-2 start consuming messages from Topic-B.

ConsumerApp-1 and ConsumerApp-2 are two separate consumer applications. They do not belong to the same consumer group.

Configuring client IDs and understanding authenticated user principal

As explained earlier, Kafka quotas can be configured for specific users, specific client IDs, or both. Let’s explore client ID and user concepts and configurations required for Kafka quota allocation.

Client ID

A client ID representing an application’s logical name can be configured within an application’s code. In Java applications, for example, you can set the producer’s and consumer’s client IDs using ProducerConfig.CLIENT_ID_CONFIG and ConsumerConfig.CLIENT_ID_CONFIG configurations, respectively. The following code snippet illustrates how ProducerApp-1 sets the client ID to this-is-me-producerapp-1 using ProducerConfig.CLIENT_ID_CONFIG:

Properties props = new Properties();
props.put(ProducerConfig.CLIENT_ID_CONFIG,"this-is-me-producerapp-1");

User

The user refers to an authenticated user principal of the client application in the Kafka cluster with authentication enabled. As shown in the solution architecture, producer and consumer applications assume the Topic-B-Write-Role and Topic-B-Read-Role IAM roles, respectively, to perform write and read operations on Topic-B. Therefore, their authenticated user principal will look like the following IAM identifier:

arn:aws:sts::<AWS Account Id>:assumed-role/<assumed Role Name>/<role session name>

For more information, refer to IAM identifiers.

The role session name is a string identifier that uniquely identifies a session when IAM principals, federated identities, or applications assume an IAM role. In our case, ProducerApp-1, ConsumerApp-1, and ConsumerApp-2 applications assume an IAM role using the AWS Security Token Service (AWS STS) SDK, and provide a role session name in the AWS STS SDK call. For example, if ProducerApp-1 assumes the Topic-B-Write-Role IAM role and uses this-is-producerapp-1-role-session as its role session name, its authenticated user principal will be as follows:

arn:aws:sts::<AWS Account Id>:assumed-role/Topic-B-Write-Role/this-is-producerapp-1-role-session

The following is an example code snippet from the ProducerApp-1 application using this-is-producerapp-1-role-session as the role session name while assuming the Topic-B-Write-Role IAM role using the AWS STS SDK:

StsClient stsClient = StsClient.builder().region(region).build();
AssumeRoleRequest roleRequest = AssumeRoleRequest.builder()
          .roleArn("<Topic-B-Write-Role ARN>")
          .roleSessionName("this-is-producerapp-1-role-session") //role-session-name string literal
          .build();

Configure network bandwidth (produce and consume) quotas

The following commands configure the produce and consume quotas dynamically for client applications based on their client ID and authenticated user principal in the MSK cluster configured with IAM access control.

The following code configures the produce quota:

kafka-configs.sh --bootstrap-server <MSK cluster bootstrap servers IAM endpoint> \
--command-config config_iam.properties \
--alter --add-config "producer_byte_rate=<number of bytes per second>" \
--entity-type clients --entity-name <ProducerApp client Id> \
--entity-type users --entity-name <ProducerApp user principal>

The producer_byes_rate refers to the number of messages, in bytes, that a producer client identified by client ID and user is allowed to produce to a single broker per second. The option --command-config points to config_iam.properties, which contains the properties required for IAM access control.

The following code configures the consume quota:

kafka-configs.sh --bootstrap-server <MSK cluster bootstrap servers IAM endpoint> \
--command-config config_iam.properties \
--alter --add-config "consumer_byte_rate=<number of bytes per second>" \
--entity-type clients --entity-name <ConsumerApp client Id> \
--entity-type users --entity-name <ConsumerApp user principal>

The consumer_bytes_rate refers to the number of messages, in bytes, that a consumer client identified by client ID and user allowed to consume from a single broker per second.

Let’s look at some example quota configuration commands for ProducerApp-1, ConsumerApp-1, and ConsumerApp-2 client applications:

  • ProducerApp-1 produce quota configuration – Let’s assume ProducerApp-1 has this-is-me-producerapp-1 configured as the client ID in the application code and uses this-is-producerapp-1-role-session as the role session name when assuming the Topic-B-Write-Role IAM role. The following command sets the produce quota for ProducerApp-1 to 1024 bytes per second:
kafka-configs.sh --bootstrap-server <MSK Cluster Bootstrap servers IAM endpoint> \
--command-config config_iam.properties \
--alter --add-config "producer_byte_rate=1024" \
--entity-type clients --entity-name this-is-me-producerapp-1 \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/Topic-B-Write-Role/this-is-producerapp-1-role-session
  • ConsumerApp-1 consume quota configuration – Let’s assume ConsumerApp-1 has this-is-me-consumerapp-1 configured as the client ID in the application code and uses this-is-consumerapp-1-role-session as the role session name when assuming the Topic-B-Read-Role IAM role. The following command sets the consume quota for ConsumerApp-1 to 5120 bytes per second:
kafka-configs.sh --bootstrap-server <MSK Cluster Bootstrap servers IAM endpoint> \
--command-config config_iam.properties \
--alter --add-config "consumer_byte_rate=5120" \
--entity-type clients --entity-name this-is-me-consumerapp-1 \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/Topic-B-Read-Role/this-is-consumerapp-1-role-session


ConsumerApp-2 consume quota configuration
– Let’s assume ConsumerApp-2 has this-is-me-consumerapp-2 configured as the client ID in the application code and uses this-is-consumerapp-2-role-session as the role session name when assuming the Topic-B-Read-Role IAM role. The following command sets the consume quota for ConsumerApp-2 to 1024 bytes per second per broker:

kafka-configs.sh --bootstrap-server <MSK Cluster Bootstrap servers IAM endpoint> \
--command-config config_iam.properties \
--alter --add-config "consumer_byte_rate=1024" \
--entity-type clients --entity-name this-is-me-consumerapp-2 \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/Topic-B-Read-Role/this-is-consumerapp-2-role-session

As a result of the preceding commands, the ProducerApp-1, ConsumerApp-1, and ConsumerApp-2 client applications will be throttled by the MSK cluster using IAM access control if they exceed their assigned produce and consume quotas, respectively.

Implement the solution

Part 2 of this series showcases the step-by-step detailed implementation of Kafka quotas configuration with IAM access control along with the sample producer and consumer client applications.

Conclusion

Kafka quotas offer teams the ability to set limits for producer and consumer applications. With Amazon MSK, Kafka quotas serve two important purposes: eliminating guesswork and preventing issues caused by poorly designed producer or consumer applications by limiting their quota, and allocating operational costs of a central streaming data platform across different cost centers and tenants (application and product teams).

In this post, we learned how to configure network bandwidth quotas within Amazon MSK while using IAM access control. We also covered some sample commands and code snippets to clarify how the client ID and authenticated principal are used when configuring quotas. Although we only demonstrated Kafka quotas using IAM access control, you can also configure them using other Amazon MSK-supported authentication mechanisms.

In Part 2 of this series, we demonstrate how to configure network bandwidth quotas with IAM access control in Amazon MSK and provide you with example producer and consumer applications so that you can see them in action.

Check out the following resources to learn more:


About the Author

Vikas Bajaj is a Senior Manager, Solutions Architects, Financial Services at Amazon Web Services. Having worked with financial services organizations and digital native customers, he advises financial services customers in Australia on technology decisions, architectures, and product roadmaps.

Multi-tenancy Apache Kafka clusters in Amazon MSK with IAM access control and Kafka quotas – Part 2

Post Syndicated from Vikas Bajaj original https://aws.amazon.com/blogs/big-data/multi-tenancy-apache-kafka-clusters-in-amazon-msk-with-iam-access-control-and-kafka-quotas-part-2/

Kafka quotas are integral to multi-tenant Kafka clusters. They prevent Kafka cluster performance from being negatively affected by poorly behaved applications overconsuming cluster resources. Furthermore, they enable the central streaming data platform to be operated as a multi-tenant platform and used by downstream and upstream applications across multiple business lines. Kafka supports two types of quotas: network bandwidth quotas and request rate quotas. Network bandwidth quotas define byte-rate thresholds such as how much data client applications can produce to and consume from each individual broker in a Kafka cluster measured in bytes per second. Request rate quotas limit the percentage of time each individual broker spends processing client applications requests. Depending on your configuration, Kafka quotas can be set for specific users, specific client IDs, or both.

In Part 1 of this two-part series, we discussed the concepts of how to enforce Kafka quotas in Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters while using AWS Identity and Access Management (IAM) access control.

In this post, we walk you through the step-by-step implementation of setting up Kafka quotas in an MSK cluster while using IAM access control and testing them through sample client applications.

Solution overview

The following figure, which we first introduced in Part 1, illustrates how Kafka client applications (ProducerApp-1, ConsumerApp-1, and ConsumerApp-2) access Topic-B in the MSK cluster by assuming write and read IAM roles. Each producer and consumer client application has a quota that determines how much data they can produce or consume in bytes/second. The ProducerApp-1 quota allows it to produce up to 1024 bytes/second per broker. Similarly, the ConsumerApp-1 and ConsumerApp-2 quotas allow them to consume 5120 and 1024 bytes/second per broker, respectively. The following is a brief explanation of the flow shown in the architecture diagram:

  • P1ProducerApp-1 (via its ProducerApp-1-Role IAM role) assumes the Topic-B-Write-Role IAM role to send messages to Topic-B
  • P2 – With the Topic-B-Write-Role IAM role assumed, ProducerApp-1 begins sending messages to Topic-B
  • C1ConsumerApp-1 (via its ConsumerApp-1-Role IAM role) and ConsumerApp-2 (via its ConsumerApp-2-Role IAM role) assume the Topic-B-Read-Role IAM role to read messages from Topic-B
  • C2 – With the Topic-B-Read-Role IAM role assumed, ConsumerApp-1 and ConsumerApp-2 start consuming messages from Topic-B

Note that this post uses the AWS Command Line Interface (AWS CLI), AWS CloudFormation templates, and the AWS Management Console for provisioning and modifying AWS resources, and resources provisioned will be billed to your AWS account.

The high-level steps are as follows:

  1. Provision an MSK cluster with IAM access control and Amazon Elastic Compute Cloud (Amazon EC2) instances for client applications.
  2. Create Topic-B on the MSK cluster.
  3. Create IAM roles for the client applications to access Topic-B.
  4. Run the producer and consumer applications without setting quotas.
  5. Configure the produce and consume quotas for the client applications.
  6. Rerun the applications after setting the quotas.

Prerequisites

It is recommended that you read Part 1 of this series before continuing. In order to get started, you need the following:

  • An AWS account that will be referred to as the demo account in this post, assuming that its account ID is 1111 1111 1111
  • Permissions to create, delete, and modify AWS resources in the demo account

Provision an MSK cluster with IAM access control and EC2 instances

This step involves provisioning an MSK cluster with IAM access control in a VPC in the demo account. Additionally, we create four EC2 instances to make configuration changes to the MSK cluster and host producer and consumer client applications.

Deploy CloudFormation stack

  1. Clone the GitHub repository to download the CloudFormation template files and sample client applications:
git clone https://github.com/aws-samples/amazon-msk-kafka-quotas.git
  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack.
  3. For Prepare template, select Template is ready.
  4. For Template source, select Upload a template file.
  5. Upload the cfn-msk-stack-1.yaml file from amazon-msk-kafka-quotas/cfn-templates directory, then choose Next.
  6. For Stack name, enter MSKStack.
  7. Leave the parameters as default and choose Next.
  8. Scroll to the bottom of the Configure stack options page and choose Next to continue.
  9. Scroll to the bottom of the Review page, select the check box I acknowledge that CloudFormation may create IAM resources, and choose Submit.

It will take approximately 30 minutes for the stack to complete. After the stack has been successfully created, the following resources will be created:

  • A VPC with three private subnets and one public subnet
  • An MSK cluster with three brokers with IAM access control enabled
  • An EC2 instance called MSKAdminInstance for modifying MSK cluster settings as well as creating and modifying AWS resources
  • EC2 instances for ProducerApp-1, ConsumerApp-1, and ConsumerApp-2, one for each client application
  • A separate IAM role for each EC2 instance that hosts the client application, as shown in the architecture diagram
  1. From the stack’s Outputs tab, note the MSKClusterArn value.

Create a topic on the MSK cluster

To create Topic-B on the MSK cluster, complete the following steps:

  1. On the Amazon EC2 console, navigate to your list of running EC2 instances.
  2. Select the MSKAdminInstance EC2 instance and choose Connect.
  3. On the Session Manager tab, choose Connect.
  4. Run the following commands on the new tab that opens in your browser:
sudo su - ec2-user

# Add Kafka binaries to the path
sed -i 's|HOME/bin|HOME/bin:~/kafka/bin|' .bash_profile

# Set your AWS region
aws configure set region <AWS Region>
  1. Set the environment variable to point to the MSK Cluster brokers IAM endpoint:
MSK_CLUSTER_ARN=<Use the value of MSKClusterArn that you noted earlier>
echo "export BOOTSTRAP_BROKERS_IAM=$(aws kafka get-bootstrap-brokers --cluster-arn $MSK_CLUSTER_ARN | jq -r .BootstrapBrokerStringSaslIam)" >> .bash_profile
source .bash_profile
echo $BOOTSTRAP_BROKERS_IAM
  1. Take note of the value of BOOTSTRAP_BROKERS_IAM.
  2. Run the following Kafka CLI command to create Topic-B on the MSK cluster:
kafka-topics.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--create --topic Topic-B \
--partitions 3 --replication-factor 3 \
--command-config config_iam.properties

Because the MSK cluster is provisioned with IAM access control, the option --command-config points to config_iam.properties, which contains the properties required for IAM access control, created by the MSKStack CloudFormation stack.

The following warnings may appear when you run the Kafka CLI commands, but you may ignore them:

The configuration 'sasl.jaas.config' was supplied but isn't a known config. 
The configuration 'sasl.client.callback.handler.class' was supplied but isn't a known config.
  1. To verify that Topic-B has been created, list all the topics:
kafka-topics.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties --list

Create IAM roles for client applications to access Topic-B

This step involves creating Topic-B-Write-Role and Topic-B-Read-Role as shown in the architecture diagram. Topic-B-Write-Role enables write operations on Topic-B, and can be assumed by the ProducerApp-1 . In a similar way, the ConsumerApp-1 and ConsumerApp-2 can assume Topic-B-Read-Role to perform read operations on Topic-B. To perform read operations on Topic-B, ConsumerApp-1 and ConsumerApp-2 must also belong to the consumer groups specified during the MSKStack stack update in the subsequent step.

Create the roles with the following steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select MSKStack and choose Update.
  3. For Prepare template, select Replace current template.
  4. For Template source, select Upload a template file.
  5. Upload the cfn-msk-stack-2.yaml file from amazon-msk-kafka-quotas/cfn-templates directory, then choose Next.
  6. Provide the following additional stack parameters:
    • For Topic B ARN, enter the Topic-B ARN.

The ARN must be formatted as arn:aws:kafka:region:account-id:topic/msk-cluster-name/msk-cluster-uuid/Topic-B. Use the cluster name and cluster UUID from the MSK cluster ARN you noted earlier and provide your AWS Region. For more information, refer to the IAM access control for Amazon MSK.

    • For ConsumerApp-1 Consumer Group name, enter ConsumerApp-1 consumer group ARN.

It must be formatted as arn:aws:kafka:region:account-id:group/msk-cluster-name/msk-cluster-uuid/consumer-group-name

    • For ConsumerApp-2 Consumer Group name, enter ConsumerApp-2 consumer group ARN.

Use a similar format as the previous ARN.

  1. Choose Next to continue.
  2. Scroll to the bottom of the Configure stack options page and choose Next to continue.
  3. Scroll to the bottom of the Review page, select the check box I acknowledge that CloudFormation may create IAM resources, and choose Update stack.

It will take approximately 3 minutes for the stack to update. After the stack has been successfully updated, the following resources will be created:

  • Topic-B-Write-Role – An IAM role with permission to perform write operations on Topic-B. Its trust policy allows the ProducerApp-1-Role IAM role to assume it.
  • Topic-B-Read-Role – An IAM role with permission to perform read operations on Topic-B. Its trust policy allows the ConsumerApp-1-Role and ConsumerApp-2-Role IAM roles to assume it. Furthermore, ConsumerApp-1 and ConsumerApp-2 must also belong to the consumer groups you specified when updating the stack to perform read operations on Topic-B.
  1. From the stack’s Outputs tab, note the TopicBReadRoleARN and TopicBWriteRoleARN values.

Run the producer and consumer applications without setting quotas

Here, we run ProducerApp-1, ConsumerApp-1, and ConsumerApp-2 without setting their quotas. From the previous steps, you will need BOOTSTRAP_BROKERS_IAM value, Topic-B-Write-Role ARN, and Topic-B-Read-Role ARN. The source code of client applications and their packaged versions are available in the GitHub repository.

Run the ConsumerApp-1 application

To run the ConsumerApp-1 application, complete the following steps:

  1. On the Amazon EC2 console, select the ConsumerApp-1 EC2 instance and choose Connect.
  2. On the Session Manager tab, choose Connect.
  3. Run the following commands on the new tab that opens in your browser:
sudo su - ec2-user

# Set your AWS region
aws configure set region <aws region>

# Set BOOTSTRAP_BROKERS_IAM variable to MSK cluster's IAM endpoint
BOOTSTRAP_BROKERS_IAM=<Use the value of BOOTSTRAP_BROKERS_IAM that you noted earlier> 

echo "export BOOTSTRAP_BROKERS_IAM=$(echo $BOOTSTRAP_BROKERS_IAM)" >> .bash_profile

# Clone GitHub repository containing source code for client applications
git clone https://github.com/aws-samples/amazon-msk-kafka-quotas.git

cd amazon-msk-kafka-quotas/uber-jars/
  1. Run the ConsumerApp-1 application to start consuming messages from Topic-B:
java -jar kafka-consumer.jar --bootstrap-servers $BOOTSTRAP_BROKERS_IAM \
--assume-role-arn <Topic-B-Read-Role-ARN> \
--topic-name <Topic-Name> \
--region <AWS Region> \
--consumer-group <ConsumerApp-1 consumer group name> \
--role-session-name <role session name for ConsumerApp-1 to use during STS assume role call> \
--client-id <ConsumerApp-1 client.id> \
--print-consumer-quota-metrics Y \
--cw-dimension-name <CloudWatch Metrics Dimension Name> \
--cw-dimension-value <CloudWatch Metrics Dimension Value> \
--cw-namespace <CloudWatch Metrics Namespace>

You can find the source code on GitHub for your reference. The command line parameter details are as follows:

  • –bootstrap-servers – MSK cluster bootstrap brokers IAM endpoint.
  • –assume-role-arnTopic-B-Read-Role IAM role ARN. Assuming this role, ConsumerApp-1 will read messages from the topic.
  • –region – Region you’re using.
  • –topic-name – Topic name from which ConsumerApp-1 will read messages. The default is Topic-B.
  • –consumer-group – Consumer group name for ConsumerApp-1, as specified during the stack update.
  • –role-session-name ConsumerApp-1 assumes the Topic-B-Read-Role using the AWS Security Token Service (AWS STS) SDK. ConsumerApp-1 will use this role session name when calling the assumeRole function.
  • –client-id – Client ID for ConsumerApp-1 .
  • –print-consumer-quota-metrics – Flag indicating whether client metrics should be printed on the terminal by ConsumerApp-1.
  • –cw-dimension-nameAmazon CloudWatch dimension name that will be used to publish client throttling metrics from ConsumerApp-1.
  • –cw-dimension-value – CloudWatch dimension value that will be used to publish client throttling metrics from ConsumerApp-1.
  • –cw-namespace – Namespace where ConsumerApp-1 will publish CloudWatch metrics in order to monitor throttling.
  1. If you’re satisfied with the rest of parameters, use the following command and change --assume-role-arn and --region as per your environment:
java -jar kafka-consumer.jar --bootstrap-servers $BOOTSTRAP_BROKERS_IAM \
--assume-role-arn arn:aws:iam::111111111111:role/MSKStack-TopicBReadRole-xxxxxxxxxxx \
--topic-name Topic-B \
--region <AWS Region> \
--consumer-group consumerapp-1-cg \
--role-session-name consumerapp-1-role-session \
--client-id consumerapp-1-client-id \
--print-consumer-quota-metrics Y \
--cw-dimension-name ConsumerApp \
--cw-dimension-value ConsumerApp-1 \
--cw-namespace ConsumerApps

The fetch-throttle-time-avg and fetch-throttle-time-max client metrics should display 0.0, indicating no throttling is occurring for ConsumerApp-1. Remember that we haven’t set the consume quota for ConsumerApp-1 yet. Let it run for a while.

Run the ConsumerApp-2 application

To run the ConsumerApp-2 application, complete the following steps:

  1. On the Amazon EC2 console, select the ConsumerApp-2 EC2 instance and choose Connect.
  2. On the Session Manager tab, choose Connect.
  3. Run the following commands on the new tab that opens in your browser:
sudo su - ec2-user

# Set your AWS region
aws configure set region <aws region>

# Set BOOTSTRAP_BROKERS_IAM variable to MSK cluster's IAM endpoint
BOOTSTRAP_BROKERS_IAM=<Use the value of BOOTSTRAP_BROKERS_IAM that you noted earlier> 

echo "export BOOTSTRAP_BROKERS_IAM=$(echo $BOOTSTRAP_BROKERS_IAM)" >> .bash_profile

# Clone GitHub repository containing source code for client applications
git clone https://github.com/aws-samples/amazon-msk-kafka-quotas.git

cd amazon-msk-kafka-quotas/uber-jars/
  1. Run the ConsumerApp-2 application to start consuming messages from Topic-B:
java -jar kafka-consumer.jar --bootstrap-servers $BOOTSTRAP_BROKERS_IAM \
--assume-role-arn <Topic-B-Read-Role-ARN> \
--topic-name <Topic-Name> \
--region <AWS Region> \
--consumer-group <ConsumerApp-2 consumer group name> \
--role-session-name <role session name for ConsumerApp-2 to use during STS assume role call> \
--client-id <ConsumerApp-2 client.id> \
--print-consumer-quota-metrics Y \
--cw-dimension-name <CloudWatch Metrics Dimension Name> \
--cw-dimension-value <CloudWatch Metrics Dimension Value> \
--cw-namespace <CloudWatch Metrics Namespace>

The code has similar command line parameters details as ConsumerApp-1 discussed previously, except for the following:

  • –consumer-group – Consumer group name for ConsumerApp-2, as specified during the stack update.
  • –role-session-name ConsumerApp-2 assumes the Topic-B-Read-Role using the AWS STS SDK. ConsumerApp-2 will use this role session name when calling the assumeRole function.
  • –client-id – Client ID for ConsumerApp-2 .
  1. If you’re satisfied with the rest of parameters, use the following command and change --assume-role-arn and --region as per your environment:
java -jar kafka-consumer.jar --bootstrap-servers $BOOTSTRAP_BROKERS_IAM \
--assume-role-arn arn:aws:iam::111111111111:role/MSKStack-TopicBReadRole-xxxxxxxxxxx \
--topic-name Topic-B \
--region <AWS Region> \
--consumer-group consumerapp-2-cg \
--role-session-name consumerapp-2-role-session \
--client-id consumerapp-2-client-id \
--print-consumer-quota-metrics Y \
--cw-dimension-name ConsumerApp \
--cw-dimension-value ConsumerApp-2 \
--cw-namespace ConsumerApps

The fetch-throttle-time-avg and fetch-throttle-time-max client metrics should display 0.0, indicating no throttling is occurring for ConsumerApp-2. Remember that we haven’t set the consume quota for ConsumerApp-2 yet. Let it run for a while.

Run the ProducerApp-1 application

To run the ProducerApp-1 application, complete the following steps:

  1. On the Amazon EC2 console, select the ProducerApp-1 EC2 instance and choose Connect.
  2. On the Session Manager tab, choose Connect.
  3. Run the following commands on the new tab that opens in your browser:
sudo su - ec2-user

# Set your AWS region
aws configure set region <aws region>

# Set BOOTSTRAP_BROKERS_IAM variable to MSK cluster's IAM endpoint
BOOTSTRAP_BROKERS_IAM=<Use the value of BOOTSTRAP_BROKERS_IAM that you noted earlier> 

echo "export BOOTSTRAP_BROKERS_IAM=$(echo $BOOTSTRAP_BROKERS_IAM)" >> .bash_profile

# Clone GitHub repository containing source code for client applications
git clone https://github.com/aws-samples/amazon-msk-kafka-quotas.git

cd amazon-msk-kafka-quotas/uber-jars/
  1. Run the ProducerApp-1 application to start sending messages to Topic-B:
java -jar kafka-producer.jar --bootstrap-servers $BOOTSTRAP_BROKERS_IAM \
--assume-role-arn <Topic-B-Write-Role-ARN> \
--topic-name <Topic-Name> \
--region <AWS Region> \
--num-messages <Number of events> \
--role-session-name <role session name for ProducerApp-1 to use during STS assume role call> \
--client-id <ProducerApp-1 client.id> \
--producer-type <Producer Type, options are sync or async> \
--print-producer-quota-metrics Y \
--cw-dimension-name <CloudWatch Metrics Dimension Name> \
--cw-dimension-value <CloudWatch Metrics Dimension Value> \
--cw-namespace <CloudWatch Metrics Namespace>

You can find the source code on GitHub for your reference. The command line parameter details are as follows:

  • –bootstrap-servers – MSK cluster bootstrap brokers IAM endpoint.
  • –assume-role-arnTopic-B-Write-Role IAM role ARN. Assuming this role, ProducerApp-1 will write messages to the topic.
  • –topic-nameProducerApp-1 will send messages to this topic. The default is Topic-B.
  • –region – AWS Region you’re using.
  • –num-messages – Number of messages the ProducerApp-1 application will send to the topic.
  • –role-session-name ProducerApp-1 assumes the Topic-B-Write-Role using the AWS STS SDK. ProducerApp-1 will use this role session name when calling the assumeRole function.
  • –client-id – Client ID of ProducerApp-1 .
  • –producer-typeProducerApp-1can be run either synchronously or asynchronously. Options are sync or async.
  • –print-producer-quota-metrics – Flag indicating whether the client metrics should be printed on the terminal by ProducerApp-1.
  • –cw-dimension-name – CloudWatch dimension name that will be used to publish client throttling metrics from ProducerApp-1.
  • –cw-dimension-value – CloudWatch dimension value that will be used to publish client throttling metrics from ProducerApp-1.
  • –cw-namespace – The namespace where ProducerApp-1 will publish CloudWatch metrics in order to monitor throttling.
  1. If you’re satisfied with the rest of parameters, use the following command and change --assume-role-arn and --region as per your environment. To run a synchronous Kafka producer, it uses the option --producer-type sync:
java -jar kafka-producer.jar --bootstrap-servers $BOOTSTRAP_BROKERS_IAM \
--assume-role-arn arn:aws:iam::111111111111:role/MSKStack-TopicBWriteRole-xxxxxxxxxxxx \
--topic-name Topic-B \
--region <AWS Region> \
--num-messages 10000000 \
--role-session-name producerapp-1-role-session \
--client-id producerapp-1-client-id \
--producer-type sync \
--print-producer-quota-metrics Y \
--cw-dimension-name ProducerApp \
--cw-dimension-value ProducerApp-1 \
--cw-namespace ProducerApps

Alternatively, use --producer-type async to run an asynchronous producer. For more details, refer to Asynchronous send.

The produce-throttle-time-avg and produce-throttle-time-max client metrics should display 0.0, indicating no throttling is occurring for ProducerApp-1. Remember that we haven’t set the produce quota for ProducerApp-1 yet. Check that ConsumerApp-1 and ConsumerApp-2 can consume messages and notice they are not throttled. Stop the consumer and producer client applications by pressing Ctrl+C in their respective browser tabs.

Set produce and consume quotas for client applications

Now that we have run the producer and consumer applications without quotas, we set their quotas and rerun them.

Open the Sessions Manager terminal for the MSKAdminInstance EC2 instance as described earlier and run the following commands to find the default configuration of one of the brokers in the MSK cluster. MSK clusters are provisioned with the default Kafka quotas configuration.

# Describe Broker-1 default configurations
kafka-configs.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties \
--entity-type brokers \
--entity-name 1 \
--all --describe > broker1_default_configurations.txt
cat broker1_default_configurations.txt | grep quota.consumer.default
cat broker1_default_configurations.txt | grep quota.producer.default

The following screenshot shows the Broker-1 default values for quota.consumer.default and quota.producer.default.

ProducerApp-1 quota configuration

Replace placeholders in all the commands in this section with values that correspond to your account.

According to the architecture diagram discussed earlier, set the ProducerApp-1 produce quota to 1024 bytes/second. For <ProducerApp-1 Client Id> and <ProducerApp-1 Role Session>, make sure you use the same values that you used while running ProducerApp-1 earlier (producerapp-1-client-id and producerapp-1-role-session, respectively):

kafka-configs.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties \
--alter --add-config 'producer_byte_rate=1024' \
--entity-type clients --entity-name <ProducerApp-1 Client Id> \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/MSKStack-TopicBWriteRole-xxxxxxxxxxx/<ProducerApp-1 Role Session>

Verify the ProducerApp-1 produce quota using the following command:

kafka-configs.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties \
--describe \
--entity-type clients --entity-name <ProducerApp-1 Client Id> \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/MSKStack-TopicBWriteRole-xxxxxxxxxxx/<ProducerApp-1 Role Session>

You can remove the ProducerApp-1 produce quota by using the following command, but don’t run the command as we’ll test the quotas next.

kafka-configs.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties \
--alter --delete-config producer_byte_rate \
--entity-type clients --entity-name <ProducerApp-1 Client Id> \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/MSKStack-TopicBWriteRole-xxxxxxxxxxx/<ProducerApp-1 Role Session>

ConsumerApp-1 quota configuration

Replace placeholders in all the commands in this section with values that correspond to your account.

Let’s set a consume quota of 5120 bytes/second for ConsumerApp-1. For <ConsumerApp-1 Client Id> and <ConsumerApp-1 Role Session>, make sure you use the same values that you used while running ConsumerApp-1 earlier (consumerapp-1-client-id and consumerapp-1-role-session, respectively):

kafka-configs.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties \
--alter --add-config 'consumer_byte_rate=5120' \
--entity-type clients --entity-name <ConsumerApp-1 Client Id> \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/MSKStack-TopicBReadRole-xxxxxxxxxxx/<ConsumerApp-1 Role Session>

Verify the ConsumerApp-1 consume quota using the following command:

kafka-configs.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties \
--describe \
--entity-type clients --entity-name <ConsumerApp-1 Client Id> \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/MSKStack-TopicBReadRole-xxxxxxxxxxx/<ConsumerApp-1 Role Session>

You can remove the ConsumerApp-1 consume quota, by using the following command, but don’t run the command as we’ll test the quotas next.

kafka-configs.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties \
--alter --delete-config consumer_byte_rate \
--entity-type clients --entity-name <ConsumerApp-1 Client Id> \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/MSKStack-TopicBReadRole-xxxxxxxxxxx/<ConsumerApp-1 Role Session>

ConsumerApp-2 quota configuration

Replace placeholders in all the commands in this section with values that correspond to your account.

Let’s set a consume quota of 1024 bytes/second for ConsumerApp-2. For <ConsumerApp-2 Client Id> and <ConsumerApp-2 Role Session>, make sure you use the same values that you used while running ConsumerApp-2 earlier (consumerapp-2-client-id and consumerapp-2-role-session, respectively):

kafka-configs.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties \
--alter --add-config 'consumer_byte_rate=1024' \
--entity-type clients --entity-name <ConsumerApp-2 Client Id> \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/MSKStack-TopicBReadRole-xxxxxxxxxxx/<ConsumerApp-2 Role Session>

Verify the ConsumerApp-2 consume quota using the following command:

kafka-configs.sh --bootstrap-server $BOOTSTRAP_BROKERS_IAM \
--command-config config_iam.properties \
--describe \
--entity-type clients --entity-name <ConsumerApp-2 Client Id> \
--entity-type users --entity-name arn:aws:sts::<AWS Account Id>:assumed-role/MSKStack-TopicBReadRole-xxxxxxxxxxx/<ConsumerApp-2 Role Session>

As with ConsumerApp-1, you can remove the ConsumerApp-2 consume quota using the same command with ConsumerApp-2 client and user details.

Rerun the producer and consumer applications after setting quotas

Let’s rerun the applications to verify the effect of the quotas.

Rerun ProducerApp-1

Rerun ProducerApp-1 in synchronous mode with the same command that you used earlier. The following screenshot illustrates that when ProducerApp-1 reaches its quota on any of the brokers, the produce-throttle-time-avg and produce-throttle-time-max client metrics value will be above 0.0. A value above 0.0 indicates that ProducerApp-1 is throttled. Allow ProducerApp-1 to run for a few seconds and then stop it by using Ctrl+C.

You can also test the effect of the produce quota by rerunning ProducerApp-1 again in asynchronous mode (--producer-type async). Similar to a synchronous run, the following screenshot illustrates that when ProducerApp-1 reaches its quota on any of the brokers, the produce-throttle-time-avg and produce-throttle-time-max client metrics value will be above 0.0. A value above 0.0 indicates that ProducerApp-1 is throttled. Allow asynchronous ProducerApp-1 to run for a while.

You will eventually see a TimeoutException stating org.apache.kafka.common.errors.TimeoutException: Expiring xxxxx record(s) for Topic-B-2:xxxxxxx ms has passed since batch creation

When using an asynchronous producer and sending messages at a rate greater than the broker can accept due to the quota, the messages will be queued in the client application memory first. The client will eventually run out of buffer space if the rate of sending messages continues to exceed the rate of accepting messages, causing the next Producer.send() call to be blocked. Producer.send() will eventually throw a TimeoutException if the timeout delay is not sufficient to allow the broker to catch up to the producer application. Stop ProducerApp-1 by using Ctrl+C.

Rerun ConsumerApp-1

Rerun ConsumerApp-1 with the same command that you used earlier. The following screenshot illustrates that when ConsumerApp-1 reaches its quota, the fetch-throttle-time-avg and fetch-throttle-time-max client metrics value will be above 0.0. A value above 0.0 indicates that ConsumerApp-1 is throttled.

Allow ConsumerApp-1 to run for a few seconds and then stop it by using Ctrl+C.

Rerun ConsumerApp-2

Rerun ConsumerApp-2 with the same command that you used earlier. Similarly, when ConsumerApp-2 reaches its quota, the fetch-throttle-time-avg and fetch-throttle-time-max client metrics value will be above 0.0. A value above 0.0 indicates that ConsumerApp-2 is throttled. Allow ConsumerApp-2 to run for a few seconds and then stop it by pressing Ctrl+C.

Client quota metrics in Amazon CloudWatch

In Part 1, we explained that client metrics are metrics exposed by clients connecting to Kafka clusters. Let’s examine the client metrics in CloudWatch.

  1. On the CloudWatch console, choose All metrics.
  2. Under Custom Namespaces, choose the namespace you provided while running the client applications.
  3. Choose the dimension name and select produce-throttle-time-max, produce-throttle-time-avg, fetch-throttle-time-max, and fetch-throttle-time-avg metrics for all the applications.

These metrics indicate throttling behavior for ProducerApp-1, ConsumerApp-1, and ConsumerApp-2 applications tested with the quota configurations in the previous section. The following screenshots indicate the throttling of ProducerApp-1, ConsumerApp-1, and ConsumerApp-2 based on network bandwidth quotas. ProducerApp-1, ConsumerApp-1, and ConsumerApp-2 applications feed their respective client metrics to CloudWatch. You can find the source code on GitHub for your reference.

Secure client ID and role session name

We discussed how to configure Kafka quotas using an application’s client ID and authenticated user principal. When a client application assumes an IAM role to access Kafka topics on a MSK cluster with IAM authentication enabled, its authenticated user principal is represented in the following format (for more information, refer to IAM identifiers):

arn:aws:sts::111111111111:assumed-role/Topic-B-Write-Role/producerapp-1-role-session

It contains the role session name (in this case, producerapp-1-role-session) used in the client application while assuming an IAM role through the AWS STS SDK. The client application source code is available for your reference. The client ID is a logical name string (for example, producerapp-1-client-id) that is configured in the application code by the application team. Therefore, an application can impersonate another application if it obtains the client ID and role session name of the other application, and if it has permission to assume the same IAM role.

As shown in the architecture diagram, ConsumerApp-1 and ConsumerApp-2 are two separate client applications with their respective quota allocations. Because both have permission to assume the same IAM role (Topic-B-Read-Role) in the demo account, they are allowed to consume messages from Topic-B. Thus, MSK cluster brokers distinguish them based on their client IDs and users (which contain their respective role session name values). If ConsumerApp-2 somehow obtains the ConsumerApp-1 role session name and client ID, it can impersonate ConsumerApp-1 by specifying the ConsumerApp-1 role session name and client ID in the application code.

Let’s assume ConsumerApp-1 uses consumerapp-1-client-id and consumerapp-1-role-session as its client ID and role session name, respectively. Therefore, ConsumerApp-1's authenticated user principal will appear as follows when it assumes the Topic-B-Read-Role IAM role:

arn:aws:sts::<AWS Account Id>:assumed-role/Topic-B-Read-Role/consumerapp-1-role-session

Similarly, ConsumerApp-2 uses consumerapp-2-client-id and consumerapp-2-role-session as its client ID and role session name, respectively. Therefore, ConsumerApp-2's authenticated user principal will appear as follows when it assumes the Topic-B-Read-Role IAM role:

arn:aws:sts::<AWS Account Id>:assumed-role/Topic-B-Read-Role/consumerapp-2-role-session

If ConsumerApp-2 obtains ConsumerApp-1's client ID and role session name and specifies them in its application code, MSK cluster brokers will treat it as ConsumerApp-1 and view its client ID as consumerapp-1-client-id, and the authenticated user principal as follows:

arn:aws:sts::<AWS Account Id>:assumed-role/Topic-B-Read-Role/consumerapp-1-role-session

This allows ConsumerApp-2 to consume data from the MSK cluster at a maximum rate of 5120 bytes per second rather than 1024 bytes per second as per its original quota allocation. Consequently, ConsumerApp-1's throughput will be negatively impacted if ConsumerApp-2 runs concurrently.

Enhanced architecture

You can introduce AWS Secrets Manager and AWS Key Management Service (AWS KMS) in the architecture to secure applications’ client IDs and role session names. To provide stronger governance, the applications’ client ID and role session name must be stored as encrypted secrets in the Secrets Manager. The IAM resource policies associated with encrypted secrets and a KMS customer managed key (CMK) will allow applications to access and decrypt only their respective client ID and role session name. In this way, applications will not be able to access each other’s client ID and role session name and impersonate one another. The following image shows the enhanced architecture.

The updated flow has the following stages:

  • P1ProducerApp-1 retrieves its client-id and role-session-name secrets from Secrets Manager
  • P2ProducerApp-1 configures the secret client-id as CLIENT_ID_CONFIG in the application code, and assumes Topic-B-Write-Role (via its ProducerApp-1-Role IAM role) by passing the secret role-session-name to the AWS STS SDK assumeRole function call
  • P3 – With the Topic-B-Write-Role IAM role assumed, ProducerApp-1 begins sending messages to Topic-B
  • C1 ConsumerApp-1 and ConsumerApp-2 retrieve their respective client-id and role-session-name secrets from Secrets Manager
  • C2ConsumerApp-1 and ConsumerApp-2 configure their respective secret client-id as CLIENT_ID_CONFIG in their application code, and assume Topic-B-Write-Role (via ConsumerApp-1-Role and ConsumerApp-2-Role IAM roles, respectively) by passing their secret role-session-name in the AWS STS SDK assumeRole function call
  • C3 – With the Topic-B-Read-Role IAM role assumed, ConsumerApp-1 and ConsumerApp-2 start consuming messages from Topic-B

Refer to the documentation for AWS Secrets Manager and AWS KMS to get a better understanding of how they fit into the architecture.

Clean up resources

Navigate to the CloudFormation console and delete the MSKStack stack. All resources created during this post will be deleted.

Conclusion

In this post, we covered detailed steps to configure Amazon MSK quotas and demonstrated their effect through sample client applications. In addition, we discussed how you can use client metrics to determine if a client application is throttled. We also highlighted a potential issue with plaintext client IDs and role session names. We recommend implementing Kafka quotas with Amazon MSK using Secrets Manager and AWS KMS as per the revised architecture diagram to ensure a zero-trust architecture.

If you have feedback or questions about this post, including the revised architecture, we’d be happy to hear from you. We hope you enjoyed reading this post.


About the Author

Vikas Bajaj is a Senior Manager, Solutions Architects, Financial Services at Amazon Web Services. With over two decades of experience in financial services and working with digital-native businesses, he advises customers on product design, technology roadmaps, and application architectures.

Optimize queries using dataset parameters in Amazon QuickSight

Post Syndicated from Anwar Ali original https://aws.amazon.com/blogs/big-data/optimize-queries-using-dataset-parameters-in-amazon-quicksight/

Amazon QuickSight powers data-driven organizations with unified business intelligence (BI) at hyperscale. With QuickSight, all users can meet varying analytic needs from the same source of truth through modern interactive dashboards, paginated reports, embedded analytics and natural language queries.

We have introduced dataset parameters, a new kind of parameter in QuickSight that can help you create interactive experiences in your dashboards. In this post, we dive deeper into what dataset parameters are, explain the key differences between dataset and analysis parameters, and discuss different use cases for dataset parameters along their benefits.

Introduction to dataset parameters

Before going deep into dataset parameters, let’s first discuss QuickSight analysis parameters. QuickSight analysis parameters are named variables that can transfer a value for use by an action or an object. Parameters help users create interactive experiences in their dashboards. You can tie parameters with other features in the QuickSight analysis. For example, a dashboard user can reference a parameter value in multiple places, using controls, filters, and actions, and also within calculated fields, narratives, and dynamic titles. Then the visuals in the dashboard react to the user’s selection of parameter value. Parameters can also help connect one dashboard to another, allowing a dashboard user to drill down into data that’s in a different analysis.

Dataset parameters, on the other hand, are defined at the dataset level. With dataset parameters, authors can optimize the experience and load time of dashboards that are connected live to external SQL-based sources. When readers interact with their data, the selection and actions they make in controls, filters, and visuals can be propagated to the data sources via live, custom, parameterized SQL queries. By mapping multiple dataset parameters to analysis parameters, users can create a wide variety of experiences using controls, user actions, parameterized URLs, and calculated fields, as well as dynamic visuals’ titles and insights.

In the following example, dataset owners connected via direct query to a table containing data about taxi rides in New York. They can add a WHERE clause in their custom SQL to filter the dataset based on the end-user’s input of a specific pickup date that will be later provided by the dashboard readers. In the SQL query, the rows are filtered by the date in the dataset parameter <<$pPickupDate>> if it matches the date in the pickupdate column. This way, the dataset size can be significantly smaller for users that are only interested in data for a specific taxi ride date. See the following code:

SELECT *
FROM nytaxidata
WHERE pickupdate = <<$pPickupDate>>

To allow users to provide multiple values in the parameter, you can create a multi-value parameter (for example, pPickupDates), and insert the parameter into an IN phrase as follows:

SELECT *
FROM nytaxidata
WHERE pickupdate in (<<$pPickupDates>>)

Use cases for dataset parameters

In this section, we discuss common use cases using dataset parameters and their benefits.

Optimized custom SQL in direct queries

With dataset parameters, you no longer have to trade-off between the flexibility of using custom SQL logic and the performance of an optimized SQL query. Parameterized datasets can be filtered to a relatively smaller result set when loaded. Authors and readers can benefit from the faster load of analyses and dashboards for the first time using default values, as well as for later queries when data is sliced and diced using filter controls on the dashboard. Also, data owners benefit from their datasets putting less load on backend database resources, making it more scalable and performant to serve higher user concurrency.

The performance gains will be evident when you work with direct query datasets that have complex custom SQL, such as nested queries that have to filter the data in the inner sections of the query.

Generic datasets reusable across analyses

Dataset parameters can enable datasets to be largely reused across various analyses, thereby reducing the effort for the data owners to prepare and maintain the datasets. Whether you have a SPICE dataset or direct query dataset, with dataset parameters, you can port calculated field referencing parameters from the analysis to the dataset. Authors can now reuse calculated fields referencing parameters created by dataset owners in a dataset, rather than recreate these fields across multiple analysis.

With the option to port parameter-dependent calculated fields from the analysis to the underlying datasets, dataset parameters can help you create the same calculated fields in the dataset and reuse them across multiple analyses. This is important for governance use cases as well: dataset owners can move the parameter-dependent calculated fields from the analysis to protect the business logic, ensuring that their calculated fields can’t be modified by analyses’ authors.

Simpler dataset maintenance with repeatable variables

When you have a dataset that refers to a static value (placeholder) in multiple places in custom SQL and calculated fields, you can now create a dataset parameter and reuse it in multiple places. This will help in better code maintainability. (Note that inserting parameters in custom SQL is only available in direct query.)

Solution overview

In this scenario, we create a custom SQL direct query dataset to observe unoptimized SQL queries that are generated without dataset parameters, and demonstrate how your current custom SQL queries run if you don’t use dataset parameters. Then we modify the custom SQL, add the dataset parameter, and show the optimized query generated for the same dataset if we use dataset parameters.

In this example, we use an Amazon RDS for PostgreSQL database. However, this feature will work with any SQL-based data source in QuickSight.

Query your data with analysis parameters

To set up your data source, dataset, and analysis, complete the following steps. If you’re using real data, you can skip to the next section.

  1. Create a QuickSight data source.

The following screenshot shows sample connection details.

create a datasource

  1. Create a new direct query custom SQL dataset.

We are using sample data from NYC OpenData for New York taxi rides with a subset of approximately 1 million records. The data is loaded in an RDS for PostgreSQL database table called nytaxidata.

create a sample dataset nytaxidata

  1. Create a sample analysis using the dataset you just created. Choose the table visual and add a few columns from the Fields list.

create a sample analysis using nytaxidata dataset

  1. Reload the analysis and observe the query generated on the PostgreSQL database.

You will notice it loads the full dataset (select * from nytaxidata) as referenced in the screenshot below from RDS Performance Insight.

SQL from performance insight, unoptimized SQL inner query without where clause

  1. Add an analysis parameter-based filter control to the QuickSight analysis. Change the value of this filter control (analysis parameter in this case).

creating analysis parameter with a control

The inner query over the dataset still uses custom SQL without using the filter in the WHERE clause. This filter control parameter is still part of the WHERE clause of the outer query, so the custom SQL fetches the complete result set as part of the inner query. This may not be the case if you use database tables as a dataset rather than a custom SQL query as a dataset. With a dataset based directly on tables, parameter values are passed to the database in the WHERE clause.

SQL from performance insight, unoptimized SQL inner query without where clause with analysis parameter

So how do we overcome the challenge of being able to include the parameter in the WHERE clause in custom SQL datasets? With dataset parameters!

Optimize your query with dataset parameters

Let’s look at a few scenarios where we can use dataset parameters to send more optimized queries to the database.

  1. Create a dataset parameter (for example, pDSfareamount) and add it to the WHERE clause with an equality predicate in the custom SQL.Observe if there is any change in the SQL query that was passed to the database.

creating dataset parameter

This time, you will see optimized SQL generated using the default parameter value in the WHERE clause of the inner query (select * from nytaxidata where fare_amount=0). This results in better query performance for direct query datasets.

optimized sql generated with dataset parameter

Map dataset parameters with analysis parameters

Dataset parameters can be mapped to analysis parameters and user-selected values can pass to the dataset parameters from the interactions on the dashboard at run time.

You can use a single analysis parameter and map it to multiple dataset parameters. The parent analysis parameter can now be linked with a filter control or an action, and can help you filter multiple datasets based on custom SQL.

In this section, we map a dataset parameter with an analysis parameter and bind it with a filter control at runtime.

  1. First, we create an analysis parameter and map it to a dataset parameter (we use the dataset parameter we created earlier).

mapping analysis parameter with a dataset parameter

  1. Now the analysis parameter (for this example, pAfareamount) is created. You can create the control object Fare Amount to dynamically change the dataset parameter value from the analysis or dashboard using a parameter control. You can bind pAfareamount with a QuickSight filter to pass values to the dataset parameter dynamically. When you’re changing values in a parameter control, you will find optimized SQL on the backend database with the WHERE predicate in inner query generated.

chaing value of analysis parameter mapped to dataste parameter via filter control

Additional examples using dataset parameters

So far, we have used dataset parameters with an equality predicate.Let’s look at a few more scenarios using dataset parameters.

  1. The following screenshot demonstrates using a dataset parameter with a range predicate of custom SQL.

dataset parameter with non equality predicate

  1. The following example illustrates using two dataset parameters with a between operator.

two dataset parameters with between operator

  1. The following example shows using a dataset parameter within a calculation.

dataset parameter used in calculated field based on ifelse condition

  1. We can also use a dataset parameter with a scalar user-defined function (UDF). In the following example, we have a scalar function is_holiday(pickupdate), which takes a pickupdate as a parameter and returns a flag of 0 or 1 based on whether pickupdate is a public holiday.

dataset parameter used with scalar user defined function

  1. Additionally, we can use a dataset parameter to derive a calculated field. In the following example, we need to calculate the surcharge_amount dynamically based on a value specified at runtime and the number of passengers. We use a dataset parameter along with a case statement to calculate the desired surcharge_amount.

dataset paramter with calculated field case statement

  1. The final example illustrates how to move calculations using parameters in the analysis to the dataset for reusability.

porting dataset parameter from analysis to dataset

Dataset parameter limitations

The following are the known limitations (as of this writing) that you may encounter when working with dataset parameters in QuickSight:

  • Dataset parameters can’t be inserted into custom SQL of datasets stored in SPICE.
  • Dynamic defaults can only be configured on the analysis page of the analysis that is using the dataset. You can’t configure a dynamic default at the dataset level.
  • The Select all option is not supported on multi-value controls of analysis parameters that are mapped to dataset parameters (but there is a workaround that you can follow).
  • Cascading controls are not supported for dataset parameters.
  • Dataset parameters can only be used by dataset filters when the dataset is using a direct query.
  • When dashboard readers schedule emailed reports, selected controls don’t propagate to the dataset parameters that are included in the report that is attached to the email. Instead, the default values of the parameters are used.

Refer to Using dataset parameters in Amazon QuickSight for more information.

Conclusion

In this post, we showed you how to create QuickSight dataset parameters and map them to analysis parameters. Dataset parameters help improve your QuickSight dashboard performance for direct query custom SQL datasets by generating optimized SQL queries. We also showed a few examples of how to use dataset parameters in SQL range predicates, calculated fields, scalar UDFs, and case statements.

Dataset parameters enable dataset owners to centrally create and govern parameter-dependent calculated fields at the dataset level. Such calculated fields can be reused across multiple analyses, and cannot be tampered with by analysis authors.

We hope you will find dataset parameters in QuickSight useful. We have already seen how the feature is creatively used in a wide range of use cases. We recommend that you review your existing direct query custom SQL datasets in your QuickSight deployment to look for candidates for optimization, or take advantage of the other benefits of dataset parameters. For example, BI teams can benefit from dataset parameters by reusing the same dataset with different values in the parameter to analyze different slices of the same data, such as different regions, products, or customers by industry segments.

Are you considering migrating legacy reports to QuickSight? Dataset parameters can help enterprise BI developers reduce the migration effort of legacy reports that already have parameterized SQL queries in the legacy queries. These SQL queries can be passed along their parameters to QuickSight datasets via automations with the help of QuickSight APIs (and a few adjustments to the queries if the parameters are marked differently).

For more information on dataset parameters, refer to Using dataset parameters in Amazon QuickSight.


About the authors

Anwar Ali is a Specialist Solutions Architect for Amazon QuickSight. Anwar has over 18 years of experience implementing enterprise business intelligence (BI), data analytics and database solutions . He specializes in integration of BI solutions with business applications, helping customers in BI architecture design patterns and best practices.

Salim Khan is a Specialist Solutions Architect for Amazon QuickSight. Salim has over 16 years of experience implementing enterprise business intelligence (BI) solutions. Prior to AWS, Salim worked as a BI consultant catering to industry verticals like Automotive, Healthcare, Entertainment, Consumer, Publishing and Financial Services. He has delivered business intelligence, data warehousing, data integration and master data management solutions across enterprises.

Gil Raviv is a Principal Product Manager for Amazon QuickSight, AWS’ cloud-native, fully managed SaaS BI service. As a thought-leader in BI, Gil accelerated the growth of global BI practices at AWS and Avanade, and has guided Fortune 1000 enterprises in their Data & AI journey. As a passionate evangelist, author and blogger of low-code/no-code data prep and analytic tools, Gil was awarded 5 times as a Microsoft MVP (Most Valuable Professional).

Best practices for enabling business users to answer questions about data using natural language in Amazon QuickSight

Post Syndicated from Amy Laresch original https://aws.amazon.com/blogs/big-data/best-practices-for-enabling-business-users-to-answer-questions-about-data-using-natural-language-in-amazon-quicksight/

In this post, we explain how you can enable business users to ask and answer questions about data using their everyday business language by using the Amazon QuickSight natural language query function, Amazon QuickSight Q.

QuickSight is a unified BI service providing modern interactive dashboards, natural language querying, paginated reports, machine learning (ML) insights, and embedded analytics at scale. Powered by ML, Q uses natural language processing (NLP) to answer your business questions quickly. Q empowers any user in an organization to start asking questions using their own language. Q uses the same QuickSight datasets you use for your dashboards and reports so your data is governed and secured. Just as data is prepared visually using dashboards and reports, it can be readied for language-based interactions using a topic. Topics are collections of one or more datasets that represent a subject area that your business users can ask questions about. To learn how to create a topic, refer to Creating Amazon QuickSight Q topics.

With automated data preparation in QuickSight Q, the model will do a lot of the topic setup for you, but there is some context that is specific to your business that you need to provide. To learn more about the initial setup work that Q does behind the scenes, check out New – Announcing Automated Data Preparation for Amazon QuickSight Q.

Business users can access Q from the QuickSight console or embedded in your website or application. To learn how to embed the Q bar, refer to Embedding the Amazon QuickSight Q search bar for registered users or anonymous (unregistered) users. To see examples of embedded dashboards with Q, refer to the QuickSight DemoCentral.

Once you have a topic shared with your business users, they can ask their own questions and save questions to their pinboard as seen in GIF 1.

QuickSight authors can also add their Q visuals straight to an analysis to speed up dashboard creation, as seen in GIF 2.

This post assumes you’re familiar with building visual analytics in dashboards or reports, and shares new and different strategies needed to build natural language interfaces that are simple to use.

In this post, we discuss the following:

  • The importance of starting with a narrow and focused use case
  • Why and how to teach the system your unique business language
  • How to get success by providing support and having a feedback loop

If you don’t have Q enabled yet, refer to Getting started with Amazon QuickSight Q or watch the following video.

Follow along

In the following examples, we often refer to two out-of-the-box sample topics, Product Sales and Student Enrollment Statistics, so you can follow along as you go. We recommend creating the topics now before continuing with this post, because they take a few minutes to be ready.

Understand your users

Before we jump into solutions, let’s talk about when natural language query (NLQ) capabilities are right for your use case. NLQ is a fast way for a business user who is an expert in their business area to flexibly answer a large variety of questions from a scoped data domain. NLQ doesn’t replace the need for dashboards. Instead, when designed to augment a dashboard or reporting use case, NLQ helps business users get customized answers about specific details without asking a business analyst for help.

It’s critical to have a well-understood use case because language is inherently complex. There are many ways to refer to the same concept. For example, a university might refer to “classes” several ways, such as “courses,” “programs,” or “enrollments.” Language also has inherent ambiguity—“top students” might mean by highest GPA to one person and highest number of extracurriculars to another. By understanding the use case up front, you can uncover areas of potential ambiguity and build that knowledge directly into the topic.

For example, the AWS Analytics sales leadership team uses QuickSight and Q to track key metrics for their region as part of their monthly business review. When I worked with the sales leaders, I learned their preferred terminology and business language through our usability sessions. One observation I made was that they referred to the data field Sales Amortized Revenue as “adrr”. With these learnings, I could easily add this context to the topic using synonyms, which I cover in detail below. One of the sales leaders shared, “This will be awesome for next month when I write my MBR. What previously took a couple of hours, I can now do in a few minutes. Now I can spend more time working to deliver my customer’s outcomes.” If the sales leader asked a question about “adrr” but that connection was not included in their Q topic, then the leader would feel misunderstood and revert back to their original, but slower, ways of finding the answer. Check out more QuickSight use cases and success stories on the AWS Big Data Blog.

Start small

In this section, we share a few common challenges and considerations when getting started with Q.

Data can contain overlapping words

One pitfall to look out for is any fields with long strings, like survey write-in responses, product descriptions, and so on. This type of data introduces additional lexical complexity for readers to navigate. In other words, when an end-user asks a question, there is a higher chance that a word in one of the strings will overlap with other relevant fields, such as a survey write-in that mentions a product name in your Product field. Other non-descriptor fields can also contain overlaps. You can have two or more field names with lexical overlap, and the same across values, and even between fields and values. For example, let’s say you have a topic with a Product Order Status field with the values Open and Closed and a Customer Complaint Status field also with the values Open and Closed. To help avoid this overlap, consider alternate names that would be natural to your end-users to avoid the potential ambiguity. In our example, I’d keep the Product Order Status values and change the Customer Complaint Status to Resolved and Unresolved.

Avoid including aggregation names in your fields and values

Another common pitfall that introduces unnecessary ambiguity is including calculated fields for basic aggregations that Q can do on the fly. For example, business users might track average clickthrough rates for a website or month-to-date free to paid conversions. Although these types of calculations are necessary in a dashboard, with Q, these calculated fields are not needed. Q can aggregate metrics using natural language, like simply asking “year over year sales” or “top customers by sales” or “average product discount,” as you can see in Figure 1. Defining a field with the name YoY Sales adds an additional potential answer choice to your topic, leaving end-users to select between the pre-defined YoY Sales field, or using Q’s built-in YoY aggregation capability, whereas you may already know which of these choices is likely to bring them the best outcome. If you have complex business logic built into calculated fields, those are still relevant to include (and if you create the topic from your existing analysis, then Q will bring them over.)

Q answer showing MoM sales

Figure 1: Q visual showing MoM sales for EMEA

Start with a single use case

For this post, we recommend defining a use case as a well-defined set of questions that actual business users will ask. Q gives the ability to answer questions not already answered in dashboards and reports, so simply having a dashboard or a dataset doesn’t mean you necessarily have a Q-ready use case. These questions are the real words and phrases used by business users, like “how are my customers performing?” where the word “performing” might map in the data to “sales amortized revenue,” but a business user might not ask questions using the precise data names.

Start with a single use case and the minimum number of fields to meet it. Then incrementally layer in more as needed. It’s better to introduce a topic with, for example, 10 fields and a 100% success rate of answering questions as expected vs. starting with 30 fields and a 70% success rate to help users feel confident.

To help you start small, Q enables you to create your topic in one click from your existing analysis (Figure 2).

Enable a Q topic from a QuickSight analysis

Figure 2: Enable a Q topic from a QuickSight analysis

Q will scan the underlying metadata in your analysis and automatically select high-value columns based on how they are used in the analysis. You’ll also get all your existing calculated fields ported over to the new topic so you don’t have to re-create them.

Add lexical context

Q knows English well. It understands a variety of phrases and different forms of the same word. What it doesn’t know is the unique terms from your business, and only you can teach it.

There are some key ways to provide Q this context, including adding synonyms, semantic types, default aggregations, primary date, named filters, and named entities. If you created your Q topic as described in the previous section, you will be a few steps ahead, but it’s always good to check the model’s work.

Add synonyms

In a dashboard, authors use visual titles, text boxes, and filter names to help business users navigate and find their answers. With NLQ, language is the interface. NLQ empowers business users to ask their questions in their own words. The author needs to make those business lexicon connections for Q using synonyms. Your business users might refer to revenue as “gross sales,” “amortized revenue,” or any number of terms specific to your business. From the topic authoring page, you can add relevant terms (Figure 3).

Add Q synonyms

Figure 3: Adding relevant synonyms

If your business users refer to the data values in multiple ways, you can use value synonyms to create those connections for Q (Figure 4). For example, in the Student Enrollment topic, let’s say your business users sometimes use First Years to map to Freshmen and so on for each classification type. If you don’t have that data directly in your dataset, you can create those mappings using value synonyms (Figure 5).

Configure Q value synonyms

Figure 4: Configure field value synonyms

Add Q value synonyms

Figure 5: Example value synonyms for Student Enrollment topic

Check semantic types

When you create a topic using automatic data prep, Q will automatically select relevant semantic types that it can detect. Q uses semantic types to understand which specific fields to use to answer vague question like who, where, when, and how many. For example, in the student enrollment statistics example, Q already set Home of Origin as Location so if someone asks “where,” Q knows to use this field (Figure 6). Another example is adding Person for the Student Name and Professor fields so Q knows what fields to use when your business users ask for “who.”

Home of origin semantic type

Figure 6: Semantic Type set to “Location”

Another important semantic type is the Identifier. This tells Q what to count when your business users ask questions like “How many were enrolled in biology in 2021?” (Figure 7). In this example, Student ID is set as the Identifier.

Q answer showing a KPI of 3

Figure 7: Q visual showing a “how many” question

Here is a list of semantic types that map to implicit question phrases:

  • Location: Where?
  • Person or Organization: Who?
    • If there are no person or organization fields, then Q will use the identifier
  • Identifier: How many? What is the number of?
  • Duration: How long?
  • Date Part: When?
  • Age: How old?
  • Distance: How far?

Semantic types also help the model in several other ways, including mapping terms like “most expensive” or “cheapest” to Currency. There is not always a relevant semantic type, so it’s okay to leave those empty.

Set default aggregations

Q will always aggregate measure values a business user asks for, so it’s important to use measures that retain their meaning when brought together with other values. As of this writing, Q works best with underlying data that is summative, for example, a currency value or a count. Examples of metrics that are not summative are percentages, percentiles, and medians. Measures of this type can produce misleading or statistically inaccurate results when added with one another. Q can be used to produce averages, percentiles, and medians by end-users without first performing those calculations in underlying data.

Help Q understand the business logic behind your data by setting default aggregations. For example, in the Student Enrollment topic, we have student test scores for every course, which should be averaged and not summed, because it’s a percentage. Therefore, we set Average as the default and set Sum as a not allowed aggregation type (Figure 8).

Percentage semantic type

Figure 8: Setting “Sum” as a “Not allowed aggregation” for a percentage data field

To ensure end-users get a correct count, consider whether the default aggregation type for each dimensional field should be Distinct Count or Count and set accordingly. For example, if we wanted to ask “how many courses do we offer,” we would want to set Courses to Distinct Count because the underlying data contains multiple records for the same course to track each student enrolled.

If we have a count, we get over 6,000 courses, which is a count of all rows that have data in the Courses field, covering every student in the dataset (Figure 9).

Q KPI showing 6,277

Figure 9: Q visual showing a count of courses

If we set the default aggregation to Distinct Count, we get the count of unique course names, which is more likely to be what the end-user expects (Figure 10).

Q KPI showing 15

Figure 10: Q visual showing the unique count of courses

Review the primary date field

Q will automatically select a primary date field for answering time related questions like “when” or “yoy”. If your data includes more than one date field, you may want to choose a different date than Q’s default choice. End-users can also ask about additional date fields by explicitly naming them (Figure 12). You can always specify a different date if you’d like. To review or change the primary date, go to the topic page, navigate to the Data section, and choose the Datasets tab. Expand the dataset and review the value for Default date (Figure 11).

Reviewing the Q default date

Figure 11: Reviewing the default date

You can change the date as needed.

Changing the date field

Figure 12: Asking about non-default dates

Add named filters

In a dashboard, filters are critical to allow users to focus in on their area of interest. With Q, traditional filters aren’t required because users can automatically ask to filter any field values included in the Q topic. For example, you could ask “What were sales last week for Acme Inc. for returning shoppers?” Instead of building the filters in a dashboard (date, customer name, and returning vs. new customer), Q does the filtering on the fly to instantly provide the answer.

With Q, a filter is a specific word or phrase your business users will use to instruct Q to filter returned results. For example, you have student test scores but you want a way for your users to ask about failing test scores. You can set up a filter for “Failing” defined as test scores less than 70% (Figure 13).

Q filter configuration

Figure 13: Filter configuration example using a measure

Additionally, maybe you have a field for Student Classification, which includes Freshmen, Sophomore, Junior, Senior, and Graduate, and you want to let users ask about “undergrads” vs. “graduates” (Figure 14). You can make a filter that includes the relevant values.

Q undergrad named filter example

Figure 14: Filter configuration example using a dimension

Add named entities

Named entities are a way to get Q to return a set of fields as a table visual when a user asks for a specific word or phrase. If someone wanted to know “sales for retail december” and they get a KPI saying $6,169 without any extra context, it is hard to understand all data this number includes (Figure 15).

Q KPI showing $6,169

Figure 15: A Q visual showing “sales for retail december”

By presenting the KPI in a table view with other relevant dimensions, the data includes additional context making it easier to understand meaning (Figure 16).

Q named entity as a table visual

Figure 16: A Q visual showing “sales details for retail december”

By building these table views, you can happily surprise your business users by anticipating the information they want to see without having to explicitly ask for each piece of data. The best part is your business users can easily filter the table using language to answer their own data questions. For example, in the Student Enrollment topic, we created a Student information named entity with some important student details like their name, major, email, and test scores per course.

Q named entity for student information

Figure 17: Named entity example

If a university administrator wanted to reach out to students who are failing biology, they can simply ask for “student information for failing biology majors.” In one step, they get a filtered list that already includes their emails and test scores so they can reach out (Figure 18).

Q named entity filtered down for failing biology students

Figure 18: Filtering a named entity

If the university administrator wanted to also see the phone numbers of the students to send texts offering free tutoring, they could simply ask Q “Student information for failing biology majors with phone numbers.” Now, Mobile is added as the first column (Figure 19).

Q named entity adding phone number

Figure 19: Adding a column to a named entity

Entities can also be referenced using synonyms in order to capture all the ways your business users might refer to this group of data. In our example, we could also add “student contact info” and “academic details” based on the common terminology the university admins use.

Besides looking for patterns in the data fields, ask yourself about what your business users care about. For example, let’s assume we have data for our HR specialists, and we know they care about job postings, candidates, and recruiters. Each author might think of the groups slightly differently, but as long as it’s rooted in your business jobs to be done, then your groupings are providing value. With those three groups in mind, we can sort all the data into one of those buckets. For this use case, our Candidate bucket is pretty large, with about 20 fields. We can scan the list and notice that we track information for rejected and accepted candidates, so we start splitting the metrics into two groups: Successful Candidates and Rejected Candidates. Now information like Offer Letter Date, Accept Date, and Final Salary are all in the Successful Candidate group, and related fields about Rejected Candidates are clearly grouped together.

If you’re curious about strategies for how to create entities, check out card sorting techniques.

In the Product Sales sample topic, after scanning the data, we would start with Sales, Product, and Customer as three key groupings of information to analyze. Try out the exercise on your own data and feel free to ask any questions on the QuickSight Community. To learn how to create named entities, refer to Adding named entities to a topic dataset.

Drive NLQ adoption

After you have refined your topic, tested it out with some readers, and made it available for a larger audience, it’s important to follow two strategies to drive adoption.

First, provide your business users with support. Support might look like a short tutorial video or newsletter announcement. Consider keeping an open channel like a Slack or Teams chat where active users can post questions or enhancements.

Here at Amazon, the Prime team has a dedicated Product Manager (PM) for their embedded Q application that they call PrimeQ. The PM hosts regular demo and training sessions where the Prime team can ask them any questions and get ideas about what types of answers they can get. The PM also sends out a monthly newsletter to announce the availability of new data and topics along with sample questions, FAQs, and quotes from Prime team members who get value out of Q. The PM also has an active Slack channel where every single question gets answered within 24 hours, either by the PM or a data engineer on the Prime team.

Pro tip: Make sure your business users know who they can reach out to if they get stuck. Avoid the black box of “reach out to your author” so readers feel confident their questions will be answered by a known person. For embedded applications, be sure to build an easy way to get support.

Second, maintain a healthy feedback loop. Look at the usage data directly in the product and schedule 1-on-1 sessions with your readers. Use the usage data to track adoption and identify readers who are asking unanswerable questions (Figure 20). Engage with both your successful and struggling readers to learn how to continue to iterate and improve the experience. Talking to business users is especially important to uncover the implicit ambiguity of language.

Another example here at Amazon, after first launching the Revenue Insights topic for the AWS Analytics sales team, a QuickSight Solution Architect (SA) and myself checked the usage tab on a daily basis to track unanswerable questions and directly reach out to the sales team member to let them know how to adjust their question or that we made a change so their question would now work. For example, we initially had a field turned off for Market Segment and noticed a question from a sales leader asking about sales by segment. We turned the field on and let him know those questions would now work. The SA and I have a Slack channel with other stakeholders so we can troubleshoot asynchronously with ease. Now that the topic has been available for several months, we check the usage tab on a weekly basis.

Q user activity tab

Figure 20: User Activity tab in Q

Conclusion

In this post, we discussed how language is inherently complex and what context you need to provide Q to teach the system about your unique business language. Q’s automated data prep will get you started, but you need to add the context that is specific to your business user’s language. As we mentioned at the start of the post, consider the following:

  • Start with a narrow and focused use case
  • Teach the system your unique business language
  • Get success by providing support and having a feedback loop

Follow this post to enable your business users to answer questions of data using natural language in QuickSight.

Ready to get started with Q? Watch our quick tutorial on enabling QuickSight Q.

Want some tutorial videos to share with your team? Check out the following:

To see how Q can answer the “Why” behind data changes and forecast future business performance, refer to New analytical questions available in Amazon QuickSight Q: “Why” and “Forecast”.


About the Author

Amy Laresch is a product manager for Amazon QuickSight Q. She is passionate about analytics and is focused on delivering the best experience for every QuickSight Q reader. Check out her videos on the @AmazonQuickSight YouTube channel for best practices and to see what’s new for QuickSight Q.

Enable data collaboration among public health agencies with AWS Clean Rooms – Part 1

Post Syndicated from Venkata Kampana original https://aws.amazon.com/blogs/big-data/part-1-enable-data-collaboration-among-public-health-agencies-with-aws-clean-rooms/

In this post, we show how you can use AWS Clean Rooms to enable data collaboration between public health agencies. Public health governmental agencies need to understand trends related to a variety of health conditions and care across populations in order to create policies and treatments with the goal of improving the well-being of the various communities they serve.

In order to do this, these agencies need to analyze data from many sources, such as clinical organizations, non-clinical community organizations, and administrative data from other government agencies, so they can identify trends around health conditions and treatments across populations. Public health needs to understand what is happening to populations within the communities they serve.

Because they are looking at populations at risk, they need the flexibility of a line list of cases, stripped of personally identifiable information (PII). With this information, they can assess risk based on a variety of demographic and social factors available in the data sources without divulging PII. The list gives them flexibility to apply more complex analyses, such as regression, on the linked data as well. Programs like MENDS, MDPHnet, and CODI have explored using clinical data in distributed networks to understand the burden of chronic diseases in communities for years. Challenges facing these programs include complex data sharing rules and distributed analytics approaches, across networks of data providers. MENDS and MDPHnet, for example, run analytics at the organization level without deduplicating across sites. Individual queries are pushed to each site where they are processed and reviewed by humans, and combined output is sent to the public health agency.

AWS Clean Rooms offers an opportunity to reduce the burden on data providers in programs like these, while enabling public health agencies to analyze data using their own queries and mitigate risks to data privacy by preventing access to the underlying raw data.

Overview of AWS Clean Rooms

AWS Clean Rooms was first announced at AWS re:Invent 2022, and is now generally available. AWS Clean Rooms allows customers and their partners to more easily and securely collaborate on their collective datasets—without sharing or copying the underlying data with each other. AWS Clean Rooms provides a broad set of privacy-enhancing controls that help protect sensitive data, including query controls, query output restrictions, query logging, and cryptographic computing tools.

With AWS Clean Rooms, you can collaborate and analyze data with other parties in the collaboration without either party having to share or copy the raw data. AWS Clean Rooms is a stateless service; it doesn’t store the data. Instead, it reads the data from where it lives, applies restrictions that protect each participant’s underlying data at query runtime, and returns the results. Queries can be written to intersect and analyze data sources using common metadata elements (for example, geography, shared identifiers, or other demographic factors), generating row-level lists of the overlap between the data sources or aggregated counts by population, condition, or other strata.

AWS Clean Rooms helps public health agencies analyze collective data to gain a more complete view of the health and well-being of their communities, while maintaining the security and privacy of the data.

Solution overview

Before we get started with AWS Clean Rooms, let’s first talk about some of the service’s key concepts:

  • Collaborations – This is a secure logical boundary in AWS Clean Rooms created by the collaboration creator. When creating the collaboration, the creator can invite additional members to join the collaboration. Invited participants can see the list of collaboration members before they accept the invitation to join the collaboration.
  • Members – This refers to AWS customers who are participants in a collaboration. All collaboration members can join data; however, only one member can query and receive results per collaboration, and that member is immutable.
  • Analysis rules – AWS Clean Rooms supports two types of analysis rules:
    • Aggregation – Members can run queries that aggregate statistics using COUNT, SUM, or AVG functions along optional dimensions. Aggregation queries won’t reveal row-level data.
    • List – Members can run queries that output row-level data of the overlap between two tables.
  • Configured tables – Members can configure existing AWS Glue tables for use in AWS Clean Rooms. This data is stored in Amazon Simple Storage Service (Amazon S3) in open data formats and cataloged in the AWS Glue Data Catalog. Each configured table contains an analysis rule that determines how the data can be queried. After it’s configured, members can associate the configured table to one or more collaborations.

Getting started with AWS Clean Rooms is a four-step process:

  1. The creator configures a collaboration and invites one or more members to the collaboration.
  2. The invited member joins the collaboration.
  3. Members can configure the existing AWS Glue tables for use in AWS Clean Rooms.
  4. Members with permission to do so can run queries in the collaboration.

Prerequisites

For this walkthrough, you need the following:

Create a collaboration and invite one or more members

You must define your collaboration configuration on the AWS Clean Rooms console, via the AWS Command Line Interface (AWS CLI), or with an AWS SDK. We demonstrate how to configure this on the console.

  1. On the AWS Clean Rooms console, choose Create collaboration.

  2. For Name, enter a name (for example, Demo collaboration).
  3. For Description, add an optional description.
  4. In the Members section, add the following members:
    1. Member 1 – Enter a member display name (your AWS account ID is automatically populated).
    2. Member 2 – Enter a member display name and the AWS account ID for the member you want to invite.
    3. Choose Add another member to add more members.
  5. In the Member abilities section, choose one member who will query and receive results.
  6. In the Query logging section, select Support query logging for this collaboration to log the queries in Amazon CloudWatch logs.
  7. Choose Next.
  8. In the Collaboration membership section, select the storage option you prefer for CloudWatch.
  9. Choose Next.
  10. On the Review and create page, choose Create collaboration and membership after reviewing the details to ensure accuracy.

Congratulations on creating your first collaboration! You can see the collaboration details on the Collaborations page.

Join the collaboration

Each collaboration member can log in to AWS Clean Rooms console, review the invitation, and decide to join the collaboration by following these steps:

  1. On the AWS Clean Rooms console, choose Collaborations in the navigation pane.
  2. On the Available to join tab, choose the collaboration you were invited to.

On the details page, you can review the member abilities.

  1. Select your preferred log storage option and choose Create membership.
  2. On the confirmation page, verify that the members listed align with your data sharing agreements, then choose Create membership.

After you create your membership, your member status is changed to Active on the collaboration dashboard.

Configure existing AWS Glue tables for use in AWS Clean Rooms

AWS Clean Rooms doesn’t require you to make a copy of the data because it reads the data from Amazon S3. This eliminates the need to copy and load your data into destinations outside your respective AWS account, or use third-party services to facilitate data sharing.

Each collaboration member can create configured tables, an AWS Clean Rooms resource that contains reference to the AWS Glue Data Catalog with underlying data that defines how that data can be used. The configured table can be used across many collaborations.

  1. On the AWS Clean Rooms console, choose Configured tables in the navigation pane.
  2. Choose Configure new table.
  3. Choose the database to populate the list of AWS Glue tables, and choose the table you want to associate with the collaboration.

For each selected table, you can determine which columns can be accessed in the collaboration.

  1. Select All columns or select Custom list to choose a subset of columns to be available in the collaboration.
  2. Enter a name for the configured table.
  3. Choose Configure new table.

In addition to column-level access controls, AWS Clean Rooms provides fine-grained query controls called analysis rules. With built-in and flexible analysis rules, you can tailor queries to specific business needs. As discussed earlier, AWS Clean Rooms provides two types of analysis rules:

  • Aggregation analysis rules – These allow queries that aggregate data without revealing row-level information. Available functions include COUNT, SUM, and AVG, along optional dimensions.
  • List analysis rules – These allow queries that output row-level attribute analyses of the overlap between the tables in the collaboration space.

Both rule types allow data owners to mandate a join between their datasets and the datasets of the collaborator running the query. This limits the results to just their intersection of the collaborators datasets.

  1. On the configured table, choose Configure analysis rule to configure the analysis rules.
  2. For this post, we select List because we want to query patients’ immunization status by joining with immunization data from other contributors.
  3. Select the creation method and select Next.
  4. To define the criteria for the table joins, in the Join controls section, choose the column names appropriate for the join.
  5. To specify which columns will be outputted, identify those in the List controls section.
  6. Choose Next.
  7. Choose Configure analysis rule on the Review and configure page.

You will see the message Successfully configured list analysis rule on the configured tables page.

  1. Choose Associate to collaboration to link this table to the collaboration you created.
  2. Review the details on the Associate table page and choose Associate table.

The collaboration page will display a list of tables that are associated by you to the collaboration.

Each member of the collaboration must repeat the aforementioned steps to associate their AWS Glue Data Catalog tables to the collaboration. For this post, the other members of the collaboration follow these same steps to associate their data to the collaboration. Then the collaboration will list all tables associated by other members.

After defining the analysis rules on the configured tables and associating them to the collaboration, the members who can query and receive results can start writing queries according to the restrictions defined by each participating collaboration member. The following section includes example collaboration queries.

Run queries in the collaboration

The following screenshot is an example of a query that won’t be successful because * is not supported. Column names must be specified in the query.

The following screenshot is an example of a query that won’t be successful because you can’t link columns that members restricted in your joins.

The following screenshot is an example of a query that will be successful because it uses permitted columns (columns that are part of the list analysis rule) in the select clause and join condition.

The sample datasets (Patient and Immunization) used in this post include a unique identifier (patient ID). However, in a real-world scenario, this might not be the case. In those situations, you may consider using privacy-preserving record linkage (PPRL) to create a unique deidentified token. For example, the CDC’s CODI program deduplicates across data owners by obfuscating PII behind each organization’s firewall in a standardized way. That obfuscated information is joined to create a unique deidentified token for each individual that is analyzed across data sources. If public health agencies want to conduct analyses based on individually linked longitudinal data, they could apply PPRL to each data source and use that metadata element to link the data sources in AWS Clean Rooms before conducting their analytics.

Clean up

As part of this walkthrough, you provisioned an AWS Clean Rooms collaboration, invited other members to join the collaboration, and configured tables. To delete these resources, refer to Leaving the collaboration and Disassociating configured tables.

Conclusion

In this post, we showed you how to create a collaboration, invite other members to the collaboration, configure existing AWS Glue Catalog tables, apply analysis rules, and run sample queries on the AWS Clean Rooms console. In Part 2 of this series, we demonstrate how to automate query runs using AWS Lambda, query the results using Amazon Athena, and publish dashboards using Amazon QuickSight.


About the Authors

Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Dr. Dawn Heisey-Grove is the public health analytics leader for Amazon Web Services’ state and local government team. In this role, she’s responsible for helping state and local public health agencies think creatively about how to achieve their analytics challenges and long-term goals. She’s spent her career finding new ways to use existing or new data to support public health surveillance and research.

Jim Daniel is the Public Health lead at Amazon Web Services. Previously, he held positions with the United States Department of Health and Human Services for nearly a decade, including Director of Public Health Innovation and Public Health Coordinator. Before his government service, Jim served as the Chief Information Officer for the Massachusetts Department of Public Health.

Enable remote reads from Azure ADLS with SAS tokens using Spark in Amazon EMR

Post Syndicated from Kiran Anand original https://aws.amazon.com/blogs/big-data/enable-remote-reads-from-azure-adls-with-sas-tokens-using-spark-in-amazon-emr/

Organizations use data from many sources to understand, analyze, and grow their business. These data sources are often spread across various public cloud providers. Enterprises may also expand their footprint by mergers and acquisitions, and during such events they often end up with data spread across different public cloud providers. These scenarios can create the need for AWS services to remotely access, in an ad hoc and temporary fashion, data stored in another public cloud provider such as Microsoft Azure to enable business as usual or facilitate a transition.

In such scenarios, data scientists and analysts are presented with a unique challenge when working to complete a quick data analysis because data typically has to be duplicated or migrated to a centralized location. Doing so introduces time delays, increased cost, and higher complexity as pipelines or replication processes are stood up by data engineering teams. In the end, the data may not even be needed, resulting in further loss of resources and time. Having quick, secure, and constrained access to the maximum amount of data is critical for enterprises to improve decision-making. Amazon EMR, with its open-source Hadoop modules and support for Apache Spark and Jupyter and JupyterLab notebooks, is a good choice to solve this multi-cloud data access problem.

Amazon EMR is a top-tier cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto. Amazon EMR Notebooks, a managed environment based on Jupyter and JupyterLab notebooks, enables you to interactively analyze and visualize data, collaborate with peers, and build applications using EMR clusters running Apache Spark.

In this post, we demonstrate how to set up quick, constrained, and time-bound authentication and authorization to remote data sources in Azure Data Lake Storage (ADLS) using a shared access signature (SAS) when running Apache Spark jobs via EMR Notebooks attached to an EMR cluster. This access enables data scientists and data analysts to access data directly when operating in multi-cloud environments and join datasets in Amazon Simple Storage Service (Amazon S3) with datasets in ADLS using AWS services.

Overview of solution

Amazon EMR inherently includes Apache Hadoop at its core and integrates other related open-source modules. The hadoop-aws and hadoop-azure modules provide support for AWS and Azure integration, respectively. For ADLS Gen2, the integration is done through the abfs connector, which supports reading and writing data in ADLS. Azure provides various options to authorize and authenticate requests to storage, including SAS. With SAS, you can grant restricted access to ADLS resources over a specified time interval (maximum of 7 days). For more information about SAS, refer to Delegate access by using a shared access signature.

Out of the box, Amazon EMR doesn’t have the required libraries and configurations to connect to ADLS directly. There are different methods to connect Amazon EMR to ADLS, and they all require custom configurations. In this post, we focus specifically on connecting from Apache Spark in Amazon EMR using SAS tokens generated for ADLS. The SAS connectivity is possible in Amazon EMR version 6.9.0 and above, which bundles hadoop-common 3.3.0 where support for HADOOP-16730 has been implemented. However, although the hadoop-azure module provides a SASTokenProvider interface, it is not yet implemented as a class. For accessing ADLS using SAS tokens, this interface should be implemented as a custom class JAR and presented as a configuration within the EMR cluster.

You can find a sample implementation of the SASTokenProvider interface on GitHub. In this post, we use this sample implementation of the SASTokenProvider interface and package it as a JAR file that can be added directly to an EMR environment on version 6.9.0 and above. To enable the JAR, a set of custom configurations are required on Amazon EMR that enable the SAS token access to ADLS. The provided JAR needs to be added to the HADOOP_CLASSPATH, and then the HADOOP_CLASSPATH needs to be added to the SPARK_DIST_CLASSPATH. This is all handled in the sample AWS CloudFormation template provided with this post. At a high level, the CloudFormation template deploys the Amazon EMR cluster with the custom configurations and has a bootstrapping script that stages the JAR on the nodes of the EMR cluster. The CloudFormation template also stages a sample Jupyter notebook and datasets into an S3 bucket. When the EMR cluster is ready, the EMR notebook needs to be attached to it and the sample Jupyter notebook loaded. After the SAS token configurations are done in the notebook, we can start reading data remotely from ADLS by running the cells within the notebook. The following diagram provides a high-level view of the solution architecture.

Architecture Overview

We walk through the following high-level steps to implement the solution:

  1. Create resources using AWS CloudFormation.
  2. Set up sample data on ADLS and create a delegated access with an SAS token.
  3. Store the SAS token securely in AWS Secrets Manager.
  4. Deploy an EMR cluster with the required configurations to securely connect and read data from ADLS via the SAS token.
  5. Create an EMR notebook and attach it to the launched EMR cluster.
  6. Read data via Spark from ADLS within the JupyterLab notebook.

For this setup, data is going over the public internet, which is not a best practice nor an AWS recommendation, but it’s sufficient to showcase the Amazon EMR configurations that enable remote reads from ADLS. Solutions such as AWS Direct Connect or AWS Site-to-Site VPN should be utilized to secure data traffic in enterprise deployments.

For an AWS Command Line Interface (AWS CLI)-based deployment example, refer to the appendix at the end of this post.

Prerequisites

To get this solution working, we have a set of prerequisites for both AWS and Microsoft Azure:

  • An AWS account that can create AWS Identity and Access Management (IAM) resources with custom names and has access enabled for Amazon EMR, Amazon S3, AWS CloudFormation, and Secrets Manager.
  • The old Amazon EMR console enabled.
  • An Azure account with a storage account and container.
  • Access to blob data in ADLS with Azure AD credentials. The user must have the required role assignments in Azure. Refer to Assign an Azure role for more details.

We are following the best practice of using Azure AD credentials to create a user delegation SAS when applications need access to data storage using shared access signature tokens. In this post, we create and use a user delegation SAS with read and list permissions for ADLS access. For more information about creating SAS tokens using the Azure portal, refer to Use the Azure portal.

Before we generate the user delegation SAS token, we need to ensure the credential that will be used to generate the token has appropriate permissions to access data on the ADLS storage account. Requests submitted to the ADLS account using the user delegation SAS token are authorized with the Azure AD credentials that were used to create the SAS token.

The following minimum Azure role-based access control is required at the storage account level to access the data on ADLS storage:

  • Reader – Allow viewing resources such as listing the Azure storage account and its configuration
  • Storage Blob Data Reader – Allow reading and listing Azure storage containers and blobs
  • Storage Blob Delegator – In addition to the permissions to access the data on the ADLS account, you also need this role to generate a user delegation SAS token

Create an EMR cluster and S3 artifacts with AWS CloudFormation

To create the supported version of an EMR cluster with the required SAS configurations and stage all the required artifacts in Amazon S3, complete the following steps:

  1. Sign in to the AWS Management Console in your Region (for this post, we use us-east-1).
  2. Choose Launch Stack to deploy the CloudFormation template:
  3. Choose Next.

Create Stack

  1. For Stack name, enter an appropriate lowercase name.
  2. For EmrRelease, leave as default.

As of this writing, the stack has been tested against 6.9.0 and 6.10.0.

  1. Choose Next.
    Stack Details
  2. On the next page, choose Next.
  3. Review the details and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.
  5. Monitor the progress of the stack creation until it’s complete (about 15–20 minutes).
  6. When the stack creation is complete, navigate to the stack detail page.
  7. On the Resources tab, find the logical ID with the name S3ArtifactsBucket.
  8. Choose the link for the physical ID that starts with emr-spark-on-adls-<GUID> to be redirected to the bucket on the Amazon S3 console.
  9. On the Objects tab, open the EMR/ folder.
    S3 Bucket
  10. Open the artifacts/ folder.

There are five artifacts staged by the CloudFormation stack in this path:

  • azsastknprovider-1.0-SNAPSHOT.jar – The custom implementation of the SASTokenProvider interface.
  • EMR-Direct-Read-From-ADLS.ipynb – The Jupyter notebook that we’ll use with the EMR cluster to read data from ADLS.
  • env-staging.sh – A bash script that the Amazon EMR bootstrap process runs to stage azsastknprovider-1.0-SNAPSHOT.jar across cluster nodes.
  • Medallion_Drivers_-_Active.csv – A sample dataset that needs to be staged in the ADLS container from which we will read.
  • self-signed-certs.zip – The openSSL self-signed certificates used by AWS CloudFormation to encrypt data in transit. This example is a proof of concept demonstration only. Using self-signed certificates is not recommended and presents a potential security risk. For production systems, use a trusted certification authority (CA) to issue certificates.
  1. Select Medallion_Drivers_-_Active.csv and choose Download.
  2. Select EMR-Direct-Read-From-ADLS.ipynb and choose Download.

Create the SAS token and stage sample data

To generate the user delegation SAS token from the Azure portal, log in to the Azure portal with your account credentials and complete the following steps:

  1. Navigate to Storage account, Access Control, and choose Add role assignment.
  2. Add the following roles to your user: Reader, Storage Blob Data Reader, and Storage Blob Delegator.
  3. Navigate to Storage account, Containers, and choose the container you want to use.
  4. Under Settings in the navigation pane, choose Shared access tokens.
  5. Select User delegation key for Signing method.
  6. On the Permissions menu, select Read and List.
    ADLS Container
  7. For Start and expiry date/time, define the start and expiry times for the SAS token.
  8. Choose Generate SAS token and URL.
  9. Copy the token under Blob SAS token and save this value.
    SAS Token
  10. Choose Overview in the navigation pane.
  11. Choose Upload and upload the Medallion_Drivers_-_Active.csv file downloaded earlier.

Store the SAS token in Secrets Manager

Next, we secure the SAS token in Secrets Manager so it can be programmatically pulled from the Jupyter notebook.

  1. Open the Secrets Manager console in the same Region you have been working in (in this case, us-east-1).
  2. Choose Store a new secret.
  3. For Secret type, select Other type of secret.
  4. In the Key/value pairs section, enter a name for the key and enter the blob SAS token for the value.
  5. For Encryption key, choose the default AWS managed key.
  6. Choose Next.
    Secret Type
  7. For Secret name, enter a name for your secret.
  8. Leave the optional fields as default and choose Next.
  9. On the next page, leave the settings as default and choose Next.

Setting up a secret rotation is a best practice but out of scope for this post. You can do so via Azure RM PowerShell, which can be integrated with the Lambda rotation function from Secrets Manager.

  1. Choose Store.
  2. Refresh the Secrets section and choose your secret.
  3. In the Secret details section, copy the value for Secret ARN to use in the Jupyter notebook.
    Secret ARN

Configure an EMR notebook with the SAS token and read ADLS data

Finally, we create the EMR notebook environment, integrate the SAS token into the downloaded Jupyter notebook, and perform a read against ADLS.

  1. Open the Amazon EMR console in the same Region you have been working in (in this case. us-east-1).
  2. Under EMR on EC2 in the navigation pane, choose Clusters.
  3. In the cluster table, choose Cluster with ADLS SAS Access.
    EMR Cluster

On the Summary tab, you will find the applications deployed on the cluster.

EMR Summary Tab

On the Configurations tab, you can see the configurations deployed by the CloudFormation stack loading the customer JAR in the appropriate classpaths.

EMR Configurations Tab

  1. Under EMR on EC2 in the navigation pane, choose Notebooks.
  2. Choose Create notebook.
  3. Enter an appropriate name for the notebook for Notebook name.
  4. For Cluster, select Choose an existing cluster, then choose the cluster you created earlier.
  5. Leave all other settings as default and choose Create notebook.
    Create Notebook
  6. When the notebook environment is set up, choose Open in JupyterLab.
  7. On your local machine, navigate to where you saved the EMR-Direct-Read-From-ADLS.ipynb notebook.
  8. Drag and drop it into the left pane of the JupyterLab environment.
  9. Choose EMR-Direct-Read-From-ADLS.ipynb from the left pane and ensure that the interpreter selected for the notebook in the top-right corner is PySpark.
    Open Notebook
  10. In the notebook, under the SAS TOKEN SETUP markup cell, replace <AWS_REGION> with the Region you are using (in this case, us-east-1).
  11. In the same code cell, replace <ADLS_SECRET_MANAGER_SECRET_ARN> with your secret ARN and <SECRET_KEY> with your secret key.

You can get the secret key from Secrets Manager in the Secret value section for the secret you created earlier.

Secret Key

  1. In the code cell below the HADOOP CONFIGURATIONS markup cell, replace <YOUR_STORAGE_ACCOUNT> with your Azure storage account where the SAS token was set up earlier.
  2. In the code cell below the READ TEST DATA markup cell, replace <YOUR_CONTAINER> and <YOUR_STORAGE_ACCOUNT> with your Azure container and storage account name, respectively.
  3. On the Run menu, choose Run All Cells.
    Run All Cells

After all notebook cells run, you should see 10 rows in a tabular format containing the data coming from ADLS, which now can be used directly in the notebook or can be written to Amazon S3 for further use.

Results

Clean up

Deploying a CloudFormation template incurs cost based on the resources deployed. The EMR cluster is configured to stop after an hour of inactivity, but to avoid incurring ongoing charges and to fully clean up the environment, complete the following steps:

  1. On the Amazon EMR console, choose Notebooks in the navigation pane.
  2. Select the notebook you created and choose Delete, and wait for the delete to complete before proceeding to the next step.
  3. On the Amazon EMR console, choose Clusters in the navigation pane.
  4. Select the cluster Cluster With ADLS SAS Access and choose Terminate.
  5. On the Amazon VPC console, choose Security groups in the navigation pane.
  6. Find the ElasticMapReduce-Master-Private, ElasticMapReduce-Slave-Private, ElasticMapReduce-ServiceAccess, ElasticMapReduceEditors-Livy, and ElasticMapReduceEditors-Editor security groups attached to the VPC created by the CloudFormation stack and delete their inbound and outbound rules.
  7. Select these five security groups and on the Actions menu, choose Delete security groups.
  8. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  9. Select the stack and choose Delete.
  10. On the Secrets Manager console, choose Secrets in the navigation pane.
  11. Select the stored SAS secret and on the Actions menu, choose Delete secret.
  12. On the IAM console, choose Roles in the navigation pane.
  13. Select the role EMR_Notebooks_DefaultRole and choose Delete.

Conclusion

In this post, we used AWS CloudFormation to deploy an EMR cluster with the appropriate configurations to connect to Azure Data Lake Storage using SAS tokens over the public internet. We provided an implementation of the SASTokenProvider interface to enable the SAS token-based connectivity to ADLS. We also provided relevant information on the SAS token creation steps on the Azure side. Furthermore, we showed how data scientists and analysts can use EMR notebooks connected to an EMR cluster to read data directly from ADLS with a minimum set of configurations. Finally, we used Secrets Manager to secure the storage of the SAS token and integrated it within the EMR notebook.

We encourage you to review the CloudFormation stack YAML template and test the setup on your own. If you implement the example and run into any issues or just have feedback, please leave a comment.


Appendix

AWS CLI-based deployment model

If you prefer to use command line options, this section provides AWS CLI commands to deploy this solution. Note that this is an alternative deployment model different from the CloudFormation template provided in the previous sections. Sample scripts and commands provided here include placeholders for values that need to be updated to suit your environment. The AWS CLI commands provided in this section should be used as guidance to understand the deployment model. Update the commands as needed to follow all the security procedures required by your organization.

Prerequisites for an AWS CLI-based deployment

The following are the assumptions made while using this AWS CLI-based deployment:

  • You will be deploying this solution in an existing AWS environment that has all the necessary security configurations enabled
  • You already have an Azure environment where you have staged the data that needs to be accessed through AWS services

You must also complete additional requirements on the AWS and Azure sides.

For AWS, complete the following prerequisites:

  1. Install the AWS CLI on your local computer or server. For instructions, see Installing, updating, and uninstalling the AWS CLI.
  2. Create an Amazon Elastic Compute Cloud (Amazon EC2) key pair for SSH access to your Amazon EMR nodes. For instructions, see Create a key pair using Amazon EC2.
  3. Create an S3 bucket to store the EMR configuration files, bootstrap shell script, and custom JAR file. Make sure that you create a bucket in the same Region as where you plan to launch your EMR cluster.
  4. Copy and save the SAS token from ADLS to use in Amazon EMR.

For Azure, complete the following prerequisites:

  1. Generate the user delegation SAS token on the ADLS container where your files are present, with the required levels of access granted. In this post, we are use SAS tokens with only read and list access.
  2. Copy and save the generated SAS token to use with Amazon EMR.

Update configurations

We have created a custom class that implements the SASTokenProvider interface and created a JAR file called azsastknprovider-1.0-SNAPSHOT.jar, which is provided as a public artifact for this post. A set of configurations are required on the Amazon EMR side to use the SAS tokens to access ADLS. A sample configuration file in JSON format called EMR-HadoopSpark-ADLS-SASConfig.json is also provided as a public artifact for this post. Download the JAR and sample config files.

While copying the code or commands from this post, make sure to remove any control characters or extra newlines that may get added.

  1. Create a shell script called env-staging-hadoopspark.sh to copy the custom JAR file azsastknprovider-1.0-SNAPSHOT.jar (provided in this post) to the EMR cluster nodes’ local storage during the bootstrap phase. The following code is a sample bootstrap shell script:
    #!/bin/bash
    # Stage the SASTokenProvider interface implementation jar on the local filesystem
    sudo mkdir /lib/customjars
    sudo aws s3 cp s3://<s3 bucket>/<path>/azsastknprovider-1.0-SNAPSHOT.jar /lib/customjars
    sudo chmod 755 /lib/customjars

  2. Update the bootstrap shell script to include your S3 bucket and the proper path where the custom JAR file is uploaded in your AWS environment.
  3. Upload the JAR file, config file, and the bootstrap shell script to your S3 bucket.
  4. Keep a copy of the updated configuration file EMR-HadoopSpark-ADLS-SASConfig.json locally in the same directory from where you plan to run the AWS CLI commands.

Launch the EMR cluster using the AWS CLI

We use the create-cluster command in the AWS CLI to deploy an EMR cluster. We need a bootstrap action at cluster creation to copy the custom JAR file to the EMR cluster nodes’ local storage. We also need to add a few custom configurations on Amazon EMR to connect to ADLS. For this, we need to supply a configuration file in JSON format. The following code launches and configures an EMR cluster that can connect to your Azure account and read objects in ADLS through Hadoop and Spark:

aws emr create-cluster \
--name "Spark Cluster with ADLS Access" \
--release-label emr-6.10.0 \
--applications Name=Hadoop Name=Spark \
--ec2-attributes KeyName=<key pair name> \
--instance-type m5.xlarge \
--instance-count 3 \
--use-default-roles \
--enable-debugging \
--log-uri s3://<s3 bucket>/<logs path>/ \
--configurations file://EMR-HadoopSpark-ADLS-SASConfig.json \
--bootstrap-actions Path="s3://<s3 bucket>/<path>/env-staging-hadoopspark.sh"

Additional configurations for Spark jobs

The following additional properties should be set inside your Spark application code to access data in ADLS through Amazon EMR. These should be set on the Spark session object used within your Spark application code.

spark.conf.set("fs.azure.account.auth.type.<azure storage account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<azure storage account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<azure storage account>.dfs.core.windows.net", "<Your SAS token from ADLS - remove the first character if it is &>")

These additional configurations can be set in the core-site.xml file for the EMR cluster. However, setting these in the application code is more secure and recommended because it won’t expose the SAS token in the Amazon EMR configurations.

Submit the Spark application in Amazon EMR using the AWS CLI

You can run a Spark application on Amazon EMR in different ways:

  • Log in to an EMR cluster node through SSH using the EC2 key pair you created earlier and then run the application using spark-submit
  • Submit a step via the console while creating the cluster or after the cluster is running
  • Use the AWS CLI to submit a step to a running cluster:
aws emr add-steps \
--cluster-id <cluster-id> \
--steps 'Type=Spark,Name=Read-ADLS-Data,ActionOnFailure=CONTINUE,Args=[s3://<s3 bucket>/<path>/<spark-application>,—deploy-mode,cluster]'

To read files in ADLS within a Spark application that is running on an EMR cluster, you need to use the abfs driver and refer to the file in the following format, just as you would have done in your Azure environment:

abfs://<azure container name>@<azure storage account>.dfs.core.windows.net/<path>/<file-name>

The following sample PySpark script reads a CSV file in ADLS using SAS tokens and writes it to Amazon S3, and can be run from the EMR cluster you created:

from pyspark import SparkContext
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession.builder.appName('Read data from ADLS').getOrCreate()
spark.conf.set("fs.azure.account.auth.type.<azure storage account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<azure storage account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<azure storage account>.dfs.core.windows.net", "<Your SAS token from ADLS>")
adlsDF = spark.read.csv("abfs://<azure container name>@<azure storage account>.dfs.core.windows.net/<path>/<file.csv>")
adlsDF.write.csv("s3://<s3 bucket>/<path>/")

Clean up using the AWS CLI

Delete the EMR cluster created using the delete-cluster command.


About the authors

Kiran Anand is a Principal Solutions Architect with the AWS Data Lab. He is a seasoned professional with more than 20 years of experience in information technology. His areas of expertise are databases and big data solutions for data engineering and analytics. He enjoys music, watching movies, and traveling with his family.

Andre Hass is a Sr. Solutions Architect with the AWS Data Lab. He has more than 20 years of experience in the databases and data analytics field. Andre enjoys camping, hiking, and exploring new places with his family on the weekends, or whenever he gets a chance. He also loves technology and electronic gadgets.

Stefan Marinov is a Sr. Solutions Architecture Manager with the AWS Data Lab. He is passionate about big data solutions and distributed computing. Outside of work, he loves spending active time outdoors with his family.

Hari Thatavarthy is a Senior Solutions Architect on the AWS Data Lab team. He helps customers design and build solutions in the data and analytics space. He believes in data democratization and loves to solve complex data processing-related problems.

Hao Wang is a Senior Big Data Architect at AWS. Hao actively works with customers building large scale data platforms on AWS. He has a background as a software architect on implementing distributed software systems. In his spare time, he enjoys reading and outdoor activities with his family.

Deploying an automated Amazon CloudWatch dashboard for AWS Outposts using AWS CDK

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/deploying-an-automated-amazon-cloudwatch-dashboard-for-aws-outposts-using-aws-cdk/

This post is written by Enrico Liguori, Networking Solutions Architect, Hybrid Cloud and Sumeeth Siriyur, Sr. Hybrid Cloud Solutions Architect.

AWS Outposts is a fully managed service that brings the same AWS infrastructure, services, APIs, and tools to virtually any data center, colocation space, manufacturing floor, or on-premises facility where it might be needed. With Outposts, you can run some AWS services on-premises and connect to a broad range of services available in the local AWS Region. Outposts supports workloads requiring low latency, local data processing, data residency, and application migration.

Outposts capacity is driven as per your compute and storage requirements to run workloads. You can monitor Outposts resources using metrics gathered by Amazon CloudWatch. Using these metrics, you can effectively monitor and manage the Outposts resources as they would in the Region, levereging cloud native tools such as CloudWatch dashboards. Check the Monitoring best practices for AWS Outposts blog post to dive deep into the available monitoring options for Outposts.

CloudWatch dashboards are customizable home pages in the CloudWatch console that can be used to monitor resources running on Outposts in a single view. For example, you can monitor in a single pane the number Amazon EC2 instances used per EC2 instance type, the available capacity of Amazon EBS volumes and Amazon S3 buckets, and the operational status of the service link of Outposts.

As a you start deploying additional Outposts resources as a part of their capacity expansion, they must all be integrated and visualized within CloudWatch in an automated way. Traditionally CloudWatch dashboards are built manually and may be time consuming to tune. This post provides also an overview of building CloudWatch dashboards in an automated way using AWS Cloud Development Kit (AWS CDK).

Overview

CloudWatch metrics available to monitor Outposts resources and capacity

CloudWatch metrics for Outposts are available to customers in all public AWS Regions and AWS GovCloud (US) at no additional cost. We can classify the available metrics in two main categories:

To identify the metrics published under the service specific namespaces, we can leverage metadata in the form of tags. A tag is a label that you assign to an AWS resource and consists of a key and an optional value. For the purpose of the monitoring strategy described in this post, we use a tag that contains the OutpostID of the Outpost where the resource is deployed. In this way, we can easily filter the CloudWatch metrics that we would like to show in our dashboard.

To enforce the assignment of tags to our resources we can implement a tagging strategy using AWS tag Policies and Service Control Policies (SCPs).

The following sections describe two different methods to build a CloudWatch dashboard that includes the different types of metrics described so far. In both cases, we see how particularly useful the presence of tags is to identify the service-specific metrics.

Manual approach to building a CloudWatch dashboard for Outposts

This section describes a manual (i.e., non-automated) approach to building a dashboard that could summarize both the capacity utilization metrics and the service specific metrics for your resources running on Outposts.

The benefit of this approach is that we can implement a fully operational dashboard directly from the CloudWatch console. However, it will simultaneously require more effort to properly tune the dashboard to satisfy your monitoring requirements.

You can start creating the dashboard opening the CloudWatch console and following the steps listed in the public documentation.

To display a metric under AWS/Outposts namespace we can choose any of the widgets available. Based on the nature of the data, we can choose different types of Widgets such as Number, Line, Gauge, Explorer, or you can even build your own custom widget.

Together with the Widget type, we must select Outposts namespace in the metric graph dialog box and then navigate to the specific metric of interest.

In case we are creating the dashboard in a different account than the Outposts owner, we must select the right account in the View data drop-down menu to see the Outposts metric in which we are interested.

View data drop-down menu

After selecting one or more metrics we can select Create widget button.

For the service specific metrics, we recommend using the explorer widget. In this way, we can utilize the tagging strategy described earlier to automatically identify the metrics belonging to the resources running on Outposts. Check the documentation page for a step-by-step guide for creating an explorer widget based on tags.

Automated outpost dashboard

After we’ve seen how to build a dashboard manually from the console, in this secton we describe an automated approach to deploy a dashboard for Outposts through AWS CDK.

AWS CDK is an open source software development framework to model and provision your cloud application resources using familiar programming languages, including TypeScript, JavaScript, Python, C#, and Java. For the solution in this post, we use Python.

Architecture overview

The AWS CDK stack described in this post, assumes that the resources running on Outposts (EC2 instances, S3 buckets, Application Load Balancers (ALBs), and RDS instances) are tagged using the tagging strategy described earlier.

Specifying a tag name and a tag value in a configuration file automatically discovers the resources with that tag and adds the related metrics to the CloudWatch dashboard.

Together with the service specific metrics, it creates a series of widgets that we can use to monitor the capacity available and utilized in each Outpost that belongs to the account where the script is running.

The workflow is made of the following phases:

  1. The AWS CDK stack creates an AWS CodeCommit repository and uploads its own code into it. The code contains a series of modules, one for each section of the CloudWatch dashboard. A section of the dashboard contains one or more widgets showing the metrics of a specific service.
  2. To maintain the CloudWatch dashboard always up-to-date with the resources matching the tag, it creates a pipeline in AWS CodePipeline that can dynamically create and or update the dashboard. The pipeline runs the code in the CodeCommit repository and is made of two stages. In the first one, the build stage, it builds the dependencies needed by the AWS CDK stack. In the second stage, the Deploy stage, it loads and runs the modules used to build the dashboard.
  3. Each module contains the code to automatically discover the tagged resources of a specific service. This discovery phase uses standard AWS APIs called through the Python SDK Boto3.
  4. Based on the results of the discovery phase, AWS CDK produces an AWS CloudFormation template containing the definition of the CloudWatch dashboard sections. The template is submitted to CloudFormation.
  5. CloudFormation creates or, if already defined, updates the CloudWatch dashboard.
  6. Together with the dashboard, the AWS CDK script also contains the definition of a CloudWatch Event that, once deployed, triggers the pipeline each time a resource tagged with the specified tag is created or destroyed.

Prerequisites

To implement the solution presented in this post, you must configure:

  1. git as distributed version control system.
  2. In case it is the first time that you’re using AWS CDK in this account and region, you must:

a. Install the AWS CDK, and its prerequisites, following these instructions.

b. Go through the AWS CDK bootstrapping process. This is required only for the first time that we use AWS CDK in a specific AWS environment (an AWS environment is a combination of an AWS account and Region).

How to install

Step 1: Clone the AWS CDK code hosted on GitHub with:

$ git clone https://github.com/aws-samples/automated-cloudwatch-dashboard.git

Step 2: enter the directory using the following:

$ cd  automated-cloudwatch-dashboard/

Step 3: Install the needed Python dependencies with:

$ pip install -r requirements.txt

Step 4: Modify the configuration file

Before deploying the stack, we must modify the configuration file to specify the tag we use for identifying our resources running on Outposts. Open the file with the name config.yaml with your preferred text editor and specify:

      • A name for the dashboard. The default name used is Automated-CloudWatch-Dashboard.
      • Replace <tag_name> placeholder following the tag_name variable with the tag name used to tag the resources that you want to include in the dashboard.
      • Replace <tag_value> placeholder under tag_values variable with the tag value that you used.

Here is an example config.yaml configuration file:

dashboard_name: Automated-CloudWatch-Dahsboard
tag_name: OutpostID
tag_values:
  - op-1234567890abcdefg 

Stack deployment

We can deploy the stack with the following:

$ cdk deploy

At the end of the deployment process, the pipeline that creates the dashboard is provisioned. You can now go to your CloudWatch console to view it.

Automated Outposts dashboard overview

Now that we have built our dashboard, let’s review each section:

  1. Outpost capacity

Outpost Capacity diagram

The AWS CDK stacks define a capacity section for each Outpost available to the AWS account where the script runs.

In this section, we find four widgets showing metrics published under the AWS/Outpost namespace. The first widget shows for each EC2 instance type available on the Outposts the number of instances utilized and available for that instance type. In the second row, we can visualize the available capacity for the Amazon EBS volumes and for the S3 buckets. The last widget shows the operational status of the service link of Outposts.

2. EC2 instances

CPU, Network, and Disk Utilization for an EC2 instance diagram

In this section of the dashboard, we find the metrics showing the CPU, Network, and Disk Utilization for an EC2 instance. It has defined a section of this type for each EC2 instance with a tag assigned matching the name and the value specified in the configuration file of the script.

3. Application Load Balancer

The ALB section aggregates metrics showing the operational status of a load balancer hosted on Outposts

The ALB section aggregates metrics showing the operational status of a load balancer hosted on Outposts. A section of this type is defined for each ALB with an assigned tag matching the one specified in the configuration file.

4. S3 buckets

The S3 buckets section diagram

The S3 buckets section is defined only once and aggregates the utilization metrics for all S3 buckets with an assigned tag.

5. AutoScaling group

The AutoScaling group section diagram

The AutoScaling group section can be used to monitor the number of instances in service in a specific AS group with a tag assigned. This section is defined once and can aggregate the metrics for multiple AutoScaling groups.

Clean up

To terminate the resources that we created in this post, run the following:

$ cdk destroy

Then, go to the Cloudformation console and delete the stack with the name “Deploy-AutomatedCloudWatchDashboard”.

Conclusion

In conclusion, this post demonstrates a manual way of creating CloudWatch Metrics dashboard using the CloudWatch console and an automated way using AWS CDK. The automated approach is also scalable by automatically discovering any new resources added to the existing Outposts in the your environment without any changes to the code.

Secure Connectivity from Public to Private: Introducing EC2 Instance Connect Endpoint

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/secure-connectivity-from-public-to-private-introducing-ec2-instance-connect-endpoint-june-13-2023/

This blog post is written by Ariana Rahgozar, Solutions Architect, and Kenneth Kitts, Sr. Technical Account Manager, AWS.

Imagine trying to connect to an Amazon Elastic Compute Cloud (Amazon EC2) instance within your Amazon Virtual Private Cloud (Amazon VPC) over the Internet. Typically, you’d first have to connect to a bastion host with a public IP address that your administrator set up over an Internet Gateway (IGW) in your VPC, and then use port forwarding to reach your destination.

Today we launched Amazon EC2 Instance Connect (EIC) Endpoint, a new feature that allows you to connect securely to your instances and other VPC resources from the Internet. With EIC Endpoint, you no longer need an IGW in your VPC, a public IP address on your resource, a bastion host, or any agent to connect to your resources. EIC Endpoint combines identity-based and network-based access controls, providing the isolation, control, and logging needed to meet your organization’s security requirements. As a bonus, your organization administrator is also relieved of the operational overhead of maintaining and patching bastion hosts for connectivity. EIC Endpoint works with the AWS Management Console and AWS Command Line Interface (AWS CLI). Furthermore, it gives you the flexibility to continue using your favorite tools, such as PuTTY and OpenSSH.

In this post, we provide an overview of how the EIC Endpoint works and its security controls, guide you through your first EIC Endpoint creation, and demonstrate how to SSH to an instance from the Internet over the EIC Endpoint.

EIC Endpoint product overview

EIC Endpoint is an identity-aware TCP proxy. It has two modes: first, AWS CLI client is used to create a secure, WebSocket tunnel from your workstation to the endpoint with your AWS Identity and Access Management (IAM) credentials. Once you’ve established a tunnel, you point your preferred client at your loopback address (127.0.0.1 or localhost) and connect as usual. Second, when not using the AWS CLI, the Console gives you secure and seamless access to resources inside your VPC. Authentication and authorization is evaluated before traffic reaches the VPC. The following figure shows an illustration of a user connecting via an EIC Endpoint:

Figure 1 shows a user connecting to private EC2 instances within a VPC through an EIC Endpoint

Figure 1. User connecting to private EC2 instances through an EIC Endpoint

EIC Endpoints provide a high degree of flexibility. First, they don’t require your VPC to have direct Internet connectivity using an IGW or NAT Gateway. Second, no agent is needed on the resource you wish to connect to, allowing for easy remote administration of resources which may not support agents, like third-party appliances. Third, they preserve existing workflows, enabling you to continue using your preferred client software on your local workstation to connect and manage your resources. And finally, IAM and Security Groups can be used to control access, which we discuss in more detail in the next section.

Prior to the launch of EIC Endpoints, AWS offered two key services to help manage access from public address space into a VPC more carefully. First is EC2 Instance Connect, which provides a mechanism that uses IAM credentials to push ephemeral SSH keys to an instance, making long-lived keys unnecessary. However, until now EC2 Instance Connect required a public IP address on your instance when connecting over the Internet. With this launch, you can use EC2 Instance Connect with EIC Endpoints, combining the two capabilities to give you ephemeral-key-based SSH to your instances without exposure to the public Internet. As an alternative to EC2 Instance Connect and EIC Endpoint based connectivity, AWS also offers Systems Manager Session Manager (SSM), which provides agent-based connectivity to instances. SSM uses IAM for authentication and authorization, and is ideal for environments where an agent can be configured to run.

Given that EIC Endpoint enables access to private resources from public IP space, let’s review the security controls and capabilities in more detail before discussing creating your first EIC Endpoint.

Security capabilities and controls

Many AWS customers remotely managing resources inside their VPCs from the Internet still use either public IP addresses on the relevant resources, or at best a bastion host approach combined with long-lived SSH keys. Using public IPs can be locked down somewhat using IGW routes and/or security groups. However, in a dynamic environment those controls can be hard to manage. As a result, careful management of long-lived SSH keys remains the only layer of defense, which isn’t great since we all know that these controls sometimes fail, and so defense-in-depth is important. Although bastion hosts can help, they increase the operational overhead of managing, patching, and maintaining infrastructure significantly.

IAM authorization is required to create the EIC Endpoint and also to establish a connection via the endpoint’s secure tunneling technology. Along with identity-based access controls governing who, how, when, and how long users can connect, more traditional network access controls like security groups can also be used. Security groups associated with your VPC resources can be used to grant/deny access. Whether it’s IAM policies or security groups, the default behavior is to deny traffic unless it is explicitly allowed.

EIC Endpoint meets important security requirements in terms of separation of privileges for the control plane and data plane. An administrator with full EC2 IAM privileges can create and control EIC Endpoints (the control plane). However, they cannot use those endpoints without also having EC2 Instance Connect IAM privileges (the data plane). Conversely, DevOps engineers who may need to use EIC Endpoint to tunnel into VPC resources do not require control-plane privileges to do so. In all cases, IAM principals using an EIC Endpoint must be part of the same AWS account (either directly or by cross-account role assumption). Security administrators and auditors have a centralized view of endpoint activity as all API calls for configuring and connecting via the EIC Endpoint API are recorded in AWS CloudTrail. Records of data-plane connections include the IAM principal making the request, their source IP address, the requested destination IP address, and the destination port. See the following figure for an example CloudTrail entry.

Figure 2 shows a sample cloud trail entry for SSH data-plane connection for an IAMUser. Specific entry:  Figure 2. Partial CloudTrail entry for an SSH data-plane connection

EIC Endpoint supports the optional use of Client IP Preservation (a.k.a Source IP Preservation), which is an important security consideration for certain organizations. For example, suppose the resource you are connecting to has network access controls that are scoped to your specific public IP address, or your instance access logs must contain the client’s “true” IP address. Although you may choose to enable this feature when you create an endpoint, the default setting is off. When off, connections proxied through the endpoint use the endpoint’s private IP address in the network packets’ source IP field. This default behavior allows connections proxied through the endpoint to reach as far as your route tables permit. Remember, no matter how you configure this setting, CloudTrail records the client’s true IP address.

EIC Endpoints strengthen security by combining identity-based authentication and authorization with traditional network-perimeter controls and provides for fine-grained access control, logging, monitoring, and more defense in depth. Moreover, it does all this without requiring Internet-enabling infrastructure in your VPC, minimizing the possibility of unintended access to private VPC resources.

Getting started

Creating your EIC Endpoint

Only one endpoint is required per VPC. To create or modify an endpoint and connect to a resource, a user must have the required IAM permissions, and any security groups associated with your VPC resources must have a rule to allow connectivity. Refer to the following resources for more details on configuring security groups and sample IAM permissions.

The AWS CLI or Console can be used to create an EIC Endpoint, and we demonstrate the AWS CLI in the following. To create an EIC Endpoint using the Console, refer to the documentation.

Creating an EIC Endpoint with the AWS CLI

To create an EIC Endpoint with the AWS CLI, run the following command, replacing [SUBNET] with your subnet ID and [SG-ID] with your security group ID:

aws ec2 create-instance-connect-endpoint \
    --subnet-id [SUBNET] \
    --security-group-id [SG-ID]

After creating an EIC Endpoint using the AWS CLI or Console, and granting the user IAM permission to create a tunnel, a connection can be established. Now we discuss how to connect to Linux instances using SSH. However, note that you can also use the OpenTunnel API to connect to instances via RDP.

Connecting to your Linux Instance using SSH

With your EIC Endpoint set up in your VPC subnet, you can connect using SSH. Traditionally, access to an EC2 instance using SSH was controlled by key pairs and network access controls. With EIC Endpoint, an additional layer of control is enabled through IAM policy, leading to an enhanced security posture for remote access. We describe two methods to connect via SSH in the following.

One-click command

To further reduce the operational burden of creating and rotating SSH keys, you can use the new ec2-instance-connect ssh command from the AWS CLI. With this new command, we generate ephemeral keys for you to connect to your instance. Note that this command requires use of the OpenSSH client. To use this command and connect, you need IAM permissions as detailed here.

Once configured, you can connect using the new AWS CLI command, shown in the following figure:
Figure 3 shows the AWS CLI view if successfully connecting to your instance using the one-click command. When running the command, you are prompted to connect and can access your instance.

Figure 3. AWS CLI view upon successful SSH connection to your instance

To test connecting to your instance from the AWS CLI, you can run the following command where [INSTANCE] is the instance ID of your EC2 instance:

aws ec2-instance-connect ssh --instance-id [INSTANCE]

Note that you can still use long-lived SSH credentials to connect if you must maintain existing workflows, which we will show in the following. However, note that dynamic, frequently rotated credentials are generally safer.

Open-tunnel command

You can also connect using SSH with standard tooling or using the proxy command. To establish a private tunnel (TCP proxy) to the instance, you must run one AWS CLI command, which you can see in the following figure:

Figure 4 shows the AWS CLI view after running the aws ec2-instance-connect open-tunnel command and connecting to your instance.Figure 4. AWS CLI view after running new SSH open-tunnel command, creating a private tunnel to connect to our EC2 instance

You can run the following command to test connectivity, where [INSTANCE] is the instance ID of your EC2 instance and [SSH-KEY] is the location and name of your SSH key. For guidance on the use of SSH keys, refer to our documentation on Amazon EC2 key pairs and Linux instances.

ssh ec2-user@[INSTANCE] \
    -i [SSH-KEY] \
    -o ProxyCommand='aws ec2-instance-connect open-tunnel \
    --instance-id %h'

Once we have our EIC Endpoint configured, we can SSH into our EC2 instances without a public IP or IGW using the AWS CLI.

Conclusion

EIC Endpoint provides a secure solution to connect to your instances via SSH or RDP in private subnets without IGWs, public IPs, agents, and bastion hosts. By configuring an EIC Endpoint for your VPC, you can securely connect using your existing client tools or the Console/AWS CLI. To learn more, visit the EIC Endpoint documentation.

Disaster Recovery for Oracle Database on Amazon EC2 with Fast-Start Failover

Post Syndicated from Harshad Gohil original https://aws.amazon.com/blogs/architecture/disaster-recovery-for-oracle-database-on-amazon-ec2-with-fast-start-failover/

High availability is non-negotiable for organizations today to prevent business-critical application disruptions. Enterprises must prioritize database scalability and availability to avoid downtime in their databases, network, servers, or storage environments.

For organizations that want to avoid required application changes, Oracle Real Application Clusters (RAC) is an option for providing high availability and scalability to the Oracle database. While the RAC feature is not supported by Oracle databases on Amazon Elastic Compute Cloud (Amazon EC2), Oracle Active Data Guard helps achieve high availability on AWS cloud.

The Oracle Data Guard feature helps customers survive disasters and data corruption while creating, maintaining, and managing one or more synchronized standby databases. But further, configuring Oracle Data Guard Fast-Start Failover (FSFO) helps achieve high availability.

In this blog post, we provide an architectural solution to achieve database high availability when running Oracle Database on Amazon EC2 with Oracle Data Guard along with Fast-Start Failover to address Availability Zones (AZs) or Amazon EC2 instance failures. We also introduce the steps you can take to make database failover happen without manual intervention, and offer recommendations for cross-Region disaster recovery.

Solution overview

Let’s explore this solution by discussing the architecture and two alternate options for securing high availability using Oracle Data Guard, along with the advantages and limitations of each. We will then offer a walkthrough of steps to make database failover happen without manual intervention.

Oracle high availability using Oracle Data Guard with multi-AZ and multi-Region with multi-AZ setup

This architecture is recommended to maintain high availability for Oracle databases on Amazon EC2 with protection against Amazon EC2 service outages in a Region. A disaster recovery environment and higher resiliency are provided after an Amazon EC2 service outage. This protects against Amazon EC2 service outages in an AWS Region and maintains resiliency due to the multi-AZ setup in a secondary Region.

In this architecture, Oracle Data Guard Fast Sync replication exists between the Primary database in AZ 1 in Region A, with standbys in AZ 2 Region A (Fast Sync), AZ1 in Region B (ASYNC), and AZ2 in Region B (ASYNC). There is an asynchronous cascading replication setup between standby databases to avoid network latency issues across regions.

Should Region A experience an Amazon EC2 service outage, the Oracle observer, a client software that monitors Oracle Data Guard and initiate failover to the Standby database in Region B. Applications can continue to connect to the database resulting in high availability with limited/minimal data loss based on the data change rate amount, as in Figure 1.

Oracle with cascading standby databases across regions

Figure 1. Oracle with cascading standby databases across regions

Using Oracle RedoRoutes, the default behavior of Data guard can be controlled and it can be set using the following example during setup.

Oracle RedoRoutes setup example:

dgmgrl > edit database DB_1A set property RedoRoutes= ‘ (LOCAL: DB_1B FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2,DB_2B ASYNC PRIORITY=3)) (DB_1B: (DB_2A ASYNC PRIORITY=1, DB_2B ASYNC PRIORITY=2)) (DB_2A: DB_1B ASYNC) (DB_2B: DB_1B ASYNC)’

dgmgrl > edit database DB_1B set property RedoRoutes= ‘(LOCAL: (DB_1A FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2,DB_2B ASYNC PRIORITY=3))(DB_1A: (DB_2A ASYNC PRIORITY=1, DB_2B ASYNC PRIORITY=2)) ‘

dgmgrl > edit database DB_1B set property RedoRoutes= ‘(LOCAL: (DB_2B FASTSYNC PRIORITY=1, DB_1A ASYNC PRIORITY=2, DB_1B ASYNC PRIORITY=3))(DB_2B: (DB_1A ASYNC PRIORITY=1, DB_1B ASYNC PRIORITY=2)) (DB_1A: DB_2B ASYNC)(DB_1B: DB_2B ASYNC )’

dgmgrl > edit database DB_1B set property RedoRoutes= ‘(LOCAL: (DB_2A FASTSYNC PRIORITY=1, DB_1A ASYNC PRIORITY=2, DB_1B ASYNC PRIORITY=3))(DB_2A: (DB_1A ASYNC PRIORITY=1, DB_1B ASYNC PRIORITY=2))’

For more information on Oracle RedoRoutes setup for Oracle Cascading Standby, refer to this step-by-step configuration documentation.

Database failover with Amazon Route 53 and Oracle Data Guard

The following walkthrough defines the steps you can take to make database failover happen without manual intervention using Amazon Route 53 and Oracle Data Guard.

Prerequisites

Before getting started, review the following prerequisites for this solution:

Walkthrough

Step 1. Create Oracle Database Service

For applications to connect without manual intervention on event of failure, we recommend creating an Oracle database service using the Oracle DBMS_Package called DBMS_SERVICE.

exec dbms_service.CREATE_SERVICE(SERVICE_NAME=>'DB_SERVICE_FOR_APP', NETWORK_NAME=>'DB_SERVICE_FOR_APP');

exec dbms_service.START_SERVICE('DB_SERVICE_FOR_APP');

Step 2. Network configuration

Applications can connect to the database seamlessly without manual intervention in an event of a failover from the Primary database to Standby using the Oracle Transparent Application Failover (TAF) approach, though TAF requires updating application connection strings in case of a host IP change.

The following approach using Amazon Route 53 is recommended for added flexibility and scalability. Route 53 has DNS A records that map to the database instance IPs and CNAME records that can redirect DNS queries to A records. The following depicts the DNS mapping. The CNAME, along with the database service name, can be used by the application in its network configuration.

Database_Name =
 (DESCRIPTION =
    (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = TCP)(HOST = <db_cname>)(PORT = 1521))
   (connect_data = 
       (service_name = <db_service_name>)
   )) )

To update the CNAME in Route 53 to map to the Primary host automatically in the event of failure, follow these steps.

Step 3. Route 53 setup

Create a script named route53update.sh and place it on the database hosts using the following code.

#!/bin/bash

export ORACLE_HOME="<<change>> "

export LD_LIBRARY_PATH=$ORACLE_HOME/lib

export PATH=$ORACLE_HOME/bin:$PATH:/usr/local/bin:/usr/bin

LOG_FILE="/tmp/switch_dns_$$.log"

DNS_DOMAIN="<<change>> "

ACTIVE_DB_CNAME="<<change>> "

HOSTED_ZONE_ID="<<change>> "

TTL="<<change>> "

update_dns () {

TMPFILE="/tmp/route53_dns_$$.log"

 cat > ${TMPFILE} << EOF

    {

      "Comment":"Updating DNS of record ${1}.${DNS_DOMAIN}",

      "Changes":[

        {

          "Action":"UPSERT",

          "ResourceRecordSet":{

            "ResourceRecords":[

              {

                "Value":"$2"

              }

            ],

            "Name":"${1}.${DNS_DOMAIN}.",

            "Type":"CNAME",

            "TTL":$TTL

          }

        }

      ]

    }

EOF

  /usr/local/bin/aws route53 change-resource-record-sets \

        --hosted-zone-id $HOSTED_ZONE_ID \

        --change-batch file://"$TMPFILE" >> "$LOG_FILE"

}

prim_uniq_sid=`$ORACLE_HOME/bin/sqlplus -s  / as sysdba <<EOF

set feedback off echo off lines 2000 head off

select upper(db_unique_name) from  v\\$dataguard_config where DEST_ROLE='PRIMARY DATABASE';

EOF`

prim_uniq_sid=`echo $prim_uniq_sid| sed 's/^[ \t]*//;s/[ \t]*$//'`

host_current=`$ORACLE_HOME/bin/tnsping ${prim_uniq_sid}|sed -n 's/\(.*Host\)\([^)]*\)\(.*\)/\2/pi' |sed 's/=//g'|sed 's/^[ \t]*//;s/[ \t]*$//'`

dns_current_host=`/usr/local/bin/aws route53 list-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --query  "ResourceRecordSets[?Name == '${ACTIVE_DB_CNAME}.${DNS_DOMAIN}.'].ResourceRecords" --output text`

if [ "$host_current" != "$dns_current_host" ]; then

        update_dns ${ACTIVE_DB_CNAME} $host_current

fi

Step 4. Database job setup

Create a job in the Oracle Primary database to execute the shell script just introduced to initiate in the event of failover using the following code.

begin
  dbms_scheduler.create_job
  (
    job_name             => 'route53update',
    job_type             => 'executable',
    number_of_arguments  => 0,
    job_action           => '/<<location of script>>/ route53update.sh',
    auto_drop            => false
  );

  dbms_scheduler.enable('route53update');
end;
/

Step 5. Database trigger setup

In an event of a failure, the Primary will failover and the Standby starts up as the new Primary. A trigger needs to be created on the Primary database to execute the job on any failover to update the Route53 CNAME using the following code.

create or replace trigger SYS.Update_Route53_Record
AFTER STARTUP ON DATABASE
DECLARE
db_role varchar2(16);
db_mode varchar2(20);BEGIN
select database_role, open_mode into db_role, db_mode from v$database;
if db_role = 'PRIMARY' then
dbms_scheduler.run_job('route53update') ;
END IF;
END;
/

Alternate Option 1: Single Region with multi-AZ

This option is a minimum recommended configuration to maintain high availability for Oracle databases on Amazon EC2 for customers who do not have a multi-region setup.

  • Advantage: Protects against Amazon EC2 service outage in a single AZ.
  • Limitation: Does not protect against Amazon EC2 service outages in a single Region.

In this architecture, Oracle Data Guard Fast Sync replication exists between the Oracle database instance in a multi-AZ setup with the Primary database (Read Write) in AZ 1 and the Standby database (Read Only) in AZ 2.

If the primary database is unreachable due to any failure, the observer will failover to the standby database in a different AZ. Applications can continue to connect to the database with zero data loss due to synchronous replication between AZ using the Maximum Availability/Maximum Protection mode setup in Oracle Data Guard. If the primary database is in us-east-1a and standby in us-east-1b, the RedoRoutes property can be defined as follows.

Oracle RedoRoutes setup example:

dgmgrl> edit database DB_1A set property RedoRoutes= '(LOCAL: (DB_1B FASTSYNC)'

dgmgrl>  edit database DB_1B set property RedoRoutes= '(LOCAL: (DB_1A FASTSYNC)'

For more information on how disaster recovery works in the AWS Cloud, visit the Disaster recovery is different in the cloud section of the AWS Well-Architected Framework. For more on Oracle RedoRoutes setup, refer to the Oracle Redo Routing Rules documentation.

Alternate Option 2: Multi-AZ with multi-Region with single AZ

This option is recommended to maintain high availability for an Oracle database on Amazon EC2 for customers who need multi-region availability. It provides protection against the rare unavailability of Amazon EC2 instances in the primary Region, in which case a disaster recovery environment is provided.

  • Advantage: Protects against Amazon EC2 service outages in a 2 AZ or AWS Region.
  • Limitation: Decreased resiliency without high availability on Amazon EC2 service outage in an entire Region

In this architecture, Oracle Data Guard Fast Sync replication exists between the Oracle database instance in multi-AZ within the single Region, with the Primary database in AZ 1 in Region A and Standby database in AZ 2 in Region A. There is an asynchronous replication setup between the Standby database cross-Region.

Asynchronous replication is recommended between Region replication to avoid network latency issue. A cascading standby setup ensures there is no additional performance impact on the primary database to send data to multiple standbys.

If the primary database is unreachable, failover happens between AZs in Region A. In the event of an Amazon EC2 service outage in a Region, failover occurs to Region B, resulting in high availability with minimal data loss based on the data change rate amount. If the primary database is in us-east-1a and standby in us-east-1b (Fast Sync) and us-east-2a (Async), the RedoRoutes property can be defined as follows.

Oracle RedoRoutes setup example:

dgmgrl > edit database DB_1A set property RedoRoutes= '(LOCAL: (DB_1B FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2))(DB_1B: DB_2A ASYNC)(DB_2A: DB_1B ASYNC)'

dgmgrl > edit database DB_1B set property RedoRoutes= '(LOCAL: (DB_1A FASTSYNC PRIORITY=1, DB_2A ASYNC  PRIORITY=2)) (DB_1A: DB_2A ASYNC)'

dgmgrl > edit database DB_1B set property RedoRoutes= '(LOCAL: (DB_1A FASTSYNC PRIORITY=1, DB_1B ASYNC  PRIORITY=2))'

Cleaning up

The services involved in this solution incur costs. When you’re done using this solution, clean up the following resources:

  • Amazon EC2 instances – Stop or delete (terminate) the Amazon EC2 instances that you provisioned.
  • Route53 – Delete the hosted Zone ID and A records/CNAMEs created.

Conclusion

This blog post demonstrates how high availability and disaster recovery can be achieved for an Oracle database on an Amazon EC2 instance using Oracle Data Guard. Using the architectures in this post, you can achieve zero data loss with the Oracle Fast-Start Failover option within the same Region or cross-Region on Amazon EC2.

You can also use this architecture to replicate data from an Oracle database on Amazon EC2 to an Oracle database hosted outside of the AWS cloud. With Oracle Cascading Standby and Oracle RedoRoutes, you can remove high dependency on the Primary database to improve overall performance.

Simplify fine-grained authorization with Amazon Verified Permissions and Amazon Cognito

Post Syndicated from Phil Windley original https://aws.amazon.com/blogs/security/simplify-fine-grained-authorization-with-amazon-verified-permissions-and-amazon-cognito/

AWS customers already use Amazon Cognito for simple, fast authentication. With the launch of Amazon Verified Permissions, many will also want to add simple, fast authorization to their applications by using the user attributes that they have in Amazon Cognito. In this post, I will show you how to use Amazon Cognito and Verified Permissions together to add fine-grained authorization to your applications.

Fine-grained authorization

With Verified Permissions, you can write policies to use fine-grained authorization in your applications. Policy-based access control helps you secure your applications without embedding complicated access control code in the application logic. Instead, you write policies that say who can take what actions on which resources, and you evaluate the policies by using the Verified Permissions API. The API evaluates access control policies in the context of an access request: Who’s making the request? What do they want to do? What do they want to access? Under what conditions are they making the request?

AWS has removed the work needed to create context about who is making the request by using data from Amazon Cognito. By using Amazon Cognito as your identity store and integrating Verified Permissions with Amazon Cognito, you can use the ID and access tokens that Amazon Cognito returns in the authorization decisions in your applications. You provide Amazon Cognito tokens to Verified Permissions, which uses the attributes that the tokens contain to represent the principal and identify the principal’s entitlements.

Make access requests

To write policies for Verified Permissions, you use a policy language called Cedar. The examples in this post are modified from those in the Cedar language tutorial; I recommend that you review this tutorial for a comprehensive introduction on how to build and use Cedar policies. The following is an example of what a Cedar policy in a photo-sharing application might look like:

permit(
  principal == User::"alice", 
  action    == Action::"update", 
  resource  == Photo::"VacationPhoto94.jpg"
);

This policy states that Alice—the principal—is permitted to update the resource named VacationPhoto94.jpg. You place the policies for your application in a dedicated policy store. To ask Verified Permissions for an authorization decision, use the isAuthorized operation from the API. For example, the following request asks if user alice is permitted to update VacationPhoto94.jpg. Note that I’ve left out details like the policy store identifier for clarity.

// JS pseudocode
const authResult = await avp.isAuthorized({
    principal: 'User::"alice"',
    action: 'Action::"update"',
    resource: 'Photo::"VacationPhoto94.jpg"'
});

This request returns a decision of allow because the principal, action, and resource all match those in the policy. If a user named bob tries to update the photo, the request returns a decision of deny because the policy only allows user alice. For requests that only need values for the principal, action, and resource, you can generally use values from the data in the application.

Things get more interesting when you build policies that use attributes of the principal to make the decision. In these cases, it can be challenging to assemble a complete and accurate authorization context. Policies might refer to attributes of the principal, such as their geographic location or whether they are paid subscribers. Fine-grained access control requires that you write good policies, properly format the access request, and provide the needed attributes for the policies to be evaluated. To get needed attributes, you must often make inline requests to other systems and then transform the results to meet the policy requirements.

The following policy uses an attribute requirement to allow any principal to view any resource as long as they are located in the United States.

permit(
  principal,
  action == Action::"view",
  resource
)
when {principal.location == "USA"};

To make an authorization request, you must supply the needed attributes for the principal. The following code shows how to do that by using the isAuthorized operation:

const authResult = await avp.isAuthorized({
    principal: 'User::"alice"',
    action: 'Action::"view"',
    resource: 'Photo::"VacationPhoto94.jpg"',
    // whenever our policy references attributes of the entity,
    // isAuthorized needs an entity argument that provides    
    // those attributes
    entities: [
        {
            "identifier": {
                "entityType": "User",
                "entityId": "alice"
            },
            "attributes": {
                "location": {
                    "String": "USA"
                }
            }
        ]
});

The authorization request now includes the context that Alice is located in the USA. According to the preceding policy, she should be allowed to view VacationPhoto94.jpg. To make this request, you must gather this information so that you can include it in the request.

Use Amazon Cognito

If the photo-sharing application uses Amazon Cognito for user authentication, then Amazon Cognito will pass an ID token to it when the authentication is successful. For more information about the format and encoding of ID tokens, see the OpenID Connect Core specification. For our purposes, it’s enough to know that after the ID token is verified and decoded, it contains a JSON structure with named attributes.

When you configure Amazon Cognito, you can specify the fields that are contained in the ID token. These might be user editable, but they can also be programmatically generated from other sources. Suppose that you have configured your Amazon Cognito user pool to also include a custom location attribute, custom:location. When Alice logs in to the photo-sharing application, the ID token that Amazon Cognito provides for her contains the following fields. Note that I’ve made the sub (subject) field human readable, but it would actually be a Amazon Cognito entity identifier.

{
 ...
 “iss”: “User”,
 “sub”: “alice”,
 “custom:location”: “USA”,
 ...
}

If the needed attributes are in Amazon Cognito, you can use them to create the attributes that isAuthorized needs to render an access decision.

Figure 1: Steps to use Amazon Cognito with isAuthorized

Figure 1: Steps to use Amazon Cognito with isAuthorized

As shown in Figure 1, you need to complete the following six steps to use isAuthorized with Amazon Cognito:

  1. Get the token from Amazon Cognito when the user authenticates
  2. Verify the token to authenticate the user
  3. Decode the token to retrieve attribute information
  4. Create the policy principal from information in the token or from other sources
  5. Create the entities structure by using information in the token or from other sources
  6. Call Amazon Verified Permissions by using isAuthorized

Now that you understand how to make a request by manually creating the context using data provided by an identity provider, I’ll share how the AWS integration of Amazon Cognito and Verified Permissions can help reduce your workload.

Use isAuthorizedWithToken

In addition to isAuthorized, the Verified Permissions API provides the isAuthorizedWithToken operation that accepts Amazon Cognito tokens. If your application uses Amazon Cognito for authentication, then Amazon Cognito provides the ID token after the user logs in. Let’s assume that you have stored this token in a variable named cognito_id_token. Because you are using an attribute from Amazon Cognito, you modify the previous policy to accommodate the namespace that the Amazon Cognito attributes are in, as shown in the following:

permit(
  principal,
  action == Action::"view",
  resource
)
when {principal.custom.location == "USA"};

With this policy, the photo-sharing application can use the ID token to make an authorization request using isAuthorizedWithToken:

Using this call, you don’t have to supply the principal or construct the entities argument for the principal. Instead, you just pass in the ID token that Amazon Cognito provides. When you call isAuthorizedWithToken, it constructs the principal by using information in the token and creates an entity context that includes Alice’s location from the attributes in the ID token.

Figure 2: Use isAuthorizedWithToken

Figure 2: Use isAuthorizedWithToken

As shown in Figure 2, when you use isAuthorizedWithToken, you only need to complete two steps:

  1. Get the token from Amazon Cognito when the user authenticates
  2. Call Verified Permissions using isAuthorizedWithToken, passing in the token

Optionally, your application might need to verify the token to authenticate the user and avoid unnecessary work, or it can rely on isAuthorizedWithToken to do that. Similarly, you might also decode the token if you need its values for the application logic.

Configure isAuthorizedWithToken

You can use the Verified Permissions console to tell the API which Amazon Cognito user pool you’re using for your application. This is called an identity source.

To create an identity source for use with Verified Permissions (console):

  1. Sign in to the AWS Management Console and open the Amazon Cognito console.
  2. Choose Identity source. You will see a list showing the sources that you have already configured.
  3. To create a new identity source, choose Create identity source.
  4. Enter the User pool ID.
  5. Enter the Principal type.
  6. Select whether or not to validate Client application IDs for this source.
  7. (Optional) Enter Tags.
  8. Choose Create identity source to add a new identity source to Verified Permissions with the given client ID.

The isAuthorizedWithToken operation uses the configuration information to get the public keys from Amazon Cognito to validate and decode the token. It also uses the configuration information to validate that the user pool for the token is associated with the policy store that the API is using.

Add resource entity attributes

ID tokens provide attributes about the principal, but they don’t provide information about the resource. Policies will often reference resource attributes, as shown in the following policy:

permit(
  principal,
  action == Action::"view",
  resource
)
when {resource.accessLevel == "public" && 
      principal.custom.location == "USA"};

Like the previous example, this policy requires that photo viewers are located in the US, but it also requires that the resource has a public attribute. The isAuthorizedWithToken operation can augment the entity information in the token by using the same entities argument that isAuthorized uses. You can add the resource entity information to the call, as shown in the following call to isAuthorizedWithToken:

const authResult = await avp.isAuthorized({
    principal: 'User::"alice"',
    action: 'Action::"view"',
    resource: 'Photo::"VacationPhoto94.jpg"',
    // whenever our policy references attributes of the entity,
    // isAuthorized needs an entity argument that provides    
    // those attributes
    entities: [
        {
            "identifier": {
                "entityType": "User",
                "entityId": "alice"
            },
            "attributes": {
                "location": {
                    "String": "USA"
                }
            }
        }
    ]
});

The isAuthorizedWithToken operation combines the entity information that it gleans from the ID token with the explicit entities information provided in the call to construct the context needed for the authorization request.

Conclusion

Discussions about access control tend to focus on the policies: how you can use them to make the code cleaner, and the value of moving the authorization logic out of the code. But that often ignores the work needed to create meaningful authorization requests. Assembling the information about the entities is a big part of making the request. If you use both Amazon Cognito and Verified Permissions, the integration discussed in this blog post can help relieve you of the work needed to build entity information about principals and help provide you with assurance that the assembly is happening consistently and securely.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Phil Windley

Phil Windley

Phil is a Senior Software Development Manager in AWS Identity. He and his team work to make access management both easier to use and easier to understand. Phil has been working in digital identity for many years, and recently wrote Learning Digital Identity from O’Reilly Media. Outside of work, Phil loves to bike, read, and spend time with grandkids.

Prevent account creation fraud with AWS WAF Fraud Control – Account Creation Fraud Prevention

Post Syndicated from David MacDonald original https://aws.amazon.com/blogs/security/prevent-account-creation-fraud-with-aws-waf-fraud-control-account-creation-fraud-prevention/

Threat actors use sign-up pages and login pages to carry out account fraud, including taking unfair advantage of promotional and sign-up bonuses, publishing fake reviews, and spreading malware.

In 2022, AWS released AWS WAF Fraud Control – Account Takeover Prevention (ATP) to help protect your application’s login page against credential stuffing attacks, brute force attempts, and other anomalous login activities.

Today, we introduce AWS WAF Fraud Control – Account Creation Fraud Prevention (ACFP) to help protect your application’s sign-up pages against fake account creation by detecting and blocking fake account creation requests.

You can now get comprehensive account fraud prevention by combining AWS WAF Account Creation Fraud Prevention and Account Takeover Prevention in your AWS WAF web access control list (web ACL). In this post, we will show you how to set up AWS WAF with ACFP for your application sign-up pages.

Overview of Account Creation Fraud Prevention for AWS WAF

ACFP helps protect your account sign-up pages by continuously monitoring requests for anomalous digital activity and automatically blocking suspicious requests based on request identifiers, behavioral analysis, and machine learning.

ACFP uses multiple capabilities to help detect and block fake account creation requests at the network edge before they reach your application. An automated vetting process for account creation requests uses rules based on reputation and risk to protect your registration pages against use of stolen credentials and disposable email domains. ACFP uses silent challenges and CAPTCHA challenges to identify and respond to sophisticated bots that are designed to actively evade detection.

ACFP is an AWS Managed Rules rule group. If you already use AWS WAF, you can configure ACFP without making architectural changes. On a single configuration page, you specify the registration page request inspection parameters that ACFP uses to detect fake account creation requests, including user identity, address, and phone number.

ACFP uses session tokens to separate legitimate client sessions from those that are not. These tokens allow ACFP to verify that the client applications that sign up for an account are legitimate. The AWS WAF Javascript SDK automatically generates these tokens during the frontend application load. We recommend that you integrate the AWS WAF Javascript SDK into your application, particularly for single-page applications where you don’t want page refreshes.

Walkthrough

In this walkthrough, we will show you how to set up ACFP for AWS WAF to help protect your account sign-up pages against account creation fraud. This walkthrough has two main steps:

  1. Set up an AWS managed rule group for ACFP in the AWS WAF console.
  2. Add the AWS WAF JavaScript SDK to your application pages.

Set up Account Creation Fraud Prevention

The first step is to set up ACFP by creating a web ACL or editing an existing one. You will add the ACFP rule group to this web ACL.

The ACFP rule group requires that you provide your registration page path, account creation path, and optionally the sign-up request fields that map to user identity, address, and phone number. ACFP uses this configuration to detect fraudulent sign-up requests and then decide an appropriate action, including blocking, challenging interstitial during the frontend application load, or requiring a CAPTCHA.

To set up ACFP

  1. Open the AWS WAF console, and then do one of the following:
    • To create a new web ACL, choose Create web ACL.
    • To edit an existing web ACL, choose the name of the ACL.
  2. On the Rules tab, for the Add Rules dropdown, select Add managed rule groups.
  3. Add the Account creation fraud prevention rule set to the web ACL. Then, choose Edit to edit the rule configuration.
  4. For Rule group configuration, provide the following information that the ACFP rule group requires to inspect account creation requests, as shown in Figure 1.
    • For Registration page path, enter the path for the registration page website for your application.
    • For Account creation path, enter the path of the endpoint that accepts the completed registration form.
    • For Request inspection, select whether the endpoint that you specified in Account creation path accepts JSON or FORM_ENCODED payload types.
    Figure 1: Account creation fraud prevention - Add account creation paths

    Figure 1: Account creation fraud prevention – Add account creation paths

  5. (Optional): Provide Field names used in submitted registration forms, as shown in Figure 2. This helps ACFP more accurately identify requests that contain information that is considered stolen, or with a bad reputation. For each field, provide the relevant information that was included in your account creation request. For this walkthrough, we use JSON pointer syntax.
     
    Figure 2: Account creation fraud prevention - Add optional field names

    Figure 2: Account creation fraud prevention – Add optional field names

  6. For Account creation fraud prevention rules, review the actions taken on each category of account creation fraud, and optionally customize them for your web applications. For this walkthrough, we leave the default rule action for each category set to the default action, as shown in Figure 3. If you want to customize the rules, you can select different actions for each category based on your application security needs:
    • Allow — Allows the request to be sent to the protected resource.
    • Block — Blocks the request, returning an HTTP 403 (Forbidden) response.
    • Count — Allows the request to be sent to the protected resource while counting detections. The count shows you bot activity that is occurring without blocking or challenging. When you turn on rules for the first time, this information can help you see what the detections are, before you change the actions.
    • CAPTCHA and Challenge — use CAPTCHA puzzles and silent challenges with tokens to track successful client responses.
    Figure 3: Account creation fraud prevention - Select actions for each category

    Figure 3: Account creation fraud prevention – Select actions for each category

  7. To save the configuration, choose Save.
  8. To add the ACFP rule group to your web ACL, choose Add rules.
  9. (Optional) Include additional rules in your web ACL, as described in the Best practices section that follows.
  10. To create or edit your web ACL, proceed through the remaining configuration pages.

Add the AWS WAF JavaScript SDK to your application pages

The next step is to find the AWS WAF JavaScript SDK and add it to your application pages.

The SDK injects a token in the requests that you send to your protected resources. You must use the SDK integration to fully enable ACFP detections.

To add the SDK to your application pages

  1. In the AWS WAF console, in the left navigation pane, choose Application integration.
  2. Under Web ACLs that are enabled for application integration, choose the name of the web ACL that you created previously.
  3. Under JavaScript SDK, copy the provided code snippet. This code snippet allows for creation of the cryptographic token in the background when the application loads for the first time. Figure 4 shows the SDK link.
    Figure 4: Application integration – Add JavaScript SDK link to application pages

    Figure 4: Application integration – Add JavaScript SDK link to application pages

  4. Add the code snippet to your pages. For example, paste the provided script code within the <head> section of the HTML. For ACFP, you only need to add the code snippet to the registration page, but if you are using other AWS WAF managed rules such as Account Takeover Protection or Targeted Bots on other pages, you will also need to add the code snippet to those pages.
  5. To validate that your application obtains tokens correctly, load your application in a browser and verify that a cookie named aws-waf-token has been set during page load.

Review metrics

Now that you’ve set up the web ACL and integrated the SDK with the application, you can use the bot visualization dashboard in AWS WAF to review fraudulent account creation traffic patterns. ACFP rules emit metrics that correspond to their labels, helping you identify which rule within the ACFP rule group initiated an action. You can also use labels and rule actions to filter AWS WAF logs so that you can further examine a request.

To view AWS WAF metrics for the distribution

  1. In the AWS WAF console, in the left navigation pane, select Web ACLs.
  2. Select the web ACL for which ACFP is enabled, and then choose the Bot Control tab to view the metrics.
  3. In the Filter metrics by dropdown, select Account creation fraud prevention to see the ACFP metrics for your web ACL.
Figure 5: Account creation fraud prevention – Review web ACL metrics

Figure 5: Account creation fraud prevention – Review web ACL metrics

Best practices

In this section, we share best practices for your ACFP rule group setup.

Limit the requests that ACFP evaluates to help lower costs

ACFP evaluates web ACL rules in priority order and takes the action associated with the first rule that a request matches. Requests that match and are blocked by a rule will not be evaluated against lower priority rules. ACFP only evaluates an ACFP rule group if a request matches the registration and account creation URI paths that are specified in the configuration.

You will incur additional fees for requests that ACFP evaluates. To help reduce ACFP costs, use higher priority rules to block requests before the ACFP rule group evaluates them. For example, you can add a higher priority AWS Managed Rules IP reputation rule group to block account creation requests from bots and other threats before ACFP evaluates them. Rate-based rules with a higher priority than the ACFP rule group can help mitigate volumetric account creation attempts by limiting the number of requests that a single IP can make in a five-minute period. For further guidance on rate-based rules, see The three most important AWS WAF rate-based rules.

If you are using the AWS WAF Bot Control rule group, give it a higher priority than the ACFP rule group because it’s less expensive to evaluate.

Use SDK integration

ACFP requires the tokens that the SDK generates. The SDK can generate these tokens silently rather than requiring a redirect or CAPTCHA. Both AWS WAF Bot Control and AWS WAF Fraud Control use the same SDK if both rule groups are in the same web ACL.

These tokens have a default immunity time (otherwise knowns as a timeout) of 5 minutes, after which AWS WAF requires the client to be challenged again. You can use the AWS WAF integration fetch wrapper in your single-pane application to help ensure that the token retrieval completes before the client sends requests to your account creation API without requiring a page refresh. Alternatively, you can use the getToken operation if you are not using fetch.

You can continue to use the CAPTCHA JavaScript API instead if you’ve already integrated this into your application.

Use both ACFP and ATP for comprehensive account fraud prevention

You can help prevent account fraud for both sign-up and login pages by enabling the ATP rule group in the same web ACL as ACFP.

Test ACFP before you deploy it to production

Test and tune your ACFP implementation in a staging or testing environment to help avoid negatively impacting legitimate users. We recommend that you start by deploying your rules in count mode in production to understand potential impact to your traffic before switching them back to the default rule actions. Use the default ACFP rule group actions when you deploy the web ACL to production. For further guidance, see Testing and Deploying ACFP.

Pricing and availability

ACFP is available today on Amazon CloudFront and in 22 AWS Regions. For information on availability and pricing, see AWS WAF Pricing.

Conclusion

In this post, we showed you how to use ACFP to protect your application’s sign-up pages against fake account creation. You can now combine ACFP with ATP managed rules in a single web ACL for comprehensive account fraud prevention. For more information and to get started today, see the AWS WAF Developer Guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

David MacDonald

David MacDonald

David is a Senior Solutions Architect focused on helping New Zealand startups build secure and scalable solutions. He has spent most of his career building and operating SaaS products that serve a variety of industries. Outside of work, David is an amateur farmer, and tends to a small herd of alpacas and goats.

Geary Scherer

Geary Scherer

Geary is a Solutions Architect focused on Travel and Hospitality customers in the Southeast US. He holds all 12 current AWS certifications and loves to dive into complex Edge Services use cases to help AWS customers, especially around Bot Mitigation. Outside of work, Geary enjoys playing soccer and cheering his daughters on at dance and softball competitions.

Federate Amazon QuickSight access with open-source identity provider Keycloak

Post Syndicated from Ayah Chamseddin original https://aws.amazon.com/blogs/big-data/federate-amazon-quicksight-access-with-open-source-identity-provider-keycloak/

Amazon QuickSight is a scalable, serverless, embeddable, machine learning (ML) powered business intelligence (BI) service built for the cloud that supports identity federation in both Standard and Enterprise editions. Organizations are working toward centralizing their identity and access strategy across all their applications, including on-premises and third-party. Many organizations use Keycloak as their identity provider (IdP) to control and manage user authentication and authorization centrally. You can enable role-based access control to make sure users get appropriate role permissions in QuickSight based on their entitlement stored in Keycloak attributes.

In this post, we walk through the steps you need to configure federated single sign-on (SSO) between QuickSight and open-source IdP Keycloak. We also demonstrate ways to to assign QuickSight roles based on Keycloak membership. Administrators can publish QuickSight applications on the Keycloak Admin console. This enables you to SSO to QuickSight using your Keycloak credentials.

Prerequisites

To complete the walkthrough, you need the following prerequisites:

Solution overview

The walkthrough includes the following steps:

  1. Register a client application in Keycloak.
  2. Configure the application in Keycloak.
  3. Add Keycloak as your SAML IdP in AWS.
  4. Configure IAM policies.
  5. Configure IAM roles.
  6. Assign the newly created roles in IAM to users and groups in Keycloak.

Register a client application in KeyCloak

To configure the integration of an SSO application in Keycloak, you need to create a Keycloak client application.

  1. Sign in to your Keycloak admin dashboard.
    For instructions on installing Keycloak, refer to Keycloak Downloads. For the Keycloak admin dashboard, use http://localhost:8080/.
  2. Create a new realm by choosing Create realm on the default realm master page.
    Create realm in Keycloak user interface
  3. Assign a name for this new realm. For this example, we assign the name aws-realm.
    Add realm name in Keycloak user interface
  4. When the new realm has been created, choose Clients.
  5. Choose Create client to create a new Keycloak application for SSO Federation to QuickSight.
    Create client in Keycloak user interface

Configure the application in Keycloak

Follow the steps to configure the application in Keycloak.

  1. Download the SAML metadata file.
  2. Save full code from saml-metadata.xml to your local machine.
  3. In the navigation pane under Clients, import the SAML metadata file.
  4. Choose Import client.
  5. Choose Browse.
  6. Leave the rest of the fields blank. The metadata.xml file that you import later automatically populates them.
  7. When imported, press Save.
    Import client in Keycloak user interface
  8. On the Clients Application Setting page, choose the recently added client.
    Selecting client on Client Application Setting page
  9. Update the properties of the client ID:
    1. Change Home URL to /realms/aws-client/protocol/saml/clients/amazon-qs.
    2. Change the IdP Initiated SSO URL to amazon-qs.
    3. Change the IdP initiated SSO Relay State to https://quicksight.aws.amazon.com.
  10. On the Client scopes tab, choose the client ID.
  11. On the Scope tab, make sure the Full scope allowed toggle is set to off.
  12. Insert your specific host domain name where the Keycloak application resides in the following URL: https://<your_host_domain>/realms/aws-realm/protocol/saml/descriptor.
    1. Download the Keycloak IdP SAML metadata file from that URL location.

You now have Keycloak installed in your local machine, a new client added, AWS federation properties updated, and the Keycloak SAML metadata downloaded for AWS use in the following section.

Add Keycloak as your SAML IdP in AWS

To configure Keycloak as your SAML IdP, complete the following steps:

  1. Open a new tab in your browser.
  2. Sign in to the IAM console in your AWS account with admin permissions.
  3. On the IAM console, under Access Management in the navigation pane, choose Identity providers.
  4. Choose Add provider.
  5. For Provider type, select SAML.
  6. For Provider name, enter keycloak.
  7. For Metadata document, upload the Keycloak IdP SAML metadata XML file you downloaded and saved to your local machine earlier.
  8. Choose Add provider.
  9. Verify Keycloak has been added as an IAM IdP and copy the ARN assigned.

The ARN is used in a later step for federated users and IdP Keycloak advanced configuration.

Configure IAM policies

Create three IAM policies for mapping to three different roles with permissions in QuickSight (admin, author, and reader):

  • Admin – Uses QuickSight for authoring and for performing administrative tasks such as managing users or purchasing SPICE capacity
  • Author – Authors analyses and dashboards in QuickSight but doesn’t perform any administrative tasks
  • Reader – Interacts with shared dashboards, but doesn’t author analyses or dashboards or perform any administrative tasks

Use the following steps to setup the QuickSight-Admin policy. This policy grants the admin privileges in QuickSight to the federated user.

  1. On the IAM console, choose Policies.
  2. Choose Create policy.
  3. Choose JSON and replace the existing text with the code from the following table for QuickSight-Admin.

    Policy Name JSON Text
    QuickSight-Admin
    {
    "Version": "2012-10-17",
    "Statement": [
    	{
    		"Effect": "Allow",
    		"Action": "quicksight:CreateAdmin",
    		"Resource": "*"
    	}
    ]		
    }

    QuickSight-Author
    {
    "Version": "2012-10-17",
    "Statement": [
    	{
    	"Effect": "Allow",
    	"Action": "quicksight:CreateUser",
    	"Resource": "*"
    	}
    ]
    }

    QuickSight-Reader
    {
    "Version": "2012-10-17",
    "Statement": [
    	{
    		"Effect": "Allow",
    		"Action": " quicksight:CreateReader",
    		"Resource": "*"
    	}
    ]
    }

  4. Choose Review policy.
  5. For Name, enter QuickSight-Admin.
  6. Choose Create policy.
  7. Repeat the steps for QuickSight-Reader and QuickSight-Author.


Configure IAM roles

Create the roles that your Keycloak users assume when federating into QuickSight. Use the following steps to set up the admin role:

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. For Select type of trusted entity, choose SAML 2.0 federation.
  4. For SAML provider, choose the IdP you created earlier (keycloak).
  5. Select Allow programmatic and AWS Management Console access.
  6. Choose Next: Permissions.
  7. Choose the QuickSight-Admin IAM policy you created in the previous step.
  8. Choose Next: Name, review, and create.
  9. For Role name, enter QuickSight-Admin-Role.
  10. For Role description, enter a description.
  11. Choose Create role.
  12. Repeat these steps to create your author and reader roles and attach the appropriate policies:
    1. For QuickSight-Author-Role, use the policy QuickSight-Author
    2. For QuickSight-Reader-Role, use the policy QuickSight-Reader

With the completion of these steps, you have created an IdP in AWS, created policies, and created roles for the Keycloak IdP.

Assign the newly created roles in IAM to users and groups in Keycloak

To create a role for the client, complete the following steps:

  1. Log back in to the Keycloak admin console.
  2. Select aws-realm and client amazon:webservices.
  3. Choose Create Role.
    1. Provide a comma-separated string using the ARN for the IAM role and the ARN for the Keycloak IdP, as in the following example:
      arn:aws:iam:: <AWS account>:role/QuickSight-Admin-Role,arn:aws:iam::<AWS account>:saml-provider/keycloak
  4. When the role has been added successfully, choose Save.
  5. Repeat the steps to add QuickSight-Author-Role and QuickSight-Reader-Role.

Create mappers

To create a mapper for the client, complete the following steps:

  1. On the Client scopes tab, select the client amazon:webservices for aws-realm.
  2. On the Mappers tab, choose Add mapper.
  3. Choose By configuration to generate mappers for Session Role, Session Duration, and Session Name.
  4. Add the values needed for the Session Role mapper:
    1. Name: Session Role
    2. Mapper type: Role list
    3. Friendly Name: Session Role
    4. Role attribute name: https://aws.amazon.com/SAML/Attributes/Role
    5. SAML Attribute NameFormat: Basic
  5. Add the values needed for the Session Duration mapper:
    1. Name: Session Duration
    2. Mapper Type: Hardcoded attribute
    3. Friendly Name: Session Duration
    4. SAML Attribute Name: https://aws.amazon.com/SAML/Attributes/SessionDuration
    5. SAML Attribute NameFormat: Basic
    6. Attribute Value: 28800

      You can automatically sync user email mapping. To perform these steps, refer to Configure an automated email sync for federated SSO users to access Amazon QuickSight.

To manually add the values needed for the Session Name mapper, provide the following information:

  1. Namee: Session Name
  2. Mapper Type: User Property
  3. Property: username
  4. Friendly Name: Session Name
  5. SAML Attribute Name: https://aws.amazon.com/SAML/Attributes/SessionName
  6. SAML Attribute NameFormat: Basic

Create a sample group for Keycloak users

To create groups and users for the Keycloak IdP, complete the following steps:

  1. Choose Group in the navigation pane.
  2. Create a new group named READ_ONLY_AWS_USERS.
  3. Choose the Role mapping tab and Assign role.
  4. Add the role created for the client.
  5. Choose Assign.

Create a sample user

Complete these steps to create a sample user with credentials:

  1. Choose Users in the navigation pane.
  2. Choose Create new user.
  3. Create a sample user, such as John.
  4. Set the credentials for the created user.
  5. Add the sample user created in earlier to the group READ_ONLY_AWS_USERS.

You now have a Keycloak role for the realm and client, and Keycloak mappers, groups, and users in your groups.

Test the application

Let’s invoke the application you have created to seamlessly sign in to QuickSight using the following URL. Make sure you enter your domain for Keycloak.

http://<your domain>/realms/aws-realm/protocol/saml/clients/amazon-qs

When prompted for your user ID and password, enter the credentials that you created earlier.

Keycloak successfully validates the credentials and federates access to the QuickSight console by assuming the role.

Conclusion

In this post, we provided step-by-step instructions to configure federated SSO between Keycloak IdP and QuickSight. We also discussed how to create new roles and map users and groups in Keycloak to IAM for secure access to QuickSight.

If you have any questions or feedback, please leave a comment.


About the Authors

Ayah Chamseddin is a Sr. Engagement Manager at AWS. She has a deep understanding of cloud technologies and has successfully overseen and lead strategic projects, partnering with clients to define business objectives, develop implementation strategies, and drive the successful delivery of solutions.


Vamsi Bhadriraju
is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.


Srikanth Baheti
 is a Specialized World Wide Principal Solutions Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.


Raji Sivasubramaniam
 is a Sr. Solutions Architect at AWS, focusing on Analytics. Raji is specialized in architecting end-to-end Enterprise Data Management, Business Intelligence and Analytics solutions for Fortune 500 and Fortune 100 companies across the globe. She has in-depth experience in integrated healthcare data and analytics with wide variety of healthcare datasets including managed market, physician targeting and patient analytics.

Amazon SES – Set up notifications for bounces and complaints

Post Syndicated from Vinay Ujjini original https://aws.amazon.com/blogs/messaging-and-targeting/amazon-ses-set-up-notifications-for-bounces-and-complaints/

Why is it important to monitor bounces and complaints when using Amazon Simple Email Service?

Amazon Simple Email Service (Amazon SES) is a scalable cloud email service provider that is cost-effective and flexible. Amazon SES allows businesses and individuals to send bulk emails to their customers and subscribers. However, as with any email service, there is always a risk of emails bouncing or being marked as spam by recipients. These bounces and complaints can have serious consequences for your email deliverability and can even lead to your email account being suspended or blocked. That’s why it’s important to monitor bounces and complaints when using Amazon SES for email sending. By using Simple Notification Services (Amazon SNS) notifications, you can set up notifications and proactively address any issues and ensure that your emails are delivered successfully to your intended recipients. In this blog, we’ll show how to set up notifications for bounces and complaints in Amazon SES, so you can stay on top of your email deliverability and maintain a positive sender reputation.

Understanding bounces and complaints:

Understanding bounces and complaints is crucial when it comes to email marketing. In simple terms, a bounce occurs when an email is undeliverable and is returned to the sender. There are two types of bounces: soft bounces and hard bounces. A soft bounce is a temporary issue, such as a full inbox or a server error, and the email may be delivered successfully on a subsequent attempt. A hard bounce, on the other hand, is a permanent issue, such as an invalid email address, and the email will never be delivered. On the other hand, a complaint occurs when a recipient marks an email as spam or unwanted. Complaints can be particularly damaging to your email deliverability and can lead to your emails being blocked or sent to the recipient’s spam folder. By monitoring bounces and complaints and taking appropriate action, you can maintain a positive sender reputation and ensure that your emails are delivered successfully to your intended recipients.

Amazon SES provides tools like Virtual Deliverability Manager (VDM) to manage the deliverability at the ISP, sub-domain or configuration set level. You can see the details in this blog.

Solution walkthrough:

This post gives detailed instructions on how to use Amazon Simple Notification Service SNS to monitor and receive notifications on bounces and complaints in Amazon SES. This blog also has FAQs and troubleshooting tips if you are not receiving notifications following the setup: (below are the steps with detailed instructions and screenshots)

Prerequisites:

For this walkthrough, you should have the following prerequisites:

  1. An active AWS account.
  2. A verified identity (Email address or Domain) in Amazon SES.
  3. Administrative Access to Amazon SES Console and Amazon SNS Console.

Step 1: Create an Amazon SNS topic and subscription:

      1. Sign in to the Amazon SNS console.
      2. Under Amazon SNS homepage provide a Topic name and click on Next steps:
      3. SNS topic image
      4. For Type, choose a topic type Standard.
        Note: Standard topics are better suited for use cases that require higher message publish and delivery throughput rates which fits the SES bounces and complaints monitoring.
      5. SNS standard queue
      6. (Optional) Expand the Encryption section if you would like to encrypt the SNS topic.
        • Choose Enable encryption.
        • Specify the AWS KMS key. For more information, see Key terms.
        • For each KMS type, the Description, Account, and KMS ARN are displayed.
      7. Encryption image
      8. Scroll to the end of the form and choose Create topic. The topic is created and the console opens the new topic’s Details page.
      9. To create the subscription on the Subscriptions page, choose Create subscription.
      10. SNS Subscription page
      11. On the Create subscription page, choose the Topic ARN that you created in the previous step.
      12. For Protocol, choose Email. There are multiple protocols available to use and it depends on where you would like to receive the SNS notifications for bounces and complaints. Please refer to list of available protocols.
      13. For Endpoint, enter an email address that can receive notifications.
        Note: this should be existing email address with accessible mailbox.
      14. SNS Subscription details
      15. Scroll to the bottom and click Create subscription. The console opens the new subscription’s Details page.
      16. After your subscription is created, you need to confirm it through the email address provided above.
      17. Check your email inbox you provided in the endpoint in previous step and and choose Confirm subscription in the email from AWS Notifications. The sender ID is usually “[email protected]“.
      18. AWS Notification email
      19. Amazon SNS opens your web browser and displays a subscription confirmation with your subscription ID.
      20. Subscription confirmation email
      21. After subscription is confirmed, refresh the subscription’s Details page and the subscription status will move from Pending to Confirmed.
      22. Subscription details
  • Step 2: Configure Amazon SES to send bounces and complaints to the Amazon SNS topic created:

In this step, I am presenting two methods to monitor your bounces and complaints. Follow Demo 1, if you are looking for a simple way to monitor bounces and complaints events for a specific email identity. Follow Demo 2, if you have many email identities and you want to monitor bounces and complaints along with other events using configuration sets “groups of rules that you can apply to all your verified identities”.

Demo 1: Configure Amazon SES to monitor bounces and complaints for specific email identity (Email, Domain):

The domain/sub-domain/email identity must have a Verified status. If the identity is not in verified status, refer to steps to verify identity with Amazon SES before continuing further.

Prior to starting this demo, it is important to know if you have a verified domain, subdomain, or an email address that shares the root domain. The identity settings (such as SNS and feedback notifications) apply at the most granular level you have set up the verification. Hierarchy is as below:

  • Verified email address identity settings override verified domain identity settings.
  • Verified subdomain identity settings override verified domain identity settings. (lower-level subdomain settings override higher-level subdomain settings).

Hence, if you want to monitor bounces and complaints for all email addresses under one domain, it is recommended to verify the domain identity with SES and apply this setting at the domain identity level. If you want to monitor bounces and complaints for specific email address under a verified domain identity, verify this email address explicitly with SES and apply this settings into the email identity level.

  1. Sign in to the Amazon SES console.
  2. In the navigation pane, under Configuration, choose Verified identities.
  3. Verified email identities
  4. Select the verified identity in which you want to monitor for bounces and complaints notifications.
  5. In the details screen of the verified identity you selected, choose the Notifications tab and select Edit in the Feedback notifications container.
  6. Notifications
  7. Expand the SNS topic list box of bounce and complaint feedback type and select the SNS topic you created in Step 1.
    (Optional) If you want your topic notification to include the headers from the original email, check the Include original email headers box directly underneath the SNS topic name of each feedback type then click on save changes.
  8. SNS topics
  9. After configured SNS topic for bounces and complaints, you can disable Email Feedback Forwarding notifications to avoid receive double notifications through Email Feedback Forwarding and SNS notifications.
  10. To Disable it, under the Notifications tab on the details screen of the verified identity, in the Email Feedback Forwarding container, choose Edit, uncheck the Enabled box, and choose Save changes.
  11. Feedback forwarding disabled

Demo 2: Configure Amazon SES to monitor bounces and complaints for emails sent with a configuration set using Amazon SES event publishing.

Configuration sets in SES are groups of rules, that you can apply to your verified identities. When you apply a configuration set to an email, all of the rules in that configuration set are applied to the email. You can use different type of rules with a configuration set. This demo will use event destination, which will allow you to publish bounces and complaints to the SNS topic.

Note: You must pass the name of the configuration set when sending an email. This can be done by either specifying the configuration set name in the headers of emails, or specifying it as a default configuration set. This can be done at the time of identity creation, or later while editing a verified identity.

  1. Sign in to the Amazon SES console.
  2. In the navigation pane, under Configuration, choose Configuration sets. Choose Create set.
  3. Configuration set image
  4. Enter Configuration set name, leave the rest of fields to default, scroll to the send and click on Create set.
  5. Create configuration set
  6. After Configuration set is created, you now need to create Amazon SES event destinations as shown below. Amazon SES sends all bounce and complaint notifications to event destination. In this blog the event destination is Amazon SNS topic.
  7. Navigate to the configuration set you created in step 3. Under configuration set home page click on Event destinations and select Add destination.
  8. Event destinations
  9. Under Select event types, check hard bounces and complaints boxes and click Next.
  10. Event types selection
  11. Specify destination for receiving bounce and complaints notifications, there’s couple of destinations types to choose from. in this demo, we will use Amazon SNS.
  12. Name – enter the name of the destination for this configuration set. The name can include letters, numbers, dashes, and hyphens.
  13. Event publishing – to turn on event publishing for this destination, select the Enabled check box.
  14. Under Amazon Simple Notification Service (SNS) topic , Expand the SNS topic list box and select the SNS topic you created in Step 1 and click Next.
  15. Use SES as destination
  16. Review, When you are satisfied that your entries are correct, choose Add destination to add your event destination.
  17. Once you choose Add destination , the summary of event destination will show a “Successfully validated SNS topic for Amazon SES event publishing” email.
  18. Successful notification

Step 3: Using Amazon SES Mailbox Simulator to test send and receive a bounce notification via SNS topic:

Test 1: Send a test email to test Demo 1 “Configure Amazon SES to monitor bounces and complaints for specific email identity (Email, Domain) ” in previous step

In this test, I will send a test message from my verified identity which configured to send any bounce and complaint notifications it receives to SNS topic and email address subscribed to the topic. I will use SES mailbox simulator to simulate a bounce message to test this setup.

  1. Sign in to the Amazon SES console.
  2. In the navigation pane, under Configuration, choose Verified identities.
  3. Select the verified identity you configured SNS notifications for bounces and complaints in Demo 1. In this test, I selected a verified domain identity.
  4. Click on Send test email from the upper right corner.
  5. Sending test email
  6. Under send message details, in From-address enter the first part of email address under this verified domain (from-address could be pre-populated).
  7. For Scenario, Expand the simulated scenarios and select Bounce scenario to test send a bounce message.
  8. For Subject, enter the desired email subject. For Body, type an optional body text then leave the rest of options as a default. Click on Send test email to send the email.
  9. Message details
  10. You should have an email from AWS notifications with bounce notification and details on the bounce.
  11. Content of bounce message includes the notificationType “Bounce/Complaint”, bouncedRecipients, diagnosticCode “reason the message bounced”, remoteMtaIp “IP of the recipient MTA rejected the message”, SourceIp “IP of the sender application”, callerIdentity “IAM user sending this message”. These details can help in identifying the reason behind why email is not delivered and bounced and will help you avoid such bounces in the future. Refer this document for additional content on bounce events.
  12. AWS notification message

Test 2: Send a test email to test Demo 2 “Configure Amazon SES to monitor bounces and complaints for emails sent with a configuration set using Amazon SES event publishing” in previous step

In this test, you can send a test message from any verified identity and by using the configuration set created in Step 2 which is configured to send any bounce and complaint notifications to SNS topic and email address subscribed to the topic. You can use SES mailbox simulator to simulate a bounce message to test this setup.

  1. Sign in to the Amazon SES console.
  2. In the navigation pane, under Configuration, choose Verified identities.
  3. Select any verified identity you want to send emails from. In this test, I selected a verified domain identity.
  4. Click on Send test email from the upper right corner.
  5. Under send message details From-address enter the first part of email address under this verified domain.
  6. For Scenario, Expand the simulated scenarios and select Bounce scenario to test send a bounce message.
  7. For Subject, enter the desired email subject. For Body, type an optional body text.
  8. For Configuration set, Expand the drop-down list and choose the configuration set you created in Demo 2.
  9. Click on Send test email to send the email.
  10. Message details
  11. You will find an email from AWS notifications with bounce notification and all details of the bounce.
  12. Content of bounce message includes the EventType “Bounce/Complaint”, bouncedRecipients, diagnosticCode “reason the message bounced” , remoteMTA “IP of the recipient MTA rejected the message”, SourceIp “IP of the sender application”, callerIdentity “IAM user sending this message”, ses:configuration-set “name of the configuration set you use when sending the email” all of this details can help you to identify the reason behind why email is not delivered and bounced and will help you to avoid such bounces in the future. Refer this document for more details on contents of bounce events.
  13. SES notification email

FAQ on this set up:

I configured SNS topic with KMS encryption and I am not receiving bounce or complain notifications for emails:
If your Amazon SNS topic uses AWS Key Management Service (AWS KMS) for server-side encryption, you have to add permissions to the AWS KMS key policy to allow SES service access the KMS key, an example policy can be found here.

I followed Demo 2. However, when I try to send emails from any verified identity, I don’t receive bounce or complain notifications for emails:
When sending the email, make sure to select the configuration set you configured for bounce and complaints notification. If you followed demo 2 and you sent the email without explicitly using the configuration set in email headers, you will lose tracking for bounce and complaints events.

I am testing the setup. After I sent an email to the bounce simulator, I am not receiving don’t receive any bounce notification emails:
Check the SNS topic subscription if its in pending status and make sure you confirm the topic subscription via subscription email sent to you. If the topic subscription is confirmed, make sure you have access to the inbox of subscription email address and you are not applying any email filters.

Cleaning up:

You should have now successfully setup SNS notifications to monitor bounce and complaints for you Amazon SES emails. To avoid incurring any extra charges, remember to delete any resources created manually if you no longer need them for monitoring.

Resources to delete from SES console:

  1. In the navigation pane, under Configuration, choose the verified identity you configured for SNS notifications.
  2. In the details screen of the verified identity you selected, choose the Notifications tab and select Edit in the Feedback notifications container.
  3. Choose No SNS topic from bounce and complaint feedback dropdown menu and click Save changes.
  4. Under the same Notifications tab on the details screen of the verified identity, in the Email Feedback Forwarding container, choose Edit, check the Enabled box, and choose Save changes.
  5. In the navigation pane, under Configuration, choose Configuration sets.
  6. Check the box beside Configuration set you created and select Delete.

Resources to delete from SNS console:

  1. In the navigation pane, from the left side menu, choose Topics.
  2. Check the radio button beside the SNS topic you created and select Delete.
  3. Confirm the topic deletion by writing “delete me”.

Conclusion:

Monitoring bounces and complaints is an essential part of maintaining a successful email sending system, using Amazon SES. By setting up SNS notifications for bounces and complaints, you can quickly identify any issues and take appropriate action to ensure that your emails are delivered successfully to your subscribers. By proactively managing your email deliverability, you can maintain a positive sender reputation and avoid any negative impact on your email marketing efforts.

About the authors:

 Alaa Hammad

Alaa Hammad is a Senior Cloud Support Engineer at AWS and subject matter expert in Amazon Simple Email Service and AWS Backup service. She has a 10 years of diverse experience in supporting enterprise customers across different industries. She enjoys cooking and try new recipes from different cuisines.

 Vinay Ujjini 

Vinay Ujjini is an Amazon Pinpoint and Amazon Simple Email Service Worldwide Principal Specialist Solutions Architect at AWS. He has been solving customer’s omni-channel challenges for over 15 years. He is an avid sports enthusiast and in his spare time, enjoys playing tennis & cricket.