Tag Archives: Technical How-to

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Post Syndicated from Ismail Makhlouf original https://aws.amazon.com/blogs/big-data/clean-up-your-excel-and-csv-files-without-writing-code-using-aws-glue-databrew/

Managing data within an organization is complex. Handling data from outside the organization adds even more complexity. As the organization receives data from multiple external vendors, it often arrives in different formats, typically Excel or CSV files, with each vendor using their own unique data layout and structure. In this blog post, we’ll explore a solution that streamlines this process by leveraging the capabilities of AWS Glue DataBrew.

DataBrew is an excellent tool for data quality and preprocessing. You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing.

In this post, we demonstrate the following:

  • Extracting non-transactional metadata from the top rows of a file and merging it with transactional data
  • Combining multi-line rows into single-line rows
  • Extracting unique identifiers from within strings or text

Solution overview

For this use case, imagine you’re a data analyst working at your organization. The sales leadership have requested a consolidated view of the net sales they are making from each of the organization’s suppliers. Unfortunately, this information is not available in a database. The sales data comes from each supplier in layouts like the following example.

However, with hundreds of resellers, manually extracting the information at the top is not feasible. Your goal is to clean up and flatten the data into the following output layout.

image2

To achieve this, you can use pre-built transformations in DataBrew to quickly get the data in the layout you want.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Connect to the dataset

The first thing we need to do is upload the input dataset to Amazon S3. Create an S3 bucket for the project and create a folder to upload the raw input data. The output data will be stored in another folder in a later step.

Next, we need to connect DataBrew to our CSV file. We create what we call a dataset, which
is an artifact that points to whatever data source we will be using. Navigate to “Datasets” on
the left hand menu.

Ensure the Column header values field is set to Add default header. The input CSV has an irregular format, so the first row will not have the needed column values.

Create a project

To create a new project, complete the following steps:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Choose Create project.
  3. For Project name, enter FoodMartSales-AllUpProject.
  4. For Attached recipe, choose Create new recipe.
  5. For Recipe name, enter FoodMartSales-AllUpProject-recipe.
  6. For Select a dataset, select My datasets.
  7. Select the FoodMartSales-AllUp dataset.
  8. Under Permissions, for Role name, choose the IAM role you created as a prerequisite or create a new role.
  9. Choose Create project.

After the project is opened, an interactive session is created where you can author transformations on a sample of the data.

Extract non-transactional metadata from within the contents of the file and merge it with transactional data

In this section, we consider data that has metadata on the first few rows of the file, followed by transactional data. We walk through how to extract data relevant to the whole file from the top of the document and combine it with the transactional data into one flat table.

Extract metadata from the header and remove invalid rows

Complete the following steps to extract metadata from the header:

  1. Choose Conditions and then choose IF.
  2. For Matching conditions, choose Match all conditions.
  3. For Source, choose Value of and Column_1.
  4. For Logical condition, choose Is exactly.
  5. For Enter a value, choose Enter custom value and enter RESELLER NAME.
  6. For Flag result value as, choose Custom value.
  7. For Value if true, choose Select source column and set Value of to Column_2.
  8. For Value if false, choose Enter custom value and enter INVALID.
  9. Choose Apply.

Your dataset should now look like the following screenshot, with the Reseller Name value extracted to a column by itself.

Next, you remove invalid rows and fill the rows with the Reseller Name value.

  1. Choose Clean and then choose Custom values.
  2. For Source column, choose ResellerName.
  3. For Specify values to remove, choose Custom value.
  4. For Values to remove, choose Invalid.
  5. For Apply transform to, choose All rows.
  6. Choose Apply.
  7. Choose Missing and then choose Fill with most frequent value.
  8. For Source column, choose FirstTransactionDate.
  9. For Missing value action, choose Fill with most frequent value.
  10. For Apply transform to, choose All rows.
  11. Choose Apply.

Your dataset should now look like the following screenshot, with the Reseller Name value extracted to a column by itself.

Repeat the same steps in this section for the rest of the metadata, including Reseller Email Address, Reseller ID, and First Transaction Date.

Promote column headers and clean up data

To promote column headers, complete the following steps:

  1. Reorder the columns to put the metadata columns to the left of the dataset by choosing Column, Move column, and Start of the table.
  2. Rename the columns with the appropriate names.

Now you can clean up some columns and rows.

  1. Delete unnecessary columns, such as Column_7.

You can also delete invalid rows by filtering out records that don’t have a transaction date value.

  1. Choose the ABC icon on the menu of the Transaction_Date column and choose date.

  2. For Handle invalid values, select Delete rows, then choose Apply.

The dataset should now have the metadata extracted and the column headers promoted.

Combine multi-line rows into single-line rows

The next issue to address is transactions pertaining to the same row that are split across multiple lines. In the following steps, we extract the needed data from the rows and merge it into single-line transactions. For this example specifically, the Reseller Margin data is split across two lines.


Complete the following steps to get the Reseller Margin value on the same line as the corresponding transaction. First, we identify the Reseller Margin rows and store them in a temporary column.

  1. Choose Conditions and then choose IF.
  2. For Matching conditions, choose Match all conditions.
  3. For Source, choose Value of and Transaction_ID.
  4. For Logical condition, choose Contains.
  5. For Enter a value, choose Enter custom value and enter Reseller Margin.
  6. For Flag result value as, choose Custom value.
  7. For Value if true, choose Select source column set Value of to TransactionAmount.
  8. For Value if false, choose Enter custom value and enter Invalid.
  9. For Destination column, choose ResellerMargin_Temp.
  10. Choose Apply.

Next, you shift the Reseller Margin value up one row.

  1. Choose Functions and then choose NEXT.
  2. For Source column, choose ResellerMargin_Temp.
  3. For Number of rows, enter 1.
  4. For Destination column, choose ResellerMargin.
  5. For Apply transform to, choose All rows.
  6. Choose Apply.

Next, delete the invalid rows.

  1. Choose Missing and then choose Remove missing rows.
  2. For Source column, choose TransactionDate.
  3. For Missing value action, choose Delete rows with missing values.
  4. For Apply transform to, choose All rows.
  5. Choose Apply.

Your dataset should now look like the following screenshot, with the Reseller Margin value extracted to a column by itself.

With the data structured properly, we can move on to mining the cleaned data.

Extract unique identifiers from within strings and text

Many types of data contain important information stored as unstructured text in a cell. In this section, we look at how to extract this data. Within the sample dataset, the BankTransferText column has valuable information around our resellers’ registered bank account numbers as well as the currency of the transaction, namely IBAN, SWIFT Code, and Currency.

Complete the following steps to extract IBAN, SWIFT code, and Currency into separate columns. First, you extract the IBAN number from the text using a regular expression (regex).

  1. Choose Extract and then choose Custom value or pattern.
  2. For Create column options, choose Extract values.
  3. For Source column, choose BankTransferText.
  4. For Extract options, choose Custom value or pattern.
  5. For Values to extract, enter [a-zA-Z][a-zA-Z][0-9]{2}[A-Z0-9]{1,30}.
  6. For Destination column, choose IBAN.
  7. For Apply transform to, choose All rows.
  8. Choose Apply.
  9. Extract the SWIFT code from the text using a regex following the same steps used to extract the IBAN number, but using the following regex instead: (?!^)(SWIFT Code: )([A-Z]{2}[A-Z0-9]+).

Next, remove the SWIFT Code: label from the extracted text.

  1. Choose Remove and then choose Custom values.
  2. For Source column, choose SWIFT Code.
  3. For Specify values to remove, choose Custom value.
  4. For Apply transform to, choose All rows.
  5. Extract the currency from the text using a regex following the same steps used to extract the IBAN number, but using the following regex instead: (?!^)(Currency: )([A-Z]{3}).
  6. Remove the Currency: label from the extracted text following the same steps used to remove the SWIFT Code: label.

You can clean up by deleting any unnecessary columns.

  1. Choose Column and then choose Delete.
  2. For Source columns, choose BankTransferText.
  3. Choose Apply.
  4. Repeat for any remaining columns.

Your dataset should now look like the following screenshot, with IBAN, SWIFT Code, and Currency extracted to separate columns.

Write the transformed data to Amazon S3

With all the steps captured in the recipe, the last step is to write the transformed data to Amazon S3.

  1. On the DataBrew console, choose Run job.
  1. For Job name, enter FoodMartSalesToDataLake.
  2. For Output to, choose Amazon S3.
  3. For File type, choose CSV.
  4. For Delimiter, choose Comma (,).
  5. For Compression, choose None.
  6. For S3 bucket owners’ account, select Current AWS account.
  7. For S3 location, enter s3://{name of S3 bucket}/clean/.
  8. For Role name, choose the IAM role created as a prerequisite or create a new role.
  9. Choose Create and run job.
  10. Go to the Jobs tab and wait for the job to complete.
  11. Navigate to the job output folder on the Amazon S3 console.
  12. Download the CSV file and view the transformed output.

Your dataset should look similar to the following screenshot.

Clean up

To optimize cost, make sure to clean up the resources deployed for this project by completing the following steps:

  1. Delete every DataBrew project along with their linked recipes.
  2. Delete all the DataBrew datasets.
  3. Delete the contents in your S3 bucket.
  4. Delete the S3 bucket.

Conclusion

The reality of exchanging data with suppliers is that we can’t always control the shape of the input data. With DataBrew, we can use a list of pre-built transformations and repeatable steps to transform incoming data into a desired layout and extract relevant data and insights from Excel or CSV files. Start using DataBrew today and transform 3 rd party files into structured datasets ready for consumption by your business.


About the Author

Ismail Makhlouf is a Senior Specialist Solutions Architect for Data Analytics at AWS. Ismail focuses on architecting solutions for organizations across their end-to-end data analytics estate, including batch and real-time streaming, big data, data warehousing, and data lake workloads. He primarily works with direct-to-consumer platform companies in the ecommerce, FinTech, PropTech, and HealthTech space to achieve their business objectives with well-architected data platforms.

Writing IAM Policies: Grant Access to User-Specific Folders in an Amazon S3 Bucket

Post Syndicated from Dylan Souvage original https://aws.amazon.com/blogs/security/writing-iam-policies-grant-access-to-user-specific-folders-in-an-amazon-s3-bucket/

November 14, 2023: We’ve updated this post to use IAM Identity Center and follow updated IAM best practices.

In this post, we discuss the concept of folders in Amazon Simple Storage Service (Amazon S3) and how to use policies to restrict access to these folders. The idea is that by properly managing permissions, you can allow federated users to have full access to their respective folders and no access to the rest of the folders.

Overview

Imagine you have a team of developers named Adele, Bob, and David. Each of them has a dedicated folder in a shared S3 bucket, and they should only have access to their respective folders. These users are authenticated through AWS IAM Identity Center (successor to AWS Single Sign-On).

In this post, you’ll focus on David. You’ll walk through the process of setting up these permissions for David using IAM Identity Center and Amazon S3. Before you get started, let’s first discuss what is meant by folders in Amazon S3, because it’s not as straightforward as it might seem. To learn how to create a policy with folder-level permissions, you’ll walk through a scenario similar to what many people have done on existing files shares, where every IAM Identity Center user has access to only their own home folder. With folder-level permissions, you can granularly control who has access to which objects in a specific bucket.

You’ll be shown a policy that grants IAM Identity Center users access to the same Amazon S3 bucket so that they can use the AWS Management Console to store their information. The policy allows users in the company to upload or download files from their department’s folder, but not to access any other department’s folder in the bucket.

After the policy is explained, you’ll see how to create an individual policy for each IAM Identity Center user.

Throughout the rest of this post, you will use a policy, which will be associated with an IAM Identity Center user named David. Also, you must have already created an S3 bucket.

Note: S3 buckets have a global namespace and you must change the bucket name to a unique name.

For this blog post, you will need an S3 bucket with the following structure (the example bucket name for the rest of the blog is “my-new-company-123456789”):

/home/Adele/
/home/Bob/
/home/David/
/confidential/
/root-file.txt

Figure 1: Screenshot of the root of the my-new-company-123456789 bucket

Figure 1: Screenshot of the root of the my-new-company-123456789 bucket

Your S3 bucket structure should have two folders, home and confidential, with a file root-file.txt in the main bucket directory. Inside confidential you will have no items or folders. Inside home there should be three sub-folders: Adele, Bob, and David.

Figure 2: Screenshot of the home/ directory of the my-new-company-123456789 bucket

Figure 2: Screenshot of the home/ directory of the my-new-company-123456789 bucket

A brief lesson about Amazon S3 objects

Before explaining the policy, it’s important to review how Amazon S3 objects are named. This brief description isn’t comprehensive, but will help you understand how the policy works. If you already know about Amazon S3 objects and prefixes, skip ahead to Creating David in Identity Center.

Amazon S3 stores data in a flat structure; you create a bucket, and the bucket stores objects. S3 doesn’t have a hierarchy of sub-buckets or folders; however, tools like the console can emulate a folder hierarchy to present folders in a bucket by using the names of objects (also known as keys). When you create a folder in S3, S3 creates a 0-byte object with a key that references the folder name that you provided. For example, if you create a folder named photos in your bucket, the S3 console creates a 0-byte object with the key photos/. The console creates this object to support the idea of folders. The S3 console treats all objects that have a forward slash (/) character as the last (trailing) character in the key name as a folder (for example, examplekeyname/)

To give you an example, for an object that’s named home/common/shared.txt, the console will show the shared.txt file in the common folder in the home folder. The names of these folders (such as home/ or home/common/) are called prefixes, and prefixes like these are what you use to specify David’s department folder in his policy. By the way, the slash (/) in a prefix like home/ isn’t a reserved character — you could name an object (using the Amazon S3 API) with prefixes such as home:common:shared.txt or home-common-shared.txt. However, the convention is to use a slash as the delimiter, and the Amazon S3 console (but not S3 itself) treats the slash as a special character for showing objects in folders. For more information on organizing objects in the S3 console using folders, see Organizing objects in the Amazon S3 console by using folders.

Creating David in Identity Center

IAM Identity Center helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications. Identity Center is the recommended approach for workforce authentication and authorization on AWS for organizations of any size and type. Using Identity Center, you can create and manage user identities in AWS, or connect your existing identity source, including Microsoft Active Directory, Okta, Ping Identity, JumpCloud, Google Workspace, and Azure Active Directory (Azure AD). For further reading on IAM Identity Center, see the Identity Center getting started page.

Begin by setting up David as an IAM Identity Center user. To start, open the AWS Management Console and go to IAM Identity Center and create a user.

Note: The following steps are for Identity Center without System for Cross-domain Identity Management (SCIM) turned on, the add user option won’t be available if SCIM is turned on.

  1. From the left pane of the Identity Center console, select Users, and then choose Add user.
    Figure 3: Screenshot of IAM Identity Center Users page.

    Figure 3: Screenshot of IAM Identity Center Users page.

  2. Enter David as the Username, enter an email address that you have access to as you will need this later to confirm your user, and then enter a First name, Last name, and Display name.
  3. Leave the rest as default and choose Add user.
  4. Select Users from the left navigation pane and verify you’ve created the user David.
    Figure 4: Screenshot of adding users to group in Identity Center.

    Figure 4: Screenshot of adding users to group in Identity Center.

  5. Now that you’re verified the user David has been created, use the left pane to navigate to Permission sets, then choose Create permission set.
    Figure 5: Screenshot of permission sets in Identity Center.

    Figure 5: Screenshot of permission sets in Identity Center.

  6. Select Custom permission set as your Permission set type, then choose Next.
    Figure 6: Screenshot of permission set types in Identity Center.

    Figure 6: Screenshot of permission set types in Identity Center.

David’s policy

This is David’s complete policy, which will be associated with an IAM Identity Center federated user named David by using the console. This policy grants David full console access to only his folder (/home/David) and no one else’s. While you could grant each user access to their own bucket, keep in mind that an AWS account can have up to 100 buckets by default. By creating home folders and granting the appropriate permissions, you can instead allow thousands of users to share a single bucket.

{
 “Version”:”2012-10-17”,
 “Statement”: [
   {
     “Sid”: “AllowUserToSeeBucketListInTheConsole”,
     “Action”: [“s3:ListAllMyBuckets”, “s3:GetBucketLocation”],
     “Effect”: “Allow”,
     “Resource”: [“arn:aws:s3:::*”]
   },
  {
     “Sid”: “AllowRootAndHomeListingOfCompanyBucket”,
     “Action”: [“s3:ListBucket”],
     “Effect”: “Allow”,
     “Resource”: [“arn:aws:s3::: my-new-company-123456789”],
     “Condition”:{“StringEquals”:{“s3:prefix”:[“”,”home/”, “home/David”],”s3:delimiter”:[“/”]}}
    },
   {
     “Sid”: “AllowListingOfUserFolder”,
     “Action”: [“s3:ListBucket”],
     “Effect”: “Allow”,
     “Resource”: [“arn:aws:s3:::my-new-company-123456789”],
     “Condition”:{“StringLike”:{“s3:prefix”:[“home/David/*”]}}
   },
   {
     “Sid”: “AllowAllS3ActionsInUserFolder”,
     “Effect”: “Allow”,
     “Action”: [“s3:*”],
     “Resource”: [“arn:aws:s3:::my-new-company-123456789/home/David/*”]
   }
 ]
}
  1. Now, copy and paste the preceding IAM Policy into the inline policy editor. In this case, you use the JSON editor. For information on creating policies, see Creating IAM policies.
    Figure 7: Screenshot of the inline policy inside the permissions set in Identity Center.

    Figure 7: Screenshot of the inline policy inside the permissions set in Identity Center.

  2. Give your permission set a name and a description, then leave the rest at the default settings and choose Next.
  3. Verify that you modify the policies to have the bucket name you created earlier.
  4. After your permission set has been created, navigate to AWS accounts on the left navigation pane, then select Assign users or groups.
    Figure 8: Screenshot of the AWS accounts in Identity Center.

    Figure 8: Screenshot of the AWS accounts in Identity Center.

  5. Select the user David and choose Next.
    Figure 9: Screenshot of the AWS accounts in Identity Center.

    Figure 9: Screenshot of the AWS accounts in Identity Center.

  6. Select the permission set you created earlier, choose Next, leave the rest at the default settings and choose Submit.
    Figure 10: Screenshot of the permission sets in Identity Center.

    Figure 10: Screenshot of the permission sets in Identity Center.

    You’ve now created and attached the permissions required for David to view his S3 bucket folder, but not to view the objects in other users’ folders. You can verify this by signing in as David through the AWS access portal.

    Figure 11: Screenshot of the settings summary in Identity Center.

    Figure 11: Screenshot of the settings summary in Identity Center.

  7. Navigate to the dashboard in IAM Identity Center and go to the Settings summary, then choose the AWS access portal URL.
    Figure 12: Screenshot of David signing into the console via the Identity Center dashboard URL.

    Figure 12: Screenshot of David signing into the console via the Identity Center dashboard URL.

  8. Sign in as the user David with the one-time password you received earlier when creating David.
    Figure 13: Second screenshot of David signing into the console through the Identity Center dashboard URL.

    Figure 13: Second screenshot of David signing into the console through the Identity Center dashboard URL.

  9. Open the Amazon S3 console.
  10. Search for the bucket you created earlier.
    Figure 14: Screenshot of my-new-company-123456789 bucket in the AWS console.

    Figure 14: Screenshot of my-new-company-123456789 bucket in the AWS console.

  11. Navigate to David’s folder and verify that you have read and write access to the folder. If you navigate to other users’ folders, you’ll find that you don’t have access to the objects inside their folders.

David’s policy consists of four blocks; let’s look at each individually.

Block 1: Allow required Amazon S3 console permissions

Before you begin identifying the specific folders David can have access to, you must give him two permissions that are required for Amazon S3 console access: ListAllMyBuckets and GetBucketLocation.

   {
      "Sid": "AllowUserToSeeBucketListInTheConsole",
      "Action": ["s3:GetBucketLocation", "s3:ListAllMyBuckets"],
      "Effect": "Allow",
      "Resource": ["arn:aws:s3:::*"]
   }

The ListAllMyBuckets action grants David permission to list all the buckets in the AWS account, which is required for navigating to buckets in the Amazon S3 console (and as an aside, you currently can’t selectively filter out certain buckets, so users must have permission to list all buckets for console access). The console also does a GetBucketLocation call when users initially navigate to the Amazon S3 console, which is why David also requires permission for that action. Without these two actions, David will get an access denied error in the console.

Block 2: Allow listing objects in root and home folders

Although David should have access to only his home folder, he requires additional permissions so that he can navigate to his folder in the Amazon S3 console. David needs permission to list objects at the root level of the my-new-company-123456789 bucket and to the home/ folder. The following policy grants these permissions to David:

   {
      "Sid": "AllowRootAndHomeListingOfCompanyBucket",
      "Action": ["s3:ListBucket"],
      "Effect": "Allow",
      "Resource": ["arn:aws:s3:::my-new-company-123456789"],
      "Condition":{"StringEquals":{"s3:prefix":["","home/", "home/David"],"s3:delimiter":["/"]}}
   }

Without the ListBucket permission, David can’t navigate to his folder because he won’t have permissions to view the contents of the root and home folders. When David tries to use the console to view the contents of the my-new-company-123456789 bucket, the console will return an access denied error. Although this policy grants David permission to list all objects in the root and home folders, he won’t be able to view the contents of any files or folders except his own (you specify these permissions in the next block).

This block includes conditions, which let you limit under what conditions a request to AWS is valid. In this case, David can list objects in the my-new-company-123456789 bucket only when he requests objects without a prefix (objects at the root level) and objects with the home/ prefix (objects in the home folder). If David tries to navigate to other folders, such as confidential/, David is denied access. Additionally, David needs permissions to list prefix home/David to be able to use the search functionality of the console instead of scrolling down the list of users’ folders.

To set these root and home folder permissions, I used two conditions: s3:prefix and s3:delimiter. The s3:prefix condition specifies the folders that David has ListBucket permissions for. For example, David can list the following files and folders in the my-new-company-123456789 bucket:

/root-file.txt
/confidential/
/home/Adele/
/home/Bob/
/home/David/

But David cannot list files or subfolders in the confidential/home/Adele, or home/Bob folders.

Although the s3:delimiter condition isn’t required for console access, it’s still a good practice to include it in case David makes requests by using the API. As previously noted, the delimiter is a character—such as a slash (/)—that identifies the folder that an object is in. The delimiter is useful when you want to list objects as if they were in a file system. For example, let’s assume the my-new-company-123456789 bucket stored thousands of objects. If David includes the delimiter in his requests, he can limit the number of returned objects to just the names of files and subfolders in the folder he specified. Without the delimiter, in addition to every file in the folder he specified, David would get a list of all files in any subfolders.

Block 3: Allow listing objects in David’s folder

In addition to the root and home folders, David requires access to all objects in the home/David/ folder and any subfolders that he might create. Here’s a policy that allows this:

{
      “Sid”: “AllowListingOfUserFolder”,
      “Action”: [“s3:ListBucket”],
      “Effect”: “Allow”,
      “Resource”: [“arn:aws:s3:::my-new-company-123456789”],
      "Condition":{"StringLike":{"s3:prefix":["home/David/*"]}}
    }

In the condition above, you use a StringLike expression in combination with the asterisk (*) to represent an object in David’s folder, where the asterisk acts as a wildcard. That way, David can list files and folders in his folder (home/David/). You couldn’t include this condition in the previous block (AllowRootAndHomeListingOfCompanyBucket) because it used the StringEquals expression, which would interpret the asterisk (*) as an asterisk, not as a wildcard.

In the next section, the AllowAllS3ActionsInUserFolder block, you’ll see that the Resource element specifies my-new-company/home/David/*, which looks like the condition that I specified in this section. You might think that you can similarly use the Resource element to specify David’s folder in this block. However, the ListBucket action is a bucket-level operation, meaning the Resource element for the ListBucket action applies only to bucket names and doesn’t take folder names into account. So, to limit actions at the object level (files and folders), you must use conditions.

Block 4: Allow all Amazon S3 actions in David’s folder

Finally, you specify David’s actions (such as read, write, and delete permissions) and limit them to just his home folder, as shown in the following policy:

    {
      "Sid": "AllowAllS3ActionsInUserFolder",
      "Effect": "Allow",
      "Action": ["s3:*"],
      "Resource": ["arn:aws:s3:::my-new-company-123456789/home/David/*"]
    }

For the Action element, you specified s3:*, which means David has permission to do all Amazon S3 actions. In the Resource element, you specified David’s folder with an asterisk (*) (a wildcard) so that David can perform actions on the folder and inside the folder. For example, David has permission to change his folder’s storage class. David also has permission to upload files, delete files, and create subfolders in his folder (perform actions in the folder).

An easier way to manage policies with policy variables

In David’s folder-level policy you specified David’s home folder. If you wanted a similar policy for users like Bob and Adele, you’d have to create separate policies that specify their home folders. Instead of creating individual policies for each IAM Identity Center user, you can use policy variables and create a single policy that applies to multiple users (a group policy). Policy variables act as placeholders. When you make a request to a service in AWS, the placeholder is replaced by a value from the request when the policy is evaluated.

For example, you can use the previous policy and replace David’s user name with a variable that uses the requester’s user name through attributes and PrincipalTag as shown in the following policy (copy this policy to use in the procedure that follows):

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "AllowUserToSeeBucketListInTheConsole",
			"Action": [
				"s3:ListAllMyBuckets",
				"s3:GetBucketLocation"
			],
			"Effect": "Allow",
			"Resource": [
				"arn:aws:s3:::*"
			]
		},
		{
			"Sid": "AllowRootAndHomeListingOfCompanyBucket",
			"Action": [
				"s3:ListBucket"
			],
			"Effect": "Allow",
			"Resource": [
				"arn:aws:s3:::my-new-company-123456789"
			],
			"Condition": {
				"StringEquals": {
					"s3:prefix": [
						"",
						"home/",
						"home/${aws:PrincipalTag/userName}"
					],
					"s3:delimiter": [
						"/"
					]
				}
			}
		},
		{
			"Sid": "AllowListingOfUserFolder",
			"Action": [
				"s3:ListBucket"
			],
			"Effect": "Allow",
			"Resource": [
				"arn:aws:s3:::my-new-company-123456789"
			],
			"Condition": {
				"StringLike": {
					"s3:prefix": [
						"home/${aws:PrincipalTag/userName}/*"
					]
				}
			}
		},
		{
			"Sid": "AllowAllS3ActionsInUserFolder",
			"Effect": "Allow",
			"Action": [
				"s3:*"
			],
			"Resource": [
				"arn:aws:s3:::my-new-company-123456789/home/${aws:PrincipalTag/userName}/*"
			]
		}
	]
}
  1. To implement this policy with variables, begin by opening the IAM Identity Center console using the main AWS admin account (ensuring you’re not signed in as David).
  2. Select Settings on the left-hand side, then select the Attributes for access control tab.
    Figure 15: Screenshot of Settings inside Identity Center.

    Figure 15: Screenshot of Settings inside Identity Center.

  3. Create a new attribute for access control, entering userName as the Key and ${path:userName} as the Value, then choose Save changes. This will add a session tag to your Identity Center user and allow you to use that tag in an IAM policy.
    Figure 16: Screenshot of managing attributes inside Identity Center settings.

    Figure 16: Screenshot of managing attributes inside Identity Center settings.

  4. To edit David’s permissions, go back to the IAM Identity Center console and select Permission sets.
    Figure 17: Screenshot of permission sets inside Identity Center with Davids-Permissions selected.

    Figure 17: Screenshot of permission sets inside Identity Center with Davids-Permissions selected.

  5. Select David’s permission set that you created previously.
  6. Select Inline policy and then choose Edit to update David’s policy by replacing it with the modified policy that you copied at the beginning of this section, which will resolve to David’s username.
    Figure 18: Screenshot of David’s policy inside his permission set inside Identity Center.

    Figure 18: Screenshot of David’s policy inside his permission set inside Identity Center.

You can validate that this is set up correctly by signing in to David’s user through the Identity Center dashboard as you did before and verifying you have access to the David folder and not the Bob or Adele folder.

Figure 19: Screenshot of David’s S3 folder with access to a .jpg file inside.

Figure 19: Screenshot of David’s S3 folder with access to a .jpg file inside.

Whenever a user makes a request to AWS, the variable is replaced by the user name of whoever made the request. For example, when David makes a request, ${aws:PrincipalTag/userName} resolves to David; when Adele makes the request, ${aws:PrincipalTag/userName} resolves to Adele.

It’s important to note that, if this is the route you use to grant access, you must control and limit who can set this username tag on an IAM principal. Anyone who can set this tag can effectively read/write to any of these bucket prefixes. It’s important that you limit access and protect the bucket prefixes and who can set the tags. For more information, see What is ABAC for AWS, and the Attribute-based access control User Guide.

Conclusion

By using Amazon S3 folders, you can follow the principle of least privilege and verify that the right users have access to what they need, and only to what they need.

See the following example policy that only allows API access to the buckets, and only allows for adding, deleting, restoring, and listing objects inside the folders:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowAllS3ActionsInUserFolder",
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject",
                "s3:DeleteObjectTagging",
                "s3:DeleteObjectVersion",
                "s3:DeleteObjectVersionTagging",
                "s3:GetObject",
                "s3:GetObjectTagging",
                "s3:GetObjectVersion",
                "s3:GetObjectVersionTagging",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:PutObjectTagging",
                "s3:PutObjectVersionTagging",
                "s3:RestoreObject"
            ],
            "Resource": [
		   "arn:aws:s3:::my-new-company-123456789",
                "arn:aws:s3:::my-new-company-123456789/home/${aws:PrincipalTag/userName}/*"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "home/${aws:PrincipalTag/userName}/*"
                    ]
                }
            }
        }
    ]
}

We encourage you to think about what policies your users might need and restrict the access by only explicitly allowing what is needed.

Here are some additional resources for learning about Amazon S3 folders and about IAM policies, and be sure to get involved at the community forums:

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Dylan Souvage

Dylan Souvage

Dylan is a Solutions Architect based in Toronto, Canada. Dylan loves working with customers to understand their business needs and enable them in their cloud journey. In his spare time, he enjoys going out in nature, going on long road trips, and traveling to warm, sunny places.

Abhra Sinha

Abhra Sinha

Abhra is a Toronto-based Senior Solutions Architect at AWS. Abhra enjoys being a trusted advisor to customers, working closely with them to solve their technical challenges and help build a secure scalable architecture on AWS. In his spare time, he enjoys Photography and exploring new restaurants.

Divyajeet Singh

Divyajeet Singh

Divyajeet (DJ) is a Sr. Solutions Architect at AWS Canada. He loves working with customers to help them solve their unique business challenges using the cloud. In his free time, he enjoys spending time with family and friends, and exploring new places.

How Wallapop improved performance of analytics workloads with Amazon Redshift Serverless and data sharing

Post Syndicated from Eduard Lopez original https://aws.amazon.com/blogs/big-data/how-wallapop-improved-performance-of-analytics-workloads-with-amazon-redshift-serverless-and-data-sharing/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze all your data at petabyte scale, using standard SQL and your existing business intelligence (BI) tools. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift.

Amazon Redshift Serverless makes it effortless to run and scale analytics workloads without having to manage any data warehouse infrastructure.

Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use.

This is ideal when it’s difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. As your demand evolves with new workloads and more concurrent users, Redshift Serverless automatically provisions the right compute resources, and your data warehouse scales seamlessly and automatically.

Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Redshift data warehouse with another Redshift data warehouse (provisioned or serverless) across accounts and Regions without needing to copy, replicate, or move data from one data warehouse to another.

Amazon Redshift data sharing enables you to evolve your Amazon Redshift deployment architectures into a hub-and-spoke or data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, and onboard new use cases, all without the complexity of data movement and data copies.

In this post, we show how Wallapop adopted Redshift Serverless and data sharing to modernize their data warehouse architecture.

Wallapop’s initial data architecture platform

Wallapop is a Spanish ecommerce marketplace company focused on second-hand items, founded in 2013. Every day, they receive around 300,000 new items from buyers to be added to their catalog. The marketplace can be accessed via mobile app or website.

The average monthly traffic is around 15 million active users. Since its creation in 2013, it has reached more than 40 million downloads and more than 700 million products have been listed.

Amazon Redshift plays a central role in their data platform on AWS for ingestion, ETL (extract, transform, and load), machine learning (ML), and consumption workloads that run their insight consumption to drive decision-making.

The initial architecture is composed of one main Redshift provisioned cluster that handled all the workloads, as illustrated in the following diagram. Their cluster was deployed with 8 nodes ra3.4xlarge and concurrency scaling enabled.

Wallapop had three main areas to improve in their initial data architecture platform:

  • Workload isolation challenges with growing data volumes and new workloads running in parallel
  • Administrative burden on data engineering teams to manage the concurrent workloads, especially at peak times
  • Cost-performance ratio while scaling during peak periods

The areas of improvement mainly focused on performance of data consumption workloads along with the BI and analytics consumption tool, where high query concurrency was impacting the final analytics preparation and its insights consumption.

Solution overview

To improve their data platform architecture, Wallapop designed and built a new distributed approach with Amazon Redshift with the support of AWS.

Their cluster size of the provisioned data warehouse didn’t change. What changed was lowering the usage concurrency scaling to 1 hour, which is in the Free Tier usage for every 24 hours of using the main cluster. The following diagram illustrates the target architecture.

Solution details

The new data platform architecture combines Redshift Serverless and provisioned data warehouses with Amazon Redshift data sharing, helping Wallapop improve their overall Amazon Redshift experience with improved ease of use, performance, and optimized costs.

Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). RPUs are resources used to handle workloads. You can adjust the base capacity setting from 8 RPUs to 512 RPUs in units of 8 (8, 16, 24, and so on).

The new architecture uses a Redshift provisioned cluster with RA3 nodes to run their constant and write workloads (data ingestion and transformation jobs). For cost-efficiency, Wallapop is also benefiting from Redshift reserved instances to optimize on costs for these known, predictable, and steady workloads. This cluster acts as the producer cluster in their distributed architecture using data sharing, meaning the data is ingested into the storage layer of Amazon Redshift—Redshift Managed Storage (RMS).

For the consumption part of the data platform architecture, the data is shared with different Redshift Serverless endpoints to meet the needs for different consumption workloads.

Data sharing provides workloads isolation. With this architecture, Wallapop achieves better workload isolation and ensures that only the right data is shared with the different consumption applications. Additionally, this approach avoids data duplication in their consumer part, which optimizes costs and allows better governance processes, because they only have to manage a single version of the data warehouse data instead of different copies or versions of it.

Redshift Serverless is used as a consumer part of the data platform architecture to meet those predictable and unpredictable, non-steady, and often demanding analytics workloads, such as their CI/CD jobs and BI and analytics consumption workloads coming from their data visualization application. Redshift Serverless also helps them achieve better workload isolation due to its managed auto scaling feature that makes sure performance is consistently good for these unpredictable workloads, even at peak times. It also provides a better user experience for the Wallapop data platform team, thanks to the autonomics capabilities that Redshift Serverless provides.

The new solution combining Redshift Serverless and data sharing allowed Wallapop to achieve better performance, cost, and ease of use.

Eduard Lopez, Wallapop Data Engineering Manager, shared the improved experience of analytics users: “Our analyst users are telling us that now ‘Looker flies.’ Insights consumption went up as a result of it without increasing costs.”

Evaluation of outcome

Wallapop started this re-architecture effort by first testing the isolation of their BI consumption workload with Amazon Redshift data sharing and Redshift Serverless with the support of AWS. The workload was tested using different base RPU configurations to measure the base capacity and resources in Redshift Serverless. Base RPU ranges for Redshift Serverless range from 8–512. Wallapop tested their BI workload with two configurations: 32 base RPU and 64 base RPU, after enabling data sharing from their Redshift provisioned cluster to ensure the serverless endpoints have access to the necessary datasets.

Based on the results measured 1 week before testing, the main area for improvement was the queries that took longer than 10 seconds to complete (52%), represented by the yellow, orange, and red areas of the following chart, as well as the long-running queries represented by the red area (over 600 seconds, 9%).

The first test of this workload with Redshift Serverless using a 64 base RPU configuration immediately showed performance improvement results: the queries running longer than 10 seconds were reduced by 38% and the long-running queries (over 120 seconds) were almost completely eliminated.

Javier Carbajo, Wallapop Data Engineer, says, “Providing a service without downtime or loading times that are too long was one of our main requirements since we couldn’t have analysts or stakeholders without being able to consult the data.”

Following the first set of results, Wallapop also tested with a Redshift Serverless configuration using 32 base RPU to compare the results and select the configuration that could offer them the best price-performance for this workload. With this configuration, the results were similar to the previously test run on Redshift Serverless with 64 base RPU (still showing significant performance improvement from the original results). Based on the tests, this configuration was selected for the new architecture.

Gergely Kajtár, Wallapop Data Engineer, says, “We noticed a significant increase in the daily workflows’ stability after the change to the new Redshift architecture.”

Following this first workload, Wallapop has continued expanding their Amazon Redshift distributed architecture with CI/CD workloads running on a separated Redshift Serverless endpoint using data sharing with their Redshift provisioned (RA3) cluster.

“With the new Redshift architecture, we have noticed remarkable improvements both in speed and stability. That has translated into an increase of 2 times in analytical queries, not only by analysts and data scientists but from other roles as well such as marketing, engineering, C-level, etc. That proves that investing in a scalable architecture like Redshift Serverless has a direct consequence on accelerating the adoption of data as decision-making driver in the organization.”

– Nicolás Herrero, Wallapop Director of Data & Analytics.

Conclusion

In this post, we showed you how this platform can help Wallapop to scale in the future by adding new consumers when new needs or applications require to access data.

If you’re new to Amazon Redshift, you can explore demos, other customer stories, and the latest features at Amazon Redshift. If you’re already using Amazon Redshift, reach out to your AWS account team for support, and learn more about what’s new with Amazon Redshift.


About the Authors

Eduard Lopez is the Data Engineer Manager at Wallapop. He is a software engineer with over 6 years of experience in data engineering, machine learning, and data science.

Daniel Martinez is a Solutions Architect in Iberia Digital Native Businesses (DNB), part of the worldwide commercial sales organization (WWCS) at AWS.

Jordi Montoliu is a Sr. Redshift Specialist in EMEA, part of the worldwide specialist organization (WWSO) at AWS.

Ziad Wali is an Acceleration Lab Solutions Architect at Amazon Web Services. He has over 10 years of experience in databases and data warehousing, where he enjoys building reliable, scalable, and efficient solutions. Outside of work, he enjoys sports and spending time in nature.

Semir Naffati is a Sr. Redshift Specialist Solutions Architect in EMEA, part of the worldwide specialist organization (WWSO) at AWS.

Use private key JWT authentication between Amazon Cognito user pools and an OIDC IdP

Post Syndicated from Martin Pagel original https://aws.amazon.com/blogs/security/use-private-key-jwt-authentication-between-amazon-cognito-user-pools-and-an-oidc-idp/

With Amazon Cognito user pools, you can add user sign-up and sign-in features and control access to your web and mobile applications. You can enable your users who already have accounts with other identity providers (IdPs) to skip the sign-up step and sign in to your application by using an existing account through SAML 2.0 or OpenID Connect (OIDC). In this blog post, you will learn how to extend the authorization code grant between Cognito and an external OIDC IdP with private key JSON Web Token (JWT) client authentication.

For OIDC, Cognito uses the OAuth 2.0 authorization code grant flow as defined by the IETF in RFC 6749 Section 1.3.1. This flow can be broken down into two steps: user authentication and token request. When a user needs to authenticate through an external IdP, the Cognito user pool forwards the user to the IdP’s login endpoint. After successful authentication, the IdP sends back a response that includes an authorization code, which concludes the authentication step. The Cognito user pool now uses this code, together with a client secret for client authentication, to retrieve a JWT from the IdP. The JWT consists of an access token and an identity token. Cognito ingests that JWT, creates or updates the user in the user pool, and returns a JWT it has created for the client’s session, to the client. You can find a more detailed description of this flow in the Amazon Cognito documentation.

Although this flow sufficiently secures the requests between Cognito and the IdP for most customers, those in the public sector, healthcare, and finance sometimes need to integrate with IdPs that enforce additional security measures as part of their security requirements. In the past, this has come up in conversations at AWS when our customers needed to integrate Cognito with, for example, the HelseID (healthcare sector, Norway), login.gov (public sector, USA), or GOV.UK One Login (public sector, UK) IdPs. Customers who are using Okta, PingFederate, or similar IdPs and want additional security measures as part of their internal security requirements, might also find adding further security requirements desirable as part of their own policies.

The most common additional requirement is to replace the client secret with an assertion that consists of a private key JWT as a means of client authentication during token requests. This method is defined through a combination of RFC 7521 and RFC 7523. Instead of a symmetric key (the client secret), this method uses an asymmetric key-pair to sign a JWT with a private key. The IdP can then verify the token request by validating the signature of that JWT using the corresponding public key. This helps to eliminate the exposure of the client secret with every request, thereby reducing the risk of request forgery, depending on the quality of the key material that was used and how access to the private key is secured. Additionally, the JWT has an expiry time, which further constrains the risk of replay attacks to a narrow time window.

A Cognito user pool does not natively support private key JWT client authentication when integrating with an external IdP. However, you can still integrate Cognito user pools with IdPs that support or require private key JWT authentication by using Amazon API Gateway and AWS Lambda.

This blog post presents a high-level overview of how you can implement this solution. To learn more about the underlying code, how to configure the included services, and what the detailed request flow looks like, check out the Deploy a demo section later in this post. Keep in mind that this solution does not cover the request flow between your own application and a Cognito user pool, but only the communication between Cognito and the IdP.

Solution overview

Following the technical implementation details of the previously mentioned RFCs, the required request flow between a Cognito user pool and the external OIDC IdP can be broken down into four simplified steps, shown in Figure 1.

Figure 1: Simplified UML diagram of the target implementation for using a private key JWT during the authorization code grant

Figure 1: Simplified UML diagram of the target implementation for using a private key JWT during the authorization code grant

In this example, we’re using the Cognito user pool hosted UI—because it already provides OAuth 2.0-aligned IdP integration—and extending it with the private key JWT. Figure 1 illustrates the following steps:

  1. The hosted UI forwards the user client to the /authorize endpoint of the external OIDC IdP with an HTTP GET request.
  2. After the user successfully logs into the IdP, the IdP‘s response includes an authorization code.
  3. The hosted UI sends this code in an HTTP POST request to the IdP’s /token endpoint. By default, the hosted UI also adds a client secret for client authentication. To align with the private key JWT authentication method, you need to replace the client secret with a client assertion and specify the client assertion type, as highlighted in the diagram and further described later.
  4. The IdP validates the client assertion by using a pre-shared public key.
  5. The IdP issues the user’s JWT, which Cognito ingests to create or update the user in the user pool.

As mentioned earlier, token requests between a Cognito user pool and an external IdP do not natively support the required client assertion. However, you can redirect the token requests to, for example, an Amazon API Gateway, which invokes a Lambda function to extend the request with the new parameters. Because you need to sign the client assertion with a private key, you also need a secure location to store this key. For this, you can use AWS Secrets Manager, which helps you to secure the key from unauthorized use. With the required flow and additional services in mind, you can create the following architecture.

Figure 2: Architecture diagram with Amazon API Gateway and Lambda to process token requests between Cognito and the OIDC identity provider

Figure 2: Architecture diagram with Amazon API Gateway and Lambda to process token requests between Cognito and the OIDC identity provider

Let’s have a closer look at the individual components and the request flow that are shown in Figure 2.

When adding an OIDC IdP to a Cognito user pool, you configure endpoints for Authorization, UserInfo, Jwks_uri, and Token. Because the private key is required only for the token request flow, you can configure resources to redirect and process requests, as follows (the step numbers correspond to the step numbering in Figure 2):

  1. Configure the endpoints for Authorization, UserInfo, and Jwks_Uri with the ones from the IdP.
  2. Create an API Gateway with a dedicated route for token requests (for example, /token) and add it as the Token endpoint in the IdP configuration in Cognito.
  3. Integrate this route with a Lambda function: When Cognito calls the API endpoint, it will automatically invoke the function.

Together with the original request parameters, which include the authorization code, this function does the following:

  1. Retrieves the private key from Secrets Manager.
  2. Creates and signs the client assertion.
  3. Makes the token request to the IdP token endpoint.
  4. Receives the response from the IdP.
  5. Returns the response to the Cognito IdP response endpoint.

The details of the function logic can be broken down into the following:

  • Decode the body of the original request—this includes the authorization code that was acquired during the authorize flow.
    import base64
    
    encoded_message = event["body"]
    decoded_message = base64.b64decode(encoded_message)
    decoded_message = decoded_message.decode("utf-8")
    

  • Retrieve the private key from Secrets Manager by using the GetSecretValue API or SDK equivalent or by using the AWS Parameters and Secrets Lambda Extension.
  • Create and sign the JWT.
    import jwt # third party library – requires a Lambda Layer
    import time
    
    instance = jwt.JWT()
    private_key_jwt = instance.encode({
        "iss": <Issuer. Contains IdP client ID>,
        "sub": <Subject. Contains IdP client ID>,
        "aud": <Audience. IdP token endpoint>,
        "iat": int(time.time()),
        "exp": int(time.time()) + 300
    },
        {<YOUR RETRIEVED PRIVATE KEY IN JWK FORMAT>},
        alg='RS256',
        optional_headers = {"kid": private_key_dict["kid"]}
    )
    

  • Modify the original body and make the token request, including the original parameters for grant_type, code, and client_id, with added client_assertion_type and the client_assertion. (The following example HTTP request has line breaks and placeholders in angle brackets for better readability.)
    POST /token HTTP/1.1
      Scheme: https
      Host: your-idp.example.com
      Content-Type: application/x-www-form-urlencoded
    
      grant_type=authorization_code&
        code=<the authorization code>&
        client_id=<IdP client ID>&
        client_assertion_type=
        urn%3Aietf%3Aparams%3Aoauth%3Aclient-assertion-type%3Ajwt-bearer&
        client_assertion=<signed JWT>
    

  • Return the IdP’ s response.

Note that there is no client secret needed in this request. Instead, you add a client assertion type as urn:ietf:params:oauth:client-assertion-type:jwt-bearer, and the client assertion with the signed JWT.

If the request is successful, the IdP’s response includes a JWT with the access token and identity token. On returning the response via the Lambda function, Cognito ingests the JWT and creates or updates the user in the user pool. It then responds to the original authorize request of the user client by sending its own authorization code, which can be exchanged for a Cognito issued JWT in your own application.

Deploy a demo

To deploy an example of this solution, see our GitHub repository. You will find the prerequisites and deployment steps there, as well as additional in-depth information.

Additional considerations

To further optimize this solution, you should consider checking the event details in the Lambda function before fully processing the requests. This way, you can, for example, check that all required parameters are present and valid. One option to do that, is to define a client secret when you create the IdP integration for the user pool. When Cognito sends the token request, it adds the client secret in the encoded body, so you can retrieve it and validate its value. If the validation fails, requests can be dropped early to improve exception handling and to prevent invalid requests from causing unnecessary function charges.

In this example, we used Secrets Manager to store the private key. You can explore other alternatives, like AWS Systems Manager Parameter Store or AWS Key Management Service (AWS KMS). To retrieve the key from the Parameter Store, you can use the SDK or the AWS Parameter and Secrets Lambda Extension. With AWS KMS, you can both create and store the private key as well as derive a public key through the service’s APIs, and you can also use the signing API to sign the JWT in the Lambda function.

Conclusion

By redirecting the IdP token endpoint in the Cognito user pool’s external OIDC IdP configuration to a route in an API Gateway, you can use Lambda functions to customize the request flow between Cognito and the IdP. In the example in this post, we showed how to change the client authentication mechanism during the token request from a client secret to a client assertion with a signed JWT (private key JWT). You can also apply the same proxy-like approach to customize the request flow even further—for example, by adding a Proof Key for Code Exchange (PKCE), for which you can find an example in the aws-samples GitHub repository.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS re:Post for Amazon Cognito User Pools or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Martin Pagel

Martin Pagel

Martin is a Solutions Architect in Norway who specializes in security and compliance, with a focus on identity. Before joining AWS, he walked in the shoes of a systems administrator and engineer, designing and implementing the technical side of a wide variety of compliance controls at companies in automotive, fintech, and healthcare. Today, he helps our customers across the Nordic region to achieve their security and compliance objectives.

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Post Syndicated from Preshen Goobiah original https://aws.amazon.com/blogs/big-data/simplifying-data-processing-at-capitec-with-amazon-redshift-integration-for-apache-spark/

This post is cowritten with Preshen Goobiah and Johan Olivier from Capitec.

Apache Spark is a widely-used open source distributed processing system renowned for handling large-scale data workloads. It finds frequent application among Spark developers working with Amazon EMR, Amazon SageMaker, AWS Glue and custom Spark applications.

Amazon Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless. This integration expands the possibilities for AWS analytics and machine learning (ML) solutions, making the data warehouse accessible to a broader range of applications.

With the Amazon Redshift integration for Apache Spark, you can quickly get started and effortlessly develop Spark applications using popular languages like Java, Scala, Python, SQL, and R. Your applications can seamlessly read from and write to your Amazon Redshift data warehouse while maintaining optimal performance and transactional consistency. Additionally, you’ll benefit from performance improvements through pushdown optimizations, further enhancing the efficiency of your operations.

Capitec, South Africa’s biggest retail bank with over 21 million retail banking clients, aims to provide simple, affordable and accessible financial services in order to help South Africans bank better so that they can live better. In this post, we discuss the successful integration of the open source Amazon Redshift connector by Capitec’s shared services Feature Platform team. As a result of utilizing the Amazon Redshift integration for Apache Spark, developer productivity increased by a factor of 10, feature generation pipelines were streamlined, and data duplication reduced to zero.

The business opportunity

There are 19 predictive models in scope for utilizing 93 features built with AWS Glue across Capitec’s Retail Credit divisions. Feature records are enriched with facts and dimensions stored in Amazon Redshift. Apache PySpark was selected to create features because it offers a fast, decentralized, and scalable mechanism to wrangle data from diverse sources.

These production features play a crucial role in enabling real-time fixed-term loan applications, credit card applications, batch monthly credit behavior monitoring, and batch daily salary identification within the business.

The data sourcing problem

To ensure the reliability of PySpark data pipelines, it’s essential to have consistent record-level data from both dimensional and fact tables stored in the Enterprise Data Warehouse (EDW). These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime.

During feature development, data engineers require a seamless interface to the EDW. This interface allows them to access and integrate the necessary data from the EDW into the data pipelines, enabling efficient development and testing of features.

Previous solution process

In the previous solution, product team data engineers spent 30 minutes per run to manually expose Redshift data to Spark. The steps included the following:

  1. Construct a predicated query in Python.
  2. Submit an UNLOAD query via the Amazon Redshift Data API.
  3. Catalog data in the AWS Glue Data Catalog via the AWS SDK for Pandas using sampling.

This approach posed issues for large datasets, required recurring maintenance from the platform team, and was complex to automate.

Current solution overview

Capitec was able to resolve these problems with the Amazon Redshift integration for Apache Spark within feature generation pipelines. The architecture is defined in the following diagram.

The workflow includes the following steps:

  1. Internal libraries are installed into the AWS Glue PySpark job via AWS CodeArtifact.
  2. An AWS Glue job retrieves Redshift cluster credentials from AWS Secrets Manager and sets up the Amazon Redshift connection (injects cluster credentials, unload locations, file formats) via the shared internal library. The Amazon Redshift integration for Apache Spark also supports using AWS Identity and Access Management (IAM) to retrieve credentials and connect to Amazon Redshift.
  3. The Spark query is translated to an Amazon Redshift optimized query and submitted to the EDW. This is accomplished by the Amazon Redshift integration for Apache Spark.
  4. The EDW dataset is unloaded into a temporary prefix in an Amazon Simple Storage Service (Amazon S3) bucket.
  5. The EDW dataset from the S3 bucket is loaded into Spark executors via the Amazon Redshift integration for Apache Spark.
  6. The EDL dataset is loaded into Spark executors via the AWS Glue Data Catalog.

These components work together to ensure that data engineers and production data pipelines have the necessary tools to implement the Amazon Redshift integration for Apache Spark, run queries, and facilitate the unloading of data from Amazon Redshift to the EDL.

Using the Amazon Redshift integration for Apache Spark in AWS Glue 4.0

In this section, we demonstrate the utility of the Amazon Redshift integration for Apache Spark by enriching a loan application table residing in the S3 data lake with client information from the Redshift data warehouse in PySpark.

The dimclient table in Amazon Redshift contains the following columns:

  • ClientKey – INT8
  • ClientAltKey – VARCHAR50
  • PartyIdentifierNumber – VARCHAR20
  • ClientCreateDate – DATE
  • IsCancelled – INT2
  • RowIsCurrent – INT2

The loanapplication table in the AWS Glue Data Catalog contains the following columns:

  • RecordID – BIGINT
  • LogDate – TIMESTAMP
  • PartyIdentifierNumber – STRING

The Redshift table is read via the Amazon Redshift integration for Apache Spark and cached. See the following code:

import pyspark.sql.functions as F
from pyspark.sql import SQLContext
sc = # existing SparkContext
sql_context = SQLContext(sc)

secretsmanager_client = boto3.client('secretsmanager')
secret_manager_response = secretsmanager_client.get_secret_value(
    SecretId='string',
    VersionId='string',
    VersionStage='string'
)
username = # get username from secret_manager_response
password = # get password from secret_manager_response
url = "jdbc:redshift://redshifthost:5439/database?user=" + username + "&password=" + password

read_config = {
    "url": url,
    "tempdir": "s3://capitec-redshift-temp-bucket/<uuid>/",
    "unload_s3_format": "PARQUET"
}

d_client = (
    spark.read.format("io.github.spark_redshift_community.spark.redshift")
    .options(**read_config)
    .option("query", f"select * from edw_core.dimclient")
    .load()
    .where((F.col("RowIsCurrent") == 1) & (F.col("isCancelled") == 0))
    .select(
        F.col("PartyIdentifierNumber"),
        F.col("ClientCreateDate")
    )
    .cache()
)

Loan application records are read in from the S3 data lake and enriched with the dimclient table on Amazon Redshift information:

import pyspark.sql.functions as F
from awsglue.context import GlueContext
from pyspark import SparkContext

glue_ctx = GlueContext(SparkContext.getOrCreate())

push_down_predicate = (
    f"meta_extract_start_utc_ms between "
    f"'2023-07-12"
    f" 18:00:00.000000' and "
    f"'2023-07-13 06:00:00.000000'"
)

database_name="loan_application_system"
table_name="dbo_view_loan_applications"
catalog_id = # Glue Data Catalog

# Selecting only the following columns
initial_select_cols=[
            "RecordID",
            "LogDate",
            "PartyIdentifierNumber"
        ]

d_controller = (glue_ctx.create_dynamic_frame.from_catalog(catalog_id=catalog_id,
                                            database=database_name,
                                            table_name=table_name,
                                            push_down_predicate=push_down_predicate)
                .toDF()
                .select(*initial_select_cols)
                .withColumn("LogDate", F.date_format("LogDate", "yyyy-MM-dd").cast("string"))
                .dropDuplicates())

# Left Join on PartyIdentifierNumber and enriching the loan application record
d_controller_enriched = d_controller.join(d_client, on=["PartyIdentifierNumber"], how="left").cache()

As a result, the loan application record (from the S3 data lake) is enriched with the ClientCreateDate column (from Amazon Redshift).

How the Amazon Redshift integration for Apache Spark solves the data sourcing problem

The Amazon Redshift integration for Apache Spark effectively addresses the data sourcing problem through the following mechanisms:

  • Just-in-time reading – The Amazon Redshift integration for Apache Spark connector reads Redshift tables in a just-in-time manner, ensuring the consistency of data and schema. This is particularly valuable for Type 2 slowly changing dimension (SCD) and timespan accumulating snapshot facts. By combining these Redshift tables with the source system AWS Glue Data Catalog tables from the EDL within production PySpark pipelines, the connector enables seamless integration of data from multiple sources while maintaining data integrity.
  • Optimized Redshift queries – The Amazon Redshift integration for Apache Spark plays a crucial role in converting the Spark query plan into an optimized Redshift query. This conversion process simplifies the development experience for the product team by adhering to the data locality principle. The optimized queries use the capabilities and performance optimizations of Amazon Redshift, ensuring efficient data retrieval and processing from Amazon Redshift for the PySpark pipelines. This helps streamline the development process while enhancing the overall performance of the data sourcing operations.

Gaining the best performance

The Amazon Redshift integration for Apache Spark automatically applies predicate and query pushdown to optimize performance. You can gain performance improvements by using the default Parquet format used for unloading with this integration.

For additional details and code samples, refer to New – Amazon Redshift Integration with Apache Spark.

Solution Benefits

The adoption of the integration yielded several significant benefits for the team:

  • Enhanced developer productivity – The PySpark interface provided by the integration boosted developer productivity by a factor of 10, enabling smoother interaction with Amazon Redshift.
  • Elimination of data duplication – Duplicate and AWS Glue cataloged Redshift tables in the data lake were eliminated, resulting in a more streamlined data environment.
  • Reduced EDW load – The integration facilitated selective data unloading, minimizing the load on the EDW by extracting only the necessary data.

By using the Amazon Redshift integration for Apache Spark, Capitec has paved the way for improved data processing, increased productivity, and a more efficient feature engineering ecosystem.

Conclusion

In this post, we discussed how the Capitec team successfully implemented the Apache Spark Amazon Redshift integration for Apache Spark to simplify their feature computation workflows. They emphasized the importance of utilizing decentralized and modular PySpark data pipelines for creating predictive model features.

Currently, the Amazon Redshift integration for Apache Spark is utilized by 7 production data pipelines and 20 development pipelines, showcasing its effectiveness within Capitec’s environment.

Moving forward, the shared services Feature Platform team at Capitec plans to expand the adoption of the Amazon Redshift integration for Apache Spark in different business areas, aiming to further enhance data processing capabilities and promote efficient feature engineering practices.

For additional information on using the Amazon Redshift integration for Apache Spark, refer to the following resources:


About the Authors

Preshen Goobiah is the Lead Machine Learning Engineer for the Feature Platform at Capitec. He is focused on designing and building Feature Store components for enterprise use. In his spare time, he enjoys reading and traveling.

Johan Olivier is a Senior Machine Learning Engineer for Capitec’s Model Platform. He is an entrepreneur and problem-solving enthusiast. He enjoys music and socializing in his spare time.

Sudipta Bagchi is a Senior Specialist Solutions Architect at Amazon Web Services. He has over 12 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket. Connect with him on LinkedIn.

Syed Humair is a Senior Analytics Specialist Solutions Architect at Amazon Web Services (AWS). He has over 17 years of experience in enterprise architecture focusing on Data and AI/ML, helping AWS customers globally to address their business and technical requirements. You can connect with him on LinkedIn.

Vuyisa Maswana is a Senior Solutions Architect at AWS, based in Cape Town. Vuyisa has a strong focus on helping customers build technical solutions to solve business problems. He has supported Capitec in their AWS journey since 2019.

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

Post Syndicated from Prantik Gachhayat original https://aws.amazon.com/blogs/big-data/create-a-modern-data-platform-using-the-data-build-tool-dbt-in-the-aws-cloud/

Building a data platform involves various approaches, each with its unique blend of complexities and solutions. A modern data platform entails maintaining data across multiple layers, targeting diverse platform capabilities like high performance, ease of development, cost-effectiveness, and DataOps features such as CI/CD, lineage, and unit testing. In this post, we delve into a case study for a retail use case, exploring how the Data Build Tool (dbt) was used effectively within an AWS environment to build a high-performing, efficient, and modern data platform.

dbt is an open-source command line tool that enables data analysts and engineers to transform data in their warehouses more effectively. It does this by helping teams handle the T in ETL (extract, transform, and load) processes. It allows users to write data transformation code, run it, and test the output, all within the framework it provides. dbt enables you to write SQL select statements, and then it manages turning these select statements into tables or views in Amazon Redshift.

Use case

The Enterprise Data Analytics group of a large jewelry retailer embarked on their cloud journey with AWS in 2021. As part of their cloud modernization initiative, they sought to migrate and modernize their legacy data platform. The aim was to bolster their analytical capabilities and improve data accessibility while ensuring a quick time to market and high data quality, all with low total cost of ownership (TCO) and no need for additional tools or licenses.

dbt emerged as the perfect choice for this transformation within their existing AWS environment. This popular open-source tool for data warehouse transformations won out over other ETL tools for several reasons. dbt’s SQL-based framework made it straightforward to learn and allowed the existing development team to scale up quickly. The tool also offered desirable out-of-the-box features like data lineage, documentation, and unit testing. A crucial advantage of dbt over stored procedures was the separation of code from data—unlike stored procedures, dbt doesn’t store the code in the database itself. This separation further simplifies data management and enhances the system’s overall performance.

Let’s explore the architecture and learn how to build this use case using AWS Cloud services.

Solution overview

The following architecture demonstrates the data pipeline built on dbt to manage the Redshift data warehouse ETL process.

        Figure 1 : Modern data platform using AWS Data Services and dbt

This architecture consists of the following key services and tools:

  • Amazon Redshift was utilized as the data warehouse for the data platform, storing and processing vast amounts of structured and semi-structured data
  • Amazon QuickSight served as the business intelligence (BI) tool, allowing the business team to create analytical reports and dashboards for various business insights
  • AWS Database Migration Service (AWS DMS) was employed to perform change data capture (CDC) replication from various source transactional databases
  • AWS Glue was put to work, loading files from the SFTP location to the Amazon Simple Storage Service (Amazon S3) landing bucket and subsequently to the Redshift landing schema
  • AWS Lambda functioned as a client program, calling third-party APIs and loading the data into Redshift tables
  • AWS Fargate, a serverless container management service, was used to deploy the consumer application for source queues and topics
  • Amazon Managed Workflows for Apache Airflow (Amazon MWAA) was used to orchestrate different tasks of dbt pipelines
  • dbt, an open-source tool, was employed to write SQL-based data pipelines for data stored in Amazon Redshift, facilitating complex transformations and enhancing data modeling capabilities

Let’s take a closer look at each component and how they interact in the overall architecture to transform raw data into insightful information.

Data sources

As part of this data platform, we are ingesting data from diverse and varied data sources, including:

  • Transactional databases – These are active databases that store real-time data from various applications. The data typically encompasses all transactions and operations that the business engages in.
  • Queues and topics – Queues and topics come from various integration applications that generate data in real time. They represent an instantaneous stream of information that can be used for real-time analytics and decision-making.
  • Third-party APIs – These provide analytics and survey data related to ecommerce websites. This could include details like traffic metrics, user behavior, conversion rates, customer feedback, and more.
  • Flat files – Other systems supply data in the form of flat files of different formats. These files, stored in an SFTP location, might contain records, reports, logs, or other kinds of raw data that can be further processed and analyzed.

Data ingestion

Data from various sources are grouped into two major categories: real-time ingestion and batch ingestion.

Real-time ingestion uses the following services:

  • AWS DMS AWS DMS is used to create CDC replication pipelines from OLTP (Online Transaction Processing) databases. The data is loaded into Amazon Redshift in near-real time to ensure that the most recent information is available for analysis. You can also use Amazon Aurora zero-ETL integration with Amazon Redshift to ingest data directly from OLTP databases to Amazon Redshift.
  • Fargate Fargate is used to deploy Java consumer applications that ingest data from source topics and queues in real time. This real-time data consumption can help the business make immediate and data-informed decisions. You can also use Amazon Redshift Streaming Ingestion to ingest data from streaming engines like Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) into Amazon Redshift.

Batch ingestion uses the following services:

  • Lambda – Lambda is used as a client for calling third-party APIs and loading the resultant data into Redshift tables. This process has been scheduled to run daily, ensuring a consistent batch of fresh data for analysis.
  • AWS Glue – AWS Glue is used to load files into Amazon Redshift through the S3 data lake. You can also use features like auto-copy from Amazon S3 (feature under preview) to ingest data from Amazon S3 to Amazon Redshift. However, the focus of this post is more on data processing within Amazon Redshift, rather than on the data loading process. Data ingestion, whether real time or batch, forms the basis of any effective data analysis, enabling organizations to gather information from diverse sources and use it for insightful decision-making.

Data warehousing using Amazon Redshift

In Amazon Redshift, we’ve established three schemas, each serving as a different layer in the data architecture:

  • Landing layer – This is where all data ingested by our services initially lands. It’s raw, unprocessed data straight from the source.
  • Certified dataset (CDS) layer – This is the next stage, where data from the landing layer undergoes cleaning, normalization, and aggregation. The cleansed and processed data is stored in this certified dataset schema. It serves as a reliable, organized source for downstream data analysis.
  • User-friendly data mart (UFDM) layer – This final layer uses data from the CDS layer to create data mart tables. These are specifically tailored to support BI reports and dashboards as per the business requirements. The goal of this layer is to present the data in a way that is most useful and accessible for end-users.

This layered approach to data management allows for efficient and organized data processing, leading to more accurate and meaningful insights.

Data pipeline

dbt, an open-source tool, can be installed in the AWS environment and set up to work with Amazon MWAA. We store our code in an S3 bucket and orchestrate it using Airflow’s Directed Acyclic Graphs (DAGs). This setup facilitates our data transformation processes in Amazon Redshift after the data is ingested into the landing schema.

To maintain modularity and handle specific domains, we create individual dbt projects. The nature of the data reporting—real-time or batch—affects how we define our dbt materialization. For real-time reporting, we define materialization as a view, loading data into the landing schema using AWS DMS from database updates or from topic or queue consumers. For batch pipelines, we define materialization as a table, allowing data to be loaded from various types of sources.

In some instances, we have had to build data pipelines that extend from the source system all the way to the UFDM layer. This can be accomplished using Airflow DAGs, which we discuss further in the next section.

To wrap up, it’s worth mentioning that we deploy a dbt webpage using a Lambda function and enable a URL for this function. This webpage serves as a hub for documentation and data lineage, further bolstering the transparency and understanding of our data processes.

ETL job orchestration

In our data pipeline, we follow these steps for job orchestration:

  1. Establish a new Amazon MWAA environment. This environment serves as the central hub for orchestrating our data pipelines.
  2. Install dbt in the new Airflow environment by adding the following dependency to your requirements.txt:
    boto3>=1.17.54
    botocore>=1.20.54
    dbt-redshift>=1.3.0
    dbt-postgres>=1.3.0

  3. Develop DAGs with specific tasks that call upon dbt commands to carry out the necessary transformations. This step involves structuring our workflows in a way that captures dependencies among tasks and ensures that tasks run in the correct order. The following code shows how to define the tasks in the DAG:
    #imports..
    ...
    
    #Define the begin_exec tasks
    start = DummyOperator(
        task_id='begin_exec',
        dag=dag 
    )
    
    #Define 'verify_dbt_install' task to check if dbt was installed properly
    verify = BashOperator(
        task_id='verify_dbt_install',
        dag=dag,
        bash_command='''
            echo "checking dbt version....";             
            /usr/local/airflow/.local/bin/dbt --version;
            if [ $? -gt 0 ]; then
                pip install dbt-redshift>=1.3.0;
            else
                echo "dbt already installed";
            fi
            python --version;
            echo "listing dbt...";      
            rm -r /tmp/dbt_project_home;
            cp -R /usr/local/airflow/dags/dbt_project_home /tmp;
            ls /tmp/dbt_project_home/<your_dbt_project_name>;
        '''
    )
    
    #Define ‘landing_to_cds_task’ task to copy from landing schema to cds schema
    landing_to_cds_task = BashOperator(
        task_id='landing_to_cds_task', 
        dag = dag,
        bash_command='''        
            /usr/local/airflow/.local/bin/dbt run --project-dir /tmp/dbt_project_home/<your_dbt_project_name> --profiles-dir /tmp/dbt_project_home/ --select <model_folder_name>.*;
        '''
    )
    
    ...
    #Define data quality check task to test a package, generate docs and copy the docs to required S3 location
    data_quality_check = BashOperator(
        task_id='data_quality_check',
        dag=dag,
        bash_command='''    
       	  /usr/local/airflow/.local/bin/dbt test –-select your_package.*               
            /usr/local/airflow/.local/bin/dbt docs generate --project-dir /tmp/dbt_project_home/<your_project_name> --profiles-dir /tmp/dbt_project_home/;        
            aws s3 cp /tmp/dbt_project_home/<your_project_name>/target/ s3://<your_S3_bucket_name>/airflow_home/dags/dbt_project_home/<your_project_name>/target --recursive;
        '''
    )

  4. Create DAGs that solely focus on dbt transformation. These DAGs handle the transformation process within our data pipelines, harnessing the power of dbt to convert raw data into valuable insights.
    #This is how we define the flow 
    start >> verify >> landing_to_cds_task >> cds_to_ufdm_task >> data_quality_check >> end_exec

The following image shows how this workflow would be seen on the Airflow UI .

  1. Create DAGs with AWS Glue for ingestion. These DAGs use AWS Glue for data ingestion tasks. AWS Glue is a fully managed ETL service that makes it easy to prepare and load data for analysis. We create DAGs that orchestrate AWS Glue jobs for extracting data from various sources, transforming it, and loading it into our data warehouse.
          #Create boto3 client for Glue 
          glue_client = boto3.client('glue', region_name='us-east-1')
    
          #Define callback function to start the Glue job using boto3 client 
          def run_glue_ingestion_job():
       glue_client.start_job_run(JobName='glue_ingestion_job')  
    
    #Define the task for glue job for ingestion
       glue_job_step = PythonOperator(
           task_id=’glue_task_for_source_to_landing’, 
           python_callable=run_glue_ingestion_job
       )
    #This is how we define the flow 
    start >> verify >> glue_task_for_source_to_landing >> landing_to_cds_task >> cds_to_ufdm_task >> data_quality_check >> end_exec
    

The following image shows how this workflow would be seen on the Airflow UI.

  1. Create DAGs with Lambda for ingestion. Lambda lets us run code without provisioning or managing servers. These DAGs use Lambda functions to call third-party APIs and load data into our Redshift tables, which can be scheduled to run at certain intervals or in response to specific events.
    #Create boto3 client for Lambda 
    lambda_client = boto3.client('lambda')
    
    #Define callback function to invoke the lambda function using boto3 client 
    def run_lambda_ingestion_job():
       Lambda_client.invoke(FunctionName='<funtion_arn>')
    )  
    
    #Define the task for glue job for ingestion
    glue_job_step = PythonOperator(
       task_id=’lambda_task_for_api_to_landing’, 
       python_callable=run_lambda_ingestion_job
    )

The following image shows how this workflow would be seen on the Airflow UI.

We now have a comprehensive, well-orchestrated process that uses a variety of AWS services to handle different stages of our data pipeline, from ingestion to transformation.

Conclusion

The combination of AWS services and the dbt open-source project provides a powerful, flexible, and scalable solution for building modern data platforms. It’s a perfect blend of manageability and functionality, with its easy-to-use, SQL-based framework and features like data quality checks, configurable load types, and detailed documentation and lineage. Its principles of “code separate from data” and reusability make it a convenient and efficient tool for a wide range of users. This practical use case of building a data platform for a retail organization demonstrates the immense potential of AWS and dbt for transforming data management and analytics, paving the way for faster insights and informed business decisions.

For more information about using dbt with Amazon Redshift, see Manage data transformations with dbt in Amazon Redshift.


About the Authors

Prantik Gachhayat is an Enterprise Architect at Infosys having experience in various technology fields and business domains. He has a proven track record helping large enterprises modernize digital platforms and delivering complex transformation programs. Prantik specializes in architecting modern data and analytics platforms in AWS. Prantik loves exploring new tech trends and enjoys cooking.

Ashutosh Dubey is a Senior Partner Solutions Architect and Global Tech leader at Amazon Web Services based out of New Jersey, USA. He has extensive experience specializing in the Data and Analytics and AIML field including generative AI, contributed to the community by writing various tech contents, and has helped Fortune 500 companies in their cloud journey to AWS.

Implement fine-grained access control in Amazon SageMaker Studio and Amazon EMR using Apache Ranger and Microsoft Active Directory

Post Syndicated from Rahul Sarda original https://aws.amazon.com/blogs/big-data/implement-fine-grained-access-control-in-amazon-sagemaker-studio-and-amazon-emr-using-apache-ranger-and-microsoft-active-directory/

Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) that enables data scientists and developers to perform every step of the ML workflow, from preparing data to building, training, tuning, and deploying models. SageMaker Studio comes with built-in integration with Amazon EMR, enabling data scientists to interactively prepare data at petabyte scale using frameworks such as Apache Spark, Hive, and Presto right from SageMaker Studio notebooks. With Amazon SageMaker, developers, data scientists, and SageMaker Studio users can access both raw data stored in Amazon Simple Storage Service (Amazon S3), and cataloged tabular data stored in a Hive metastore easily. SageMaker Studio’s support for Apache Ranger creates a simple mechanism for applying fine-grained access control to the raw and cataloged data with grant and revoke policies administered from a friendly web interface.

In this post, we show how you can authenticate into SageMaker Studio using an existing Active Directory (AD), with authorized access to both Amazon S3 and Hive cataloged data using AD entitlements via Apache Ranger integration and AWS IAM Identity Center (successor to AWS Single Sign-On). With this solution, you can manage access to multiple SageMaker environments and SageMaker Studio notebooks using a single set of credentials. Subsequently, Apache Spark jobs created from SageMaker Studio notebooks will access only the data and resources permitted by Apache Ranger policies attached to the AD credentials, inclusive of table and column-level access.

With this capability, multiple SageMaker Studio users can connect to the same EMR cluster, gaining access only to data granted to their user or group, with audit records captured and visible in Amazon CloudWatch. This multi-tenant environment is possible through user session isolation that prevents users from accessing datasets and cluster resources allocated to other users. Ultimately, organizations can provision fewer clusters, reduce administrative overhead, and increase cluster utilization, saving staff time and cloud costs.

Solution overview

We demonstrate this solution with an end-to-end use case using a sample ecommerce dataset. The dataset is available within provided AWS CloudFormation templates and consists of transactional ecommerce data (products, orders, customers) cataloged in a Hive metastore.

The solution utilizes two data analyst personas, Alex and Tina, each tasked with different analysis requiring fine-grained limitations on dataset access:

  • Tina, a data scientist on the marketing team, is tasked with building a model for customer lifetime value. Data access should only be permitted to non-sensitive customer, product, and orders data.
  • Alex, a data scientist on the sales team, is tasked to generate product demand forecast, requiring access to product and orders data. No customer data is required.

The following figure illustrates our desired fine-grained access.

The following diagram illustrates the solution architecture.

The architecture is implemented as follows:

  • Microsoft Active Directory – Used to manage user authentication, select AWS application access, and user and group membership for Apache Ranger secured data authorization
  • Apache Ranger – Used to monitor and manage comprehensive data security across the Hadoop and Amazon EMR platform
  • Amazon EMR – Used to retrieve, prepare, and analyze data from the Hive metastore using Spark
  • SageMaker Studio – An integrated IDE with purpose-built tools to build AI/ML models.

The following sections walk through the setup of the architectural components for this solution using the CloudFormation stack.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Create resources with AWS CloudFormation

To build the solution within your environment, use the provided CloudFormation templates to create the required AWS resources.

Note that running these CloudFormation templates and the following configuration steps will create AWS resources that may incur charges. Additionally, all the steps should be run in the same Region.

Template 1

This first template creates the following resources and takes approximately 15 minutes to complete:

  • A Multi-AZ, multi-subnet VPC infrastructure, with managed NAT gateways in the public subnet for each Availability Zone
  • S3 VPC endpoints and Elastic Network Interfaces
  • A Windows Active Directory domain controller using Amazon Elastic Compute Cloud (Amazon EC2) with cross-realm trust
  • A Linux Bastion host (Amazon EC2) in an auto scaling group

To deploy this template, complete the following steps:

  1. Sign in to the AWS Management Console.
  2. On the Amazon EC2 console, create an EC2 key pair.
  3. Choose Launch Stack :
  4. Select the target Region
  5. Verify the stack name and provide the following parameters:
    1. The name of the key pair you created.
    2. Passwords for cross-realm trust, the Windows domain admin, LDAP bind, and default AD user. Be sure to record these passwords to use in future steps.
    3. Select a minimum of three Availability Zones based on the chosen Region.
  6. Review the remaining parameters. No changes are required for the solution, but you may change parameter values if desired.
  7. Choose Next and then choose Next again.
  8. Review the parameters.
  9. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  10. Choose Submit.

Template 2

The second template creates the following resources and takes approximately 30–60 minutes to complete:

To deploy this template, complete the following steps:

  1. Choose Launch Stack :
  2. Select the target Region
  3. Verify the stack name and provide the following parameters:
    1. Key pair name (created earlier).
    2. LDAPHostPrivateIP address, which can be found in the output section of the Windows AD CloudFormation stack.
    3. Passwords for the Windows domain admin, cross-realm trust, AD domain user, and LDAP bind. Use the same passwords as you did for the first CloudFormation template.
    4. Passwords for the RDS for MySQL database and KDC admin. Record these passwords; they may be needed in future steps.
    5. Log directory for the EMR cluster.
    6. VPC (it contains the name of the CloudFormation stack)
    7. Subnet details (align the subnet name with the parameter name).
    8. Set AppsEMR to Hadoop, Spark, Hive, Livy, Hue, and Trino.
    9. Leave RangerAdminPassword as is.
  4. Review the remaining parameters. No changes are required beyond what is mentioned, but you may change parameter values if desired.
  5. Choose Next, then choose Next again.
  6. Review the parameters.
  7. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  8. Choose Submit.

Integrate Active Directory with AWS accounts using IAM Identity Center

To enable users to sign in to SageMaker with Active Directory credentials, a connection between IAM Identity Center and Active Directory must be established.

To connect to Microsoft Active Directory, we set up AWS Directory Service using AD Connector.

  1. On the Directory Service console, choose Directories in the navigation pane.
  2. Choose Set up directory.
  3. For Directory types, select AD Connector.
  4. Choose Next.
  5. For Directory size, select the appropriate size for AD Connector. For this post, we select Small.
  6. Choose Next.
  7. Choose the VPC and private subnets where the Windows AD domain controller resides.
  8. Choose Next.
  9. In the Active Directory information section, enter the following details (this information can be retrieved on the Outputs tab of the first CloudFormation template):
    1. For Directory DNS Name, enter awsemr.com.
    2. For Directory NetBIOS name, enter awsemr.
    3. For DNS IP addresses, enter the IPv4 private IP address from AD Controller.
    4. Enter the service account user name and password that you provided during stack creation.
  10. Choose Next.
  11. Review the settings and choose Create directory.

After the directory is created, you will see its status as Active on the Directory Services console.

Set up AWS Organizations

AWS Organizations supports IAM Identity Center in only one Region at a time. To enable IAM Identity Center in this Region, you must first delete the IAM Identity Center configuration if created in another Region. Do not delete an existing IAM Identity Center configuration unless you are sure it will not negatively impact existing workloads.

  1. Navigate to the IAM Identity Center console.
  2. If IAM Identity Center has not been activated previously, choose Enable. If an organization does not exist, an alert appears to create one.
  3. Choose Create AWS organization.
  4. Choose Settings in the navigation pane.
  5. On the Identity source tab, on the Actions menu, choose Change identity source.
  6. For Choose identity source, select Active Directory.
  7. Choose Next.
  8. For Existing Directories, choose AWSEMR.COM.
  9. Choose Next.
  10. To confirm the change, enter ACCEPT in the confirmation input box, then choose Change identity source. Upon completion, you will be redirected to Settings, where you receive the alert Configurable AD sync paused.
  11. Choose Resume sync.
  12. Choose Settings in the navigation pane.
  13. On the Identity source tab, on the Actions menu, choose Manage sync.
  14. Choose Add users and groups to specify the users and groups to sync from Active Directory to IAM Identity Center.
  15. On the Users tab, enter tina and choose Add.
  16. Enter alex and choose Add.
  17. Choose Submit.
  18. On the Groups tab, enter datascience and choose Add.
  19. Choose Submit.

After your users and groups are synced to IAM Identity Center, you can see them by choosing Users or Groups in the navigation pane on the IAM Identity Center console. When they’re available, you can assign them access to AWS accounts and cloud applications. The initial sync may take up to 5 minutes.

Set up a SageMaker domain using IAM Identity Center

To set up a SageMaker domain, complete the following steps:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose Create domain.
  3. Choose Standard setup, then choose Configure.
  4. For Domain Name, enter a unique name for your domain.
  5. For Authentication, choose AWS IAM Identity Center.
  6. Choose Create a new role for the default execution role.
  7. In the Create an IAM Role popup, choose Any S3 bucket.
  8. Choose Create role.
  9. Copy the role details to be used in next section for adding a policy for EMR cluster access.
  10. In the Network and storage section, specify the following:
    1. Choose the VPC that you created using the first CloudFormation template.
    2. Choose a private subnet in an Availability Zone supported by SageMaker.
    3. Use the default security group (sg-XXXX).
    4. Choose VPC only.

Note that there is a public domain called AWSEMR.COM that will conflict with the one created for this solution if Public internet only is selected.

  1. Leave all other options as default and choose Next.
  2. In the Studio settings section, accept the defaults and choose Next.
  3. In the RStudio settings section, accept the defaults and choose Next.
  4. In the Canvas setting section, accept the defaults and choose Submit.

Add a policy to provide SageMaker Studio access to the EMR cluster

Complete the following steps to give SageMaker Studio access to the EMR cluster:

  1. On the IAM console, choose Roles in the navigation pane.
  2. Search and choose for the role you copied earlier (<AmazonSageMaker-ExecutionRole- XXXXXXXXXXXXXXX>).
  3. On the Permissions tab, choose Add permissions and Attach policy.
  4. Search for and choose the policy AmazonEMRFullAccessPolicy_v2.
  5. Choose Add permissions.

Add users and groups to access the domain

Complete the following steps to give users and groups access to the domain:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain you created earlier.
  3. On the Domain details page, choose Assign users and groups.
  4. On the Users tab, select the users tina and alex.
  5. On the Groups tab, select the group datascience.
  6. Choose Assign users and groups.

Configure Spark data access rights in Apache Ranger

Now that the AWS environment is set up, we configure Hive dataset security using Apache Ranger.

To start, collect the Apache Ranger URL details to access the Ranger admin console:

  1. On the Amazon EC2 console, choose Resources in the navigation pane, then Instance (running).
  2. Choose the Ranger server EC2 instance and copy the private IP DNS name (IPv4 only).
    Next, connect to the Windows domain controller to use the connected VPC to access the Ranger admin console. This is done by logging in to the Windows server and launching a web browser.
  3. Install the Remote Desktop Services client on your computer to connect with Windows Server.
  4. Authorize inbound traffic from your computer to the Windows AD domain controller EC2 instance.
  5. On the Amazon EC2 console, choose Resources in the navigation pane, then Instance (running).
  6. Choose on the Windows Domain Controller (DC1) EC2 instance ID and copy the public IP DNS name (IPv4 only).
  7. Use Microsoft Remote Desktop to log in to the Windows domain controller:
    1. Computer – Use the public IP DNS name (IPv4 only).
    2. Username – Enter awsadmin.
    3. Password – Use the password you set during the first CloudFormation template setup.
  8. Disable the Enhanced Security Configuration for Internet Explorer.
  9. Launch Internet Explorer and navigate to the Ranger admin console using the private IP DNS name (IPv4 only) associated with the Ranger server noted earlier and port 6182 (for example, https://<RangerServer Private IP DNS name>:6182).
  10. Choose Continue to this website (not recommended) if you receive a security alert.
  11. Log in using the default user name and password. During the first logon, you should modify your password and store it securely.
  12. In the top Ranger banner, choose Settings and Users/Groups/Roles.
  13. Confirm Tina and Alex are listed as users with a User Source of External.
  14. Confirm the datascience group is listed as a group with Group Source of External.

If the Tina or Alex users aren’t listed, follow the Apache Ranger troubleshooting instructions in the appendix at the end of this post.

Dataset policies

The Apache Ranger access policy model consists of two major components: specification of the resources a policy is applied to, such as files and directories, databases, tables, and columns, services, and so on, and the specification of access conditions (permissions) for specific users and groups.

Configure your dataset policy with the following steps:

  1. On the Ranger admin console, choose the Ranger icon in the top banner to return to the main page.
  2. Choose the service name amazonemrspark inside AMAZON-EMR-SPARK.
  3. Choose Add New Policy and add a new policy with the following parameters:
    1. For Policy Name, enter Data Science Policy.
    2. For Database, enter staging and default.
    3. For EMR Spark Table, enter products and orders.
    4. For EMR Spark Column, enter *.
    5. In the Allow Conditions section, for Select User, enter tina and alex, and for Permissions, enter select and read.
  4. Choose Add.
    When using Internet Explorer & adding a new policy, you may receive the error SCRIPT438: Object doesn't support property or method 'assign'. In this case, install and use an alternate browser such as Firefox or Chrome.
  5. Choose Add New Policy and add a new policy for tina:
    1. For Policy Name, enter Customer Demographics Policy.
    2. For Database, enter staging.
    3. For EMR Spark Table, enter Customers.
    4. For EMR Spark Column, choose customer_id, first_name, last_name, region, and state.
    5. In the Allow Conditions section, for Select User, enter Tina and for Permissions, enter select and read.
  6. Choose Add.

Configure Amazon S3 data access rights in Apache Ranger

Complete the following steps to configure Amazon S3 data access rights:

  1. On the Ranger admin console, choose the Ranger icon in the top banner to return to the main page.
  2. Choose the service name amazonemrs3 inside AMAZON-EMR-EMRFS.
  3. Choose Add New Policy and add a policy for the datascience group as follows:
    1. For Policy Name, enter Data Science S3 Policy.
    2. For S3 resource, enter the following:
      • aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/products
      • aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/orders
    3. In the Allow Conditions, section, for Select User, enter tina and alex, and for Permissions, enter GetObject and ListObjects.
  4. Choose Add.
  5. Choose Add New Policy and add a new policy for tina:
    1. For Policy Name, enter Customer Demographics S3 Policy.
    2. For S3 resource, enter aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/customers.
    3. In the Allow Conditions section, for Select User, enter Tina and for Permissions, enter GetObject and ListObjects.
  6. Choose Add.

Configure Amazon S3 user working folders

While working with data, users often require data storage for interim results. To provide each user with a private working directory, complete the following steps:

  1. On the Ranger admin console, choose Ranger icon in the top banner to return to the main page.
  2. Choose the service name amazonemrs3 inside AMAZON-EMR-EMRFS.
  3. Choose Add New Policy and add a policy for {USER} as follows:
    1. For Policy Name, enter User Directory S3 Policy.
    2. For S3 resource, enter <Bucket Name>/data/{USER} (use a bucket within the account).
    3. Enable Recursive.
    4. In the Allow Conditions, section, for Select User, enter {USER} and for Permissions, enter GetObject, ListObjects, PutObject, and DeleteObject.
  4. Choose Add.

Use the user access login URL

Users attempting to access shared AWS applications via IAM Identity Center need to first log in to the AWS environment with a custom link using their Active Directory user name and password. The link needed can be found on the IAM Identity Center console.

  1. On the IAM Identity Center console, choose Settings in the navigation pane.
  2. On the Identity source tab, locate the user login link under AWS access portal URL.

Test role-based data access

To review, data scientist Tina needs to build a customer lifetime value model, which requires access to orders, product, and non-sensitive customer data. Data scientist Alex only needs access to orders and product data to build a product demand model.

In this section, we test the data access levels for each role.

Data scientist Tina

Complete the following steps:

  1. Log in using the URL you located in the previous step.
  2. Enter Microsoft AD user [email protected] and your password.
  3. Choose the Amazon SageMaker Studio tile.
  4. In the SageMaker Studio UI, start a notebook:
    1. Choose File, New, and Notebook.
    2. For Image, choose SparkMagic.
    3. For Kernel, choose PySpark.
    4. For Instance Type, choose ml.t3.medium.
    5. Choose Select.
  5. When the notebook kernel starts, connect to the EMR cluster by running the following code:
    %load_ext sagemaker_studio_analytics_extension.magics
    %sm_analytics emr connect --cluster-id <EMR Cluster ID> --auth-type Kerberos --language python

The EMR cluster ID details can be found on the Outputs tab of the EMR cluster CloudFormation stack created with the second template.

  1. Enter Microsoft AD [email protected] and your password. (Note that [email protected] is case-sensitive.)
  2. Choose Connect.

Now we can test Tina’s data access.

  1. In a new cell, enter the following query and run the cell:
    %%sql
    show tables from staging

Returned data will indicate the table objects accessible to Tina.

  1. In a new cell, run the following:
    %%sql
    select * from staging.customers limit 5

Returned data will include columns Tina has been granted access.

Let’s test Tina’s access to customer data.

  1. In a new cell, run the following:
    %%sql
    select customer_id, education_level, first_name, last_name, marital_status, region, state from staging.customers limit 15

The preceding query will result in an Access Denied error due to the inclusion of sensitive data columns.

During ad hoc analysis and model building, it’s common for users to create temporary datasets that need to be persisted for a short period. Let’s test Tina’s ability to create a working dataset and store results in a private working directory.

  1. In a new cell, run the following:
    join_order_to_customer = spark.sql("select orders.*, first_name, last_name, region, state from staging.orders, staging.customers where orders.customer_id = customers.customer_id")

  2. Before running the following code, update the S3 path variable <bucket name> to correspond to an S3 location within your local account:
    join_order_to_customer.write.mode("overwrite").format("parquet").option("path", "s3://<bucket name>/data/tina/order_and_product/").save()

The preceding query writes the created dataset as Parquet files in the S3 bucket specified.

Data scientist: Alex

Complete the following steps:

  1. Log in using the URL you located in the previous step.
  2. Enter Microsoft AD user [email protected] and your password.
  3. Choose the Amazon SageMaker Studio tile.
  4. In the SageMaker Studio UI, start a notebook:
    1. Choose File, New, and Notebook.
    2. For Image, choose SparkMagic.
    3. For Kernel, choose PySpark.
    4. For Instance Type, choose ml.t3.medium.
    5. Choose Select.
  5. When the notebook kernel starts, connect to the EMR cluster by running the following code:
    %load_ext sagemaker_studio_analytics_extension.magics
    %sm_analytics emr connect --cluster-id <EMR Cluster ID> --auth-type Kerberos --language python

  6. Enter Microsoft AD [email protected] and your password (note that [email protected] is case-sensitive).
  7. Choose Connect. Now we can test Alex’s data access.
  8. In a new cell, enter the following query and run the cell:
    %%sql
    show tables from staging

    Returned data will indicate the table objects accessible to Alex. Note that the customers table is missing.

  9. In a new cell, run the following:
    %%sql
    select * from staging.orders limit 5

Returned data will include columns Alex has been granted access.

Let’s test Alex’s access to customer data.

  1. In a new cell, run the following:
    %%sql
    select * from staging.customers limit 5

The preceding query will result in an Access Denied error because Alex doesn’t have access to customers.

We can verify Ranger is responsible for the denial by looking at the CloudWatch logs.

Now that you can successfully access data, feel free to interactively explore, visualize, prepare, and model the data using the different user personas.

Clean up

When you’re finished experimenting with this solution, clean up your resources:

  1. Shut down and update SageMaker Studio and Studio apps. Ensure that all apps created as part of this post are deleted before deleting the stack.
  2. Change the identity source for IAM Identity Center back to Identity Center Directory.
  3. Delete the directory AWSEMR.COM from Directory Services.
  4. Empty the S3 buckets created by the CloudFormation stacks.
  5. Delete the stacks via the AWS CloudFormation console for the non-nested stacks starting in reverse order.

Conclusion

This post showed how you can implement fine-grained access control in SageMaker Studio and Amazon EMR using Apache Ranger and Microsoft Active Directory. We also demonstrated how multiple SageMaker Studio users can connect to the same EMR cluster and access different tables and columns using Apache Ranger, wherein each user is scoped with permissions matching their individual level of access to data. In addition, we demonstrated how the individual users can access separate S3 folders for storing their intermediate data. We detailed the steps required to set up the integration and provided CloudFormation templates to set up the base infrastructure from end to end.

To learn more about using Amazon EMR with SageMaker Studio, refer to Prepare Data using Amazon EMR. We encourage you to try out this new functionality, and connect with the Machine Learning & AI community if you have any questions or feedback!

Appendix: Apache Ranger troubleshooting

The sync between Active Directory and Apache Ranger is set for every 24 hours. To force a sync, complete the following steps:

  1. Connect to the Apache Ranger server using SSH. This can be done using directly or Session Manager, a capability of AWS Systems Manager, or through AWS Cloud9.
  2. Once connected, issue the following commands:
    sudo /usr/bin/ranger-usersync stop || true
    sudo /usr/bin/ranger-usersync start
    sudo chkconfig ranger-usersync on

  3. To confirm the sync, open the Ranger console as an admin.
  4. Choose Audit in the top banner.
  5. Choose the User Sync tab and confirm the event time.

About the Authors

Rahul Sarda is a Senior Analytics & ML Specialist at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys spending time with his family, stay healthy, running and road cycling.

Varun Rao Bhamidimarri is a Sr Manager, AWS Analytics Specialist Solutions Architect team. His focus is helping customers with adoption of cloud-enabled analytics solutions to meet their business requirements. Outside of work, he loves spending time with his wife and two kids, stay healthy, mediate and recently picked up gardening during the lockdown.

Set up AWS Private Certificate Authority to issue certificates for use with IAM Roles Anywhere

Post Syndicated from Chris Sciarrino original https://aws.amazon.com/blogs/security/set-up-aws-private-certificate-authority-to-issue-certificates-for-use-with-iam-roles-anywhere/

Traditionally, applications or systems—defined as pieces of autonomous logic functioning without direct user interaction—have faced challenges associated with long-lived credentials such as access keys. In certain circumstances, long-lived credentials can increase operational overhead and the scope of impact in the event of an inadvertent disclosure.

To help mitigate these risks and follow the best practice of using short-term credentials, Amazon Web Services (AWS) introduced IAM Roles Anywhere, a feature of AWS Identity and Access Management (IAM). With the introduction of IAM Roles Anywhere, systems running outside of AWS can exchange X.509 certificates to assume an IAM role and receive temporary IAM credentials from AWS Security Token Service (AWS STS).

You can use IAM Roles Anywhere to help you implement a secure and manageable authentication method. It uses the same IAM policies and roles as within AWS, simplifying governance and policy management across hybrid cloud environments. Additionally, the certificates used in this process come with a built-in validity period defined when the certificate request is created, enhancing the security by providing a time-limited trust for the identities. Furthermore, customers in high security environments can optionally keep private keys for the certificates stored in PKCS #11-compatible hardware security modules for extra protection.

For organizations that lack an existing public key infrastructure (PKI), AWS Private Certificate Authority allows for the creation of a certificate hierarchy without the complexity of self-hosting a PKI.

With the introduction of IAM Roles Anywhere, there is now an accompanying requirement to manage certificates and their lifecycle. AWS Private CA is an AWS managed service that can issue x509 certificates for hosts. This makes it ideal for use with IAM Roles Anywhere. However, AWS Private CA doesn’t natively deploy certificates to hosts.

Certificate deployment is an essential part of managing the certificate lifecycle for IAM Roles Anywhere, the absence of which can lead to operational inefficiencies. Fortunately, there is a solution. By using AWS Systems Manager with its Run Command capability, you can automate issuing and renewing certificates from AWS Private CA. This simplifies the management process of IAM Roles Anywhere on a large scale.

In this blog post, we walk you through an architectural pattern that uses AWS Private CA and Systems Manager to automate issuing and renewing x509 certificates. This pattern smooths the integration of non-AWS hosts with IAM Roles Anywhere. It can help you replace long-term credentials while reducing operational complexity of IAM Roles Anywhere with certificate vending automation.

While IAM Roles Anywhere supports both Windows and Linux, this solution is designed for a Linux environment. Windows users integrating with Active Directory should check out the AWS Private CA Connector for Active Directory. By implementing this architectural pattern, you can distribute certificates to your non-AWS Linux hosts, thereby enabling them to use IAM Roles Anywhere. This approach can help you simplify certificate management tasks.

Architecture overview

Figure 1: Architecture overview

Figure 1: Architecture overview

The architectural pattern we propose (Figure 1) is composed of multiple stages, involving AWS services including Amazon EventBridge, AWS Lambda, Amazon DynamoDB, and Systems Manager.

  1. Amazon EventBridge Scheduler invokes a Lambda function called CertCheck twice daily.
  2. The Lambda function scans a DynamoDB table to identify instances that require certificate management. It specifically targets instances managed by Systems Manager, which the administrator populates into the table.
  3. The information about the instances with no certificate and instances requiring new certificates due to expiry of existing ones is received by CertCheck.
  4. Depending on the certificate’s expiration date for a particular instance, a second Lambda function called CertIssue is launched.
  5. CertIssue instructs Systems Manager to apply a run command on the instance.
  6. Run Command generates a certificate signing request (CSR) and a private key on the instance.
  7. The CSR is retrieved by Systems Manager, the private key remains securely on the instance.
  8. CertIssue then retrieves the CSR from Systems Manager.
  9. CertIssue uses the CSR to request a signed certificate from AWS Private CA.
  10. On successful certificate issuance, AWS Private CA creates an event through EventBridge that contains the ID of the newly issued certificate.
  11. This event subsequently invokes a third Lambda function called CertDeploy.
  12. CertDeploy retrieves the certificate from AWS Private CA and invokes Systems Manager to launch Run Command with the certificate data and updates the certificate’s expiration date in the DynamoDB table for future reference.
  13. Run Command conducts a brief test to verify the certificate’s functionality, and upon success, stores the signed certificate on the instance.
  14. The instance can then exchange the certificate for AWS credentials through IAM Roles Anywhere.

Additionally, on a certificate rotation failure, an Amazon Simple Notification Service (Amazon SNS) notification is delivered to an email address specified during the AWS CloudFormation deployment.

The solution enables periodic certificate rotation. If a certificate is nearing expiration, the process initiates the generation of a new private key and CSR, thus issuing a new certificate. Newly generated certificates, private keys, and CSRs replace the existing ones.

With certificates in place, they can be used by IAM Roles Anywhere to obtain short-term IAM credentials. For more details on setting up IAM Roles Anywhere, see the IAM Roles Anywhere User Guide.

Costs

Although this solution offers significant benefits, it’s important to consider the associated costs before you deploy. To provide a cost estimate, managing certificates for 100 hosts would cost approximately $85 per month. However, for a larger deployment of 1,100 hosts with the Systems Manager advanced tier, the cost would be around $5937 per month. These pricing estimates include the rotation of certificates six times a month.

AWS Private CA in short-lived mode incurs a monthly charge of $50, and each certificate issuance costs $0.058. Systems Manager Hybrid Activation standard has no additional cost for managing fewer than 1,000 hosts. If you have more than 1,000 hosts, the advanced plan must be used at an approximate cost of $5 per host per month. DynamoDB, Amazon SNS, and Lambda costs should be under $5 per month per service for under 1000 hosts. For environments with over 1,000 hosts, it might be worthwhile to explore other options of machine to machine authentication or another option for distributing certificates.

Please note that the estimated pricing mentioned here is specific to the us-east-1 AWS Region and can be calculated for other regions using the AWS Pricing Calculator.

Prerequisites

You should have several items already set up to make it easier to follow along with the blog.

Enabling Systems manager hybrid activation

To create a hybrid activation, follow these steps:

  1. Open the AWS Management Console for Systems Manager, go to Hybrid activations and choose Create an Activation.
    Figure 2: Hybrid activation page

    Figure 2: Hybrid activation page

  2. Enter a description [optional] for the activation and adjust the Instance limit value to the maximum you need, then choose Create activation.
    Figure 3: Create hybrid activation

    Figure 3: Create hybrid activation

  3. This gives you a green banner with an Activation Code and Activation ID. Make a note of these.
    Figure 4: Successful hybrid activation with activation code and ID

    Figure 4: Successful hybrid activation with activation code and ID

  4. Install the AWS Systems Manager Agent (SSM Agent) on the hosts to be managed. Follow the instructions for the appropriate operating system. In the example commands, replace <activation-code>, <activation-id>, and <region> with the activation code and ID from the previous step and your Region. Here is an example of commands to run for an Ubuntu host:
    mkdir /tmp/ssm
    
    curl https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/debian_amd64/amazon-ssm-agent.deb -o /tmp/ssm/amazon-ssm-agent.deb
    
    sudo dpkg -i /tmp/ssm/amazon-ssm-agent.deb
    
    sudo service amazon-ssm-agent stop
    
    sudo -E amazon-ssm-agent -register -code "<activation-code>" -id "<activation-id>" -region <region> 
    
    sudo service amazon-ssm-agent start
    

You should see a message confirming the instance was successfully registered with Systems Manager.

Note: If you receive errors during Systems Manager registration about the Region having invalid characters, verify that the Region is not in quotation marks.

Deploy with CloudFormation

We’ve created a Git repository with a CloudFormation template that sets up the aforementioned architecture. An existing S3 bucket is required for CloudFormation to upload the Lambda package.

To launch the CloudFormation stack:

  1. Clone the Git repository that contains the CloudFormation template and the Lambda function code.
    git clone https://github.com/aws-samples/aws-privateca-certificate-deployment-automator.git
    

  2. cd into the directory created by Git.
    cd aws-privateca-certificate-deployment-automator
    

  3. Launch the CloudFormation stack within the cloned Git directory using the cf_template.yaml file, replacing <DOC-EXAMPLE-BUCKET> with the name of your S3 bucket from the prerequisites.
    aws cloudformation package --template-file cf_template.yaml --output-template-file packaged.yaml --s3-bucket <DOC-EXAMPLE-BUCKET>
    

Note: These commands should be run on the system you plan to use to deploy the CloudFormation and have the Git and AWS CLI installed.

After successfully running the CloudFormation package command, run the CloudFormation deploy command. The template supports various parameters to change the path where the certificates and keys will be generated. Adjust the paths as needed with the parameter-overrides flag, but verify that they exist on the hosts. Replace the <email> placeholder with one that you want to receive alerts for failures. The stack name must be in lower case.

aws cloudformation deploy --template packaged.yaml --stack-name ssm-pca-stack --capabilities CAPABILITY_NAMED_IAM --parameter-overrides SNSSubscriberEmail=<email>

The available CloudFormation parameters are listed in the following table:

Parameter Default value Use
AWSSigningHelperPath /root Default path on the host for the AWS Signing Helper binary
CACertPath /tmp Default path on the host the CA certificate will be created in
CACertValidity 10 Default CA certificate length in years
CACommonNam ca.example.com Default CA certificate common name
CACountry US Default CA certificate country code
CertPath /tmp Default path on the host the certificates will be created in
CSRPath /tmp Default path on the host the certificates will be created in
KeyAlgorithm RSA_2048 Default algorithm use to create the CA private key
KeyPath /tmp Default path on the host the private keys will be created in
OrgName Example Corp Default CA certificate organization name
SigningAlgorithm SHA256WITHRSA Default CA signing algorithm for issued certificates

After the CloudFormation stack is ready, manually add the hosts requiring certificate management into the DynamoDB table.

You will also receive an email at the email address specified to accept the SNS topic subscription. Make sure to choose the Confirm Subscription link as shown in Figure 5.

Figure 5: SNS topic subscription confirmation

Figure 5: SNS topic subscription confirmation

Add data to the DynamoDB table

  1. Open the AWS Systems Manager console and select Fleet Manager.
  2. Choose Managed Nodes and copy the Node ID value. The node ID value in the Fleet Manager as shown in Figure 6 will be the host ID to be used in a subsequent step.
    Figure 6: Systems Manager Node ID

    Figure 6: Systems Manager Node ID

  3. Open the DynamoDB console and select Dashboard and then Tables in the left navigation pane.
    Figure 7: DynamoDB menu

    Figure 7: DynamoDB menu

  4. Select the certificates table.
    Figure 8: DynamoDB tables

    Figure 8: DynamoDB tables

  5. Choose Explore table items and then choose Create item.
  6. Enter the node ID as a value for the hostID attribute as copied in step 2.
    Figure 9: DynamoDB table hostID attribute creation

    Figure 9: DynamoDB table hostID attribute creation

Additional string attributes listed in the following table can be added to the item to specify paths for the certificates on a per host basis. If these attributes aren’t created, either the default paths or overrides in the CloudFormation parameters will be used.

Additional supported attributes Use
certPath Path on the host the certificate will be created in
keyPath Path on the host the private key will be created in
signinghelperPath Path on the host for the AWS Signing Helper binary
cacertPath Path on the host the CA certificate will be created in

The CertCheck Lambda function created by the CloudFormation template runs twice daily to verify that the certificates for these hosts are kept up to date. If necessary, you can use the Lambda invoke command to run the Lambda function on-demand.

aws lambda invoke --function-name CertCheck-Trigger --cli-binary-format raw-in-base64-out response.json

The certificate expiration and serial number metadata are stored in the DynamoDB table certificate. Select the certificates table and choose Explore table items to view the data.

Figure 10: DynamoDB table item with certificate expiration and serial for a host

Figure 10: DynamoDB table item with certificate expiration and serial for a host

Validation

To validate successful certificate deployment, you should find four files in the location specified in the CloudFormation parameter or DynamoDB table attribute, as shown in the following table.

File Use Location
{host}.crt The certificate containing the public key, signed by AWS Private CA. certPath attribute in DynamoDB. Otherwise, default specified by the certPath CF parameter.
ca_chain_certificate.crt The certificate chain including intermediates from AWS Private CA. cacertPath attribute in DynamoDB. Otherwise, default specified by the CACertPath CF parameter.
{host}.key The private key for the certificate. keyPath attribute in DynamoDB. Otherwise, default specified by the KeyPath CF parameter.
{host}.csr The CSR used to generate the signed certificate. Default specified by the CSRPath CF parameter.

These certificates can now be used to configure the host for IAM Roles Anywhere. See Obtaining temporary security credentials from AWS Identity and Access Management Roles Anywhere for using the signing helper tool provided by IAM Roles Anywhere. The signing helper must be installed on the instance for the validation to work. You can pass the location of the signing helper as a parameter to the CloudFormation template.

Note: As a security best practice, it’s important to use permissions and ACLs to keep the private key secure and restrict access to it. The automation will create and set the private key with chmod 400 permissions. Chmod command is used to change the permission for a file or directory. Chmod 400 permission will allow owner of the file to read the file while restricting others from reading, writing, or running the file.

Revoke a certificate

AWS Private CA also supports generating a certificate revocation list (CRL), which can be imported to IAM Roles Anywhere. The CloudFormation template automatically sets up the CRL process between AWS Private CA and IAM Roles Anywhere.

Figure 11: Certificate revocation process

Figure 11: Certificate revocation process

Within 30 minutes after revocation, AWS Private CA generates a CRL file and uploads it to the CRL S3 bucket that was created by the CloudFormation template. Then, the CRLProcessor Lambda function receives a notification through EventBridge of the new CRL file and passes it to the IAM Roles Anywhere API.

To revoke a certificate, use the AWS CLI. In the following example, replace <certificate-authority-arn>, <certificate-serial>, and<revocation-reason> with your own information.

aws acm-pca revoke-certificate --certificate-authority-arn <certificate-authority-arn> --certificate-serial <certificate-serial> --revocation-reason <revocation-reason>

The AWS Private CA ARN can be found in the Cloudformation stack outputs under the name PCAARN. The certificate serial number are listed in the DynamoDB table for each host as previously mentioned. The revocation reasons can be one of these possible values:

  • UNSPECIFIED
  • KEY_COMPROMISE
  • CERTIFICATE_AUTHORITY_COMPROMISE
  • AFFILIATION_CHANGED
  • SUPERSEDED
  • CESSATION_OF_OPERATION
  • PRIVILEGE_WITHDRAWN
  • A_A_COMPROMISE

Revoking a certificate won’t automatically generate a new certificate for the host. See Manually rotate certificates.

Manually rotate certificates

The certificates are set to expire weekly and are rotated the day of expiration. If you need to manually replace a certificate sooner, remove the expiration date for the host’s record in the DynamoDB table (see Figure 12). On the next run of the Lambda function, the lack of an expiration date will cause the certificate for that host to be replaced. To immediately renew a certificate or test the rotation function, remove the expiration date from the DynamoDB table and run the following Lambda invoke command. After the certificates have been rotated, the new expiration date will be listed in the table.

aws lambda invoke --function-name CertCheck-Trigger --cli-binary-format raw-in-base64-out response.json

Conclusion

By using AWS IAM Roles Anywhere, systems outside of AWS can use short-term credentials in the form of x509 certificates in exchange for AWS STS credentials. This can help you improve your security in a hybrid environment by reducing the use of long-term access keys as credentials.

For organizations without an existing enterprise PKI, the solution described in this post provides an automated method of generating and rotating certificates using AWS Private CA and AWS Systems Manager. We showed you how you can use Systems Manager to set up a non-AWS host with certificates for use with IAM Roles Anywhere and ensure they’re rotated regularly.

Deploy this solution today and move towards IAM Roles Anywhere to remove long term credentials for programmatic access. For more information, see the IAM Roles Anywhere blog article or post your queries on AWS re:Post.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on IAM re:Post or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Chris Sciarrino

Chris Sciarrino

Chris is a Senior Solutions Architect and a member of the AWS security field community based in Toronto, Canada. He works with enterprise customers helping them design solutions on AWS. Outside of work, Chris enjoys spending his time hiking and skiing with friends and listening to audiobooks.

Ravikant Sharma/>

Ravikant Sharma

Ravikant is a Solutions Architect based in London. He specializes in cloud security and financial services. He helps Fintech and Web3 startups build and scale their business using AWS Cloud. Prior to AWS, he worked at Citi Singapore as Vice President – API management, where he played a pivotal role in implementing Open API security frameworks and Open Banking regulations.

Rahul Gautam

Rahul Gautam

Rahul is a Security and Compliance Specialist Solutions Architect based in London. He helps customers in adopting AWS security services to meet and improve their security posture in the cloud. Before joining the SSA team, Rahul spent 5 years as a Cloud Support Engineer in AWS Premium Support. Outside of work, Rahul enjoys travelling as much as he can.

How to improve your security incident response processes with Jupyter notebooks

Post Syndicated from Tim Manik original https://aws.amazon.com/blogs/security/how-to-improve-your-security-incident-response-processes-with-jupyter-notebooks/

Customers face a number of challenges to quickly and effectively respond to a security event. To start, it can be difficult to standardize how to respond to a partic­ular security event, such as an Amazon GuardDuty finding. Additionally, silos can form with reliance on one security analyst who is designated to perform certain tasks, such as investigate all GuardDuty findings. Jupyter notebooks can help you address these challenges by simplifying both standardization and collaboration.

Jupyter Notebook is an open-source, web-based application to run and document code. Although Jupyter notebooks are most frequently used for data science and machine learning, you can also use them to more efficiently and effectively investigate and respond to security events.

In this blog post, we will show you how to use Jupyter Notebook to investigate a security event. With this solution, you can automate the tasks of gathering data, presenting the data, and providing procedures and next steps for the findings.

Benefits of using Jupyter notebooks for security incident response

The following are some ways that you can use Jupyter notebooks for security incident response:

  • Develop readable code for analysts – Within a notebook, you can combine markdown text and code cells to improve readability. Analysts can read context around the code cell, run the code cell, and analyze the results within the notebook.
  • Standardize analysis and response – You can reuse notebooks after the initial creation. This makes it simpler for you to standardize your incident response processes for how to respond to a certain type of security event. Additionally, you can use notebooks to achieve repeatable responses. You can rerun an entire notebook or a specific cell.
  • Collaborate and share incident response knowledge – After you create a Jupyter notebook, you can share it with peers to more seamlessly collaborate and share knowledge, which helps reduce silos and reliance on certain analysts.
  • Iterate on your incident response playbooks – Developing a security incident response program involves continuous iteration. With Jupyter notebooks, you can start small and iterate on what you have developed. You can keep Jupyter notebooks under source code control by using services such as AWS CodeCommit. This allows you to approve and track changes to your notebooks.

Architecture overview

Figure 1: Architecture for incident response analysis

Figure 1: Architecture for incident response analysis

The architecture shown in Figure 1 consists of the foundational services required to analyze and contain security incidents on AWS. You create and access the playbooks through the Jupyter console that is hosted on Amazon SageMaker. Within the playbooks, you run several Amazon Athena queries against AWS CloudTrail logs hosted in Amazon Simple Storage Service (Amazon S3).

Solution implementation

To deploy the solution, you will complete the following steps:

  1. Deploy a SageMaker notebook instance
  2. Create an Athena table for your CloudTrail trail
  3. Grant AWS Lake Formation access
  4. Access the Credential Compromise playbooks by using JupyterLab

Step 1: Deploy a SageMaker notebook instance

You will host your Jupyter notebooks on a SageMaker notebook instance. We chose to use SageMaker instead of running the notebooks locally because SageMaker provides flexible compute, seamless integration with CodeCommit and GitHub, temporary credentials through AWS Identity and Access Management (IAM) roles, and lower latency for Athena queries.

You can deploy the SageMaker notebook instance by using the AWS CloudFormation template from our jupyter-notebook-for-incident-response GitHub repository. We recommend that you deploy SageMaker in your security tooling account or an equivalent.

The CloudFormation template deploys the following resources:

  • A SageMaker notebook instance to run the analysis notebooks. Because this is a proof of concept (POC), the deployed SageMaker instance is the smallest instance type available. However, within an enterprise environment, you will likely need a larger instance type.
  • An AWS Key Management Service (AWS KMS) key to encrypt the SageMaker notebook instance and protect sensitive data.
  • An IAM role that grants the SageMaker notebook permissions to query CloudTrail, VPC Flow Logs, and other log sources.
  • An IAM role that allows access to the pre-signed URL of the SageMaker notebook from only an allowlisted IP range.
  • A VPC configured for SageMaker with an internet gateway, NAT gateway, and VPC endpoints to access required AWS services securely. The internet gateway and NAT gateway provide internet access to install external packages.
  • An S3 bucket to store results for your Athena log queries—you will reference the S3 bucket in the next step.

Step 2: Create an Athena table for your CloudTrail trail

The solution uses Athena to query CloudTrail logs, so you need to create an Athena table for CloudTrail.

There are two main ways to create an Athena table for CloudTrail:

For either of these methods to create an Athena table, you need to provide the URI of an S3 bucket. For this blog post, use the URI of the S3 bucket that the CloudFormation template created in Step 1. To find the URI of the S3 bucket, see the Output section of the CloudFormation stack.

Step 3: Grant AWS Lake Formation access

If you don’t use AWS Lake Formation in your AWS environment, skip to Step 4. Otherwise, continue with the following instructions. Lake Formation is how data access control for your Athena tables is managed.

To grant permission to the Security Log database

  1. Open the Lake Formation console.
  2. Select the database that you created in Step 2 for your security logs. If you used the Security Analytics Bootstrap, then the table name is either security_analysis or a custom name that you provided—you can find the name in the CloudFormation stack. If you created the Athena table by using the CloudTrail console, then the database is named default.
  3. From the Actions dropdown, select Grant.
  4. In Grant data permissions, select IAM users and roles.
  5. Find the IAM role used by the SageMaker Notebook instance.
  6. In Database permissions, select Describe and then Grant.

To grant permission to the Security Log CloudTrail table

  1. Open the Lake Formation console.
  2. Select the database that you created in Step 2.
  3. Choose View Tables.
  4. Select CloudTrail. If you created VPC flow log and DNS log tables, select those, too.
  5. From the Actions dropdown, select Grant.
  6. In Grant data permissions, select IAM users and roles.
  7. Find the IAM role used by the SageMaker notebook instance.
  8. In Table permissions, select Describe and then Grant.

Step 4: Access the Credential Compromise playbooks by using JupyterLab

The CloudFormation template clones the jupyter-notebook-for-incident-response GitHub repo into your Jupyter workspace.

You can access JupyterLab hosted on your SageMaker notebook instance by following the steps in the Access Notebook Instances documentation.

Your folder structure should match that shown in Figure 2. The parent folder should be jupyter-notebook-for-incident-response, and the child folders should be playbooks and cfn-templates.

Figure 2: Folder structure after GitHub repo is cloned to the environment

Figure 2: Folder structure after GitHub repo is cloned to the environment

Sample investigation of a spike in failed login attempts

In the following sections, you will use the Jupyter notebook that we created to investigate a scenario where failed login attempts have spiked. We designed this notebook to guide you through the process of gathering more information about the spike.

We discuss the important components of these notebooks so that you can use the framework to create your own playbooks. We encourage you to build on top of the playbook, and add additional queries and steps in the playbook to customize it for your organization’s specific business and security goals.

For this blog post, we will focus primarily on the analysis phase of incident response and walk you through how you can use Jupyter notebooks to help with this phase.

Before you get started with the following steps, open the credential-compromise-analysis.ipynb notebook in your JupyterLab environment.

How to import Python libraries and set environment variables

The notebooks require that you have the following Python libraries:

  • Boto3 – to interact with AWS services through API calls
  • Pandas – to visualize the data
  • PyAthena – to simplify the code to connect to Athena

To install the required Python libraries, in the Setup section of the notebook, under Load libraries, edit the variables in the two code cells as follows:

  • region – specify the AWS Region that you want your AWS API commands to run in (for example, us-east-1).
  • athena_bucket – specify the S3 bucket URI that is configured to store your Athena queries. You can find this information at Athena > Query Editor > Settings > Query result location.
  • db_name – specify the database used by Athena that contains your Athena table for CloudTrail.
Figure 3: Load the Python libraries in the notebook

Figure 3: Load the Python libraries in the notebook

This helps ensure that subsequent code cells that run are configured to run in your environment.

Run each code cell by choosing the cell and pressing SHIFT+ENTER or by choosing the play button (▶) in the toolbar at the top of the console.

How to set up the helper function for Athena

The Python query_results function, shown in the following figure, helps you query Athena tables. Run this code cell. You will use the query_results function later in the 2.0 IAM Investigation section of the notebook.

Figure 4: Code cell for the helper function to query with Athena

Figure 4: Code cell for the helper function to query with Athena

Credential Compromise Analysis Notebook

The credential-compromise-analysis.ipynb notebook includes several prebuilt queries to help you start your investigation of a potentially compromised credential. In this post, we discuss three of these queries:

  • The first query provides a broad view by retrieving the CloudTrail events related to authorization failures. By reviewing these results, you get baseline information about where users and roles are attempting to access resources or take actions without having the proper permissions.
  • The second query narrows the focus by identifying the top five IAM entities (such as users, roles, and identities) that are causing most of the authorization failures. Frequent failures from specific entities often indicate that their credentials are compromised.
  • The third query zooms in on one of the suspicious entities from the previous query. It retrieves API activity and events initiated by that entity across AWS services or resource. Analyzing actions performed by a suspicious entity can reveal if valid permissions are being misused or if the entity is systematically trying to access resources it doesn’t have access to.

Investigate authorization failures

The notebook has markdown cells that provide a description of the expected result of the query. The next cell contains the query statement. The final cell calls the query_result function to run your query by using Athena and display your results in tabular format.

In query 2.1, you query for specific error codes such as AccessDenied, and filter for anything that is an IAM entity by looking for useridentity.arn like ‘%iam%’. The notebook orders the entries by eventTime. If you want to look for specific IAM Identity Center entities, update the query to filter by useridentity.sessioncontext.sessionissuer.arn like ‘%sso.amazonaws.com%’.

This query retrieves a list of failed API calls to AWS services. From this list, you can gain additional insight into the context surrounding the spike in failed login attempts.

When you investigate denied API access requests, carefully examine details such as the user identity, timestamp, source IP address, and other metadata. This information helps you determine if the event is a legitimate threat or a false positive. Here are some specific questions to ask:

  • Does the IP address originate from within your network, or is it external? Internal addresses might be less concerning.
  • Is the access attempt occurring during normal working hours for that user? Requests outside of normal times might warrant more scrutiny.
  • What resources or changes is the user trying to access or make? Attempts to modify sensitive data or systems might indicate malicious intent.

By thoroughly evaluating the context around denied API calls, you can more accurately assess the risk they pose and whether you need to take further action. You can use the specifics in the logs to go beyond just the fact that access was denied, and learn the story of who, when, and why.

As shown in the following figure, the queries in the notebook use the following structure.

  1. Markdown cell to explain the purpose of the query (the query statement).
  2. Code cell to run the query and display the query results.

In the figure, the first code cell that runs stores the input for the query statement. After that finishes, the next code block displays the query results.

Figure 5: Run predefined Athena queries in JupyterLab

Figure 5: Run predefined Athena queries in JupyterLab

Figure 6 shows the output of the query that you ran in the 2.1 Investigation Authorization Failures section. It contains critical details for understanding the context around a denied API call:

  • The eventtime field shows the date and time that the request was completed.
  • The useridentity field reveals which IAM identity made a request.
  • The sourceipddress provides the IP address that the request was made from.
  • The useragent shows which client or app was used to make the call.
Figure 6: Results from the first investigative query

Figure 6: Results from the first investigative query

Figure 6 only shows a subset of the many details captured in CloudTrail logs. By scrolling to the right in the query output, you can view additional attributes that provide further context around the event. The CloudTrail record contents guide contains a comprehensive list of the fields included in the logs, along with descriptions of each attribute.

Often, you will need to search for more information to determine if remediation is necessary. For this reason, we have included additional queries to help you further examine the sequence of events leading up to the failed login attempt spike and after the spike occurred.

Triaging suspicious entities (Queries 2.2 and 2.3)

By running the second and third queries you can dig deeper into anomalous authorization failures. As shown in Figure 7, Query 2.2 provides the top five IAM entities with the most frequent access denials. This highlights the specific users, roles, and identities causing the most failures, which indicates potentially compromised credentials.

Query 2.3 takes the investigation further by isolating the activity from one suspicious entity. Retrieving the actions attempted by a single problematic user or role reveals useful context to determine if you need to revoke credentials. For example, is the entity probing resources that it shouldn’t have access to? Are there unusual API calls outside of normal hours? By scrutinizing an entity’s full history, you can make an informed decision on remediation.

Figure 7: Overview of queries 2.2 and 2.3

Figure 7: Overview of queries 2.2 and 2.3

You can use these two queries together to triage authorization failures: query 2 identifies high-risk entities, and query 3 gathers intelligence to drive your response. This progression from a macro view to a micro view is crucial for transforming signals into action.

Although log analysis relies on automation and queries to facilitate insights, human judgment is essential to interpret these signals and determine the appropriate response. You should discuss flagged events with stakeholders and resource owners to benefit from their domain expertise. You can export the results of your analysis by exporting your Jupyter notebook.

By collaborating with other people, you can gather contextual clues that might not be captured in the raw data. For example, an owner might confirm that a suspicious login time is expected for users in a certain time zone. By pairing automated detection with human perspectives, you can accurately assess risk and decide if credential revocation or other remediation is truly warranted. Uptime or downtime technical issues alone can’t dictate if remediation is necessary—the human element provides pivotal context.

Build your own queries

In addition to the existing queries, you can run your own queries and include them in your copy of the Credential-compromise-analysis.ipynb notebook. The AWS Security Analytics Bootstrap contains a library of common Athena queries for CloudTrail. We recommend that you review these queries before you start to build your own queries. The key takeaway is that these notebooks are highly customizable. You can use the Jupyter Notebook application to help meet the specific incident response requirements of your organization.

Contain compromised IAM entities

If the investigation reveals that a compromised IAM entity requires containment, follow these steps to revoke access:

  • For federated users, revoke their active AWS sessions according to the guidance in How to revoke federated users’ active AWS sessions. This uses IAM policies and AWS Organizations service control policies (SCPs) to revoke access to assumed roles.
  • Avoid using long-lived IAM credentials such as access keys. Instead, use temporary credentials through IAM roles. However, if you detect a compromised access key, immediately rotate or deactivate it by following the guidance in What to Do If You Inadvertently Expose an AWS Access Key. Review the permissions granted to the compromised IAM entity and consider if these permissions should be reduced after access is restored. Overly permissive policies might have enabled broader access for the threat actor.

Going forward, implement least privilege access and monitor authorization activity to detect suspicious behavior. By quickly containing compromised entities and proactively improving IAM hygiene, you can minimize the adversaries’ access duration and prevent further unauthorized access.

Additional considerations

In addition to querying CloudTrail, you can use Athena to query other logs, such as VPC Flow Logs and Amazon Route 53 DNS logs. You can also use Amazon Security Lake, which is generally available, to automatically centralize security data from AWS environments, SaaS providers, on-premises environments, and cloud sources into a purpose-built data lake stored in your account. To better understand which logs to collect and analyze as part of your incident response process, see Logging strategies for security incident response.

We recommended that you understand the playbook implementation described in this blog post before you expand the scope of your incident response solution. The running of queries and automation of containment are two elements to consider as you think about the next steps to evolve your incident response processes.

Conclusion

In this blog post, we showed how you can use Jupyter notebooks to simplify and standardize your incident response processes. You reviewed how to respond to a potential credential compromise incident using a Jupyter notebook style playbook. You also saw how this helps reduce the time to resolution and standardize the analysis and response. Finally, we presented several artifacts and recommendations showing how you can tailor this solution to meet your organization’s specific security needs. You can use this framework to evolve your incident response process.

Further resources

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS re:Post or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Tim Manik

Tim Manik

Tim is a Solutions Architect at AWS working with enterprise customers in North America. He specializes in cybersecurity and AI/ML. When he’s not working, you can find Tim exploring new hiking trails, swimming, or playing the guitar.

Daria Pshonkina

Daria Pshonkina

Daria is a Solutions Architect at AWS supporting enterprise customers that started their business journey in the cloud. Her specialization is in security. Outside of work, she enjoys outdoor activities, travelling, and spending quality time with friends and family.

Bryant Pickford

Bryant Pickford

Bryant is a Security Specialist Solutions Architect within the Prototyping Security team under the Worldwide Specialist Organization. He has experience in incident response, threat detection, and perimeter security. In his free time, he likes to DJ, make music, and skydive.

Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control

Post Syndicated from Ashley Zhou original https://aws.amazon.com/blogs/big-data/use-iam-runtime-roles-with-amazon-emr-studio-workspaces-and-aws-lake-formation-for-cross-account-fine-grained-access-control/

Amazon EMR Studio is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio provides fully managed Jupyter notebooks and tools such as Spark UI and YARN Timeline Server via EMR Studio Workspaces. You can attach an EMR Studio Workspace to an EMR cluster, and use the compute power of the EMR cluster and run data science jobs on the cluster. Data is often stored in data lakes managed by AWS Lake Formation, enabling you to apply fine-grained access control through a simple grant or revoke mechanism.

We’re happy to introduce runtime roles for EMR Studio Workspaces. You can now define a runtime role and assign it to an EMR cluster when attaching an EMR Studio Workspace. The jobs on the EMR cluster will use this runtime role to access AWS resources. After configuring a runtime role, you can also use Lake Formation and apply fine-grained data access control for the jobs submitted by the EMR Studio Workspace.

Previously, when attaching EMR Studio Workspaces to EMR clusters, all Workspaces had to use the same AWS Identity and Access Management (IAM) role—namely, the cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instance profile. Therefore, all Workspaces attached to the same EMR cluster had the same data access. To control access to data sources, each EMR Studio Workspace had to use a different EMR cluster, and multiple EMR instance profiles were needed.

Starting with the release of Amazon EMR 6.11, you can now choose a runtime role when attaching an EMR Studio Workspace to an EMR cluster. This runtime role scopes down access at the Workspace level. Your Apache Livy and Apache Spark jobs that run from the EMR Studio Workspaces will have permission to access only the data and resources permitted by policies attached to the runtime role. Also, when data is accessed from data lakes managed with Lake Formation, you can enforce fine-grained data access control using Lake Formation permissions. This helps you reduce operational overhead.

In this post, we demonstrate how to configure runtime roles for EMR Studio Workspaces and attach a Workspace to an EMR cluster with runtime roles. Because large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account, our example uses two AWS accounts. We explain how to control access to EMR Studio runtime roles, manage data access across accounts in a data lake via Lake Formation, and enforce table-level and column-level permissions to the EMR runtime roles.

Solution overview

To demonstrate fine-grained access control, we create a sample AWS Glue database named company and manage the database permission in Lake Formation. The database consists of two separate tables:

  • employees – This table stores information about the company’s employees, including employee ID, name, department, and salary
  • products – This table stores information about the products sold by the company, including product ID, name, category, and price

To demonstrate data access control, we consider the following data users:

  • Alice, a data scientist in the sales team – She should have read-only access to all columns in the products table and selected columns, including uID, name, and department in the employees table
  • Bob, a data scientist in the human resources team – He should have read-only access to all columns in employees table and should not have access to the products table

To demonstrate cross-account data sharing, we consider two accounts:

  • Data producer account – We refer to this account as 123456789012 in this post. This account manages the raw data in Amazon Simple Storage Service (Amazon S3) and writes data to the data lake. The company database and tables should be in this account.
  • Data consumer account – We refer to this account as 111122223333 in this post. This account is accessed directly by the users for data analysis and doesn’t have write access to the data. This account should be accessible by Alice and Bob.

The architecture is implemented as follows:

  • The data producer account manages a data lake. Raw data is stored in S3 buckets and catalogued in the AWS Glue Data Catalog.
  • Lake Formation in the data producer account governs the data access via the Data Catalog, and provides cross-account data sharing with the data consumer account.
  • Lake Formation in the data consumer account governs cross-account access to the data lake on table level and fine-grained Lake Formation permissions. For more information, refer to Methods for fine-grained access control.
  • EMR Studio Workspaces in the data consumer account use runtime roles when running jobs on an EMR cluster.
  • The EMR cluster connects to Glue Data Catalog in the data consumer account and queries the data from the data lake through cross-account data sharing.

The following diagram illustrates this architecture.

In the following sections, we go through the steps to share data across accounts via Lake Formation, run an EMR Studio Workspace with runtime roles, and demonstrate fine-grained access control.

Prerequisites

You should have the following prerequisites:

Create the infrastructure in the data producer account

Complete the following steps to create the infrastructure resources:

  1. Log in to the data producer AWS account (123456789012).
  2. Choose Launch Stack to deploy a CloudFormation template to create the necessary resources.
  3. For DataLakeBucketSuffix, enter the suffix for the S3 bucket used by the data lake. The whole S3 bucket name to be created will be {AwsAccoundId}-{AwsRegion}-{DataLakeBucketSuffix}.
  4. After the CloudFormation stack is created, navigate to the Outputs tab of the stack and capture the value of DataLakeS3Bucket to use in the next step.

Create data files and upload them to Amazon S3 in the data producer account

Configure your AWS CLI to use the IAM identity with permission to upload to DataLakeS3BucketName in the data producer AWS account (123456789012), or you can sign in to CloudShell using the AWS Management Console. Complete the following steps:

  1. On your local machine, move to a directory of your choice with the cd command, for example, cd ~.
  2. Run the script with chmod 744 create_sample_data.sh && ./create_sample_data.sh <DataLakeS3BucketName>.

The script will create a subdirectory tmp in your current working directory, create the test data in CSV files, and upload the files to the DataLakeS3BucketName S3 bucket.

Set up Lake Formation in the data producer account

In this section, we walk through the steps to set up Lake Formation in the data producer account.

Set up Lake Formation cross-account data sharing version settings

Lake Formation supports multiple data sharing versions. For this post, we use version 3. To learn more about the differences between data sharing versions, refer to Updating cross-account data sharing version settings. To change the data sharing version, see To enable the new version.

Register the Amazon S3 location as the data lake location

When you register an Amazon S3 location with Lake Formation, you specify an IAM role with read/write permissions on that location. After registering, when EMR clusters request access to this Amazon S3 location, Lake Formation will supply temporary credentials of the provided role to access the data. We already created the role LakeFormationCompanyDatabaseDataAccessRole for this purpose in the previous step. To register the Amazon S3 location as the data lake location, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account (123456789012).
  2. In the navigation pane, choose Data lake locations under Administration.
  3. Choose Register location.
  4. For Amazon S3 path, enter s3://<DataLakeS3BucketName>/company-database.
  5. For IAM role, enter LakeFormationCompanyDatabaseDataAccessRole.
  6. For Permission mode, select Lake Formation.
  7. Choose Register location.

Register data location

Revoke permissions granted to IAMAllowedPrincipals

The IAMAllowedPrincipals group includes any IAM users and roles that are allowed access to your Data Catalog resources by your IAM policies. To enforce the Lake Formation model, we need to revoke permission from IAMAllowedPrincipals using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. In the navigation pane, choose Data lake permissions under Permissions.
  3. Filter permissions by Database = company and Principle=IAMAllowedPrinciples.
  4. Select all the permissions given to the principal IAMAllowedPrincipals and choose Revoke.

Revoke permissions granted to IAMAllowedPrincipals

Set up application integration settings

To enforce permissions for the EMR cluster, you need to register a session tag value with Lake Formation. Lake Formation uses this session tag to authorize callers and provide access to the data lake. We register Amazon EMR as the session tag value. This value will be referenced in the security configuration when creating the EMR cluster.

Set up the session tag using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. Choose Application integration settings under Administration in the navigation pane.
  3. Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
  4. For Session tag values, enter Amazon EMR.
  5. For AWS account IDs, enter the data consumer AWS account ID (111122223333).
  6. Choose Save.

Set up application integration settings in data producer account

Share the database and tables to the data consumer account

We now grant permissions to the data consumer AWS account, including grantable permissions. This allows the Lake Formation data lake administrator in the data consumer account to control access to the data within the account.

Grant database permissions to the data consumer account

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. In the navigation pane, choose Databases.
  3. Select the database company, and on the Actions menu, under Permissions, choose Grant.
  4. In the Principles section, select External accounts and enter the data consumer AWS account (111122223333).
  5. In the LF-Tags or catalog resources section, choose company for Databases.
  6. In the Database permissions section, select Describe for both Database permissions and Grantable permissions.

This allows the data lake administrator in the data consumer account to describe the database and grant describe permissions to other principals in the data consumer account.

  1. Choose Grant.

Grant database permissions to the data consumer account

Grant table permissions to the data consumer account

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. In the navigation pane, choose Tables.
  3. Select the products table, which belongs to the company database, and on the Actions menu, under Permissions, choose Grant.
  4. In the Principles section, select External accounts and enter in the data consumer AWS account (111122223333).
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables, choose products and employees.
  6. In the Table permissions section, choose Select and Describe for both Table permissions and Grantable permissions.

This allows the data lake administrator in the data consumer account to select and describe the tables, and grant select and describe table permissions to other principals in the data consumer account.

  1. In the Data permissions section, select All data access.
  2. Choose Grant.

Grant table permissions to the data consumer account
Now we have finished setting up the data producer account.

Set up the infrastructure in the data consumer account

Complete the following steps to create the infrastructure resources:

  1. Log in to the data consumer account (111122223333).
  2. Choose Launch stack to deploy a CloudFormation template to create the necessary resources.
    Launch Stack
  3. For Release Label, enter the Amazon EMR release label to use, which can only be emr-6.11 or up.
  4. For InstanceType, choose the instance type for EMR cluster, such as r4.4xlarge.
  5. For EMRS3BucketNameSuffix, enter the S3 bucket suffix to store EMR cluster logs and EMR notebook files. The full S3 bucket name to be created will be {AWSAccoundId}-{AWSRegion}-{EMRS3BucketNameSuffix}.
  6. For S3PathToInTransitCertificate, enter the S3 path for the .zip file that contains the .pem files used for in-transit encryption.

For instructions on creating the .zip file that contains the .pem files and uploading them to your S3 bucket, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption.

  1. After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
  2. Capture the value of EMRStudioLink to use to sign in to EMR Studio.

Accept the resource share in the data consumer account

To access shared resources, you must accept the invitation first.

  1. Open the AWS RAM console of the data consumer account with the IAM identity that has AWS RAM access.
  2. In the navigation pane, choose Resource shares under Shared with me.

You should see two pending resource shares from the data producer account.

  1. Accept both resource shares.

You should see the company database, employees table, and products table in the Data Catalog.

Set up Lake Formation in the data consumer account

In this section, we walk through the steps to set up Lake Formation in the data consumer account.

Set up application integration settings

Similar to the setup in the data producer account, you need register Amazon EMR as a session tag. This value is referenced in the security configuration when creating the EMR cluster in the CloudFormation stack.

To do that, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account (111122223333).
  2. Choose Application integration settings under Administration in the navigation pane.
  3. Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
  4. For Session tag values, enter Amazon EMR.
  5. For AWS account IDs, enter the data consumer AWS account ID (111122223333).
  6. Choose Save.

Set up application integration settings in data consumer account

Grant describe permissions to runtime roles on the default database

If you don’t have a default database in Lake Formation, or your default database already has permissions to grant to IAMAllowedPrinciples, you can skip this step.

Amazon EMR will check on the default database by default. If you already have a default database in your Lake Formation, grant the describe permission to the runtime roles on the default database by completing the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator user in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the default database, verify that the owner account ID is the data consumer account (111122223333), and on the Actions menu, choose Grant.
  4. In the Principles section, select IAM users and roles.
  5. For IAM users and roles, choose sales-runtime-role and human-resource-runtime-role.
  6. For LF-Tags or catalog resources, select Named data catalog resources and choose default for Databases.
  7. In the Database permissions section, for Database permissions, choose Describe.
  8. Choose Grant.

Grant describe permissions to runtime roles on the default database

Create a resource link for the shared database

To access the database and table resources that were shared by the data producer AWS account, you need to create a resource link in the data consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database or table. After you create a resource link to a database or table, you can use the resource link name wherever you would use the database or table name. In this step, you grant permission on the resource links to the runtime role principles. The runtime roles will then access the data in shared databases and underlying tables through the resource link.

To create a resource link, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the company database, verify that the owner account ID is the data producer account (123456789012), and on the Actions menu, choose Create Resource links.
  4. For Resource link name, enter the name of the resource link (for example, company-shared).
  5. For Shared database’s region, choose the Region of the company database.
  6. For Shared database, choose the company database.
  7. For Shared database’s owner ID, enter the account ID of the data producer account (123456789012).
  8. Choose Create.

Create a resource link for the shared database

Grant permissions on the resource link to the runtime role principle

Grant permissions on the resource link to sales-runtime-role and human-resource-runtime-role using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant.
  4. In the Principles section, select IAM users and roles, and choose sales-runtime-role and human-resource-runtime-role.
  5. In the LF-Tags or catalog resources section, for Databases, choose company-shared.
  6. In the Resource link permissions section, select Describe.

This allows the runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.

  1. Choose Grant.

Grant permissions on the resource link to the runtime role principle

Grant permission on the tables to the runtime role principle

You need to grant permissions on the tables to sales-runtime-role and human-resource-runtime-role to allow data access:

  • Human-resource-runtime-role should have describe and select permissions on all columns in the employees table, and no permissions on the products table.
  • Sales-runtime-role should have select permissions on the columns uid, name, and department in the employees table, and describe and select permissions on all columns in the products table.

Grant permission on the employees table to human-resource-runtime-role

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant on Target.
  4. In the Principles section, select IAM users and roles, then choose human-resource-runtime-role.
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables¸ choose employees.
  6. In the Table permissions section, for Table permissions, select Describe and Select.
  7. In the Data permissions section, select All data access.
  8. Choose Grant.

Grant permission on the employees table to human-resource-runtime-role

Grant permission on the employees table to sales-runtime-role

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant on Target.
  4. In the Principles section, select IAM users and roles, then choose sales-runtime-role.
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables, choose employees.
  6. In the Table permissions section, for Table permissions, select Select.
  7. In the Data permissions section, select Column-based access.
  8. Select Include columns and choose the uid, name, and department columns.
  9. Choose Grant.

 Grant permission on the employees table to sales-runtime-role

Grant permission on the products table to sales-runtime-role

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant on Target.
  4. In the Principles section, select IAM users and roles, then choose sales-runtime-role.
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables, choose products.
  6. In the Table permissions section, for Table permissions, select Select and Describe.
  7. In the Data permissions section, select All data access.
  8. Choose Grant.

Grant permission on the products table to sales-runtime-role

Log in to EMR Studio and use the EMR Studio Workspace

Switch your role to alice-role or bob-role on the console using different web browsers to test access. Open the EMRStudioLink URL from the CloudFormation stack output to sign in to the EMR Studio with each role, then complete the following steps:

  1. Choose Workspaces in the navigation pane and choose Create Workspace.
  2. Enter a name and a description for the Workspace.
  3. Choose Create Workspace.

A new tab containing JupyterLab will open automatically when the Workspace is ready. Enable pop-ups in your browser if necessary.

  1. Chose the Compute icon in the navigation pane to attach the EMR Studio Workspace with a compute engine.
  2. Select EMR cluster on EC2 for Compute type.
  3. Choose the EMR cluster ID you created with AWS CloudFormation.
  4. For Runtime role, choose sales-runtime-role if signed in as alice-role. Choose human-resource-runtime-role if signed in as bob-role.
  5. Choose Attach.

attach EMR Studio Workspace to cluster

Run code in the EMR Studio Workspace and verify data access

Run the following code in the EMR Studio Workspace with a PySpark kernel after signing in with alice-role or bob-role:

%%sql -o result -n -1
select * from `company-shared`.products limit 5;

%%sql -o result -n -1
select * from `company-shared`.employees limit 5;

You should see different results when using different roles.

According to our data access configuration in Lake Formation, Alice will have full data access for the products table. She can view all the columns except for salary in the employees table.

Alice (sales) query result

For Bob, according to our data access configuration in Lake Formation, he will have full data access to the employees table, but he has no access to the products table.

Bob (human resource) query result

Clean up

When you’re finished experimenting with this solution, clean up your resources:

  1. Stop and delete the EMR Studio Workspaces created in the data consumer AWS account.
  2. Delete all the content in the S3 bucket EMRS3Bucket in the data consumer AWS account.
  3. Delete the CloudFormation stack in the data consumer AWS account.
  4. Delete all the content in the S3 bucket DataLakeS3Bucket in the data producer AWS account.
  5. Delete the CloudFormation stack in the data producer AWS account.

Conclusion

This post showed how you can use runtime roles to connect to an EMR Studio Workspace with Amazon EMR to apply cross-account fine-grained data access control with Lake Formation. We also demonstrated how multiple EMR Studio users can connect to the same EMR cluster, each using a runtime role scoped with permissions matching their individual level of access to data.

To learn more about using EMR Studio Workspaces with Lake Formation, refer to Run an EMR Studio Workspace with a runtime role. We encourage you to try out this new functionality, and connect with the us if you have any questions or feedback!


About the Authors

Ashley Zhou is a Software Development Engineer at AWS. She is interested in data analytics and distributed systems.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

Build an entitlement service for business applications using Amazon Verified Permissions

Post Syndicated from Abdul Qadir original https://aws.amazon.com/blogs/security/build-an-entitlement-service-for-business-applications-using-amazon-verified-permissions/

Amazon Verified Permissions is designed to simplify the process of managing permissions within an application. In this blog post, we aim to help customers understand how this service can be applied to several business use cases.

Companies typically use custom entitlement logic embedded in their business applications. This is the most common approach, and it involves writing custom code to manage user access permissions. We’ll explore the common challenges faced by application developers and access administrators when handling user access permissions in an application and how Verified Permissions can help you solve these challenges. We’ll provide an integration guide for incorporating Verified Permissions into an entitlement service, specifically for use cases such as payment management. Finally, we’ll discuss the advantages of using a granular, adaptable, and externally managed access control system.

This blog post will provide a comprehensive and centralized approach to managing access policies, reducing administrative overhead, and empowering line-of-business users to define, administer, and enforce application entitlement policies.

Challenges of building an entitlement system

Entitlements refer to the rules that determine what each user can or cannot do within an application. Figure 1 shows the architecture of a common entitlement system, with components embedded in applications and entitlements stored in multiple data stores.

Figure 1: Typical entitlement system

Figure 1: Typical entitlement system

Creating your own permissions management system can be resource-intensive, requiring time and expertise to ensure its effectiveness. Enterprises face many issues when building a custom entitlement management system, such as complexity, security risks, performance, and lack of scalability. Let’s delve into these issues in detail.

  • Data complexity – Entitlement decisions are often based on complex data relationships, such as user roles, group membership, and product permissions. Managing this complexity can be challenging, especially in a large organization with a lot of users, groups, and products.
  • Compliance and security – Building an entitlement system requires careful consideration of compliance regulations and security best practices. You need to protect user data, implement secure communication protocols, and handle potential security vulnerabilities.
  • Scalability – Permissions management systems must scale to handle large number of users and transactions. This can be a challenge, especially if the service is used to control access to critical resources.
  • Performance and availability – Entitlement services need to be performant, because they are often used to make real-time decisions. Additionally, they need to be reliable and consistent, so that users can be confident that their entitlements are accurate.

Architecting an entitlement service using Amazon Verified Permissions

Amazon Verified Permissions is a scalable permissions management and fine-grained authorization service that helps you build and modernize applications without relying heavily on coding authorization within your applications.

Let’s discuss how you can use Verified Permissions to manage entitlements.

Creating and deploying policies

Verified Permissions uses Cedar, a policy language that allows developers to express permissions as policies that permit users or forbid them from doing certain tasks. A central policy-based authorization system gives developers a consistent way to define and manage fine-grained authorization across applications, simplifies changing permission rules without a need to change code, and improves visibility by moving permissions out of the code.

By using Verified Permissions, you can create specific permission policies that incorporate characteristics of role-based access control (RBAC) and attribute-based access control (ABAC). This approach enables you to implement granular controls while prioritizing the principle of least privilege.

Use case 1: Mary, who works as a clerk, can submit and view payments. Her role within the payment management system allows for multiple actions, and the policy for this role can be defined as follows.

permit (
    principal,
    action in [
        PaymentManager::Action::"SubmitPayment",
        PaymentManager::Action::"UpdatePayment",
        PaymentManager::Action::"ListPayment"
    ],
    resource
)
when { principal.role == "clerk" };

In contrast, Shirley is an auditor, with access that only allows her to list payments. The policy for this role is as follows.

permit (
    principal,
    action in [PaymentManager::Action::"ListPayment"],
    resource
)
when { principal.role == "auditor" };

The payment system will pass the principal, action, resource, and the entity data to Verified Permissions. If the user information is not explicitly defined within the application, the payment system must retrieve it from data stores such as an identity provider or database.

Following that, Verified Permissions evaluates relevant policies by assembling policies that affect the calling principal and the resource in question to make a decision on whether the action should be permitted or denied. Once a decision is made, it is conveyed back to the application, which can then enforce the decision.

As you can see in Figure 2, Mary has access to submit a payment because she has the role of “clerk” and the policy shown earlier permits this action.

Figure 2: Using the test bench to test if Mary can submit payment

Figure 2: Using the test bench to test if Mary can submit payment

Shirley can’t submit a payment based on her role as an “auditor” and the action is denied, as shown in Figure 3.

Figure 3: Using the test bench to test if Shirley can submit payment

Figure 3: Using the test bench to test if Shirley can submit payment

However, she can list the payments, as the policy shown earlier permits this action, as shown in Figure 4.

Figure 4: Using the test bench to test if Shirley can list payments

Figure 4: Using the test bench to test if Shirley can list payments

Use case 2: Using the payment system application, CFO Jane delegates access for a high-value account, 111222333, to John, VP of Finance, during her vacation by creating a policy from a template. This gives John permission to approve payments on the account without Jane’s direct presence.

Policy template for approving payment: Figure 5 shows a sample policy template to approve payment. Policies created by using this template, like the one following, will provide the principal with the ability to approve payments for the resource.

permit (
    principal == ?principal,
    action in [PaymentManager::Action::"ApprovePayment”],
    resource == ?resource
);
Figure 5: Creating a policy template

Figure 5: Creating a policy template

Create the policy from the template: Figure 6 shows the policy created by using the preceding template. The parameters that you have to pass are the principal and resource information. For this use case, the principal is “John” and the resource is the account “111222333”, enabling John to approve payment for the account. (AWS recommends using a universally unique identifier (UUID) for the principal, but “John” is used in this blog post to make it more readable.)

Figure 6: Creating a policy from template

Figure 6: Creating a policy from template

Evaluate the policy: As expected, John is granted access to approve payment for the account 111222333, as shown in Figure 7.

Figure 7: Using the test bench to test if Jeff can approve payment

Figure 7: Using the test bench to test if Jeff can approve payment

Building an entitlement service with Verified Permissions

Verified Permissions enables you to build an entitlement service by externalizing authorization and centralizing policy management and administration. It allows you to tailor access control to your specific application requirements while leveraging the underlying entitlement management provided by Verified Permissions.

Integrating an existing entitlement service with Verified Permissions

Let’s look at how you can integrate an existing entitlement service with Verified Permissions, as shown in Figure 8. In this diagram, the underlying implementation of the entitlement service uses the standard enterprise technology stack. Amazon DynamoDB is used to store the user and role information.

Figure 8: Integrating an entitlement service with Verified Permissions

Figure 8: Integrating an entitlement service with Verified Permissions

Here’s an approach you can use to seamlessly integrate your existing entitlement service with Verified Permissions:

  1. Identify permissions: Begin by assessing your existing entitlement service to identify the permissions it currently uses, different roles, actions, and resources. Compile a detailed list of the permissions along with their respective purposes.
  2. Formulate policies: Map the permissions identified for each use case in the previous step into policies. You can use both inline policies and policy templates. In the AWS Management Console, use the Verified Permissions test bench to evaluate the policies you’ve drafted.
  3. Create policies: Depending on your business needs, create one or more policy stores within Verified Permissions. Create the policies within these policy stores. This is a one-time task and we recommend using automation to accomplish it.
  4. Update entitlement service: Use your entitlement service’s existing interface to create a logic that transforms the current request payload into the format that Verified Permissions’ authorization request expects. You might need to identify and incorporate missing parameters into the existing interfaces. Apply this same transformation logic to the response payload. Refer to this documentation for the Verified Permissions authorization request and response format.
  5. Integrate with Verified Permissions: Use the Verified Permission API or AWS SDK to integrate the entitlements service with Verified Permissions. This involves tasks such as fetching the user role from Amazon DynamoDB, making authorization requests to Verified Permissions, and processing the resulting responses.
  6. Testing: Thoroughly test your service after making the permission changes. Verify that all functionalities are working as expected and that the policies in Verified Permissions are being utilized correctly.
  7. Deployment: After your service passes the review process, roll out the updated entitlement service along with the integrated Verified Permissions functionality.
  8. Monitor and maintain: Following deployment, continuously monitor the performance and gather feedback. Be prepared to make further adjustments if necessary.
  9. Documentation and support: Provide comprehensive documentation for developers who will use your entitlement service. Clearly explain the available endpoints, the request and response formats, and the authorization requirements.

You can use a similar approach to integrate your existing entitlement service with other third-party permission management systems.

Building a new entitlement service in AWS using Amazon Verified Permissions

The reference architecture in Figure 9 shows how to build a new entitlement service using Verified Permissions. AWS customers already use Amazon Cognito for simple, fast authentication. With Amazon Verified Permissions, customers can also add simple, fast authorization to their applications by adding user profile attributes to the identity token generated by Amazon Cognito.

Figure 9: Entitlement service using Verified Permissions

Figure 9: Entitlement service using Verified Permissions

The workflow in the diagram is as follows:

  1. The user signs in to the application by using Amazon Cognito.
  2. If the authentication is successful, the pre-token generation Lambda function will be invoked.
  3. You can use the pre-token generation Lambda function to customize an identity token before Amazon Cognito generates it. In this case, the trigger is used to add the user profile attributes as new claims in the identity token.
    1. The user profile attributes are retrieved from Amazon Dynamo DB.
    2. The attributes are then added as new claims in the identity token.
  4. After the user is signed in, they request access to the protected resource in the application through Amazon API Gateway.
  5. Amazon API Gateway initiates an authorization check using a Lambda authorizer. A Lambda authorizer is a feature of the API Gateway that allows you to implement a custom authorization scheme using the identity token generated by Amazon Cognito.
  6. The Lambda authorizer validates, decodes, and retrieves the user profile attributes from the identity token.
  7. The Lambda authorizer calls the Verified Permission authorization API and passes the principal, action, resource, and user profile attributes as entities.
  8. Based on the decision returned by Verified Permissions, the user is permitted or denied access to the resource.

Common pitfalls of using an entitlement service

Entitlement services can be tricky, but there are a few common mistakes you can avoid to make them more secure and simpler to use:

  • Entitlement service misconfigurations can create security vulnerabilities and lead to data breaches. It is important to carefully configure the entitlement service and to regularly review policies to verify that they are correct and up-to-date.
  • When you first start using an entitlement service, it’s easy to give users too many permissions. This can make your application less secure and harder to manage. It’s important to give users only the permissions they need to do their jobs.
  • Users need to be trained on how to use the entitlement service correctly, especially when it comes to requesting and managing permissions. If users don’t know how to do these tasks appropriately, they could make mistakes that could leave your system vulnerable.

Conclusion

Amazon Verified Permissions is a comprehensive solution for businesses looking to manage granular access control, flexible authorization, and externalized access control. With this service, organizations can quickly and conveniently apply new policies across their environment, streamlining user management processes and helping to improve overall security. This post has highlighted the many benefits of using Verified Permissions for entitlement management within an application. We hope it has been helpful in understanding how you can apply this service to your business use cases.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Abdul Qadir

Abdul Qadir

Abdul is a Solutions Architect based in New York. He designs and architects solutions for independent software vendor (ISV) customers and helps customers in their cloud journeys. He’s been working in the financial and insurance industries and helping companies with digital transformation and modernizing their legacy systems.

Arun Sivaraman

Arun Sivaraman

Arun is a Boston-based Solutions Architect. He enjoys working with customers to create innovative solutions and supporting their digital transformation.

How to use chaos engineering in incident response

Post Syndicated from Kevin Low original https://aws.amazon.com/blogs/security/how-to-use-chaos-engineering-in-incident-response/

Simulations, tests, and game days are critical parts of preparing and verifying incident response processes. Customers often face challenges getting started and building their incident response function as the applications they build become increasingly complex. In this post, we will introduce the concept of chaos engineering and how you can use it to accelerate your incident response preparation and testing processes.

Why chaos engineering?

Chaos engineering is a formalized approach that uses fault injection experiments to create real-world conditions needed to understand how your system will react to unknowns and build confidence in the system’s resiliency and security.

Modern applications can have multiple components, including web, API, application, and data persistence layers. To respond to potential security events, you must understand the failure scenarios across each component and their downstream impacts. One challenge is that creating incident response processes and playbooks for components in a silo doesn’t consider known unknowns—how these components interact with each other—and can’t reveal unknown unknowns such as second-order effects during a security event.

As an example, consider the personalization microservice shown in Figure 1.

The microservice relies on two Amazon Elastic Compute Cloud (Amazon EC2) instances that are deployed in an auto scaling group across two Availability Zones. An upstream data collection microservice sends data for the personalization microservice to process. In addition, a downstream website microservice takes the personalized data and displays it to customers.

Figure 1: Architecture of the personalization microservice

Figure 1: Architecture of the personalization microservice

Now imagine that unexpected activity occurred on an EC2 instance. The instance started to query a domain name that’s associated with cryptocurrency-related activity. A first set of unknowns already emerges:

  • Can your detective controls detect the activity on the instance?
  • How long do they take to do so?
  • How long does it take your security team to be notified?
  • Does the security team know what to do?
  • Does the notification have all the information that the team needs to respond?
  • Is there an existing automated response to other stakeholders?

Security professionals may not consider all of these questions when building and designing their threat detection and incident response capabilities.

In our example, Amazon GuardDuty is able to detect the unexpected activity and generates the CryptoCurrency:EC2/BitcoinTool.B!DNS finding within 15 minutes. The security team takes a snapshot for further forensics before the instance is isolated, as shown in Figure 2.

Figure 2: Architecture after GuardDuty detects unexpected activity and the security team isolates the EC2 instance

Figure 2: Architecture after GuardDuty detects unexpected activity and the security team isolates the EC2 instance

Although this might seem like an adequate response in isolation, it leads to more questions.

From a security perspective:

  • What other logs do we need for further investigation?
  • Do we know if the credentials need to be rotated and what impact that will have on the workload?
  • Should other parts of the system be replaced or restarted?

From an operational perspective:

  • Do any of the (manual or automated) incident response processes impact the performance of the workload?
  • Can the remaining instance handle the traffic before the auto scaling group creates another instance?
  • If there is increased latency or failure of the microservice, how will the data collection and website microservices react to it?

Creating detection and incident response plans in isolation doesn’t consider the second order effects that could have an impact on the integrity and availability of the system.

How can chaos engineering help?

Chaos engineering is a formalized process that can help solve this problem. It creates failure in a controlled environment with well-defined experiments to generate data on system behavior during a simulated event. You can use this data to improve incident response processes and make proactive changes that improve the security of your workloads. By using chaos engineering, developer and security teams can reveal additional unknowns and understand areas of opportunity to improve incident response processes and workload availability.

Chaos engineering has five phases—steady state, hypothesis, run experiment, verify, and improve—which we’ll discuss in more detail next.

Steady state 

The first phase involves an understanding of the behavior and configuration of the system under normal conditions. Instead of focusing on the internal attributes of the system, you should focus on an output metric or indicator that ties operational metrics and customer experience together. Including these output metrics in your hypothesis helps you collect data on security events and understand how these events and your response to them impact business outcomes.

Returning to our earlier example, this could be the latency when a user attempts to retrieve personalized information. This output is critical to the customer experience and relies on multiple operational metrics.

In addition, two key metrics in incident response are time to detect (TTD) and time to remediate (TTR). These metrics help capture how effectively your team has responded to the security event.

By defining your steady state, you can detect deviations from that state and determine if your system has fully returned to the known good state. You should identify the relevant metrics to measure your system and make these metrics simple for engineers to consume.

Using AWS, you can collect logs from the different services that you use in a workload, such as Amazon VPC Flow Logs, Amazon CloudWatch log groups, and AWS CloudTrail. For more details about the different log sources, see Logging strategies for security incident response.

Hypothesis

After you understand the steady state behavior, you can write a hypothesis about it. Security hypotheses can take the following form:

When _________ happens, ________ system will notify the team within _______ and the application’s metric _________ will remain at ________.

It can be challenging to decide what should happen. Chaos engineering recommends that you choose real-world events that are likely to occur and that will impact the user experience. Get your team to brainstorm. For security issues, this is an ideal time to use your threat model as the starting point for discussions. Starting with one of your identified threats and then running experiments based on that threat can help you test both your processes and automation.

After you’ve chosen your component, decide which variable to influence or what could happen in your complex system. For example, a misconfigured Amazon Simple Storage Service (Amazon S3) bucket or an open database port could lead to unintended exposure of customer data. A software flaw in your application could lead to the misuse of resources by an unauthorized user.

Here are a few examples of hypotheses:

  • If port 22 permits unrestricted access on a security group, AWS Config will detect it, run an automation to remove the security group rule, and notify the security team through Slack within 5 minutes, and the application’s latency will remain at 0.005 seconds.
  • If malware is run on an EC2 instance, Amazon GuardDuty will detect it within 15 minutes and notify the security team. Remediation playbooks will not affect the application’s error rate of 1 error for every thousand requests.

Design and run the experiment 

The next phase is to run the experiment. You don’t need to run experiments in production right away. A great place to get started with chaos engineering is the staging environment. One benefit of the AWS Cloud is that you can configure your staging environment to be identical to production. This increases the value of using an approach like chaos engineering before you get to production. By running experiments in staging, you can see how your system will likely react in production while earning trust within your organization.

As you gain confidence, you can begin running experiments in production. Because you configured staging to be identical to production, the risk of this transition is mitigated.

You can use AWS Fault Injection Simulator (FIS), our fully managed service for running fault injection experiments. FIS supports multiple fault injection actions, such as injecting API errors, restarting instances, running scripts on instances, disrupting network connectivity, and more. For the full list, see the FIS actions reference.

Although FIS doesn’t support security-related actions out of the box, you can use FIS to run AWS Systems Manager Automation documents that can run AWS APIs and scripts to simulate security events. To learn how to set up FIS to run a Systems Manager document that turns off bucket-level block public access for a randomly-selected S3 bucket, see the workshop Chaos Kitty – Gamifying Incident Response with Chaos Engineering. To learn how to set up FIS to run experiments that simulate events such as an RDP brute force event, lateral movement, cryptocurrency mining, and DNS data exfiltration, see the workshop Validating security guardrails with Chaos Engineering.

During this phase, you must understand the scope of impact of your experiment and work to minimize it. If an Amazon CloudWatch alarm goes into an alarm state, FIS can automatically stop the experiment. You should have a plan to return the environment to the steady state if the experiment has an unintended impact.

As you run your experiment, remember to document the key metrics and human responses, such as whether incident responders were confident, knew where to find the correct resources, or were aware of the escalation points.

Learn and verify 

The next step is to analyze and document the data to understand what happened. Lessons learned during the experiment are critical and should promote a culture of support instead of blame.

Here are some questions that you should address:

  1. What happened?
  2. What was the impact on our customers?
  3. What did we learn? Did we have enough information in the notification to investigate?
  4. What could have reduced our time to detect or time to remediate by 50 percent?
  5. Can we apply this to other similar systems?
  6. How can we improve our incident response processes?
  7. What human steps in the process can we automate?

Here are a few examples of things you might learn from your chaos engineering experiments:

  • After port 22 was opened, AWS Config detected the misconfigured security group within 2 minutes. However, the notification system was misconfigured, and the security team wasn’t notified. During the five minutes that port 22 was opened, the EC2 instance received 22 attempts to connect to it from unknown IP addresses.
  • After a cryptocurrency mining script was run on an EC2 instance, GuardDuty detected the activity, generated a finding within 10 minutes, and notified the security team.
  • The security team’s remediation actions—terminating the instance—led to increased application latency beyond the SLA of 0.05 seconds.

Improve and fix it

Use your learnings to improve the workload. It’s vital that you get leadership alignment and support to prioritize the remediation of findings from chaos experiments or other testing scenarios. Examples include improving incident response playbooks, creating new forms of automation, or creating preventative controls to prevent the event from happening. For more guidance on playbooks, see these sample incident response playbooks and the workshop Building an AWS incident response runbook using Jupyter notebooks and CloudTrail Lake.

Preparation is important in incident response. As you improve your processes, run more experiments to collect additional data and continue to iteratively improve. Automate chaos experiments on new environments or applications with minimal user traffic before directing the majority of traffic to it. As you use chaos engineering approaches to prepare incident response processes, your detection and incident response capabilities should improve.

Conclusion

It’s vital to be prepared when a security event happens. In this blog post, you learned about the five phases of chaos engineering—steady state, hypothesis, design and run the experiment, learn and verify, and improve and fix—and how you can use them to accelerate your incident response preparation and testing processes. For more information on chaos engineering, see the following resources. Choose a workload and run an experiment on it to verify and improve your incident response processes today.

Additional resources

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Kevin Low

Kevin Low

Kevin is a Security Solutions Architect who helps customers of all sizes across ASEAN build securely. He is passionate about integrating resilience and security and has a keen interest in chaos engineering. Outside of work, he loves spending time with his wife and dog, a poodle called Noodle.

Aggregating, searching, and visualizing log data from distributed sources with Amazon Athena and Amazon QuickSight

Post Syndicated from Pratima Singh original https://aws.amazon.com/blogs/security/aggregating-searching-and-visualizing-log-data-from-distributed-sources-with-amazon-athena-and-amazon-quicksight/

Customers using Amazon Web Services (AWS) can use a range of native and third-party tools to build workloads based on their specific use cases. Logs and metrics are foundational components in building effective insights into the health of your IT environment. In a distributed and agile AWS environment, customers need a centralized and holistic solution to visualize the health and security posture of their infrastructure.

You can effectively categorize the members of the teams involved using the following roles:

  1. Executive stakeholder: Owns and operates with their support staff and has total financial and risk accountability.
  2. Data custodian: Aggregates related data sources while managing cost, access, and compliance.
  3. Operator or analyst: Uses security tooling to monitor, assess, and respond to related events such as service disruptions.

In this blog post, we focus on the data custodian role. We show you how you can visualize metrics and logs centrally with Amazon QuickSight irrespective of the service or tool generating them. We use Amazon Simple Storage Service (Amazon S3) for storage, AWS Glue for cataloguing, and Amazon Athena for querying the data and creating structured query language (SQL) views for QuickSight to consume.

Target architecture

This post guides you towards building a target architecture in line with the AWS Well-Architected Framework. The tiered and multi-account target architecture, shown in Figure 1, uses account-level isolation to separate responsibilities across the various roles identified above and makes access management more defined and specific to those roles. The workload accounts generate the telemetry around the applications and infrastructure. The data custodian account is where the data lake is deployed and collects the telemetry. The operator account is where the queries and visualizations are created.

Throughout the post, I mention AWS services that reduce the operational overhead in one or more stages of the architecture.

Figure 1: Data visualization architecture

Figure 1: Data visualization architecture

Ingestion

Irrespective of the technology choices, applications and infrastructure configurations should generate metrics and logs that report on resource health and security. The format of the logs depends on which tool and which part of the stack is generating the logs. For example, the format of log data generated by application code can capture bespoke and additional metadata deemed useful from a workload perspective as compared to access logs generated by proxies or load balancers. For more information on types of logs and effective logging strategies, see Logging strategies for security incident response.

Amazon S3 is a scalable, highly available, durable, and secure object storage that you will use as the storage layer. To build a solution that captures events agnostic of the source, you must forward data as a stream to the S3 bucket. Based on the architecture, there are multiple tools you can use to capture and stream data into S3 buckets. Some tools support integration with S3 and directly stream data to S3. Resources like servers and virtual machines need forwarding agents such as Amazon Kinesis Agent, Amazon CloudWatch agent, or Fluent Bit.

Amazon Kinesis Data Streams provides a scalable data streaming environment. Using on-demand capacity mode eliminates the need for capacity provisioning and capacity management for streaming workloads. For log data and metric collection, you should use on-demand capacity mode, because log data generation can be unpredictable depending on the requests that are being handled by the environment. Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet before storing the data in Amazon S3. Parquet is naturally compressed, and using Parquet native partitioning and compression allows for faster queries compared to JSON formatted objects.

Scalable data lake

Use AWS Lake Formation to build, secure, and manage the data lake to store log and metric data in S3 buckets. We recommend using tag-based access control and named resources to share the data in your data store to share data across accounts to build visualizations. Data custodians should configure access for relevant datasets to the operators who can use Athena to perform complex queries and build compelling data visualizations with QuickSight, as shown in Figure 2. For cross-account permissions, see Use Amazon Athena and Amazon QuickSight in a cross-account environment. You can also use Amazon DataZone to build additional governance and share data at scale within your organization. Note that the data lake is different to and separate from the Log Archive bucket and account described in Organizing Your AWS Environment Using Multiple Accounts.

Figure 2: Account structure

Figure 2: Account structure

Amazon Security Lake

Amazon Security Lake is a fully managed security data lake service. You can use Security Lake to automatically centralize security data from AWS environments, SaaS providers, on-premises, and third-party sources into a purpose-built data lake that’s stored in your AWS account. Using Security Lake reduces the operational effort involved in building a scalable data lake, as the service automates the configuration and orchestration for the data lake with Lake Formation. Security Lake automatically transforms logs into a standard schema—the Open Cybersecurity Schema Framework (OCSF) — and parses them into a standard directory structure, which allows for faster queries. For more information, see How to visualize Amazon Security Lake findings with Amazon QuickSight.

Querying and visualization

Figure 3: Data sharing overview

Figure 3: Data sharing overview

After you’ve configured cross-account permissions, you can use Athena as the data source to create a dataset in QuickSight, as shown in Figure 3. You start by signing up for a QuickSight subscription. There are multiple ways to sign in to QuickSight; this post uses AWS Identity and Access Management (IAM) for access. To use QuickSight with Athena and Lake Formation, you first must authorize connections through Lake Formation. After permissions are in place, you can add datasets. You should verify that you’re using QuickSight in the same AWS Region as the Region where Lake Formation is sharing the data. You can do this by checking the Region in the QuickSight URL.

You can start with basic queries and visualizations as described in Query logs in S3 with Athena and Create a QuickSight visualization. Depending on the nature and origin of the logs and metrics that you want to query, you can use the examples published in Running SQL queries using Amazon Athena. To build custom analytics, you can create views with Athena. Views in Athena are logical tables that you can use to query a subset of data. Views help you to hide complexity and minimize maintenance when querying large tables. Use views as a source for new datasets to build specific health analytics and dashboards.

You can also use Amazon QuickSight Q to get started on your analytics journey. Powered by machine learning, Q uses natural language processing to provide insights into the datasets. After the dataset is configured, you can use Q to give you suggestions for questions to ask about the data. Q understands business language and generates results based on relevant phrases detected in the questions. For more information, see Working with Amazon QuickSight Q topics.

Conclusion

Logs and metrics offer insights into the health of your applications and infrastructure. It’s essential to build visibility into the health of your IT environment so that you can understand what good health looks like and identify outliers in your data. These outliers can be used to identify thresholds and feed into your incident response workflow to help identify security issues. This post helps you build out a scalable centralized visualization environment irrespective of the source of log and metric data.

This post is part 1 of a series that helps you dive deeper into the security analytics use case. In part 2, How to visualize Amazon Security Lake findings with Amazon QuickSight, you will learn how you can use Security Lake to reduce the operational overhead involved in building a scalable data lake and centralizing log data from SaaS providers, on-premises, AWS, and third-party sources into a purpose-built data lake. You will also learn how you can integrate Athena with Security Lake and create visualizations with QuickSight of the data and events captured by Security Lake.

Part 3, How to share security telemetry per Organizational Unit using Amazon Security Lake and AWS Lake Formation, dives deeper into how you can query security posture using AWS Security Hub findings integrated with Security Lake. You will also use the capabilities of Athena and QuickSight to visualize security posture in a distributed environment.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Pratima Singh

Pratima Singh

Pratima is a Security Specialist Solutions Architect with Amazon Web Services based out of Sydney, Australia. She is a security enthusiast who enjoys helping customers find innovative solutions to complex business challenges. Outside of work, Pratima enjoys going on long drives and spending time with her family at the beach.

How to visualize Amazon Security Lake findings with Amazon QuickSight

Post Syndicated from Mark Keating original https://aws.amazon.com/blogs/security/how-to-visualize-amazon-security-lake-findings-with-amazon-quicksight/

In this post, we expand on the earlier blog post Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service, and show you how to query and visualize data from Amazon Security Lake using Amazon Athena and Amazon QuickSight. We also provide examples that you can use in your own environment to visualize your data. This post is the second in a multi-part blog series on visualizing data in QuickSight and provides an introduction to visualizing Security Lake data using QuickSight. The first post in the series is Aggregating, searching, and visualizing log data from distributed sources with Amazon Athena and Amazon QuickSight.

With the launch of Amazon Security Lake, it’s now simpler and more convenient to access security-related data in a single place. Security Lake automatically centralizes security data from cloud, on-premises, and custom sources into a purpose-built data lake stored in your account, and removes the overhead related to building and scaling your infrastructure as your data volumes increase. With Security Lake, you can get a more complete understanding of your security data across your entire organization. You can also improve the protection of your workloads, applications, and data.

Security Lake has adopted the Open Cybersecurity Schema Framework (OCSF), an open standard. With OCSF support, the service can normalize and combine security data from AWS and a broad range of enterprise security data sources. Using the native ingestion capabilities of the service to pull in AWS CloudTrail, Amazon Route 53, VPC Flow Logs, or AWS Security Hub findings, ingesting supported third-party partner findings, or ingesting your own security-related logs, Security Lake provides an environment in which you can correlate events and findings by using a broad range of tools from the AWS and APN partner community.

Many customers have already deployed and maintain a centralized logging solution using services such as Amazon OpenSearch Service or a third-party security information and event management (SIEM) tool, and often use business intelligence (BI) tools such as Amazon QuickSight to gain insights into their data. With Security Lake, you have the freedom to choose how you analyze this data. In some cases, it may be from a centralized team using OpenSearch or a SIEM tool, and in other cases it may be that you want the ability to give your teams access to QuickSight dashboards or provide specific teams access to a single data source with Amazon Athena.

Before you get started

To follow along with this post, you must have:

  • A basic understanding of Security Lake, Athena, and QuickSight
  • Security Lake already deployed and accepting data sources
  • An existing QuickSight deployment that can be used to visualize Security Lake data, or an account where you can sign up for QuickSight to create visualizations

Accessing data

Security Lake uses the concept of data subscribers when it comes to accessing your data. A subscriber consumes logs and events from Security Lake, and supports two types of access:

  • Data access — Subscribers can directly access Amazon Simple Storage Service (Amazon S3) objects and receive notifications of new objects through a subscription endpoint or by polling an Amazon Simple Queue Service (Amazon SQS) queue. This is the architecture typically used by tools such as OpenSearch Service and partner SIEM solutions.
  • Query access — Subscribers with query access can directly query AWS Lake Formation tables in your S3 bucket by using services like Athena. Although the primary query engine for Security Lake is Athena, you can also use other services that integrate with AWS Glue, such as Amazon Redshift Spectrum and Spark SQL.

In the sections that follow, we walk through how to configure cross-account sharing from Security Lake to visualize your data with QuickSight, and the associated Athena queries that are used. It’s a best practice to isolate log data from visualization workloads, and we recommend using a separate AWS account for QuickSight visualizations. A high-level overview of the architecture is shown in Figure 1.

Figure 1: Security Lake visualization architecture overview

Figure 1: Security Lake visualization architecture overview

In Figure 1, Security Lake data is being cataloged by AWS Glue in account A. This catalog is then shared to account B by using AWS Resource Access Manager. Users in account B are then able to directly query the cataloged Security Lake data using Athena, or get visualizations by accessing QuickSight dashboards that use Athena to query the data.

Configure a Security Lake subscriber

The following steps guide you through configuring a Security Lake subscriber using the delegated administrator account.

To configure a Security Lake subscriber

  1. Sign in to the AWS Management Console and navigate to the Amazon Security Lake console in the Security Lake delegated administrator account. In this post, we’ll call this Account A.
  2. Go to Subscribers and choose Create subscriber.
  3. On the Subscriber details page, enter a Subscriber name. For example, cross-account-visualization.
  4. For Log and event sources, select All log and event sources. For Data access method, select Lake Formation.
  5. Add the Account ID for the AWS account that you’ll use for visualizations. In this post, we’ll call this Account B.
  6. Add an External ID to configure secure cross-account access. For more information, see How to use an external ID when granting access to your AWS resources to a third party.
  7. Choose Create.

Security Lake creates a resource share in your visualizations account using AWS Resource Access Manager (AWS RAM). You can view the configuration of the subscriber from Security Lake by selecting the subscriber you just created from the main Subscribers page. It should look like Figure 2.

Figure 2: Subscriber configuration

Figure 2: Subscriber configuration

Note: your configuration might be slightly different, based on what you’ve named your subscriber, the AWS Region you’re using, the logs being ingested, and the external ID that you created.

Configure Athena to visualize your data

Now that the subscriber is configured, you can move on to the next stage, where you configure Athena and QuickSight to visualize your data.

Note: In the following example, queries will be against Security Hub findings, using the Security Lake table in the ap-southeast-2 Region. If necessary, change the table name in your queries to match the Security Lake Region you use in the following configuration steps.

To configure Athena

  1. Sign in to your QuickSight visualization account (Account B).
  2. Navigate to the AWS Resource Access Manager (AWS RAM) console. You’ll see a Resource share invitation under Shared with me in the menu on the left-hand side of the screen. Choose Resource shares to go to the invitation.
    Figure 3: RAM menu

    Figure 3: RAM menu

  3. On the Resource shares page, select the name of the resource share starting with LakeFormation-V3, and then choose Accept resource share. The Security Lake Glue catalog is now available to Account B to query.
  4. For cross-account access, you should create a database to link the shared tables. Navigate to Lake Formation, and then under the Data catalog menu option, select Databases, then select Create database.
  5. Enter a name, for example security_lake_visualization, and keep the defaults for all other settings. Choose Create database.
    Figure 4: Create database

    Figure 4: Create database

  6. After you’ve created the database, you need to create resource links from the shared tables into the database. Select Tables under the Data catalog menu option. Select one of the tables shared by Security Lake by selecting the table’s name. You can identify the shared tables by looking for the ones that start with amazon_security_lake_table_.
  7. From the Actions dropdown list, select Create resource link.
    Figure 5: Creating a resource link

    Figure 5: Creating a resource link

  8. Enter the name for the resource link, for example amazon_security_lake_table_ap_southeast_2_sh_findings_1_0, and then select the security_lake_visualization database created in the previous steps.
  9. Choose Create. After the links have been created, the names of the resource links will appear in italics in the list of tables.
  10. You can now select the radio button next to the resource link, select Actions, and then select View data under Table. This takes you to the Athena query editor, where you can now run queries on the shared Security Lake tables.
    Figure 6: Viewing data to query

    Figure 6: Viewing data to query

    To use Athena for queries, you must configure an S3 bucket to store query results. If this is the first time Athena is being used in your account, you’ll receive a message saying that you need to configure an S3 bucket. To do this, choose Edit settings in the information notice and follow the instructions.

  11. In the Editor configuration, select AwsDataCatalog from the Data source options. The Database should be the database you created in the previous steps, for example security_lake_visualization.
  12. After selecting the database, copy the query that follows and paste it into your Athena query editor, and then choose Run. This runs your first query to list 10 Security Hub findings:
    Figure 7: Athena data query editor

    Figure 7: Athena data query editor

    SELECT * FROM 
    "AwsDataCatalog"."security_lake_visualization"."amazon_security_lake_table_ap_southeast_2_sh_findings_1_0" limit 10;

    This queries Security Hub data in Security Lake from the Region you specified, and outputs the results in the Query results section on the page. For a list of example Security Lake specific queries, see the AWS Security Analytics Bootstrap project, where you can find example queries specific to each of the Security Lake natively ingested data sources.

  13. To build advanced dashboards, you can create views using Athena. The following is an example of a view that lists 100 findings with failed checks sorted by created_time of the findings.
    CREATE VIEW security_hub_new_failed_findings_by_created_time AS
    SELECT
    finding.title, cloud.account_uid, compliance.status, metadata.product.name
    FROM "security_lake_visualization"."amazon_security_lake_table_ap_southeast_2_sh_findings_1_0"
    WHERE compliance.status = 'FAILED'
    ORDER BY finding.created_time
    limit 100;

  14. You can now query the view to list the first 10 rows using the following query.
    SELECT * FROM 
    "security_lake_visualization"."security_hub_new_failed_findings_by_created_time" limit 10;

Create a QuickSight dataset

Now that you’ve done a sample query and created a view, you can use Athena as the data source to create a dataset in QuickSight.

To create a QuickSight dataset

  1. Sign in to your QuickSight visualization account (also known as Account B), and open the QuickSight console.
  2. If this is the first time you’re using QuickSight, you need to sign up for a QuickSight subscription.
  3. Although there are multiple ways to sign in to QuickSight, we used AWS Identity and Access Management (IAM) based access to build the dashboards. To use QuickSight with Athena and Lake Formation, you first need to authorize connections through Lake Formation.
  4. When using cross-account configuration with AWS Glue Catalog, you also need to configure permissions on tables that are shared through Lake Formation. For a detailed deep dive, see Use Amazon Athena and Amazon QuickSight in a cross-account environment. For the use case highlighted in this post, use the following steps to grant access on the cross-account tables in the Glue Catalog.
    1. In the AWS Lake Formation console, navigate to the Tables section and select the resource link for the table, for example amazon_security_lake_table_ap_southeast_2_sh_findings_1_0.
    2. Select Actions. Under Permissions, select Grant on target.
    3. For Principals, select SAML users and groups, and then add the QuickSight user’s ARN captured in step 2 of the topic Authorize connections through Lake Formation.
    4. For the LF-Tags or catalog resources section, use the default settings.
    5. For Table permissions, choose Select for both Table Permissions and Grantable Permissions.
    6. Choose Grant.
    Figure 8: Granting permissions in Lake Formation

    Figure 8: Granting permissions in Lake Formation

  5. After permissions are in place, you can create datasets. You should also verify that you’re using QuickSight in the same Region where Lake Formation is sharing the data. The simplest way to determine your Region is to check the QuickSight URL in your web browser. The Region will be at the beginning of the URL. To change the Region, select the settings icon in the top right of the QuickSight screen and select the correct Region from the list of available Regions in the drop-down menu.
  6. Select Datasets, and then select New dataset. Select Athena from the list of available data sources.
  7. Enter a Data source name, for example security_lake_visualizations, and leave the Athena workgroup as [primary]. Then select Create data source.
  8. Select the tables to build your dashboards. On the Choose your table prompt, for Catalog, select AwsDataCatalog. For Database, select the database you created in the previous steps, for example security_lake_visualization. For Table, select the table with the name starting with amazon_security_lake_table_. Choose Select.
    Figure 9: Selecting the table for a new dataset

    Figure 9: Selecting the table for a new dataset

  9. On the Finish dataset creation prompt, select Import to SPICE for quicker analytics. Choose Visualize.
  10. In the left-hand menu in QuickSight, you can choose attributes from the data set to add analytics and widgets.

After you’re familiar with how to use QuickSight to visualize data from Security Lake, you can create additional datasets and add other widgets to create dashboards that are specific to your needs.

AWS pre-built QuickSight dashboards

So far, you’ve seen how to use Athena manually to query your data and how to use QuickSight to visualize it. AWS Professional Services is excited to announce the publication of the Data Visualization framework to help customers quickly visualize their data using QuickSight. The repository contains a combination of CDK tools and scripts that can be used to create the required AWS objects and deploy basic data sources, datasets, analysis, dashboards, and the required user groups to QuickSight with respect to Security Lake. The framework includes three pre-built dashboards based on the following personas.

Persona Role description Challenges Goals
CISO/Executive Stakeholder Owns and operates, with their support staff, all security-related activities within a business; total financial and risk accountability
  • Difficult to assess organizational aggregated security risk
  • Burdened by license costs of security tooling
  • Less agility in security programs due to mounting cost and complexity
  • Reduce risk
  • Reduce cost
  • Improve metrics (MTTD/MTTR and others)
Security Data Custodian Aggregates all security-related data sources while managing cost, access, and compliance
  • Writes new custom extract, transform, and load (ETL) every time a new data source shows up; difficult to maintain
  • Manually provisions access for users to view security data
  • Constrained by cost and performance limitations in tools depending on licenses and hardware
  • Reduce overhead to integrate new data
  • Improve data governance
  • Streamline access
Security Operator/Analyst Uses security tooling to monitor, assess, and respond to security-related events. Might perform incident response (IR), threat hunting, and other activities.
  • Moves between many security tools to answer questions about data
  • Lacks substantive automated analytics; manually processing and analyzing data
  • Can’t look historically due to limitations in tools (licensing, storage, scalability)
  • Reduce total number of tools
  • Increase observability
  • Decrease time to detect and respond
  • Decrease “alert fatigue”

After deploying through the CDK, you will have three pre-built dashboards configured and available to view. Once deployed, each of these dashboards can be customized according to your requirements. The Data Lake Executive dashboard provides a high-level overview of security findings, as shown in Figure 10.

Figure 10: Example QuickSight dashboard showing an overview of findings in Security Lake

Figure 10: Example QuickSight dashboard showing an overview of findings in Security Lake

The Security Lake custodian role will have visibility of security related data sources, as shown in Figure 11.

Figure 11: Security Lake custodian dashboard

Figure 11: Security Lake custodian dashboard

And the Security Lake operator will have a view of security related events, as shown in Figure 12.

Figure 12: Security Operator dashboard

Figure 12: Security Operator dashboard

Conclusion

In this post, you learned about Security Lake, and how you can use Athena to query your data and QuickSight to gain visibility of your security findings stored within Security Lake. When using QuickSight to visualize your data, it’s important to remember that the data remains in your S3 bucket within your own environment. However, if you have other use cases or wish to use other analytics tools such as OpenSearch, Security Lake gives you the freedom to choose how you want to interact with your data.

We also introduced the Data Visualization framework that was created by AWS Professional Services. The framework uses the CDK to deploy a set of pre-built dashboards to help get you up and running quickly.

With the announcement of AWS AppFabric, we’re making it even simpler to ingest data directly into Security Lake from leading SaaS applications without building and managing custom code or point-to-point integrations, enabling quick visualization of your data from a single place, in a common format.

For additional information on using Athena to query Security Lake, have a look at the AWS Security Analytics Bootstrap project, where you can find queries specific to each of the Security Lake natively ingested data sources. If you want to learn more about how to configure and use QuickSight to visualize findings, we have hands-on QuickSight workshops to help you configure and build QuickSight dashboards for visualizing your data.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Mark Keating

Mark Keating

Mark is an AWS Security Solutions Architect based out of the U.K. who works with Global Healthcare & Life Sciences and Automotive customers to solve their security and compliance challenges and help them reduce risk. He has over 20 years of experience working with technology, within in operations, solution, and enterprise architecture roles.

Pratima Singh

Pratima Singh

Pratima is a Security Specialist Solutions Architect with Amazon Web Services based out of Sydney, Australia. She is a security enthusiast who enjoys helping customers find innovative solutions to complex business challenges. Outside of work, Pratima enjoys going on long drives and spending time with her family at the beach.

David Hoang

David Hoang

David is a Shared Delivery Team Senior Security Consultant at AWS. His background is in AWS security, with a focus on automation. David designs, builds, and implements scalable enterprise solutions with security guardrails that use AWS services.

Ajay Rawat

Ajay Rawat

Ajay is a Security Consultant in a shared delivery team at AWS. He is a technology enthusiast who enjoys working with customers to solve their technical challenges and to improve their security posture in the cloud.

Transforming transactions: Streamlining PCI compliance using AWS serverless architecture

Post Syndicated from Abdul Javid original https://aws.amazon.com/blogs/security/transforming-transactions-streamlining-pci-compliance-using-aws-serverless-architecture/

Compliance with the Payment Card Industry Data Security Standard (PCI DSS) is critical for organizations that handle cardholder data. Achieving and maintaining PCI DSS compliance can be a complex and challenging endeavor. Serverless technology has transformed application development, offering agility, performance, cost, and security.

In this blog post, we examine the benefits of using AWS serverless services and highlight how you can use them to help align with your PCI DSS compliance responsibilities. You can remove additional undifferentiated compliance heavy lifting by building modern applications with abstracted AWS services. We review an example payment application and workflow that uses AWS serverless services and showcases the potential reduction in effort and responsibility that a serverless architecture could provide to help align with your compliance requirements. We present the review through the lens of a merchant that has an ecommerce website and include key topics such as access control, data encryption, monitoring, and auditing—all within the context of the example payment application. We don’t discuss additional service provider requirements from the PCI DSS in this post.

This example will help you navigate the intricate landscape of PCI DSS compliance. This can help you focus on building robust and secure payment solutions without getting lost in the complexities of compliance. This can also help reduce your compliance burden and empower you to develop your own secure, scalable applications. Join us in this journey as we explore how AWS serverless services can help you meet your PCI DSS compliance objectives.

Disclaimer

This document is provided for the purposes of information only; it is not legal advice, and should not be relied on as legal advice. Customers are responsible for making their own independent assessment of the information in this document. This document: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers.

AWS encourages its customers to obtain appropriate advice on their implementation of privacy and data protection environments, and more generally, applicable laws and other obligations relevant to their business.

PCI DSS v4.0 and serverless

In April 2022, the Payment Card Industry Security Standards Council (PCI SSC) updated the security payment standard to “address emerging threats and technologies and enable innovative methods to combat new threats.” Two of the high-level goals of these updates are enhancing validation methods and procedures and promoting security as a continuous process. Adopting serverless architectures can help meet some of the new and updated requirements in version 4.0, such as enhanced software and encryption inventories. If a customer has access to change a configuration, it’s the customer’s responsibility to verify that the configuration meets PCI DSS requirements. There are more than 20 PCI DSS requirements applicable to Amazon Elastic Compute Cloud (Amazon EC2). To fulfill these requirements, customer organizations must implement controls such as file integrity monitoring, operating system level access management, system logging, and asset inventories. Using AWS abstracted services in this scenario can remove undifferentiated heavy lifting from your environment. With abstracted AWS services, because there is no operating system to manage, AWS becomes responsible for maintaining consistent time settings for an abstracted service to meet Requirement 10.6. This will also shift your compliance focus more towards your application code and data.

This makes more of your PCI DSS responsibility addressable through the AWS PCI DSS Attestation of Compliance (AOC) and Responsibility Summary. This attestation package is available to AWS customers through AWS Artifact.

Reduction in compliance burden

You can use three common architectural patterns within AWS to design payment applications and meet PCI DSS requirements: infrastructure, containerized, and abstracted. We look into EC2 instance-based architecture (infrastructure or containerized patterns) and modernized architectures using serverless services (abstracted patterns). While both approaches can help align with PCI DSS requirements, there are notable differences in how they handle certain elements. EC2 instances provide more control and flexibility over the underlying infrastructure and operating system, assisting you in customizing security measures based on your organization’s operational and security requirements. However, this also means that you bear more responsibility for configuring and maintaining security controls applicable to the operating systems, such as network security controls, patching, file integrity monitoring, and vulnerability scanning.

On the other hand, serverless architectures similar to the preceding example can reduce much of the infrastructure management requirements. This can relieve you, the application owner or cloud service consumer, of the burden of configuring and securing those underlying virtual servers. This can streamline meeting certain PCI requirements, such as file integrity monitoring, patch management, and vulnerability management, because AWS handles these responsibilities.

Using serverless architecture on AWS can significantly reduce the PCI compliance burden. Approximately 43 percent of the overall PCI compliance requirements, encompassing both technical and non-technical tests, are addressed by the AWS PCI DSS Attestation of Compliance.

Customer responsible
52%
AWS responsible
43%
N/A
5%

The following table provides an analysis of each PCI DSS requirement against the serverless architecture in Figure 1, which shows a sample payment application workflow. You must evaluate your own use and secure configuration of AWS workload and architectures for a successful audit.

PCI DSS 4.0 requirements Test cases Customer responsible AWS responsible N/A
Requirement 1: Install and maintain network security controls 35 13 22 0
Requirement 2: Apply secure configurations to all system components 27 16 11 0
Requirement 3: Protect stored account data 55 24 29 2
Requirement 4: Protect cardholder data with strong cryptography during transmission over open, public networks 12 7 5 0
Requirement 5: Protect all systems and networks from malicious software 25 4 21 0
Requirement 6: Develop and maintain secure systems and software 35 31 4 0
Requirement 7: Restrict access to system components and cardholder data by business need-to-know 22 19 3 0
Requirement 8: Identify users and authenticate access to system components 52 43 6 3
Requirement 9: Restrict physical access to cardholder data 56 3 53 0
Requirement 10: Log and monitor all access to system components and cardholder data 38 17 19 2
Requirement 11: Test security of systems and networks regularly 51 22 23 6
Requirement 12: Support information security with organizational policies 56 44 2 10
Total 464 243 198 23
Percentage 52% 43% 5%

Note: The preceding table is based on the example reference architecture that follows. The actual extent of PCI DSS requirements reduction can vary significantly depending on your cardholder data environment (CDE) scope, implementation, and configurations.

Sample payment application and workflow

This example serverless payment application and workflow in Figure 1 consists of several interconnected steps, each using different AWS services. The steps are listed in the following text and include brief descriptions. They cover two use cases within this example application — consumers making a payment and a business analyst generating a report.

The example outlines a basic serverless payment application workflow using AWS serverless services. However, it’s important to note that the actual implementation and behavior of the workflow may vary based on specific configurations, dependencies, and external factors. The example serves as a general guide and may require adjustments to suit the unique requirements of your application or infrastructure.

Several factors, including but not limited to, AWS service configurations, network settings, security policies, and third-party integrations, can influence the behavior of the system. Before deploying a similar solution in a production environment, we recommend thoroughly reviewing and adapting the example to align with your specific use case and requirements.

Keep in mind that AWS services and features may evolve over time, and new updates or changes may impact the behavior of the components described in this example. Regularly consult the AWS documentation and ensure that your configurations adhere to best practices and compliance standards.

This example is intended to provide a starting point and should be considered as a reference rather than an exhaustive solution. Always conduct thorough testing and validation in your specific environment to ensure the desired functionality and security.

Figure 1: Serverless payment architecture and workflow

Figure 1: Serverless payment architecture and workflow

  • Use case 1: Consumers make a payment
    1. Consumers visit the e-commerce payment page to make a payment.
    2. The request is routed to the payment application’s domain using Amazon Route 53, which acts as a DNS service.
    3. The payment page is protected by AWS WAF to inspect the initial incoming request for any malicious patterns, web-based attacks (such as cross-site scripting (XSS) attacks), and unwanted bots.
    4. An HTTPS GET request (over TLS) is sent to the public target IP. Amazon CloudFront, a content delivery network (CDN), acts as a front-end proxy and caches and fetches static content from an Amazon Simple Storage Service (Amazon S3) bucket.
    5. AWS WAF inspects the incoming request for any malicious patterns, if the request is blocked, the request doesn’t return static content from the S3 bucket.
    6. User authentication and authorization are handled by Amazon Cognito, providing a secure login and scalable customer identity and access management system (CIAM)
    7. AWS WAF processes the request to protect against web exploits, then Amazon API Gateway forwards it to the payment application API endpoint.
    8. API Gateway launches AWS Lambda functions to handle payment requests. AWS Step Functions state machine oversees the entire process, directing the running of multiple Lambda functions to communicate with the payment processor, initiate the payment transaction, and process the response.
    9. The cardholder data (CHD) is temporarily cached in Amazon DynamoDB for troubleshooting and retry attempts in the event of transaction failures.
    10. A Lambda function validates the transaction details and performs necessary checks against the data stored in DynamoDB. A web notification is sent to the consumer for any invalid data.
    11. A Lambda function calculates the transaction fees.
    12. A Lambda function authenticates the transaction and initiates the payment transaction with the third-party payment provider.
    13. A Lambda function is initiated when a payment transaction with the third-party payment provider is completed. It receives the transaction status from the provider and performs multiple actions.
    14. Consumers receive real-time notifications through a web browser and email. The notifications are initiated by a step function, such as order confirmations or payment receipts, and can be integrated with external payment processors through an Amazon Simple Notification Service (Amazon SNS) Amazon Simple Email Service (Amazon SES) web hook.
    15. A separate Lambda function clears the DynamoDB cache.
    16. The Lambda function makes entries into the Amazon Simple Queue Service (Amazon SQS) dead-letter queue for failed transactions to retry at a later time.
  • Use case 2: An admin or analyst generates the report for non-PCI data
    1. An admin accesses the web-based reporting dashboard using their browser to generate a report.
    2. The request is routed to AWS WAF to verify the source that initiated the request.
    3. An HTTPS GET request (over TLS) is sent to the public target IP. CloudFront fetches static content from an S3 bucket.
    4. AWS WAF inspects incoming requests for any malicious patterns, if the request is blocked, the request doesn’t return static content from the S3 bucket. The validated traffic is sent to Amazon S3 to retrieve the reporting page.
    5. The backend requests of the reporting page pass through AWS WAF again to provide protection against common web exploits before being forwarded to the reporting API endpoint through API Gateway.
    6. API Gateway launches a Lambda function for report generation. The Lambda function retrieves data from DynamoDB storage for the reporting mechanism.
    7. The AWS Security Token Service (AWS STS) issues temporary credentials to the Lambda service in the non-PCI serverless account, allowing it to launch the Lambda function in the PCI serverless account. The Lambda function retrieves non-PCI data and writes it into DynamoDB.
    8. The Lambda function fetches the non-PCI data based on the report criteria from the DynamoDB table from the same account.

Additional AWS security and governance services that would be implemented throughout the architecture are shown in Figure 1, Label-25. For example, Amazon CloudWatch monitors and alerts on all the Lambda functions within the environment.

Label-26 demonstrates frameworks that can be used to build the serverless applications.

Scoping and requirements

Now that we’ve established the reference architecture and workflow, lets delve into how it aligns with PCI DSS scope and requirements.

PCI scoping

Serverless services are inherently segmented by AWS, but they can be used within the context of an AWS account hierarchy to provide various levels of isolation as described in the reference architecture example.

Segregating PCI data and non-PCI data into separate AWS accounts can help in de-scoping non-PCI environments and reducing the complexity and audit requirements for components that don’t handle cardholder data.

PCI serverless production account

  • This AWS account is dedicated to handling PCI data and applications that directly process, transmit, or store cardholder data.
  • Services such as Amazon Cognito, DynamoDB, API Gateway, CloudFront, Amazon SNS, Amazon SES, Amazon SQS, and Step Functions are provisioned in this account to support the PCI data workflow.
  • Security controls, logging, monitoring, and access controls in this account are specifically designed to meet PCI DSS requirements.

Non-PCI serverless production account

  • This separate AWS account is used to host applications that don’t handle PCI data.
  • Since this account doesn’t handle cardholder data, the scope of PCI DSS compliance is reduced, simplifying the compliance process.

Note: You can use AWS Organizations to centrally manage multiple AWS accounts.

AWS IAM Identity Center (successor to AWS Single Sign-On) is used to manage user access to each account and is integrated with your existing identify provider. This helps to ensure you’re meeting PCI requirements on identity, access control of card holder data, and environment.

Now, let’s look at the PCI DSS requirements that this architectural pattern can help address.

Requirement 1: Install and maintain network security controls

  • Network security controls are limited to AWS Identity and Access Management (IAM) and application permissions because there is no customer controlled or defined network. VPC-centric requirements aren’t applicable because there is no VPC. The configuration settings for serverless services can be covered under Requirement 6 to for secure configuration standards. This supports compliance with Requirements 1.2 and 1.3.

Requirement 2: Apply secure configurations to all system components

  • AWS services are single function by default and exist with only the necessary functionality enabled for the functioning of that service. This supports compliance with much of Requirement 2.2.
  • Access to AWS services is considered non-console and only accessible through HTTPS through the service API. This supports compliance with Requirement 2.2.7.
  • The wireless requirements under Requirement 2.3 are not applicable, because wireless environments don’t exist in AWS environments.

Requirement 3: Protect stored account data

  • AWS is responsible for destruction of account data configured for deletion based on DynamoDB Time to Live (TTL) values. This supports compliance with Requirement 3.2.
  • DynamoDB and Amazon S3 offer secure storage of account data, encryption by default in transit and at rest, and integration with AWS Key Management Service (AWS KMS). This supports compliance with Requirements 3.5 and 4.2.
  • AWS is responsible for the generation, distribution, storage, rotation, destruction, and overall protection of encryption keys within AWS KMS. This supports compliance with Requirements 3.6 and 3.7.
  • Manual cleartext cryptographic keys aren’t available in this solution, Requirement 3.7.6 is not applicable.

Requirement 4: Protect cardholder data with strong cryptography during transmission over open, public networks

  • AWS Certificate Manager (ACM) integrates with API Gateway and enables the use of trusted certificates and HTTPS (TLS) for secure communication between clients and the API. This supports compliance with Requirement 4.2.
  • Requirement 4.2.1.2 is not applicable because there are no wireless technologies in use in this solution. Customers are responsible for ensuring strong cryptography exists for authentication and transmission over other wireless networks they manage outside of AWS.
  • Requirement 4.2.2 is not applicable because no end-user technologies exist in this solution. Customers are responsible for ensuring the use of strong cryptography if primary account numbers (PAN) are sent through end-user messaging technologies in other environments.

Requirement 5: Protect a ll systems and networks from malicious software

  • There are no customer-managed compute resources in this example payment environment, Requirements 5.2 and 5.3 are the responsibility of AWS.

Requirement 6: Develop and maintain secure systems and software

  • Amazon Inspector now supports Lambda functions, adding continual, automated vulnerability assessments for serverless compute. This supports compliance with Requirement 6.2.
  • Amazon Inspector helps identify vulnerabilities and security weaknesses in the payment application’s code, dependencies, and configuration. This supports compliance with Requirement 6.3.
  • AWS WAF is designed to protect applications from common attacks, such as SQL injections, cross-site scripting, and other web exploits. AWS WAF can filter and block malicious traffic before it reaches the application. This supports compliance with Requirement 6.4.2.

Requirement 7: Restrict access to system components and cardholder data by business need to know

  • IAM and Amazon Cognito allow for fine-grained role- and job-based permissions and access control. Customers can use these capabilities to configure access following the principles of least privilege and need-to-know. IAM and Cognito support the use of strong identification, authentication, authorization, and multi-factor authentication (MFA). This supports compliance with much of Requirement 7.

Requirement 8: Identify users and authenticate access to system components

  • IAM and Amazon Cognito also support compliance with much of Requirement 8.
  • Some of the controls in this requirement are usually met by the identity provider for internal access to the cardholder data environment (CDE).

Requirement 9: Restrict physical access to cardholder data

  • AWS is responsible for the destruction of data in DynamoDB based on the customer configuration of content TTL values for Requirement 9.4.7. Customers are responsible for ensuring their database instance is configured for appropriate removal of data by enabling TTL on DDB attributes.
  • Requirement 9 is otherwise not applicable for this serverless example environment because there are no physical media, electronic media not already addressed under Requirement 3.2, or hard-copy materials with cardholder data. AWS is responsible for the physical infrastructure under the Shared Responsibility Model.

Requirement 10: Log and monitor all access to system components and cardholder data

  • AWS CloudTrail provides detailed logs of API activity for auditing and monitoring purposes. This supports compliance with Requirement 10.2 and contains all of the events and data elements listed.
  • CloudWatch can be used for monitoring and alerting on system events and performance metrics. This supports compliance with Requirement 10.4.
  • AWS Security Hub provides a comprehensive view of security alerts and compliance status, consolidating findings from various security services, which helps in ongoing security monitoring and testing. Customers must enable PCI DSS security standard, which supports compliance with Requirement 10.4.2.
  • AWS is responsible for maintaining accurate system time for AWS services. In this example, there are no compute resources for which customers can configure time. Requirement 10.6 is addressable through the AWS Attestation of Compliance and Responsibility Summary available in AWS Artifact.

Requirement 11: Regularly test security systems and processes

  • Testing for rogue wireless activity within the AWS-based CDE is the responsibility of AWS. AWS is responsible for the management of the physical infrastructure under Requirement 11.2. Customers are still responsible for wireless testing for their environments outside of AWS, such as where administrative workstations exist.
  • AWS is responsible for internal vulnerability testing of AWS services, and supports compliance with Requirement 11.3.1.
  • Amazon GuardDuty, a threat detection service that continuously monitors for malicious activity and unauthorized access, providing continuous security monitoring. This supports the IDS requirements under Requirement 11.5.1, and covers the entire AWS-based CDE.
  • AWS Config allows customers to catalog, monitor and manage configuration changes for their AWS resources. This supports compliance with Requirement 11.5.2.
  • Customers can use AWS Config to monitor the configuration of the S3 bucket hosting the static website. This supports compliance with Requirement 11.6.1.

Requirement 12: Support information security with organizational policies and programs

  • Customers can download the AWS AOC and Responsibility Summary package from Artifact to support Requirement 12.8.5 and the identification of which PCI DSS requirements are managed by the third-party service provider (TSPS) and which by the customer.

Conclusion

Using AWS serverless services when developing your payment application can significantly help reduce the number of PCI DSS requirements you need to meet by yourself. By offloading infrastructure management to AWS and using serverless services such as Lambda, API Gateway, DynamoDB, Amazon S3, and others, you can benefit from built-in security features and help align with your PCI DSS compliance requirements.

Contact us to help design an architecture that works for your organization. AWS Security Assurance Services is a Payment Card Industry-Qualified Security Assessor company (PCI-QSAC) and HITRUST External Assessor firm. We are a team of industry-certified assessors who help you to achieve, maintain, and automate compliance in the cloud by tying together applicable audit standards to AWS service-specific features and functionality. We help you build on frameworks such as PCI DSS, HITRUST CSF, NIST, SOC 2, HIPAA, ISO 27001, GDPR, and CCPA.

More information on how to build applications using AWS serverless technologies can be found at Serverless on AWS.

Want more AWS Security news? Follow us on Twitter.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Serverless re:Post, Security, Identity, & Compliance re:Post or contact AWS Support.

Abdul Javid

Abdul Javid

Abdul is a Senior Security Assurance Consultant and PCI DSS Qualified Security Assessor with AWS Security Assurance Services, and has more than 25 years of IT governance, operations, security, risk, and compliance experience. Abdul leverages his experience and knowledge to advise AWS customers with guidance and advice on their compliance journey. Abdul earned an M.S. in Computer Science from IIT, Chicago and holds various industry recognized sought after certifications in security and program and risk management from prominent organizations like AWS, HITRUST, ISACA, PMI, PCI DSS, and ISC2.

Ted Tanner

Ted Tanner

Ted is a Principal Assurance Consultant and PCI DSS Qualified Security Assessor with AWS Security Assurance Services, and has more than 25 years of IT and security experience. He uses this experience to provide AWS customers with guidance on compliance and security, and on building and optimizing their cloud compliance programs. He is co-author of the Payment Card Industry Data Security Standard (PCI DSS) v3.2.1 on AWS Compliance Guide and the soon-to-be-released v4.0 edition.

Tristan Watty

Tristan Watty

Dr. Watty is a Senior Security Consultant within the Professional Services team of Amazon Web Services based in Queens, New York. He is a passionate Tech Enthusiast, Influencer, and Amazonian with 15+ years of professional and educational experience with a specialization in Security, Risk, and Compliance. His zeal lies in empowering customers to develop and put into action secure mechanisms that steer them towards achieving their security goals. Dr. Watty also created and hosts an AWS Security Show named “Security SideQuest!” that airs on the AWS Twitch Channel.

Padmakar Bhosale

Padmakar Bhosale

Padmakar is a Sr. Technical Account Manager with over 25 years of experience in the Financial, Banking, and Cloud Services. He provides AWS customers with guidance and advice on Payment Services, Core Banking Ecosystem, Credit Union Banking Technologies, Resiliency on AWS Cloud, AWS Accounts & Network levels PCI Segmentations, and Optimization of the Customer’s Cloud Journey experience on AWS Cloud.

Scaling national identity schemes with itsme and Amazon Cognito

Post Syndicated from Guillaume Neau original https://aws.amazon.com/blogs/security/scaling-national-identity-schemes-with-itsme-and-amazon-cognito/

In this post, we demonstrate how you can use identity federation and integration between the identity provider itsme® and Amazon Cognito to quickly consume and build digital services for citizens on Amazon Web Services (AWS) using available national digital identities. We also provide code examples and integration proofs of concept to get you started quickly.

National digital identities refer to a system or framework that a government establishes to uniquely and securely identify its citizens or residents in the digital realm.

These national digital identities are built on a rigorous process of identity verification and enforce the use of high security standards when it comes to authentication mechanisms. Their adoption by both citizens and businesses helps to fight identity theft, most notably by removing the need to send printed copies of identity documents.

National certified secure digital identities are suitable for both businesses and public services and can improve the onboarding experience by reducing the need to create new credentials.

About itsme

itsme is a trusted identity provider (certified and notified for all 27 member states of EU at Level of Assurance HIGH of the eiDAS regulation) that can be used on over 800 government and company platforms to identify yourself online, log in, confirm transactions, or sign documents. It allows partners to use its verified identities for authentication and authorization on web, desktop, mobile web, and mobile applications.

As of this writing, itsme is accessible for all residents in Belgium, The Netherlands, and Luxembourg. However, since there are no limitations on the geographic usage of the identity and electronic signature APIs, itsme has the potential to expand to additional countries in the future. (Source: itsme, 2023)

Architecture overview

To demonstrate the integration, you’re going to build a minimalistic application made of the following components as shown in Figure 1 that follows:

Figure 1: Architectural diagram

Figure 1: Architectural diagram

After deployment, you can log in and interact with the application:

  1. Visit the frontend deployed locally, and you’re presented the option to authenticate with itsme by using a blue colored button. Choose the button to proceed.
  2. After being redirected to itsme, you’re asked to either create a new account or to use an existing one for authentication. After you’re successfully authenticated with itsme, the associated Amazon Cognito user pool is populated with the requested data in the scope of the federation. Specifically in this example, the national registration number is made available.
  3. When authenticated, you’re redirected to the frontend, and you can read and write messages to and from the database behind an Amazon API Gateway.
  4. The Amazon API Gateway uses Amazon Cognito to check the validity of your authentication token.
  5. The Lambda function reads and writes messages to and from DynamoDB.

Prerequisites to deploy the identity federation with itsme

While setting up the Amazon Cognito user pool, you’re asked for the following information:

  • An itsme client ID – itsmeClientId
  • An itsme client secret – itsmeClientSecret
  • An itsme service code – itsmeServiceCode
  • An itsme issuer URL – itsmeIssuerUrl

To retrieve this information, you must be an itsme partner and to have your sandbox requested and available. The sandbox should be made available three business days after you submit the dedicated request form to itsme.

After the sandbox is provisioned, you must contact the itsme support desk and ask to switch the sandbox authentication to the client secret – itsmeClientSecret flow. Include the link to this post and specify that it’s for establishing a federation with Amazon Cognito.

Implement the proof of concept

To implement this proof of concept, you need to follow these steps:

  1. Create an Amazon Cognito user pool.
  2. Configure the Amazon Cognito user pool.
  3. Deploy a sample API.
  4. Configure your application.

To create and configure an Amazon Cognito user pool

  1. Sign in to the AWS Management Console and enter cognito in the search bar at the top. Select Cognito from the Services results.
    Figure 2: Select Cognito service

    Figure 2: Select Cognito service

  2. In the Amazon Cognito console, select User pools, and then choose Create user pool.
    Figure 3: Cognito user pool creation

    Figure 3: Cognito user pool creation

  3. To configure the sign-in experience section, select Federated identity providers as the authentication providers.
  4. In the Cognito user pool sign-in options area, select User name, Email, and Phone number.
  5. In the Federated sign-in options area, select OpenID Connect (OIDC).
    Figure 4: Sign-in configuration

    Figure 4: Sign-in configuration

  6. Choose Next to continue to security requirements.

Note: In this post, account management and authentication are restricted to itsme. Because of this, the password length, multi-factor authentication, and recovery procedures are delegated to itsme. If you don’t restrict your Cognito user pool to itsme only, configure it according to your security requirements.

To configure the security requirements

  1. For Password policy, select Cognito defaults.
  2. Select Require MFA – Recommended in the Multi-factor authentication area, and select Authenticator apps

    Note: Although the activation of multi-factor authentication is recommended, it’s important to understand that users of this pool will be created and authenticated through the federation with itsme. In the next procedure, you disable the Self service sign-up feature to prevent users from creating accounts. As itsme is compliant with the level of assurance substantial of the eIDAS regulation, itsme users must log in using a second factor of authentication.

  3. Clear Enable self-service account recovery in the User account recovery area.
Figure 5: Security requirements configuration

Figure 5: Security requirements configuration

To configure the sign-up experience

  1. Clear Enable self-registration.
  2. Clear Allow Cognito to automatically send messages to verify and confirm.
    Figure 6: Sign-up configuration

    Figure 6: Sign-up configuration

  3. Skip the configuration of required attributes and configure custom attributes. Expand the drop-down menu and add the following custom attributes:
    1. Name: eid.
    2. Type: String.
    3. Leave Min and Max length blank.
    4. Mutable: Select.

    This custom attribute is used to map and store the national registration number.

  4. Choose Next to configure message delivery.

Note: In this post, account management and authentication are going to be restricted to itsme. As a result, Amazon Cognito doesn’t send email or SMS, and the prescribed configuration is minimal. If you don’t limit your user pool to itsme, configure message delivery parameters according to your corporate policy.

To configure message delivery

  1. For Email, select Send email with Cognito and leave the other fields with their default configuration.
  2. To configure the SMS, select Create a new IAM Role if you don’t already have one provisioned.
  3. Choose Next to configure the federated identity provider.
    Figure 7: Message delivery configuration

    Figure 7: Message delivery configuration

  4. Choose Next to configure identity provider.

To configure the federated identity provider

  1. For Provider name, enter itsme.
  2. For Client ID, enter the client ID provided by itsme.
  3. For Client secret, enter the client secret provided by itsme.
  4. For Authorized scopes, start with the mandatory service:itsmeServiceCode.
  5. With a space between each scope, enter openid profile eid email address.
  6. For Retrieve OIDC endpoints, enter the issuer URL provided by itsme.
    Figure 8: OIDC federation configuration

    Figure 8: OIDC federation configuration

    The configuration of the mapping of the attributes can be done according to the documentation provided by itsme.

    An example of mapping is provided in Figure 9 that follows. Some differences exist to be able to retrieve and map the eID and the unique username of itsme (sub).

    More specifically, to retrieve the National Registration Number, the eid field needs to be set to http://itsme.services/v2/claim/BENationalNumber.

    Figure 9: Attributes mapping

    Figure 9: Attributes mapping

  7. Choose Next to configure an app client.

To configure an app client

  1. Configure both your user pool name and domain by opening the Amazon Cognito console.
    Figure 10: User pool and domain name

    Figure 10: User pool and domain name

  2. In the Initial app client area, select Public client.
    1. Enter your application client name.
    2. Select Don’t generate a client secret.
    3. Enter the application callback URL that’s used by itsme at the end of the authenticating flow. This URL is the one your end user is going to land on after authenticating.
    Figure 11: Configuring app client

    Figure 11: Configuring app client

To finish the creation by reviewing and creating the user pool

When the user pool is created, send your Amazon Cognito domain name to itsme support for them to activate your authentication endpoints. That URL has the following composition:

https://<Your user pool domain>.auth.<your region>.amazoncognito.com/oauth2/idpresponse

When the user pool is created, you can retrieve your userPoolWebClientId, which is required to create a consuming application.

To retrieve your userPoolWebClientId

  1. From the Amazon Cognito Console, select User pools on the left menu.
  2. Select the user pool that you created.
    Figure 12: User pool app integration

    Figure 12: User pool app integration

In the App integration area, your userPoolWebClientId is displayed at the bottom of the window.

Figure 13: Client ID

Figure 13: Client ID

To create a consuming application

When the setup of the user pool is done, you can integrate the authenticating flow in your application. The integration can be done using the AWS Amplify SDK and by calling the relevant API directly. Depending of the framework you used when building the application, you can find documentation about doing so in AWS Prescriptive Guidance Patterns.

You can use Amazon API Gateway to quickly build a secure API that uses the authentication made through Amazon Cognito and the federation to build services. We encourage you to review the Amazon API Gateway documentation to learn more. The next section provides you with examples that you can deploy to get an idea of the integration steps.

Additionally, you can use an Amazon Cognito identity pool to exchange Amazon Cognito issued tokens for AWS credentials (in other words, assuming AWS Identity and Access Management (IAM) roles) to access other AWS services. As an example, this could allow users to upload files to an Amazon Simple Storage Service (Amazon S3) bucket.

About the examples provided

The public GitHub repository that is provided contains code examples and associated documentation to help you automatically go through the setup steps detailed in this post. Specifically, the following are available:

  1. An AWS Cloudformation template that can help you provision a properly set-up user pool after you have the required information from itsme.
  2. An AWS Cloudformation template that deploys the backend for the test application.
  3. A React frontend that you can run locally to interact with the backend and to consume identities from itsme.

To deploy the provided examples

  1. Clone the repository on your local machine.
  2. Install the dependencies.
  3. If you haven’t created your user pool following the instructions in this post, you can use the CognitoItsmeStack provided as an example.
  4. Deploy the associated backend stack BackendItsmeStack.cfn.yaml.
  5. Rename the frontend/src/config.json.template file to frontend/src/config.json and replace the following:
    1. region with the AWS Region associated with your Amazon Cognito user pool.
    2. userPoolId with the assigned ID of the user pool that you created.
    3. userPoolWebClientId with the client ID that you retrieved.
    4. domain with your Amazon Cognito domain in the form of <your user pool name>.auth.<your region>.amazoncognito.com
    Figure 14: Frontend configuration file

    Figure 14: Frontend configuration file

  6. After modifications are done, start the application on your local machine with the provided command.

Following authentication, the results in the associated collected data are displayed, as shown in Figure 15 that follows.

Figure 15: User information

Figure 15: User information

In the My Data section, you can access a form to input a value (shown in Figure 16). Each time you go back to this page, the previous value entered is shown in the input box. This input is associated with your NRN (custom:eid), and only you can access it.

Figure 16: Database interaction

Figure 16: Database interaction

Conclusion

You can now consume digital identities through identity federation between Amazon Cognito and itsme. We hope that it helps you build secure digital services to improve the life of Benelux users.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Guillaume Neau

Guillaume is a Solutions Architect from France with expertise in information security that focuses on building solutions that improve customers’ lives.bio text

Julien Martin

Julien Martin

Julien is a Solutions Architect at Amazon Web Services (AWS) supporting public institutions (such as local governments and cities) in Benelux. He has over 20 years of industry experience in helping customers design, build, implement, and operate enterprise applications.

How to use Amazon CodeWhisperer using Okta as an external IdP

Post Syndicated from Chetan Makvana original https://aws.amazon.com/blogs/devops/how-to-use-amazon-codewhisperer-using-okta-as-an-external-idp/

Customers using Amazon CodeWhisperer often want to enable their developers to sign in using existing identity providers (IdP), such as Okta. CodeWhisperer provides support for authentication either through AWS Builder Id or AWS IAM Identity Center. AWS Builder ID is a personal profile for builders. It is designed for individual developers, particularly when working on personal projects or in cases when organization does not authenticate to AWS using the IAM Identity Center. IAM Identity Center is better suited for enterprise developers who use CodeWhisperer as employees of organizations that have an AWS account. The IAM Identity Center authentication method expands the capabilities of IAM by centralizing user administration and access control. Many customers prefer using Okta as their external IdP for Single Sign-On (SSO). They aim to leverage their existing Okta credentials to seamlessly access CodeWhisperer. To achieve this, customers utilize the IAM Identity Center authentication method.

Trained on billions of lines of Amazon and open-source code, CodeWhisperer is an AI coding companion that helps developers write code by generating real-time whole-line and full-function code suggestions in their Integrated Development Environments (IDEs). CodeWhisperer comes in two tiers—the Individual Tier is free for individual use, and the Professional Tier offers administrative features, like SSO and IAM Identity Center integration, policy control for referenced code suggestions, and higher limits on security scanning. Customers enable the professional tier of CodeWhisperer for their developers for a business use. When using CodeWhisperer with the professional tier, developers should authenticate with the IAM Identity Center. We will also soon introduce the Enterprise Tier, offering the additional capability to customize CodeWhisperer to generate more relevant recommendations by making it aware of your internal libraries, APIs, best practices, and architectural patterns.

In this blog, we will show you how to set up Okta as an external IdP for IAM Identity Center and enable access to CodeWhisperer using existing Okta credentials.

How it works

The flow for accessing CodeWhisperer through the IAM Identity Center involves the authentication of Okta users using Security Assertion Markup Language (SAML) 2.0 authentication.

Architecture Diagram for sign-in process

Figure 1: Architecture diagram for sign-in process

The sign-in process follows these steps:

  1. IAM Identity Center synchronizes users and groups information from Okta into IAM Identity Center using the System for Cross-domain Identity Management (SCIM) v2.0 protocol.
  2. Developer with an Okta account connects to CodeWhisperer through IAM Identity Center.
  3. If the developer isn’t already authenticated, they will be redirected to the Okta account login. The developer will sign in using their Okta credentials.
  4. If the sign-in is successful, Okta processes the request and sends a SAML response containing the developer’s identity and authentication status to IAM Identity Center.
  5. If the SAML response is valid and the developer is authenticated, IAM Identity Center grants access to CodeWhisperer.
  6. The developer can now securely access and use CodeWhisperer.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Configure Okta as an external IdP with IAM Identity Center

Step 1: Enable IAM Identity Center in Okta

  1. In the Okta Admin Console, Navigate to the Applications Tab > Applications.
  2. Browse App Catalog for the AWS IAM Identity Center app and Select Add Integration.
In Okta’s App Integration Catalog, the option to select the AWS IAM Identity Center App

Figure 2: Okta’s App Integration Catalog

  1. Provide a name for the App Integration, using the Application label. Example: “Demo_AWS IAM Identity Center”.
  2. Navigate to the Sign On tab for the “Demo_AWS IAM Identity Center” app .
  3. Under SAML Signing Certificates, select View IdP Metadata from the Actions drop-down menu and Save the contents as metadata.xml
Under SAML Signing Certificates, the option to select View IdP Metadata from the Actions drop-down menu and Save the contents as metadata.xml.

Figure 3: SAML Signing Certificates Page

Step 2: Configure Okta as an external IdP in IAM Identity Center

  1. In IAM Identity Center Console, navigate to Dashboard.
  2. Select Choose your identity source.
In IAM Identity Center Console, Dashboard tab displays the option to select Recommended setup steps with Choose your Identity source as Step 1

Figure 4: IAM Identity Center Console

  1. Under Identity source, select Change identity source from the Actions drop-down menu
  2. On the next page, select External identity provider, then click Next.
  3. Configure the external identity provider:
    • IdP SAML metadata: Click Choose file to upload Okta’s IdP SAML metadata you saved in the previous section Step 6.
    • Make a copy of the AWS access portal sign-in URL, IAM Identity Center ACS URL, and IAM Identity Center issuer URL values. You’ll need these values later on.
    • Click Next.
  1. Review the list of changes. Once you are ready to proceed, type ACCEPT, then click Change identity source.

Step 3: Configure Okta with IAM Identity Center Sign On details

  1. In Okta, select the Sign On tab IAM Identity Center SAML app, then click Edit:
    • Enter AWS IAM Identity Center SSO ACS URL and AWS IAM Identity Center SSO issuer URL values that you copied in previous section step 5b into the corresponding fields.
    • Application username format: Select one of the options from the drop-down menu.
  1. Click Save.

Step 4: Enable provisioning in IAM Identity Center

  1. In IAM Identity Center Console, choose Settings in the left navigation pane.
  2. On the Settings page, locate the Automatic provisioning information box, and then choose Enable. This immediately enables automatic provisioning in IAM Identity Center and displays the necessary SCIM endpoint and access token information.
Under IAM Identity Center settings, the option to enable automatic provisioning.

Figure 5: IAM Identity Center Settings for Automatic provisioning

  1. In the Inbound automatic provisioning dialog box, copy each of the values for the following options. You will need to paste these in later when you configure provisioning in Okta.
    • SCIM endpoint
    • Access token
  2. Choose Close.

Step 5: Configure provisioning in Okta

  1. In a separate browser window, log in to the Okta admin portal and navigate to the IAM Identity Center app.
  2. On the IAM Identity Center app page, choose the Provisioning tab, and then choose Integration.
  3. Choose Configure API Integration and then select the check box next to Enable API integration to enable provisioning.
  4. Paste SCIM endpoint copied in previous step into the Base URL field. Make sure that you remove the trailing forward slash at the end of the URL.
  5. Paste Access token copied in previous step into the API Token field.
  6. Choose Test API Credentials to verify the credentials entered are valid.
  7. Choose Save.
  8. Under Settings, choose To App, choose Edit, and then select the Enable check box for each of the Provisioning Features you want to enable.
  9. Choose Save.

Step 6: Assign access for groups in Okta

  1. In the Okta admin console, navigate to the IAM Identity Center app page, choose the Assignments tab.
  2. In the Assignments page, choose Assign, and then choose Assign to Groups.
  3. Choose the Okta group or groups that you want to assign access to the IAM Identity Center app. Choose Assign, choose Save and Go Back, and then choose Done. This starts provisioning the users in the group into the IAM Identity Center.
  4. Repeat step 3 for all groups that you want to provide access.
  5. Choose the Push Groups tab. Choose the Okta group or groups that you chose in the previous step.
  6. Then choose Save. The group status changes to Active after the group and its members have successfully been pushed to IAM Identity Center.

    CodeWhisperer Settings page displays option to add groups with access.

    Figure 7: Amazon CodeWhisperer Settings Page

Step 7: Provide access to CodeWhisperer

  1. In the CodeWhisperer Console, under Settings add the groups which require access to CodeWhisperer.

    CodeWhisperer Settings page displays option to add groups with access.

    Figure 7: Amazon CodeWhisperer Settings Page

Set up AWS Toolkit with IAM Identity Center

To use CodeWhisperer, you will now set up the AWS Toolkit within integrated development environments (IDE) to establish authentication with the IAM Identity Center.

In IDE, steps to set up AWS Toolkit with IAM Identity Center

Figure 8: Set up AWS Toolkit with IAM Identity Center

  1. In the IDE, open the AWS extension panel and click Start button under Developer Tools > CodeWhisperer.
  2. In the resulting pane, expand the section titled Have a Professional Tier subscription? Sign in with IAM Identity Center.
  3. Enter the IAM Identity Center URL you previously copied into the Start URL field.
  4. Set the region to us-east-1 and click Sign in button.
  5. Click Copy Code and Proceed to copy the code from the resulting pop-up.
  6. When prompted by the Do you want Code to open the external website? pop-up, click Open.
  7. Paste the code copied in Step 5 and click Next.
  8. Enter your Okta credentials and click Sign in.
  9. Click Allow to grant AWS Toolkit to access your data.
  10. When the connection is complete, a notification indicates that it is safe to close your browser. Close the browser tab and return to IDE.
  11. Depending on your preference, select Yes if you wish to continue using IAM Identity Center with CodeWhisperer while using current AWS profile, else select No.
  12. You are now all set to use CodeWhisperer from within IDE, authenticated with your Okta credentials.

Test Configuration

If you have successfully completed the previous step, you will see the code suggested by CodeWhisperer.

 In IDE, steps to test CodeWhisperer configuration

Figure 9: Test Step Configurations

Conclusion

In this post, you learned how to leverage existing Okta credential to access Amazon CodeWhisperer via the IAM Identity Center integration. The post walked through the steps detailing the process of setting up Okta as an external IdP with the IAM Identity Center. You then configured the AWS Toolkit to establish an authenticated connection to AWS using Okta credentials, enabling you to access the CodeWhisperer Professional Tier.

About the authors:

Chetan Makvana

Chetan Makvana is a senior solutions architect working with global systems integrators at AWS. He works with AWS partners and customers to provide them with architectural guidance for building scalable architecture and execute strategies to drive adoption of AWS services. He is a technology enthusiast and a builder with a core area of interest on serverless and DevOps. Outside of work, he enjoys binge-watching, traveling and music.

Balachander Keelapudi

Balachander Keelapudi is a senior solutions architect working with AWS customers to provide them with architectural guidance for building scalable architecture and execute strategies to drive adoption of AWS services. He is a technology enthusiast and a builder with a core area of interest on DevSecOps and Observability. Outside of work, he enjoys biking and going on hiking with his family.

An automated approach to perform an in-place engine upgrade in Amazon OpenSearch Service

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/an-automated-approach-to-perform-an-in-place-engine-upgrade-in-amazon-opensearch-service/

Software upgrades bring new features and better performance, and keep you current with the software provider. However, upgrades for software services can be difficult to complete successfully, especially when you can’t tolerate downtime and when the new version’s APIs introduce breaking changes and deprecation that you must remediate. This post shows you how to upgrade from Elasticsearch engine to OpenSearch engine on Amazon OpenSearch Service without needing an intermediate upgrade to Elasticsearch 7.10.

OpenSearch Service supports OpenSearch as an engine, with versions in the 1.x through 2.x series. The service also supports legacy versions of Elasticsearch, versions 1.x through 7.10. Although OpenSearch brings many improvements over earlier engines, it can feel daunting to consider not only upgrading versions, but also changing engines in the process. The good news is that OpenSearch 1.0 is wire compatible with Elasticsearch 7.10, making engine changes straightforward. If you’re running a version of Elasticsearch in the 6.x or early 7.x series on OpenSearch Service, you might think you need to upgrade to Elasticsearch 7.10, and then upgrade to OpenSearch 1.3. However, you can easily upgrade your existing Elasticsearch engine running 6.8, 7.1, 7.2, 7.4, 7.9, and 7.10 in OpenSearch Service to the OpenSearch 1.3 engine.

OpenSearch Service runs a variety of checks before running an actual upgrade:

  • Validation before starting an upgrade
  • Preparing the setup configuration for the desired version
  • Provisioning new nodes with the same hardware configuration
  • Moving shards from old nodes to newly provisioned nodes
  • Removing older nodes and old node references from OpenSearch endpoints

During an upgrade, AWS takes care of the undifferentiated heavy lifting of provisioning, deploying, and moving the data to new domain. You are responsible to make sure there are no breaking changes that affect the data migration and movement to the newer version of the OpenSearch domain. In this post, we discuss the things you must modify and verify before and after running an upgrade from 6.8, 7.1, 7.2, 7.4, 7.9, and 7.10 version of Elasticsearch to 1.3 OpenSearch Service.

Pre-upgrade breaking changes

The following are pre-upgrade breaking changes:

  • Dependency check for language clients and libraries – If you’re using the open-source high-level language clients from Elastic, for example the Java, go, or Python client libraries, AWS recommends moving to the open-source, OpenSearch versions of these clients. (If you don’t use a high-level language client, you can skip this step.) The following are a few steps to perform a dependency check:
    • Determine the client library – Choose an appropriate client library compatible with your programing language. Refer to OpenSearch language clients for a list of all supported client libraries.
    • Add dependencies and resolve conflicts – Update your project’s dependency management system with the necessary dependencies specified by the client library. If your project already has dependencies that conflict with the OpenSearch client library dependencies, you may encounter dependency conflicts. In such cases, you need to resolve the conflicts manually.
    • Test and verify the client – Test the OpenSearch client functionality by establishing a connection, performing some basic operations (like indexing and searching), and verifying the results.
  • Removal of mapping types – Multiple types within an index were deprecated in Elasticsearch version 6.x, and completely removed in version 7.0 or later. OpenSearch indexes can only contain one mapping type. From OpenSearch version 2.x onward, the mapping _type must be _doc. You must check and fix the mapping before upgrading to OpenSearch 1.3.

Complete the following steps to identify and fix mapping issues:

  1. Navigate to dev tools and use the following GET <index> mapping API to fetch the mapping information for all the indexes:
GET /index-name/_mapping

The mapping response will contain a JSON structure that represents the mapping for your index.

  1. Look for the top-level keys in the response JSON; each key represents a custom type within the index.

The _doc type is used for the default type in Elasticsearch 7.x and OpenSearch Service 1.x, but you may see additional types that you defined in earlier versions of Elasticsearch. The following is an example response for an index with two custom types, type1 and type2.

Note that indexes created in 5.x will continue to function in 6.x as they did in 5.x, but indexes created in 6.x only allow a single type per index.

{
  "myindex": {
    "mappings": {
      "type1": {
        "properties": {
          "field1_type1": {
            "type": "text"
          },
          "field2_type1": {
            "type": "integer"
          }
        }
      },
      "type2": {
        "properties": {
          "field1_type2": {
            "type": "keyword"
          },
          "field2_type2": {
            "type": "date"
          }
        }
      }
    }
  }
}

To fix the multiple mapping types in your existing domain, you need to reindex the data, where you can create one index for each mapping. This is a crucial step in the migration process because OpenSearch doesn’t support multiple types within a single index. In the next steps, we convert an index that has multiple mapping types into two separate indexes, each using the _doc type.

  1. You can unify the mapping by using your existing index name as a root and adding the type as a suffix. For example, the following code creates two indexes with myindex as the root name and type1 and type2 as the suffix:
    # Create an index for "type1"
    PUT /myindex_type1
    
    # Create an index for "type2"
    PUT /myindex_type2

  2. Use the _reindex API to reindex the data from the original index into the two new indexes. Alternately, you can reload the data from its source, if you’re keeping it in another system.
    POST _reindex
    {
      "source": {
        "index": "myindex",
        "type": "type1"  
      },
      "dest": {
        "index": "myindex_type1",
        "type": "_doc"  
      }
    }
    POST _reindex
    {
      "source": {
        "index": "myindex",
        "type": "type2"  
      },
      "dest": {
        "index": "myindex_type2",
        "type": "_doc"  
      }
    }

  3. If your application was previously querying the original index with multiple types, you’ll need to update your queries to specify the new indexes with _doc as the type. For example, if your client was querying using myindex, which has been reindexed to myindex_type1 and myindex_type2, then change your clients to point to myindex*, which will query across both indexes.
  4. After you have verified that the data is successfully reindexed and your application is working as expected with the new indexes, you should delete the original index before starting the upgrade, because it won’t be supported in the new version. Be cautious when deleting data and make sure you have backups if necessary.
  5. As part of this upgrade, Kibana will be replaced with OpenSearch Dashboards. When you’re done with the upgrade in the next step, you should advocate your users to use the new endpoint, which will be _dashboards. If you use a custom endpoint, be sure to update it to point to /_dashboards.
  6. We recommend that you update your AWS Identity and Access Management (IAM) policies to use the renamed API operations. However, OpenSearch Service will continue to respect existing policies by internally replicating the old API permissions. Service control policies (SCPs) introduce an additional layer of complexity compared to standard IAM. To prevent your SCP policies from breaking, you need to add both the old and the new API operations to each of your SCP policies.

Start the upgrade

The upgrade process is irreversible and can’t be paused or cancelled. You should make sure that everything will go smoothly by running a proof of concept (POC) check. By building a POC, you ensure data preservation, avoid compatibility issues, prevent unwanted bugs, and mitigate risks. You can run a POC with a small domain on the new version, and with a small subset of your data. Deploy and run any front-end code that communicates with OpenSearch Service via API against your test domain. Use your ingest pipeline to send data to your test domain as well, ensuring nothing breaks. Import or rebuild dashboards, alerts, anomaly detectors, and so on. This simple precaution can make your upgrade experience smoother and trouble-free while minimizing potential disruptions.

When OpenSearch Service starts the upgrade, it can take from 15 minutes to several hours to complete. OpenSearch Dashboards might be unavailable during some or all of the duration of the upgrade.

In this section, we show how you can start an upgrade using the AWS Management Console. You can also run the upgrade using the AWS Command Line Interface (AWS CLI) or AWS SDK. For more information, refer to Starting an upgrade (CLI) and Starting an upgrade (SDK).

  1. Take a manual snapshot of your domain.

This snapshot serves as a backup that you can restore on a new domain if you want to return to using the prior version.

  1. On the OpenSearch Service console, choose the domain that you want to upgrade.
  2. On the Actions menu, choose Upgrade.
  3. Select the target version as OpenSearch 1.3. If you’re upgrading to an OpenSearch version, then you’ll be able to enable compatibility mode. If you enable this setting, OpenSearch reports its version as 7.10 to allow Elastic’s open-source clients and plugins like Logstash to continue working with your OpenSearch Service domain.
  4. Choose Check Upgrade Eligibility, which helps you identify if there are any breaking changes you still need to fix before running an upgrade.
  5. Choose Upgrade.
  6. Check the status on the domain dashboard to monitor the status of the upgrade.

The following graphic gives a quick demonstration of running and monitoring the upgrade via the preceding steps.

Post-upgrade changes and validations

Now that you have successfully upgraded your domain to OpenSearch Service 1.3, be sure to make the following changes:

  • Custom endpoint – A custom endpoint for your OpenSearch Service domain makes it straightforward for you to refer to your OpenSearch and OpenSearch Dashboards URLs. You can include your company’s branding or just use a shorter, easier-to-remember endpoint than the standard one. In OpenSearch Service, Kibana has been renamed to dashboards. After you upgrade a domain from the Elasticsearch engine to the OpenSearch engine, the /_plugin/kibana endpoint changes to /_dashboards. If you use the Kibana endpoint to set up a custom domain, you need to update it to include the new /_dashboards endpoint.
  • SAML authentication for OpenSearch Dashboards – After you upgrade your domain to OpenSearch Service, you need to change all Kibana URLs configured in your identity provider (IdP) from /_plugin/kibana to /_dashboards. The most common URLs are assertion consumer service (ACS) URLs and recipient URLs.

Conclusion

This post discussed what to consider when planning to upgrade your existing Elasticsearch engine in your OpenSearch Service domain to the OpenSearch engine. OpenSearch Service continues to support older engine versions, including open-source versions of Elasticsearch from 1.x through 7.10. By migrating forward to OpenSearch Service, you will be able to take advantage of bug fixes, performance improvements, new service features, and expanded instance types for your data nodes, like AWS Graviton 2 instances.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the Authors

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Harsh Bansal is a Solutions Architect at AWS with Analytics as his area of specialty.He has been building solutions to help organizations make data-driven decisions.

Create, train, and deploy Amazon Redshift ML model integrating features from Amazon SageMaker Feature Store

Post Syndicated from Anirban Sinha original https://aws.amazon.com/blogs/big-data/create-train-and-deploy-amazon-redshift-ml-model-integrating-features-from-amazon-sagemaker-feature-store/

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. Data analysts and database developers want to use this data to train machine learning (ML) models, which can then be used to generate insights on new data for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies. Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using SQL commands familiar to many roles such as executives, business analysts, and data analysts. We covered in a previous post how you can use data in Amazon Redshift to train models in Amazon SageMaker, a fully managed ML service, and then make predictions within your Redshift data warehouse.

Redshift ML currently supports ML algorithms such as XGBoost, multilayer perceptron (MLP), KMEANS, and Linear Learner. Additionally, you can import existing SageMaker models into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. However, one challenge in training a production-ready ML model using SageMaker Feature Store is access to a diverse set of features that aren’t always owned and maintained by the team that is building the model. For example, an ML model to identify fraudulent financial transactions needs access to both identifying (device type, browser) and transaction (amount, credit or debit, and so on) related features. As a data scientist building an ML model, you may have access to the identifying information but not the transaction information, and having access to a feature store solves this.

In this post, we discuss the combined feature store pattern, which allows teams to maintain their own local feature stores using a local Redshift table while still being able to access shared features from the centralized feature store. In a local feature store, you can store sensitive data that can’t be shared across the organization for regulatory and compliance reasons.

We also show you how to use familiar SQL statements to create and train ML models by combining shared features from the centralized store with local features and use these models to make in-database predictions on new data for use cases such as fraud risk scoring.

Overview of solution

For this post, we create an ML model to predict if a transaction is fraudulent or not, given the transaction record. To build this, we need to engineer features that describe an individual credit card’s spending pattern, such as the number of transactions or the average transaction amount, and also information about the merchant, the cardholder, the device used to make the payment, and any other data that may be relevant to detecting fraud.

To get started, we need an Amazon Redshift Serverless data warehouse with the Redshift ML feature enabled and an Amazon SageMaker Studio environment with access to SageMaker Feature Store. For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

We also need an offline feature store to store features in feature groups. The offline store uses an Amazon Simple Storage Service (Amazon S3) bucket for storage and can also fetch data using Amazon Athena queries. For an introduction to SageMaker Feature Store and instructions on setting it up, see Getting started with Amazon SageMaker Feature Store.

The following diagram illustrates solution architecture.

The workflow contains the following steps:

  1. Create the offline feature group in SageMaker Feature Store and ingest data into the feature group.
  2. Create a Redshift table and load local feature data into the table.
  3. Create an external schema for Amazon Redshift Spectrum to access the offline store data stored in Amazon S3 using the AWS Glue Data Catalog.
  4. Train and validate a fraud risk scoring ML model using local feature data and external offline feature store data.
  5. Use the offline feature store and local store for inference.

Dataset

To demonstrate this use case, we use a synthetic dataset with two tables: identity and transactions. They can both be joined by the TransactionID column. The transaction table contains information about a particular transaction, such as amount, credit or debit card, and so on, and the identity table contains information about the user, such as device type and browser. The transaction must exist in the transaction table, but might not always be available in the identity table.

The following is an example of the transactions dataset.

The following is an example of the identity dataset.

Let’s assume that across the organization, data science teams centrally manage the identity data and process it to extract features in a centralized offline feature store. The data warehouse team ingests and analyzes transaction data in a Redshift table, owned by them.

We work through this use case to understand how the data warehouse team can securely retrieve the latest features from the identity feature group and join it with transaction data in Amazon Redshift to create a feature set for training and inferencing a fraud detection model.

Create the offline feature group and ingest data

To start, we set up SageMaker Feature Store, create a feature group for the identity dataset, inspect and process the dataset, and ingest some sample data. We then prepare the transaction features from the transaction data and store it in Amazon S3 for further loading into the Redshift table.

Alternatively, you can author features using Amazon SageMaker Data Wrangler, create feature groups in SageMaker Feature Store, and ingest features in batches using an Amazon SageMaker Processing job with a notebook exported from SageMaker Data Wrangler. This mode allows for batch ingestion into the offline store.

Let’s explore some of the key steps in this section.

  1. Download the sample notebook.
  2. On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
  3. Locate your notebook instance and choose Open Jupyter.
  4. Choose Upload and upload the notebook you just downloaded.
  5. Open the notebook sagemaker_featurestore_fraud_redshiftml_python_sdk.ipynb.
  6. Follow the instructions and run all the cells up to the Cleanup Resources section.

The following are key steps from the notebook:

  1. We create a Pandas DataFrame with the initial CSV data. We apply feature transformations for this dataset.
    identity_data = pd.read_csv(io.BytesIO(identity_data_object["Body"].read()))
    transaction_data = pd.read_csv(io.BytesIO(transaction_data_object["Body"].read()))
    
    identity_data = identity_data.round(5)
    transaction_data = transaction_data.round(5)
    
    identity_data = identity_data.fillna(0)
    transaction_data = transaction_data.fillna(0)
    
    # Feature transformations for this dataset are applied 
    # One hot encode card4, card6
    encoded_card_bank = pd.get_dummies(transaction_data["card4"], prefix="card_bank")
    encoded_card_type = pd.get_dummies(transaction_data["card6"], prefix="card_type")
    
    transformed_transaction_data = pd.concat(
        [transaction_data, encoded_card_type, encoded_card_bank], axis=1
    )

  2. We store the processed and transformed transaction dataset in an S3 bucket. This transaction data will be loaded later in the Redshift table for building the local feature store.
    transformed_transaction_data.to_csv("transformed_transaction_data.csv", header=False, index=False)
    s3_client.upload_file("transformed_transaction_data.csv", default_s3_bucket_name, prefix + "/training_input/transformed_transaction_data.csv")

  3. Next, we need a record identifier name and an event time feature name. In our fraud detection example, the column of interest is TransactionID.EventTime can be appended to your data when no timestamp is available. In the following code, you can see how these variables are set, and then EventTime is appended to both features’ data.
    # record identifier and event time feature names
    record_identifier_feature_name = "TransactionID"
    event_time_feature_name = "EventTime"
    
    # append EventTime feature
    identity_data[event_time_feature_name] = pd.Series(
        [current_time_sec] * len(identity_data), dtype="float64"
    )

  4. We then create and ingest the data into the feature group using the SageMaker SDK FeatureGroup.ingest API. This is a small dataset and therefore can be loaded into a Pandas DataFrame. When we work with large amounts of data and millions of rows, there are other scalable mechanisms to ingest data into SageMaker Feature Store, such as batch ingestion with Apache Spark.
    identity_feature_group.create(
        s3_uri=<S3_Path_Feature_Store>,
        record_identifier_name=record_identifier_feature_name,
        event_time_feature_name=event_time_feature_name,
        role_arn=<role_arn>,
        enable_online_store=False,
    )
    
    identity_feature_group_name = "identity-feature-group"
    
    # load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
    identity_feature_group.load_feature_definitions(data_frame=identity_data)
    identity_feature_group.ingest(data_frame=identity_data, max_workers=3, wait=True)
    

  5. We can verify that data has been ingested into the feature group by running Athena queries in the notebook or running queries on the Athena console.

At this point, the identity feature group is created in an offline feature store with historical data persisted in Amazon S3. SageMaker Feature Store automatically creates an AWS Glue Data Catalog for the offline store, which enables us to run SQL queries against the offline data using Athena or Redshift Spectrum.

Create a Redshift table and load local feature data

To build a Redshift ML model, we build a training dataset joining the identity data and transaction data using SQL queries. The identity data is in a centralized feature store where the historical set of records are persisted in Amazon S3. The transaction data is a local feature for training data that needs to made available in the Redshift table.

Let’s explore how to create the schema and load the processed transaction data from Amazon S3 into a Redshift table.

  1. Create the customer_transaction table and load daily transaction data into the table, which you’ll use to train the ML model:
    DROP TABLE customer_transaction;
    CREATE TABLE customer_transaction (
      TransactionID INT,    
      isFraud INT,  
      TransactionDT INT,    
      TransactionAmt decimal(10,2), 
      card1 INT,    
      card2 decimal(10,2),card3 decimal(10,2),  
      card4 VARCHAR(20),card5 decimal(10,2),    
      card6 VARCHAR(20),    
      B1 INT,B2 INT,B3 INT,B4 INT,B5 INT,B6 INT,
      B7 INT,B8 INT,B9 INT,B10 INT,B11 INT,B12 INT,
      F1 INT,F2 INT,F3 INT,F4 INT,F5 INT,F6 INT,
      F7 INT,F8 INT,F9 INT,F10 INT,F11 INT,F12 INT,
      F13 INT,F14 INT,F15 INT,F16 INT,F17 INT,  
      N1 VARCHAR(20),N2 VARCHAR(20),N3 VARCHAR(20), 
      N4 VARCHAR(20),N5 VARCHAR(20),N6 VARCHAR(20), 
      N7 VARCHAR(20),N8 VARCHAR(20),N9 VARCHAR(20), 
      card_type_0  boolean,
      card_type_credit boolean,
      card_type_debit  boolean,
      card_bank_0  boolean,
      card_bank_american_express boolean,
      card_bank_discover  boolean,
      card_bank_mastercard  boolean,
      card_bank_visa boolean  
    );

  2. Load the sample data by using the following command. Replace your Region and S3 path as appropriate. You will find the S3 path in the S3 Bucket Setup For The OfflineStore section in the notebook or by checking the dataset_uri_prefix in the notebook.
    COPY customer_transaction
    FROM '<s3path>/transformed_transaction_data.csv' 
    IAM_ROLE default delimiter ',' 
    region 'your-region';

Now that we have created a local feature store for the transaction data, we focus on integrating a centralized feature store with Amazon Redshift to access the identity data.

Create an external schema for Redshift Spectrum to access the offline store data

We have created a centralized feature store for identity features, and we can access this offline feature store using services such as Redshift Spectrum. When the identity data is available through the Redshift Spectrum table, we can create a training dataset with feature values from the identity feature group and customer_transaction, joining on the TransactionId column.

This section provides an overview of how to enable Redshift Spectrum to query data directly from files on Amazon S3 through an external database in an AWS Glue Data Catalog.

  1. First, check that the identity-feature-group table is present in the Data Catalog under the sagemamker_featurestore database.
  2. Using Redshift Query Editor V2, create an external schema using the following command:
    CREATE EXTERNAL SCHEMA sagemaker_featurestore
    FROM DATA CATALOG
    DATABASE 'sagemaker_featurestore'
    IAM_ROLE default
    create external database if not exists;

All the tables, including identity-feature-group external tables, are visible under the sagemaker_featurestore external schema. In Redshift Query Editor v2, you can check the contents of the external schema.

  1. Run the following query to sample a few records—note that your table name may be different:
    Select * from sagemaker_featurestore.identity_feature_group_1680208535 limit 10;

  2. Create a view to join the latest data from identity-feature-group and customer_transaction on the TransactionId column. Be sure to change the external table name to match your external table name:
    create or replace view public.credit_fraud_detection_v
    AS select  "isfraud",
            "transactiondt",
            "transactionamt",
            "card1","card2","card3","card5",
             case when "card_type_credit" = 'False' then 0 else 1 end as card_type_credit,
             case when "card_type_debit" = 'False' then 0 else 1 end as card_type_debit,
             case when "card_bank_american_express" = 'False' then 0 else 1 end as card_bank_american_express,
             case when "card_bank_discover" = 'False' then 0 else 1 end as card_bank_discover,
             case when "card_bank_mastercard" = 'False' then 0 else 1 end as card_bank_mastercard,
             case when "card_bank_visa" = 'False' then 0 else 1 end as card_bank_visa,
            "id_01","id_02","id_03","id_04","id_05"
    from public.customer_transaction ct left join sagemaker_featurestore.identity_feature_group_1680208535 id
    on id.transactionid = ct.transactionid with no schema binding;

Train and validate the fraud risk scoring ML model

Redshift ML gives you the flexibility to specify your own algorithms and model types and also to provide your own advanced parameters, which can include preprocessors, problem type, and hyperparameters. In this post, we create a customer model by specifying AUTO OFF and the model type of XGBOOST. By turning AUTO OFF and using XGBoost, we are providing the necessary inputs for SageMaker to train the model. A benefit of this can be faster training times. XGBoost is as open-source version of the gradient boosted trees algorithm. For more details on XGBoost, refer to Build XGBoost models with Amazon Redshift ML.

We train the model using 80% of the dataset by filtering on transactiondt < 12517618. The other 20% will be used for inference. A centralized feature store is useful in providing the latest supplementing data for training requests. Note that you will need to provide an S3 bucket name in the create model statement. It will take approximately 10 minutes to create the model.

CREATE MODEL frauddetection_xgboost
FROM (select  "isfraud",
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05"
from credit_fraud_detection_v where transactiondt < 12517618
)
TARGET isfraud
FUNCTION ml_fn_frauddetection_xgboost
IAM_ROLE default
AUTO OFF
MODEL_TYPE XGBOOST
OBJECTIVE 'binary:logistic'
PREPROCESSORS 'none'
HYPERPARAMETERS DEFAULT EXCEPT(NUM_ROUND '100')
SETTINGS (S3_BUCKET <s3_bucket>);

When you run the create model command, it will complete quickly in Amazon Redshift while the model training is happening in the background using SageMaker. You can check the status of the model by running a show model command:

show model frauddetection_xgboost;

The output of the show model command shows that the model state is TRAINING. It also shows other information such as the model type and the training job name that SageMaker assigned.
After a few minutes, we run the show model command again:

show model frauddetection_xgboost;

Now the output shows the model state is READY. We can also see the train:error score here, which at 0 tells us we have a good model. Now that the model is trained, we can use it for running inference queries.

Use the offline feature store and local store for inference

We can use the SQL function to apply the ML model to data in queries, reports, and dashboards. Let’s use the function ml_fn_frauddetection_xgboost created by our model against our test dataset by filtering where transactiondt >=12517618, to predict whether a transaction is fraudulent or not. SageMaker Feature Store can be useful in supplementing data for inference requests.

Run the following query to predict whether transactions are fraudulent or not:

select  "isfraud" as "Actual",
        ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") as "Predicted"
from credit_fraud_detection_v where transactiondt >= 12517618;

For binary and multi-class classification problems, we compute the accuracy as the model metric. Accuracy can be calculated based on the following:

accuracy = (sum (actual == predicted)/total) *100

Let’s apply the preceding code to our use case to find the accuracy of the model. We use the test data (transactiondt >= 12517618) to test the accuracy, and use the newly created function ml_fn_frauddetection_xgboost to predict and take the columns other than the target and label as the input:

-- check accuracy 
WITH infer_data AS (
SELECT "isfraud" AS label,
ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") AS predicted,
CASE 
   WHEN label IS NULL
       THEN 0
   ELSE label
   END AS actual,
CASE 
   WHEN actual = predicted
       THEN 1::INT
   ELSE 0::INT
   END AS correct
FROM credit_fraud_detection_v where transactiondt >= 12517618),
aggr_data AS (
SELECT SUM(correct) AS num_correct,
COUNT(*) AS total
FROM infer_data) 

SELECT (num_correct::FLOAT / total::FLOAT) AS accuracy FROM aggr_data;

Clean up

As a final step, clean up the resources:

  1. Delete the Redshift cluster.
  2. Run the Cleanup Resources section of your notebook.

Conclusion

Redshift ML enables you to bring machine learning to your data, powering fast and informed decision-making. SageMaker Feature Store provides a purpose-built feature management solution to help organizations scale ML development across business units and data science teams.

In this post, we showed how you can train an XGBoost model using Redshift ML with data spread across SageMaker Feature Store and a Redshift table. Additionally, we showed how you can make inferences on a trained model to detect fraud using Amazon Redshift SQL commands.


About the authors

Anirban Sinha is a Senior Technical Account Manager at AWS. He is passionate about building scalable data warehouses and big data solutions working closely with customers. He works with large ISVs customers, in helping them build and operate secure, resilient, scalable, and high-performance SaaS applications in the cloud.

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and using the power of ML within their data warehouse.

Gaurav Singh is a Senior Solutions Architect at AWS, specializing in AI/ML and Generative AI. Based in Pune, India, he focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. In his spare time, Gaurav loves to explore nature, read, and run.

Enable cost-efficient operational analytics with Amazon OpenSearch Ingestion

Post Syndicated from Mikhail Vaynshteyn original https://aws.amazon.com/blogs/big-data/enable-cost-efficient-operational-analytics-with-amazon-opensearch-ingestion/

As the scale and complexity of microservices and distributed applications continues to expand, customers are seeking guidance for building cost-efficient infrastructure supporting operational analytics use cases. Operational analytics is a popular use case with Amazon OpenSearch Service. A few of the defining characteristics of these use cases are ingesting a high volume of time series data and a relatively low volume of querying, alerting, and running analytics on ingested data for real-time insights. Although OpenSearch Service is capable of ingesting petabytes of data across storage tiers, you still have to provision capacity to migrate between hot and warm tiers. This adds to the cost of provisioned OpenSearch Service domains.

The time series data often contains logs or telemetry data from various sources with different values and needs. That is, logs from some sources need to be available in a hot storage tier longer, whereas logs from other sources can tolerate a delay in querying and other requirements. Until now, customers were building external ingestion systems with the Amazon Kinesis family of services, Amazon Simple Queue Service (Amazon SQS), AWS Lambda, custom code, and other similar solutions. Although these solutions enable ingestion of operational data with various requirements, they add to the cost of ingestion.

In general, operational analytics workloads use anomaly detection to aid domain operations. This assumes that the data is already present in OpenSearch Service and the cost of ingestion is already borne.

With the addition of a few recent features of Amazon OpenSearch Ingestion, a fully managed serverless pipeline for OpenSearch Service, you can effectively address each of these cost points and build a cost-effective solution. In this post, we outline a solution that does the following:

  • Uses conditional routing of Amazon OpenSearch Ingestion to separate logs with specific attributes and store those, for example, in Amazon OpenSearch Service and archive all events in Amazon S3 to query with Amazon Athena
  • Uses in-stream anomaly detection with OpenSearch Ingestion, thereby removing the cost associated with compute needed for anomaly detection after ingestion

In this post, we use a VPC flow logs use case to demonstrate the solution. The solution and pattern presented in this post is equally applicable to larger operational analytics and observability use cases.

Solution overview

We use VPC flow logs to capture IP traffic and trigger processing notifications to the OpenSearch Ingestion pipeline. The pipeline filters the data, routes the data, and detects anomalies. The raw data will be stored in Amazon S3 for archival purposes, then the pipeline will detect anomalies in the data in near-real time using the Random Cut Forest (RCF) algorithm and send those data records to OpenSearch Service. The raw data stored in Amazon S3 can be inexpensively retained for an extended period of time using tiered storage and queried using the Athena query engine, and also visualized using Amazon QuickSight or other data visualization services. Although this walkthrough uses VPC flow log data, the same pattern applies for use with AWS CloudTrail, Amazon CloudWatch, any log files as well as any OpenTelemetry events, and custom producers.

The following is a diagram of the solution that we configure in this post.

In the following sections, we provide a walkthrough for configuring this solution.

The patterns and procedures presented in this post have been validated with the current version of OpenSearch Ingestion and the Data Prepper open-source project version 2.4.

Prerequisites

Complete the following prerequisite steps:

  1. We will be using a VPC for demonstration purposes for generating data. Set up the VPC flow logs to publish logs to an S3 bucket in text format. To optimize S3 storage costs, create a lifecycle configuration on the S3 bucket to transition the VPC flow logs to different tiers or expire processed logs. Make a note of the S3 bucket name you configured to use in later steps.
  2. Set up an OpenSearch Service domain. Make a note of the domain URL. The domain can be either public or VPC based, which is the preferred configuration.
  3. Create an S3 bucket for storing archived events, and make a note of S3 bucket name. Configure a resource-based policy allowing OpenSearch Ingestion to archive logs and Athena to read the logs.
  4. Configure an AWS Identity and Access Management (IAM) role or separate IAM roles allowing OpenSearch Ingestion to interact with Amazon SQS and Amazon S3. For instructions, refer to Configure the pipeline role.
  5. Configure Athena or validate that Athena is configured on your account. For instructions, refer to Getting started.

Configure an SQS notification

VPC flow logs will write data in Amazon S3. After each file is written, Amazon S3 will send an SQS notification to notify the OpenSearch Ingestion pipeline that the file is ready for processing.

If the data is already stored in Amazon S3, you can use the S3 scan capability for a one-time or scheduled loading of data through the OpenSearch Ingestion pipeline.

Use AWS CloudShell to issue the following commands to create the SQS queues VpcFlowLogsNotifications and VpcFlowLogsNotifications-DLQ that we use for this walkthrough.

Create a dead-letter queue with the following code

export SQS_DLQ_URL=$(aws sqs create-queue --queue-name VpcFlowLogsNotifications-DLQ | jq -r '.QueueUrl')

echo $SQS_DLQ_URL 

export SQS_DLQ_ARN=$(aws sqs get-queue-attributes --queue-url $SQS_DLQ_URL --attribute-names QueueArn | jq -r '.Attributes.QueueArn') 

echo $SQS_DLQ_ARN

Create an SQS queue to receive events from Amazon S3 with the following code:

export SQS_URL=$(aws sqs create-queue --queue-name VpcFlowLogsNotification --attributes '{
"RedrivePolicy": 
"{\"deadLetterTargetArn\":\"'$SQS_DLQ_ARN'\",\"maxReceiveCount\":\"2\"}", 
"Policy": 
  "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"s3.amazonaws.com\"}, \"Action\":\"SQS:SendMessage\",\"Resource\":\"*\"}]}" 
}' | jq -r '.QueueUrl')

echo $SQS_URL

To configure the S3 bucket to send events to the SQS queue, use the following code (provide the name of your S3 bucket used for storing VPC flow logs):

aws s3api put-bucket-notification-configuration --bucket __BUCKET_NAME__ --notification-configuration '{
     "QueueConfigurations": [
         {
             "QueueArn": "'$SQS_URL'",
             "Events": [
                 "s3:ObjectCreated:*"
             ]
         }
     ]
}'

Create the OpenSearch Ingestion pipeline

Now that you have configured Amazon SQS and the S3 bucket notifications, you can configure the OpenSearch Ingestion pipeline.

  1. On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
  2. Choose Create pipeline.

  1. For Pipeline name, enter a name (for this post, we use stream-analytics-pipeline).
  2. For Pipeline configuration, enter the following code:
version: "2"
entry-pipeline:
  source:
     s3:
       notification_type: sqs
       compression: gzip
       codec:
         newline:
       sqs:
         queue_url: "<strong>__SQS_QUEUE_URL__</strong>"
         visibility_timeout: 180s
       aws:
        region: "<strong>__REGION__</strong>"
        sts_role_arn: "<strong>__STS_ROLE_ARN__</strong>"
  
  processor:
  sink:
    - pipeline:
        name: "archive-pipeline"
    - pipeline:
        name: "data-processing-pipeline"

data-processing-pipeline:
    source: 
        pipeline:
            name: "entry-pipeline"
    processor:
    - grok:
        tags_on_match_failure: [ "grok_match_failure" ]
        match:
          message: [ "%{VPC_FLOW_LOG}" ]
    route:
        - icmp_traffic: '/protocol == 1'
    sink:
        - pipeline:
            name : "icmp-pipeline"
            routes:
                - "icmp_traffic"
        - pipeline:
            name: "analytics-pipeline"
    

archive-pipeline:
  source:
    pipeline:
      name: entry-pipeline
  processor:
  sink:
    - s3:
        aws:
          region: "<strong>__REGION__</strong>"
          sts_role_arn: "<strong>__STS_ROLE_ARN__</strong>"
        max_retries: 16
        bucket: "<strong>__AWS_S3_BUCKET_ARCHIVE__</strong>"
        object_key:
          path_prefix: "vpc-flow-logs-archive/year=%{yyyy}/month=%{MM}/day=%{dd}/"
        threshold:
          maximum_size: 50mb
          event_collect_timeout: 300s
        codec:
          parquet:
            auto_schema: true
      
analytics-pipeline:
  source:
    pipeline:
      name: "data-processing-pipeline"
  processor:
    - drop_events:
        drop_when: "hasTags(\"grok_match_failure\") or \"/log-status\" == \"NODATA\""
    - date:
        from_time_received: true
        destination: "@timestamp"
    - aggregate:
        identification_keys: ["srcaddr", "dstaddr"]
        action:
          tail_sampler:
            percent: 20.0
            wait_period: "60s"
            condition: '/action != "ACCEPT"'
    - anomaly_detector:
        identification_keys: ["srcaddr","dstaddr"]
        keys: ["bytes"]
        verbose: true
        mode:
          random_cut_forest:
  sink:
    - opensearch:
        hosts: [ "<strong>__AMAZON_OPENSEARCH_DOMAIN_URL__</strong>" ]
        index: "flow-logs-anomalies"
        aws:
          sts_role_arn: "<strong>__STS_ROLE_ARN__</strong>"
          region: "<strong>__REGION__</strong>"
          
icmp-pipeline:
  source:
    pipeline:
      name: "data-processing-pipeline"
  processor:
  sink:
    - opensearch:
        hosts: [ "<strong>__AMAZON_OPENSEARCH_DOMAIN_URL__</strong>" ]
        index: "sensitive-icmp-traffic"
        aws:
          sts_role_arn: "<strong>__STS_ROLE_ARN__</strong>"
          region: "<strong>__REGION__</strong>"</code>

Replace the variables in the preceding code with resources in your account:

    • __SQS_QUEUE_URL__ – URL of Amazon SQS for Amazon S3 events
    • __STS_ROLE_ARN__AWS Security Token Service (AWS STS) roles for resources to assume
    • __AWS_S3_BUCKET_ARCHIVE__ – S3 bucket for archiving processed events
    • __AMAZON_OPENSEARCH_DOMAIN_URL__ – URL of OpenSearch Service domain
    • __REGION__ – Region (for example, us-east-1)
  1. In the Network settings section, specify your network access. For this walkthrough, we are using VPC access. We provided the VPC and private subnet locations that have connectivity with the OpenSearch Service domain and security groups.
  2. Leave the other settings with default values, and choose Next.
  3. Review the configuration changes and choose Create pipeline.

It will take a few minutes for OpenSearch Service to provision the environment. While the environment is being provisioned, we’ll walk you through the pipeline configuration. Entry-pipeline listens for SQS notifications about newly arrived files and triggers the reading of VPC flow log compressed files:

…
entry-pipeline:
  source:
     s3:
…

The pipeline branches into two sub-pipelines. The first stores original messages for archival purposes in Amazon S3 in read-optimized Parquet format; the other applies analytics routes events to the OpenSearch Service domain for fast querying and alerting:

…
  sink:
    - pipeline:
        name: "archive-pipeline"
    - pipeline:
        name: "data-processing-pipeline"
… 

The pipeline archive-pipeline aggregates messages in 50 MB chunks or every 60 seconds and writes a Parquet file to Amazon S3 with the schema inferred from the message. Also, a prefix is added to help with partitioning and query optimization when reading a collection of files using Athena.

…
sink:
    - s3:
…
        object_key:
          path_prefix: " vpc-flow-logs-archive/year=%{yyyy}/month=%{MM}/day=%{dd}/"
        threshold:
          maximum_size: 50mb
          event_collect_timeout: 300s
        codec:
          parquet:
            auto_schema: true
…

Now that we have reviewed the basics, we focus on the pipeline that detects anomalies and sends only high-value messages that deviate from the norm to OpenSearch Service. It also stores Internet Control Message Protocols (ICMP) messages in OpenSearch Service.

We applied a grok processor to parse the message using a predefined regex for parsing VPC flow logs, and also tagged all unparsable messages with the grok_match_failure tag, which we use to remove headers and other records that can’t be parsed:

…
    processor:
    - grok:
        tags_on_match_failure: [ "grok_match_failure" ]
        match:
          message: [ "%{VPC_FLOW_LOG}" ]
…

We then routed all messages with the protocol identifier 1 (ICMP) to icmp-pipeline and all messages to analytics-pipeline for anomaly detection:

…
   route:
        - icmp_traffic: '/protocol == 1'
    sink:
        - pipeline:
            name : "icmp-pipeline"
            routes:
                - "icmp_traffic"
        - pipeline:
            name: "analytics-pipeline"
…

In the analytics pipeline, we dropped all records that can’t be parsed using the hasTags method based on the tag that we assigned at the time of parsing. We also removed all records that don’t contains useful data for anomaly detection.

…
  - drop_events:
        drop_when: "hasTags(\"grok_match_failure\") or \"/log-status\" == \"NODATA\""		
…

Then we applied probabilistic sampling using the tail_sampler processor for all accepted messages grouped by source and destination addresses and sent those to the sink with all messages that were not accepted. This helps reduce the volume of messages within the selected cardinality keys, with a focus on all messages that weren’t accepted, and keeps a sample representation of messages that were accepted.

…
aggregate:
        identification_keys: ["srcaddr", "dstaddr"]
        action:
          tail_sampler:
            percent: 20.0
            wait_period: "60s"
            condition: '/action != "ACCEPT"'
…

Then we used the anomaly detector processor to identify anomalies within the cardinality key pairs or source and destination addresses in our example. The anomaly detector processor creates and trains RCF models for a hashed value of keys, then uses those models to determine whether newly arriving messages have an anomaly based on the trained data. In our demonstration, we use bytes data to detect anomalies:

…
anomaly_detector:
        identification_keys: ["srcaddr","dstaddr"]
        keys: ["bytes"]
        verbose: true
        mode:
          random_cut_forest:
…

We set verbose:true to instruct the detector to emit the message every time an anomaly is detected. Also, for this walkthrough, we used a non-default sample_size for training the model.

When anomalies are detected, the anomaly detector returns a complete record and adds

"deviation_from_expected":value,"grade":value attributes that signify the deviation value and severity of the anomaly. These values can be used to determine routing of such messages to OpenSearch Service, and use per-document monitoring capabilities in OpenSearch Service to alert on specific conditions.

Currently, OpenSearch Ingestion creates up to 5,000 distinct models based on cardinality key values per compute unit. This limit is observed using the anomaly_detector.RCFInstances.value CloudWatch metric. It’s important to select a cardinality key-value pair to avoid exceeding this constraint. As development of the Data Prepper open-source project and OpenSearch Ingestion continues, more configuration options will be added to offer greater flexibility around model training and memory management.

The OpenSearch Ingestion pipeline exposes the anomaly_detector.cardinalityOverflow.count metric through CloudWatch. This metric indicates a number of key value pairs that weren’t run by the anomaly detection processor during a period of time as the maximum number of RCFInstances per compute unit was reached. To avoid this constraint, a number of compute units can be scaled out to provide additional capacity for hosting additional instances of RCFInstances.

In the last sink, the pipeline writes records with detected anomalies along with deviation_from_expected and grade attributes to the OpenSearch Service domain:

…
sink:
    - opensearch:
        hosts: [ "__AMAZON_OPENSEARCH_DOMAIN_URL__" ]
        index: "anomalies"
…

Because only anomaly records are being routed and written to the OpenSearch Service domain, we are able to significantly reduce the size of our domain and optimize the cost of our sample observability infrastructure.

Another sink was used for storing all ICMP records in a separate index in the OpenSearch Service domain:

…
sink:
    - opensearch:
        hosts: [ "__AMAZON_OPENSEARCH_DOMAIN_URL__" ]
        index: " sensitive-icmp-traffic"
…

Query archived data from Amazon S3 using Athena

In this section, we review the configuration of Athena for querying archived events data stored in Amazon S3. Complete the following steps:

  1. Navigate to the Athena query editor and create a new database called vpc-flow-logs-archive-database using the following command:
CREATE DATABASE `vpc-flow-logs-archive`
  1. 2. On the Database menu, choose vpc-flow-logs-archive.
  2. In the query editor, enter the following command to create a table (provide the S3 bucket used for archiving processed events). For simplicity, for this walkthrough, we create a table without partitions.
CREATE EXTERNAL TABLE `vpc-flow-logs-data`(
  `message` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://__AWS_S3_BUCKET_ARCHIVE__'
TBLPROPERTIES (
  'classification'='parquet', 
  'compressionType'='none'
)
  1. Run the following query to validate that you can query the archived VPC flow log data:
SELECT * FROM "vpc-flow-logs-archive"."vpc-flow-logs-data" LIMIT 10;

Because archived data is stored in its original format, it helps avoid issues related to format conversion. Athena will query and display records in the original format. However, it’s ideal to interact only with a subset of columns or parts of the messages. You can use the regexp_split function in Athena to split the message in the columns and retrieve certain columns. Run the following query to see the source and destination address groupings from the VPC flow log data:

SELECT srcaddr, dstaddr FROM (
   SELECT regexp_split(message, ' ')[4] AS srcaddr, 
          regexp_split(message, ' ')[5] AS dstaddr, 
          regexp_split(message, ' ')[14] AS status  FROM "vpc-flow-logs-archive"."vpc-flow-logs-data" 
) WHERE status = 'OK' 
GROUP BY srcaddr, dstaddr 
ORDER BY srcaddr, dstaddr LIMIT 10;

This demonstrated that you can query all events using Athena, where archived data in its original raw format is used for the analysis. Athena is priced per data scanned. Because the data is stored in a read-optimized format and partitioned, it enables further cost-optimization around on-demand querying of archived streaming and observability data.

Clean up

To avoid incurring future charges, delete the following resources created as part of this post:

  • OpenSearch Service domain
  • OpenSearch Ingestion pipeline
  • SQS queues
  • VPC flow logs configuration
  • All data stored in Amazon S3

Conclusion

In this post, we demonstrated how to use OpenSearch Ingestion pipelines to build a cost-optimized infrastructure for log analytics and observability events. We used routing, filtering, aggregation, and anomaly detection in an OpenSearch Ingestion pipeline, enabling you to downsize your OpenSearch Service domain and create a cost-optimized observability infrastructure. For our example, we used a data sample with 1.5 million events with a pipeline distilling to 1,300 events with predicted anomalies based on source and destination IP pairs. This metric demonstrates that the pipeline identified that less than 0.1% of events were of high importance, and routed those to OpenSearch Service for visualization and alerting needs. This translates to lower resource utilization in OpenSearch Service domains and can lead to provisioning of smaller OpenSearch Service environments.

We encourage you to use OpenSearch Ingestion pipelines to create your purpose-built and cost-optimized observability infrastructure that uses OpenSearch Service for storing and alerting on high-value events. If you have comments or feedback, please leave them in the comments section.


About the Authors

Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare and life sciences customers to build solutions that help improve patients’ outcomes. Mikhail specializes in data analytics services.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.