Tag Archives: Technical How-to

How to prioritize IAM Access Analyzer findings

Post Syndicated from Swara Gandhi original https://aws.amazon.com/blogs/security/how-to-prioritize-iam-access-analyzer-findings/

AWS Identity and Access Management (IAM) Access Analyzer is an important tool in your journey towards least privilege access. You can use IAM Access Analyzer access previews to preview and validate public and cross-account access before deploying permissions changes in your environment.

For the permissions already in place, one of IAM Access Analyzer’s capabilities is that it helps you identify resources in your AWS Organizations organization and AWS accounts that are shared with an external entity.

For each external entity that has access to a resource in your account, IAM Access Analyzer generates a finding. Findings display information about the resource and the policy statement that generated the finding, with details such as the list of actions in the policy granting access, level of access, and conditions that allow the access. You can review the findings to determine if the access is intended or unintended.

As your use of AWS services grows and the number of accounts in your organization increases, the number of findings that you have might also increase. To help reduce noise and allow you to focus on unintended access findings, you can filter findings and create archive rules for intended access.

This blog post provides step-by-step guidance on how to get started with IAM Access Analyzer findings by using different filtering techniques that can help you filter approved use cases that result in access findings. For example, you might see a finding generated for an S3 bucket that hosts images for your website and thus allows public access, as approved by your organization, apply a filter so that you can concentrate on unintended access. IAM Access Analyzer offers a wide range of filters; for a complete list, see the IAM documentation.

In this post, we also share example archive rules for approved use cases that result in access findings. Archive rules automatically archive new findings that meet the criteria you define when you create the rule. You can also apply archive rules retroactively to archive existing findings that meet the archive rule criteria. Finally, we have included an example implementation of archive rules using an AWS CloudFormation template.

IAM Access Analyzer findings overview

To get started, create an analyzer for your entire organization or your account. The organization or account that you choose is known as the zone of trust for the analyzer. The zone of trust determines the type of access that IAM Access Analyzer considers to be trusted. IAM Access Analyzer continuously monitors to identify resource policies, access control lists, and other access controls that grant public or cross-account access from outside the zone of trust, and generates findings. For this blog post, we’ll demonstrate an organization as the zone of trust, showcasing findings from a large-scale, multi-account AWS deployment.

Prerequisites

This blog post assumes that you have the following in place:

  • IAM Access Analyzer is enabled in your organization or account in the AWS Regions where you operate. For more details on how to enable IAM Access Analyzer, see Enabling IAM Access Analyzer.
  • Access to the AWS Organizations management account or to a member account in the organization with delegated administrator access for creating and updating IAM Access Analyzer resources.

How to filter the findings

To start filtering your findings and create archive rules, you should complete the following steps:

  1. Review public access findings
  2. Filter by removing permissions errors
  3. Filter for known identity providers
  4. Filter cross-account access from trusted external accounts

We’ll walk you through each step.

1. Review public access findings

Some AWS resources allow public access on the resource by means of a resource-based policy—for example, an Amazon Simple Storage Service (Amazon S3) bucket policy that has the “Principal:*” permission added to its bucket policy. For resources such as Amazon Elastic Block Store (Amazon EBS) snapshots, you can share these by using a flag on the resource permission. IAM Access Analyzer looks for such sharing and reports it in the findings.

From the global report, you can generate a list of resources that allow public access by using the Public access: true query in the IAM console.

The following is an example of an AWS Command Line Interface (AWS CLI) command with public access as “true”. Replace <AccessAnalyzerARN> with the Amazon Resource Name (ARN) of your analyzer.

aws accessanalyzer list-findings --analyzer-arn <AccessAnalyzerARN> --filter isPublic={"eq"="true"}

Is the public access intended?

If the access is intended, you can archive the findings by creating an archive rule using the AWS Management Console, AWS CLI, or API. When you archive a security finding, IAM Access Analyzer removes it from the Active findings list and changes its status to Archived. For instructions on how to automatically archive expected findings, see How to automatically archive expected IAM Access Analyzer findings.

Example: Known S3 bucket that hosts public website images

If you have resources for which public access is expected, such as an S3 bucket that hosts images for your website, you can add an archive rule with Resource criteria equal to the bucket name, as shown in Figure 1.

Figure 1: Create IAM Access Analyzer archive rule using the console

Figure 1: Create IAM Access Analyzer archive rule using the console

Is the public access unintended?

If the finding results from policies that were misconfigured to allow unintended public access, you can constrain the access by using AWS global condition context keys or a specific IAM principal ARN. The findings show the account and resource that contain the policy.

For example, if the finding shows a misconfigured S3 bucket, the following policy shows how you can modify the S3 bucket policy to only allow IAM principals from your organization to access the bucket by using the PrincipalOrgID condition key. Replace <DOC-EXAMPLE-BUCKET> with the name of your S3 bucket, and <ORGANIZATION_ID> with your organization ID.

{
   "Version":"2008-10-17",
   "Id":"Policy1335892530063",
   "Statement":[
      {
         "Sid":"AllowS3Access",
         "Effect":"Allow",
         "Principal":"*",
         "Action":"s3:*",
         "Resource":[
            "arn:aws:s3:::<DOC-EXAMPLE-BUCKET>",
            "arn:aws:s3:::<DOC-EXAMPLE-BUCKET>/*"
         ],
         "Condition":{
            "StringEquals":{
               "aws:PrincipalOrgID":"<ORGANIZATION_ID>"
            }
         }
      }
   ]
}

2. Filter by removing permissions errors

Before you further investigate the IAM Access Analyzer findings, you should make sure that IAM Access Analyzer has enough permissions to access the resources in your accounts to be able to provide the analysis.

IAM Access Analyzer uses an AWS service-linked role to call other AWS services on your behalf. When IAM Access Analyzer analyzes a resource, it reads resource metadata, such as a resource-based policy, access control lists, and other access controls that grant public or cross-account access. If the policies don’t allow an IAM Access Analyzer role to read the resource metadata, it generates an Access Denied error finding, as shown in Figure 2.

Figure 2: IAM Access Analyzer access denied error example

Figure 2: IAM Access Analyzer access denied error example

To view these error findings from the IAM Access Analyzer console, filter the findings by using the Error: Access Denied property.

Resolution

To resolve the access issue, make sure that the IAM Access Analyzer service-linked role is not denied access. Review the resource-based policy attached to the resource that IAM Access Analyzer isn’t able to access. For a list of services that support resource-based policies, see the IAM documentation.

For example, if the analyzer can’t access an AWS Key Management Service (AWS KMS) key because of an explicit deny, add an exception for the IAM Access Analyzer service-linked role to the policy statement, similar to the following. Make sure that you change the <ACCOUNT_ID> to your account id.

Before After
{
   "Sid":"Deny unintended access to KMS key",
   "Effect":"Deny",
   "Principal":"*",
   "Action":[
      "kms:DescribeKey",
      "kms:GetKeyPolicy",
      "kms:List*"
   ],
   "Resource":"*",
   "Condition":{
      "ArnNotLikeIfExists":{
         "aws:PrincipalArn":[
            "arn:aws:iam::*:role/<YOUR-ADMIN-ROLE>"
         ]
      }
   }
}

{
   "Sid":"Deny unintended access to KMS key",
   "Effect":"Deny",
   "Principal":"*",
   "Action":[
      "kms:DescribeKey",
      "kms:GetKeyPolicy",
      "kms:List*"
   ],
   "Resource":"*",
   "Condition":{
      "ArnNotLikeIfExists":{
         "aws:PrincipalArn":[
            "arn:aws:iam::<ACCOUNT_ID>:role/aws-service-role/access-analyzer.amazonaws.com/AWSServiceRoleForAccessAnalyzer",
"arn:aws:iam::*:role/<YOUR-ADMIN-ROLE>"
         ]
      }
   }
}

3. Filter for known identity providers

With SAML 2.0 or Open ID Connect (OIDC)—which are open federation standards that many identity providers (IdPs) use—users can log in to the console or call the AWS API operations without you having to create an IAM user for everyone in your organization.

To set up federation, you must perform a one-time configuration so that your organization’s IdP and your account trust each other. To configure this trust, you must register AWS as a service provider (SP) with the IdP of your organization and set up metadata and key exchange.

The role or roles that you create in IAM define what the federated users from your organization are allowed to use on AWS. When you create the trust policy for the role, you specify the SAML or OIDC provider as the Principal. To only allow users that match certain attributes to access the role, you can scope the trust policy with a Condition.

Example 1: Federation with Okta

Let’s walk through an example that uses Okta as the IdP. Although access to a trusted IdP is intended, IAM Access Analyzer creates a finding for an IAM role that has trust policy granting access to a SAML provider because the trust policy allows access outside of the known zone of trust for the analyzer. You will see findings created for the IAM role granting access to Okta using the IAM trust policy, as shown in Figure 3.

Figure 3: IAM Access Analyzer identity provider finding example

Figure 3: IAM Access Analyzer identity provider finding example

Resolution 

Setting access through SAML providers is a privileged operation, so we recommend that you analyze each finding to decide if an exception is acceptable. If you approve of the SAML-provided access setup, you can implement an archive rule to archive such findings with conditions for federation used in combination with your SAML provider. The filter for the Federated User rule depends on the name that you gave to the SAML IdP in your federation setup. For example, if your SAML IdP name is Okta, the rule should have a filter for arn:aws:iam::<ACCOUNT_ID>:saml-provider/Okta, where <ACCOUNT_ID> is your account number, as shown in Figure 4.

Figure 4: Archive rule example for using an IdP-related finding

Figure 4: Archive rule example for using an IdP-related finding

Note: To include additional values for a multi-account setup, use the Add another value filter.

Example 2: IAM Identity Center

With AWS IAM Identity Center (successor to AWS Single Sign-On), you can manage sign-in security for your workforce. IAM Identity Center provides a central place to define your permission sets, assign them to your users and groups, and give your users a portal where they can access their assigned accounts.

With IAM Identity Center, you manage access to accounts by creating and assigning permission sets. These are IAM role templates that define (among other things) which policies to include in a role. When you create a permission set in IAM Identity Center and associate it to an account, IAM Identity Center creates a role in that account with a trust policy that allows a federated IdP as a principal — in this case, IAM Identity Center.

IAM Access Analyzer generates a finding for this setup because the allowed access is outside of the known zone of trust for the analyzer, as shown in Figure 5.

Figure 5: IAM Access Analyzer finding example for IAM Identity Center

Figure 5: IAM Access Analyzer finding example for IAM Identity Center

To filter this finding, you need to implement an archive rule.

Resolution

You can implement an archive rule with conditions for federation used in combination with IAM Identity Center as the SAML provider. The roles created by IAM Identity Center in member accounts use a reserved path on AWS: arn:aws:iam::<ACCOUNT_ID>:role/aws-reserved/sso.amazonaws.com/. Hence, you can create an archive rule with a filter that contains :saml-provider/AWSSSO in the Federated User name and aws-reserved/sso.amazonaws.com/ in the Resource, as shown in Figure 6.

Figure 6: Archive rule example for IAM Identity Center generated findings

Figure 6: Archive rule example for IAM Identity Center generated findings

4. Filter cross-account access findings from trusted external accounts

We recommend that you identify and document accounts and principals that should be allowed access outside of the zone of trust for IAM Access Analyzer.

When a resource-based policy attached to a resource allows cross-account access from outside the zone of trust, IAM Access Analyzer generates cross-account access findings.

Is the cross-account access intended?

When you review cross-account access findings, you need to determine whether the access is intended or not. For example, you might have access provided to your auditor’s account or a partner account for visibility and monitoring of your AWS applications.

For trusted external accounts, you can create an archive rule that includes the AWS account in the criteria for the rule. Figure 7 shows an example of how to create the archive rule for a trusted external account (EXTERNAL_ACCOUNT_ID). In your own rule, replace EXTERNAL_ACCOUNT_ID with the trusted account id.

Figure 7: Archive rule example for trusted account findings

Figure 7: Archive rule example for trusted account findings

Is the cross-account access unintended?

After you have archived the intended access findings, you can start analyzing the findings initiated from unintended access. When you confirm that the findings show unintended access, you should take steps to remove the access by altering or deleting the policy or access control that granted access. You can expand the solution outlined in the blog post Automate resolution for IAM Access Analyzer cross-account access findings on IAM roles by adding an explicit deny statement.

You can also use AWS CloudTrail to track API calls that could have changed access configuration on your AWS resources.

Deploy IAM Access Analyzer and archive rules with a CloudFormation template

In this section, we demonstrate a sample CloudFormation template that creates an IAM access analyzer and archive rules for findings that are created for identified intended access to resources.

Important: When you create an archive rule using the AWS console, the existing findings and new findings that match criteria mentioned in the rules will be archived. However, archive rules created through CloudFormation or the AWS CLI will only archive the new findings that meet the criteria defined. You need to perform the access-analyzer:ApplyArchiveRule API after you create the archive rule to archive existing findings as well.

The sample CloudFormation template takes the following values as inputs and creates archive rules for findings that are created for identified intended access to resources shared outside of your zone of trust for the specified analyzer:

  • Analyzer name
  • Zone of trust
  • Known public S3 buckets, if you have any (for example, a bucket that hosts public website images).

    Note: We use S3 buckets as an example. You can edit the rule to include resource types that are supported by IAM Access Analyzer, if public access is intended.

  • Trusted accounts — AWS accounts that don’t belong to your organization, but you trust them to have access to resources in your organization
  • SAML provider — The SAML provider approved to have access to your resources

    Note: If you don’t use federation, you can remove the rule SAMLFederatedUsers.

AWSTemplateFormatVersion: 2010-09-09
Description: >+
  Sample CloudFormation template creates archive rules for findings
  created for resources shared outside of your zone of trust for specified
  analyzer. 
   
Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: Define Configuration
        Parameters:
          - AccessAnalyzerName
          - ZoneOfTrust
          - KnownPublicS3Buckets
          - TrustedAccounts
          - SAMLProvider
Parameters:
  AccessAnalyzerName:
    Description: Provide name of the analyzer you would like to create archive rules for.
    Type: String
  ZoneOfTrust:
    Description: Select the zone of trust of AccessAnalyzer
    AllowedValues:
      - ACCOUNT
      - ORGANIZATION
    Type: String
  KnownPublicS3Buckets:
    Description: List of comma-separated known S3 bucket arns, that should allow
      public access Example -
      arn:aws:s3:::DOC-EXAMPLE-BUCKET,arn:aws:s3:::DOC-EXAMPLE-BUCKET2
    Type: CommaDelimitedList
  TrustedAccounts:
    Description: List of comma-separated account IDs, that do not belong to your
      organization but you trust them to have access to resources in your
      organization. [Example - Your auditor’s AWS account]
    Type: List<Number>
  TrustedFederationPrincipals:
    Description: List of comma-separated trusted federated principals that are able
      to assume roles in your accounts. [Example -
      arn:aws:iam::012345678901:saml-provider/Okta,
      arn:aws:iam::1111222233334444:saml-provider/Okta]
    Type: CommaDelimitedList
Resources:
  AccessAnalyzer:
    Type: AWS::AccessAnalyzer::Analyzer
    Properties:
      AnalyzerName: ${AccessAnalyzerName}-${AWS::Region}
      Type: ZoneOfTrust
      ArchiveRules:
        - RuleName: ArchivePublicS3BucketsAccess
          Filter:
            - Property: resource
              Eq: KnownPublicS3Buckets
        - RuleName: AccountAccessNecessaryForBusinessProcesses
          Filter:
            - Property: principal.AWS
              Eq: TrustedAccounts
            - Property: isPublic
              Eq:
                - "false"
        - RuleName: SAMLFederatedUsers
          Filter:
            - Property: principal.Federated
              Eq: TrustedFederationPrincipals

To download this sample template, download the file IAMAccessAnalyzer.yaml from Amazon S3.

Conclusion

In this blog post, you learned how to start with IAM Access Analyzer findings, filter them based on the level of access given outside of your zone of trust, and create archive rules for intended access findings. By using different filtering techniques to remediate intended access findings, you can concentrate on unintended access.

To take this solution further, we recommend that you consider automating the resolution of unintended cross-account IAM roles found by IAM Access Analyzer by adding a deny statement to the IAM role’s trust policy. You can also include capabilities like an approval workflow to resolve the finding to suit your organization’s process requirements.

Lastly, we suggest that you use IAM Access Analyzer access previews to preview and validate public and cross-account access before deploying permissions changes in your environment.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Swara Gandhi

Swara Gandhi

Swara is a solutions architect on the AWS Identity Solutions team. She works on building secure and scalable end-to-end identity solutions. She is passionate about everything identity, security, and cloud.

Nitin Kulkarni

Nitin is a Solutions Architect on the AWS Identity Solutions team. He helps customers build secure and scalable solutions on the AWS platform. He also enjoys hiking, baseball and linguistics.

Monitoring Amazon DevOps Guru insights using Amazon Managed Grafana

Post Syndicated from MJ Kubba original https://aws.amazon.com/blogs/devops/monitoring-amazon-devops-guru-insights-using-amazon-managed-grafana/

As organizations operate day-to-day, having insights into their cloud infrastructure state can be crucial for the durability and availability of their systems. Industry research estimates[1] that downtime costs small businesses around $427 per minute of downtime, and medium to large businesses an average of $9,000 per minute of downtime. Amazon DevOps Guru customers want to monitor and generate alerts using a single dashboard. This allows them to reduce context switching between applications, providing them an opportunity to respond to operational issues faster.

DevOps Guru can integrate with Amazon Managed Grafana to create and display operational insights. Alerts can be created and communicated for any critical events captured by DevOps Guru and notifications can be sent to operation teams to respond to these events. The key telemetry data types of logs and metrics are parsed and filtered to provide the necessary insights into observability.

Furthermore, it provides plug-ins to popular open-source databases, third-party ISV monitoring tools, and other cloud services. With Amazon Managed Grafana, you can easily visualize information from multiple AWS services, AWS accounts, and Regions in a single Grafana dashboard.

In this post, we will walk you through integrating the insights generated from DevOps Guru with Amazon Managed Grafana.

Solution Overview:

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana. Insights originate from DevOps Guru, each insight generating an event. These events are captured by Amazon EventBridge, and then saved as logs to Amazon CloudWatch Log Group DevOps Guru service metrics, and then parsed by Amazon Managed Grafana to create new dashboards.

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana, starting with DevOps Guru and then using Amazon EventBridge to save the insight event logs to Amazon CloudWatch Log Group DevOps Guru service metrics to be parsed by Amazon Managed Grafana and create new dashboards in Grafana from these logs and Metrics.

Now we will walk you through how to do this and set up notifications to your operations team.

Prerequisites:

The following prerequisites are required for this walkthrough:

  • An AWS Account
  • Enabled DevOps Guru on your account with CloudFormation stack, or tagged resources monitored.

Using Amazon CloudWatch Metrics

 

DevOps Guru sends service metrics to CloudWatch Metrics. We will use these to      track metrics for insights and metrics for your DevOps Guru usage; the DevOps Guru service reports the metrics to the AWS/DevOps-Guru namespace in CloudWatch by default.

First, we will provision an Amazon Managed Grafana workspace and then create a Dashboard in the workspace that uses Amazon CloudWatch as a data source.

Setting up Amazon CloudWatch Metrics

  1. Create Grafana Workspace
    Navigate to Amazon Managed Grafana from AWS console, then click Create workspace

a. Select the Authentication mechanism

i. AWS IAM Identity Center (AWS SSO) or SAML v2 based Identity Providers

ii. Service Managed Permission or Customer Managed

iii. Choose Next

b. Under “Data sources and notification channels”, choose Amazon CloudWatch

c. Create the Service.

You can use this post for more information on how to create and configure the Grafana workspace with SAML based authentication.

Next, we will show you how to create a dashboard and parse the Logs and Metrics to display the DevOps Guru insights and recommendations.

2. Configure Amazon Managed Grafana

a. Add CloudWatch as a data source:
From the left bar navigation menu, hover over AWS and select Data sources.

b. From the Services dropdown select and configure CloudWatch.

3. Create a Dashboard

a. From the left navigation bar, click on add a new Panel.

b. You will see a demo panel.

c. In the demo panel – Click on Data source and select Amazon CloudWatch.

The Amazon Grafana Workspace dashboard with the Grafana data source dropdown menu open. The drop down has 'Amazon CloudWatch (region name)' highlighted, other options include 'Mixed, 'Dashboard', and 'Grafana'.

d. For this panel we will use CloudWatch metrics to display the number of insights.

e. From Namespace select the AWS/DevOps-Guru name space, Insights as Metric name and Average for Statistics.

In the Amazon Grafana Workspace dashboard the user has entered values in three fields. "Grafana Query with Namespace" has the chosen value: AWS/DevOps-Guru. "Metric name" has the chosen value: Insights. "Statistic" has the chosen value: Average.

click apply

Time series graph contains a single new data point, indicting a recent event.

f. This is our first panel. We can change the panel name from the right-side bar under Title. We will name this panel “Insights

g. From the top right menu, click save dashboard and give your new dashboard a name

Using Amazon CloudWatch Logs via Amazon EventBridge

For other insights outside of the service metrics, such as a number of insights per specific service or the average for a region or for a specific AWS account, we will need to parse the event logs. These logs first need to be sent to Amazon CloudWatch Logs. We will go over the details on how to set this up and how we can parse these logs in Amazon Managed Grafana using CloudWatch Logs Query Syntax. In this post, we will show a couple of examples. For more details, please check out this User Guide documentation. This is not done by default and we will need to use Amazon EventBridge to pass these logs to CloudWatch.

DevOps Guru logs include other details that can be helpful when building Dashboards, such as region, Insight Severity (High, Medium, or Low), associated resources, and DevOps guru dashboard URL, among other things.  For more information, please check out this User Guide documentation.

EventBridge offers a serverless event bus that helps you receive, filter, transform, route, and deliver events. It provides one to many messaging solutions to support decoupled architectures, and it is easy to integrate with AWS Services and 3rd-party tools. Using Amazon EventBridge with DevOps Guru provides a solution that is easy to extend to create a ticketing system through integrations with ServiceNow, Jira, and other tools. It also makes it easy to set up alert systems through integrations with PagerDuty, Slack, and more.

 

Setting up Amazon CloudWatch Logs

  1. Let’s dive in to creating the EventBridge rule and enhance our Grafana dashboard:

a. First head to Amazon EventBridge in the AWS console.

b. Click Create rule.

     Type in rule Name and Description. You can leave the Event bus to default and Rule type to Rule with an event pattern.

c. Select AWS events or EventBridge partner events.

    For event Pattern change to Customer patterns (JSON editor) and use:

{"source": ["aws.devops-guru"]}

This filters for all events generated from DevOps Guru. You can use the same mechanism to filter out specific messages such as new insights, or insights closed to a different channel. For this demonstration, let’s consider extracting all events.

As the user configures their EventBridge Rule, for the Creation method they have chosen "Custom pattern (JSON editor) write an event pattern in JSON." For the Event pattern editor just below they have entered {"source":["aws.devops-guru"]}

d. Next, for Target, select AWS service.

    Then use CloudWatch log Group.

    For the Log Group, give your group a name, such as “devops-guru”.

In the prompt for the new Target's configurations, the user has chosen AWS service as the Target type. For the Select a target drop down, they chose CloudWatch log Group. For the log group, they selected the /aws/events radio option, and then filled in the following input text box with the kebab case group name devops-guru.

e. Click Create rule.

f. Navigate back to Amazon Managed Grafana.
It’s time to add a couple more additional Panels to our dashboard.  Click Add panel.
    Then Select Amazon CloudWatch, and change from metrics to CloudWatch Logs and select the Log Group we created previously.

In the Grafana Workspace, the user has "Data source" selected as Amazon CloudWatch us-east-1. Underneath that they have chosen to use the default region and CloudWatch Logs. Below that, for the Log Groups they have entered /aws/events/DevOpsGuru

g. For the query use the following to get the number of closed insights:

fields @detail.messageType
| filter detail.messageType="CLOSED_INSIGHT"
| count(detail.messageType)

You’ll see the new dashboard get updated with “Data is missing a time field”.

New panel suggestion with switch to table or open visualization suggestions

You can either open the suggestions and select a gauge that makes sense;

New Suggestions display a dial graph, a bar graph, and a count numerical tracker

Or choose from multiple visualization options.

Now we have 2 panels:

Two panels are shown, one is the new dial graph, and the other is the time series graph that was created earlier.

h. You can repeat the same process. To create 3rd panel for the new insights using this query:

fields @detail.messageType 
| filter detail.messageType="NEW_INSIGHT" 
| count(detail.messageType)

Now we have 3 panels:

Grafana now shows three 3 panels. Two dial graphs, and the time series graph.

Next, depending on the visualizations, you can work with the Logs and metrics data types to parse and filter the data.

Setting up a 4th panel as table. Under the Query tab, in the query editor, the user has entered the text: fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter | filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"

i. For our fourth panel, we will add DevOps Guru dashboard direct link to the AWS Console.

Repeat the same process as demonstrated previously one more time with this query:

fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter 
| filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"                       

                        Switch to table when prompted on the panel.

Grafana now shows 4 panels. The new panel displays a data table that contains information about the most recent DevOps Guru insights. There are also the two dial graphs, and the time series graph from before.

This will give us a direct link to the DevOps Guru dashboard and help us get to the insight details and Recommendations.

Grafana now shows 4 panels. The new panel displays a data table that contains information about the most recent DevOps Guru insights. There are also the two dial graphs, and the time series graph from before.

Save your dashboard.

  1. You can extend observability by sending notifications through alerts on dashboards of panels providing metrics. The alerts will be triggered when a condition is met. The Alerts are communicated with Amazon SNS notification mechanism. This is our SNS notification channel setup.

Screenshot: notification settings show Name: DevopsGuruAlertsFromGrafana and Type: SNS

A previously created notification is used next to communicate any alerts when the condition is met across the metrics being observed.

Screenshot: notification setting with condition when count of query is above 5, a notification is sent to DevopsGuruAlertsFromGrafana with message, "More than 5 insights in the past 1 hour"

Cleanup

To avoid incurring future charges, delete the resources.

  • Navigate to EventBridge in AWS console and delete the rule created in step 4 (a-e) “devops-guru”.
  • Navigate to CloudWatch logs in AWS console and delete the log group created as results of step 4 (a-e) named “devops-guru”.
  • Amazon Managed Grafana: Navigate to Amazon Managed Grafana service and delete the Grafana services you created in step 1.

Conclusion

In this post, we have demonstrated how to successfully incorporate Amazon DevOps Guru insights into Amazon Managed Grafana and use Grafana as the observability tool. This will allow Operations team to successfully observe the state of their AWS resources and notify them through Alarms on any preset thresholds on DevOps Guru metrics and logs. You can expand on this to create other panels and dashboards specific to your needs. If you don’t have DevOps Guru, you can start monitoring your AWS applications with AWS DevOps Guru today using this link.

[1] https://www.atlassian.com/incident-management/kpis/cost-of-downtime

About the authors:

MJ Kubba

MJ Kubba is a Solutions Architect who enjoys working with public sector customers to build solutions that meet their business needs. MJ has over 15 years of experience designing and implementing software solutions. He has a keen passion for DevOps and cultural transformation.

David Ernst

David is a Sr. Specialist Solution Architect – DevOps, with 20+ years of experience in designing and implementing software solutions for various industries. David is an automation enthusiast and works with AWS customers to design, deploy, and manage their AWS workloads/architectures.

Sofia Kendall

Sofia Kendall is a Solutions Architect who helps small and medium businesses achieve their goals as they utilize the cloud. Sofia has a background in Software Engineering and enjoys working to make systems reliable, efficient, and scalable.

Automate discovery of data relationships using ML and Amazon Neptune graph technology

Post Syndicated from Moira Lennox original https://aws.amazon.com/blogs/big-data/automate-discovery-of-data-relationships-using-ml-and-amazon-neptune-graph-technology/

Data mesh is a new approach to data management. Companies across industries are using a data mesh to decentralize data management to improve data agility and get value from data. However, when a data producer shares data products on a data mesh self-serve web portal, it’s neither intuitive nor easy for a data consumer to know which data products they can join to create new insights. This is especially true in a large enterprise with thousands of data products.

This post shows how to use machine learning (ML) and Amazon Neptune to create automated recommendations to join data products and display those recommendations alongside the existing data products. This allows data consumers to easily identify new datasets and provides agility and innovation without spending hours doing analysis and research.

Background

The success of a data-driven organization recognizes data as a key enabler to increase and sustain innovation. It follows what is called a distributed system architecture. The goal of a data product is to solve the long-standing issue of data silos and data quality. Independent data products often only have value if you can connect them, join them, and correlate them to create a higher order data product that creates additional insights. A modern data architecture is critical in order to become a data-driven organization. It allows stakeholders to manage and work with data products across the organization, enhancing the pace and scale of innovation.

Solution overview

A data mesh architecture starts to solve for the decoupled architecture by decoupling the data infrastructure from the application infrastructure, which is a common challenge in traditional data architectures. It focuses on decentralized ownership, domain design, data products, and self-serve data infrastructure. This allows for a new way of thinking and new organizational elements—namely, a modern data community.

However, today’s data mesh platform contains largely independent data products. Even with well-documented data products, knowing how to connect or join data products is a time-consuming job. Data consumers spend hours, days, or months to understand and analyze the data. Identifying links or relationships between data products is critical to create value from the data mesh and enable a data-driven organization.

The solution in this post illustrates an approach to solving these challenges. It uses a fictional insurance company with several data products shared on their data mesh marketplace. The following figure shows the sample data products used in our solution.

Suppose a consumer is browsing the customer data product in the data mesh marketplace. The consumer wonders if the customer data could be linked to claim, policy, or encounter data. Because these data products come from different lines of business (LOBs) or silos, it’s hard to know. A consumer would have to review each data product and do the necessary analysis and research to know this with any certainty.

To solve this problem, our solution uses ML and Neptune to create recommendations for the data consumer. The solution generates a list of data products, product attributes, and the associated probability scores to show join ability. This reduces the time to discover, analyze, and create new insights.

We use Valentine, a data science algorithm for comparing datasets, to improve data product recommendations. Neptune, the managed AWS graph database service, stores information about explicit connections between datasets, improving the recommendations.

Example use case

Let’s walk through a concrete example. Suppose a consumer is browsing the Customer data product in the data mesh marketplace. Customer is similar to the Policy and Encounter data products, but these products come from different silos. Their similarity to the Customer is hard to gauge. To expedite the consumer’s work, the mesh recommends how the Policy and Encounter products can be connected to the Customer product.

Let’s consider two cases. First, is Customer similar to Claim? The following is a sample of the data in each product.

Intuitively, these two products have lots of overlap. Every Cust_Nbr in Claim has a corresponding Customer_ID in Customer. There is no foreign key constraint in Claim that assures us it points to Customer. We think there is enough similarity to infer a join relationship.

The data science algorithm Valentine is an effective tool for this. Valentine is presented in the paper Valentine: Evaluating Matching Techniques for Dataset Discovery (2021, Koutras et al.). Valentine determines if two datasets are joinable or unionable. We focus on the former. Two datasets are joinable if a record from one dataset has a link to a record in the other dataset using one or more columns. Valentine addresses the use case where data is messy: there is no foreign key constraint in place, and data doesn’t match perfectly between datasets. Valentine looks for similarities, and its findings are probabilistic. It scores its proposed matches.

This solution uses an implementation of Valentine available in the following GitHub repo. The first step is to load each data product from its source into a Pandas data frame. If the data is large, load a representative subset of it, at most a few million records. Pass the frames to the valentine_match() function and select the matching method. We use COMA, one of several methods that Valentine supports. The function’s result indicates the similarity of columns and the score. In this case, it tells us that the Customer_ID for Customer matches the Cust_Nbr for Claim, with a very high score. We then instruct the data mesh to recommend Claim to the consumer browsing Customer.

A graph database isn’t required to recommend Claim; the two products could be directly compared. But let’s consider Encounter. Is Customer similar to Encounter? This case is more complicated. Many encounters in the Encounter product don’t link to a customer. An encounter occurs when someone contacts the contact center, which could be by phone or email. The party may or may not be a customer, and if they are a customer, we may not know their customer ID during this encounter. Additionally, sometimes the phone or email they use isn’t the same as the one from a customer record in the Customer product.

In the following sample encounter set, encounters 1 and 2 match to Customer_ID 4. Note that encounter 2’s inbound_email doesn’t exactly match the inbound_email in that customer’s record in the Customer product. Encounter 3 has no Customer_ID, but its inbound_email matches the customer with ID 8. Encounter 4 appears to refer to the customer with ID 8, but the email doesn’t match, and no Customer_ID is given. Encounter 5 only has Inbound_Phone, but that matches the customer with ID 1. Encounter 6 only has an Inbound_Phone, and it doesn’t appear to match any of the customers we’ve listed so far.

We don’t have a strong enough comparison to infer similarity.

But we know more about the customer than the Customer product tells us. In the Neptune database, we maintain a knowledge graph that combines multiple products and links them through relationships. A knowledge graph allows us to combine data from different sources to gain a better understanding of a specific problem domain. In Neptune, we combine the Customer product data with an additional data product: Sales Opportunity. We ingest each product from its source into the knowledge graph and model a hasSalesOpportunity relationship between Customer and SalesOpportunity resources. The following figure shows these resources, their attributes, and their relationship.

With the AWS SDK for Pandas, we combine this data by running a query against the Neptune graph. We use a graph query language (such as SPARQL) to wrangle a representative subset of customer and sales opportunity data into a Pandas data frame (shown as Enhanced Customer View in the following figure). In the following example, we enhance customers 7 and 8 with alternate phone or email contact data from sales opportunities.

We pass that frame to Valentine and compare it to Encounter. This time, two additional encounters match a customer.

The score meets our threshold, and is high enough to share with the consumer as a possible match. To the customer browsing Customer in the mesh marketplace, we present the recommendation of Encounter, along with scoring details to support the recommendation. With this recommendation, the consumer can explore the Encounter product with greater confidence.

Conclusion

Data-driven organizations are transitioning to a data product way of thinking. Utilizing strategies like data mesh generates value on a large scale. We took this a step further by creating a blueprint to create smart recommendations by linking similar data products using graph technology and ML. In this post, we showed how an organization can augment a data catalog with additional metadata by using ML and Neptune with an automated process.

This solution solves the interoperability and linkage problem for data products. Additionally, it gives organizations real-time insights, agility, and innovation without spending time on data analysis and research. This approach creates a truly connected ecosystem with simplified access to delight your data consumers. The current solution is platform agnostic; however, in a future post we will show how to implement this using data.all (open-source software) and Amazon DataZone.

To learn more about ML in Neptune, refer to Amazon Neptune ML for machine learning on graphs. You can also explore Neptune notebooks demonstrating ML and data science for graphs. For more information about the data mesh architecture, refer to Design a data mesh architecture using AWS Lake Formation and AWS Glue. To learn more about Amazon DataZone and how you can share, search, and discover data at scale across organizational boundaries.


About the Authors


Moira Lennox
is a Senior Data Strategy Technical Specialist for AWS with 27 years’ experience helping companies innovate and modernize their data strategies to achieve new heights and allow for strategic decision-making. She has experience working in large enterprises and technology providers, in both business and technical roles across multiple industries, including health care live sciences, financial services, communications, digital entertainment, energy, and manufacturing.

Joel Farvault is Principal Specialist SA Analytics for AWS with 25 years’ experience working on enterprise architecture, data strategy, and analytics, mainly in the financial services industry. Joel has led data transformation projects on fraud analytics, claims automation, and data governance.

Mike Havey is a Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. His Amazon author page

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Post Syndicated from Vinay Kumar Khambhampati original https://aws.amazon.com/blogs/big-data/accelerate-hiveql-with-oozie-to-spark-sql-migration-on-amazon-emr/

Many customers run big data workloads such as extract, transform, and load (ETL) on Apache Hive to create a data warehouse on Hadoop. Apache Hive has performed pretty well for a long time. But with advancements in infrastructure such as cloud computing and multicore machines with large RAM, Apache Spark started to gain visibility by performing better than Apache Hive.

Customers now want to migrate their Apache Hive workloads to Apache Spark in the cloud to get the benefits of optimized runtime, cost reduction through transient clusters, better scalability by decoupling the storage and compute, and flexibility. However, migration from Apache Hive to Apache Spark needs a lot of manual effort to write migration scripts and maintain different Spark job configurations.

In this post, we walk you through a solution that automates the migration from HiveQL to Spark SQL. The solution was used to migrate Hive with Oozie workloads to Spark SQL and run them on Amazon EMR for a large gaming client. You can also use this solution to develop new jobs with Spark SQL and process them on Amazon EMR. This post assumes that you have a basic understanding of Apache Spark, Hive, and Amazon EMR.

Solution overview

In our example, we use Apache Oozie, which schedules Apache Hive jobs as actions to collect and process data on a daily basis.

We migrate these Oozie workflows with Hive actions by extracting the HQL files, and dynamic and static parameters, and converting them to be Spark compliant. Manual conversion is both time consuming and error prone. To convert the HQL to Spark SQL, you’ll need to sort through existing HQLs, replace the parameters, and change the syntax for a bunch of files.

Instead, we can use automation to speed up the process of migration and reduce heavy lifting tasks, costs, and risks.

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. The second component consumes the generated metadata from the first component and prepares the run order of Spark SQL within a Spark session. With this solution, we support basic orchestration and scheduling with the help of AWS services like Amazon DynamoDB and Amazon Simple Storage Service (Amazon S3). We can validate the solution by running queries in Amazon Athena.

In the following sections, we walk through these components and how to use these automations in detail.

Generate Spark SQL metadata

Our batch job consists of Hive steps scheduled to run sequentially. For each step, we run HQL scripts that extract, transform, and aggregate input data into one final Hive table, which stores data in HDFS. We use the following Oozie workflow parser script, which takes the input of an existing Hive job and generates configurations artifacts needed for running SQL using PySpark.

Oozie workflow XML parser

We create a Python script to automatically parse the Oozie jobs, including workflow.xml, co-ordinator.xml, job properties, and HQL files. This script can handle many Hive actions in a workflow by organizing the metadata at the step level into separate folders. Each step includes the list of SQLs, SQL paths, and their static parameters, which are input for the solution in the next step.

The process consists of two steps:

  1. The Python parser script takes input of the existing Oozie Hive job and its configuration files.
  2. The script generates a metadata JSON file for each step.

The following diagram outlines these steps and shows sample output.

Prerequisites

You need the following prerequisites:

  • Python 3.8
  • Python packages:
    • sqlparse==0.4.2
    • jproperties==2.1.1
    • defusedxml== 0.7.1

Setup

Complete the following steps:

  1. Install Python 3.8.
  2. Create a virtual environment:
python3 -m venv /path/to/new/virtual/environment
  1. Activate the newly created virtual environment:
source /path/to/new/virtual/environment/bin/activate
  1. Git clone the project:
git clone https://github.com/aws-samples/oozie-job-parser-extract-hive-sql
  1. Install dependent packages:
cd oozie-job-parser-extract-hive-sql
pip install -r requirements.txt

Sample command

We can use the following sample command:

python xml_parser.py --base-folder ./sample_jobs/ --job-name sample_oozie_job_name --job-version V3 --hive-action-version 0.4 --coordinator-action-version 0.4 --workflow-version 0.4 --properties-file-name job.coordinator.properties

The output is as follows:

{'nameNode': 'hdfs://@{{/cluster/${{cluster}}/namenode}}:54310', 'jobTracker': '@{{/cluster/${{cluster}}/jobtracker}}:54311', 'queueName': 'test_queue', 'appName': 'test_app', 'oozie.use.system.libpath': 'true', 'oozie.coord.application.path': '${nameNode}/user/${user.name}/apps/${appName}', 'oozie_app_path': '${oozie.coord.application.path}', 'start': '${{startDate}}', 'end': '${{endDate}}', 'initial_instance': '${{startDate}}', 'job_name': '${appName}', 'timeOut': '-1', 'concurrency': '3', 'execOrder': 'FIFO', 'throttle': '7', 'hiveMetaThrift': '@{{/cluster/${{cluster}}/hivemetastore}}', 'hiveMySQL': '@{{/cluster/${{cluster}}/hivemysql}}', 'zkQuorum': '@{{/cluster/${{cluster}}/zookeeper}}', 'flag': '_done', 'frequency': 'hourly', 'owner': 'who', 'SLA': '2:00', 'job_type': 'coordinator', 'sys_cat_id': '6', 'active': '1', 'data_file': 'hdfs://${nameNode}/hive/warehouse/test_schema/test_dataset', 'upstreamTriggerDir': '/input_trigger/upstream1'}
('./sample_jobs/development/sample_oozie_job_name/step1/step1.json', 'w')

('./sample_jobs/development/sample_oozie_job_name/step2/step2.json', 'w')

Limitations

This method has the following limitations:

  • The Python script parses only HiveQL actions from the Oozie workflow.xml.
  • The Python script generates one file for each SQL statement and uses the sequence ID for file names. It doesn’t name the SQL based on the functionality of the SQL.

Run Spark SQL on Amazon EMR

After we create the SQL metadata files, we use another automation script to run them with Spark SQL on Amazon EMR. This automation script supports custom UDFs by adding JAR files to spark submit. This solution uses DynamoDB for logging the run details of SQLs for support and maintenance.

Architecture overview

The following diagram illustrates the solution architecture.

Prerequisites

You need the following prerequisites:

  • Version:
    • Spark 3.X
    • Python 3.8
    • Amazon EMR 6.1

Setup

Complete the following steps:

  1. Install the AWS Command Line Interface (AWS CLI) on your workspace by following the instructions in Installing or updating the latest version of the AWS CLI. To configure AWS CLI interaction with AWS, refer to Quick setup.
  2. Create two tables in DynamoDB: one to store metadata about jobs and steps, and another to log job runs.
    • Use the following AWS CLI command to create the metadata table in DynamoDB:
aws dynamodb create-table --region us-east-1 --table-name dw-etl-metadata --attribute-definitions '[ { "AttributeName": "id","AttributeType": "S" } , { "AttributeName": "step_id","AttributeType": "S" }]' --key-schema '[{"AttributeName": "id", "KeyType": "HASH"}, {"AttributeName": "step_id", "KeyType": "RANGE"}]' --billing-mode PROVISIONED --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

You can check on the DynamoDB console that the table dw-etl-metadata is successfully created.

The metadata table has the following attributes.

Attributes Type Comments
id String partition_key
step_id String sort_key
step_name String Step description
sql_base_path string Base path
sql_info list List of SQLs in ETL pipeline
. sql_path SQL file name
. sql_active_flag y/n
. sql_load_order Order of SQL
. sql_parameters Parameters in SQL and values
spark_config Map Spark configs
    • Use the following AWS CLI command to create the log table in DynamoDB:
aws dynamodb create-table --region us-east-1 --table-name dw-etl-pipelinelog --attribute-definitions '[ { "AttributeName":"job_run_id", "AttributeType": "S" } , { "AttributeName":"step_id", "AttributeType": "S" } ]' --key-schema '[{"AttributeName": "job_run_id", "KeyType": "HASH"},{"AttributeName": "step_id", "KeyType": "RANGE"}]' --billing-mode PROVISIONED --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

You can check on the DynamoDB console that the table dw-etl-pipelinelog is successfully created.

The log table has the following attributes.

Attributes Type Comments
job_run_id String partition_key
id String sort_key (UUID)
end_time String End time
error_description String Error in case of failure
expire Number Time to Live
sql_seq Number SQL sequence number
start_time String Start time
Status String Status of job
step_id String Job ID SQL belongs

The log table can grow quickly if there are too many jobs or if they are running frequently. We can archive them to Amazon S3 if they are no longer used or use the Time to Live feature of DynamoDB to clean up old records.

  1. Run the first command to set the variable in case you have an existing bucket that can be reused. If not, create a S3 bucket to store the Spark SQL code, which will be run by Amazon EMR.
export s3_bucket_name=unique-code-bucket-name # Change unique-code-bucket-name to a valid bucket name
aws s3api create-bucket --bucket $s3_bucket_name --region us-east-1
  1. Enable secure transfer on the bucket:
aws s3api put-bucket-policy --bucket $s3_bucket_name --policy '{"Version": "2012-10-17", "Statement": [{"Effect": "Deny", "Principal": {"AWS": "*"}, "Action": "s3:*", "Resource": ["arn:aws:s3:::unique-code-bucket-name", "arn:aws:s3:::unique-code-bucket-name/*"], "Condition": {"Bool": {"aws:SecureTransport": "false"} } } ] }' # Change unique-code-bucket-name to a valid bucket name

  1. Clone the project to your workspace:
git clone https://github.com/aws-samples/pyspark-sql-framework.git
  1. Create a ZIP file and upload it to the code bucket created earlier:
cd pyspark-sql-framework/code
zip code.zip -r *
aws s3 cp ./code.zip s3://$s3_bucket_name/framework/code.zip
  1. Upload the ETL driver code to the S3 bucket:
cd $OLDPWD/pyspark-sql-framework
aws s3 cp ./code/etl_driver.py s3://$s3_bucket_name/framework/

  1. Upload sample job SQLs to Amazon S3:
aws s3 cp ./sample_oozie_job_name/ s3://$s3_bucket_name/DW/sample_oozie_job_name/ --recursive

  1. Add a sample step (./sample_oozie_job_name/step1/step1.json) to DynamoDB (for more information, refer to Write data to a table using the console or AWS CLI):
{
  "name": "step1.q",
  "step_name": "step1",
  "sql_info": [
    {
      "sql_load_order": 5,
      "sql_parameters": {
        "DATE": "${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd')}",
        "HOUR": "${coord:formatTime(coord:nominalTime(), 'HH')}"
      },
      "sql_active_flag": "Y",
      "sql_path": "5.sql"
    },
    {
      "sql_load_order": 10,
      "sql_parameters": {
        "DATE": "${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd')}",
        "HOUR": "${coord:formatTime(coord:nominalTime(), 'HH')}"
      },
      "sql_active_flag": "Y",
      "sql_path": "10.sql"
    }
  ],
  "id": "emr_config",
  "step_id": "sample_oozie_job_name#step1",
  "sql_base_path": "sample_oozie_job_name/step1/",
  "spark_config": {
    "spark.sql.parser.quotedRegexColumnNames": "true"
  }
}

  1. In the Athena query editor, create the database base:
create database base;
  1. Copy the sample data files from the repo to Amazon S3:
    1. Copy us_current.csv:
aws s3 cp ./sample_data/us_current.csv s3://$s3_bucket_name/covid-19-testing-data/base/source_us_current/;

  1. Copy states_current.csv:
aws s3 cp ./sample_data/states_current.csv s3://$s3_bucket_name/covid-19-testing-data/base/source_states_current/;

  1. To create the source tables in the base database, run the DDLs present in the repo in the Athena query editor:
    1. Run the ./sample_data/ddl/states_current.q file by modifying the S3 path to the bucket you created.
    1. Run the ./sample_data/ddl/us_current.q file by modifying the S3 path to the bucket you created.

The ETL driver file implements the Spark driver logic. It can be invoked locally or on an EMR instance.

  1. Launch an EMR cluster.
    1. Make sure to select Use for Spark table metadata under AWS Glue Data Catalog settings.

  1. Add the following steps to EMR cluster.
aws emr add-steps --cluster-id <<cluster id created above>> --steps 'Type=CUSTOM_JAR,Name="boto3",ActionOnFailure=CONTINUE,Jar=command-runner.jar,Args=[bash,-c,"sudo pip3 install boto3"]'
aws emr add-steps --cluster-id <<cluster id created above>> --steps 'Name="sample_oozie_job_name",Jar="command-runner.jar",Args=[spark-submit,--py-files,s3://unique-code-bucket-name-#####/framework/code.zip,s3://unique-code-bucket-name-#####/framework/etl_driver.py,--step_id,sample_oozie_job_name#step1,--job_run_id,sample_oozie_job_name#step1#2022-01-01-12-00-01,  --code_bucket=s3://unique-code-bucket-name-#####/DW,--metadata_table=dw-etl-metadata,--log_table_name=dw-etl-pipelinelog,--sql_parameters,DATE=2022-02-02::HOUR=12::code_bucket=s3://unique-code-bucket-name-#####]' # Change unique-code-bucket-name to a valid bucket name

The following table summarizes the parameters for the spark step.

Step type Spark Application
Name Any Name
Deploy mode Client
Spark-submit options --py-files s3://unique-code-bucket-name-#####/framework/code.zip
Application location s3://unique-code-bucket-name-####/framework/etl_driver.py
Arguments --step_id sample_oozie_job_name#step1 --job_run_id sample_oozie_job_name#step1#2022-01-01-12-00-01 --code_bucket=s3://unique-code-bucket-name-#######/DW --metadata_table=dw-etl-metadata --log_table_name=dw-etl-pipelinelog --sql_parameters DATE=2022-02-02::HOUR=12::code_bucket=s3://unique-code-bucket-name-#######
Action on failure Continue

The following table summarizes the script arguments.

Script Argument Argument Description
deploy-mode Spark deploy mode. Client/Cluster.
name <jobname>#<stepname> Unique name for the Spark job. This can be used to identify the job on the Spark History UI.
py-files <s3 path for code>/code.zip S3 path for the code.
<s3 path for code>/etl_driver.py S3 path for the driver module. This is the entry point for the solution.
step_id <jobname>#<stepname> Unique name for the step. This refers to the step_id in the metadata entered in DynamoDB.
job_run_id <random UUID> Unique ID to identify the log entries in DynamoDB.
log_table_name <DynamoDB Log table name> DynamoDB table for storing the job run details.
code_bucket <s3 bucket> S3 path for the SQL files that are uploaded in the job setup.
metadata_table <DynamoDB Metadata table name> DynamoDB table for storing the job metadata.
sql_parameters DATE=2022-07-04::HOUR=00 Any additional or dynamic parameters expected by the SQL files.

Validation

After completion of EMR step you should have data on S3 bucket for the table base.states_daily. We can validate the data by querying the table base.states_daily in Athena.

Congratulations, you have converted an Oozie Hive job to Spark and run on Amazon EMR successfully.

Solution highlights

This solution has the following benefits:

  • It avoids boilerplate code for any new pipeline and offers less maintenance of code
  • Onboarding any new pipeline only needs the metadata set up—the DynamoDB entries and SQL to be placed in the S3 path
  • Any common modifications or enhancements can be done at the solution level, which will be reflected across all jobs
  • DynamoDB metadata provides insight into all active jobs and their optimized runtime parameters
  • For each run, this solution persists the SQL start time, end time, and status in a log table to identify issues and analyze runtimes
  • It supports Spark SQL and UDF functionality. Custom UDFs can be provided externally to the spark submit command

Limitations

This method has the following limitations:

  • The solution only supports SQL queries on Spark
  • Each SQL should be a Spark action like insert, create, drop, and so on

In this post, we explained the scenario of migrating an existing Oozie job. We can use the PySpark solution independently for any new development by creating DynamoDB entries and SQL files.

Clean up

Delete all the resources created as part of this solution to avoid ongoing charges for the resources:

  1. Delete the DynamoDB tables:
aws dynamodb delete-table --table-name dw-etl-metadata --region us-east-1
aws dynamodb delete-table --table-name dw-etl-pipelinelog --region us-east-1
  1. Delete the S3 bucket:
aws s3 rm s3://$s3_bucket_name --region us-east-1 --recursive
aws s3api create-bucket --bucket $s3_bucket_name --region us-east-1

  1. Stop the EMR cluster if it wasn’t a transient cluster:
aws emr terminate-clusters --cluster-ids <<cluster id created above>> 

Conclusion

In this post, we presented two automated solutions: one for parsing Oozie workflows and HiveQL files to generate metadata, and a PySpark solution for running SQLs using generated metadata. We successfully implemented these solutions to migrate a Hive workload to EMR Spark for a major gaming customer and achieved about 60% effort reduction.

For a Hive with Oozie to Spark migration, these solutions help complete the code conversion quickly so you can focus on performance benchmark and testing. Developing a new pipeline is also quick—you only need to create SQL logic, test it using Spark (shell or notebook), add metadata to DynamoDB, and test via the PySpark SQL solution. Overall, you can use the solution in this post to accelerate Hive to Spark code migration.


About the authors

Vinay Kumar Khambhampati is a Lead Consultant with the AWS ProServe Team, helping customers with cloud adoption. He is passionate about big data and data analytics.

Sandeep Singh is a Lead Consultant at AWS ProServe, focused on analytics, data lake architecture, and implementation. He helps enterprise customers migrate and modernize their data lake and data warehouse using AWS services.

Amol Guldagad is a Data Analytics Consultant based in India. He has worked with customers in different industries like banking and financial services, healthcare, power and utilities, manufacturing, and retail, helping them solve complex challenges with large-scale data platforms. At AWS ProServe, he helps customers accelerate their journey to the cloud and innovate using AWS analytics services.

Configure SAML federation for Amazon OpenSearch Serverless with AWS IAM Identity Center

Post Syndicated from Utkarsh Agarwal original https://aws.amazon.com/blogs/big-data/configure-saml-federation-for-amazon-opensearch-serverless-with-aws-iam-identity-center/

Amazon OpenSearch Serverless is a serverless option of Amazon OpenSearch Service that makes it easy for you to run large-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. It automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads. With OpenSearch Serverless, you can configure SAML to enable users to access data through OpenSearch Dashboards using an external SAML identity provider (IdP).

AWS IAM Identity Center (Successor to AWS Single Sign-On) helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications, OpenSearch Dashboards being one of them.

In this post, we show you how to configure SAML authentication for OpenSearch Dashboards using IAM Identity Center as its IdP.

Solution overview

The following diagram illustrates how the solution allows users or groups to authenticate into OpenSearch Dashboards using single sign-on (SSO) with IAM Identity Center using its built-in directory as the identity source.

The workflow steps are as follows:

  1. A user accesses the OpenSearch Dashboard URL in their browser and chooses the SAML provider.
  2. OpenSearch Serverless redirects the login to the specified IdP.
  3. The IdP provides a login form for the user to specify the credentials for authentication.
  4. After the user is authenticated successfully, a SAML assertion is sent back to OpenSearch Serverless.

OpenSearch Serverless validates the SAML assertion, and the user logs in to OpenSearch Dashboards.

Prerequisites

To get started, you must have an active OpenSearch Serverless collection. Refer to Creating and managing Amazon OpenSearch Serverless collections to learn more about creating a collection. Furthermore, you must have the correct AWS Identity and Access Management (IAM) permissions for configuring SAML authentication along with relevant IAM permissions for configuring the data access policy.

IAM Identity Center should be enabled, and you should have the relevant IAM permissions to create an application in IAM Identity Center and create and manage users and groups.

Create and configure the application in IAM Identity Center

To set up your application in IAM Identity Center, complete the following steps:

  1. On the IAM Identity Center dashboard, choose Applications in the navigation pane.
  2. Choose Add application
  3. For Custom application, select Add custom SAML 2.0 application.
  4. Choose Next.
  5. Under Configure application, enter a name and description for the application.
  6. Under IAM Identity Center metadata, choose Download under IAM Identity Center SAML metadata file.

We use this metadata file to create a SAML provider under OpenSearch Serverless. It contains the public certificate used to verify the signature of the IAM Identity Center SAML assertions.

  1. Under Application properties, leave Application start URL and Relay state blank.
  2. For Session duration, choose 1 hour (the default value).

Note that the session duration you configure in this step takes precedence over the OpenSearch Dashboards timeout setting specified in the configuration of the SAML provider details on the OpenSearch Serverless end.

  1. Under Application metadata, select Manually type your metadata values.
  2. For Application ACS URL, enter your URL using the format https://collection.<REGION>.aoss.amazonaws.com/_saml/acs. For example, we enter https://collection.us-east-1.aoss.amazonaws.com/_saml/acs for this post.
  3. For Application SAML audience, enter your service provider in the format aws:opensearch:<aws account id>.
  4. Choose Submit.

Now you modify the attribute settings. The attribute mappings you configure here become part of the SAML assertion that is sent to the application.

  1. On the Actions menu, choose Edit attribute mappings.
  2. Configure Subject to map to ${user:email}, with the format unspecified.

Using ${user:email} here ensures that the email address for the user in IAM Identity Center is passed in the <NameId> tag of the SAML response.

  1. Choose Save changes.

Now we assign a user to the application.

  1. Create a user in IAM Identity Center to use to log in to OpenSearch Dashboards.

Alternatively, you can use an existing user.

  1. On the IAM Identity Center console, navigate to your application and choose Assign Users and select the user(s) you would like to assign.

You have now created a custom SAML application. Next, you will configure the SAML provider in OpenSearch Serverless.

Create a SAML provider

The SAML provider you create in this step can be assigned to any collection in the same Region. Complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose SAML authentication under Security.
  2. Choose Create SAML provider.
  3. Enter a name and description for your SAML provider.
  4. Enter the metadata from your IdP that you downloaded earlier.
  5. Under Additional settings, you can optionally add custom user ID and group attributes. We leave these settings blank for now.
  6. Choose Create a SAML provider.

You have now configured a SAML provider for OpenSearch Serverless. Next, we walk you through configuring the data access policy for accessing collections.

Create the data access policy

In this section, you set up data access policies for OpenSearch Serverless and allow access to the users. Complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Data access policies under Security.
  2. Choose Create access policy.
  3. Enter a name and description for your access policy.
  4. For Policy definition method, select Visual Editor.
  5. In the Rules section, enter a rule name.
  6. Under Select principals, for Add principals, choose SAML users and groups.
  7. For SAML provider name, choose the SAML provider you created earlier.
  8. Specify the user in the format user/<email> (for example, user/[email protected]).

The value of the email address should match the email address in IAM Identity Center.

  1. Choose Save.
  2. Choose Grant and specify the permissions.

You can configure what access you want to provide for the specific user at the collection level and specific indexes at the index pattern level.

You should select the access the user needs based on the least privilege model. Refer to Supported policy permissions and Supported OpenSearch API operations and permissions to set up more granular access for your users.

  1. Choose Save and configure any additional rules, if required.

You can now review and edit your configuration if needed.

  1. Choose Create to create the data access policy.

Now you have the data access policy that will allow the users to perform the allowed actions on OpenSearch Dashboards.

Access OpenSearch Dashboards

To sign in to OpenSearch Dashboards, complete the following steps:

  1. On the OpenSearch Service dashboard, under Serverless in the navigation pane, choose Dashboard.
  2. Locate your dashboard and copy the OpenSearch Dashboards URL (in the format <collection-endpoint>/_dashboards).
  3. Enter this URL into a new browser tab.
  4. On the OpenSearch login page, choose your IdP and specify your SSO credentials.
  5. Choose Login.

Configure SAML authentication using groups in IAM Identity Center

Groups can help you organize your users and permissions in a coherent way. With groups, you can add multiple users from the IdP, and then use groupid as the identifier in the data access policy. For more information, refer to Add groups and Add users to groups.

To configure group access to OpenSearch Dashboards, complete the following steps:

  1. On the IAM Identity Center console, navigate to your application.
  2. In the Attribute mappings section, add an additional user as group and map it to ${user:groups}, with the format unspecified.
  3. Choose Save changes.
  4. For the SAML provider in OpenSearch Serverless, under Additional settings, for Group attribute, enter group.
  5. For the data access policy, create a new rule or add an additional principal in the previous rule.
  6. Choose the SAML provider name and enter group/<GroupId>.

You can fetch the value for the group ID by navigating to the Group section on the IAM Identity Center console.

Clean up

If you don’t want to continue using the solution, be sure to delete the resources you created:

  1. On the IAM Identity Center console, remove the application.
  2. On OpenSearch Dashboards, delete the following resources:
    1. Delete your collection.
    2. Delete the data access policy.
    3. Delete the SAML provider.

Conclusion

In this post, you learned how to set up IAM Identity Center as an IdP to access OpenSearch Dashboards using SAML as SSO. You also learned on how to set up users and groups within IAM Identity Center and control the access of users and groups for OpenSearch Dashboards. For more details, refer to SAML authentication for Amazon OpenSearch Serverless.

Stay tuned for a series of posts focusing on the various options available for you to build effective log analytics and search solutions using OpenSearch Serverless. You can also refer to the Getting started with Amazon OpenSearch Serverless workshop to know more about OpenSearch Serverless.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the OpenSearch Service forum or contact AWS Support.


About the Authors

Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available and secure solutions in AWS Cloud. In his free time, he enjoys watching movies, TV series and of course cricket! Lately, he his also attempting to master the art of cooking in his free time – The taste buds are excited, but the kitchen might disagree.

Ravi Bhatane is a software engineer with Amazon OpenSearch Serverless Service. He is passionate about security, distributed systems, and building scalable services. When he’s not coding, Ravi enjoys photography and exploring new hiking trails with his friends.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Scale your authorization needs for Secrets Manager using ABAC with IAM Identity Center

Post Syndicated from Aravind Gopaluni original https://aws.amazon.com/blogs/security/scale-your-authorization-needs-for-secrets-manager-using-abac-with-iam-identity-center/

With AWS Secrets Manager, you can securely store, manage, retrieve, and rotate the secrets required for your applications and services running on AWS. A secret can be a password, API key, OAuth token, or other type of credential used for authentication purposes. You can control access to secrets in Secrets Manager by using AWS Identity and Access Management (IAM) permission policies. In this blog post, I will show you how to use principles of attribute-based access control (ABAC) to define dynamic IAM permission policies in AWS IAM Identity Center (successor to AWS Single Sign-On) by using user attributes from an external identity provider (IdP) and resource tags in Secrets Manager.

What is ABAC and why use it?

Attribute-based access control (ABAC) is an authorization strategy that defines permissions based on attributes or characteristics of the user, the data, or the environment, such as the department, business unit, or other factors that could affect the authorization outcome. In the AWS Cloud, these attributes are called tags. By assigning user attributes as principal tags, you can simplify the process of creating fine-grained permissions on AWS.

With ABAC, you can use attributes to build more dynamic policies that provide access based on matching attribute conditions. ABAC rules are evaluated dynamically at runtime, which means that the users’ access to applications and data and the type of allowed operations automatically change based on the contextual factors in the policy. For example, if a user changes department, access is automatically adjusted without the need to update permissions or request new roles. You can use ABAC in conjunction with role-based access control (RBAC) to combine the ease of policy administration with flexible policy specification and dynamic decision-making capability to enforce least privilege.

AWS IAM Identity Center (successor to AWS Single Sign-On) expands the capabilities of IAM to provide a central place that brings together the administration of users and their access to AWS accounts and cloud applications. With IAM Identity Center, you can define user permissions and manage access to accounts and applications in your AWS Organizations organization centrally. You can also create ABAC permission policies in a central place. ABAC will work with attributes from a supported identity source in IAM Identity Center. For a list of supported external IdPs for identity synchronization through the System for Cross-domain Identity Management (SCIM) and Security Assertion Markup Language (SAML) 2.0, see Supported identity providers.

The following are key benefits of using ABAC with IAM Identity Center and Secrets Manager:

  1. Fewer permission sets — With ABAC, multiple users who use the same IAM Identity Center permission set and the same IAM role can still get unique permissions, because permissions are now based on user attributes. Administrators can author IAM policies that grant users access only to secrets that have matching attributes. This helps reduce the number of distinct permissions that you need to create and manage in IAM Identity Center and, in turn, reduces your permission management complexity.
  2. Teams can change and grow quickly — When you create new secrets, you can apply the appropriate tags, which will automatically grant access without requiring you to update the permission policies.
  3. Use employee attributes from your corporate directory to define access — You can use existing employee attributes from a supported identity source configured in IAM Identity Center to make access control decisions on AWS.

Figure 1 shows a framework to control access to Secrets Manager secrets using IAM Identity Center and ABAC principles.

Figure 1: ABAC framework to control access to secrets using IAM Identity Center

Figure 1: ABAC framework to control access to secrets using IAM Identity Center

The following is a brief introduction to the basic components of the framework:

  1. User attribute source or identity source — This is where your users and groups are administered. You can configure a supported identity source with IAM Identity Center. You can then define and manage supported user attributes in the identity source.
  2. Policy management — You can create and maintain policy definitions (permission sets) centrally in IAM Identity Center. You can assign access to a user or group to one or more accounts in IAM Identity Center with these permission sets. You can then use attributes defined in your identity source to build ABAC policies for managing access to secrets.
  3. Policy evaluation — When you assign a permission set, IAM Identity Center creates corresponding IAM Identity Center-controlled IAM roles in each account, and attaches the policies specified in the permission set to those roles. IAM Identity Center manages the role, and allows the authorized users that you’ve defined to assume the role. When users try to access a secret, IAM dynamically evaluates ABAC policies on the target account to determine access based on the attributes assigned to the user and resource tags assigned to that secret.

How to configure ABAC with IAM Identity Center

To configure ABAC with IAM Identity Center, you need to complete the following high-level steps. I will walk you through these steps in detail later in this post.

  1. Identify and set up identities that are created and managed in the identity source with user attributes, such as project, team, AppID or department.
  2. In IAM Identity Center, enable Attributes for access control and configure select attributes (such as department) to use for access control. For a list of supported attributes, see Supported external identity provider attributes.
  3. If you are using an external IdP and choose to use custom attributes from your IdP for access controls, configure your IdP to send the attributes through SAML assertions to IAM Identity Center.
  4. Assign appropriate tags to secrets in Secrets Manager.
  5. Create permission sets based on attributes added to identities and resource tags.
  6. Define guardrails to enforce access using ABAC.

ABAC enforcement and governance

Because an ABAC authorization model is based on tags, you must have a tagging strategy for your resources. To help prevent unintended access, you need to make sure that tagging is enforced and that a governance model is in place to protect the tags from unauthorized updates. By using service control policies (SCPs) and AWS Organizations tag policies, you can enforce tagging and tag governance on resources.

When you implement ABAC for your secrets, consider the following guidance for establishing a tagging strategy:

  • During secret creation, secrets must have an ABAC tag applied (tag-on-create).
  • During secret creation, the provided ABAC tag key must be the same case as the principal’s ABAC tag key.
  • After secret creation, the ABAC tag cannot be modified or deleted.
  • Only authorized principals can do tagging operations on secrets.
  • You enforce the permissions that give access to secrets through tags.

For more information on tag strategy, enforcement, and governance, see the following resources:

Solution overview

In this post, I will walk you through the steps to enable the IdP that is supported by IAM Identity Center.

Figure 2: Sample solution implementation

Figure 2: Sample solution implementation

In the sample architecture shown in Figure 2, Arnav and Ana are users who each have the attributes department and AppID. These attributes are created and updated in the external directory—Okta in this case. The attribute department is automatically synchronized between IAM Identity Center and Okta using SCIM. The attribute AppID is a custom attribute configured on Okta, and is passed to AWS as a SAML assertion. Both users are configured to use the same IAM Identity Center permission set that allows them to retrieve the value of secrets stored in Secrets Manager. However, access is granted based on the tags associated with the secret and the attributes assigned to the user. 

For example, user Arnav can only retrieve the value of the RDS_Master_Secret_AppAlpha secret. Although both users work in the same department, Arnav can’t retrieve the value of the RDS_Master_Secret_AppBeta secret in this sample architecture.

Prerequisites

Before you implement the solution in this blog post, make sure that you have the following prerequisites in place:

  1. You have IAM Identity Center enabled for your organization and connected to an external IdP using SAML 2.0 identity federation.
  2. You have IAM Identity Center configured for automatic provisioning with an external IdP using the SCIM v2.0 standard. SCIM keeps your IAM Identity Center identities in sync with identities from the external IdP.

Solution implementation

In this section, you will learn how to enable access to Secrets Manager using ABAC by completing the following steps:

  1. Configure ABAC in IAM Identity Center
  2. Define custom attributes in Okta
  3. Update configuration for the IAM Identity Center application on Okta
  4. Make sure that required tags are assigned to secrets in Secrets Manager
  5. Create and assign a permission set with an ABAC policy in IAM Identity Center
  6. Define guardrails to enforce access using ABAC

Step 1: Configure ABAC in IAM Identity Center

The first step is to set up attributes for your ABAC configuration in IAM Identity Center. This is where you will be mapping the attribute coming from your identity source to an attribute that IAM Identity Center passes as a session tag. The Key represents the name that you are giving to the attribute for use in the permission set policies. You need to specify the exact name in the policies that you author for access control. For the example in this post, you will create a new attribute with Key of department and Value of ${path:enterprise.department}. For supported external IdP attributes, see Attribute mappings.

To configure ABAC in IAM Identity Center (console)

  1. Open the IAM Identity Center console.
  2. In the Settings menu, enable Attributes for access control.
  3. Choose the Attributes for access control tab, select Add attribute, and then enter the Key and Value details as follows.
    • Key: department
    • Value: ${path:enterprise.department}

Note: For more information, see Attributes for access control.

Step 2: Define custom attributes in Okta

The sample architecture in this post uses a custom attribute (AppID) on an external IdP for access control. In this step, you will create a custom attribute in Okta.

To define custom attributes in Okta (console)

  1. Open the Okta console.
  2. Navigate to Directory and then select Profile Editor.
  3. On the Profile Editor page, choose Okta User (default).
  4. Select Add Attribute and create a new custom attribute with the following parameters.
    • For Data type, enter string
    • For Display name, enter AppID
    • For Variable name, enter user.AppID
    • For Attribute length, select Less Than from the dropdown and enter a value.
    • For User permission, enter Read Only
  5. Navigate to Directory, select People, choose in-scope users, and enter a value for Department and AppID attributes. The following shows these values for the users in our example.
    • First name (firstName): Arnav
    • Last name (lastName): Desai
    • Primary email (email): [email protected]
    • Department (department): Digital
    • AppID: Alpha
       
    • First name (firstName): Ana
    • Last name (lastName): Carolina
    • Primary email (email): [email protected]
    • Department (department): Digital
    • AppID: Beta

Step 3: Update SAML configuration for IAM Identity Center application on Okta

Automatic provisioning (through the SCIM v2.0 standard) of user and group information from Okta into IAM Identity Center supports a set of defined attributes. A custom attribute that you create on Okta won’t be automatically synchronized to IAM Identity Center through SCIM. You can, however, define the attribute in the SAML configuration so that it is inserted into the SAML assertions.

To update the SAML configuration in Okta (console)

  1. Open the Okta console and navigate to Applications.
  2. On the Applications page, select the app that you defined for IAM Identity Center.
  3. Under the Sign On tab, choose Edit.
  4. Under SAML 2.0, expand the Attributes (Optional) section, and add an attribute statement with the following values, as shown in Figure 3:
    Figure 3: Sample SAML configuration with custom attributes

    Figure 3: Sample SAML configuration with custom attributes

  5. To check that the newly added attribute is reflected in the SAML assertion, choose Preview SAML, review the information, and then choose Save.

Step 4: Make sure that required tags are assigned to secrets in Secrets Manager

The next step is to make sure that the required tags are assigned to secrets in Secrets Manager. You will review the required tags from the Secrets Manager console.

To verify required tags on secrets (console)

  1. Open the Secrets Manager console in the target AWS account and then choose Secrets.
  2. Verify that the required tags are assigned to the secrets in scope for this solution, as shown in Figure 4. In our example, the tags are as follows:
    • Keydepartment
    • Value: Digital
    • KeyAppID
    • Value: Alpha or Beta
    Figure 4: Sample secret configuration with required tags

    Figure 4: Sample secret configuration with required tags

Step 5a: Create a permission set in IAM Identity Center using ABAC policy

In this step, you will create a new permission set that allows access to secrets based on the principal attributes and resource tags.

When you enable ABAC and specify attributes, IAM Identity Center passes the attribute value of the authenticated user to AWS Security Token Service (AWS STS) as session tags when an IAM role is assumed. You can use access control attributes in your permission sets by using the aws:PrincipalTag condition key to create access control rules.

To create a permission set (console)

  1. Open the IAM Identity Center console and navigate to Multi-account permissions.
  2. Choose Permission sets, and then select Create permission set.
  3. On the Specify policies and permissions boundary page, choose Inline policy.
  4. For Inline policy, paste the following sample policy document and then choose Next. This policy allows users to retrieve the value of only those secrets that have resource tags that match the required user attributes (department and AppID in our example).
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "ListAllSecrets",
                "Effect": "Allow",
                "Action": [
                    "secretsmanager:ListSecrets"
                ],
                "Resource": "*"
            },
            {
                "Sid": "AuthorizetoGetSecretValue",
                "Effect": "Allow",
                "Action": [
                    "secretsmanager:GetSecretValue",
                    "secretsmanager:DescribeSecret"
                ],
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "secretsmanager:ResourceTag/department": "${aws:PrincipalTag/department}",
                        "secretsmanager:ResourceTag/AppID": "${aws:PrincipalTag/AppID}"
                    }
                }
            }
        ]
    }

  5. Configure the session duration, and optionally provide a description and tags for the permission set.
  6. Review and create the permission set.

Step 5b: Assign permission set to users in IAM Identity Center

Now that you have created a permission set with ABAC policy, complete the configuration by assigning the permission set to users to grant them access to secrets in one or more accounts in your organization.

To assign a permission set (console)

  1. Open the IAM Identity Center console and navigate to Multi-account permissions.
  2. Choose AWS accounts and select one or more accounts to which you want to assign access.
  3. Choose Assign users or groups.
  4. On the Assign users and groups page, select the users, groups, or both to which you want to assign access. For this example, I select both Arnav and Ana.
  5. On the Assign permission sets page, select the permission set that you created in the previous section.
  6. Review your changes, as shown in Figure 5, and then select Submit.
Figure 5: Sample permission set assignment

Figure 5: Sample permission set assignment

Step 6: Define guardrails to enforce access using ABAC

To govern access to secrets to your workforce users only through ABAC and to help prevent unauthorized access, you can define guardrails. In this section, I will show you some sample service control policies (SCPs) that you can use in your organization.

Note: Before you use these sample SCPs, you should carefully review, customize, and test them for your unique requirements. For additional instructions on how to attach an SCP, see Attaching and detaching service control policies.

Guardrail 1 – Enforce ABAC to access secrets

The following sample SCP requires the use of ABAC to access secrets in Secrets Manager. In this example, users and secrets must have matching values for the attributes department and AppID. Access is denied if those attributes don’t exist or if they don’t have matching values. Also, this example SCP allows only the admin role to access secrets without matching tags. Replace <arn:aws:iam::*:role/secrets-manager-admin-role> with your own information.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyAccesstoSecretsWithoutABACTag",
            "Effect": "Deny",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret"
            ],
            "Resource": "*",
            "Condition": {
                "StringNotEqualsIfExists": {
                    "secretsmanager:ResourceTag/department": "${aws:PrincipalTag/department}",
                    "secretsmanager:ResourceTag/AppID": "${aws:PrincipalTag/AppID}"
                },
                "ArnNotLike": {
                    "aws:PrincipalArn": "<arn:aws:iam::*:role/secrets-manager-admin-role>"
                }
            }
        }
    ]
}

Guardrail 2 – Enforce tagging on secret creation

The following sample SCP denies the creation of new secrets that don’t have the required tag key-value pairs. In this example, the SCP denies creation of a new secret if it doesn’t include department and AppID tag keys. It also denies access if the tag department doesn’t have the value Digital and the tag AppID doesn’t have either Alpha or Beta assigned to it. Also, this example SCP allows only the admin role to create secrets without matching tags. Replace <arn:aws:iam::*:role/secrets-manager-admin-role> with your own information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyCreatingResourcesWithoutRequiredTag",
      "Effect": "Deny",
      "Action": [
        "secretsmanager:CreateSecret"
      ],
      "Resource": [
        "arn:aws:secretsmanager:*:*:secret:*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:RequestTag/department": ["Digital"],
          "aws:RequestTag/AppID": ["Alpha", "Beta"]
        },
        "ArnNotLike": {
          "aws:PrincipalArn": "<arn:aws:iam::*:role/secrets-manager-admin-role>"
        }
      }
    }
  ]
}

Guardrail 3 – Restrict deletion of ABAC tags

The following sample SCP denies the ability to delete the tags used for ABAC. In this example, only the admin role can delete the tags department and AppID after they are attached to a secret. Replace <arn:aws:iam::*:role/secrets-manager-admin-role> with your own information.

Guardrail 4 – Restrict modification of ABAC tags

The following sample SCP denies the ability to modify required tags for ABAC after they are attached to a secret. In this example, only the admin role can modify the tags department and AppID after they are attached to a secret. Replace <arn:aws:iam::*:role/secrets-manager-admin-role> with your own information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyModifyingABACTags",
      "Effect": "Deny",
      "Action": [
        "secretsmanager:TagResource"
      ],
      "Resource": [
        "arn:aws:secretsmanager:*:*:secret:*"
      ],
      "Condition": {
        "Null": {
          "aws:ResourceTag/department": "false",
          "aws:ResourceTag/AppID": "false"
        },
        "ArnNotLike": {
          "aws:PrincipalArn": "<arn:aws:iam::*:role/secrets-manager-admin-role>"
        }
      }
    }
  ]
}

Test the solution 

In this section, you will test the solution by retrieving a secret using the Secrets Manager console. Your attempt to retrieve the secret value will be successful only when the required resource and principal tags exist, and have matching values (AppID and department in our example).

Test scenario 1: Retrieve and view the value of an authorized secret

In this test, you will verify whether you can successfully retrieve the value of a secret that belongs to your application.

To test the scenario

  1. Sign in to IAM Identity Center and log in with your external IdP user. For this example, I log in as Arnav.
  2. On the IAM Identity Center dashboard, select the target account.
  3. From the list of available roles that the user has access to, choose the role that you created in Step 5a and select Management console, as shown in Figure 6. For this example, I select the SecretsManagerABACTest permission set.
    Figure 6: Sample IAM Identity Center dashboard

    Figure 6: Sample IAM Identity Center dashboard

  4. Open the Secrets Manager console and select a secret that belongs to your application. For this example, I select RDS_Master_Secret_AppAlpha.

    Because the AppID and department tags exist on both the secret and the user, the ABAC policy allowed the user to describe the secret, as shown in Figure 7.

    Figure 7: Sample secret that was described successfully

    Figure 7: Sample secret that was described successfully

  5. In the Secret value section, select Retrieve secret value.

    Because the value of the resource tags, AppID and department, matches the value of the corresponding user attributes (in other words, the principal tags), the ABAC policy allows the user to retrieve the secret value, as shown in Figure 8.

    Figure 8: Sample secret value that was retrieved successfully

    Figure 8: Sample secret value that was retrieved successfully

Test scenario 2: Retrieve and view the value of an unauthorized secret

In this test, you will verify whether you can retrieve the value of a secret that belongs to a different application.

To test the scenario

  1. Repeat steps 1-3 from test scenario 1.
  2. Open the Secrets Manager console and select a secret that belongs to a different application. For this example, I select RDS_Master_Secret_AppBeta.

    Because the value of the resource tag AppID doesn’t match the value of the corresponding user attribute (principal tag), the ABAC policy denies access to describe the secret, as shown in Figure 9.

    Figure 9: Sample error when describing an unauthorized secret

    Figure 9: Sample error when describing an unauthorized secret

Conclusion

In this post, you learned how to implement an ABAC strategy using attributes and to build dynamic policies that can simplify access management to Secrets Manager using IAM Identity Center configured with an external IdP. You also learned how to govern resource tags used for ABAC and establish guardrails to enforce access to secrets using ABAC. To learn more about ABAC and Secrets Manager, see Attribute-Based Access Control (ABAC) for AWS and the Secrets Manager documentation.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on AWS Secrets Manager re:Post.

 
Want more AWS Security news? Follow us on Twitter.

Aravind Gopaluni

Aravind Gopaluni

Aravind is a Senior Security Solutions Architect at AWS helping Financial Services customers meet their security and compliance objectives in the AWS cloud. Aravind has about 20 years of experience focusing on identity & access management and data protection solutions to numerous global enterprises. Outside of work, Aravind loves to have a ball with his family and explore new cuisines.

Investigate security events by using AWS CloudTrail Lake advanced queries

Post Syndicated from Rodrigo Ferroni original https://aws.amazon.com/blogs/security/investigate-security-events-by-using-aws-cloudtrail-lake-advanced-queries/

This blog post shows you how to use AWS CloudTrail Lake capabilities to investigate CloudTrail activity across AWS Organizations in response to a security incident scenario. We will walk you through two security-related scenarios while we investigate CloudTrail activity. The method described in this post will help you with the investigation process, allowing you to gain comprehensive understanding of the incident and its implications. CloudTrail Lake is a managed audit and security lake that allows you to aggregate, immutably store, and query your activity logs for auditing, security investigation, and operational troubleshooting.

Prerequisites

You must have the following AWS services enabled before you start the investigation.

  • CloudTrail Lake — To learn how to enable this service and use sample queries, see the blog post Announcing AWS CloudTrail Lake – a managed audit and security Lake. When you create a new event data store at the organization level, you will need to enable CloudTrail Lake for all of the accounts in the organization. We advise that you include not only management events but also data events.

    When you use CloudTrail Lake with AWS Organizations, you can designate an account within the organization to be the CloudTrail Lake delegated administrator. This provides a convenient way to perform queries from a designated AWS security account—for example, you can avoid granting access to your AWS management account.

  • Amazon GuardDuty — This is a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity and delivers detailed security findings for visibility and remediation. To learn about the benefits of the service and how to get started, see Amazon GuardDuty.

Incident scenario 1: AWS access keys compromised

In the first scenario, you have observed activity within your AWS account from an unauthorized party. This example covers a situation where a threat actor has obtained and misused one of your AWS access keys that was exposed publicly by mistake. This investigation starts after Amazon GuardDuty generates an IAM finding identifying that the malicious activity came from the exposed AWS access key. Following the Incident Response Playbook Compromised IAM Credentials, focusing on step 12 in the playbook ([DETECTION AND ANALYSIS] Review CloudTrail Logs), you will use CloudTrail Lake capabilities to investigate the activity that was performed with this key. To do so, you will use the following nine query examples that we provide for this first scenario.

Query 1.1: Activity performed by access key during a specific time window

The first query is aimed at obtaining the specific activity that was performed by this key, either successfully or not, during the time the malicious activity took place. You can use the GuardDuty finding details “EventFirstSeen” and “EventLastSeen” to define the time window of the query. Also, and for further queries, you want to fetch artifacts that could be considered possible indicators of compromise (IoC) related to this security incident, such as IP addresses.

You can build and run the following query on CloudTrail Lake Editor, either in the CloudTrail console or programmatically.

Query 1.1


SELECT eventSource,eventName,sourceIPAddress,eventTime,errorCode FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE userIdentity.accessKeyId = 'AKIAIOSFODNN7EXAMPLE' AND eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' order by eventTime;

The results of the query are as follows:

Figure 1: Sample query 1.1 and results in the AWS Management Console

Figure 1: Sample query 1.1 and results in the AWS Management Console

The results demonstrate that the activity performed by the access key tried to unsuccessfully list Amazon Simple Storage Services (Amazon S3) buckets and CloudTrail trails. You can also see specific write activity related to AWS Identity and Access Management (IAM) that was denied, and afterwards there was activity possibly related to reconnaissance tactics in IAM to finally be able to assume a role, which indicates a possible attempt to perform an escalation of privileges. You can observe only one source IP from which this activity was performed.

Query 1.2: Confirm which IAM role was assumed by the threat actor during a specific time window

As you observed from the previous query results, the threat actor was able to assume an IAM role. In this query, you would like to confirm which IAM role was assumed during the security incident.

Query 1.2


SELECT requestParameters,responseElements FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventName = 'AssumeRole' AND eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND userIdentity.accessKeyId = 'AKIAIOSFODNN7EXAMPLE'

The results of the query are as follows:

Figure 2: Sample query 1.2 and results in the console

Figure 2: Sample query 1.2 and results in the console

The results show that an IAM role named “Alice” was assumed in a second account. For future queries, keep the temporary access key from the responseElements result to obtain activity performed by this role session.

Query 1.3: Activity performed from an IP address in an expanded time window search

Investigating the incident only from the time of discovery may result in overlooking signs or indicators of potential past incidents that were not detected related to this threat actor. For this reason, you want to expand the investigation window time, which might result in expanding the search back weeks, months, or even years, depending on factors such as the nature and severity of the incident, available resources, and so on. In this example, for balance and urgency, the window of time searched is expanded to a month. You want to also review whether there is past activity related to this account by the IP you previously observed.

Query 1.3

The results of the query are as follows:


SELECT eventSource,eventName,sourceIPAddress,eventTime,errorCode FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE sourceIPAddress = '192.0.2.76' AND useridentity.accountid = '555555555555 AND eventTime > '2022-02-15 13:10:00' AND eventTime < '2022-03-15 13:10:00' order by eventTime;

Figure 3: Sample query 1.3 and results in the console

Figure 3: Sample query 1.3 and results in the console

As you can observe from the results, there is no activity coming from this IP address in this account in the previous month.

Query 1.4: Activity performed from an IP address in any other account in your organization during a specific time window

Before you start investigating what activity was performed by the role assumed in the second account, and considering that this malicious activity now involves cross-account access, you will want to review whether any other account in your organization has activity related to the specific IP address observed. You will need to expand the window of time to an entire month in order to see if previous activity was performed before this incident from this source IP, and you will need to exclude activity coming from the first account.

Query 1.4


SELECT useridentity.accountid,eventTime FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE sourceIPAddress = '192.0.2.76' AND eventTime > '2022-02-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND useridentity.accountid != '555555555555'GROUP by useridentity.accountid

The results of the query are as follows:

Figure 4: Sample query 1.4 and results in the console

Figure 4: Sample query 1.4 and results in the console

As you can observe from the results, there is activity only in the second account where the role was assumed. You can also confirm that there was no activity performed in other accounts in the previous month from this IP address.

Query 1.5: Count activity performed by an IAM role during a specific time period

For the next query example, you want to count and group activity based on the API actions that were performed in each service by the role assumed. This query helps you quantify and understand the impact of the possible unauthorized activity that might have happened in this second account.

Query 1.5


SELECT count (*) as NumberEvents, eventSource, eventName
FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND useridentity.type = 'AssumedRole' AND useridentity.sessioncontext.sessionissuer.arn = 'arn:aws:iam::111122223333:role/Alice'
GROUP by eventSource, eventName
order by NumberEvents desc;

The results of the query are as follows:

Figure 5: Sample query 1.5 and results in the console

Figure 5: Sample query 1.5 and results in the console

You observe that the activity is consistent with what was shown in the first account, and the threat actor seems to be targeting trails, S3 buckets, and IAM activity related to possible further escalation of privileges.

Query 1.6: Confirm successful activity performed by an IAM role during a specific time window

Following the example in query 1.1, you will fetch the information related to activity that was successful or denied. This helps you confirm modifications that took place in the environment, or the creation of new resources. For this example, you will also want to obtain the event ID in case you need to dig further into one specific API call. You will then filter out activity done by any other session by using the temporary access key obtained from query 1.2.

Query 1.6


SELECT eventSource, eventName, eventTime, eventID, errorCode FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND useridentity.type = 'AssumedRole' AND useridentity.sessioncontext.sessionissuer.arn = 'arn:aws:iam::111122223333:role/Alice' AND userIdentity.accessKeyId = 'ASIAZNYXHMZ37EXAMPLE '

The results of the query are as follows:

Figure 6: Sample query 1.6 and results in the console

Figure 6: Sample query 1.6 and results in the console

You can observe that the threat actor was again not able to perform activity upon the trails, S3 buckets, or IAM roles. But as you can see, the threat actor was able to perform specific IAM activity, which led to the creation of a new IAM user, policy attachment, and access key.

Query 1.7: Obtain new access key ID created

By making use of the event ID from the CreateAccesskey event displayed in the previous query, you can obtain the access key ID so that you can further dig into what activity was performed by it.

Query 1.7


SELECT responseElements FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventID = 'bd29bab7-1153-4510-9e7f-9ff9bba4bd9a'

The results of the query are as follows:

Figure 7: Sample query 1.7 and results in the console

Figure 7: Sample query 1.7 and results in the console

Query 1.8: Obtain successful API activity that was performed by the access key during a specific time window

Following previous examples, you will count and group the API activity that was successfully performed by this access key ID. This time, you will exclude denied activity in order to understand the activity that actually took place.

Query 1.8


SELECT count (*) as NumberEvents, eventSource, eventName
FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND
userIdentity.accessKeyId = 'AKIAI44QH8DHBEXAMPLE' AND errorcode IS NULL
GROUP by eventSource, eventName
order by NumberEvents desc;

The results of the query are as follows:

Figure 8: Sample query 1.8 and results in the console

Figure 8: Sample query 1.8 and results in the console

You can observe that this time, the threat actor was able to perform specific activities targeting your trails and buckets due to privilege escalation. In these results, you observe that a trail was successfully stopped, and S3 objects were downloaded and deleted.

You can also see bucket deletion activity. At first glance, this might indicate activity related to a data exfiltration scenario in the case where the bucket was not properly secured, and possible future ransom demands could be made if proper preventive controls and measures to recover the data were not in place. For more details on this scenario, see this AWS Security blog post.

Query 1.9: Obtain bucket and object names affected during a specific time window

After you obtain the activities on the S3 buckets by using sample query 1.8, you can use the following query to show what objects this activity was related to, and from which buckets. You can expand the query to exclude denied activity.

Query 1.9


SELECT element_at(requestParameters, 'bucketName') as BucketName, element_at(requestParameters, 'key') as ObjectName, eventName FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE (eventName = 'GetObject' OR eventName = 'DeleteObject') AND eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND userIdentity.accessKeyId = 'AKIAI44QH8DHBEXAMPLE' AND errorcode IS NULL

The results of the query are as follows:

Figure 9: Sample query 1.9 and results in the console

Figure 9: Sample query 1.9 and results in the console

As you can observe, the unauthorized user was able to first obtain and exfiltrate S3 objects, and then delete them afterwards.

Summary of incident scenario 1

This scenario describes a security incident involving a publicly exposed AWS access key that is exploited by a threat actor. Here is a summary of the steps taken to investigate this incident by using CloudTrail Lake capabilities:

  • Investigated AWS activity that was performed by the compromised access key
  • Observed possible adversary tactics and techniques that were used by the threat actor
  • Collected artifacts that could be potential indicators of compromise (IoC), such as IP addresses
  • Confirmed role assumption by the threat actor in a second account
  • Expanded the time window of your investigation and the scope to your entire organization in AWS Organizations; and searched for any activity that might have taken place originating from the IP address related to the unauthorized activity
  • Investigated AWS activity that was performed by the role assumed in the second account
  • Identified new resources that were created by the threat actor, and malicious activity performed by the actor
  • Confirmed the modifications caused by the threat actor and their impact in your environment

Incident scenario 2: AWS IAM Identity Center user credentials compromised

In this second scenario, you start your investigation from a GuardDuty finding stating that an Amazon Elastic Compute Cloud (Amazon EC2) instance is querying an IP address that is associated with cryptocurrency-related activity. There are several sources of logs that you might want to explore when you conduct this investigation, including network, operation system, or application logs, among others. In this example, you will use CloudTrail Lake capabilities to investigate API activity logged in CloudTrail for this security event. To understand what exactly happened and when, you start by querying information from the resource involved, in this case an EC2 instance, and then continue digging into the AWS IAM Identity Center (successor to AWS Single Sign-On) credentials that were used to launch that EC2 instance, to finally confirm what other actions were performed.

Query 2.1: Confirm who has launched the EC2 instance involved in the cryptocurrency-related activity

You can begin by looking at the finding CryptoCurrency:EC2/BitcoinTool.B to get more information related to this event, for example when (timestamp), where (AWS account and AWS Region), and also which resource (EC2 instance ID) was involved with the security incident and when it was launched. With this information, you can perform the first query for this scenario, which will confirm what type of user credentials were used to launch the instance involved.

Query 2.1


SELECT userIdentity.principalid, eventName, eventTime, recipientAccountId, awsRegion FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
WHERE responseElements IS NOT NULL AND
element_at(responseElements, 'instancesSet') like '%"instanceId":"i-053a7e6164c0f0473"%' AND
eventTime > '2022-09-13 12:45:59' AND eventName='RunInstances'

The results of the query are as follows:

Figure 10: Sample query 2.1 and results in the console

Figure 10: Sample query 2.1 and results in the console

The results demonstrate that the IAM Identity Center user as principal ID AROASVPO5CIEXAMPLE:[email protected] was used to launch the EC2 instance that was involved in the incident.

Query 2.2: Confirm in which AWS accounts the IAM Identity Center user has federated and authenticated

You want to confirm which AWS accounts this specific IAM Identity Center user has federated and authenticated with, and also which IAM role was assumed. This is important information to make sure that the security event happened only within the affected AWS account. The window of time for this query is based on the maximum value for the permission sets’ session duration in IAM Identity Center.

Query 2.2


SELECT element_at(serviceEventDetails, 'account_id') as AccountID, element_at(serviceEventDetails, 'role_name') as SSORole, eventID, eventTime FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd WHERE eventSource = 'sso.amazonaws.com' AND eventName = 'Federate' AND userIdentity.username = '[email protected]' AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 11: Sample query 2.2 and results in the console

Figure 11: Sample query 2.2 and results in the console

The results show that only one AWS account has been accessed during the time of the incident, and only one AWS role named AdministratorAccess has been used.

Query 2.3: Count and group activity based on API actions that were performed by the user in each AWS service

You now know exactly where the user has gained access, so next you can count and group the activity based on the API actions that were performed in each AWS service. This information helps you confirm the types of activity that were performed.

Query 2.3


SELECT eventSource, eventName, COUNT(*) AS apiCount
FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'
GROUP BY eventSource, eventName ORDER BY apiCount DESC

The results of the query are as follows:

Figure 12: Sample query 2.3 and results in the console

Figure 12: Sample query 2.3 and results in the console

You can see that the list of APIs includes the read activities Get, Describe, and List. This activity is commonly associated with the discovery stage, when the unauthorized user is gathering information to determine credential permissions.

Query 2.4: Obtain mutable activity based on API actions performed by the user in each AWS service

To get a better understanding of the mutable actions performed by the user, you can add a new condition to hide the read-only actions by setting the readOnly parameter to false. You will want to focus on mutable actions to know whether there were new AWS resources created or if existing AWS resources were deleted or modified. Also, you can add the possible error code from the response element to the query, which will tell you if the actions were denied.

Query 2.4


SELECT eventSource, eventName, eventTime, eventID, errorCode
FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
AND readOnly = false
AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 13: Sample query 2.4 and results in the console

Figure 13: Sample query 2.4 and results in the console

You can confirm that some actions, like EC2 RunInstances, EC2 CreateImage, SSM StartSession, IAM CreateUser, and IAM PutRolePolicy were allowed. And in contrast, IAM CreateAccessKey, IAM CreateRole, IAM AttachRolePolicy, and GuardDuty DeleteDetector were denied. The IAM-related denied actions are commonly associated with persistence tactics, where an unauthorized user may try to maintain access to the environment. The GuardDuty denied action is commonly associated with defense evasion tactics, where the unauthorized user is trying to cover their tracks and avoid detection.

Query 2.5: Obtain more information about API action EC2 RunInstances

You can focus first on the API action EC2 RunInstances to understand how many EC2 instances were created by the same user. This information will confirm which other EC2 instances were involved in the security event.

Query 2.5


SELECT awsRegion, recipientAccountId, eventID, element_at(responseElements, 'instancesSet') as instances
FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
AND eventName='RunInstances'
AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 14: Sample query 2.5 and results in the console

Figure 14: Sample query 2.5 and results in the console

You can confirm that the API was called twice, and if you expand the column InstanceSet in the response element, you will see the exact number of EC2 instances that were launched. Also, you can find that these EC2 instances were launched with an IAM instance profile called ec2-role-ssm-core. By checking in the IAM console, you can confirm that the IAM role associated has only the AWS managed policy AmazonSSMManagedInstanceCore attached, which enables AWS Systems Manager core functionality.

Query 2.6: Get the list of denied API actions performed by the user for each AWS service

Now, you can filter more to focus only on those denied API actions by performing the following query. This is important because it can help you to identify what kind of malicious event was attempted.

Query 2.6


SELECT recipientAccountId, awsRegion, eventSource, eventName, eventID, eventTime
FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
AND errorCode = 'AccessDenied'
AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 15: Sample query 2.6 and results in the console

Figure 15: Sample query 2.6 and results in the console

You can see that the user has tried to stop GuardDuty by calling DeleteDetector, and has also performed actions within IAM that you should examine more closely to know if new unwanted access to the environment was created.

Query 2.7: Obtain more information about API action IAM CreateUserAccessKeys

With the previous query, you confirmed that more actions were denied within IAM. You can now focus on the failed attempt to create IAM user access keys that could have been used to gain persistent and programmatic access to the AWS account. With the following query, you can make sure that the actions were denied and determine the reason why.

Query 2.7


SELECT recipientAccountId, awsRegion, eventID, eventTime, errorCode, errorMessage
FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
AND eventName='CreateAccessKey'
AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 16: Sample query 2.7 and results in the console

Figure 16: Sample query 2.7 and results in the console

If you copy the errorMessage element from the response, you can confirm that the action was denied by a service control policy, as shown in the following example.


"errorMessage":"User: arn:aws:sts::111122223333:assumed-role/AWSReservedSSO_AdministratorAccess_f53d10b0f8a756ac/[email protected] is not authorized to perform: iam:CreateAccessKey on resource: user production-user with an explicit deny in a service control policy"

Query 2.8: Obtain more information about API IAM CreateUser

From the query error message in query 2.7, you can confirm the name of the IAM user that was used. Now you can check the allowed API action IAM CreateUser that you observed before to see if the IAM users match. This helps you confirm that there were no other IAM users involved in the security event.

Query 2.8


SELECT recipientAccountId, awsRegion, eventID, eventTime, element_at(responseElements, 'user') as userInfo
FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
AND eventName='CreateUser'
AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 17: Sample query 2.8 and results in the console

Figure 17: Sample query 2.8 and results in the console

Based on this output, you can confirm that the IAM user is indeed the same. This user was created successfully but was denied the creation of access keys, confirming the failed attempt to get new persistent and programmatic credentials.

Query 2.9: Get more information about the IAM role creation attempt

Now you can figure out what happened with the IAM CreateRole denied action. With the following query, you can see the full error message for the denied action.

Query 2.9


SELECT recipientAccountId, awsRegion, eventID, eventTime, errorCode, errorMessage
FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
AND eventName='CreateRole'
AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 18: Sample query 2.9 and results in the console

Figure 18: Sample query 2.9 and results in the console

If you copy the output of this query, you will see that the role was denied by a service control policy, as shown in the following example:


"errorMessage":"User: arn:aws:sts::111122223333:assumed-role/AWSReservedSSO_AdministratorAccess_f53d10b0f8a756ac/[email protected] is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::111122223333:role/production-ec2-role with an explicit deny in a service control policy"

Query 2.10: Get more information about IAM role policy changes

With the previous query, you confirmed that the unauthorized user failed to create a new IAM role to replace the existing EC2 instance profile in an attempt to grant more permissions. And with another of the previous queries, you confirmed that the IAM API action AttachRolePolicy was also denied, in another attempt for the same goal, but this time trying to attach a new AWS managed policy directly. However, with this new query, you can confirm that the unauthorized user successfully applied an inline policy to the EC2 role associated with the existing EC2 instance profile, with full admin access.

Query 2.10


SELECT recipientAccountId, eventID, eventTime, element_at(requestParameters, 'roleName') as roleName, element_at(requestParameters, 'policyDocument') as policyDocument FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]' AND eventName = 'PutRolePolicy' AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 19: Sample query 2.10 and results in the console

Figure 19: Sample query 2.10 and results in the console

Summary of incident scenario 2

This second scenario describes a security incident that involves an IAM Identity Center user that has been compromised. To investigate this incident by using CloudTrail Lake capabilities, you did the following:

  • Started the investigation by looking at metadata from the GuardDuty EC2 finding
  • Confirmed the AWS credentials that were used for the creation of that resource
  • Looked at whether the IAM Identity Center user credentials were used to access other AWS accounts
  • Did further investigation on the AWS APIs that were called by the IAM Identity Center user
  • Obtained the list of denied actions, confirming the unauthorized user’s attempt to get persistent access and cover their tracks
  • Obtained the list of EC2 resources that were successfully created in this security event

Conclusion

In this post, we’ve shown you how to use AWS CloudTrail Lake capabilities to investigate CloudTrail activity in response to security incidents across your organization. We also provided sample queries for two security incident scenarios. You now know how to use the capabilities of CloudTrail Lake to assist you and your security teams during the investigation process in a security incident. Additionally, you can find some of the sample queries related to this post and other topics in the following GitHub repository, and additional examples in the sample queries tab in the CloudTrail console. To learn more, see Working with CloudTrail Lake in the CloudTrail User Guide.

Regarding pricing for CloudTrail Lake, you pay for ingestion and storage together, where the billing is based on the amount of uncompressed data ingested. If you’re a new customer, you can try AWS CloudTrail Lake for a 30-day free trial or when you reach the free usage limits of 5GB of data. For more information, see see AWS CloudTrail pricing.

Finally, in combination with the investigation techniques shown in this post, we also recommend that you explore the use of Amazon Detective, an AWS managed and dedicated service that simplifies the investigative process and helps security teams conduct faster and more effective investigations. With the Amazon Detective prebuilt data aggregations, summaries, and context, you can quickly analyze and determine the nature and extent of possible security issues.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Rodrigo Ferroni

Rodrigo Ferroni is a senior Security Specialist at AWS Enterprise Support. He is certified in CISSP, AWS Security Specialist, and AWS Solutions Architect Associate. He enjoys helping customers to continue adopting AWS security services to improve their security posture in the cloud. Outside of work, he loves to travel as much as he can. In every winter he enjoys snowboarding with his friends.

Eduardo Ortiz Pineda

Eduardo Ortiz Pineda

Eduardo is a Senior Security Specialist at AWS Enterprise Support. He is interested in different security topics, automation, and helping customers to improve their security posture. Outside of work, he spends his free time with family and friends, enjoying sports, reading and traveling.

Connect to Amazon MSK Serverless from your on-premises network

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/connect-to-amazon-msk-serverless-from-your-on-premises-network/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed, highly available, and secure Apache Kafka service. Amazon MSK reduces the work needed to set up, scale, and manage Apache Kafka in production. With Amazon MSK, you can create a cluster in minutes and start sending data.

With Amazon MSK Serverless, you can run Apache Kafka without having to manage the underlying infrastructure. Amazon MSK will automatically provision, scale, and manage your Apache Kafka clusters, so you can focus on your applications without worrying about the operational overhead. Additionally, MSK Serverless offers fine-grained, pay-as-you-go pricing, making it a cost-effective option for organizations with unpredictable workloads.

Connecting to MSK Serverless is easy. You can set up a serverless cluster using the API or AWS Management Console in minutes. MSK Serverless provides bootstrap information as a private DNS endpoint, allowing clients to connect to the serverless Apache Kafka cluster. A common use case of using MSK Serverless is an on-premises client that needs to process real-time data streams. However, the private DNS endpoint is only accessible from virtual private clouds (VPCs) that have been configured to connect and isn’t directly resolvable from an on-premises network. This can pose a challenge for on-premises clients to discover and connect to the MSK Serverless cluster.
In this post, we guide you through a step-by-step process to connect your on-premises client to MSK Serverless, overcoming this challenge.

Solution overview

The following diagram illustrates the solution architecture.

The flow of the solution is as follows:

  1. The DNS query for your MSK endpoint is routed to a locally configured on-premises DNS server.
  2. The on-premises DNS as configured performs conditional forwarding for kafka-serverless.REPLACE-MSK-SERVERLESS-REGION.amazonaws.com to an Amazon Route 53 inbound resolver endpoint IP address.
  3. The inbound resolver endpoint performs DNS resolution by forwarding the query to the private hosted zone that was created along with the MSK Serverless cluster.
  4. The IP addresses returned by the DNS query are the private IP addresses of the interface VPC endpoint, which allow your on-premises host to establish private connectivity over AWS VPN or AWS Direct Connect.
  5. The interface endpoint is a collection of one or more elastic network interfaces with a private IP address in your account that serves as an entry point for traffic destined to a MSK Serverless service.

Note that at this time, this solution works only for MSK Serverless clusters with a single VPC.

Prerequisites

In this section, we discuss the prerequisite steps to complete in order to implement this solution.

Establish network connectivity between on premises and the AWS Cloud

To use MSK Serverless from your on-premises network, you need to establish a network connection between your on-premises environment and the VPC that you have set up for MSK Serverless. Various secure methods are available to connect your on-premises network to the AWS Cloud. Refer to Network-to-Amazon VPC connectivity options for more information.

Create a security group for allowing inbound TCP/UDP connections from your on-premises network

Create a security group with the following configurations on the same VPC that you configured for MSK Serverless:

Inbound rule:

  • Source: [On-premises CIDR range]
  • Protocol: TCP/UDP
  • Port Range: 53

Outbound rule: Leave it to default

For more information, refer to Work with security groups.

Update the MSK security group for inbound connections from your on-premises network

To ensure that your MSK Serverless cluster can be accessed from your on-premises network, you need to adjust the cluster’s security group settings to allow incoming traffic from your network on TCP port 9098. Complete the following steps:

  1. On the Amazon MSK console, choose Clusters in the navigation pane.
  2. Navigate to your serverless MSK cluster’s properties.

  1. Choose the security group associated with your MSK cluster.

Because MSK Serverless supports configuring multiple VPCs, make sure to choose the security group associated with the VPC that you configured for connecting from your on-premises network.

  1. To enable connections from your on-premises CIDR block to MSK Serverless, add an inbound rule that allows traffic on TCP port 9098 from your on-premises CIDR.

This ensures that your on-premises network can communicate with MSK Serverless on the specified port.

Configure a Route 53 inbound resolver endpoint

MSK Serverless provides a DNS endpoint that serves as the starting point for an Apache Kafka client to connect to the cluster. However, this endpoint isn’t publicly discoverable and can only be accessed from within the configured VPC. To resolve the serverless DNS endpoint outside of your VPC, you can set up a Route 53 resolver endpoint. This allows you to access the endpoint securely by creating a hybrid cloud setup over VPN or Direct Connect.

To configure the Route 53 resolver using the console, complete the following steps:

  1. On the Route 53 console, under Resolver in the navigation pane, choose Inbound endpoints.
  2. Choose Create inbound endpoint.

  1. For Endpoint name, enter the endpoint name.
  2. For VPC in the Region, choose the VPC where you configured MSK Serverless.
  3. For Security group for this endpoint, choose the security group that you created as a prerequisite for inbound TCP/UDP connections.

The security group of the inbound resolver endpoint should allow traffic from the on-premises DNS Server IP address on TCP/UDP port 53.

In the next step, you add your IP addresses, ensuring that the number of IP addresses matches the number of subnets in your MSK cluster.

  1. Choose the Availability Zones and subnets that are the same as your MSK Serverless network configuration.
  2. Select Use an IP address that is selected automatically.

  1. Choose Create inbound endpoint.

  1. Copy the inbound endpoint IP addresses.

Configure the on-premises DNS server

In this example, we use a Microsoft DNS server. To configure a conditional forwarder, complete the following steps:

  1. Open DNS Manager.
  2. Run the following command in the Run command window:
dnsmgmt.msc
  1. Choose (right-click) Conditional Forwarders under the server of your choosing, then choose New Conditional Forwarder.


In the next step, you enter kafka-serverless.REPLACE-MSK-SERVERLESS-REGION.amazonaws.com, using the IP address of Route 53 inbound resolver endpoints that you created earlier. You can find the MSK endpoint information by accessing the cluster’s client information. To learn more about getting client information, refer to Getting the bootstrap brokers for an Amazon MSK cluster.

  1. For DNS Domain, enter your endpoint name. For example, kafka-serverless.ap-southeast-2.amazonaws.com. Do not enter the entire endpoint name.
  2. Choose OK.

Test the DNS resolution

DNS (Domain Name System) uses TCP/UDP port 53. To test whether you can connect any of the Route 53 inbound endpoints, run the following command from your on-premises client:

telnet Route53-INBOUND-ENDPOINT-IP 53

For example: telnet 10.1.0.133 53

The following is a sample output:

Trying 10.1.0.133...
Connected to 10.1.0.133.
Escape character is '^]'.
Connection closed by foreign host.

Run the following command to check whether you can connect with the MSK Serverless endpoint from your on-premises client. To get the MSK Serverless endpoint information, refer to Create an MSK Serverless cluster.

dig MSK-SERVERLESS-ENDPOINT-REMOVE-PORT-NUMBER +short

For example: dig boot-abcdc9.c3.kafka-serverless.ap-southeast-2.amazonaws.com +short

The following is a sample output:

vpce-0bcb06d53aab34111-vt8yzx2b.vpce-svc-05dc791a527abcd.ap-southeast-2.vpce.amazonaws.com.
10.1.1.185
10.1.0.191

If the DNS resolution fails, check your network connectivity from on premises. For more information about troubleshooting connectivity issues, refer to How do I troubleshoot VPN tunnel connectivity to an Amazon VPC or Troubleshooting AWS Direct Connect.

After you create a serverless MSK cluster, the service automatically creates an interface VPC endpoint for the cluster. You can use the dig command as shown above to retrieve the VPC endpoint ID and its associated IP address, which confirms that you are now able to connect to the MSK Serverless cluster from your on-premises environment.

Test your Kafka client

Once you complete the configuration of the Route 53 inbound resolver endpoint and on-premises DNS server, you can test your Kafka client from an on-premises network. For instructions, refer to Create a client machine. This documentation guides you through the necessary steps to set up your client machine and verify that it can successfully connect to your MSK cluster from your on-premises network.

Conclusion

MSK Serverless makes it easy for you to manage your data. You don’t have to worry about setting up and running your own Kafka cluster, which saves time and effort. In this post, we explored the option of on-premises connectivity with MSK Serverless and how it can greatly benefit organizations. By establishing this connection, you can gain access to a wide range of real-time analytics use case possibilities and unlock the full potential of your data.

We encourage you to try on-premises connectivity with MSK serverless.


About the Authors

Masudur Rahaman Sayem is a Streaming Data Architect at AWS. He works with AWS customers globally to design and build data streaming architectures to solve real-world business problems. He specializes in optimizing solutions that use streaming data services and NoSQL. Sayem is very passionate about distributed computing.

Akeef Khan is a Solutions Architect at Amazon Web Services. He helps SMB Greenfield customers adopt the cloud. Whilst being a generalist SA, Akeef is passionate about networking.

Patterns for updating Amazon OpenSearch Service index settings and mappings

Post Syndicated from Mikhail Vaynshteyn original https://aws.amazon.com/blogs/big-data/patterns-for-updating-amazon-opensearch-service-index-settings-and-mappings/

Amazon OpenSearch Service is used for a broad set of use cases like real-time application monitoring, log analytics, and website search at scale. As your domain ages and you add additional consumers, you need to reevaluate and change the domain’s configuration to handle additional storage and compute needs. You want to minimize downtime and performance impact as you make these changes.

Customers have been seeking guidance on best practices and patterns for changing index settings without an index maintenance window or affecting overall performance of the OpenSearch Service domain. This is part one of a two-part series, in which we show how to make settings changes to OpenSearch Service indexes with little to no downtime while supporting active producers and consumers of the data.

Indexes in OpenSearch Service

In OpenSearch Service, data must be indexed before it can be queried. Indexing is the method by which search engines organize data for fast retrieval. The resulting structure is called, fittingly, an index. All operations performed on an index are done via index APIs. Also, each index contains index mappings, which define field names and data types in the index. Data producers can add new fields with data types to an index. Index mappings can’t change throughout the index lifecycle.

OpenSearch Service indexes have two types of settings that periodically need adjustments as the profile of your workload changes:

  • Dynamic – Settings that can be changed on the index at any time
  • Static – Settings that can only be defined at the index creation time and can’t be changed throughout the index lifecycle

Dynamic index settings can be changed at any time using the update settings API. While the OpenSearch Service domain is performing instructed operations on dynamic index settings, the index doesn’t require a downtime. Changes to most dynamic index settings won’t trigger background tasks that affect the overall utilization of domain resources; however, some settings such as increasing the number of replicas via index.number_of_replicas or index.auto_expand_replicas, and depending on the domain’s configuration, can cause a temporary increase in resource utilization while the domain adds replicas. We recommend maintaining at least one replica for redundancy reasons, and multiple replicas for high query throughput use cases.

Static index settings such as mapping and shard count are defined at index creation time and can’t be changed throughout the index lifecycle. In this post, we focus on patterns and best practices for working with static index settings, such as changing shard count and patterns for updating index mappings.

All operations and procedures that we cover in this post are issued directly to the OpenSearch REST API or via the Dev Tools in OpenSearch Dashboards.

As with any use case, there is a spectrum of solutions and constraints to be considered. We start with a few simple foundational patterns and build on them, accounting for use cases with more operational constraints and working with large datasets.

Solution overview

OpenSearch Service has a default sharding strategy of 5:1, where each index is divided into five primary shards. Within each index, each primary shard also has its own replica. OpenSearch Service automatically assigns primary shards and replica shards to separate data nodes.

It’s not possible to increase the primary shard number of an existing index, meaning an index must be recreated if you want to increase the primary shard count.

The _reindex operation is ideal for creating destination indexes with updated shards and mapping settings. The _reindex operation is resource intensive. We recommend disabling replicas in your destination index by setting number_of_replicas to 0 and re-enable replicas when the reindex process is complete. If you have your data in a second, durable store, the simplest thing to do is pause updates and reindex from the source. But that’s not always possible. In this post, we share several patterns that enable you to update even static index settings like shard count.

One the major advantages of using the _reindex operation is that it doesn’t require placing the source index in a read-only mode (data producers may continue to write the data while reindexing is in progress). Also, the _reindex operation enables reprocessing, transformation, and reindexing a subset of documents and even selectively combining documents from multiple indexes. With the _reindex operation, you can copy all or a subset of documents that you select through a query to another index. In its most basic form, the _reindex operation requires you to specify a source and a destination index and configuration parameters.

The following are the some of the use cases supported by the reindex API:

  • Reindexing all documents
  • Reindexing from a remote cluster when transferring data between clusters
  • Reindexing a subset of documents that match a search query
  • Combining one or more indexes
  • Transforming documents during reindexing

To increase the shard count, you create a new index, set number_of_shards to your desired primary shard count, set number_of_replicas to 0, update the new index mapping based on your requirement, and run the reindex API operation. After the _reindex operation is complete, we recommend updating number_of_replicas in the destination index settings to achieve your desired level of replica shards.

In the following sections, we provide a walkthrough of the reindex API operation. Note that the patterns and procedures presented in this post have been validated on Amazon OpenSearch Service version 1.3.

Prerequisites

The source of the documents must be stored in the index (the “_source” setting at the index mappings level must be set to “enabled”:true, which is the default). The _reindex operation can’t be used without source documents.

Create the destination index with your desired mapping (field or data type). For demonstration purposes, our source index has a field ratings defined as long, and we want the same field to use the float data type in the destination index:

GET /source_index_name/mappings
{  
  "source_index_name": {
    "mappings" : {
      "properties" : {
        "ratings " : {
          "type" : "long"
        },
…
      }
    }
  }
}

PUT /destination_index_name
{
  "settings": {
    "index": {
      "number_of_shards": <DESIRED_NUMBER_OF_PRIMARY_SHARDS>,
      "number_of_replicas": 0
    }
  },
  "mappings": {
    "properties" : {
      "ratings" : {
          "type" : "float"
        },
…
    }
  }
}

Ensure that you have sufficient disk space on each hot tier data node to house the new index primary shards and, depending on your configuration, replica shards. If disk space is insufficient, perform an update operation on the OpenSearch Service domain to add the required storage capacity. Depending on storage requirements, you may need to migrate the OpenSearch Service domain to a different instance type, because nodes have constraints on the EBS volume size that can be mounted to each instance type. Issue the following operation to validate available disk space:

GET _cat/allocation?v

The following screenshot shows the output.

Check the disk.avail metric for hot storage tier nodes to validate your available disk space.

Use the reindex API operation

The _reindex operation snapshots the index at the beginning of its run and performs processing on a snapshot to minimize impact on the source index. The source index can still be used for querying and processing the data. Although the _reindex operation can run both synchronously and asynchronously, we recommend using an asynchronous run. You can monitor the progress of the _reindex operation, cancel its run, or throttle its run using the _task, _cancel, and _rethrottle operations, respectively.

Because the _reindex operation doesn’t require the source index placed in a read-only mode, query and index update operations are free to continue.

Use the reindex API with the following command:

POST _reindex?wait_for_completion=false
{
  "source": {
	"index": "source_index_name"
  },
  "dest": {
	"index": "destination_index_name",
	"op_type" : "index"
  }
}

The source indexes as part of the _reindex API operation can be supplemented with a query for reindexing a subset of documents and storing them in the destination index. Progress of the re-indexing operation can be monitored via tasks API operation:

GET _tasks

Note that the _reindex operation can be throttled via a _rethrottle API or settings passed as a parameter. You can cancel the task with the _cancel operation:

POST _tasks/TASK_ID/_cancel

The following screenshot shows the output of the _reindex operation for reindexing from source_index_name to destination_index_name.

When the operation is complete, both consumers and producers of the source indexes or aliases need to re-point to the destination index and the same _reindex operation needs to run again to catch up on any create, update, or delete operations performed on the source indexes while the initial _reindex operation was running. This step is required because the _reindex operation is running on a snapshot of the index. At this time, the _reindex operation needs to run with “op_type”:”create” to realign missing and out-of-version documents. See the following API command:

POST _reindex?wait_for_completion=false
{
"conflicts":"proceed",
  "source": {
	"index": "source_index_name"
  },
  "dest": {
	"index": "destination_index_name",
	"op_type" : "create"
  }
}

After the operation is complete and data integrity in the destination index is confirmed, you can delete the source index to reclaim disk space.

Increase index shard count using the split index API

The split index API and shrink index API cover a large array of use cases and present with low resource utilization in the domain. However, these APIs require closing the index for write operations and don’t address use cases that require changes to the mapping settings.

In OpenSearch Service, the number_of_shards index setting is immutable and defined at the time when the index is created. However, although this setting is immutable, there are several patterns to increase or decrease index shard count without needing to explicitly reindex the data. You can alternatively use the split index API to increase index shard count in the environments that can suspend write operations. The split index API provides a simplified way of creating a new index with a different shard setting and without reindexing your data. The split index API operation creates a new index based off of a read-only index with a desired number of primary shards.

In OpenSearch Service, an index alias is a virtual index name that can point to one or more indexes. Referencing to indexes using aliases in your applications allows you to avoid index name changes. Index aliases are used to point consumers and producers to a new index after the split index API operation is complete.

Although the majority of use cases focus on increasing a number of shards on an existing index due to data growth, there are also instances where you need to reduce the number of shards on an existing index. Such cases occasionally happen when an actual index size is less than what was anticipated when the index was created, and you want to align with a shard strategy for operational best practices for OpenSearch Service. In cases where you need to reduce a number of shards on an index, you can use the shrink index API to achieve this task.

Conclusion

In this post, we reviewed best practices when reindexing data for making changes in OpenSearch Service static index settings and mappings that require little or no index downtime. We also covered use of the split index and shrink index APIs for changing the primary index shard count for use cases where the index can be placed in a read-only state.

In our next post, we’ll explore patterns for remote indexing to alleviate load and resource utilization on the source OpenSearch Service domain.


About the Authors

Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare and life sciences customers to build solutions that help improve patients’ outcomes. Mikhail specializes in data analytics services.

Sukhomoy Basak is a Solutions Architect at Amazon Web Services, with a passion for data and analytics solutions. Sukhomoy works with enterprise customers to help them architect, build, and scale applications to achieve their business outcomes.

Optimizing fleet utilization with Amazon Location Service and HERE Technologies

Post Syndicated from Mahesh Geeniga original https://aws.amazon.com/blogs/architecture/optimizing-fleet-utilization-with-amazon-location-service-and-here-technologies/

The fleet management market is expected to grow at a Compound Annual Growth Rate (CAGR) of 15.5 percent—from 25.5 billion US dollars in 2022 to USD 52.4 billion in 2027. Optimizing how your organization uses its vehicle fleet is important for logistics and service providers such as last mile, middle mile, and field services.

In this post, we demonstrate how to build and run a solution for many-to-many vehicle routing using HERE Tour Planning and Amazon Location Service. HERE Technologies is a data provider for Amazon Location Service and provides it with map rendering, geocoding, search, and routing. HERE Tour Planning expands on functionality such as geocoding, basic routing, and matrix routing to consider parameters such as time windows, job requirements or priorities, vehicle capabilities, range, and traffic information. They also support immediate re-planning when conditions change.

The architecture described in this post can help you optimize your fleets for delivering shipments, such as perishable items on pallets, from a central distribution center to multiple retail locations. The architecture uses the AWS Cloud Development Kit (AWS CDK) to help you provision and version control your infrastructure. The architecture also uses Event Driven Architecture (EDA) based on AWS Lambda and Amazon DynamoDB. It uses Amazon Simple Storage Service (Amazon S3) and DynamoDB to store the artifacts generated and integrates with Amazon Location Service to plot the routes visually on a map for each delivery driver.

Solution overview

This post will help you:

  • Configure the many-to-many vehicle routing architecture using HERE Tour Planning and Amazon Location Service
  • Submit a HERE Tour Planning problem
  • Generate an optimized solution file
  • Run a React app that:
    • Generates a list of routes for each vehicle in the fleet
    • Allows drivers to select and view routes in detail

The following diagram outlines how the architecture works.

Many-to-many vehicle routing architecture

Figure 1. Many-to-many vehicle routing architecture using HERE Tour Planning and Amazon Location Service

Let’s explore the steps in the diagram.

  1. The fleet operator uploads the tour requirements file to an Amazon S3 bucket.
  2. The upload invokes a Lambda function to process the new Tour Request. If the HERE API key is present, it calls the HERE Tour Planning API.
  3. The HERE Tour Planning API calculates the solution to the routing problem.
  4. The driver uses the React app to select a vehicle, which requests a route.
  5. The invoked Lambda function uses Amazon Location Service to calculate the route and render it in the React app.

Prerequisites

This walkthrough requires the following installations and resources:

  • Have an AWS account
  • Install the AWS Amplify CLI (command-line interface)
  • Install the AWS CDK CLI
  • Install the AWS CLI
    • Configure and authenticate the AWS CLI to interact with your AWS account.
  • Have your preferred integrated development environment (IDE), such as Visual Studio Code
  • Have GitHub repository access
    • git clone https://github.com/aws-samples/aws-here-optimize-fleet-utilization
  • Have a HERE API Key (optional)
    • This is needed to invoke the HERE Tour Planning API for generating solutions to new routing problems.
    • The GitHub sample repository includes a problem and pre-solved solution file, so you don’t need to acquire a HERE API key to learn about these offerings.
    • To acquire an API key, create a free account on the HERE Platform, then follow the instructions in the HERE Tour Planning documentation to create your API key.
    • There can be additional charges based on API key use. For more details, see the HERE Tour Planning section within HERE service rates.

Walkthrough

Provision the infrastructure

  1. The architecture uses NPM node modules. Run the following commands to install the dependencies:
    • # AWS Lambda function dependencies
      cd lib/lambda/calculate-route
      npm install
    • # Sample Frontend application dependencies
      cd frontend/here-driver-app
      npm install
  2. If you have never used AWS CDK in your AWS account, you must first bootstrap the solution, which creates Amazon S3 buckets and metadata to support AWS CDK operations. Note: Architectural components described in this article are covered under AWS Free Tier and HERE Free monthly usage, but additional charges can occur based on usage beyond the Free Tier limits. We recommend following the Cleanup instructions after completing the walkthrough.
      • From the root of the repository, run the following command to generate the needed infrastructure:
        • cdk bootstrap
      • Output similar to the following indicates that you successfully bootstrapped the AWS account with what AWS CDK requires.

        Successful bootstrap of AWS account with AWS CDK requirements

        Figure 2. Successful bootstrap of AWS account with AWS CDK requirements

  3. Deploy the infrastructure for this solution by running the following command. This provisions much of the infrastructure required for this solution such as DynamoDB, Lambda functions, and Amazon S3 buckets:
    • cdk deploy
  4. Next, Amplify provisions the remaining resources to complete the architecture. Navigate to the folder for the frontend by running the following command:
    • cd frontend/here-driver-app
  5. Run the following series of Amplify commands to create the remaining resources, Amazon API Gateway, Amazon Cognito, and Amazon Location Service. For additional details, see Clone sample Amplify project.
  6. To accept the defaults, run the following command:
    • amplify init
  7. To push the infrastructure out to the AWS account, run the following command:
    • amplify push
  8. To publish the environment, run the following command:
    • amplify publish

Create problem and generate architecture files

The next step is to create the HERE Tour Planning problem file and submit it to the HERE Tour Planning API to solve. Note: You can either sign up for the HERE Developer program to receive an API key to test this solution live or use the problem and pre-solved solution provided in the /data folder of the repository.

  1. Open the Amazon S3 bucket that the previous step created.
  2. Upload a problem file (in JSON format) to the bucket.
  3. An Amazon S3 event notification invokes a Lambda function that performs a synchronous call to the HERE Tour Planning API and generates vehicle routing problem solution file in JSON format.
  4. The Lambda function saves the solution file to the Amazon S3 bucket and additional details about the solution to a DynamoDB table.
  5. The delivery drivers can use the example React app to view the list of vehicles and the routes.
Creating a problem file and submitting it to HERE Tour Planning API

Figure 3. Creating a problem file and submitting it to HERE Tour Planning API

Frontend

The next step is to run the React Frontend app to see the results. Access to the app and the API Gateway is secured with Amazon Cognito.

  1. To run the web application, run the following command:
    • npm start
  2. A local web server runs on http://localhost:3000.
  3. To use the system, the user must authenticate. Amazon Cognito allows users to sign in or create a new account.
  4. After you authenticate, the Home screen displays a list of available vehicles and routes.

    Available vehicles and routes

    Figure 4. Available vehicles and routes

Choose a vehicle to see the detail of the route. Each red marker is a stop.

Vehicle route details

Figure 5. Vehicle route details

Cleanup

To avoid incurring future charges, delete all resources created.

  1. Run the following commands to delete the application in AWS Amplify. As an alternative, you can use the Amplify console to delete Amplify resources:
    • cd frontend/here-driver-app
      amplify pull
      amplify delete
  2. With AWS CDK, run the following command to delete the AWS CloudFormation stack that was used to provision the resources. Note: You can leave the AWS CLI, Amplify CLI, and CDK CLI installed on your computer for future development:
    • cdk destroy

Conclusion

In this post, using shipment delivery from a central distribution center use case, we’ve demonstrated how you can build your own serverless solution for optimizing middle and last mile operations. The solution uses multi-vehicle and multi-stop optimization services provided by HERE Tour Planning and Amazon Location Service to visualize the generated routes for each delivery vehicle driver. For additional details about HERE’s offerings on AWS Marketplace, see AWS Marketplace: HERE Technologies.

Reduce triage time for security investigations with Amazon Detective visualizations and export data

Post Syndicated from Alex Waddell original https://aws.amazon.com/blogs/security/reduce-triage-time-for-security-investigations-with-detective-visualizations-and-export-data/

To respond to emerging threats, you will often need to sort through large datasets rapidly to prioritize security findings. Amazon Detective recently released two new features to help you do this. New visualizations in Detective show the connections between entities related to multiple Amazon GuardDuty findings, and a new export data feature helps you use the data from Detective in your other tools and automated workflows.

By using these new features, you can quickly analyze, correlate, and visualize the large amounts of data generated by sources like Amazon Virtual Private Cloud (Amazon VPC) Flow Logs, AWS CloudTrail, Amazon GuardDuty findings, and Amazon Elastic Kubernetes Service (Amazon EKS) audit logs.

In this post, we’ll show you how you can use these new features to help reduce the time it takes to assess, investigate, and prioritize a security incident.

A security finding is raised

The workflow starts with GuardDuty. GuardDuty continuously monitors AWS accounts, Amazon Elastic Compute Cloud (Amazon EC2) instances, EKS clusters, and data stored in Amazon Simple Storage Service (Amazon S3) for malicious activity without the use of security software or agents.

If GuardDuty detects potential malicious activity, such as anomalous behavior, credential exfiltration, or command and control (C2) infrastructure communication, it generates detailed security incidents called findings.

Depending on the severity and complexity of the GuardDuty finding, the resolution might require deep investigation. Consider an example that involves cryptocurrency mining. If you frequently see a cryptocurrency finding on your EC2 instances, you might have a recurring malware issue that has enabled a backdoor. If a threat actor is attempting to compromise your AWS environment, they typically perform a sequence of actions that lead to multiple findings and unusual behavior. When security findings are investigated in isolation, it can lead to a misinterpretation of their significance and difficulty in finding the root cause. When you need more context around a finding, Detective can help.

Detective automatically collects log data and events from sources like CloudTrail logs, Amazon VPC Flow Logs, GuardDuty findings, and Amazon EKS audit logs and maintains up to a year of aggregated data for analysis. Detective uses machine learning to create a behavioral graph for these data sources that helps show how security issues have evolved. It highlights what AWS resources might be compromised and flags unusual activity like new API calls, new user agents, and new AWS Regions.

The search capabilities work across AWS workloads, providing the information required to show the potential impact of an incident. Detective helps you answer questions like: How did this security incident happen? What AWS resources were affected? How can we prevent this from happening again?

Finding groups help connect the dots of an incident

You can use finding groups, a recent feature of Detective, to help with your investigations. A finding group is a collection of entities related to a single potential security incident that should be investigated together. An entity can be an AWS resource like an EC2 instance, IAM role, or GuardDuty finding, but it can also be an IP address or user agent. For a full list of entities collected, see Searching for a finding or entity in the Detective User Guide.

Grouping these entities together helps provide context and a more complete understanding of the threat landscape. This makes it simpler for you to identify relationships between different events and to assess the overall impact of a potential threat.

In the cryptocurrency mining example described previously, finding groups could show the relationship between the cryptocurrency mining finding and a C2 finding so that you know the two are related and the AWS resources affected. To learn more about working with Detective finding groups, see How to improve security incident investigations using Amazon Detective finding groups.

Figure 1 shows the finding groups overview page on the Detective console, with a list of finding groups filtered by status. The dashboard also shows the severity, title, observed tactics, accounts, entities, and the total number of findings for each finding group. For more information about the attributes of finding groups, see Analyzing finding groups.

Figure 1: Finding groups overview

Figure 1: Finding groups overview

To see details about the finding group, select the title of the finding group to access the details page, which includes Details, Visualization, Involved entities, and Involved findings. On this panel, you can view entities and findings included in a finding group and interact with them. The information presented is the same in the Visualization panel, the Involved entities panel, and the Involved findings panel. The different views allow you to view the information in the way that is helpful for you. Figure 2 shows an example of the Details and Visualization for a specific finding group.

Figure 2: Details and Visualization

Figure 2: Details and Visualization

Note: Finding groups with over 100 nodes (findings and entities) do not include a graph visualization.

Visualizations to show you the situation

The new visualizations in Detective provide three layouts that display the same information from finding groups, but allow you to choose and arrange the different entities so that you can focus on the highest priority finding or resources.

To determine what each visual element represents, choose the Legend in the bottom left corner of the panel. You can change the placement of findings in the Visualization panel by selecting a different layout from the Select layout dropdown menu. Figure 2 in the preceding section shows the Force-directed layout, where the positioning of entities and findings presents an even distribution of links with minimal overlap, while maintaining consistent distance between items.

Figure 3 shows the Visualization panel with the Circle layout, where nodes are displayed in a circular layout. You can use the Legend to understand the different categories of Findings, Compute, Network, Identity, Storage, or Other.

If you’re unfamiliar with these terms, see Amazon Detective terms and concepts to learn more.

Figure 3: Visualization panel with Circle layout and Legend

Figure 3: Visualization panel with Circle layout and Legend

Figure 4 shows the Visualization panel with the Grid layout, where nodes are divided into four different columns: evidence, identity entities, GuardDuty findings, and other entities (compute, network, and storage).

Figure 4: Visualization panel with Grid layout

Figure 4: Visualization panel with Grid layout

In the Visualization panel, you can select one or more (using ctrl/cmd+click) nodes. Selected nodes are listed next to the graph, and you can select each node’s title for more information. Selecting an entity’s title opens a new page that displays detailed information about that entity, whereas selecting a finding or evidence expands the right sidebar to show details on the selected finding or evidence.

You can rearrange chosen entities and findings as needed to help improve your understanding of their connections. This can help speed up your assessment of findings. Figure 5 shows the Visualization panel with four nodes selected and the sidebar displaying information relevant to the selected finding.

Figure 5: Visualization panel with evidence selected

Figure 5: Visualization panel with evidence selected

Finding groups and visualizations provide an overview of the entities and resources related to a security activity. Presenting the information in this way highlights the interconnections between various activities. This means that you no longer have to use multiple tools or query different services to collect information or investigate entities and resources. This can help you reduce triage and scoping times and make your investigations faster and more comprehensive.

Increased flexibility for investigation with simpler data access

To expand the scope of your investigation or confirm if a security incident has taken place, you might want to combine data from Detective with your own tools or different services. This is where export data comes into play.

Detective has several Summary page panels that you can use as a starting point for your investigations because they highlight potentially suspicious activity. The panels include roles and users with the most API call volume, EC2 instances with the most traffic volume, and EKS clusters with the most Kubernetes pods created.

With export data, you can now export these panels as common-separated values (CSV) files and import the data into other AWS services or third-party applications, or manipulate the data with spreadsheet programs.

Export from the Detective console Summary screen

In the Detective console, on the Summary page, you will see an Export option on several summary panes. This is enabled and available for anyone with access to Detective. Figure 6 shows summary information for the roles and users with the most API call volume.

Figure 6: Detective console Summary screen

Figure 6: Detective console Summary screen

Choose Export to download a CSV file containing the data for the summary information. The file is downloaded to your browser’s default download folder on your local device. When you view the data, it will look something like the spreadsheet in Figure 7:

Figure 7: Example CSV data export from Detective console Summary

Figure 7: Example CSV data export from Detective console Summary

Export from the Detective console Search screen

You can also export data using the Search capability of Detective. After you apply specific filters to search for findings or entities based on your use case, an Export button appears at the top of the search results in Detective. Figure 8 shows an example of filtering for a particular CIDR range. Choose Export to download the CSV file containing the filtered data.

Figure 8: Detective search results filtering for a CIDR range

Figure 8: Detective search results filtering for a CIDR range

Conclusion

In this blog post, you learned how to use the two new features of Detective to visualize findings and export data. By using these new features, you and your teams can investigate an incident in the way that best fits your workflow. New visualizations show the entities involved in an issue and surface nuanced connections that can be difficult to find when you’re faced with line after line of log data. The new data export feature makes it simpler to integrate the insights discovered in Detective with the tools and automations that your team is already using.

These features are automatically enabled for both existing and new customers in AWS Regions that support Detective. There is no additional charge for finding groups. If you don’t currently use Detective, you can start a free 30-day trial. For more information on finding groups, see Analyzing finding groups in the Amazon Detective User Guide.

If you have feedback about this post, submit comments in the Comments section below. You can also start a new thread on Amazon Detective re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Alex Waddell

Alex Waddell

Alex is a Senior Security Specialist Solutions Architect at AWS based in Scotland. Alex provides security architectural guidance and operational best practices to customers of all sizes, helping them to implement AWS security services in the best way possible to keep their AWS workloads secure. Alex has been working in the security domain since 1999 and joined AWS in 2021. When he is not working, Alex enjoys spending time sampling rum from around the world (Alex is one of those rarely found Scottish people that doesn’t like whisky), walking his dogs in the local forest trails and traveling.

Nima Fotouhi

Nima Fotouhi

Nima is a Security Consultant. He’s a builder with passion for infrastructure as code (IaC) and policy as code (PaC), helping customer building secure infrastructure on AWS. In my spare time I love to hit the slopes and go snowboarding.

Rich Vorwaller

Rich Vorwaller

Rich is a Principal Product Manager of Amazon Detective. He boomerang to AWS with a passion for walking backwards from customer security problems. AWS is a great place for innovation and Rich is excited to dive deep on how customers are using AWS to strengthen their security posture in the cloud. In his spare time, Rich loves to read, travel, and perform a little bit of amateur stand up comedy.

Implement column-level encryption to protect sensitive data in Amazon Redshift with AWS Glue and AWS Lambda user-defined functions

Post Syndicated from Aaron Chong original https://aws.amazon.com/blogs/big-data/implement-column-level-encryption-to-protect-sensitive-data-in-amazon-redshift-with-aws-glue-and-aws-lambda-user-defined-functions/

Amazon Redshift is a massively parallel processing (MPP), fully managed petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using existing business intelligence tools.

When businesses are modernizing their data warehousing solutions to Amazon Redshift, implementing additional data protection mechanisms for sensitive data, such as personally identifiable information (PII) or protected health information (PHI), is a common requirement, especially for those in highly regulated industries with strict data security and privacy mandates. Amazon Redshift provides role-based access control, row-level security, column-level security, and dynamic data masking, along with other database security features to enable organizations to enforce fine-grained data security.

Security-sensitive applications often require column-level (or field-level) encryption to enforce fine-grained protection of sensitive data on top of the default server-side encryption (namely data encryption at rest). In other words, sensitive data should be always encrypted on disk and remain encrypted in memory, until users with proper permissions request to decrypt the data. Column-level encryption provides an additional layer of security to protect your sensitive data throughout system processing so that only certain users or applications can access it. This encryption ensures that only authorized principals that need the data, and have the required credentials to decrypt it, are able to do so.

In this post, we demonstrate how you can implement your own column-level encryption mechanism in Amazon Redshift using AWS Glue to encrypt sensitive data before loading data into Amazon Redshift, and using AWS Lambda as a user-defined function (UDF) in Amazon Redshift to decrypt the data using standard SQL statements. Lambda UDFs can be written in any of the programming languages supported by Lambda, such as Java, Go, PowerShell, Node.js, C#, Python, Ruby, or a custom runtime. You can use Lambda UDFs in any SQL statement such as SELECT, UPDATE, INSERT, or DELETE, and in any clause of the SQL statements where scalar functions are allowed.

Solution overview

The following diagram describes the solution architecture.

Architecture Diagram

To illustrate how to set up this architecture, we walk you through the following steps:

  1. We upload a sample data file containing synthetic PII data to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. A sample 256-bit data encryption key is generated and securely stored using AWS Secrets Manager.
  3. An AWS Glue job reads the data file from the S3 bucket, retrieves the data encryption key from Secrets Manager, performs data encryption for the PII columns, and loads the processed dataset into an Amazon Redshift table.
  4. We create a Lambda function to reference the same data encryption key from Secrets Manager, and implement data decryption logic for the received payload data.
  5. The Lambda function is registered as a Lambda UDF with a proper AWS Identity and Access Management (IAM) role that the Amazon Redshift cluster is authorized to assume.
  6. We can validate the data decryption functionality by issuing sample queries using Amazon Redshift Query Editor v2.0. You may optionally choose to test it with your own SQL client or business intelligence tools.

Prerequisites

To deploy the solution, make sure to complete the following prerequisites:

  • Have an AWS account. For this post, you configure the required AWS resources using AWS CloudFormation in the us-east-2 Region.
  • Have an IAM user with permissions to manage AWS resources including Amazon S3, AWS Glue, Amazon Redshift, Secrets Manager, Lambda, and AWS Cloud9.

Deploy the solution using AWS CloudFormation

Provision the required AWS resources using a CloudFormation template by completing the following steps:

  1. Sign in to your AWS account.
  2. Choose Launch Stack:
    Launch Button
  3. Navigate to an AWS Region (for example, us-east-2).
  4. For Stack name, enter a name for the stack or leave as default (aws-blog-redshift-column-level-encryption).
  5. For RedshiftMasterUsername, enter a user name for the admin user account of the Amazon Redshift cluster or leave as default (master).
  6. For RedshiftMasterUserPassword, enter a strong password for the admin user account of the Amazon Redshift cluster.
  7. Select I acknowledge that AWS CloudFormation might create IAM resources.
  8. Choose Create stack.
    Create CloudFormation stack

The CloudFormation stack creation process takes around 5–10 minutes to complete.

  1. When the stack creation is complete, on the stack Outputs tab, record the values of the following:
    1. AWSCloud9IDE
    2. AmazonS3BucketForDataUpload
    3. IAMRoleForRedshiftLambdaUDF
    4. LambdaFunctionName

CloudFormation stack output

Upload the sample data file to Amazon S3

To test the column-level encryption capability, you can download the sample synthetic data generated by Mockaroo. The sample dataset contains synthetic PII and sensitive fields such as phone number, email address, and credit card number. In this post, we demonstrate how to encrypt the credit card number field, but you can apply the same method to other PII fields according to your own requirements.

Sample synthetic data

An AWS Cloud9 instance is provisioned for you during the CloudFormation stack setup. You may access the instance from the AWS Cloud9 console, or by visiting the URL obtained from the CloudFormation stack output with the key AWSCloud9IDE.

CloudFormation stack output for AWSCloud9IDE

On the AWS Cloud9 terminal, copy the sample dataset to your S3 bucket by running the following command:

S3_BUCKET=$(aws s3 ls| awk '{print $3}'| grep awsblog-pii-data-input-)
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2274/pii-sample-dataset.csv s3://$S3_BUCKET/

Upload sample dataset to S3

Generate a secret and secure it using Secrets Manager

We generate a 256-bit secret to be used as the data encryption key. Complete the following steps:

  1. Create a new file in the AWS Cloud9 environment.
    Create new file in Cloud9
  2. Enter the following code snippet. We use the cryptography package to create a secret, and use the AWS SDK for Python (Boto3) to securely store the secret value with Secrets Manager:
    from cryptography.fernet import Fernet
    import boto3
    import base64
    
    key = Fernet.generate_key()
    client = boto3.client('secretsmanager')
    
    response = client.create_secret(
        Name='data-encryption-key',
        SecretBinary=base64.urlsafe_b64decode(key)
    )
    
    print(response['ARN'])

  3. Save the file with the file name generate_secret.py (or any desired name ending with .py).
    Save file in Cloud9
  4. Install the required packages by running the following pip install command in the terminal:
    pip install --user boto3
    pip install --user cryptography

  5. Run the Python script via the following command to generate the secret:
    python generate_secret.py

    Run Python script

Create a target table in Amazon Redshift

A single-node Amazon Redshift cluster is provisioned for you during the CloudFormation stack setup. To create the target table for storing the dataset with encrypted PII columns, complete the following steps:

  1. On the Amazon Redshift console, navigate to the list of provisioned clusters, and choose your cluster.
    Amazon Redshift console
  2. To connect to the cluster, on the Query data drop-down menu, choose Query in query editor v2.
    Connect with Query Editor v2
  3. If this is the first time you’re using the Amazon Redshift Query Editor V2, accept the default setting by choosing Configure account.
    Configure account
  4. To connect to the cluster, choose the cluster name.
    Connect to Amazon Redshift cluster
  5. For Database, enter demodb.
  6. For User name, enter master.
  7. For Password, enter your password.

You may need to change the user name and password according to your CloudFormation settings.

  1. Choose Create connection.
    Create Amazon Redshift connection
  2. In the query editor, run the following DDL command to create a table named pii_table:
    CREATE TABLE pii_table(
      id BIGINT,
      full_name VARCHAR(50),
      gender VARCHAR(10),
      job_title VARCHAR(50),
      spoken_language VARCHAR(50),
      contact_phone_number VARCHAR(20),
      email_address VARCHAR(50),
      registered_credit_card VARCHAR(50)
    );

We recommend using the smallest possible column size as a best practice, and you may need to modify these table definitions per your specific use case. Creating columns much larger than necessary will have an impact on the size of data tables and affect query performance.

Create Amazon Redshift table

Create the source and destination Data Catalog tables in AWS Glue

The CloudFormation stack provisioned two AWS Glue data crawlers: one for the Amazon S3 data source and one for the Amazon Redshift data source. To run the crawlers, complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
    AWS Glue Crawlers
  2. Select the crawler named glue-s3-crawler, then choose Run crawler to trigger the crawler job.
    Run Amazon S3 crawler job
  3. Select the crawler named glue-redshift-crawler, then choose Run crawler.
    Run Amazon Redshift crawler job

When the crawlers are complete, navigate to the Tables page to verify your results. You should see two tables registered under the demodb database.

AWS Glue database tables

Author an AWS Glue ETL job to perform data encryption

An AWS Glue job is provisioned for you as part of the CloudFormation stack setup, but the extract, transform, and load (ETL) script has not been created. We create and upload the ETL script to the /glue-script folder under the provisioned S3 bucket in order to run the AWS Glue job.

  1. Return to your AWS Cloud9 environment either via the AWS Cloud9 console, or by visiting the URL obtained from the CloudFormation stack output with the key AWSCloud9IDE.
    CloudFormation stack output for AWSCloud9IDE

We use the Miscreant package for implementing a deterministic encryption using the AES-SIV encryption algorithm, which means that for any given plain text value, the generated encrypted value will be always the same. The benefit of using this encryption approach is to allow for point lookups, equality joins, grouping, and indexing on encrypted columns. However, you should also be aware of the potential security implication when applying deterministic encryption to low-cardinality data, such as gender, boolean values, and status flags.

  1. Create a new file in the AWS Cloud9 environment and enter the following code snippet:
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from awsglue.dynamicframe import DynamicFrameCollection
    from awsglue.dynamicframe import DynamicFrame
    
    import boto3
    import base64
    from miscreant.aes.siv import SIV
    from pyspark.sql.functions import udf, col
    from pyspark.sql.types import StringType
    
    args = getResolvedOptions(sys.argv, ["JOB_NAME", "SecretName", "InputTable"])
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args["JOB_NAME"], args)
    
    # retrieve the data encryption key from Secrets Manager
    secret_name = args["SecretName"]
    
    sm_client = boto3.client('secretsmanager')
    get_secret_value_response = sm_client.get_secret_value(SecretId = secret_name)
    data_encryption_key = get_secret_value_response['SecretBinary']
    siv = SIV(data_encryption_key)  # Without nonce, the encryption becomes deterministic
    
    # define the data encryption function
    def pii_encrypt(value):
        if value is None:
            value = ""
        ciphertext = siv.seal(value.encode())
        return base64.b64encode(ciphertext).decode('utf-8')
    
    # register the data encryption function as Spark SQL UDF   
    udf_pii_encrypt = udf(lambda z: pii_encrypt(z), StringType())
    
    # define the Glue Custom Transform function
    def Encrypt_PII (glueContext, dfc) -> DynamicFrameCollection:
        newdf = dfc.select(list(dfc.keys())[0]).toDF()
        
        # PII fields to be encrypted
        pii_col_list = ["registered_credit_card"]
    
        for pii_col_name in pii_col_list:
            newdf = newdf.withColumn(pii_col_name, udf_pii_encrypt(col(pii_col_name)))
    
        encrypteddyc = DynamicFrame.fromDF(newdf, glueContext, "encrypted_data")
        return (DynamicFrameCollection({"CustomTransform0": encrypteddyc}, glueContext))
    
    # Script generated for node S3 bucket
    S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
        database="demodb",
        table_name=args["InputTable"],
        transformation_ctx="S3bucket_node1",
    )
    
    # Script generated for node ApplyMapping
    ApplyMapping_node2 = ApplyMapping.apply(
        frame=S3bucket_node1,
        mappings=[
            ("id", "long", "id", "long"),
            ("full_name", "string", "full_name", "string"),
            ("gender", "string", "gender", "string"),
            ("job_title", "string", "job_title", "string"),
            ("spoken_language", "string", "spoken_language", "string"),
            ("contact_phone_number", "string", "contact_phone_number", "string"),
            ("email_address", "string", "email_address", "string"),
            ("registered_credit_card", "long", "registered_credit_card", "string"),
        ],
        transformation_ctx="ApplyMapping_node2",
    )
    
    # Custom Transform
    Customtransform_node = Encrypt_PII(glueContext, DynamicFrameCollection({"ApplyMapping_node2": ApplyMapping_node2}, glueContext))
    
    # Script generated for node Redshift Cluster
    RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
        frame=Customtransform_node,
        database="demodb",
        table_name="demodb_public_pii_table",
        redshift_tmp_dir=args["TempDir"],
        transformation_ctx="RedshiftCluster_node3",
    )
    
    job.commit()

  2. Save the script with the file name pii-data-encryption.py.
    Save file in Cloud9
  3. Copy the script to the desired S3 bucket location by running the following command:
    S3_BUCKET=$(aws s3 ls| awk '{print $3}'| grep awsblog-pii-data-input-)
    aws s3 cp pii-data-encryption.py s3://$S3_BUCKET/glue-script/pii-data-encryption.py

    Upload AWS Glue script to S3

  4. To verify the script is uploaded successfully, navigate to the Jobs page on the AWS Glue console.You should be able to find a job named pii-data-encryption-job.
    AWS Glue console
  5. Choose Run to trigger the AWS Glue job.It will first read the source data from the S3 bucket registered in the AWS Glue Data Catalog, then apply column mappings to transform data into the expected data types, followed by performing PII fields encryption, and finally loading the encrypted data into the target Redshift table. The whole process should be completed within 5 minutes for this sample dataset.AWS Glue job scriptYou can switch to the Runs tab to monitor the job status.
    Monitor AWS Glue job

Configure a Lambda function to perform data decryption

A Lambda function with the data decryption logic is deployed for you during the CloudFormation stack setup. You can find the function on the Lambda console.

AWS Lambda console

The following is the Python code used in the Lambda function:

import boto3
import os
import json
import base64
import logging
from miscreant.aes.siv import SIV

logger = logging.getLogger()
logger.setLevel(logging.INFO)

secret_name = os.environ['DATA_ENCRYPT_KEY']

sm_client = boto3.client('secretsmanager')
get_secret_value_response = sm_client.get_secret_value(SecretId = secret_name)
data_encryption_key = get_secret_value_response['SecretBinary']

siv = SIV(data_encryption_key)  # Without nonce, the encryption becomes deterministic

# define lambda function logic
def lambda_handler(event, context):
    ret = dict()
    res = []
    for argument in event['arguments']:
        encrypted_value = argument[0]
        try:
            de_val = siv.open(base64.b64decode(encrypted_value)) # perform decryption
        except:
            de_val = encrypted_value
            logger.warning('Decryption for value failed: ' + str(encrypted_value)) 
        res.append(json.dumps(de_val.decode('utf-8')))

    ret['success'] = True
    ret['results'] = res

    return json.dumps(ret) # return decrypted results

If you want to deploy the Lambda function on your own, make sure to include the Miscreant package in your deployment package.

Register a Lambda UDF in Amazon Redshift

You can create Lambda UDFs that use custom functions defined in Lambda as part of your SQL queries. Lambda UDFs are managed in Lambda, and you can control the access privileges to invoke these UDFs in Amazon Redshift.

  1. Navigate back to the Amazon Redshift Query Editor V2 to register the Lambda UDF.
  2. Use the CREATE EXTERNAL FUNCTION command and provide an IAM role that the Amazon Redshift cluster is authorized to assume and make calls to Lambda:
    CREATE OR REPLACE EXTERNAL FUNCTION pii_decrypt (value varchar(max))
    RETURNS varchar STABLE
    LAMBDA '<--Replace-with-your-lambda-function-name-->'
    IAM_ROLE '<--Replace-with-your-redshift-lambda-iam-role-arn-->';

You can find the Lambda name and Amazon Redshift IAM role on the CloudFormation stack Outputs tab:

  • LambdaFunctionName
  • IAMRoleForRedshiftLambdaUDF

CloudFormation stack output
Create External Function in Amazon Redshift

Validate the column-level encryption functionality in Amazon Redshift

By default, permission to run new Lambda UDFs is granted to PUBLIC. To restrict usage of the newly created UDF, revoke the permission from PUBLIC and then grant the privilege to specific users or groups. To learn more about Lambda UDF security and privileges, see Managing Lambda UDF security and privileges.

You must be a superuser or have the sys:secadmin role to run the following SQL statements:

GRANT SELECT ON "demodb"."public"."pii_table" TO PUBLIC;
CREATE USER regular_user WITH PASSWORD '1234Test!';
CREATE USER privileged_user WITH PASSWORD '1234Test!';
REVOKE EXECUTE ON FUNCTION pii_decrypt(varchar) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pii_decrypt(varchar) TO privileged_user;

First, we run a SELECT statement to verify that our highly sensitive data field, in this case the registered_credit_card column, is now encrypted in the Amazon Redshift table:

SELECT * FROM "demodb"."public"."pii_table";

Select statement

For regular database users who have not been granted the permission to use the Lambda UDF, they will see a permission denied error when they try to use the pii_decrypt() function:

SET SESSION AUTHORIZATION regular_user;
SELECT *, pii_decrypt(registered_credit_card) AS decrypted_credit_card FROM "demodb"."public"."pii_table";

Permission denied

For privileged database users who have been granted the permission to use the Lambda UDF for decrypting the data, they can issue a SQL statement using the pii_decrypt() function:

SET SESSION AUTHORIZATION privileged_user;
SELECT *, pii_decrypt(registered_credit_card) AS decrypted_credit_card FROM "demodb"."public"."pii_table";

The original registered_credit_card values can be successfully retrieved, as shown in the decrypted_credit_card column.

Decrypted results

Cleaning up

To avoid incurring future charges, make sure to clean up all the AWS resources that you created as part of this post.

You can delete the CloudFormation stack on the AWS CloudFormation console or via the AWS Command Line Interface (AWS CLI). The default stack name is aws-blog-redshift-column-level-encryption.

Conclusion

In this post, we demonstrated how to implement a custom column-level encryption solution for Amazon Redshift, which provides an additional layer of protection for sensitive data stored on the cloud data warehouse. The CloudFormation template gives you an easy way to set up the data pipeline, which you can further customize for your specific business scenarios. You can also modify the AWS Glue ETL code to encrypt multiple data fields at the same time, and to use different data encryption keys for different columns for enhanced data security. With this solution, you can limit the occasions where human actors can access sensitive data stored in plain text on the data warehouse.

You can learn more about this solution and the source code by visiting the GitHub repository. To learn more about how to use Amazon Redshift UDFs to solve different business problems, refer to Example uses of user-defined functions (UDFs) and Amazon Redshift UDFs.


About the Author

Aaron ChongAaron Chong is an Enterprise Solutions Architect at Amazon Web Services Hong Kong. He specializes in the data analytics domain, and works with a wide range of customers to build big data analytics platforms, modernize data engineering practices, and advocate AI/ML democratization.

Simplify web app authentication: A guide to AD FS federation with Amazon Cognito user pools

Post Syndicated from Leo Drakopoulos original https://aws.amazon.com/blogs/security/simplify-web-app-authentication-a-guide-to-ad-fs-federation-with-amazon-cognito-user-pools/

August 13, 2018: Date this post was first published, on the Front-End Web and Mobile Blog. We updated the CloudFormation template, provided additional clarification on implementation steps, and revised to account for the new Amazon Cognito UI.


User authentication and authorization can be challenging when you’re building web and mobile apps. The challenges include handling user data and passwords, token-based authentication, federating identities from external identity providers (IdPs), managing fine-grained permissions, scalability, and more.

In this blog post, we will show you how to federate identities from Windows Server Active Directory to authenticate users into your web app by using AWS services. The main AWS service that we’ll use for this purpose is Amazon Cognito.

With Amazon Cognito user pools, you can add user sign-up and sign-in to your mobile and web apps by using a secure and scalable user directory. In addition, you can federate users from a SAML IdP with Amazon Cognito user pools, map these users to a user directory, and get standard authentication tokens from a user pool after the user authenticates with a SAML IdP.

This post explains how to integrate Amazon Cognito user pools with Microsoft Active Directory Federation Services (AD FS) to obtain JSON web tokens (JWTs) in your web app—which in turn can be used for downstream authentication. To demonstrate the complete authentication flow, we’ve created a simple REST API that’s built on Amazon API Gateway. The REST API retrieves data from an Amazon DynamoDB table with the help of an AWS Lambda function. We’ll use the JWT tokens that are vended from user pools to authenticate to the REST API, which is hosted on API Gateway.

A benefit of using Amazon Cognito user pools to federate users from a SAML provider is that a user pool supports SAML 2.0 post-binding endpoints. This helps eliminate the need for client-side parsing of the SAML assertion response, and the user pool directly receives the SAML response from your IdP through a user agent.

As part of the SAML federation feature, the user pool acts as a service provider on behalf of your application. The user pool becomes a single point of identity management for your application, and your application doesn’t need to integrate with multiple SAML IdPs.

Solution overview

Figure 1 shows the authentication flow that we present throughout this blog post.

Figure 1: Authentication flow with Amazon Cognito user pool

Figure 1: Authentication flow with Amazon Cognito user pool

As shown in the figure, the authentication flow involves the following steps:

  1. The app starts the sign-up and sign-in process by directing the user to the Cognito user pools hosted web UI. For a mobile app, you can use a web view to show the hosted web UI. For this post, you will use a web app that is hosted on Amazon Simple Storage Service (Amazon S3) fronted by Amazon CloudFront.
  2. The Amazon Cognito user pool determines the appropriate IdP based on your configuration. For AD FS, the IdP is determined by the metadata file or metadata endpoint URL from your SAML IdP. For example, if you use AD FS, the metadata URL looks like the following: https://<yourservername>/FederationMetadata/2007-06/FederationMetadata.xml
  3. The user is redirected to the IdP—in this case, Active Directory.
  4. The IdP authenticates the user if necessary. If the IdP recognizes that the user has an active session, then the IdP skips the authentication to provide a single sign-on experience.
  5. The IdP sends the SAML assertion to Amazon Cognito.
  6. The user’s profile is created in the user pool.
  7. After verifying the SAML assertion and collecting the user attributes (claims) from the assertion, Amazon Cognito returns OIDC tokens (ID, access, and refresh tokens) to the app for the user who is now signed in.
  8. The app then makes a GET request to API Gateway, passing along the JWT token for authorization. If authorized, the request is forwarded to Lambda for data retrieval from DynamoDB.

Installation and configuration walkthrough

To build the authentication flow that we described in the previous section, complete the following steps.

  • Step 1: Install Active Directory and AD FS
  • Step 2: Create an Amazon Cognito user pool
  • Step 3: Configure Active Directory and AD FS
  • Step 4: Complete the Amazon Cognito configuration
  • Step 5: Deploy and configure the web app

Step 1: Install Active Directory and AD FS

You will need to set up Active Directory and AD FS. For instructions on how to install both with an AWS CloudFormation template, see Enabling Federation to AWS Using Windows Active Directory, ADFS, and SAML 2.0. To complete the walkthrough in this blog post, you will need to have a working Active Directory service and AD FS service, and a user created within Active Directory. For this walkthrough, we created a user named bob with an email address of [email protected].

Step 2: Create an Amazon Cognito user pool

  1. Sign in to the Amazon Cognito console and do one of the following:
    • If you have an existing user pool, in the left navigation pane, choose User pools and then choose Create user pool to create a new user pool for this walkthrough.
    • If you don’t have an existing user pool, you will see a landing page. Keep the dropdown list as default and choose Create user pool.
  2. In the Configure sign-in experience section, for Cognito user pool sign-in options, select Email, and then choose Next.
  3. In the Configure security requirements section, under Multi-factor authentication, select No MFA, leave the other fields as default, and then choose Next.
  4. In the Configure sign-up experience section, under Attribute verification and user account confirmation, deselect Allow Cognito to automatically send messages to verify and confirm, and choose Next.
  5. In the Configure message delivery section, under Email, select Send email with Cognito, leave the other fields as default, and then choose Next.
  6. In the Integrate your app section, enter a user pool name, select Use the Cognito Hosted UI, and create a domain name using a Cognito domain.
  7. In the Initial app client section as shown in Figure 2, for App client name, enter SAML-IdP; and for Allowed callback URLs, enter https://localhost. Then choose Next.
    Figure 2: Set up the initial app client to create the Cognito user pool

    Figure 2: Set up the initial app client to create the Cognito user pool

  8. In the Review and create section, review all settings, and then scroll to the bottom of the page and choose Create user pool.

Step 3: Configure Active Directory and AD FS

Now that you’ve created an Amazon Cognito user pool, you need to set up Amazon Cognito as a relying party in the SAML identity provider (in this case, AD FS). After you configure AD FS, you will return to Amazon Cognito to complete the final configurations for the application to work.

  1. Connect to the Windows Server instance where you installed AD FS as an administrator through the remote desktop protocol (RDP).
  2. Open the AD FS 2.0 console.
  3. To make sure that the user you created in Step 1 has an email address, in the user property window for your user, choose General. Figure 3 shows our user named bob in Active Directory with an email address of [email protected].
    Figure 3: User properties of bob in the Active Directory

    Figure 3: User properties of bob in the Active Directory

  4. Determine the Uniform Resource Name (URN) for the Amazon Cognito user pool. The form of the URN is urn:amazon:cognito:sp:<user-pool-id>. You can find the user pool ID in the General settings tab.
  5. Configure AD FS as follows to work with the Amazon Cognito user pool:
    1. Go to Trust Relationships > Relying Party Trusts > Add relying party trusts. This will start a wizard.
    2. Select Enter data about the relying party manually.
    3. Enter a display name for the relying party configuration.
    4. On the next screen, do not configure a certificate.
    5. Enable support for the SAML 2.0 single sign-on service URL.
    6. Add the Amazon Cognito user pool URN as the relying party trust identifier.
    7. Configure the SAML POST binding. The SAML 2.0 post-binding endpoint (also known as the assertion consumer URL) for the Amazon Cognito user pool is https://<domain-prefix>.auth.<<region>.amazoncognito.com/saml2/idpresponse.  You configured this as the domain name in Step 2.6.
    8. Select Permit all users to access this relying party.
    9. Choose Finish.
  6. Navigate to Trust Relationships Relying Party Trusts. You should see that the URN of Amazon Cognito is configured as the relying party, as shown in Figure 4:
Figure 4: Amazon Cognito trusted as the relying party

Figure 4: Amazon Cognito trusted as the relying party

In a SAML federation, the IdP can pass various attributes about the user, the authentication method, or other points of context to the service provider (in this case, Amazon Cognito) in the form of SAML attributes. In AD FS, claim rules are used to assemble these required attributes using a combination of Active Directory lookups, simple transformations, and regular expression-based custom rules. In this example, you will configure two claim rules: Name ID and E-Mail.

  1. The Edit Claim Rules window should already be open. If it isn’t, select your relying party trust from the Trust Relationships > Relying Party Trusts screen, and then, in the Actions tab on the right side, choose Edit Claim Rules.
  2. On the Configure Claim Rule page, enter the following values for each configuration element, and then choose OK.
    • Claim rule name: Name ID
    • Incoming claim type: Windows account name
    • Outgoing claim type: Name ID
    • Outgoing name ID format: Persistent identifier
  3. Repeat the preceding steps for the E-mail claim:
    • Claim rule name: Email
    • Attribute Directory: Active Directory
    • LDAP Attributes: Email Addresses
    • Outgoing Claim Type: Email Address
  4. Before leaving the AD FS configuration, download the metadata file for the AD FS. The metadata URL for AD FS looks like the following: https://<servername>/FederationMetadata/2007-06/FederationMetadata.xmlM. The metadata file describes the endpoint of your SAML IdP (the AD FS service) to the service provider (Amazon Cognito).

Step 4: Complete the Amazon Cognito configuration

  1. Sign in to the Amazon Cognito console.
  2. Select the Amazon Cognito user pool that you created earlier, navigate to Sign-in experience Federated identity provider sign-in, and choose Add identity provider, as shown in Figure 5.
    Figure 5: Add a federated identity provider in the Amazon Cognito console

    Figure 5: Add a federated identity provider in the Amazon Cognito console

  3. Choose SAML as the identity provider.
  4. As shown in Figure 6, enter a name for your identity provider, choose Select file, and then upload the FederationMetadata.xml file that you downloaded at the end of Step 3.
    Figure 6: Set up SAML federation with the user pool

    Figure 6: Set up SAML federation with the user pool

  5. Provide the SAML attribute to map attributes between your SAML provider and your user pool as follows:
    • For User pool attribute, select email.
    • For SAML attribute, enter http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress

    These mappings map the claims from the SAML assertion from AD FS to the user pool attributes. You configured an E-mail claim in AD FS, so you need to map this with the appropriate attribute in the user pool.

  6. Choose Add identity provider.

Step 5: Deploy and configure a web app

To reduce the number of steps required for this walkthrough, we have provided a CloudFormation template that you can use to complete the deployment, which deploys the architecture shown in Figure 7:

Figure 7: Web app architecture deployed by the CloudFormation template

Figure 7: Web app architecture deployed by the CloudFormation template

This architecture is essentially the same as step 8 from the authentication flow diagram (Figure 1) earlier in this post. In Figure 7, we have added Amazon S3 and Amazon CloudFront to the diagram, which is where your static website is hosted. Complete the following steps for this walkthrough:

  • Step 5.1: Create the AWS CloudFormation stack
  • Step 5.2: Manually integrate Amazon Cognito user pools with API Gateway
  • Step 5.3: Update the configuration for Amazon Cognito
  • Step 5.4: Update the configuration for the client-side application and upload to Amazon S3
  • Step 5.5: Insert a row into a DynamoDB table to help you test the application

Step 5.1: Create the AWS CloudFormation stack

Let’s deploy this infrastructure:

  1. Download the code repository, which includes the CloudFormation template named prerequisites.yaml and the sample code for a web app named DataManager.
  2. Navigate to the CloudFormation console in the Region where you deployed the user pool, and choose Create Stack.
  3. To upload the template to Amazon S3, choose Browse and select prerequisites.yaml  in the folder where you downloaded it.
  4. Provide a Stack name and a unique Bucket name.

    Note: S3 bucket names should not contain uppercase characters.

  5. Choose Next, and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  6. Choose Create and then wait for the resources to be deployed.

    Note: If the deployment fails with the error message API: s3:CreateBucket Access Denied, review the IAM permissions available for the IAM user or the role used and make sure that the s3:CreateBucket permission has been granted.

Step 5.2: Manually integrate the Amazon Cognito user pool with API Gateway

  1. Open the API Gateway console. You should see that an API named DataManager has been created by CloudFormation, as shown in Figure 8:
    Figure 8: APIs in the API Gateway console

    Figure 8: APIs in the API Gateway console

  2. Under APIs, choose DataManager, and then choose Authorizers.
  3. Choose Create new Authorizer, and then populate the relevant details:
    • For Name, enter SamlAuthorizer (Make sure that the name of the user pool is the same as the one that you created).
    • For Type, select Cognito.
    • For Cognito user pool, enter Samlfederation.
    • For Token source, enter Authorization.

    With this configuration, you use the user pools authorizer to authenticate Get requests to your Rest API that’s hosted on API Gateway. In the dropdown for Cognito User Pool, add the user pool that you created in Step 2: Create an Amazon Cognito user pool. Choose Create.

  4. Navigate back to APIs > Resources, choose GET, and then choose Method Request.
  5. To add the authorizer that you just created, under Settings, in the Authorization dropdown, choose your authorizer. Remember to save the setting by choosing the small tick symbol on the right side. If you don’t see the Cognito authorizer, just wait for several minutes for updates from API Gateway.
    Figure 9: Add the Cognito authorizer for the API GET method

    Figure 9: Add the Cognito authorizer for the API GET method

Step 5.3: Update the configuration for Amazon Cognito

Now you need to update the Amazon Cognito configuration based on the CloudFront distribution that you deployed using the CloudFormation template in Step 5.1.

  1. Navigate to the CloudFormation console and locate the CloudFormation stack that was deployed. As shown in Figure 10, in the Outputs tab, copy the values for CloudfrontEndpoint and DataManagerApiInvokeUrl because you will need them later.
    Figure 10: Outputs of the CloudFormation template deployment

    Figure 10: Outputs of the CloudFormation template deployment

  2. Navigate to the Amazon Cognito console and go to your user pool. Choose the App integration tab, scroll to the bottom of the page, and for App client name, choose the App client that you added during user pool creation.
  3. On the page for your App client, in the Hosted UI section, choose Edit, and then do the following:
    • For both the Allowed callback URLs and Allowed sign-out URLs, enter the CloudFront endpoint.
    • For OAuth grant types, select Implicit grant.
    • For OpenID Connect scopes, select Email and OpenID.
    Figure 11: Configure the hosted UI for the app client

    Figure 11: Configure the hosted UI for the app client

The Amazon Cognito hosted UI provides an OAuth 2.0 compliant authorization server. It includes the default implementation of end user flows, such as registration and authentication. Because the application interacts with Amazon Cognito through an OAuth 2.0 implicit flow, which requires a redirect, the website needs to use HTTPS.

Note: In a production scenario, instead of implicit flow, an authorization code grant is the preferred method in the OAuth 2.0 framework because it’s more secure.

To have an HTTPS endpoint for the Amazon S3 static website, you can use the CloudFront distribution that was deployed by the CloudFormation template in Step 5.1.

When one of your users successfully logs in to the Active Directory infrastructure, the user is automatically redirected to the callback URL. In this case, this is a CloudFront distribution URL with an Amazon Cognito ID token, access token, and refresh token.

Step 5.4: Update the configuration for your client-side application, and upload it to Amazon S3

Navigate to the code that you previously cloned in Step 5.1, and perform the following steps:

  1. With a file manager, navigate to the folder where the cloned content is located. Open the DataManager directory.
  2. Open the js folder. Using a text editor, open the config.js file.
  3. From the Amazon Cognito console, copy the client app application ID as the value of the userPoolClientId property. You can find the application ID in the App clients menu of the Amazon Cognito console.
  4. Change the value of the Region property to the Region that you are using (for example, us-east-2)
  5. While you are still in the Amazon Cognito console, open the Domain name page, and copy the custom prefix into the value for the authDomainPrefix property.
  6. Open the CloudFormation console and choose the stack that was created in Step 5.1. With the stack selected, open the Outputs tab.
    • Copy the value of the CloudfrontEndpoint output variable to the redirect_uri property.
    • Copy the value of the DataManagerApiInvokeUrl output variable to the invokeUri property.
  7. Copy the files to the S3 bucket that hosts the static website. To upload the files, use the AWS Command Line Interface (AWS CLI) or the Amazon S3 console.

Step 5.5: Insert a row into the DynamoDB table to help test your application

The CloudFormation template that you used in Step 5.1 created a DynamoDB table that you can use to test your application. Now you need to add a row to the table (as shown in the Items returned section of Figure 12), so that you can get some results when you test your application. To add a row, in the left menu, choose Tables Update settings to find the table, and then choose Actions Create item.

The Lambda function that retrieves data from the ADFSSecretData DynamoDB table only retrieves data from rows where the email matches the one used to log in to Active Directory. To achieve this, you pass the event.requestContext.authorizer.claims.email.object within the Lambda function. This object contains the email that you used to log in to Active Directory.

Figure 12: Search result of DynamoDB table

Figure 12: Search result of DynamoDB table

Now you’re ready to test the application.

  1. Open the CloudFront URL in your browser and choose Enter. This should immediately take you to the web app landing page. From there, you’re automatically redirected to the Amazon Cognito hosted UI. You should see a screen similar to the following that says Sign in with your corporate ID:
    Figure 13: Cognito hosted UI sign-in page

    Figure 13: Cognito hosted UI sign-in page

  2. After you choose your SAML provider, you are redirected to your AD FS infrastructure that shows a login screen similar to the following:
    Figure 14: AD FS sign-in page

    Figure 14: AD FS sign-in page

    Note: If there’s an error, make sure that there’s a mapping in the host file for your AD FS server, with the appropriate hostname or public IP address of the EC2 instance where the AD FS infrastructure is hosted

    On the login screen, for Username, enter the user’s email address (in our example, that’s Bob’s email address), and for Password, enter the password that you defined in Active Directory, as shown in Figure 14. If the login is successful, you’re redirected back to the web app with a valid ID and access tokens.

    Figure 15: Sample web app home page

    Figure 15: Sample web app home page

  3. Choose Refresh to see the data that you stored in DynamoDB.
    Figure 16: Retrieval of the data from DynamoDB

    Figure 16: Retrieval of the data from DynamoDB

Summary

In this walkthrough, you federated users from AD FS, and successfully authenticated those users to our REST API that’s hosted on API Gateway.

The SAML federation feature in Amazon Cognito helps you set up and integrate your apps with multiple SAML IdPs. By using the SAML federation capabilities of Amazon Cognito, your apps don’t need to handle the type of SAML IdP that they are interacting with. Amazon Cognito takes care of it on behalf of your application.
 


This article was originally written by Adrian Hall, who was an AWS Solutions Architect when he wrote it.
 


 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Leo Drakopoulos

Leo Drakopoulos

Leo is a Principal Solutions Architect working within the financial services industry. His focus is AWS Serverless and Container-based architectures. He enjoys helping customers adopt a culture of innovation and use cloud-native architectures.

Jun Zhang

Jun Zhang

Jun is a Solutions Architect based in Zurich. He helps Swiss customers architect cloud-based solutions to achieve their business potential. He has a passion for sustainability and strives to solve current environmental challenges with technology. He is also a huge tennis fan and enjoys playing board games a lot.

Perform accent-insensitive search using OpenSearch

Post Syndicated from Aruna Govindaraju original https://aws.amazon.com/blogs/big-data/perform-accent-insensitive-search-using-opensearch/

We often need our text search to be agnostic of accent marks. Accent-insensitive search, also called diacritics-agnostic search, is where search results are the same for queries that may or may not contain Latin characters such as à, è, Ê, ñ, and ç. Diacritics are English letters with an accent to mark a difference in pronunciation. In recent years, words with diacritics have trickled into the mainstream English language, such as café or protégé. Well, touché! OpenSearch has the answer!

OpenSearch is a scalable, flexible, and extensible open-source software suite for your search workload. OpenSearch can be deployed in three different modes: the self-managed open-source OpenSearch, the managed Amazon OpenSearch Service, and Amazon OpenSearch Serverless. All three deployment modes are powered by Apache Lucene, and offer text analytics using the Lucene analyzers.

In this post, we demonstrate how to perform accent-insensitive search using OpenSearch to handle diacritics.

Solution overview

Lucene Analyzers are Java libraries that are used to analyze text while indexing and searching documents. These analyzers consist of tokenizers and filters. The tokenizers split the incoming text into one or more tokens, and the filters are used to transform the tokens by modifying or removing the unnecessary characters.

OpenSearch supports custom analyzers, which enable you to configure different combinations of tokenizers and filters. It can consist of character filters, tokenizers, and token filters. In order to enable our diacritic-insensitive search, we configure custom analyzers that use the ASCII folding token filter.

ASCIIFolding is a method used to covert alphabetic, numeric, and symbolic Unicode characters that aren’t in the first 127 ASCII characters (the Basic Latin Unicode block) into their ASCII equivalents, if one exists. For example, the filter changes “à” to “a”. This allows search engines to return results agnostic of the accent.

In this post, we configure accent-insensitive search using the ASCIIFolding filter supported in OpenSearch Service. We ingest a set of European movie names with diacritics and verify search results with and without the diacritics.

Create an index with a custom analyzer

We first create the index asciifold_movies with custom analyzer custom_asciifolding:

PUT /asciifold_movies
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_asciifolding": {
          "tokenizer": "standard",
          "filter": [
            "my_ascii_folding"
          ]
        }
      },
      "filter": {
        "my_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_asciifolding",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Ingest sample data

Next, we ingest sample data with Latin characters into the index asciifold_movies:

POST _bulk
{ "index" : { "_index" : "asciifold_movies", "_id":"1"} }
{  "title" : "Jour de fête"}
{ "index" : { "_index" : "asciifold_movies", "_id":"2"} }
{  "title" : "La gloire de mon père" }
{ "index" : { "_index" : "asciifold_movies", "_id":"3"} }
{  "title" : "Le roi et l’oiseau" }
{ "index" : { "_index" : "asciifold_movies", "_id":"4"} }
{  "title" : "Être et avoir" }
{ "index" : { "_index" : "asciifold_movies", "_id":"5"} }
{  "title" : "Kirikou et la sorcière"}
{ "index" : { "_index" : "asciifold_movies", "_id":"6"} }
{  "title" : "Señora Acero"}
{ "index" : { "_index" : "asciifold_movies", "_id":"7"} }
{  "title" : "Señora garçon"}
{ "index" : { "_index" : "asciifold_movies", "_id":"8"} }
{  "title" : "Jour de fete"}

Query the index

Now we query the asciifold_movies index for words with and without Latin characters.

Our first query uses an accented character:

GET asciifold_movies/_search
{
  "query": {
    "match": {
      "title": "fête"
    }
  }
}

Our second query uses a spelling of the same word without the accent mark:

GET asciifold_movies/_search
{
  "query": {
    "match": {
      "title": "fete"
    }
  }
}

In the preceding queries, the search terms “fête” and “fete” return the same results:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.7361701,
    "hits": [
      {
        "_index": "asciifold_movies",
        "_id": "8",
        "_score": 0.7361701,
        "_source": {
          "title": "Jour de fete"
        }
      },
      {
        "_index": "asciifold_movies",
        "_id": "1",
        "_score": 0.42547938,
        "_source": {
          "title": "Jour de fête"
        }
      }
    ]
  }
}

Similarly, try comparing results for “señora” and “senora” or “sorcière” and “sorciere.” The accent-insensitive results are due to the ASCIIFolding filter used with the custom analyzers.

Enable aggregations for fields with accents

Now that we have enabled accent-insensitive search, let’s look at how we can make aggregations work with accents.

Try the following query on the index:

GET asciifold_movies/_search
{
  "size": 0,
  "aggs": {
    "test": {
      "terms": {
        "field": "title.keyword"
      }
    }
  }
}

We get the following response:

"aggregations" : {
    "test" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jour de fete",
          "doc_count" : 1
        },
        {
          "key" : "Jour de fête",
          "doc_count" : 1
        },
        {
          "key" : "Kirikou et la sorcière",
          "doc_count" : 1
        },
        {
          "key" : "La gloire de mon père",
          "doc_count" : 1
        },
        {
          "key" : "Le roi et l’oiseau",
          "doc_count" : 1
        },
        {
          "key" : "Señora Acero",
          "doc_count" : 1
        },
        {
          "key" : "Señora garçon",
          "doc_count" : 1
        },
        {
          "key" : "Être et avoir",
          "doc_count" : 1
        }
      ]
    }
  }

Create accent-insensitive aggregations using a normalizer

In the previous example, the aggregation returns two different buckets, one for “Jour de fête” and one for “Jour de fete.” We can enable aggregations to create one bucket for the field, regardless of the diacritics. This is achieved using the normalizer filter.

The normalizer supports a subset of character and token filters. Using just the defaults, the normalizer filter is a simple way to standardize Unicode text in a language-independent way for search, thereby standardizing different forms of the same character in Unicode and allowing diacritic-agnostic aggregations.

Let’s modify the index mapping to include the normalizer. Delete the previous index, then create a new index with the following mapping and ingest the same dataset:

PUT /asciifold_movies
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_asciifolding": {
          "tokenizer": "standard",
          "filter": [
            "my_ascii_folding"
          ]
        }
      },
      "filter": {
        "my_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      },
      "normalizer": {
        "custom_normalizer": {
          "type": "custom",
          "filter": "asciifolding"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_asciifolding",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256,
            "normalizer": "custom_normalizer"
          }
        }
      }
    }
  }
}

After you ingest the same dataset, try the following query:

GET asciifold_movies/_search
{
  "size": 0,
  "aggs": {
    "test": {
      "terms": {
        "field": "title.keyword"
      }
    }
  }
}

We get the following results:

"aggregations" : {
    "test" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jour de fete",
          "doc_count" : 2
        },
        {
          "key" : "Etre et avoir",
          "doc_count" : 1
        },
        {
          "key" : "Kirikou et la sorciere",
          "doc_count" : 1
        },
        {
          "key" : "La gloire de mon pere",
          "doc_count" : 1
        },
        {
          "key" : "Le roi et l'oiseau",
          "doc_count" : 1
        },
        {
          "key" : "Senora Acero",
          "doc_count" : 1
        },
        {
          "key" : "Senora garcon",
          "doc_count" : 1
        }
      ]
    }
  }

Now we compare the results, and we can see the aggregations with term “Jour de fête” and “Jour de fete” are rolled up into one bucket with doc_count=2.

Summary

In this post, we showed how to enable accent-insensitive search and aggregations by designing the index mapping to do ASCII folding for search tokens and normalize the keyword field for aggregations. You can use the OpenSearch query DSL to implement a range of search features, providing a flexible foundation for structured and unstructured search applications. The Open Source OpenSearch community has also extended the product to enable support for natural language processing, machine learning algorithms, custom dictionaries, and a wide variety of other plugins.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the Author

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open-source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience. Her favorite pastime is hiking the New England trails and mountains.

Build event-driven data pipelines using AWS Controllers for Kubernetes and Amazon EMR on EKS

Post Syndicated from Victor Gu original https://aws.amazon.com/blogs/big-data/build-event-driven-data-pipelines-using-aws-controllers-for-kubernetes-and-amazon-emr-on-eks/

An event-driven architecture is a software design pattern in which decoupled applications can asynchronously publish and subscribe to events via an event broker. By promoting loose coupling between components of a system, an event-driven architecture leads to greater agility and can enable components in the system to scale independently and fail without impacting other services. AWS has many services to build solutions with an event-driven architecture, such as Amazon EventBridge, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and AWS Lambda.

Amazon Elastic Kubernetes Service (Amazon EKS) is becoming a popular choice among AWS customers to host long-running analytics and AI or machine learning (ML) workloads. By containerizing your data processing tasks, you can simply deploy them into Amazon EKS as Kubernetes jobs and use Kubernetes to manage underlying computing compute resources. For big data processing, which requires distributed computing, you can use Spark on Amazon EKS. Amazon EMR on EKS, a managed Spark framework on Amazon EKS, enables you to run Spark jobs with benefits of scalability, portability, extensibility, and speed. With EMR on EKS, the Spark jobs run using the Amazon EMR runtime for Apache Spark, which increases the performance of your Spark jobs so that they run faster and cost less than open-source Apache Spark.

Data processes require a workflow management to schedule jobs and manage dependencies between jobs, and require monitoring to ensure that the transformed data is always accurate and up to date. One popular orchestration tool for managing workflows is Apache Airflow, which can be installed in Amazon EKS. Alternatively, you can use the AWS-managed version, Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Another option is to use AWS Step Functions, which is a serverless workflow service that integrates with EMR on EKS and EventBridge to build event-driven workflows.

In this post, we demonstrate how to build an event-driven data pipeline using AWS Controllers for Kubernetes (ACK) and EMR on EKS. We use ACK to provision and configure serverless AWS resources, such as EventBridge and Step Functions. Triggered by an EventBridge rule, Step Functions orchestrates jobs running in EMR on EKS. With ACK, you can use the Kubernetes API and configuration language to create and configure AWS resources the same way you create and configure a Kubernetes data processing job. Because most of the managed services are serverless, you can build and manage your entire data pipeline using the Kubernetes API with tools such as kubectl.

Solution overview

ACK lets you define and use AWS service resources directly from Kubernetes, using the Kubernetes Resource Model (KRM). The ACK project contains a series of service controllers, one for each AWS service API. With ACK, developers can stay in their familiar Kubernetes environment and take advantage of AWS services for their application-supporting infrastructure. In the post Microservices development using AWS controllers for Kubernetes (ACK) and Amazon EKS blueprints, we show how to use ACK for microservices development.

In this post, we show how to build an event-driven data pipeline using ACK controllers for EMR on EKS, Step Functions, EventBridge, and Amazon Simple Storage Service (Amazon S3). We provision an EKS cluster with ACK controllers using Terraform modules. We create the data pipeline with the following steps:

  1. Create the emr-data-team-a namespace and bind it with the virtual cluster my-ack-vc in Amazon EMR by using the ACK controller.
  2. Use the ACK controller for Amazon S3 to create an S3 bucket. Upload the sample Spark scripts and sample data to the S3 bucket.
  3. Use the ACK controller for Step Functions to create a Step Functions state machine as an EventBridge rule target based on Kubernetes resources defined in YAML manifests.
  4. Use the ACK controller for EventBridge to create an EventBridge rule for pattern matching and target routing.

The pipeline is triggered when a new script is uploaded. An S3 upload notification is sent to EventBridge and, if it matches the specified rule pattern, triggers the Step Functions state machine. Step Functions calls the EMR virtual cluster to run the Spark job, and all the Spark executors and driver are provisioned inside the emr-data-team-a namespace. The output is saved back to the S3 bucket, and the developer can check the result on the Amazon EMR console.

The following diagram illustrates this architecture.

Prerequisites

Ensure that you have the following tools installed locally:

Deploy the solution infrastructure

Because each ACK service controller requires different AWS Identity and Access Management (IAM) roles for managing AWS resources, it’s better to use an automation tool to install the required service controllers. For this post, we use Amazon EKS Blueprints for Terraform and the AWS EKS ACK Addons Terraform module to provision the following components:

  • A new VPC with three private subnets and three public subnets
  • An internet gateway for the public subnets and a NAT Gateway for the private subnets
  • An EKS cluster control plane with one managed node group
  • Amazon EKS-managed add-ons: VPC_CNI, CoreDNS, and Kube_Proxy
  • ACK controllers for EMR on EKS, Step Functions, EventBridge, and Amazon S3
  • IAM execution roles for EMR on EKS, Step Functions, and EventBridge

Let’s start by cloning the GitHub repo to your local desktop. The module eks_ack_addons in addon.tf is for installing ACK controllers. ACK controllers are installed by using helm charts in the Amazon ECR public galley. See the following code:

cd examples/usecases/event-driven-pipeline
terraform init
terraform plan
terraform apply -auto-approve #defaults to us-west-2

The following screenshot shows an example of our output. emr_on_eks_role_arn is the ARN of the IAM role created for Amazon EMR running Spark jobs in the emr-data-team-a namespace in Amazon EKS. stepfunction_role_arn is the ARN of the IAM execution role for the Step Functions state machine. eventbridge_role_arn is the ARN of the IAM execution role for the EventBridge rule.

The following command updates kubeconfig on your local machine and allows you to interact with your EKS cluster using kubectl to validate the deployment:

region=us-west-2
aws eks --region $region update-kubeconfig --name event-driven-pipeline-demo

Test your access to the EKS cluster by listing the nodes:

kubectl get nodes
# Output should look like below
NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-1-10-64.us-west-2.compute.internal    Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-10-65.us-west-2.compute.internal    Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-10-7.us-west-2.compute.internal     Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-10-73.us-west-2.compute.internal    Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-11-96.us-west-2.compute.internal    Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-12-197.us-west-2.compute.internal   Ready    <none>   19h     v1.24.9-eks-49d8fe8

Now we’re ready to set up the event-driven pipeline.

Create an EMR virtual cluster

Let’s start by creating a virtual cluster in Amazon EMR and link it with a Kubernetes namespace in EKS. By doing that, the virtual cluster will use the linked namespace in Amazon EKS for running Spark workloads. We use the file emr-virtualcluster.yaml. See the following code:

apiVersion: emrcontainers.services.k8s.aws/v1alpha1
kind: VirtualCluster
metadata:
  name: my-ack-vc
spec:
  name: my-ack-vc
  containerProvider:
    id: event-driven-pipeline-demo  # your eks cluster name
    type_: EKS
    info:
      eksInfo:
        namespace: emr-data-team-a # namespace binding with EMR virtual cluster

Let’s apply the manifest by using the following kubectl command:

kubectl apply -f ack-yamls/emr-virtualcluster.yaml

You can navigate to the Virtual clusters page on the Amazon EMR console to see the cluster record.

Create an S3 bucket and upload data

Next, let’s create a S3 bucket for storing Spark pod templates and sample data. We use the s3.yaml file. See the following code:

apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
  name: sparkjob-demo-bucket
spec:
  name: sparkjob-demo-bucket

kubectl apply -f ack-yamls/s3.yaml

If you don’t see the bucket, you can check the log from the ACK S3 controller pod for details. The error is mostly caused if a bucket with the same name already exists. You need to change the bucket name in s3.yaml as well as in eventbridge.yaml and sfn.yaml. You also need to update upload-inputdata.sh and upload-spark-scripts.sh with the new bucket name.

Run the following command to upload the input data and pod templates:

bash spark-scripts-data/upload-inputdata.sh

The sparkjob-demo-bucket S3 bucket is created with two folders: input and scripts.

Create a Step Functions state machine

The next step is to create a Step Functions state machine that calls the EMR virtual cluster to run a Spark job, which is a sample Python script to process the New York City Taxi Records dataset. You need to define the Spark script location and pod templates for the Spark driver and executor in the StateMachine object .yaml file. Let’s make the following changes (highlighted) in sfn.yaml first:

  • Replace the value for roleARN with stepfunctions_role_arn
  • Replace the value for ExecutionRoleArn with emr_on_eks_role_arn
  • Replace the value for VirtualClusterId with your virtual cluster ID
  • Optionally, replace sparkjob-demo-bucket with your bucket name

See the following code:

apiVersion: sfn.services.k8s.aws/v1alpha1
kind: StateMachine
metadata:
  name: run-spark-job-ack
spec:
  name: run-spark-job-ack
  roleARN: "arn:aws:iam::xxxxxxxxxxx:role/event-driven-pipeline-demo-sfn-execution-role"   # replace with your stepfunctions_role_arn
  tags:
  - key: owner
    value: sfn-ack
  definition: |
      {
      "Comment": "A description of my state machine",
      "StartAt": "input-output-s3",
      "States": {
        "input-output-s3": {
          "Type": "Task",
          "Resource": "arn:aws:states:::emr-containers:startJobRun.sync",
          "Parameters": {
            "VirtualClusterId": "f0u3vt3y4q2r1ot11m7v809y6",  
            "ExecutionRoleArn": "arn:aws:iam::xxxxxxxxxxx:role/event-driven-pipeline-demo-emr-eks-data-team-a",
            "ReleaseLabel": "emr-6.7.0-latest",
            "JobDriver": {
              "SparkSubmitJobDriver": {
                "EntryPoint": "s3://sparkjob-demo-bucket/scripts/pyspark-taxi-trip.py",
                "EntryPointArguments": [
                  "s3://sparkjob-demo-bucket/input/",
                  "s3://sparkjob-demo-bucket/output/"
                ],
                "SparkSubmitParameters": "--conf spark.executor.instances=10"
              }
            },
            "ConfigurationOverrides": {
              "ApplicationConfiguration": [
                {
                 "Classification": "spark-defaults",
                "Properties": {
                  "spark.driver.cores":"1",
                  "spark.executor.cores":"1",
                  "spark.driver.memory": "10g",
                  "spark.executor.memory": "10g",
                  "spark.kubernetes.driver.podTemplateFile":"s3://sparkjob-demo-bucket/scripts/driver-pod-template.yaml",
                  "spark.kubernetes.executor.podTemplateFile":"s3://sparkjob-demo-bucket/scripts/executor-pod-template.yaml",
                  "spark.local.dir" : "/data1,/data2"
                }
              }
              ]
            }...

You can get your virtual cluster ID from the Amazon EMR console or with the following command:

kubectl get virtualcluster -o jsonpath={.items..status.id}
# result:
f0u3vt3y4q2r1ot11m7v809y6  # VirtualClusterId

Then apply the manifest to create the Step Functions state machine:

kubectl apply -f ack-yamls/sfn.yaml

Create an EventBridge rule

The last step is to create an EventBridge rule, which is used as an event broker to receive event notifications from Amazon S3. Whenever a new file, such as a new Spark script, is created in the S3 bucket, the EventBridge rule will evaluate (filter) the event and invoke the Step Functions state machine if it matches the specified rule pattern, triggering the configured Spark job.

Let’s use the following command to get the ARN of the Step Functions state machine we created earlier:

kubectl get StateMachine -o jsonpath={.items..status.ackResourceMetadata.arn}
# result
arn: arn:aws:states:us-west-2:xxxxxxxxxx:stateMachine:run-spark-job-ack # sfn_arn

Then, update eventbridge.yaml with the following values:

  • Under targets, replace the value for roleARN with eventbridge_role_arn

Under targets, replace arn with your sfn_arn

  • Optionally, in eventPattern, replace sparkjob-demo-bucket with your bucket name

See the following code:

apiVersion: eventbridge.services.k8s.aws/v1alpha1
kind: Rule
metadata:
  name: eb-rule-ack
spec:
  name: eb-rule-ack
  description: "ACK EventBridge Filter Rule to sfn using event bus reference"
  eventPattern: | 
    {
      "source": ["aws.s3"],
      "detail-type": ["Object Created"],
      "detail": {
        "bucket": {
          "name": ["sparkjob-demo-bucket"]    
        },
        "object": {
          "key": [{
            "prefix": "scripts/"
          }]
        }
      }
    }
  targets:
    - arn: arn:aws:states:us-west-2:xxxxxxxxxx:stateMachine:run-spark-job-ack # replace with your sfn arn
      id: sfn-run-spark-job-target
      roleARN: arn:aws:iam::xxxxxxxxx:role/event-driven-pipeline-demo-eb-execution-role # replace your eventbridge_role_arn
      retryPolicy:
        maximumRetryAttempts: 0 # no retries
  tags:
    - key:owner
      value: eb-ack

By applying the EventBridge configuration file, an EventBridge rule is created to monitor the folder scripts in the S3 bucket sparkjob-demo-bucket:

kubectl apply -f ack-yamls/eventbridge.yaml

For simplicity, the dead-letter queue is not set and maximum retry attempts is set to 0. For production usage, set them based on your requirements. For more information, refer to Event retry policy and using dead-letter queues.

Test the data pipeline

To test the data pipeline, we trigger it by uploading a Spark script to the S3 bucket scripts folder using the following command:

bash spark-scripts-data/upload-spark-scripts.sh

The upload event triggers the EventBridge rule and then calls the Step Functions state machine. You can go to the State machines page on the Step Functions console and choose the job run-spark-job-ack to monitor its status.

For the Spark job details, on the Amazon EMR console, choose Virtual clusters in the navigation pane, and then choose my-ack-vc. You can review all the job run history for this virtual cluster. If you choose Spark UI in any row, you’re redirected the Spark history server for more Spark driver and executor logs.

Clean up

To clean up the resources created in the post, use the following code:

aws s3 rm s3://sparkjob-demo-bucket --recursive # clean up data in S3
kubectl delete -f ack-yamls/. #Delete aws resources created by ACK
terraform destroy -target="module.eks_blueprints_kubernetes_addons" -target="module.eks_ack_addons" -auto-approve -var region=$region
terraform destroy -target="module.eks_blueprints" -auto-approve -var region=$region
terraform destroy -auto-approve -var region=$regionterraform destroy -auto-approve -var region=$region

Conclusion

This post showed how to build an event-driven data pipeline purely with native Kubernetes API and tooling. The pipeline uses EMR on EKS as compute and uses serverless AWS resources Amazon S3, EventBridge, and Step Functions as storage and orchestration in an event-driven architecture. With EventBridge, AWS and custom events can be ingested, filtered, transformed, and reliably delivered (routed) to more than 20 AWS services and public APIs (webhooks), using human-readable configuration instead of writing undifferentiated code. EventBridge helps you decouple applications and achieve more efficient organizations using event-driven architectures, and has quickly become the event bus of choice for AWS customers for many use cases, such as auditing and monitoring, application integration, and IT automation.

By using ACK controllers to create and configure different AWS services, developers can perform all data plane operations without leaving the Kubernetes platform. Also, developers only need to maintain the EKS cluster because all the other components are serverless.

As a next step, clone the GitHub repository to your local machine and test the data pipeline in your own AWS account. You can modify the code in this post and customize it for your own needs by using different EventBridge rules or adding more steps in Step Functions.


About the authors

Victor Gu is a Containers and Serverless Architect at AWS. He works with AWS customers to design microservices and cloud native solutions using Amazon EKS/ECS and AWS serverless services. His specialties are Kubernetes, Spark on Kubernetes, MLOps and DevOps.

Michael Gasch is a Senior Product Manager for AWS EventBridge, driving innovations in event-driven architectures. Prior to AWS, Michael was a Staff Engineer at the VMware Office of the CTO, working on open-source projects, such as Kubernetes and Knative, and related distributed systems research.

Peter Dalbhanjan is a Solutions Architect for AWS based in Herndon, VA. Peter has a keen interest in evangelizing AWS solutions and has written multiple blog posts that focus on simplifying complex use cases. At AWS, Peter helps with designing and architecting variety of customer workloads.

How to use Amazon GuardDuty and AWS WAF v2 to automatically block suspicious hosts

Post Syndicated from Eucke Warren original https://aws.amazon.com/blogs/security/how-to-use-amazon-guardduty-and-aws-waf-v2-to-automatically-block-suspicious-hosts/

In this post, we’ll share an automation pattern that you can use to automatically detect and block suspicious hosts that are attempting to access your Amazon Web Services (AWS) resources. The automation will rely on Amazon GuardDuty to generate findings about the suspicious hosts, and then you can respond to those findings by programmatically updating AWS WAF to block the host from accessing your workloads.

You should implement security measures across your AWS resources by using a holistic approach that incorporates controls across multiple areas. In the AWS CAF Security Perspective section of the AWS Security Incident Response Guide, we define these controls across four categories:

  • Directive controls — Establish the governance, risk, and compliance models the environment will operate within
  • Preventive controls — Protect your workloads and mitigate threats and vulnerabilities
  • Detective controls — Provide full visibility and transparency over the operation of your deployments in AWS
  • Responsive controls — Drive remediation of potential deviations from your security baselines

Security automation is a key principle outlined in the Response Guide. It helps reduce operational overhead and creates repeatable, predictable approaches to monitoring and responding to events. AWS services provide the building blocks to create powerful patterns for the automated detection and remediation of threats against your AWS environments. You can configure automated flows that use both detective and responsive controls and might also feed into preventative controls to help mitigate risks in the future. Depending on the type of source event, you can automatically invoke specific actions, such as modifying access controls, terminating instances, or revoking credentials.

The patterns highlighted in this post provide an example of how to automatically remediate detected threats. You should modify these patterns to suit your defined requirements, and test and validate them before deploying them in a production environment.

AWS services used for the example pattern

Amazon GuardDuty is a continuous security monitoring and threat detection service that incorporates threat intelligence, anomaly detection, and machine learning to help protect your AWS resources, including your AWS accounts. Amazon EventBridge delivers a near-real-time stream of system events that describe changes in AWS resources. Amazon GuardDuty sends events to Amazon CloudWatch when a change in the findings takes place. In the context of GuardDuty, such changes include newly generated findings and subsequent occurrences of these findings. You can quickly set up rules to match events generated by GuardDuty findings in EventBridge events and route those events to one or more target actions. The pattern in this post routes matched events to AWS Lambda, which then updates AWS WAF web access control lists (web ACLs) and Amazon Virtual Private Cloud (Amazon VPC) network access control lists (network ACLs). AWS WAF is a web application firewall that helps protect your web applications from common web exploits that could affect application availability, security, or excess resource consumption. It supports both managed rules as well as a powerful rule language for custom rules. A network ACL is stateless and is an optional layer of security for your VPC that helps you restrict specific inbound and outbound traffic at the subnet level.

Pattern overview

This example pattern assumes that Amazon GuardDuty is enabled in your AWS account. If it isn’t enabled, you can learn more about the free trial and pricing, and follow the steps in the GuardDuty documentation to configure the service and start monitoring your account. The example code will only work in the us-east-1 AWS Region due to the use of Amazon CloudFront and web ACLs within the template.

Figure 1 shows how the AWS CloudFormation template creates the sample pattern.

Figure 1: How the CloudFormation template works

Figure 1: How the CloudFormation template works

Here’s how the pattern works, as shown in the diagram:

  1. A GuardDuty finding is generated due to suspected malicious activity.
  2. An EventBridge event is configured to filter for GuardDuty finding types by using event patterns.
  3. A Lambda function is invoked by the EventBridge event and parses the GuardDuty finding.
  4. The Lambda function checks the Amazon DynamoDB state table for an existing entry that matches the identified host. If state data is not found in the table for the identified host, a new entry is created in the Amazon DynamoDB state table.
  5. The Lambda function creates a web ACL rule inside AWS WAF and updates a subnet network ACL.
  6. A notification email is sent through Amazon Simple Notification Service (SNS).

A second Lambda function runs on a 5-minute recurring schedule and removes entries that are past the configurable retention period from AWS WAF IPSets (an IPSet is a list that contains the blocklisted IPs or CIDRs), VPC network ACLs, and the DynamoDB table.

GuardDuty prefix patterns and findings

The EventBridge event rule provided by the example automation uses the following seven prefix patterns, which allow coverage for 36 GuardDuty finding types. These specific finding types are of a network nature, and so we can use AWS WAF to block them. Be sure to read through the full list of finding types in the GuardDuty documentation to better understand what GuardDuty can report findings for. The covered findings are as follows:

  1. UnauthorizedAccess:EC2
    • UnauthorizedAccess:EC2/MaliciousIPCaller.Custom
    • UnauthorizedAccess:EC2/MetadataDNSRebind
    • UnauthorizedAccess:EC2/RDPBruteForce
    • UnauthorizedAccess:EC2/SSHBruteForce
    • UnauthorizedAccess:EC2/TorClient
    • UnauthorizedAccess:EC2/TorRelay
  2. Recon:EC2
    • Recon:EC2/PortProbeEMRUnprotectedPort
    • Recon:EC2/PortProbeUnprotectedPort
    • Recon:EC2/Portscan
  3. Trojan:EC2
    • Trojan:EC2/BlackholeTraffic
    • Trojan:EC2/BlackholeTraffic!DNS
    • Trojan:EC2/DGADomainRequest.B
    • Trojan:EC2/DGADomainRequest.C!DNS
    • Trojan:EC2/DNSDataExfiltration
    • Trojan:EC2/DriveBySourceTraffic!DNS
    • Trojan:EC2/DropPoint
    • Trojan:EC2/DropPoint!DNS
    • Trojan:EC2/PhishingDomainRequest!DNS
  4. Backdoor:EC2
    • Backdoor:EC2/C&CActivity.B
    • Backdoor:EC2/C&CActivity.B!DNS
    • Backdoor:EC2/DenialOfService.Dns
    • Backdoor:EC2/DenialOfService.Tcp
    • Backdoor:EC2/DenialOfService.Udp
    • Backdoor:EC2/DenialOfService.UdpOnTcpPorts
    • Backdoor:EC2/DenialOfService.UnusualProtocol
    • Backdoor:EC2/Spambot
  5. Impact:EC2
    • Impact:EC2/AbusedDomainRequest.Reputation
    • Impact:EC2/BitcoinDomainRequest.Reputation
    • Impact:EC2/MaliciousDomainRequest.Reputation
    • Impact:EC2/PortSweep
    • Impact:EC2/SuspiciousDomainRequest.Reputation
    • Impact:EC2/WinRMBruteForce
  6. CryptoCurrency:EC2
    • CryptoCurrency:EC2/BitcoinTool.B
    • CryptoCurrency:EC2/BitcoinTool.B!DNS
  7. Behavior:EC2
    • Behavior:EC2/NetworkPortUnusual
    • Behavior:EC2/TrafficVolumeUnusual

When activity occurs that generates one of these GuardDuty finding types and is then matched by the EventBridge event rule, an entry is created in the target web ACLs and subnet network ACLs to deny access from the suspicious host, and then a notification is sent to an email address by this pattern’s Lambda function. Blocking traffic from the suspicious host helps to mitigate potential threats while you perform additional investigation and remediation. For more information, see Remediating a compromised EC2 instance.

Solution deployment

To deploy the solution, you’ll do the following steps. Each step is described in more detail in the sections that follow.

  1. Download the required files.
  2. Create your Amazon Simple Storage Service (Amazon S3) bucket and upload the .zip files.
  3. Deploy the CloudFormation template.
  4. Create and test the Lambda function for a GuardDuty finding event.
  5. Confirm the entry for the test event in the VPC network ACL.
  6. Confirm the entry in the AWS WAF IP sets.
  7. Confirm the SNS notification email alert.
  8. Apply the AWS WAF web ACLs to resources.

Step 1: Download the required files

Download the following four files from the amazon-guardduty-waf-acl GitHub code repository:

  1. CloudFormation template – Copy and save the linked raw text, using the file name guarddutytoacl.template on your local file system.
  2. JSON event test file – Copy and save the linked raw text, using the file name gd2acl_test_event.json on your local file system.
  3. guardduty_to_acl_lambda_wafv2.zip – Choose the Download button on the GitHub page and save the .zip file to your local file system.
  4. prune_old_entries_wafv2.zip – Choose the Download button on the GitHub page and save the .zip file to your local file system.

Step 2: Create your S3 bucket and upload .zip files

For this step, create an S3 bucket with public access blocked, and then upload the Lambda .zip files to the newly created S3 bucket.

To create your S3 bucket and upload .zip files

  1. Create an S3 bucket in the us-east-1 Region.
  2. Upload the .zip files guardduty_to_acl_lambda_wafv2.zip and prune_old_entries_wafv2.zip that you saved to your local file system in Step 1 to the newly created S3 bucket.

Step 3: Deploy the CloudFormation template

For this step, deploy the CloudFormation template only to the us-east-1 Region within the AWS account where GuardDuty findings are to be monitored.

To deploy the CloudFormation template

  1. Sign in to the AWS Management Console, choose the CloudFormation service, and set N.Virginia (us-east-1) as the Region.
  2. Choose Create stack, and then choose With new resources (standard).
  3. When the Create stack landing page is presented, make sure that Template is ready is selected in the Prepare template section. In the Template source section, choose Upload a template file.
  4. Choose the Choose file button and browse to the location where the guarddutytoacl.template file was saved on your local file system. Select the file, choose Open, and then choose Next.
  5. On the Specify stack details page, provide the following input parameters. You can modify the default values to customize the pattern for your environment.

    Input parameter Input parameter description
    Notification email The email address to receive notifications. Must be a valid email address.
    Retention time, in minutes How long to retain IP addresses in the blocklist (in minutes). The default is 12 hours.
    S3 bucket for artifacts The S3 bucket with artifact files (Lambda functions, templates, HTML files, and so on). Keep the default value for deployment into the N. Virginia Region.
    S3 path to artifacts The path in the S3 bucket that contains artifact files. Keep the default value for deployment into the N. Virginia Region.
    CloudFrontWebACL Create CloudFront Web ACL? If set to true, a CloudFront IP set will be created automatically.
    RegionalWebACL Create Regional Web ACL? If set to true, a Regional IP set will be created automatically.

    Figure 2 shows an example of the values entered on this page.

    Figure 2: CloudFormation parameters on the Specify stack details page

    Figure 2: CloudFormation parameters on the Specify stack details page

  6. Enter values for all of the input parameters, and then choose Next.
  7. On the Configure stack options page, accept the defaults, and then choose Next.
  8. On the Review page, confirm the details, check the box acknowledging that the template will require capabilities for AWS::IAM::Role, and then choose Create Stack.

    The stack normally requires no more than 3–5 minutes to complete.

  9. While the stack is being created, check the email inbox that you specified for the Notification email address parameter. Look for an email message with the subject “AWS Notification – Subscription Confirmation”. Choose the link in the email to confirm the subscription to the SNS topic. You should see a message similar to the following.
    Figure 3: Subscription confirmation

    Figure 3: Subscription confirmation

When the Status field for the CloudFormation stack changes to CREATE_COMPLETE, as shown in Figure 4, the pattern is implemented and is ready for testing.

Figure 4: The stack status is CREATE_COMPLETE

Figure 4: The stack status is CREATE_COMPLETE

Step 4: Create and test the Lambda function for a GuardDuty finding event

After the CloudFormation stack has completed deployment, you can test the functionality by using a Lambda test event.

To create and run a Lambda GuardDuty finding test event

  1. In the AWS Management Console, choose Services > VPC > Subnets and locate a subnet that is suitable for testing the pattern.
  2. On the Details tab, copy the subnet ID to the clipboard or to a text editor.
    Figure 5: The subnet ID value on the Details tab

    Figure 5: The subnet ID value on the Details tab

  3. In the AWS Management Console, choose Services > CloudFormation > GuardDutytoACL stack. On the Outputs tab for the stack, look for the GuardDutytoACLLambda entry.
    Figure 6: The GuardDutytoACLLambda entry on the Outputs tab

    Figure 6: The GuardDutytoACLLambda entry on the Outputs tab

  4. Choose the link for the entry, and you’ll be redirected to the Lambda console, with the Lambda Code source page already open.
    Figure 7: The Lambda function open in the Lambda console

    Figure 7: The Lambda function open in the Lambda console

  5. In the middle of the Code source menu, in the Test dropdown list, locate and select the Configure test event option.
    Figure 8: Select Configure test event from the dropdown list

    Figure 8: Select Configure test event from the dropdown list

  6. To facilitate testing, we’ve provided a test event file. On the Configure test event page, do the following:
    1. For Event name, enter a name.
    2. In the body of the Event JSON field, paste the provided test event JSON, overwriting the existing contents.
    3. Update the value of SubnetId key (line 35) to the value of the subnet ID that you chose in Step 1 of this procedure.
    4. Choose Save.
    Figure 9: Update the value of the subnetId key

    Figure 9: Update the value of the subnetId key

  7. Choose Test to invoke the Lambda function with the test event. You should see the message “Status: succeeded” at the top of the execution results, similar to what is shown in Figure 10.
    Figure 10: The Test button and the “succeeded” message

    Figure 10: The Test button and the “succeeded” message

Step 5: Confirm the entry in the VPC network ACL

In this step, you’ll confirm that the DENY entry was created in the network ACL. This pattern is configured to create up to 10 entries in an ACL, ranging between rule numbers 71 and 80. Because network ACL rules are processed in order, it’s important that the DENY rule is placed before the ALLOW rule.

To confirm the entry in the VPC network ACL

  1. In the AWS Management Console, choose Services > VPC > Subnets, and locate the subnet you provided for the test event.
  2. Choose the network ACL link and confirm that the new DENY entry was generated from the test event.
    Figure 11: Check the entry from the test event on the Network tab

    Figure 11: Check the entry from the test event on the Network tab

    Note that VPC network ACL entries are created in the rule number range between 71 and 80. Older entries are aged out to create a “sliding window” of blocked hosts.

Step 6: Confirm the entry in the AWS WAF IP sets and blocklists

Next, verify that the entry was added to the CloudFront AWS WAF IP set and to the Application Load Balancer (ALB) AWS WAF IP set.

To confirm the entry in the AWS WAF IP set and blocklist

  1. In the AWS Management Console, choose Services > WAF & Shield > Web ACLs, and then set the selected Region to Global (CloudFront).
  2. Find and select the web ACL name that starts with CloudFrontBlockListWeb. In the Rule view, on the Rules tab, select the rule named CloudFrontBlocklistIPSetRule. Note that 198.51.100.0/32 appears as an entry in the rule.
    Figure 12: Confirm that the IP address was added

    Figure 12: Confirm that the IP address was added

  3. In the AWS Management Console, on the left navigation menu, choose Web ACLs, and then set the selected Region to US East (N. Virginia).
  4. Find and select the web ACL name that starts with RegionalBlocklistACL. In the Rule view, on the Rules tab, select the rule named RegionalBlocklistIPSetRule. Note that 198.51.100.0/32 appears as an entry in the rule.
    Figure 13: Make sure that the IP address was added

    Figure 13: Make sure that the IP address was added

There might be specific host addresses that you want to prevent from being added to the blocklist. You can do this within GuardDuty by using a trusted IP list. Trusted IP lists consist of IP addresses that you have allowlisted for secure communication with your AWS infrastructure and applications. GuardDuty doesn’t generate findings for IP addresses on trusted IP lists. For more information, see Working with trusted IP lists and threat lists.

Step 7: Confirm the SNS notification email

Finally, verify that the SNS notification was sent to the email address you set up.

To confirm receipt of the SNS notification email

  • Review the email inbox that you specified for the AdminEmail parameter and look for a message with the subject line “AWS GD2ACL Alert”. The contents of the message from SNS should be similar to the following.
    Figure 14: SNS message example

    Figure 14: SNS message example

Step 8: Apply the AWS WAF web ACLs to resources

The final task is to associate the web ACL with the CloudFront distributions and Application Load Balancers that you want to automatically update with this pattern. To learn how to do this, see Associating or disassociating a web ACL with an AWS resource.

You can also use AWS Firewall Manager to associate the web ACLs. AWS Firewall Manager can simplify your AWS WAF administration and maintenance tasks across multiple accounts and resources. With Firewall Manager, you set up your firewall rules just once. The service automatically applies your rules across your accounts and resources, even as you add new resources.

Conclusion

In this post, you’ve learned how to use Lambda to automatically update AWS WAF and VPC network ACLs in response to GuardDuty findings. With just a few steps, you can use this sample pattern to help mitigate threats by blocking communication with suspicious hosts. You can explore additional possible patterns by using GuardDuty finding types and Amazon EventBridge target actions. This pattern’s code is available on GitHub. Feel free to play around with the code to add more GuardDuty findings to this pattern and also to build bigger and better patterns! Make sure to modify the patterns in this post to suit your defined requirements, and test and validate them before deploying them in a production environment.

If you have comments about this blog post, you can submit them in the Comments section below. If you have questions about using this pattern, start a thread in the GuardDutyAWS WAF, or CloudWatch forums, or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Eucke Warren

Eucke Warren

Eucke is a Sr Solution Architect helping ISV customers grow and mature securely. He has been fortunate to be able to work with technology for more than 30 years and counts automation, infrastructure, and security as areas of focus. When he’s not supporting customers, he enjoys time with his wife, family, and the company of a very bossy 18-pound dog.

Geoff Sweet

Geoff Sweet

Geoff has been in industry for over 20 years. He began his career in Electrical Engineering. Starting in IT during the dot-com boom, he has held a variety of diverse roles, such as systems architect, network architect, and, for the past several years, security architect. Geoff specializes in infrastructure security.

Extending CloudFormation and CDK with Third-Party Extensions

Post Syndicated from Lucas Chen original https://aws.amazon.com/blogs/devops/extending-cloudformation-and-cdk-with-third-party-extensions/

Did you know you can use CloudFormation to manage third-party resources? The AWS CloudFormation Public Registry provides a searchable collection of CloudFormation extensions and makes it easy to discover and provision them in CloudFormation templates and AWS Cloud Development Kit (CDK) applications. In the past three months, we’ve added a number of new, exciting partners to the Public Registry, including GitLab, Okta, and PagerDuty.

The extensions available on the registry are wide-ranging and include third-party resources from partners such as MongoDB; hooks, which are preventative controls that add safeguards to provisioning; and modules, which are re-usable components that take into account best practices and opinionated definitions of resources. AWS Partner Network (APN), third parties, and the developer community contribute these extensions to the Public Registry. Using extensions, customers no longer need to create and maintain custom provisioning logic for resource types from third-party vendors.

Over last few months, AWS collaborated with partners to develop and publish over 80 new resources across 14 providers to Public Registry for CloudFormation. Below is a summary of the new resource type additions.

Recently Updated Third-Party Providers

Provider Use case
MongoDB Atlas

Manage components in MongoDB Atlas. Add, edit, or delete administrative objects within Atlas, including projects, users, and database deployments

Note: You cannot read or write data to Atlas Clusters with Atlas Admin APIs and AWS CloudFormation resources. To read and write data in Atlas, you must use the Atlas Data API

GitLab Manage the users and groups in an organization, set up a new project with the right users, groups, and access token, tag a project automatically for every active CI/CD deployment
New Relic Create a new Dashboard with custom Pages, Widgets and Layout, add tags to your data to help improve data organization and findability, workloads-related tasks
GitHub Manage the users and groups in an organization, set up a new project with the right users, groups, and access token, Add a webhook to a repo
Dynatrace Set up a new project with service level objective, locations, monitors and metrics
Okta Onboard a new application into Okta with the right users and groups
PagerDuty Set up monitoring of a new or existing application
Databricks Set up a Databricks cluster and jobs
Fastly Configure Fastly as a CDN for your web app
BigID Connect S3 and DynamoDB data sources into your BigID application
Rollbar Set up a new Rollbar project and manage rules, teams, and users
Cloudflare Configure a DNS record and load-balancing using Cloudflare
Lacework Configure Lacework alert profiles, rules, channels and manage queries
Snowflake Create databases, users, and manage privileges

Key Benefits

Here are some of the benefits for extension builders and consumers when publishing extensions to the public registry:

  1. Discoverability – Publishing your extensions in the public registry will make them discoverable by 1M+ active CloudFormation and CDK customers.
  2. CDK Support – We’re seeing rapid growth in the adoption of the CDK amongst the developer population. Upon publishing to the registry, L1 CDK Constructs will automatically be created for your third party resources making them compatible with the CDK with no added work required. These constructs will also be listed on Construct Hub and aids discoverability discoverable by customers. Note: Automated L1 CDK construct generation is currently an experimental feature.
  3. Drift detection – Third-party resource types in the public registry also integrate with drift detection. After creating a resource from a third-party resource type, CloudFormation will detect changes to the third-party resource from its template configuration, known as configuration drift, just as it would with AWS resources.
  4. AWS Config – You can also use AWS Config to manage compliance for third-party resources consumed from the registry. The resource types are automatically tracked as Configuration Items when you have configured AWS Config to record them, and used CloudFormation to create, update, and delete them. Whether the resource types you use are third-party or AWS resources, you can view configuration history for them, in addition to being able to write AWS Config rules to verify configuration best practices.
  5. Abstraction of Best Practices with Modules – Browse and use modules from the registry when creating your CloudFormation templates to ensure you’re provisioning resources while adhering to best practices.
  6. AWS Cloud Control API – The AWS Cloud Control API allows AWS partners and customers to interface with your resource type through API calls using Create, Read, Update, Delete, and List (CRUD-L) operations. Resources in the registry will be automatically integrated with our AWS Cloud Control API and expands your third party resource compatibility to even more AWS services and IaC tools.

We’ve seen great momentum from our partners and developer community over the past year. We are looking forward to continued investment and innovation in the Public Registry.

How to Get Started

For Resource Type Users: Explore and Activate Third Party Resource Types

Third party resource types must first be activated before they can be used. You do this by logging into your AWS Console > Navigate to CloudFormation > Registry > Public extensions > Set the Publisher to Third Party. This will show you a list of available third-party resources in your region (note that different regions may have a different set of third-party resource types). Select the radio box next to the resource types you want to activate and click the activate button at the top of the list.

Figure 1:

Don’t see the extension you need in the registry?

You can submit requests for new third-party extensions through our Community Registry Extensions Github repo issue tracker! Click the New Issue button and describe the third-party extension along with information about your use case.

For Developers and Publishers: Join the CloudFormation Developer Community and Start Building

You can see several of the community-built registry extensions in the AWS CloudFormation Community Registry Extensions repository and even contribute yourself. You can also read about the experiences and lessons learned from publishing to the Registry through this blog written by Cloudsoft.

For developers looking to create new resource types to add to the public Registry, follow this creating resource types walkthrough help you get started. If you need assistance creating, publishing resources, or just want to join the discussion, you can join the conversation today in our CloudFormation Discord Channel. We’d love to hear about your experiences and use cases in developing innovations with registry extensions.

About the authors:

Anuj Sharma

Anuj Sharma is a Sr Container Partner Solution Architect with Amazon Web Services. He works with ISV partners and drives Partner-AWS product development and integrations.

Lucas Chen

Lucas is a Senior Product Manager at Amazon Web Services. He leads the CloudFormation Registry and its integrations with third-party products. Prior to AWS, he spent 9 years at VMware working on its end user computing product, Workspace ONE.

Rahul Sharma

Rahul is a Senior Product Manager-Technical at Amazon Web Services with over two years of product management spanning AWS CloudFormation and AWS Cloud Control API.

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Post Syndicated from Nith Govindasivan original https://aws.amazon.com/blogs/big-data/implement-slowly-changing-dimensions-in-a-data-lake-using-aws-glue-and-delta/

In a data warehouse, a dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. To illustrate an example, in a typical sales domain, customer, time or product are dimensions and sales transactions is a fact. Attributes within the dimension can change over time—a customer can change their address, an employee can move from a contractor position to a full-time position, or a product can have multiple revisions to it. A slowly changing dimension (SCD) is a data warehousing concept that contains relatively static data that can change slowly over a period of time. There are three major types of SCDs maintained in data warehousing: Type 1 (no history), Type 2 (full history), and Type 3 (limited history). Change data capture (CDC) is a characteristic of a database that provides an ability to identify the data that changed between two database loads, so that an action can be performed on the changed data.

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging. It becomes even more challenging when source systems don’t provide a mechanism to identify the changed data for processing within the data lake and makes the data processing highly complex if the data source happens to be semi-structured instead of a database. The key objective while handling Type 2 SCDs is to define the start and end dates to the dataset accurately to track the changes within the data lake, because this provides the point-in-time reporting capability for the consuming applications.

In this post, we focus on demonstrating how to identify the changed data for a semi-structured source (JSON) and capture the full historical data changes (SCD Type 2) and store them in an S3 data lake, using AWS Glue and open data lake format Delta.io. This implementation supports the following use cases:

  • Track Type 2 SCDs with start and end dates to identify the current and full historical records and a flag to identify the deleted records in the data lake (logical deletes)
  • Use consumption tools such as Amazon Athena to query historical records seamlessly

Solution overview

This post demonstrates the solution with an end-to-end use case using a sample employee dataset. The dataset represents employee details such as ID, name, address, phone number, contractor or not, and more. To demonstrate the SCD implementation, consider the following assumptions:

  • The data engineering team receives daily files that are full snapshots of records and don’t contain any mechanism to identify source record changes
  • The team is tasked with implementing SCD Type 2 functionality for identifying new, updated, and deleted records from the source, and to preserve the historical changes in the data lake
  • Because the source systems don’t provide the CDC capability, a mechanism needs to be developed to identify the new, updated, and deleted records and persist them in the data lake layer

The architecture is implemented as follows:

  • Source systems ingest files in the S3 landing bucket (this step is mimicked by generating the sample records using the provided AWS Lambda function into the landing bucket)
  • An AWS Glue job (Delta job) picks the source data file and processes the changed data from the previous file load (new inserts, updates to the existing records, and deleted records from the source) into the S3 data lake (processed layer bucket)
  • The architecture uses the open data lake format (Delta), and builds the S3 data lake as a Delta Lake, which is mutable, because the new changes can be updated, new inserts can be appended, and source deletions can be identified accurately and marked with a delete_flag value
  • An AWS Glue crawler catalogs the data, which can be queried by Athena

The following diagram illustrates our architecture.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Deploy the solution

For this solution, we provide a CloudFormation template that sets up the services included in the architecture, to enable repeatable deployments. This template creates the following resources:

  • Two S3 buckets: a landing bucket for storing sample employee data and a processed layer bucket for the mutable data lake (Delta Lake)
  • A Lambda function to generate sample records
  • An AWS Glue extract, transform, and load (ETL) job to process the source data from the landing bucket to the processed bucket

To deploy the solution, complete the following steps:

  1. Choose Launch Stack to launch the CloudFormation stack:

  1. Enter a stack name.
  2. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  3. Choose Create stack.

After the CloudFormation stack deployment is complete, navigate to AWS CloudFormation console to note the following resources on the Outputs tab:

  • Data lake resources – The S3 buckets scd-blog-landing-xxxx and scd-blog-processed-xxxx (referred to as scd-blog-landing and scd-blog-processed in the subsequent sections in this post)
  • Sample records generator Lambda functionSampleDataGenaratorLambda-<CloudFormation Stack Name> (referred to as SampleDataGeneratorLambda)
  • AWS Glue Data Catalog databasedeltalake_xxxxxx (referred to as deltalake)
  • AWS Glue Delta job<CloudFormation-Stack-Name>-src-to-processed (referred to as src-to-processed)

Note that deploying the CloudFormation stack in your account incurs AWS usage charges.

Test SCD Type 2 implementation

With the infrastructure in place, you’re ready to test out the overall solution design and query historical records from the employee dataset. This post is designed to be implemented for a real customer use case, where you get full snapshot data on a daily basis. We test the following aspects of SCD implementation:

  • Run an AWS Glue job for the initial load
  • Simulate a scenario where there are no changes to the source
  • Simulate insert, update, and delete scenarios by adding new records, and modifying and deleting existing records
  • Simulate a scenario where the deleted record comes back as a new insert

Generate a sample employee dataset

To test the solution, and before you can start your initial data ingestion, the data source needs to be identified. To simplify that step, a Lambda function has been deployed in the CloudFormation stack you just deployed.

Open the function and configure a test event, with the default hello-world template event JSON as seen in the following screenshot. Provide an event name without any changes to the template and save the test event.

Choose Test to invoke a test event, which invokes the Lambda function to generate the sample records.

When the Lambda function completes its invocation, you will be able to see the following sample employee dataset in the landing bucket.

Run the AWS Glue job

Confirm if you see the employee dataset in the path s3://scd-blog-landing/dataset/employee/. You can download the dataset and open it in a code editor such as VS Code. The following is an example of the dataset:

{"emp_id":1,"first_name":"Melissa","last_name":"Parks","Address":"19892 Williamson Causeway Suite 737\nKarenborough, IN 11372","phone_number":"001-372-612-0684","isContractor":false}
{"emp_id":2,"first_name":"Laura","last_name":"Delgado","Address":"93922 Rachel Parkways Suite 717\nKaylaville, GA 87563","phone_number":"001-759-461-3454x80784","isContractor":false}
{"emp_id":3,"first_name":"Luis","last_name":"Barnes","Address":"32386 Rojas Springs\nDicksonchester, DE 05474","phone_number":"127-420-4928","isContractor":false}
{"emp_id":4,"first_name":"Jonathan","last_name":"Wilson","Address":"682 Pace Springs Apt. 011\nNew Wendy, GA 34212","phone_number":"761.925.0827","isContractor":true}
{"emp_id":5,"first_name":"Kelly","last_name":"Gomez","Address":"4780 Johnson Tunnel\nMichaelland, WI 22423","phone_number":"+1-303-418-4571","isContractor":false}
{"emp_id":6,"first_name":"Robert","last_name":"Smith","Address":"04171 Mitchell Springs Suite 748\nNorth Juliaview, CT 87333","phone_number":"261-155-3071x3915","isContractor":true}
{"emp_id":7,"first_name":"Glenn","last_name":"Martinez","Address":"4913 Robert Views\nWest Lisa, ND 75950","phone_number":"001-638-239-7320x4801","isContractor":false}
{"emp_id":8,"first_name":"Teresa","last_name":"Estrada","Address":"339 Scott Valley\nGonzalesfort, PA 18212","phone_number":"435-600-3162","isContractor":false}
{"emp_id":9,"first_name":"Karen","last_name":"Spencer","Address":"7284 Coleman Club Apt. 813\nAndersonville, AS 86504","phone_number":"484-909-3127","isContractor":true}
{"emp_id":10,"first_name":"Daniel","last_name":"Foley","Address":"621 Sarah Lock Apt. 537\nJessicaton, NH 95446","phone_number":"457-716-2354x4945","isContractor":true}
{"emp_id":11,"first_name":"Amy","last_name":"Stevens","Address":"94661 Young Lodge Suite 189\nCynthiamouth, PR 01996","phone_number":"241.375.7901x6915","isContractor":true}
{"emp_id":12,"first_name":"Nicholas","last_name":"Aguirre","Address":"7474 Joyce Meadows\nLake Billy, WA 40750","phone_number":"495.259.9738","isContractor":true}
{"emp_id":13,"first_name":"John","last_name":"Valdez","Address":"686 Brian Forges Suite 229\nSullivanbury, MN 25872","phone_number":"+1-488-011-0464x95255","isContractor":false}
{"emp_id":14,"first_name":"Michael","last_name":"West","Address":"293 Jones Squares Apt. 997\nNorth Amandabury, TN 03955","phone_number":"146.133.9890","isContractor":true}
{"emp_id":15,"first_name":"Perry","last_name":"Mcguire","Address":"2126 Joshua Forks Apt. 050\nPort Angela, MD 25551","phone_number":"001-862-800-3814","isContractor":true}
{"emp_id":16,"first_name":"James","last_name":"Munoz","Address":"74019 Banks Estates\nEast Nicolefort, GU 45886","phone_number":"6532485982","isContractor":false}
{"emp_id":17,"first_name":"Todd","last_name":"Barton","Address":"2795 Kelly Shoal Apt. 500\nWest Lindsaytown, TN 55404","phone_number":"079-583-6386","isContractor":true}
{"emp_id":18,"first_name":"Christopher","last_name":"Noble","Address":"Unit 7816 Box 9004\nDPO AE 29282","phone_number":"215-060-7721","isContractor":true}
{"emp_id":19,"first_name":"Sandy","last_name":"Hunter","Address":"7251 Sarah Creek\nWest Jasmine, CO 54252","phone_number":"8759007374","isContractor":false}
{"emp_id":20,"first_name":"Jennifer","last_name":"Ballard","Address":"77628 Owens Key Apt. 659\nPort Victorstad, IN 02469","phone_number":"+1-137-420-7831x43286","isContractor":true}
{"emp_id":21,"first_name":"David","last_name":"Morris","Address":"192 Leslie Groves Apt. 930\nWest Dylan, NY 04000","phone_number":"990.804.0382x305","isContractor":false}
{"emp_id":22,"first_name":"Paula","last_name":"Jones","Address":"045 Johnson Viaduct Apt. 732\nNorrisstad, AL 12416","phone_number":"+1-193-919-7527x2207","isContractor":true}
{"emp_id":23,"first_name":"Lisa","last_name":"Thompson","Address":"1295 Judy Ports Suite 049\nHowardstad, PA 11905","phone_number":"(623)577-5982x33215","isContractor":true}
{"emp_id":24,"first_name":"Vickie","last_name":"Johnson","Address":"5247 Jennifer Run Suite 297\nGlenberg, NC 88615","phone_number":"708-367-4447x9366","isContractor":false}
{"emp_id":25,"first_name":"John","last_name":"Hamilton","Address":"5899 Barnes Plain\nHarrisville, NC 43970","phone_number":"341-467-5286x20961","isContractor":false}

Download the dataset and keep it ready, because you will modify the dataset for future use cases to simulate the inserts, updates, and deletes. The sample dataset generated for you will be entirely different than what you see in the preceding example.

To run the job, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job src-to-processed.
  3. On the Runs tab, choose Run.

When the AWS Glue job is run for the first time, the job reads the employee dataset from the landing bucket path and ingests the data to the processed bucket as a Delta table.

When the job is complete, you can create a crawler to see the initial data load. The following screenshot shows the database available on the Databases page.

  1. Choose Crawlers in the navigation pane.
  2. Choose Create crawler.

  1. Name your crawler delta-lake-crawler, then choose Next.

  1. Select Not yet for data already mapped to AWS Glue tables.
  2. Choose Add a data source.

  1. On the Data source drop-down menu, choose Delta Lake.
  2. Enter the path to the Delta table.
  3. Select Create Native tables.
  4. Choose Add a Delta Lake data source.

  1. Choose Next.

  1. Choose the role that was created by the CloudFormation template, then choose Next.

  1. Choose the database that was created by the CloudFormation template, then choose Next.

  1. Choose Create crawler.

  1. Select your crawler and choose Run.

Query the data

After the crawler is complete, you can see the table it created.

To query the data, complete the following steps:

  1. Choose the employee table and on the Actions menu, choose View data.

You’re redirected to the Athena console. If you don’t have the latest Athena engine, create a new Athena workgroup with the latest Athena engine.

  1. Under Administration in the navigation pane, choose Workgroups.

  1. Choose Create workgroup.

  1. Provide a name for the workgroup, such as DeltaWorkgroup.
  2. Select Athena SQL as the engine, and choose Athena engine version 3 for Query engine version.

  1. Choose Create workgroup.

  1. After you create the workgroup, select the workgroup (DeltaWorkgroup) on the drop-down menu in the Athena query editor.

  1. Run the following query on the employee table:
SELECT * FROM "deltalake_2438fbd0"."employee";

Note: Update the correct database name from the CloudFormation outputs before running the above query.

You can observe that the employee table has 25 records. The following screenshot shows the total employee records with some sample records.

The Delta table is stored with an emp_key, which is unique to each and every change and is used to track the changes. The emp_key is created for every insert, update, and delete, and can be used to find all the changes pertaining to a single emp_id.

The emp_key is created using the SHA256 hashing algorithm, as shown in the following code:

df.withColumn("emp_key", sha2(concat_ws("||", col("emp_id"), col("first_name"), col("last_name"), col("Address"),
            col("phone_number"), col("isContractor")), 256))

Perform inserts, updates, and deletes

Before making changes to the dataset, let’s run the same job one more time. Assuming that the current load from the source is the same as the initial load with no changes, the AWS Glue job shouldn’t make any changes to the dataset. After the job is complete, run the previous Select query in the Athena query editor and confirm that there are still 25 active records with the following values:

  • All 25 records with the column isCurrent=true
  • All 25 records with the column end_date=Null
  • All 25 records with the column delete_flag=false

After you confirm the previous job run with these values, let’s modify our initial dataset with the following changes:

  1. Change the isContractor flag to false (change it to true if your dataset already shows false) for emp_id=12.
  2. Delete the entire row where emp_id=8 (make sure to save the record in a text editor, because we use this record in another use case).
  3. Copy the row for emp_id=25 and insert a new row. Change the emp_id to be 26, and make sure to change the values for other columns as well.

After we make these changes, the employee source dataset looks like the following code (for readability, we have only included the changed records as described in the preceding three steps):

{"emp_id":12,"first_name":"Nicholas","last_name":"Aguirre","Address":"7474 Joyce Meadows\nLake Billy, WA 40750","phone_number":"495.259.9738","isContractor":false}
{"emp_id":26,"first_name":"John-copied","last_name":"Hamilton-copied","Address":"6000 Barnes Plain\nHarrisville-city, NC 5000","phone_number":"444-467-5286x20961","isContractor":true}
  1. Now, upload the changed fake_emp_data.json file to the same source prefix.

  1. After you upload the changed employee dataset to Amazon S3, navigate to the AWS Glue console and run the job.
  2. When the job is complete, run the following query in the Athena query editor and confirm that there are 27 records in total with the following values:
SELECT * FROM "deltalake_2438fbd0"."employee";

Note: Update the correct database name from the CloudFormation output before running the above query.

  1. Run another query in the Athena query editor and confirm that there are 4 records returned with the following values:
SELECT * FROM "AwsDataCatalog"."deltalake_2438fbd0"."employee" where emp_id in (8, 12, 26)
order by emp_id;

Note: Update the correct database name from the CloudFormation output before running the above query.

You will see two records for emp_id=12:

  • One emp_id=12 record with the following values (for the record that was ingested as part of the initial load):
    • emp_key=44cebb094ef289670e2c9325d5f3e4ca18fdd53850b7ccd98d18c7a57cb6d4b4
    • isCurrent=false
    • delete_flag=false
    • end_date=’2023-03-02’
  • A second emp_id=12 record with the following values (for the record that was ingested as part of the change to the source):
    • emp_key=b60547d769e8757c3ebf9f5a1002d472dbebebc366bfbc119227220fb3a3b108
    • isCurrent=true
    • delete_flag=false
    • end_date=Null (or empty string)

The record for emp_id=8 that was deleted in the source as part of this run will still exist but with the following changes to the values:

  • isCurrent=false
  • end_date=’2023-03-02’
  • delete_flag=true

The new employee record will be inserted with the following values:

  • emp_id=26
  • isCurrent=true
  • end_date=NULL (or empty string)
  • delete_flag=false

Note that the emp_key values in your actual table may be different than what is provided here as an example.

  1. For the deletes, we check for the emp_id from the base table along with the new source file and inner join the emp_key.
  2. If the condition evaluates to true, we then check if the employee base table emp_key equals the new updates emp_key, and get the current, undeleted record (isCurrent=true and delete_flag=false).
  3. We merge the delete changes from the new file with the base table for all the matching delete condition rows and update the following:
    1. isCurrent=false
    2. delete_flag=true
    3. end_date=current_date

See the following code:

delete_join_cond = "employee.emp_id=employeeUpdates.emp_id and employee.emp_key = employeeUpdates.emp_key"
delete_cond = "employee.emp_key == employeeUpdates.emp_key and employee.isCurrent = true and employeeUpdates.delete_flag = true"

base_tbl.alias("employee")\
        .merge(union_updates_dels.alias("employeeUpdates"), delete_join_cond)\
        .whenMatchedUpdate(condition=delete_cond, set={"isCurrent": "false",
                                                        "end_date": current_date(),
                                                        "delete_flag": "true"}).execute()
  1. For both the updates and the inserts, we check for the condition if the base table employee.emp_id is equal to the new changes.emp_id and the employee.emp_key is equal to new changes.emp_key, while only retrieving the current records.
  2. If this condition evaluates to true, we then get the current record (isCurrent=true and delete_flag=false).
  3. We merge the changes by updating the following:
    1. If the second condition evaluates to true:
      1. isCurrent=false
      2. end_date=current_date
    2. Or we insert the entire row as follows if the second condition evaluates to false:
      1. emp_id=new record’s emp_key
      2. emp_key=new record’s emp_key
      3. first_name=new record’s first_name
      4. last_name=new record’s last_name
      5. address=new record’s address
      6. phone_number=new record’s phone_number
      7. isContractor=new record’s isContractor
      8. start_date=current_date
      9. end_date=NULL (or empty string)
      10. isCurrent=true
      11. delete_flag=false

See the following code:

upsert_cond = "employee.emp_id=employeeUpdates.emp_id and employee.emp_key = employeeUpdates.emp_key and employee.isCurrent = true"
upsert_update_cond = "employee.isCurrent = true and employeeUpdates.delete_flag = false"

base_tbl.alias("employee").merge(union_updates_dels.alias("employeeUpdates"), upsert_cond)\
    .whenMatchedUpdate(condition=upsert_update_cond, set={"isCurrent": "false",
                                                            "end_date": current_date()
                                                            }) \
    .whenNotMatchedInsert(
    values={
        "isCurrent": "true",
        "emp_id": "employeeUpdates.emp_id",
        "first_name": "employeeUpdates.first_name",
        "last_name": "employeeUpdates.last_name",
        "Address": "employeeUpdates.Address",
        "phone_number": "employeeUpdates.phone_number",
        "isContractor": "employeeUpdates.isContractor",
        "emp_key": "employeeUpdates.emp_key",
        "start_date": current_date(),
        "delete_flag":  "employeeUpdates.delete_flag",
        "end_date": "null"
    })\
    .execute()

As a last step, let’s bring back the deleted record from the previous change to the source dataset and see how it is reinserted into the employee table in the data lake and observe how the complete history is maintained.

Let’s modify our changed dataset from the previous step and make the following changes.

  1. Add the deleted emp_id=8 back to the dataset.

After making these changes, my employee source dataset looks like the following code (for readability, we have only included the added record as described in the preceding step):

{"emp_id":8,"first_name":"Teresa","last_name":"Estrada","Address":"339 Scott Valley\nGonzalesfort, PA 18212","phone_number":"435-600-3162","isContractor":false}

  1. Upload the changed employee dataset file to the same source prefix.
  2. After you upload the changed fake_emp_data.json dataset to Amazon S3, navigate to the AWS Glue console and run the job again.
  3. When the job is complete, run the following query in the Athena query editor and confirm that there are 28 records in total with the following values:
SELECT * FROM "deltalake_2438fbd0"."employee";

Note: Update the correct database name from the CloudFormation output before running the above query.

  1. Run the following query and confirm there are 5 records:
SELECT * FROM "AwsDataCatalog"."deltalake_2438fbd0"."employee" where emp_id in (8, 12, 26)
order by emp_id;

Note: Update the correct database name from the CloudFormation output before running the above query.

You will see two records for emp_id=8:

  • One emp_id=8 record with the following values (the old record that was deleted):
    • emp_key=536ba1ba5961da07863c6d19b7481310e64b58b4c02a89c30c0137a535dbf94d
    • isCurrent=false
    • deleted_flag=true
    • end_date=’2023-03-02
  • Another emp_id=8 record with the following values (the new record that was inserted in the last run):
    • emp_key=536ba1ba5961da07863c6d19b7481310e64b58b4c02a89c30c0137a535dbf94d
    • isCurrent=true
    • deleted_flag=false
    • end_date=NULL (or empty string)

The emp_key values in your actual table may be different than what is provided here as an example. Also note that because this is a same deleted record that was reinserted in the subsequent load without any changes, there will be no change to the emp_key.

End-user sample queries

The following are some sample end-user queries to demonstrate how the employee change data history can be traversed for reporting:

  • Query 1 – Retrieve a list of all the employees who left the organization in the current month (for example, March 2023).
SELECT * FROM "deltalake_2438fbd0"."employee" where delete_flag=true and date_format(CAST(end_date AS date),'%Y/%m') ='2023/03'

Note: Update the correct database name from the CloudFormation output before running the above query.

The preceding query would return two employee records who left the organization.

  • Query 2 – Retrieve a list of new employees who joined the organization in the current month (for example, March 2023).
SELECT * FROM "deltalake_2438fbd0"."employee" where date_format(start_date,'%Y/%m') ='2023/03' and iscurrent=true

Note: Update the correct database name from the CloudFormation output before running the above query.

The preceding query would return 23 active employee records who joined the organization.

  • Query 3 – Find the history of any given employee in the organization (in this case employee 18).
SELECT * FROM "deltalake_2438fbd0"."employee" where emp_id=18

Note: Update the correct database name from the CloudFormation output before running the above query.

In the preceding query, we can observe that employee 18 had two changes to their employee records before they left the organization.

Note that the data results provided in this example are different than what you will see in your specific records based on the sample data generated by the Lambda function.

Clean up

When you have finished experimenting with this solution, clean up your resources, to prevent AWS charges from being incurred:

  1. Empty the S3 buckets.
  2. Delete the stack from the AWS CloudFormation console.

Conclusion

In this post, we demonstrated how to identify the changed data for a semi-structured data source and preserve the historical changes (SCD Type 2) on an S3 Delta Lake, when source systems are unable to provide the change data capture capability, with AWS Glue. You can further extend this solution to enable downstream applications to build additional customizations from CDC data captured in the data lake.

Additionally, you can extend this solution as part of an orchestration using AWS Step Functions or other commonly used orchestrators your organization is familiar with. You can also extend this solution by adding partitions where appropriate. You can also maintain the delta table by compacting the small files.


About the authors

Nith Govindasivan, is a Data Lake Architect with AWS Professional Services, where he helps onboarding customers on their modern data architecture journey through implementing Big Data & Analytics solutions. Outside of work, Nith is an avid Cricket fan, watching almost any cricket during his spare time and enjoys long drives, and traveling internationally.

Vijay Velpula is a Data Architect with AWS Professional Services. He helps customers implement Big Data and Analytics Solutions. Outside of work, he enjoys spending time with family, traveling, hiking and biking.

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Customize marketing messages and promotions for personalized outreach

Post Syndicated from binpazho original https://aws.amazon.com/blogs/messaging-and-targeting/customize-marketing-messages-and-promotions-for-personalized-outreach/

Introduction

Amazon Pinpoint is widely used by many customers for their various user engagement use cases like marketing campaigns, scheduled communications (newsletters, reminders, etc.), and transactional messaging. By using the message template feature in Amazon Pinpoint, customers can design messages personalized to the specific end users, by using variable attributes. While Amazon Pinpoint enables customers to include up to 250 attributes for each user, often times there might be need to pick and choose from a wide range of attributes about a user, that can lead to needing more than the allowed number of attributes.

The CampaignHook feature of Amazon Pinpoint can come to rescue for a situation like this. Using the CampainHook feature, we can filter out attributes that are not applicable to a specific user, while allowing to add new attributes, right before of sending the message. In this blog, I will walk you through how I have implemented the CampaignHook feature for a similar use case.

Sample Use-Cases

When setting up your Pinpoint campaign, following are the use cases where a CampaignHook can be enabled:

  • Retrieving data and perform custom compute logic in real time from third party data stores.
  • Filter endpoints out of the send: This is useful if you need to do some type of custom logic that you can’t do in Segmentation (custom opt-out, quiet time, campaign prioritization, etc.)
  • Avoid costly and time consuming Extract, Transform & Load (ETL) processes by accessing the data sources directly and applying custom compute logic in real-time.

Solution overview

CampaignHook Demo Architecture

The diagram above shows the solution that we will setup in this blog. As you can see, the Campaign event will trigger the Amazon Pinpoint Campaign. The event can be triggered from your web or mobile app that are accessed by your end-users, and can be setup to be triggered when the user performs a certain action. You can read more about setting up Amazon Pinpoint campaign in the user guide. By having the CampaignHook enabled on your Amazon Pinpoint campaign, the Lambda function that is configured with the CampaignHook will be triggered. This function will have access to the endpoint attributes passed by the Campaign event, and perform additional logic to derive new attributes for the user. Once all the new fields are derived, the function will update the user endpoint. Amazon pinpoint will then perform the next steps in the Campaign, and substitute the variables in the message template, before the personalized message is sent to the end user.

Prerequisites

  • AWS Account with Console and Programmatic access
  • Access to AWS CloudShell
  • Email channel enabled in Amazon Pinpoint

Building the demo

Build the Amazon Pinpoint Project

From the AWS Management console, go to Amazon Pinpoint and create a new project called “PinpointCampaignHookDemo”, and choose the option to enable the email channel. For more information about creating a project see the user guide, and follow the instructions here to setup your email channel.

If your account is in the Sandbox account, you will need to verify the email address, before you can send the email. You can follow the steps here to upgrade your account to a Production status if you are ready to deploy this solution to production.

Create the segment.

A segment is a group of your users that share certain attributes. For example, a segment might contain all of your users who use version 2.0 of your app on an Android device, or all users who live in the city of Los Angeles. You can send multiple campaigns to a single segment, and you can send a single campaign to multiple segments.

For this demo, let’s create a Dynamic Segment. Let’s call it ‘CampaignHookDemoSegment’.  Follow the steps here to create your Dynamic Segment.

Create a Segment

Setup message template

Let’s create our first template and call it “CampaignHookDemoTemplate”. You can read more about Amazon Pinpoint templates in the user guide.

For this demo, I have used the HTML template shown below, and I have 3 endpoint attribute variables: 2 that are passed from the campaign event trigger, and the third one (Company) that will be generated by the CampaignHook lambda function. For the subject of the email, I used “Campaign Hook Demo Campaign“.

Create eMail Template

The email template can be found in this GitHub repository.

Create Campaign

Next, create your campaign and use the Segment and email Template that you created in the previous steps by following the instructions here.

Select the ‘when an event occurs’ option to trigger the campaign when an event occurs. (This option will trigger the campaign when a specific event occurs). Yoy may also schedule your campaign to run on a scheduled bases as available in the setup screen. I used ‘CampaignHookTrigger’ as my event name.

Create a campaign

Set your Campaign Start date, time and end date. I have left all the other settings to default and saved the campaign. Now that you have successfully created your first Campaign, you are ready for the next steps.

Set Campaign Start and End Times

Create the Lambda function

This is the function that we will configure to trigger the Amazon pinpoint campaign event . From the Lambda console page, create a new function by clicking on the ‘Create function’ button. You can then pick the following options and create the function.

Name: Campaign_event_trigger_function

Runtime: Python 3.9 or higher.

Replace the default script with the code from the GitHub repository, and then deploy your code by clicking on the “Deploy” button.

Assign permissions

In-order for the Lambda function trigger to trigger the Pinpoint Campaign, you will need to add an inline policy to the IAM role that is attached to your Lambda function, by selecting Pinpoint as the service and PutEvents from the Write options. You can select the Lambda function as the resource to which the access will be granted.

{

    "Version" :"2012-10-17",

    "Statement":[

        {

            "Sid": "VisualEditor0",

            "Effect": "Allow",

            "Action": [

                "mobiletargeting:PutEvents"

            ],

            "Resource":"ARN of your Lambda function goes here."

        }

    ]

}

Create the CampaignHook Lambda function

This is the function that we will triggered from the CampaignHook. From your Lambda console, click on “Create function” and enter the basic information as shown below to create your function.

Name: CampaignHookFunction

Runtime: Python 3.9 or higher.

Next replace your default code with the sample GitHub code, and then deploy your code by clicking on the “Deploy” button.

Assign permissions

Next add permissions for Amazon Pinpoint to invoke the Lambda function by running the command below from your Command Shell. Replace the Lambda function name and Account number with yours.

aws lambda add-permission \

--function-name [YourCampaignHookLambdaFunctionName] \

--statement-id my-hook-id1 \

--action lambda:InvokeFunction \

--principal pinpoint.us-east-1.amazonaws.com \

--source-arn 'arn:aws:mobiletargeting:us-east-1:[YourAccountNumber]:apps/*'

You can also do this from the Lambda console, by clicking on “Configuration” and then scrolling down to “Resource based Policy” and by clicking on “Add permissions“.

Update Campaign settings to add the Campaign Hook

Now that you have the Lambda function that needs to act as the hook is created, and granted Amazon Pinpoint service to invoke that function, run the command below to update the Campaign settings to add the Campaign Hook. You can also set a default CampaignHook for ALL campaigns in the project by setting the CampaignHook property on the Project Settings via this API.

Replace the application-id (project id), campaign-id, and the arn of the Campaign Hook lambda function and run the command below. (You can find the Project ID by clicking on All Projects at the top-left of the Pinpoint Console. The Campaign ID can be found by opening your Pinpoint Project and then clicking Campaigns in the Pinpoint Console.)

aws pinpoint   update-campaign --application-id /

[your-application-id-goes-here] –campaign-id /

[your-campaign-id-goes-here] --cli-input-json '{"ApplicationId": /

"","CampaignId": "","WriteCampaignRequest": {"Hook": {"LambdaFunctionName": /

"your-CampaignHook-Function-goes-here","Mode": "FILTER","WebUrl": ""}}}'

You can optionally run the command below to make sure that the campaign settings have been updated:

aws pinpoint get-campaign –application-id [your-application-id-goes-here]  –campaign-id [your-campaign-id-goes-here]

Test your Campaign.

Go back to your Lambda function that you have created to trigger the Campaign in the “Create the Lambda function” step above. I have used the test event as shown below. Update the Application id to reflect your Project id and change the email address to the email you verified earlier and click on “Test” button.

{

    "application_id": "your application id",

    "endpoint_id": "223",

    "event_type": "CampaignHookEvent",

    "nextTestDate": "12/15/2025",

    "FirstName": "Jack",

    "email": "[email protected]",

    "userid": "Jack123"

}

You should now receive an email with the variables replaced with the values that was passed from your json payload. Further you can see the Company name was added to the endpoint from the CampaignHook Lambda, which is passed to the email template. If you have not received the email, please check the following:

  • The Lambda function ran without any errors
  • The LambdaHook function has the proper rights assigned to be invoked from Pinpoint
  • The From and To email id that you have used are verified in SES.

Verify email identity

Clean up resources

Once you are satisfied with your setup and testing, you can now clean up the resources by following the steps below:

  • Delete your Amazon Pinpoint Project, Campaign and Segment.
    • aws pinpoint delete-campaign –application-id [your appl id] –campaign-id [your campaign id]
    • aws pinpoint delete-segment –application-id [your app id]  –segment-id [your segment id]
    • aws pinpoint delete-app –application-id [your app id]
  • Delete you Lambda functions
    • aws lambda delete-function –function-name CampaignHookFunction
    • aws lambda delete-function –function-name Campaign_event_Trigger_Function

Conclusion

By dynamically generating the attributes in real-time, customers can now add greater levels of personalization within a single user message template. By invoking a Lambda function, you can perform custom compute logic, calculate new attribute values, and access external data stores, to modify the campaign’s segment, right before Amazon Pinpoint sends the message. Campaign Hook feature makes this possible as explained in this blog by running few basic CLI commands to enable the feature on your Amazon Pinpoint Campaign. You can read more about Amazon Pinpoint Campaign from the user guide documentation”.

Visualize Confluent data in Amazon QuickSight using Amazon Athena

Post Syndicated from Ahmed Zamzam original https://aws.amazon.com/blogs/big-data/visualize-confluent-data-in-amazon-quicksight-using-amazon-athena/

This is a guest post written by Ahmed Saef Zamzam and Geetha Anne from Confluent.

Businesses are using real-time data streams to gain insights into their company’s performance and make informed, data-driven decisions faster. As real-time data has become essential for businesses, a growing number of companies are adapting their data strategy to focus on data in motion. Event streaming is the central nervous system of a data in motion strategy and, in many organizations, Apache Kafka is the tool that powers it.

Today, Kafka is well known and widely used for streaming data. However, managing and operating Kafka at scale can still be challenging. Confluent offers a solution through its fully managed, cloud-native service that simplifies running and operating data streams at scale. Confluent extends open-source Kafka through a suite of related services and features designed to enhance the data in motion experience for operators, developers, and architects in production.

In this post, we demonstrate how Amazon Athena, Amazon QuickSight, and Confluent work together to enable visualization of data streams in near-real time. We use the Kafka connector in Athena to do the following:

  • Join data inside Confluent with data stored in one of the many data sources supported by Athena, such as Amazon Simple Storage Service (Amazon S3)
  • Visualize Confluent data using QuickSight

Challenges

Purpose-built stream processing engines, like Confluent ksqlDB, often provide SQL-like semantics for real-time transformations, joins, aggregations, and filters on streaming data. With ksqlDB, you can create persistent queries, which continuously process streams of events according to specific logic, and materialize streaming data in views that can be queried at a point in time (pull queries) or subscribed to by clients (push queries).

ksqlDB is one solution that made stream processing accessible to a wider range of users. However, pull queries, like those supported by ksqlDB, may not be suitable for all stream processing use cases, and there may be complexities or unique requirements that pull queries are not designed for.

Data visualization for Confluent data

A frequent use case for enterprises is data visualization. To visualize data stored in Confluent, you can use one of over 120 pre-built connectors, provided by Confluent, to write streaming data to a destination data store of your choice. Next, you connect your business intelligence (BI) tool to the data store to begin visualizing the data.

The following diagram depicts a typical architecture utilized by many Confluent customers. In this workflow, data is written to Amazon S3 through the Confluent S3 sink connector and then analyzed with Athena, a serverless interactive analytics service that enables you to analyze and query data stored in Amazon S3 and various other data sources using standard SQL. You can then use Athena as an input data source to QuickSight, a highly scalable cloud native BI service, for further analysis.

typical architecture utilized by many Confluent customers.

Although this approach works well for many use cases, it requires data to be moved, and therefore duplicated, before it can be visualized. This duplication not only adds time and effort for data engineers who may need to develop and test new scripts, but also creates data redundancy, making it more challenging to manage and secure the data, and increases storage cost.

Enriching data with reference data in another data store

With ksqlDB queries, the source and destination are always Kafka topics. Therefore, if you have a data stream that you need to enrich with external reference data, you have two options. One option is to import the reference data into Confluent, model it as a table, and use ksqlDB’s stream-table join to enrich the stream. The other option is to ingest the data stream into a separate data store and perform join operations there. Both require data movement and result in duplicate data storage.

Solution overview

So far, we have discussed two challenges that are not addressed by conventional stream processing tools. Is there a solution that addresses both challenges simultaneously?

When you want to analyze data without separate pipelines and jobs, a popular choice is Athena. With Athena, you can run SQL queries on a wide range of data sources—in addition to Amazon S3—without learning a new language, developing scripts to extract (and duplicate) data, or managing infrastructure.

Recently, Athena announced a connector for Kafka. Like Athena’s other connectors, queries on Kafka are processed within Kafka and return results to Athena. The connector supports predicate pushdown, which means that adding filters to your queries can reduce the amount of data scanned, improve query performance, and reduce cost.

For example, when using this connector, the amount of data scanned by the query SELECT * FROM CONFLUENT_TABLE could be significantly higher than the amount of data scanned by the query SELECT * FROM CONFLUENT_TABLE WHERE COUNTRY = 'UK'. The reason is that the AWS Lambda function which provides the runtime environment for the Athena connector, filters data at the source before returning it to Athena.

Let’s assume we have a stream of online transactions flowing into Confluent and customer reference data stored in Amazon S3. We want to use Athena to join both data sources together and produce a new dataset for QuickSight. Instead of using the S3 sink connector to load data into Amazon S3, we use Athena to query Confluent and join it with S3 data—all without moving data. The following diagram illustrates this architecture.

Athena to join both data sources together and produce a new dataset for QuickSight

We perform the following steps:

  1. Register the schema of your Confluent data.
  2. Configure the Athena connector for Kafka.
  3. Optionally, interactively analyze Confluent data.
  4. Create a QuickSight dataset using Athena as the source.

Register the schema

To connect Athena to Confluent, the connector needs the schema of the topic to be registered in the AWS Glue Schema Registry, which Athena uses for query planning.

The following is a sample record in Confluent:

{
  "transaction_id": "23e5ed25-5818-4d4f-acb3-73ef04d51d21",
  "customer_id": "126-58-9758",
  "amount": 986,
  "timestamp": "2023-01-03T15:40:42",
  "product_category": "health_fitness"
}

The following is the schema of this record:

{
  "topicName": "transactions",
  "message": {
    "dataFormat": "json",
    "fields": [
      {
        "name": "transaction_id",
        "mapping": "transaction_id",
        "type": "VARCHAR"
      },
      {
        "name": "customer_id",
        "mapping": "customer_id",
        "type": "VARCHAR"
      },
      {
        "name": "amount",
        "mapping": "amount",
        "type": "INTEGER"
      },
      {
        "name": "timestamp",
        "mapping": "timestamp",
        "type": "timestamp",
        "formatHint": "yyyy-MM-dd\'T\'HH:mm:ss"
      },
      {
        "name": "product_category",
        "mapping": "product_category",
        "type": "VARCHAR"
      },
      {
        "name": "customer_id",
        "mapping": "customer_id",
        "type": "VARCHAR"
      }
    ]
  }
}

The data producer writing the data can register this schema with the AWS Glue Schema Registry. Alternatively, you can use the AWS Management Console or AWS Command Line Interface (AWS CLI) to create a schema manually.

We create the schema manually by running the following CLI command. Replace <registry_name> with your registry name and make sure that the text in the description field includes the required string {AthenaFederationKafka}:

aws glue create-registry –registry-name <registry_name> --description {AthenaFederationKafka}

Next, we run the following command to create a schema inside the newly created schema registry:

aws glue create-schema –registry-id RegistryName=<registry_name> --schema-name <schema_name> --compatibility <Compatibility_Mode> --data-format JSON –schema-definition <Schema>

Before running the command, be sure to provide the following details:

  • Replace <registry_name> with our AWS Glue Schema Registry name
  • Replace <schema_name> with the name of our Confluent Cloud topic, for example, transactions
  • Replace <Compatibility_Mode> with one of the supported compatibility modes, for example, ‘Backward’
  • Replace <Schema> with our schema

Configure and deploy the Athena Connector

With our schema created, we’re ready to deploy the Athena connector. Complete the following steps:

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. Search for and select Apache Kafka.
    Add Apache Kafka as data source
  4. For Data source name, enter the name for the data source.
    Enter name for data source

This data source name will be referenced in your queries. For example:

SELECT * 
FROM <data_source_name>.<registry_name>.<schema_name>
WHERE COL1='SOMETHING'

Applying this to our use case and previously defined schema, our query would be as follows:

SELECT * 
FROM "Confluent"."transactions_db"."transactions"
WHERE product_category='Kids'
  1. In the Connection details section, choose Create Lambda function.
    create lambda function

You’re redirected to the Applications page on the Lambda console. Some of the application settings are already filled.

The following are the important settings required for integrating with Confluent Cloud. For more information on these settings, refer to Parameters.

  1. For LambdaFunctionName, enter the name for the Lambda function the connector will use. For example, athena_confluent_connector.

We use this parameter in the next step.

  1. For KafkaEndpoint, enter the Confluent Cloud bootstrap URL.

You can find this on the Cluster settings page in the Confluent Cloud UI.

enter the Confluent Cloud bootstrap URL

Confluent Cloud supports two authentication mechanisms: OAuth and SASL/PLAIN (API keys). The connector doesn’t support OAuth; this leaves us with SASL/PLAIN. SASL/PLAIN uses SSL as a security protocol and PLAIN as SASL mechanism.

  1. For AuthType, enter SASL_SSL_PLAIN.

The API key and secret used by the connector to access Confluent need to be stored in AWS Secrets Manager.

  1. Get your Confluent API key or create a new one.
  2. Run the following AWS CLI command to create the secret in Secrets Manager:
    aws secretsmanager create-secret \
        --name <SecretNamePrefix>\
        --secret-string "{\"username\":\"<Confluent_API_KEY>\",\"password\":\"<Confluent_Secret>\"}"

The secret string should have two key-value pairs, one named username and the other password.

  1. For SecretNamePrefix, enter the secret name prefix created in the previous step.
  2. If the Confluent cloud cluster is reachable over the internet, leave SecurityGroupIds and SubnetIds blank. Otherwise, your Lambda function needs to run in a VPC that has connectivity to your Confluent Cloud network. Therefore, enter a security group ID and three private subnet IDs in this VPC.
  3. For SpillBucket, enter the name of an S3 bucket where the connector can spill data.

Athena connectors temporarily store (spill) data to Amazon S3 for further processing by Athena.

  1. Select I acknowledge that this app creates custom IAM roles and resource policies.
  2. Choose Deploy.
  3. Return to the Connection details section on the Athena console and for Lambda, enter the name of the Lambda function you created.
  4. Choose Next.
    Return to the Connection details section on the Athena console and for Lambda, enter the name of the Lambda function you created. And Choose Next.
  5. Choose Create data source.

Perform interactive analysis on Confluent data

With the Athena connector set up, our streaming data is now queryable from the same service we use to analyze S3 data lakes. Next, we use Athena to conduct point-in-time analysis of transactions flowing through Confluent Cloud.

Aggregation

We can use standard SQL functions to aggregate the data. For example, we can get the revenue by product category:

SELECT product_category, SUM(amount) AS Revenue
FROM "Confluent"."athena_blog"."transactions"
GROUP BY product_category
ORDER BY Revenue desc

SQL function to aggregate data

Enrich transaction data with customer data

The aggregation example is also available with ksqlDB pull queries. However, Athena’s connector allows us to join the data with other data sources like Amazon S3.

In our use case, the transactions streamed to Confluent Cloud lack detailed information about customers, apart from a customer_id. However, we have a reference dataset in Amazon S3 that has more information about the customers. With Athena, we can join both datasets together to gain insights about our customers. See the following code:

SELECT * 
FROM "Confluent"."athena_blog"."transactions" a
INNER JOIN "AwsDataCatalog"."athenablog"."customer" b 
ON a.customer_id=b.customer_id

join data

You can see from the results that we were able to enrich the streaming data with customer details, stored in Amazon S3, including name and address.

Visualize data using QuickSight

Another powerful feature this connector brings is the ability to visualize data stored in Confluent using any BI tool that supports Athena as a data source. In this post, we use QuickSight. QuickSight is a machine learning (ML)-powered BI service built for the cloud. You can use it to deliver easy-to-understand insights to the people you work with, wherever they are.

For more information about signing up for QuickSight, see Signing up for an Amazon QuickSight subscription.

Complete the following steps to visualize your streaming data with QuickSight:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose Athena as the data source.
  4. For Data source name, enter a name.
  5. Choose Create data source.
  6. In the Choose your table section, choose Use custom SQL.
    In the Choose your table section, choose Use custom SQL.
  7. Enter the join query like the one given previously, then choose Confirm query.
    Enter the join query like the one given previously, then choose Confirm query.
  8. Next, choose to import the data into SPICE (Super-fast, Parallel, In-memory Calculation Engine), a fully managed in-memory cache that boosts performance, or directly query the data.

Utilizing SPICE will enhance performance, but the data may need to be periodically updated. You can choose to incrementally refresh your dataset or schedule regular refreshes with SPICE. If you want near-real-time data reflected in your dashboards, select Directly query your data. Note that with the direct query option, user actions in QuickSight, such as applying a drill-down filter, may invoke a new Athena query.

  1. Choose Visualize.
    Choose Visualize

That’s it, we have successfully connected QuickSight to Confluent through Athena. With just a few clicks, you can create a few visuals displaying data from Confluent.

successfully connected QuickSight to Confluent through Athena.

Clean up

To avoid incurring ongoing charges, delete the resources you provisioned by completing the following steps:

  1. Delete the AWS Glue schema and registry.
  2. Delete the Athena Kafka connector.
  3. Delete the QuickSight dataset.

Conclusion

In this post, we discussed use cases for Athena and Confluent. We provided examples of how you can use both for near-real-time data visualization with QuickSight and interactive analysis involving joins between streaming data in Confluent and data stored in Amazon S3.

The Athena connector for Kafka simplifies the process of querying and analyzing streaming data from Confluent Cloud. It removes the need to first move streaming data to persistent storage before it can be used in downstream use cases like business intelligence. This complements the existing integration between Confluent and Athena, using the S3 sink connector, which enables loading streaming data into a data lake, and is an additional option for customers who want to enable interactive analysis on Confluent data.


About the authors

Ahmed Zamzam is a Senior Partner Solutions Architect at Confluent, with a focus on the AWS partnership. In his role, he works with customers in the EMEA region across various industries to assist them in building applications that leverage their data using Confluent and AWS. Prior to Confluent, Ahmed was a Specialist Solutions Architect for Analytics AWS specialized in data streaming and search. In his free time, Ahmed enjoys traveling, playing tennis, and cycling.

Geetha Anne is a Partner Solutions Engineer at Confluent with previous experience in implementing solutions for data-driven business problems on the cloud, involving data warehousing and real-time streaming analytics. She fell in love with distributed computing during her undergraduate days and has followed her interest ever since. Geetha provides technical guidance, design advice, and thought leadership to key Confluent customers and partners. She also enjoys teaching complex technical concepts to both tech-savvy and general audiences.