Tag Archives: Intermediate (200)

Scale your authorization needs for Secrets Manager using ABAC with IAM Identity Center

2023-04-14 Aravind Gopaluni

Post Syndicated from Aravind Gopaluni original https://aws.amazon.com/blogs/security/scale-your-authorization-needs-for-secrets-manager-using-abac-with-iam-identity-center/

With AWS Secrets Manager, you can securely store, manage, retrieve, and rotate the secrets required for your applications and services running on AWS. A secret can be a password, API key, OAuth token, or other type of credential used for authentication purposes. You can control access to secrets in Secrets Manager by using AWS Identity and Access Management (IAM) permission policies. In this blog post, I will show you how to use principles of attribute-based access control (ABAC) to define dynamic IAM permission policies in AWS IAM Identity Center (successor to AWS Single Sign-On) by using user attributes from an external identity provider (IdP) and resource tags in Secrets Manager.

What is ABAC and why use it?

Attribute-based access control (ABAC) is an authorization strategy that defines permissions based on attributes or characteristics of the user, the data, or the environment, such as the department, business unit, or other factors that could affect the authorization outcome. In the AWS Cloud, these attributes are called tags. By assigning user attributes as principal tags, you can simplify the process of creating fine-grained permissions on AWS.

With ABAC, you can use attributes to build more dynamic policies that provide access based on matching attribute conditions. ABAC rules are evaluated dynamically at runtime, which means that the users’ access to applications and data and the type of allowed operations automatically change based on the contextual factors in the policy. For example, if a user changes department, access is automatically adjusted without the need to update permissions or request new roles. You can use ABAC in conjunction with role-based access control (RBAC) to combine the ease of policy administration with flexible policy specification and dynamic decision-making capability to enforce least privilege.

AWS IAM Identity Center (successor to AWS Single Sign-On) expands the capabilities of IAM to provide a central place that brings together the administration of users and their access to AWS accounts and cloud applications. With IAM Identity Center, you can define user permissions and manage access to accounts and applications in your AWS Organizations organization centrally. You can also create ABAC permission policies in a central place. ABAC will work with attributes from a supported identity source in IAM Identity Center. For a list of supported external IdPs for identity synchronization through the System for Cross-domain Identity Management (SCIM) and Security Assertion Markup Language (SAML) 2.0, see Supported identity providers.

The following are key benefits of using ABAC with IAM Identity Center and Secrets Manager:

Fewer permission sets — With ABAC, multiple users who use the same IAM Identity Center permission set and the same IAM role can still get unique permissions, because permissions are now based on user attributes. Administrators can author IAM policies that grant users access only to secrets that have matching attributes. This helps reduce the number of distinct permissions that you need to create and manage in IAM Identity Center and, in turn, reduces your permission management complexity.
Teams can change and grow quickly — When you create new secrets, you can apply the appropriate tags, which will automatically grant access without requiring you to update the permission policies.
Use employee attributes from your corporate directory to define access — You can use existing employee attributes from a supported identity source configured in IAM Identity Center to make access control decisions on AWS.

Figure 1 shows a framework to control access to Secrets Manager secrets using IAM Identity Center and ABAC principles.

Figure 1: ABAC framework to control access to secrets using IAM Identity Center

The following is a brief introduction to the basic components of the framework:

User attribute source or identity source — This is where your users and groups are administered. You can configure a supported identity source with IAM Identity Center. You can then define and manage supported user attributes in the identity source.
Policy management — You can create and maintain policy definitions (permission sets) centrally in IAM Identity Center. You can assign access to a user or group to one or more accounts in IAM Identity Center with these permission sets. You can then use attributes defined in your identity source to build ABAC policies for managing access to secrets.
Policy evaluation — When you assign a permission set, IAM Identity Center creates corresponding IAM Identity Center-controlled IAM roles in each account, and attaches the policies specified in the permission set to those roles. IAM Identity Center manages the role, and allows the authorized users that you’ve defined to assume the role. When users try to access a secret, IAM dynamically evaluates ABAC policies on the target account to determine access based on the attributes assigned to the user and resource tags assigned to that secret.

How to configure ABAC with IAM Identity Center

To configure ABAC with IAM Identity Center, you need to complete the following high-level steps. I will walk you through these steps in detail later in this post.

Identify and set up identities that are created and managed in the identity source with user attributes, such as project, team, AppID or department.
In IAM Identity Center, enable Attributes for access control and configure select attributes (such as department) to use for access control. For a list of supported attributes, see Supported external identity provider attributes.
If you are using an external IdP and choose to use custom attributes from your IdP for access controls, configure your IdP to send the attributes through SAML assertions to IAM Identity Center.
Assign appropriate tags to secrets in Secrets Manager.
Create permission sets based on attributes added to identities and resource tags.
Define guardrails to enforce access using ABAC.

ABAC enforcement and governance

Because an ABAC authorization model is based on tags, you must have a tagging strategy for your resources. To help prevent unintended access, you need to make sure that tagging is enforced and that a governance model is in place to protect the tags from unauthorized updates. By using service control policies (SCPs) and AWS Organizations tag policies, you can enforce tagging and tag governance on resources.

When you implement ABAC for your secrets, consider the following guidance for establishing a tagging strategy:

During secret creation, secrets must have an ABAC tag applied (tag-on-create).
During secret creation, the provided ABAC tag key must be the same case as the principal’s ABAC tag key.
After secret creation, the ABAC tag cannot be modified or deleted.
Only authorized principals can do tagging operations on secrets.
You enforce the permissions that give access to secrets through tags.

For more information on tag strategy, enforcement, and governance, see the following resources:

Solution overview

In this post, I will walk you through the steps to enable the IdP that is supported by IAM Identity Center.

Figure 2: Sample solution implementation

In the sample architecture shown in Figure 2, Arnav and Ana are users who each have the attributes department and AppID. These attributes are created and updated in the external directory—Okta in this case. The attribute department is automatically synchronized between IAM Identity Center and Okta using SCIM. The attribute AppID is a custom attribute configured on Okta, and is passed to AWS as a SAML assertion. Both users are configured to use the same IAM Identity Center permission set that allows them to retrieve the value of secrets stored in Secrets Manager. However, access is granted based on the tags associated with the secret and the attributes assigned to the user.

For example, user Arnav can only retrieve the value of the RDS_Master_Secret_AppAlpha secret. Although both users work in the same department, Arnav can’t retrieve the value of the RDS_Master_Secret_AppBeta secret in this sample architecture.

Prerequisites

Before you implement the solution in this blog post, make sure that you have the following prerequisites in place:

You have IAM Identity Center enabled for your organization and connected to an external IdP using SAML 2.0 identity federation.
You have IAM Identity Center configured for automatic provisioning with an external IdP using the SCIM v2.0 standard. SCIM keeps your IAM Identity Center identities in sync with identities from the external IdP.

Solution implementation

In this section, you will learn how to enable access to Secrets Manager using ABAC by completing the following steps:

Configure ABAC in IAM Identity Center
Define custom attributes in Okta
Update configuration for the IAM Identity Center application on Okta
Make sure that required tags are assigned to secrets in Secrets Manager
Create and assign a permission set with an ABAC policy in IAM Identity Center
Define guardrails to enforce access using ABAC

Step 1: Configure ABAC in IAM Identity Center

The first step is to set up attributes for your ABAC configuration in IAM Identity Center. This is where you will be mapping the attribute coming from your identity source to an attribute that IAM Identity Center passes as a session tag. The Key represents the name that you are giving to the attribute for use in the permission set policies. You need to specify the exact name in the policies that you author for access control. For the example in this post, you will create a new attribute with Key of department and Value of ${path:enterprise.department}. For supported external IdP attributes, see Attribute mappings.

To configure ABAC in IAM Identity Center (console)

Open the IAM Identity Center console.
In the Settings menu, enable Attributes for access control.
Choose the Attributes for access control tab, select Add attribute, and then enter the Key and Value details as follows.
- Key: department
- Value: ${path:enterprise.department}

Note: For more information, see Attributes for access control.

Step 2: Define custom attributes in Okta

The sample architecture in this post uses a custom attribute (AppID) on an external IdP for access control. In this step, you will create a custom attribute in Okta.

To define custom attributes in Okta (console)

Open the Okta console.
Navigate to Directory and then select Profile Editor.
On the Profile Editor page, choose Okta User (default).
Select Add Attribute and create a new custom attribute with the following parameters.
- For Data type, enter string
- For Display name, enter AppID
- For Variable name, enter user.AppID
- For Attribute length, select Less Than from the dropdown and enter a value.
- For User permission, enter Read Only
Navigate to Directory, select People, choose in-scope users, and enter a value for Department and AppID attributes. The following shows these values for the users in our example.
- First name (firstName): Arnav
- Last name (lastName): Desai
- Primary email (email): [email protected]
- Department (department): Digital
- AppID: Alpha
- First name (firstName): Ana
- Last name (lastName): Carolina
- Primary email (email): [email protected]
- Department (department): Digital
- AppID: Beta

Step 3: Update SAML configuration for IAM Identity Center application on Okta

Automatic provisioning (through the SCIM v2.0 standard) of user and group information from Okta into IAM Identity Center supports a set of defined attributes. A custom attribute that you create on Okta won’t be automatically synchronized to IAM Identity Center through SCIM. You can, however, define the attribute in the SAML configuration so that it is inserted into the SAML assertions.

To update the SAML configuration in Okta (console)

Open the Okta console and navigate to Applications.
On the Applications page, select the app that you defined for IAM Identity Center.
Under the Sign On tab, choose Edit.
Under SAML 2.0, expand the Attributes (Optional) section, and add an attribute statement with the following values, as shown in Figure 3:
- Name: https://aws.amazon.com/SAML/Attributes/AccessControl:AppID
- Name format: URI reference
- Value: user.AppID
Figure 3: Sample SAML configuration with custom attributes
To check that the newly added attribute is reflected in the SAML assertion, choose Preview SAML, review the information, and then choose Save.

Step 4: Make sure that required tags are assigned to secrets in Secrets Manager

The next step is to make sure that the required tags are assigned to secrets in Secrets Manager. You will review the required tags from the Secrets Manager console.

To verify required tags on secrets (console)

Open the Secrets Manager console in the target AWS account and then choose Secrets.
Verify that the required tags are assigned to the secrets in scope for this solution, as shown in Figure 4. In our example, the tags are as follows:
- Key: department
- Value: Digital
- Key: AppID
- Value: Alpha or Beta
Figure 4: Sample secret configuration with required tags

Step 5a: Create a permission set in IAM Identity Center using ABAC policy

In this step, you will create a new permission set that allows access to secrets based on the principal attributes and resource tags.

When you enable ABAC and specify attributes, IAM Identity Center passes the attribute value of the authenticated user to AWS Security Token Service (AWS STS) as session tags when an IAM role is assumed. You can use access control attributes in your permission sets by using the aws:PrincipalTag condition key to create access control rules.

To create a permission set (console)

Open the IAM Identity Center console and navigate to Multi-account permissions.
Choose Permission sets, and then select Create permission set.
On the Specify policies and permissions boundary page, choose Inline policy.

For Inline policy, paste the following sample policy document and then choose Next. This policy allows users to retrieve the value of only those secrets that have resource tags that match the required user attributes (department and AppID in our example).

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListAllSecrets",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AuthorizetoGetSecretValue",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "secretsmanager:ResourceTag/department": "${aws:PrincipalTag/department}",
                    "secretsmanager:ResourceTag/AppID": "${aws:PrincipalTag/AppID}"
                }
            }
        }
    ]
}

Configure the session duration, and optionally provide a description and tags for the permission set.
Review and create the permission set.

Step 5b: Assign permission set to users in IAM Identity Center

Now that you have created a permission set with ABAC policy, complete the configuration by assigning the permission set to users to grant them access to secrets in one or more accounts in your organization.

To assign a permission set (console)

Open the IAM Identity Center console and navigate to Multi-account permissions.
Choose AWS accounts and select one or more accounts to which you want to assign access.
Choose Assign users or groups.
On the Assign users and groups page, select the users, groups, or both to which you want to assign access. For this example, I select both Arnav and Ana.
On the Assign permission sets page, select the permission set that you created in the previous section.
Review your changes, as shown in Figure 5, and then select Submit.

Figure 5: Sample permission set assignment

Step 6: Define guardrails to enforce access using ABAC

To govern access to secrets to your workforce users only through ABAC and to help prevent unauthorized access, you can define guardrails. In this section, I will show you some sample service control policies (SCPs) that you can use in your organization.

Note: Before you use these sample SCPs, you should carefully review, customize, and test them for your unique requirements. For additional instructions on how to attach an SCP, see Attaching and detaching service control policies.

Guardrail 1 – Enforce ABAC to access secrets

The following sample SCP requires the use of ABAC to access secrets in Secrets Manager. In this example, users and secrets must have matching values for the attributes department and AppID. Access is denied if those attributes don’t exist or if they don’t have matching values. Also, this example SCP allows only the admin role to access secrets without matching tags. Replace <arn:aws:iam::*:role/secrets-manager-admin-role> with your own information.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyAccesstoSecretsWithoutABACTag",
            "Effect": "Deny",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret"
            ],
            "Resource": "*",
            "Condition": {
                "StringNotEqualsIfExists": {
                    "secretsmanager:ResourceTag/department": "${aws:PrincipalTag/department}",
                    "secretsmanager:ResourceTag/AppID": "${aws:PrincipalTag/AppID}"
                },
                "ArnNotLike": {
                    "aws:PrincipalArn": "<arn:aws:iam::*:role/secrets-manager-admin-role>"
                }
            }
        }
    ]
}

Guardrail 2 – Enforce tagging on secret creation

The following sample SCP denies the creation of new secrets that don’t have the required tag key-value pairs. In this example, the SCP denies creation of a new secret if it doesn’t include department and AppID tag keys. It also denies access if the tag department doesn’t have the value Digital and the tag AppID doesn’t have either Alpha or Beta assigned to it. Also, this example SCP allows only the admin role to create secrets without matching tags. Replace <arn:aws:iam::*:role/secrets-manager-admin-role> with your own information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyCreatingResourcesWithoutRequiredTag",
      "Effect": "Deny",
      "Action": [
        "secretsmanager:CreateSecret"
      ],
      "Resource": [
        "arn:aws:secretsmanager:*:*:secret:*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:RequestTag/department": ["Digital"],
          "aws:RequestTag/AppID": ["Alpha", "Beta"]
        },
        "ArnNotLike": {
          "aws:PrincipalArn": "<arn:aws:iam::*:role/secrets-manager-admin-role>"
        }
      }
    }
  ]
}

Guardrail 3 – Restrict deletion of ABAC tags

The following sample SCP denies the ability to delete the tags used for ABAC. In this example, only the admin role can delete the tags department and AppID after they are attached to a secret. Replace <arn:aws:iam::*:role/secrets-manager-admin-role> with your own information.

Guardrail 4 – Restrict modification of ABAC tags

The following sample SCP denies the ability to modify required tags for ABAC after they are attached to a secret. In this example, only the admin role can modify the tags department and AppID after they are attached to a secret. Replace <arn:aws:iam::*:role/secrets-manager-admin-role> with your own information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyModifyingABACTags",
      "Effect": "Deny",
      "Action": [
        "secretsmanager:TagResource"
      ],
      "Resource": [
        "arn:aws:secretsmanager:*:*:secret:*"
      ],
      "Condition": {
        "Null": {
          "aws:ResourceTag/department": "false",
          "aws:ResourceTag/AppID": "false"
        },
        "ArnNotLike": {
          "aws:PrincipalArn": "<arn:aws:iam::*:role/secrets-manager-admin-role>"
        }
      }
    }
  ]
}

Test the solution

In this section, you will test the solution by retrieving a secret using the Secrets Manager console. Your attempt to retrieve the secret value will be successful only when the required resource and principal tags exist, and have matching values (AppID and department in our example).

Test scenario 1: Retrieve and view the value of an authorized secret

In this test, you will verify whether you can successfully retrieve the value of a secret that belongs to your application.

To test the scenario

Sign in to IAM Identity Center and log in with your external IdP user. For this example, I log in as Arnav.
On the IAM Identity Center dashboard, select the target account.
From the list of available roles that the user has access to, choose the role that you created in Step 5a and select Management console, as shown in Figure 6. For this example, I select the SecretsManagerABACTest permission set.

Figure 6: Sample IAM Identity Center dashboard
Open the Secrets Manager console and select a secret that belongs to your application. For this example, I select RDS_Master_Secret_AppAlpha.
Because the AppID and department tags exist on both the secret and the user, the ABAC policy allowed the user to describe the secret, as shown in Figure 7.

Figure 7: Sample secret that was described successfully
In the Secret value section, select Retrieve secret value.
Because the value of the resource tags, AppID and department, matches the value of the corresponding user attributes (in other words, the principal tags), the ABAC policy allows the user to retrieve the secret value, as shown in Figure 8.

Figure 8: Sample secret value that was retrieved successfully

Test scenario 2: Retrieve and view the value of an unauthorized secret

In this test, you will verify whether you can retrieve the value of a secret that belongs to a different application.

To test the scenario

Repeat steps 1-3 from test scenario 1.
Open the Secrets Manager console and select a secret that belongs to a different application. For this example, I select RDS_Master_Secret_AppBeta.
Because the value of the resource tag AppID doesn’t match the value of the corresponding user attribute (principal tag), the ABAC policy denies access to describe the secret, as shown in Figure 9.

Figure 9: Sample error when describing an unauthorized secret

Conclusion

In this post, you learned how to implement an ABAC strategy using attributes and to build dynamic policies that can simplify access management to Secrets Manager using IAM Identity Center configured with an external IdP. You also learned how to govern resource tags used for ABAC and establish guardrails to enforce access to secrets using ABAC. To learn more about ABAC and Secrets Manager, see Attribute-Based Access Control (ABAC) for AWS and the Secrets Manager documentation.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on AWS Secrets Manager re:Post.

Want more AWS Security news? Follow us on Twitter.

Investigate security events by using AWS CloudTrail Lake advanced queries

2023-04-13 Rodrigo Ferroni

Post Syndicated from Rodrigo Ferroni original https://aws.amazon.com/blogs/security/investigate-security-events-by-using-aws-cloudtrail-lake-advanced-queries/

This blog post shows you how to use AWS CloudTrail Lake capabilities to investigate CloudTrail activity across AWS Organizations in response to a security incident scenario. We will walk you through two security-related scenarios while we investigate CloudTrail activity. The method described in this post will help you with the investigation process, allowing you to gain comprehensive understanding of the incident and its implications. CloudTrail Lake is a managed audit and security lake that allows you to aggregate, immutably store, and query your activity logs for auditing, security investigation, and operational troubleshooting.

Prerequisites

You must have the following AWS services enabled before you start the investigation.

CloudTrail Lake — To learn how to enable this service and use sample queries, see the blog post Announcing AWS CloudTrail Lake – a managed audit and security Lake. When you create a new event data store at the organization level, you will need to enable CloudTrail Lake for all of the accounts in the organization. We advise that you include not only management events but also data events.
When you use CloudTrail Lake with AWS Organizations, you can designate an account within the organization to be the CloudTrail Lake delegated administrator. This provides a convenient way to perform queries from a designated AWS security account—for example, you can avoid granting access to your AWS management account.
Amazon GuardDuty — This is a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity and delivers detailed security findings for visibility and remediation. To learn about the benefits of the service and how to get started, see Amazon GuardDuty.

Incident scenario 1: AWS access keys compromised

In the first scenario, you have observed activity within your AWS account from an unauthorized party. This example covers a situation where a threat actor has obtained and misused one of your AWS access keys that was exposed publicly by mistake. This investigation starts after Amazon GuardDuty generates an IAM finding identifying that the malicious activity came from the exposed AWS access key. Following the Incident Response Playbook Compromised IAM Credentials, focusing on step 12 in the playbook ([DETECTION AND ANALYSIS] Review CloudTrail Logs), you will use CloudTrail Lake capabilities to investigate the activity that was performed with this key. To do so, you will use the following nine query examples that we provide for this first scenario.

Query 1.1: Activity performed by access key during a specific time window

The first query is aimed at obtaining the specific activity that was performed by this key, either successfully or not, during the time the malicious activity took place. You can use the GuardDuty finding details “EventFirstSeen” and “EventLastSeen” to define the time window of the query. Also, and for further queries, you want to fetch artifacts that could be considered possible indicators of compromise (IoC) related to this security incident, such as IP addresses.

You can build and run the following query on CloudTrail Lake Editor, either in the CloudTrail console or programmatically.

Query 1.1


 SELECT eventSource,eventName,sourceIPAddress,eventTime,errorCode FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE userIdentity.accessKeyId = 'AKIAIOSFODNN7EXAMPLE' AND eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' order by eventTime;

The results of the query are as follows:

Figure 1: Sample query 1.1 and results in the AWS Management Console

The results demonstrate that the activity performed by the access key tried to unsuccessfully list Amazon Simple Storage Services (Amazon S3) buckets and CloudTrail trails. You can also see specific write activity related to AWS Identity and Access Management (IAM) that was denied, and afterwards there was activity possibly related to reconnaissance tactics in IAM to finally be able to assume a role, which indicates a possible attempt to perform an escalation of privileges. You can observe only one source IP from which this activity was performed.

Query 1.2: Confirm which IAM role was assumed by the threat actor during a specific time window

As you observed from the previous query results, the threat actor was able to assume an IAM role. In this query, you would like to confirm which IAM role was assumed during the security incident.

Query 1.2


 SELECT requestParameters,responseElements FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventName = 'AssumeRole' AND eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND userIdentity.accessKeyId = 'AKIAIOSFODNN7EXAMPLE'

The results of the query are as follows:

Figure 2: Sample query 1.2 and results in the console

The results show that an IAM role named “Alice” was assumed in a second account. For future queries, keep the temporary access key from the responseElements result to obtain activity performed by this role session.

Query 1.3: Activity performed from an IP address in an expanded time window search

Investigating the incident only from the time of discovery may result in overlooking signs or indicators of potential past incidents that were not detected related to this threat actor. For this reason, you want to expand the investigation window time, which might result in expanding the search back weeks, months, or even years, depending on factors such as the nature and severity of the incident, available resources, and so on. In this example, for balance and urgency, the window of time searched is expanded to a month. You want to also review whether there is past activity related to this account by the IP you previously observed.

Query 1.3

The results of the query are as follows:


 SELECT eventSource,eventName,sourceIPAddress,eventTime,errorCode FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE sourceIPAddress = '192.0.2.76' AND useridentity.accountid = '555555555555 AND eventTime > '2022-02-15 13:10:00' AND eventTime < '2022-03-15 13:10:00' order by eventTime;

Figure 3: Sample query 1.3 and results in the console

As you can observe from the results, there is no activity coming from this IP address in this account in the previous month.

Query 1.4: Activity performed from an IP address in any other account in your organization during a specific time window

Before you start investigating what activity was performed by the role assumed in the second account, and considering that this malicious activity now involves cross-account access, you will want to review whether any other account in your organization has activity related to the specific IP address observed. You will need to expand the window of time to an entire month in order to see if previous activity was performed before this incident from this source IP, and you will need to exclude activity coming from the first account.

Query 1.4


 SELECT useridentity.accountid,eventTime FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE sourceIPAddress = '192.0.2.76' AND eventTime > '2022-02-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND useridentity.accountid != '555555555555'GROUP by useridentity.accountid

The results of the query are as follows:

Figure 4: Sample query 1.4 and results in the console

As you can observe from the results, there is activity only in the second account where the role was assumed. You can also confirm that there was no activity performed in other accounts in the previous month from this IP address.

Query 1.5: Count activity performed by an IAM role during a specific time period

For the next query example, you want to count and group activity based on the API actions that were performed in each service by the role assumed. This query helps you quantify and understand the impact of the possible unauthorized activity that might have happened in this second account.

Query 1.5


 SELECT count (*) as NumberEvents, eventSource, eventName
 FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND useridentity.type = 'AssumedRole' AND useridentity.sessioncontext.sessionissuer.arn = 'arn:aws:iam::111122223333:role/Alice'
 GROUP by eventSource, eventName
 order by NumberEvents desc;

The results of the query are as follows:

Figure 5: Sample query 1.5 and results in the console

You observe that the activity is consistent with what was shown in the first account, and the threat actor seems to be targeting trails, S3 buckets, and IAM activity related to possible further escalation of privileges.

Query 1.6: Confirm successful activity performed by an IAM role during a specific time window

Following the example in query 1.1, you will fetch the information related to activity that was successful or denied. This helps you confirm modifications that took place in the environment, or the creation of new resources. For this example, you will also want to obtain the event ID in case you need to dig further into one specific API call. You will then filter out activity done by any other session by using the temporary access key obtained from query 1.2.

Query 1.6


 SELECT eventSource, eventName, eventTime, eventID, errorCode FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND useridentity.type = 'AssumedRole' AND useridentity.sessioncontext.sessionissuer.arn = 'arn:aws:iam::111122223333:role/Alice' AND userIdentity.accessKeyId = 'ASIAZNYXHMZ37EXAMPLE '

The results of the query are as follows:

Figure 6: Sample query 1.6 and results in the console

You can observe that the threat actor was again not able to perform activity upon the trails, S3 buckets, or IAM roles. But as you can see, the threat actor was able to perform specific IAM activity, which led to the creation of a new IAM user, policy attachment, and access key.

Query 1.7: Obtain new access key ID created

By making use of the event ID from the CreateAccesskey event displayed in the previous query, you can obtain the access key ID so that you can further dig into what activity was performed by it.

Query 1.7


 SELECT responseElements FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventID = 'bd29bab7-1153-4510-9e7f-9ff9bba4bd9a'

The results of the query are as follows:

Figure 7: Sample query 1.7 and results in the console

Query 1.8: Obtain successful API activity that was performed by the access key during a specific time window

Following previous examples, you will count and group the API activity that was successfully performed by this access key ID. This time, you will exclude denied activity in order to understand the activity that actually took place.

Query 1.8


 SELECT count (*) as NumberEvents, eventSource, eventName
 FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND
 userIdentity.accessKeyId = 'AKIAI44QH8DHBEXAMPLE' AND errorcode IS NULL
 GROUP by eventSource, eventName
 order by NumberEvents desc;

The results of the query are as follows:

Figure 8: Sample query 1.8 and results in the console

You can observe that this time, the threat actor was able to perform specific activities targeting your trails and buckets due to privilege escalation. In these results, you observe that a trail was successfully stopped, and S3 objects were downloaded and deleted.

You can also see bucket deletion activity. At first glance, this might indicate activity related to a data exfiltration scenario in the case where the bucket was not properly secured, and possible future ransom demands could be made if proper preventive controls and measures to recover the data were not in place. For more details on this scenario, see this AWS Security blog post.

Query 1.9: Obtain bucket and object names affected during a specific time window

After you obtain the activities on the S3 buckets by using sample query 1.8, you can use the following query to show what objects this activity was related to, and from which buckets. You can expand the query to exclude denied activity.

Query 1.9


 SELECT element_at(requestParameters, 'bucketName') as BucketName, element_at(requestParameters, 'key') as ObjectName, eventName FROM 1994bee2-d4a0-460e-8c07-1b5ee04765d8 WHERE (eventName = 'GetObject' OR eventName = 'DeleteObject') AND eventTime > '2022-03-15 13:10:00' AND eventTime < '2022-03-16 00:00:00' AND userIdentity.accessKeyId = 'AKIAI44QH8DHBEXAMPLE' AND errorcode IS NULL

The results of the query are as follows:

Figure 9: Sample query 1.9 and results in the console

As you can observe, the unauthorized user was able to first obtain and exfiltrate S3 objects, and then delete them afterwards.

Summary of incident scenario 1

This scenario describes a security incident involving a publicly exposed AWS access key that is exploited by a threat actor. Here is a summary of the steps taken to investigate this incident by using CloudTrail Lake capabilities:

Investigated AWS activity that was performed by the compromised access key
Observed possible adversary tactics and techniques that were used by the threat actor
Collected artifacts that could be potential indicators of compromise (IoC), such as IP addresses
Confirmed role assumption by the threat actor in a second account
Expanded the time window of your investigation and the scope to your entire organization in AWS Organizations; and searched for any activity that might have taken place originating from the IP address related to the unauthorized activity
Investigated AWS activity that was performed by the role assumed in the second account
Identified new resources that were created by the threat actor, and malicious activity performed by the actor
Confirmed the modifications caused by the threat actor and their impact in your environment

Incident scenario 2: AWS IAM Identity Center user credentials compromised

In this second scenario, you start your investigation from a GuardDuty finding stating that an Amazon Elastic Compute Cloud (Amazon EC2) instance is querying an IP address that is associated with cryptocurrency-related activity. There are several sources of logs that you might want to explore when you conduct this investigation, including network, operation system, or application logs, among others. In this example, you will use CloudTrail Lake capabilities to investigate API activity logged in CloudTrail for this security event. To understand what exactly happened and when, you start by querying information from the resource involved, in this case an EC2 instance, and then continue digging into the AWS IAM Identity Center (successor to AWS Single Sign-On) credentials that were used to launch that EC2 instance, to finally confirm what other actions were performed.

Query 2.1: Confirm who has launched the EC2 instance involved in the cryptocurrency-related activity

You can begin by looking at the finding CryptoCurrency:EC2/BitcoinTool.B to get more information related to this event, for example when (timestamp), where (AWS account and AWS Region), and also which resource (EC2 instance ID) was involved with the security incident and when it was launched. With this information, you can perform the first query for this scenario, which will confirm what type of user credentials were used to launch the instance involved.

Query 2.1


 SELECT userIdentity.principalid, eventName, eventTime, recipientAccountId, awsRegion FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
 WHERE responseElements IS NOT NULL AND
 element_at(responseElements, 'instancesSet') like '%"instanceId":"i-053a7e6164c0f0473"%' AND
 eventTime > '2022-09-13 12:45:59' AND eventName='RunInstances'

The results of the query are as follows:

Figure 10: Sample query 2.1 and results in the console

The results demonstrate that the IAM Identity Center user as principal ID AROASVPO5CIEXAMPLE:[email protected] was used to launch the EC2 instance that was involved in the incident.

Query 2.2: Confirm in which AWS accounts the IAM Identity Center user has federated and authenticated

You want to confirm which AWS accounts this specific IAM Identity Center user has federated and authenticated with, and also which IAM role was assumed. This is important information to make sure that the security event happened only within the affected AWS account. The window of time for this query is based on the maximum value for the permission sets’ session duration in IAM Identity Center.

Query 2.2


 SELECT element_at(serviceEventDetails, 'account_id') as AccountID, element_at(serviceEventDetails, 'role_name') as SSORole, eventID, eventTime FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd WHERE eventSource = 'sso.amazonaws.com' AND eventName = 'Federate' AND userIdentity.username = '[email protected]' AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 11: Sample query 2.2 and results in the console

The results show that only one AWS account has been accessed during the time of the incident, and only one AWS role named AdministratorAccess has been used.

Query 2.3: Count and group activity based on API actions that were performed by the user in each AWS service

You now know exactly where the user has gained access, so next you can count and group the activity based on the API actions that were performed in each AWS service. This information helps you confirm the types of activity that were performed.

Query 2.3


 SELECT eventSource, eventName, COUNT(*) AS apiCount
 FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
 WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
 AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'
 GROUP BY eventSource, eventName ORDER BY apiCount DESC

The results of the query are as follows:

Figure 12: Sample query 2.3 and results in the console

You can see that the list of APIs includes the read activities Get, Describe, and List. This activity is commonly associated with the discovery stage, when the unauthorized user is gathering information to determine credential permissions.

Query 2.4: Obtain mutable activity based on API actions performed by the user in each AWS service

To get a better understanding of the mutable actions performed by the user, you can add a new condition to hide the read-only actions by setting the readOnly parameter to false. You will want to focus on mutable actions to know whether there were new AWS resources created or if existing AWS resources were deleted or modified. Also, you can add the possible error code from the response element to the query, which will tell you if the actions were denied.

Query 2.4


 SELECT eventSource, eventName, eventTime, eventID, errorCode
 FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
 WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
 AND readOnly = false
 AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 13: Sample query 2.4 and results in the console

You can confirm that some actions, like EC2 RunInstances, EC2 CreateImage, SSM StartSession, IAM CreateUser, and IAM PutRolePolicy were allowed. And in contrast, IAM CreateAccessKey, IAM CreateRole, IAM AttachRolePolicy, and GuardDuty DeleteDetector were denied. The IAM-related denied actions are commonly associated with persistence tactics, where an unauthorized user may try to maintain access to the environment. The GuardDuty denied action is commonly associated with defense evasion tactics, where the unauthorized user is trying to cover their tracks and avoid detection.

Query 2.5: Obtain more information about API action EC2 RunInstances

You can focus first on the API action EC2 RunInstances to understand how many EC2 instances were created by the same user. This information will confirm which other EC2 instances were involved in the security event.

Query 2.5


 SELECT awsRegion, recipientAccountId, eventID, element_at(responseElements, 'instancesSet') as instances
 FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
 WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
 AND eventName='RunInstances'
 AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 14: Sample query 2.5 and results in the console

You can confirm that the API was called twice, and if you expand the column InstanceSet in the response element, you will see the exact number of EC2 instances that were launched. Also, you can find that these EC2 instances were launched with an IAM instance profile called ec2-role-ssm-core. By checking in the IAM console, you can confirm that the IAM role associated has only the AWS managed policy AmazonSSMManagedInstanceCore attached, which enables AWS Systems Manager core functionality.

Query 2.6: Get the list of denied API actions performed by the user for each AWS service

Now, you can filter more to focus only on those denied API actions by performing the following query. This is important because it can help you to identify what kind of malicious event was attempted.

Query 2.6


 SELECT recipientAccountId, awsRegion, eventSource, eventName, eventID, eventTime
 FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
 WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
 AND errorCode = 'AccessDenied'
 AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 15: Sample query 2.6 and results in the console

You can see that the user has tried to stop GuardDuty by calling DeleteDetector, and has also performed actions within IAM that you should examine more closely to know if new unwanted access to the environment was created.

Query 2.7: Obtain more information about API action IAM CreateUserAccessKeys

With the previous query, you confirmed that more actions were denied within IAM. You can now focus on the failed attempt to create IAM user access keys that could have been used to gain persistent and programmatic access to the AWS account. With the following query, you can make sure that the actions were denied and determine the reason why.

Query 2.7


 SELECT recipientAccountId, awsRegion, eventID, eventTime, errorCode, errorMessage
 FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
 WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
 AND eventName='CreateAccessKey'
 AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 16: Sample query 2.7 and results in the console

If you copy the errorMessage element from the response, you can confirm that the action was denied by a service control policy, as shown in the following example.


 "errorMessage":"User: arn:aws:sts::111122223333:assumed-role/AWSReservedSSO_AdministratorAccess_f53d10b0f8a756ac/[email protected] is not authorized to perform: iam:CreateAccessKey on resource: user production-user with an explicit deny in a service control policy"

Query 2.8: Obtain more information about API IAM CreateUser

From the query error message in query 2.7, you can confirm the name of the IAM user that was used. Now you can check the allowed API action IAM CreateUser that you observed before to see if the IAM users match. This helps you confirm that there were no other IAM users involved in the security event.

Query 2.8


 SELECT recipientAccountId, awsRegion, eventID, eventTime, element_at(responseElements, 'user') as userInfo
 FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
 WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
 AND eventName='CreateUser'
 AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 17: Sample query 2.8 and results in the console

Based on this output, you can confirm that the IAM user is indeed the same. This user was created successfully but was denied the creation of access keys, confirming the failed attempt to get new persistent and programmatic credentials.

Query 2.9: Get more information about the IAM role creation attempt

Now you can figure out what happened with the IAM CreateRole denied action. With the following query, you can see the full error message for the denied action.

Query 2.9


 SELECT recipientAccountId, awsRegion, eventID, eventTime, errorCode, errorMessage
 FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd
 WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]'
 AND eventName='CreateRole'
 AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 18: Sample query 2.9 and results in the console

If you copy the output of this query, you will see that the role was denied by a service control policy, as shown in the following example:


 "errorMessage":"User: arn:aws:sts::111122223333:assumed-role/AWSReservedSSO_AdministratorAccess_f53d10b0f8a756ac/[email protected] is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::111122223333:role/production-ec2-role with an explicit deny in a service control policy"

Query 2.10: Get more information about IAM role policy changes

With the previous query, you confirmed that the unauthorized user failed to create a new IAM role to replace the existing EC2 instance profile in an attempt to grant more permissions. And with another of the previous queries, you confirmed that the IAM API action AttachRolePolicy was also denied, in another attempt for the same goal, but this time trying to attach a new AWS managed policy directly. However, with this new query, you can confirm that the unauthorized user successfully applied an inline policy to the EC2 role associated with the existing EC2 instance profile, with full admin access.

Query 2.10


 SELECT recipientAccountId, eventID, eventTime, element_at(requestParameters, 'roleName') as roleName, element_at(requestParameters, 'policyDocument') as policyDocument FROM 467f2e52-84b9-4d41-8049-bc8f8fad35dd WHERE userIdentity.principalId = 'AROASVPO5CIEXAMPLE:[email protected]' AND eventName = 'PutRolePolicy' AND eventTime > '2022-09-13 00:00:00' AND eventTime < '2022-09-14 00:00:00'

The results of the query are as follows:

Figure 19: Sample query 2.10 and results in the console

Summary of incident scenario 2

This second scenario describes a security incident that involves an IAM Identity Center user that has been compromised. To investigate this incident by using CloudTrail Lake capabilities, you did the following:

Started the investigation by looking at metadata from the GuardDuty EC2 finding
Confirmed the AWS credentials that were used for the creation of that resource
Looked at whether the IAM Identity Center user credentials were used to access other AWS accounts
Did further investigation on the AWS APIs that were called by the IAM Identity Center user
Obtained the list of denied actions, confirming the unauthorized user’s attempt to get persistent access and cover their tracks
Obtained the list of EC2 resources that were successfully created in this security event

Conclusion

In this post, we’ve shown you how to use AWS CloudTrail Lake capabilities to investigate CloudTrail activity in response to security incidents across your organization. We also provided sample queries for two security incident scenarios. You now know how to use the capabilities of CloudTrail Lake to assist you and your security teams during the investigation process in a security incident. Additionally, you can find some of the sample queries related to this post and other topics in the following GitHub repository, and additional examples in the sample queries tab in the CloudTrail console. To learn more, see Working with CloudTrail Lake in the CloudTrail User Guide.

Regarding pricing for CloudTrail Lake, you pay for ingestion and storage together, where the billing is based on the amount of uncompressed data ingested. If you’re a new customer, you can try AWS CloudTrail Lake for a 30-day free trial or when you reach the free usage limits of 5GB of data. For more information, see see AWS CloudTrail pricing.

Finally, in combination with the investigation techniques shown in this post, we also recommend that you explore the use of Amazon Detective, an AWS managed and dedicated service that simplifies the investigative process and helps security teams conduct faster and more effective investigations. With the Amazon Detective prebuilt data aggregations, summaries, and context, you can quickly analyze and determine the nature and extent of possible security issues.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Push Amazon EMR step logs from Amazon EC2 instances to Amazon CloudWatch logs

2023-04-07 Nausheen Sayed

Post Syndicated from Nausheen Sayed original https://aws.amazon.com/blogs/big-data/push-amazon-emr-step-logs-from-amazon-ec2-instances-to-amazon-cloudwatch-logs/

Amazon EMR is a big data service offered by AWS to run Apache Spark and other open-source applications on AWS to build scalable data pipelines in a cost-effective manner. Monitoring the logs generated from the jobs deployed on EMR clusters is essential to help detect critical issues in real time and identify root causes quickly.

Pushing those logs into Amazon CloudWatch enables you to centralize and drive actionable intelligence from your logs to address operational issues without needing to provision servers or manage software. You can instantly begin writing queries with aggregations, filters, and regular expressions. In addition, you can visualize time series data, drill down into individual log events, and export query results to CloudWatch dashboards.

To ingest logs that are persisted on the Amazon Elastic Compute Cloud (Amazon EC2) instances of an EMR cluster into CloudWatch, you can use the CloudWatch agent. This provides a simple way to push logs from an EC2 instance to CloudWatch.

The CloudWatch agent is a software package that autonomously and continuously runs on your servers. You can install and configure the CloudWatch agent to collect system and application logs from EC2 instances, on-premises hosts, and containerized applications. CloudWatch processes and stores the logs collected by the CloudWatch agent, which further helps with the performance and health monitoring of your infrastructure and applications.

In this post, we create an EMR cluster and centralize the EMR step logs of the jobs in CloudWatch. This will make it easier for you to manage your EMR cluster, troubleshoot issues, and monitor performance. This solution is particularly helpful if you want to use CloudWatch to collect and visualize real-time logs, metrics, and event data, streamlining your infrastructure and application maintenance.

Overview of solution

The solution presented in this post is based on a specific configuration where the EMR step concurrency level is set to 1. This means that only one step is run at a time on the cluster. It’s important to note that if the EMR step concurrency level is set to a value greater than 1, the solution may not work as expected. We highly recommend verifying your EMR step concurrency configuration before implementing the solution presented in this post.

The following diagram illustrates the solution architecture.

Solution Architecture Diagram

The workflow includes the following steps:

Users start an Apache Spark EMR job, creating a step on the EMR cluster. Using Apache Spark, the workload is distributed across the different nodes of the EMR cluster.
In each node (EC2 instance) of the cluster, a CloudWatch agent watches different logs directories, capturing new entries in the log files and pushing them to CloudWatch.
Users can view the step logs accessing the different log groups from the CloudWatch console. The step logs written by Amazon EMR are as follows:
- controller — Information about the processing of the step. If your step fails while loading, you can find the stack trace in this log.
- stderr — The standard error channel of Spark while it processes the step.
- stdout — The standard output channel of Spark while it processes the step.

We provide an AWS CloudFormation template in this post as a general guide. The template demonstrates how to configure a CloudWatch agent on Amazon EMR to push Spark logs to CloudWatch. You can review and customize it as needed to include your Amazon EMR security configurations. As a best practice, we recommend including your Amazon EMR security configurations in the template to encrypt data in transit.

You should also be aware that some of the resources deployed by this stack incur costs when they remain in use.

In the next sections, we go through the following steps:

Create and upload the bootstrap script to an Amazon Simple Storage Service (Amazon S3) bucket.
Use the CloudFormation template to create the following resources:
- An AWS Identity and Access Management (IAM) instance profile role for Amazon EMR.
- An AWS Systems Manager parameter with the CloudWatch agent’s configuration.
- An EMR cluster with Spark as an installed application.
- A Spark job.
Monitor the Spark logs on the CloudWatch console.

Prerequisites

This post assumes that you have the following:

An AWS account.
An S3 bucket for storing the bootstrap script.
A VPC created in Amazon Virtual Private Cloud (Amazon VPC), where your EMR cluster will be launched.
Default IAM service roles for Amazon EMR permissions to AWS services and resources. You can create these roles with the aws emr create-default-roles command in the AWS Command Line Interface (AWS CLI).
An optional EC2 key pair, if you plan to connect to your cluster through SSH rather than AWS Systems Manager Session Manager.

Create and upload the bootstrap script to an S3 bucket

For more information, see Uploading objects and Installing and running the CloudWatch agent on your servers.

To create and the upload the bootstrap script, complete the following steps:

Create a local file named bootstrap_cloudwatch_agent.sh with the following content:

#!/bin/bash

echo -e 'Installing CloudWatch Agent... \n'
sudo rpm -Uvh --force https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm

echo -e 'Starting CloudWatch Agent... \n'
sudo amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:AmazonCloudWatch-Config.json -s

On the Amazon S3 console, choose your S3 bucket.
On the Objects tab, choose Upload.
Choose Add files, then choose the bootstrap script.
Choose Upload, then choose the file name: bootstrap_cloudwatch_agent.sh.
Choose Copy S3 URI. We use this value in a later step.

Provision resources with the CloudFormation template

Choose Launch Stack to launch a CloudFormation stack in your account and deploy the template:

This template creates an IAM role, IAM instance profile, Systems Manager parameter, and EMR cluster. The cluster starts the Spark PI estimation example application. You will be billed for the AWS resources used if you create a stack from this template.

The CloudFormation wizard will ask you to modify or provide these parameters:

InstanceType – The type of instance for all instance groups. The default is m4.xlarge.
InstanceCountCore – The number of instances in the core instance group. The default is 2.
EMRReleaseLabel – The Amazon EMR release label you want to use. The default is emr-6.9.0.
BootstrapScriptPath – The S3 path of your CloudWatch agent installation bootstrap script that you copied earlier.
Subnet – The EC2 subnet where the cluster launches. You must provide this parameter.
EC2KeyPairName – An optional EC2 keypair for connecting to cluster nodes, as an alternative to Session Manager.

Monitor the log streams

After the CloudFormation stack deploys successfully, on the CloudWatch console, choose Log groups in the navigation pane. Then filter the log groups by the prefix /aws/emr/master.

choose Log groups in the navigation pane

The ID in the log group corresponds to the EC2 instance ID of the EMR primary node. If you have multiple EMR clusters, you can use this ID to identify a particular EMR cluster, based on the primary node ID.

In the log group, you will find the three different log streams.

The log streams contain the following information:

step-stdout – The standard output channel of Spark while it processes the step.
step-stderr – The standard error channel of Spark while it processes the step.
step-controller – Information about the processing of the step. If your step fails while loading, you can find the stack trace in this log.

Clean up

To avoid future charges in your account, delete the resources you created in this walkthrough. The EMR cluster will incur charges as long as the cluster is active, so stop it when you’re done.

On the CloudFormation console, in the navigation pane, choose Stacks.
Choose the stack you launched (EMR-CloudWatch-Demo), then choose Delete.
Empty the S3 bucket you created.
Delete the S3 bucket you created.

Conclusion

Now that you have completed the steps in this walkthrough, you have the CloudWatch agent running on your cluster hosts and configured to push EMR step logs to CloudWatch. With this feature, you can effectively monitor the health and performance of your Spark jobs running on Amazon EMR, detecting critical issues in real time and identifying root causes quickly.

You can package and deploy this solution through a CloudFormation template like this example template, which creates the IAM instance profile role, Systems Manager parameter, and EMR cluster.

To take this further, consider using these logs in CloudWatch alarms for alerts on a log group-metric filter. You could collect them with other alarms into a composite alarm or configure alarm actions such as sending Amazon Simple Notification Service (Amazon SNS) notifications to trigger event-driven processes such as AWS Lambda functions.

About the Author

Ennio Pastore is a Senior Data Architect on the AWS Data Lab team. He is an enthusiast of everything related to new technologies that have a positive impact on businesses and general livelihood. Ennio has over 10 years of experience in data analytics. He helps companies define and implement data platforms across industries, such as telecommunications, banking, gaming, retail, and insurance.

Reduce triage time for security investigations with Amazon Detective visualizations and export data

2023-04-05 Alex Waddell

Post Syndicated from Alex Waddell original https://aws.amazon.com/blogs/security/reduce-triage-time-for-security-investigations-with-detective-visualizations-and-export-data/

To respond to emerging threats, you will often need to sort through large datasets rapidly to prioritize security findings. Amazon Detective recently released two new features to help you do this. New visualizations in Detective show the connections between entities related to multiple Amazon GuardDuty findings, and a new export data feature helps you use the data from Detective in your other tools and automated workflows.

By using these new features, you can quickly analyze, correlate, and visualize the large amounts of data generated by sources like Amazon Virtual Private Cloud (Amazon VPC) Flow Logs, AWS CloudTrail, Amazon GuardDuty findings, and Amazon Elastic Kubernetes Service (Amazon EKS) audit logs.

In this post, we’ll show you how you can use these new features to help reduce the time it takes to assess, investigate, and prioritize a security incident.

A security finding is raised

The workflow starts with GuardDuty. GuardDuty continuously monitors AWS accounts, Amazon Elastic Compute Cloud (Amazon EC2) instances, EKS clusters, and data stored in Amazon Simple Storage Service (Amazon S3) for malicious activity without the use of security software or agents.

If GuardDuty detects potential malicious activity, such as anomalous behavior, credential exfiltration, or command and control (C2) infrastructure communication, it generates detailed security incidents called findings.

Depending on the severity and complexity of the GuardDuty finding, the resolution might require deep investigation. Consider an example that involves cryptocurrency mining. If you frequently see a cryptocurrency finding on your EC2 instances, you might have a recurring malware issue that has enabled a backdoor. If a threat actor is attempting to compromise your AWS environment, they typically perform a sequence of actions that lead to multiple findings and unusual behavior. When security findings are investigated in isolation, it can lead to a misinterpretation of their significance and difficulty in finding the root cause. When you need more context around a finding, Detective can help.

Detective automatically collects log data and events from sources like CloudTrail logs, Amazon VPC Flow Logs, GuardDuty findings, and Amazon EKS audit logs and maintains up to a year of aggregated data for analysis. Detective uses machine learning to create a behavioral graph for these data sources that helps show how security issues have evolved. It highlights what AWS resources might be compromised and flags unusual activity like new API calls, new user agents, and new AWS Regions.

The search capabilities work across AWS workloads, providing the information required to show the potential impact of an incident. Detective helps you answer questions like: How did this security incident happen? What AWS resources were affected? How can we prevent this from happening again?

Finding groups help connect the dots of an incident

You can use finding groups, a recent feature of Detective, to help with your investigations. A finding group is a collection of entities related to a single potential security incident that should be investigated together. An entity can be an AWS resource like an EC2 instance, IAM role, or GuardDuty finding, but it can also be an IP address or user agent. For a full list of entities collected, see Searching for a finding or entity in the Detective User Guide.

Grouping these entities together helps provide context and a more complete understanding of the threat landscape. This makes it simpler for you to identify relationships between different events and to assess the overall impact of a potential threat.

In the cryptocurrency mining example described previously, finding groups could show the relationship between the cryptocurrency mining finding and a C2 finding so that you know the two are related and the AWS resources affected. To learn more about working with Detective finding groups, see How to improve security incident investigations using Amazon Detective finding groups.

Figure 1 shows the finding groups overview page on the Detective console, with a list of finding groups filtered by status. The dashboard also shows the severity, title, observed tactics, accounts, entities, and the total number of findings for each finding group. For more information about the attributes of finding groups, see Analyzing finding groups.

Figure 1: Finding groups overview

To see details about the finding group, select the title of the finding group to access the details page, which includes Details, Visualization, Involved entities, and Involved findings. On this panel, you can view entities and findings included in a finding group and interact with them. The information presented is the same in the Visualization panel, the Involved entities panel, and the Involved findings panel. The different views allow you to view the information in the way that is helpful for you. Figure 2 shows an example of the Details and Visualization for a specific finding group.

Figure 2: Details and Visualization

Note: Finding groups with over 100 nodes (findings and entities) do not include a graph visualization.

Visualizations to show you the situation

The new visualizations in Detective provide three layouts that display the same information from finding groups, but allow you to choose and arrange the different entities so that you can focus on the highest priority finding or resources.

To determine what each visual element represents, choose the Legend in the bottom left corner of the panel. You can change the placement of findings in the Visualization panel by selecting a different layout from the Select layout dropdown menu. Figure 2 in the preceding section shows the Force-directed layout, where the positioning of entities and findings presents an even distribution of links with minimal overlap, while maintaining consistent distance between items.

Figure 3 shows the Visualization panel with the Circle layout, where nodes are displayed in a circular layout. You can use the Legend to understand the different categories of Findings, Compute, Network, Identity, Storage, or Other.

If you’re unfamiliar with these terms, see Amazon Detective terms and concepts to learn more.

Figure 3: Visualization panel with Circle layout and Legend

Figure 4 shows the Visualization panel with the Grid layout, where nodes are divided into four different columns: evidence, identity entities, GuardDuty findings, and other entities (compute, network, and storage).

Figure 4: Visualization panel with Grid layout

In the Visualization panel, you can select one or more (using ctrl/cmd+click) nodes. Selected nodes are listed next to the graph, and you can select each node’s title for more information. Selecting an entity’s title opens a new page that displays detailed information about that entity, whereas selecting a finding or evidence expands the right sidebar to show details on the selected finding or evidence.

You can rearrange chosen entities and findings as needed to help improve your understanding of their connections. This can help speed up your assessment of findings. Figure 5 shows the Visualization panel with four nodes selected and the sidebar displaying information relevant to the selected finding.

Figure 5: Visualization panel with evidence selected

Finding groups and visualizations provide an overview of the entities and resources related to a security activity. Presenting the information in this way highlights the interconnections between various activities. This means that you no longer have to use multiple tools or query different services to collect information or investigate entities and resources. This can help you reduce triage and scoping times and make your investigations faster and more comprehensive.

Increased flexibility for investigation with simpler data access

To expand the scope of your investigation or confirm if a security incident has taken place, you might want to combine data from Detective with your own tools or different services. This is where export data comes into play.

Detective has several Summary page panels that you can use as a starting point for your investigations because they highlight potentially suspicious activity. The panels include roles and users with the most API call volume, EC2 instances with the most traffic volume, and EKS clusters with the most Kubernetes pods created.

With export data, you can now export these panels as common-separated values (CSV) files and import the data into other AWS services or third-party applications, or manipulate the data with spreadsheet programs.

Export from the Detective console Summary screen

In the Detective console, on the Summary page, you will see an Export option on several summary panes. This is enabled and available for anyone with access to Detective. Figure 6 shows summary information for the roles and users with the most API call volume.

Figure 6: Detective console Summary screen

Choose Export to download a CSV file containing the data for the summary information. The file is downloaded to your browser’s default download folder on your local device. When you view the data, it will look something like the spreadsheet in Figure 7:

Figure 7: Example CSV data export from Detective console Summary

Export from the Detective console Search screen

You can also export data using the Search capability of Detective. After you apply specific filters to search for findings or entities based on your use case, an Export button appears at the top of the search results in Detective. Figure 8 shows an example of filtering for a particular CIDR range. Choose Export to download the CSV file containing the filtered data.

Figure 8: Detective search results filtering for a CIDR range

Conclusion

In this blog post, you learned how to use the two new features of Detective to visualize findings and export data. By using these new features, you and your teams can investigate an incident in the way that best fits your workflow. New visualizations show the entities involved in an issue and surface nuanced connections that can be difficult to find when you’re faced with line after line of log data. The new data export feature makes it simpler to integrate the insights discovered in Detective with the tools and automations that your team is already using.

These features are automatically enabled for both existing and new customers in AWS Regions that support Detective. There is no additional charge for finding groups. If you don’t currently use Detective, you can start a free 30-day trial. For more information on finding groups, see Analyzing finding groups in the Amazon Detective User Guide.

If you have feedback about this post, submit comments in the Comments section below. You can also start a new thread on Amazon Detective re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

TLS inspection configuration for encrypted traffic and AWS Network Firewall

2023-04-04 Shiva Vaidyanathan

Post Syndicated from Shiva Vaidyanathan original https://aws.amazon.com/blogs/security/tls-inspection-configuration-for-encrypted-traffic-and-aws-network-firewall/

AWS Network Firewall is a managed service that provides a convenient way to deploy essential network protections for your virtual private clouds (VPCs). In this blog, we are going to cover how to leverage the TLS inspection configuration with AWS Network Firewall and perform Deep Packet Inspection for encrypted traffic. We shall also discuss key considerations and possible architectures.

Today, the majority of internet traffic is SSL/TLS encrypted to maintain privacy and secure communications between applications. Deep packet inspection (DPI) refers to the method of examining the full content of data packets as they traverse a network perimeter firewall. However, the lack of visibility into encrypted traffic presents a challenge to organizations that do not have the resources to decrypt and inspect network traffic. TLS encryption can hide malware, conceal data theft, or mask data leakage of sensitive information such as credit card numbers or passwords. Additionally, TLS decryption is compute-intensive and cryptographic standards are constantly evolving. Organizations that want to decrypt and inspect network traffic typically use a combination of hardware and software solutions from multiple vendors, which adds operational complexity and implementation challenges around capacity planning, scaling issues, and latency concerns. This forces some organizations to make adverse decisions to reduce the complexity of inspecting their network traffic such as blocking access to popular websites to mitigate performance problems.

There are multiple options you can use to perform DPI for encrypted traffic in your AWS environment, based on the use case. These include using AWS WAF or implementing third-party security appliances (next generation firewalls). The addition of new services like Gateway Load Balancer gives you more flexibility in designing your firewall architectures and the ability to perform DPI on AWS.

With this release, Network Firewall also becomes an option to support Deep Packet Inspection on encrypted payloads.

Considerations for deep packet inspection

The following are some key factors to consider when you enable TLS decryption functionality on Network Firewall.

DPI and performance. DPI is processor-intensive, because it not only looks into individual packets, but it also looks into traffic flows (a flow is a collection of related packets). This is combined with the fact that inspection needs to be done in real time with minimal impact to latency. Also, because many firewalls perform other advanced functions (for example, stateful packet inspection, NAT, virtual private network (VPN), and malware threat prevention), adding DPI increases the complexity of the entire system and impacts performance. However, because Network Firewall is an AWS managed service, the bandwidth performance of 100 gigabits per second (Gbps) per firewall endpoint is not impacted, even after you enable TLS inspection configuration. Single digit millisecond latency is expected at initial connection due to the TCP and TLS handshake before data can flow to the firewall. We recommend that you conduct your own testing for the rule sets to verify that the service meets your performance expectations.

DPI and encryption. Encryption has particularly been a challenge to DPI. Effective decisions can’t be made if the contents of the packets aren’t known. As more applications and websites use encryption, it is important that you implement the right TLS decryption technique. With Network Firewall, you can chose which traffic to decrypt by using your available certificates in AWS Certificate Manager (ACM). You can then apply the TLS configurations across the stateful rule groups, thereby authorizing Network Firewall to act as a go-between. For more information on how AWS Network Firewall handles privacy, please read the Network Firewall documentation.

AWS Network Firewall deployment architectures

There are three main architecture patterns for Network Firewall deployments. You can refer to the Deployment models for AWS Network Firewall blog post, which provides details on these, as well as key considerations. The three main models are as follows:

Distributed deployment model — Network Firewall is deployed into each individual VPC.
Centralized deployment model — Network Firewall is deployed into a centralized VPC for East-West (VPC-to-VPC) or North-South (inbound and outbound from internet, on-premises) traffic. We refer to this VPC as the inspection VPC throughout this blog post.
Combined deployment model — Network Firewall is deployed into a centralized inspection VPC for East-West (VPC-to-VPC) and a subset of North-South (on-premises, egress) traffic. Internet ingress is distributed to VPCs that require dedicated inbound access from the internet, and Network Firewall is deployed accordingly.

Each of these architectures is still valid for TLS inspection functionality. Today, AWS Network Firewall supports TLS inspection only for the ingress (inbound) traffic coming into the VPC.

In this section, we will highlight a deployment architecture with AWS Network Firewall and the process for deep packet inspection.

AWS Network Firewall – prior to TLS inspection configuration

Below figure 1 shows how Network Firewall performs inspection when the TLS inspection feature isn’t enabled. The workflow is as follows:

The ingress traffic enters the VPC. Ingress routing enables the internet traffic to be inspected by AWS Network Firewall.
The traffic from the firewall endpoint to the Network Firewall:
1. Network Firewall inspects the packet first through a stateless engine. Network Firewall makes a drop/pass decision by applying the rules that are present in the stateless engine.
2. If there is no match on the set of stateless rules present, the traffic is then forwarded to the stateful engine. Again, a drop/pass decision is made by applying the set of stateful rules.
If the decision is to pass traffic, then the firewall endpoint present in the firewall subnet sends the traffic to the customer subnet through the routes present in the VPC subnet route table.

Figure 1: AWS Network Firewall without TLS inspection configuration

AWS Network Firewall — after TLS inspection configuration

After you enable the TLS inspection capability in Network Firewall, the traffic flow changes slightly, as shown in Figure 2. Because the ingress data you want to inspect is encrypted, it first needs to be decrypted before it is sent to the firewall stateful engine.

In Figure 2, you can see the ingress traffic flow, which has the following steps:

The ingress traffic enters the VPC. Ingress routing enables the internet traffic to be inspected by AWS Network Firewall.
The traffic from the firewall endpoint to the Network Firewall:
1. Network Firewall inspects the packet first through a stateless engine. Network Firewall makes a drop/pass decision by applying the rules present in the stateless engine.
2. If there is no match on the set of stateless rules present, the traffic is then forwarded to the stateful engine. However, before the traffic passes to the stateful engine, if there is no match and the traffic is in the scope of the TLS encryption configuration, the traffic is forwarded for the decrypt operation.
3. After decryption, the traffic is then forwarded to the firewall stateful engine for inspection. Again, Network Firewall makes a drop/pass decision by applying the set of stateful rules.
If the decision is to pass traffic, then the firewall endpoint present in the firewall subnet sends the traffic to the customer subnet through the routes present in the VPC subnet route table.

Note: Customers must trust this certificate for the TLS inspection configuration to function properly.

Figure 2: AWS Network Firewall with TLS inspection configuration

Implement TLS inspection in AWS Network Firewall

Let’s look at how to implement TLS inspection when you create a new network firewall in AWS Network Firewall. A TLS inspection configuration contains one or more references to a valid AWS Certificate Manager (ACM) SSL/TLS certificate that Network Firewall uses to decrypt ingress (inbound) traffic. Network Firewall supports a variety of certificate types supported in addition to wildcard certificates. You can optionally define a scope (5-tuple based) to decrypt traffic by source and destination IP or port. To follow this procedure, you must have at least one valid certificate type supported by Network Firewall in ACM that’s accessible by your AWS account.

To create a TLS inspection configuration (console)

Sign in to the AWS Management Console and open the Amazon VPC console.
In the navigation pane, under Network Firewall, choose TLS inspection configurations.
Choose Create TLS inspection configuration.

Figure 3: TLS inspection configuration for AWS Network Firewall
On the Associate SSL/TLS certificates page, in the search box, select the ACM certificate to use in the TLS inspection configuration, and then choose Add certificate. You can use as many as 10 certificates for a single configuration.

Figure 4: SSL/TLS certificate as part of Network Firewall inspection configuration
Choose Next to go to the TLS inspection configuration’s Describe TLS inspection configuration page.
For Name, enter a name to identify this TLS inspection configuration, and optionally enter a description for the TLS inspection configuration.
Choose Next to go to the TLS inspection configuration’s Define scope page.

Figure 5: Description for Network Firewall inspection configuration

Note that you can’t change the name after you create the TLS inspection configuration.
In the Scope configuration pane, you can optionally define one or more 5-tuple scopes for the domains that you want Network Firewall to decrypt. Network Firewall uses the corresponding SSL/TLS certificates in your TLS inspection configuration to decrypt the SSL/TLS traffic that matches the scope criteria.

Figure 6: Define scope for Network Firewall to decrypt
- For Protocol, choose the protocol to decrypt and inspect for. Network Firewall currently supports only TCP.
- For Source, choose the source IP addresses and ranges to decrypt and inspect for. You can inspect for either Custom IP addresses or Any IPv4 address. (IPv6 is currently not supported.)
- For Source port, choose the source ports and source port ranges to decrypt and inspect for. You can inspect for either Custom port ranges or Any port.
- For Destination, choose the destination IP addresses and ranges to decrypt and inspect for. You can inspect for either Custom IP addresses or Any IPv4 address.
- For Destination port, choose the destination ports and destination port ranges to decrypt and inspect for. You can inspect for either Custom port ranges or Any port.
After you’ve set the scope criteria, choose Add scope configuration, and then choose Next.
On the next page, Select encryption options, determine whether you want to use the AWS managed key or customize encryption settings (advanced). Here we use the default key that AWS owns and manages on your behalf, choose Next.

Figure 7: Select the encryption options
On the Add tags page, choose Next. Tags are optional but are recommended as a best practice. Tags help you organize and manage your AWS resources. For more information about tagging your resources, see Tagging AWS Network Firewall resources.
On the Review and confirm page, check the TLS inspection configuration settings. Choose Create TLS inspection configuration. Your TLS inspection configuration is now ready for use.

Figure 8: Validate the TLS inspection configuration

Update an existing network firewall with TLS inspection configuration

There are two methods that you can use modify an existing firewall configuration for TLS inspection, depending on your scenario.

Scenario 1: Add TLS inspection to an existing network firewall. In this scenario, you only need to consider the scope that TLS inspection applies to. After you have followed steps 1 through 12 outlined in the procedure in this post, and created the TLS inspection configuration, ingress (inbound) traffic will be decrypted and then sent to the stateful engine for inspection that uses your existing firewall policies.

Scenario 2: Modify an existing firewall with TLS inspection configured. In this scenario, where TLS configuration has already been added and you just need to modify the configuration, you can use the following steps. Note that you can’t change the name of a TLS inspection configuration after creation, but you can change other details.

To update a TLS inspection configuration

Sign in to the AWS Management Console and open the Amazon VPC console.
In the navigation pane, under Network Firewall, choose TLS inspection configurations.
On the TLS inspection configuration page, select the name of the TLS inspection configuration you want to update.
Make your desired changes to the configuration.
Choose Save.

To understand more about how Network Firewall handles changes, including TLS inspection configuration, refer to Managing your firewall policy in AWS Network Firewall.

Conclusion

AWS Network Firewall lets you inspect traffic at scale in a variety of use cases. In this blog post, we looked into the recently launched TLS inspection configuration for ingress inspection architectures and discussed considerations for enabling this feature. We showed how you can enable and update the TLS inspection feature on Network Firewall. To learn more about the TLS inspection feature, check out the AWS Network Firewall Developer Guide. We hope this post is helpful and look forward to hearing about how you use the latest feature.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Logging strategies for security incident response

2023-04-04 Anna McAbee

Post Syndicated from Anna McAbee original https://aws.amazon.com/blogs/security/logging-strategies-for-security-incident-response/

Effective security incident response depends on adequate logging, as described in the AWS Security Incident Response Guide. If you have the proper logs and the ability to query them, you can respond more rapidly and effectively to security events. If a security event occurs, you can use various log sources to validate what occurred and understand the scope. Then, you can use the results of your analysis to take remediation actions. To learn more about logging best practices, see Configure service and application logging and Analyze logs, findings, and metrics centrally.

In this blog post, we will show you how to achieve an effective strategy for logging for security incident response. We will share logging options across the typical cloud application stack, log analysis options, and sample queries. AWS offers managed services, such as Amazon GuardDuty for threat detection and Amazon Detective for incident analysis. If you want to collect additional logs or perform custom analysis, then you should consider the options described in this blog post.

Selection of logs

To select the appropriate logs for security incident response, you should start with the common cloud application stack, which consists of the components and layers of your application deployed on AWS. For each component, we will describe the logging sources that you have. For each log source, we will describe why you should log it for security incident response, how to enable the logs, and what your log storage options are.

To select the logs for security incident response, first consider the following questions:

What are your compliance and regulatory requirements for logging?

Note: Make sure that you comply with the log retention requirements of compliance standards relevant to your organization, as well as your organization’s incident response strategy.
What AWS services do you commonly use?
What AWS services have access to or contain sensitive data?
What threats are most relevant to you?

Note: Performing a threat model of your cloud architectures can help you answer this question. For more information, see How to approach threat modelling.

Considering these questions can help you develop requirements for logging that will guide your selection of the following log sources.

AWS account logs

An AWS account is the first, fundamental component of an application deployed on AWS. The account is a container for your AWS resources. You create and manage your AWS resources in this account, and the account provides administrative capabilities for access and billing.

AWS CloudTrail

Within an account, each action performed is an API call. From a console sign-in to the deployment of each resource in an AWS CloudFormation stack, events are generated to provide transparency on what has occurred in the account. With AWS CloudTrail, you can log, continuously monitor, and retain account activity related to actions across supported AWS services. CloudTrail provides the event history of your account activity, including actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. CloudTrail logs API calls as three types of events:

Management events (also known as control plane operations) show management operations that are performed on resources in your account. This includes actions like creating an Amazon Simple Storage Service (Amazon S3) bucket and setting up logging.
Data events (also known as data plane operations) show the resource operations performed on or within resources in your account. These operations are often high-volume activities, such as Amazon S3 object-level API activity (for example, GetObject, DeleteObject, and PutObject API operations) and AWS Lambda function invocation activity.
Insights events capture unusual API call rate or error rate activity in your account. You must enable these events on a trail in order to capture them, and they are logged to a different folder prefix in the destination S3 bucket for your trail. Insights events provide you with information such as the type of event, the incident time period, the associated API, the error code, and statistics to help you understand and respond effectively to unusual activity.

For security investigations, CloudTrail provides context on the creation, modification, and deletion of AWS resources. Therefore, CloudTrail is one of your most important log sources for security incident response in an AWS environment. You have three primary ways to set up CloudTrail:

CloudTrail Event history — CloudTrail is enabled by default with 90-day retention of management events that you can retrieve through the CloudTrail Event history facility using the console, AWS Command line Interface (AWS CLI), or AWS SDK. You don’t need to take any action to get started using the Event history feature.
CloudTrail trail — For longer retention and visibility of data events, you need to create a CloudTrail trail and associate it with an S3 bucket and optionally with an Amazon CloudWatch log group. If you use AWS Organizations, you can create an organization trail that will log events for each account in the organization. By default, trails are multi-Region, so you don’t need to enable CloudTrail logs in each AWS Region.
AWS CloudTrail Lake — You can create a CloudTrail lake, which retains CloudTrail logs for up to seven years and provides a SQL-based querying facility. You don’t need to have a trail configured in your account to use CloudTrail Lake.
Amazon Security Lake — You can use Security Lake to ingest CloudTrail events, which include management and data events. You can further analyze these events with Amazon QuickSight or another other third-party security information and event management (SIEM) tool.

AWS Config

Creating and modifying resources is an integral part of your account use. Tracking resource configuration changes made by calling the AWS API helps you review changes throughout the resource lifecycle. AWS Config provides a detailed view of the configuration of AWS resources in your account, examines the resource configurations periodically, and tracks configuration changes that were not initiated by the API. This includes how the resources are related to one another and how they were configured in the past so that you can see how configurations and relationships change over time.

You should enable AWS Config in each Region where you have resources deployed, and you should configure an S3 bucket to receive configuration history and configuration snapshot files, which contain details on the resources that AWS Config records. You can then review configuration compliance and analyze activities performed before, during, and after an event using the configuration history in S3. You should centralize AWS Config resource tracking across multiple accounts in the same organization by setting up an aggregator. You can use AWS Control Tower to automate the setup.

During a security investigation, you might want to understand how a resource configuration has changed over time. For example, you might want to investigate the changes to an S3 bucket policy before and after a security event that involves an S3 bucket. AWS Config provides a configuration history for resources that can help you track activities performed during a security event.

Operating system and application logs

To record interactions with applications, you must capture operating system (OS) and application logs, especially custom logs generated by the application development framework. OS and local application logs are relevant for security events that involve an Amazon Elastic Compute Cloud (Amazon EC2) instance. These instances could be standalone, in an auto scaling group behind a load balancer, or compute workloads for Amazon Elastic Container Service (Amazon ECS) or an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. OS logs track privileged use, processes, login events, access to directory services, and file system activity on a server. To analyze a potential compromise to an EC2 instance, you will want to review the security event logs for Windows OS and the system logs for Linux-based OS.

With the unified CloudWatch agent, you can collect metrics and logs from EC2 instances and on-premises servers. The CloudWatch agent aggregates log data into CloudWatch logs, which can then be exported to Amazon S3 for long-term retention and analyzed with a SIEM tool of your choice or Amazon Athena, as shown in Figure 1.

Figure 1: Aggregate OS and application logs using CloudWatch Logs

Database logs

With SQL databases, you can log transactions to help track modifications to the databases, such as additions or deletions. After an engine or system failure, you will need transaction logs to restore a database to a consistent state. Transaction logs are designed to be secure, and they require additional processing to access valuable information. It’s important that you understand data interactions during a security investigation, especially if your databases hold personally identifiable information (PII), financial and payments information, or other information subject to regulatory controls.

When you use Amazon Relational Database Service (Amazon RDS), you can publish database logs to Amazon CloudWatch Logs. For NoSQL databases, tracking atomic interactions is useful. You can find logs for managed NoSQL databases like Amazon DynamoDB in CloudTrail. DynamoDB integrates with CloudTrail, providing a record of actions taken by a user, role, or service. These events are classified as data events in CloudTrail.

Network logs

The goal of logging network activity is to gain insight into the communications that traverse your network. You might need this data for a variety of reasons, such as network troubleshooting or for use in a forensic investigation of suspected malware activity within your network.

In the AWS Cloud, you can log network activity by creating a proxy that logs network traffic or by using Traffic Mirroring to send a copy of network traffic to a logging server. You can adopt cloud-native approaches to capture this type of data using Amazon Route 53 DNS query logs and Amazon VPC Flow Logs.

There are also a variety of third-party networking solutions available like Palo Alto Networks and Fortinet, so you can continue to use the network logging mechanisms that you might have used in an on-premises environment.

Route 53 DNS query logs

You can configure Amazon Route 53 to log Domain Name System (DNS) queries. These logs are categorized into two groups:

Public DNS query logging
Resolver query logging

Logging public DNS queries against domains that you have hosted in Route 53 provides query information, such as the domain or subdomain requested, date and time stamp of the request, DNS record type, Route 53 edge location that responded, and response code.

The Amazon Route 53 Resolver comes with Amazon Virtual Private Cloud (Amazon VPC) by default. Capturing Resolver query logs provides the same information as public queries, as well as additional information such as the Instance ID of the resource that the query originated from. You can also capture Resolver query logs against different types of queries.

VPC Flow Logs

You can configure VPC Flow Logs for a VPC in your account to capture traffic that enters and moves around your VPC network, without the addition of instances or products. From these logs, you can review information, such as source and destination IP, ports, timestamps, protocol, account ID, and whether the traffic was accepted or rejected. For a complete list of the fields available for flow log records, see Available fields. You can create a flow log for a VPC, a subnet, or a network interface. If you create a flow log for a subnet or VPC, IP traffic going to and from each network interface in that subnet or VPC will be logged. For more details on VPC Flow Logs, see Logging IP traffic using VPC Flow Logs.

You can forward flow logs to Amazon CloudWatch Logs to create CloudWatch alarms based on metric filters. You can also forward flow logs to an S3 bucket for long-term retention and further analysis. Figure 2 demonstrates these configurations.

Figure 2: Sending VPC Flow logs to CloudWatch Logs and S3

Access logs

To identify access patterns for accessible endpoints, especially public endpoints, you should use access logs. Access logs capture detailed information about requests sent to your load balancer. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. With services built in layers behind a load balancer, unless you track the X-Forwarded-For request header, the requestor’s context is lost. Access logs help bridge that gap during investigations and analysis.

Amazon S3 server access logs

Access logs are critical to track object level access when using S3 buckets to store confidential or sensitive data. You can also turn on CloudTrail to capture S3 data events. You can store access logs in S3 buckets for long-term storage for compliance purposes and to run analyses during and after an event.

Load balancing logs

Elastic Load Balancing provides access logs that capture detailed information about requests sent to load balancers. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. You can use this log to analyze traffic patterns and to troubleshoot issues.

Access logs is an optional feature of Elastic Load Balancing that is turned off by default. To enable access logs for load balancers, see Access logs for your Application Load Balancer.

If you implement your own reverse proxy for load balancing needs, make sure that you capture the reverse proxy access logs. You can use the unified CloudWatch agent to forward the logs to CloudWatch. As with OS logs, you can export CloudWatch logs to an S3 bucket for long-term retention and analysis.

If you use an Amazon CloudFront distribution as the public endpoint for end users with load balancers as the custom origin, then load balancing access logs will represent the CloudFront distribution as the requestor, rather than the actual end user. If this information doesn’t add value to your incident handling process, then you can use CloudFront access logs as the log source that provides end user request details.

CloudFront access logs

You should enable standard logs, also known as access logs, when using CloudFront. Specify an S3 bucket where you want CloudFront to save the files.

CloudFront access logs are delivered on a best-effort basis. For information about requests made to a distribution in real time, use real-time logs that are delivered within seconds of receiving the requests. You should use real-time logs to monitor, analyze, and take action based on content delivery performance. For more details on the fields available from these logs, see the CloudFront standard log file format.

AWS WAF logs

When associated with a supported resource like a CloudFront distribution, Amazon API Gateway REST API, Application Load Balancer, AWS AppSync GraphQL API, Amazon Cognito user pool, or AWS App Runner, AWS WAF can help you monitor HTTP and HTTPS requests that are forwarded to the resource. You should configure web access control lists (ACLs) to gain fine-grained control over the requests, and enable logging for such ACLs to get detailed information about traffic that is analyzed by AWS WAF. Log information includes time of the request being received by AWS WAF from the AWS resource, details about the request, and the AWS WAF rules that the request matched. You can use this log information to monitor access patterns of public endpoints and configure rules to inspect requests in detail. For more information about AWS WAF logging, see Logging web ACL traffic.

Serverless logs

Serverless computing has become increasingly popular in the cloud-computing space. It provides on-demand compute power in a relatively short burst, meaning that cloud-based instances don’t need to be provisioned and kept around, idle, when there are no tasks to be completed. Although more and more compute tasks are being moved to serverless solutions, the need to log has not changed, but how the logs are generated has. In a serverless environment, security investigations not only benefit from logs that demonstrate the interactions and changes made by the code deployed, but that also document changes to the deployed code itself and access permissions of the Lambda execution role that is granting privileged access.

AWS Lambda

The logging of Lambda functions involves two components: how the function itself is operating, and what is happening inside the function (what your code is actually doing).

The logging of a Lambda function itself occurs through data events captured by CloudTrail. As noted earlier in this post, you must configure data events on a trail created in CloudTrail. During configuration, you will need to specify the function from which logs will be captured by your trail, and the destination S3 bucket where they will be stored. These logs contain details on the invocation of the function and help identify the IAM principals that called the Invoke API for Lambda.

AWS Lambda automatically monitors Lambda functions on your behalf and sends logs to CloudWatch. Your Lambda function comes with a CloudWatch Logs log group and a log stream for each instance of your function. The Lambda runtime environment sends details about each invocation to the log stream, and relays logs and other output from your function’s code. For more details on how to monitor Lambda functions, see Accessing Amazon CloudWatch logs for AWS Lambda.

Log analysis

For incident response, you need to be able to analyze and query your logs to validate what occurred and to understand the scope.

To begin, you can aggregate logs from various sources in S3 buckets for long-term storage, and you can integrate that data with query tools for further investigation. Logs can be exported and either parsed through directly, or ingested by another tool to help with the analysis. The following are some options that you can use to query these logs:

Amazon Athena — You can directly query CloudTrail events stored in S3 with Athena using SQL commands, specifying the LOCATION of the log files. You would generally use this approach if you have advanced queries to run, and you don’t have a SIEM. To set up Athena to query logs, you can use this open-source solution from AWS.
Amazon OpenSearch Service — OpenSearch is a distributed search and log analytics suite. Because it’s open source, it can ingest logs from more than just AWS log sources. To set this up, you can use this open-source SIEM solution from AWS.
CloudTrail Event History — Either from the console, or programmatically, you can query CloudTrail management events from the last 90-day period. This is ideal for when you have simple queries to make within the last 90 days, and you don’t need stored logs or more complex queries.
AWS CloudTrail Lake — Either from the console, or programmatically, you can query stored events in your configured CloudTrail Lake from the time of its configuration, up until the maximum storage duration of 2,557 days (7 years) from the time that you make your query. This approach allows for SQL-based queries, and it is ideal for when you need to make more complex queries against events, but don’t require the additional features of a SIEM solution.
Parse through raw JSON using CLI — This is achieved programmatically and parsed through terminal commands. It’s more a legacy method of parsing through logs. You might choose to use this approach for analysis if another service or solution isn’t feasible (for example, if you can’t use the service due to your corporate security policy).
Third-party SIEM — A third-party SIEM might be ideal if you already have a SIEM solution on AWS or elsewhere, and you don’t need a duplicated solution elsewhere. Typically, SIEM solutions will import logs from an S3 bucket and process and index events for analysis. To learn more about SIEM options, see the SIEM solutions in the AWS Marketplace, or the AWS Security Competency Partners for a partner local to you with threat detection and incident response (TDIR) expertise.

Sample queries

In this section, we provide samples of SQL queries. Both Athena and CloudTrail Lake accept SQL queries, but the following samples have been tested for use in Athena only. This is because some samples are for VPC Flow Logs, which you can’t query from CloudTrail Lake. To query CloudTrail logs in Athena, you must first create a table definition that points to the location of your logs stored in S3. You can do this from the CloudTrail Events console by using a hyperlinked suggestion, or from the Athena console directly. Alternatively, for Athena, you can use the AWS Security Analytics Bootstrap.

For each of these queries, you might need to modify some of the fields, such as the time frame that you are investigating, the IAM entity involved, and the account and Region in scope. For example, you might want to modify the time frame based on the current time and when you believe the security event began. This often involves expanding the time frame after running additional queries and learning more about the scope and timeline.

By using partitions for tables, you can restrict the amount of data scanned by each Athena query, helping to improve performance and reduce cost. For example, you can partition your CloudTrail Athena table manually or by using partition projection. You can include the partition column (for example, the timestamp) in your queries to limit the amount of data scanned.

Unauthorized attempts

When a security event occurs, you might want to review API calls that were attempted but failed due to the IAM principal not having access to perform the action on that resource. To discover this activity, run the following query (be sure to modify the time window first):

SELECT *
FROM cloudtrail
WHERE errorcode IN ('Client.UnauthorizedOperation','Client.InvalidPermission.NotFound','Client.OperationNotPermitted','AccessDenied')
AND useridentity.arn LIKE '%iam%'
AND eventtime >= '2023-01-01T00:00:00Z'
AND eventtime < '2023-03-01T00:00:00Z'
ORDER BY eventtime desc

This sample query can help you identify whether certain IAM principals have a significant amount of unauthorized API calls, which can indicate that an IAM principal is compromised.

Rejected TCP connections

During a security event, the unauthorized user that is interacting with the resources in your account is probably trying to establish persistence through the network layer. To get a list of rejected TCP connections and extract from it the day that these events occurred, run the following query:

SELECT day_of_week(date) AS
day,date,interface_id,srcaddr,action,protocol
FROM vpc_flow_logs
WHERE action = 'REJECT' AND protocol = 6
LIMIT 100;

Connections over older TLS versions

You might want to see how many calls to AWS APIs were made using older versions of the TLS protocol, as part of a forensic follow-up or a discovery job after a risk analysis. You can get this data by querying CloudTrail logs.

SELECT eventSource
COUNT(*) AS numOutdatedTlsCalls FROM cloudtrail WHERE tlsDetails.tlsVersion IN ('TLSv1', 'TLSv1.1') AND eventTime > '2023-01-01 00:00:00' GROUP BY eventSource ORDER BY numOutdatedTlsCalls DESC

Filter connections from an IP

With an IP address that you’d like to investigate, as a part of your forensic analysis, you might want to see the connections made to resources in a VPC. You can obtain this information by querying VPC Flow Logs. As with the server access logs, if you’re using Athena, you will first need to create a new table.

SELECT day_of_week(date) AS 
day, date, srcaddr, dstaddr, action, protocol
FROM vpc_flow_logs
WHERE day >= '2023/01/01' AND day < '2023/03/01' AND srcaddr LIKE '172.50.%'
ORDER BY day DESC
LIMIT 100

Investigate user actions

If you have identified a user who has been compromised, or that you suspect has been compromised, you might want to know the API calls that they made over a specific time period. Understanding the activity of a user can help you understand the scope of impact during an incident, as well as the reach of user permissions when you design your access management strategy.

SELECT eventID, eventName, eventSource, eventTime, userIdentity.arn
AS user
FROM cloudtrail
WHERE userIdentity.arn = '%<username>%' AND eventTime > '2022-12-05 00:00:00' AND eventTime < '2022-12-08 00:00:00'

Conclusion

It is essential that you capture logs from various layers within your application architecture, so that you can effectively respond to a security event at various layers of the application stack. If a security event occurs, logs can help provide a clear picture of what happened and the scope of the affected resources. This post helps you build a logging strategy for security incident response by understanding what logs you want to analyze, where you want to store those logs, and how you will analyze them.

Further resources

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Perform accent-insensitive search using OpenSearch

2023-03-30 Aruna Govindaraju

Post Syndicated from Aruna Govindaraju original https://aws.amazon.com/blogs/big-data/perform-accent-insensitive-search-using-opensearch/

We often need our text search to be agnostic of accent marks. Accent-insensitive search, also called diacritics-agnostic search, is where search results are the same for queries that may or may not contain Latin characters such as à, è, Ê, ñ, and ç. Diacritics are English letters with an accent to mark a difference in pronunciation. In recent years, words with diacritics have trickled into the mainstream English language, such as café or protégé. Well, touché! OpenSearch has the answer!

OpenSearch is a scalable, flexible, and extensible open-source software suite for your search workload. OpenSearch can be deployed in three different modes: the self-managed open-source OpenSearch, the managed Amazon OpenSearch Service, and Amazon OpenSearch Serverless. All three deployment modes are powered by Apache Lucene, and offer text analytics using the Lucene analyzers.

In this post, we demonstrate how to perform accent-insensitive search using OpenSearch to handle diacritics.

Solution overview

Lucene Analyzers are Java libraries that are used to analyze text while indexing and searching documents. These analyzers consist of tokenizers and filters. The tokenizers split the incoming text into one or more tokens, and the filters are used to transform the tokens by modifying or removing the unnecessary characters.

OpenSearch supports custom analyzers, which enable you to configure different combinations of tokenizers and filters. It can consist of character filters, tokenizers, and token filters. In order to enable our diacritic-insensitive search, we configure custom analyzers that use the ASCII folding token filter.

ASCIIFolding is a method used to covert alphabetic, numeric, and symbolic Unicode characters that aren’t in the first 127 ASCII characters (the Basic Latin Unicode block) into their ASCII equivalents, if one exists. For example, the filter changes “à” to “a”. This allows search engines to return results agnostic of the accent.

In this post, we configure accent-insensitive search using the ASCIIFolding filter supported in OpenSearch Service. We ingest a set of European movie names with diacritics and verify search results with and without the diacritics.

Create an index with a custom analyzer

We first create the index asciifold_movies with custom analyzer custom_asciifolding:

PUT /asciifold_movies
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_asciifolding": {
          "tokenizer": "standard",
          "filter": [
            "my_ascii_folding"
          ]
        }
      },
      "filter": {
        "my_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_asciifolding",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Ingest sample data

Next, we ingest sample data with Latin characters into the index asciifold_movies:

POST _bulk
{ "index" : { "_index" : "asciifold_movies", "_id":"1"} }
{  "title" : "Jour de fête"}
{ "index" : { "_index" : "asciifold_movies", "_id":"2"} }
{  "title" : "La gloire de mon père" }
{ "index" : { "_index" : "asciifold_movies", "_id":"3"} }
{  "title" : "Le roi et l’oiseau" }
{ "index" : { "_index" : "asciifold_movies", "_id":"4"} }
{  "title" : "Être et avoir" }
{ "index" : { "_index" : "asciifold_movies", "_id":"5"} }
{  "title" : "Kirikou et la sorcière"}
{ "index" : { "_index" : "asciifold_movies", "_id":"6"} }
{  "title" : "Señora Acero"}
{ "index" : { "_index" : "asciifold_movies", "_id":"7"} }
{  "title" : "Señora garçon"}
{ "index" : { "_index" : "asciifold_movies", "_id":"8"} }
{  "title" : "Jour de fete"}

Query the index

Now we query the asciifold_movies index for words with and without Latin characters.

Our first query uses an accented character:

GET asciifold_movies/_search
{
  "query": {
    "match": {
      "title": "fête"
    }
  }
}

Our second query uses a spelling of the same word without the accent mark:

GET asciifold_movies/_search
{
  "query": {
    "match": {
      "title": "fete"
    }
  }
}

In the preceding queries, the search terms “fête” and “fete” return the same results:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.7361701,
    "hits": [
      {
        "_index": "asciifold_movies",
        "_id": "8",
        "_score": 0.7361701,
        "_source": {
          "title": "Jour de fete"
        }
      },
      {
        "_index": "asciifold_movies",
        "_id": "1",
        "_score": 0.42547938,
        "_source": {
          "title": "Jour de fête"
        }
      }
    ]
  }
}

Similarly, try comparing results for “señora” and “senora” or “sorcière” and “sorciere.” The accent-insensitive results are due to the ASCIIFolding filter used with the custom analyzers.

Enable aggregations for fields with accents

Now that we have enabled accent-insensitive search, let’s look at how we can make aggregations work with accents.

Try the following query on the index:

GET asciifold_movies/_search
{
  "size": 0,
  "aggs": {
    "test": {
      "terms": {
        "field": "title.keyword"
      }
    }
  }
}

We get the following response:

"aggregations" : {
    "test" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jour de fete",
          "doc_count" : 1
        },
        {
          "key" : "Jour de fête",
          "doc_count" : 1
        },
        {
          "key" : "Kirikou et la sorcière",
          "doc_count" : 1
        },
        {
          "key" : "La gloire de mon père",
          "doc_count" : 1
        },
        {
          "key" : "Le roi et l’oiseau",
          "doc_count" : 1
        },
        {
          "key" : "Señora Acero",
          "doc_count" : 1
        },
        {
          "key" : "Señora garçon",
          "doc_count" : 1
        },
        {
          "key" : "Être et avoir",
          "doc_count" : 1
        }
      ]
    }
  }

Create accent-insensitive aggregations using a normalizer

In the previous example, the aggregation returns two different buckets, one for “Jour de fête” and one for “Jour de fete.” We can enable aggregations to create one bucket for the field, regardless of the diacritics. This is achieved using the normalizer filter.

The normalizer supports a subset of character and token filters. Using just the defaults, the normalizer filter is a simple way to standardize Unicode text in a language-independent way for search, thereby standardizing different forms of the same character in Unicode and allowing diacritic-agnostic aggregations.

Let’s modify the index mapping to include the normalizer. Delete the previous index, then create a new index with the following mapping and ingest the same dataset:

PUT /asciifold_movies
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_asciifolding": {
          "tokenizer": "standard",
          "filter": [
            "my_ascii_folding"
          ]
        }
      },
      "filter": {
        "my_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      },
      "normalizer": {
        "custom_normalizer": {
          "type": "custom",
          "filter": "asciifolding"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_asciifolding",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256,
            "normalizer": "custom_normalizer"
          }
        }
      }
    }
  }
}

After you ingest the same dataset, try the following query:

GET asciifold_movies/_search
{
  "size": 0,
  "aggs": {
    "test": {
      "terms": {
        "field": "title.keyword"
      }
    }
  }
}

We get the following results:

"aggregations" : {
    "test" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jour de fete",
          "doc_count" : 2
        },
        {
          "key" : "Etre et avoir",
          "doc_count" : 1
        },
        {
          "key" : "Kirikou et la sorciere",
          "doc_count" : 1
        },
        {
          "key" : "La gloire de mon pere",
          "doc_count" : 1
        },
        {
          "key" : "Le roi et l'oiseau",
          "doc_count" : 1
        },
        {
          "key" : "Senora Acero",
          "doc_count" : 1
        },
        {
          "key" : "Senora garcon",
          "doc_count" : 1
        }
      ]
    }
  }

Now we compare the results, and we can see the aggregations with term “Jour de fête” and “Jour de fete” are rolled up into one bucket with doc_count=2.

Summary

In this post, we showed how to enable accent-insensitive search and aggregations by designing the index mapping to do ASCII folding for search tokens and normalize the keyword field for aggregations. You can use the OpenSearch query DSL to implement a range of search features, providing a flexible foundation for structured and unstructured search applications. The Open Source OpenSearch community has also extended the product to enable support for natural language processing, machine learning algorithms, custom dictionaries, and a wide variety of other plugins.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

About the Author

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open-source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience. Her favorite pastime is hiking the New England trails and mountains.

Extending CloudFormation and CDK with Third-Party Extensions

2023-03-29 Lucas Chen

Post Syndicated from Lucas Chen original https://aws.amazon.com/blogs/devops/extending-cloudformation-and-cdk-with-third-party-extensions/

Did you know you can use CloudFormation to manage third-party resources? The AWS CloudFormation Public Registry provides a searchable collection of CloudFormation extensions and makes it easy to discover and provision them in CloudFormation templates and AWS Cloud Development Kit (CDK) applications. In the past three months, we’ve added a number of new, exciting partners to the Public Registry, including GitLab, Okta, and PagerDuty.

The extensions available on the registry are wide-ranging and include third-party resources from partners such as MongoDB; hooks, which are preventative controls that add safeguards to provisioning; and modules, which are re-usable components that take into account best practices and opinionated definitions of resources. AWS Partner Network (APN), third parties, and the developer community contribute these extensions to the Public Registry. Using extensions, customers no longer need to create and maintain custom provisioning logic for resource types from third-party vendors.

Over last few months, AWS collaborated with partners to develop and publish over 80 new resources across 14 providers to Public Registry for CloudFormation. Below is a summary of the new resource type additions.

Key Benefits

Here are some of the benefits for extension builders and consumers when publishing extensions to the public registry:

Discoverability – Publishing your extensions in the public registry will make them discoverable by 1M+ active CloudFormation and CDK customers.
CDK Support – We’re seeing rapid growth in the adoption of the CDK amongst the developer population. Upon publishing to the registry, L1 CDK Constructs will automatically be created for your third party resources making them compatible with the CDK with no added work required. These constructs will also be listed on Construct Hub and aids discoverability discoverable by customers. Note: Automated L1 CDK construct generation is currently an experimental feature.
Drift detection – Third-party resource types in the public registry also integrate with drift detection. After creating a resource from a third-party resource type, CloudFormation will detect changes to the third-party resource from its template configuration, known as configuration drift, just as it would with AWS resources.
AWS Config – You can also use AWS Config to manage compliance for third-party resources consumed from the registry. The resource types are automatically tracked as Configuration Items when you have configured AWS Config to record them, and used CloudFormation to create, update, and delete them. Whether the resource types you use are third-party or AWS resources, you can view configuration history for them, in addition to being able to write AWS Config rules to verify configuration best practices.
Abstraction of Best Practices with Modules – Browse and use modules from the registry when creating your CloudFormation templates to ensure you’re provisioning resources while adhering to best practices.
AWS Cloud Control API – The AWS Cloud Control API allows AWS partners and customers to interface with your resource type through API calls using Create, Read, Update, Delete, and List (CRUD-L) operations. Resources in the registry will be automatically integrated with our AWS Cloud Control API and expands your third party resource compatibility to even more AWS services and IaC tools.

We’ve seen great momentum from our partners and developer community over the past year. We are looking forward to continued investment and innovation in the Public Registry.

How to Get Started

For Resource Type Users: Explore and Activate Third Party Resource Types

Third party resource types must first be activated before they can be used. You do this by logging into your AWS Console > Navigate to CloudFormation > Registry > Public extensions > Set the Publisher to Third Party. This will show you a list of available third-party resources in your region (note that different regions may have a different set of third-party resource types). Select the radio box next to the resource types you want to activate and click the activate button at the top of the list.

Figure 1:

Don’t see the extension you need in the registry?

You can submit requests for new third-party extensions through our Community Registry Extensions Github repo issue tracker! Click the New Issue button and describe the third-party extension along with information about your use case.

For Developers and Publishers: Join the CloudFormation Developer Community and Start Building

You can see several of the community-built registry extensions in the AWS CloudFormation Community Registry Extensions repository and even contribute yourself. You can also read about the experiences and lessons learned from publishing to the Registry through this blog written by Cloudsoft.

For developers looking to create new resource types to add to the public Registry, follow this creating resource types walkthrough help you get started. If you need assistance creating, publishing resources, or just want to join the discussion, you can join the conversation today in our CloudFormation Discord Channel. We’d love to hear about your experiences and use cases in developing innovations with registry extensions.

About the authors:

Visualize Confluent data in Amazon QuickSight using Amazon Athena

2023-03-27 Ahmed Zamzam

Post Syndicated from Ahmed Zamzam original https://aws.amazon.com/blogs/big-data/visualize-confluent-data-in-amazon-quicksight-using-amazon-athena/

This is a guest post written by Ahmed Saef Zamzam and Geetha Anne from Confluent.

Businesses are using real-time data streams to gain insights into their company’s performance and make informed, data-driven decisions faster. As real-time data has become essential for businesses, a growing number of companies are adapting their data strategy to focus on data in motion. Event streaming is the central nervous system of a data in motion strategy and, in many organizations, Apache Kafka is the tool that powers it.

Today, Kafka is well known and widely used for streaming data. However, managing and operating Kafka at scale can still be challenging. Confluent offers a solution through its fully managed, cloud-native service that simplifies running and operating data streams at scale. Confluent extends open-source Kafka through a suite of related services and features designed to enhance the data in motion experience for operators, developers, and architects in production.

In this post, we demonstrate how Amazon Athena, Amazon QuickSight, and Confluent work together to enable visualization of data streams in near-real time. We use the Kafka connector in Athena to do the following:

Join data inside Confluent with data stored in one of the many data sources supported by Athena, such as Amazon Simple Storage Service (Amazon S3)
Visualize Confluent data using QuickSight

Challenges

Purpose-built stream processing engines, like Confluent ksqlDB, often provide SQL-like semantics for real-time transformations, joins, aggregations, and filters on streaming data. With ksqlDB, you can create persistent queries, which continuously process streams of events according to specific logic, and materialize streaming data in views that can be queried at a point in time (pull queries) or subscribed to by clients (push queries).

ksqlDB is one solution that made stream processing accessible to a wider range of users. However, pull queries, like those supported by ksqlDB, may not be suitable for all stream processing use cases, and there may be complexities or unique requirements that pull queries are not designed for.

Data visualization for Confluent data

A frequent use case for enterprises is data visualization. To visualize data stored in Confluent, you can use one of over 120 pre-built connectors, provided by Confluent, to write streaming data to a destination data store of your choice. Next, you connect your business intelligence (BI) tool to the data store to begin visualizing the data.

The following diagram depicts a typical architecture utilized by many Confluent customers. In this workflow, data is written to Amazon S3 through the Confluent S3 sink connector and then analyzed with Athena, a serverless interactive analytics service that enables you to analyze and query data stored in Amazon S3 and various other data sources using standard SQL. You can then use Athena as an input data source to QuickSight, a highly scalable cloud native BI service, for further analysis.

typical architecture utilized by many Confluent customers.

Although this approach works well for many use cases, it requires data to be moved, and therefore duplicated, before it can be visualized. This duplication not only adds time and effort for data engineers who may need to develop and test new scripts, but also creates data redundancy, making it more challenging to manage and secure the data, and increases storage cost.

Enriching data with reference data in another data store

With ksqlDB queries, the source and destination are always Kafka topics. Therefore, if you have a data stream that you need to enrich with external reference data, you have two options. One option is to import the reference data into Confluent, model it as a table, and use ksqlDB’s stream-table join to enrich the stream. The other option is to ingest the data stream into a separate data store and perform join operations there. Both require data movement and result in duplicate data storage.

Solution overview

So far, we have discussed two challenges that are not addressed by conventional stream processing tools. Is there a solution that addresses both challenges simultaneously?

When you want to analyze data without separate pipelines and jobs, a popular choice is Athena. With Athena, you can run SQL queries on a wide range of data sources—in addition to Amazon S3—without learning a new language, developing scripts to extract (and duplicate) data, or managing infrastructure.

Recently, Athena announced a connector for Kafka. Like Athena’s other connectors, queries on Kafka are processed within Kafka and return results to Athena. The connector supports predicate pushdown, which means that adding filters to your queries can reduce the amount of data scanned, improve query performance, and reduce cost.

For example, when using this connector, the amount of data scanned by the query SELECT * FROM CONFLUENT_TABLE could be significantly higher than the amount of data scanned by the query SELECT * FROM CONFLUENT_TABLE WHERE COUNTRY = 'UK'. The reason is that the AWS Lambda function which provides the runtime environment for the Athena connector, filters data at the source before returning it to Athena.

Let’s assume we have a stream of online transactions flowing into Confluent and customer reference data stored in Amazon S3. We want to use Athena to join both data sources together and produce a new dataset for QuickSight. Instead of using the S3 sink connector to load data into Amazon S3, we use Athena to query Confluent and join it with S3 data—all without moving data. The following diagram illustrates this architecture.

Athena to join both data sources together and produce a new dataset for QuickSight

We perform the following steps:

Register the schema of your Confluent data.
Configure the Athena connector for Kafka.
Optionally, interactively analyze Confluent data.
Create a QuickSight dataset using Athena as the source.

Register the schema

To connect Athena to Confluent, the connector needs the schema of the topic to be registered in the AWS Glue Schema Registry, which Athena uses for query planning.

The following is a sample record in Confluent:

{
  "transaction_id": "23e5ed25-5818-4d4f-acb3-73ef04d51d21",
  "customer_id": "126-58-9758",
  "amount": 986,
  "timestamp": "2023-01-03T15:40:42",
  "product_category": "health_fitness"
}

The following is the schema of this record:

{
  "topicName": "transactions",
  "message": {
    "dataFormat": "json",
    "fields": [
      {
        "name": "transaction_id",
        "mapping": "transaction_id",
        "type": "VARCHAR"
      },
      {
        "name": "customer_id",
        "mapping": "customer_id",
        "type": "VARCHAR"
      },
      {
        "name": "amount",
        "mapping": "amount",
        "type": "INTEGER"
      },
      {
        "name": "timestamp",
        "mapping": "timestamp",
        "type": "timestamp",
        "formatHint": "yyyy-MM-dd\'T\'HH:mm:ss"
      },
      {
        "name": "product_category",
        "mapping": "product_category",
        "type": "VARCHAR"
      },
      {
        "name": "customer_id",
        "mapping": "customer_id",
        "type": "VARCHAR"
      }
    ]
  }
}

The data producer writing the data can register this schema with the AWS Glue Schema Registry. Alternatively, you can use the AWS Management Console or AWS Command Line Interface (AWS CLI) to create a schema manually.

We create the schema manually by running the following CLI command. Replace <registry_name> with your registry name and make sure that the text in the description field includes the required string {AthenaFederationKafka}:

aws glue create-registry –registry-name <registry_name> --description {AthenaFederationKafka}

Next, we run the following command to create a schema inside the newly created schema registry:

aws glue create-schema –registry-id RegistryName=<registry_name> --schema-name <schema_name> --compatibility <Compatibility_Mode> --data-format JSON –schema-definition <Schema>

Before running the command, be sure to provide the following details:

Replace <registry_name> with our AWS Glue Schema Registry name
Replace <schema_name> with the name of our Confluent Cloud topic, for example, transactions
Replace <Compatibility_Mode> with one of the supported compatibility modes, for example, ‘Backward’
Replace <Schema> with our schema

Configure and deploy the Athena Connector

With our schema created, we’re ready to deploy the Athena connector. Complete the following steps:

On the Athena console, choose Data sources in the navigation pane.
Choose Create data source.
Search for and select Apache Kafka.
For Data source name, enter the name for the data source.

This data source name will be referenced in your queries. For example:

SELECT * 
FROM <data_source_name>.<registry_name>.<schema_name>
WHERE COL1='SOMETHING'

Applying this to our use case and previously defined schema, our query would be as follows:

SELECT * 
FROM "Confluent"."transactions_db"."transactions"
WHERE product_category='Kids'

In the Connection details section, choose Create Lambda function.

You’re redirected to the Applications page on the Lambda console. Some of the application settings are already filled.

The following are the important settings required for integrating with Confluent Cloud. For more information on these settings, refer to Parameters.

For LambdaFunctionName, enter the name for the Lambda function the connector will use. For example, athena_confluent_connector.

We use this parameter in the next step.

For KafkaEndpoint, enter the Confluent Cloud bootstrap URL.

You can find this on the Cluster settings page in the Confluent Cloud UI.

enter the Confluent Cloud bootstrap URL

Confluent Cloud supports two authentication mechanisms: OAuth and SASL/PLAIN (API keys). The connector doesn’t support OAuth; this leaves us with SASL/PLAIN. SASL/PLAIN uses SSL as a security protocol and PLAIN as SASL mechanism.

For AuthType, enter SASL_SSL_PLAIN.

The API key and secret used by the connector to access Confluent need to be stored in AWS Secrets Manager.

Get your Confluent API key or create a new one.

Run the following AWS CLI command to create the secret in Secrets Manager:

aws secretsmanager create-secret \
    --name <SecretNamePrefix>\
    --secret-string "{\"username\":\"<Confluent_API_KEY>\",\"password\":\"<Confluent_Secret>\"}"

The secret string should have two key-value pairs, one named username and the other password.

For SecretNamePrefix, enter the secret name prefix created in the previous step.
If the Confluent cloud cluster is reachable over the internet, leave SecurityGroupIds and SubnetIds blank. Otherwise, your Lambda function needs to run in a VPC that has connectivity to your Confluent Cloud network. Therefore, enter a security group ID and three private subnet IDs in this VPC.
For SpillBucket, enter the name of an S3 bucket where the connector can spill data.

Athena connectors temporarily store (spill) data to Amazon S3 for further processing by Athena.

Select I acknowledge that this app creates custom IAM roles and resource policies.
Choose Deploy.
Return to the Connection details section on the Athena console and for Lambda, enter the name of the Lambda function you created.
Choose Next.
Choose Create data source.

Perform interactive analysis on Confluent data

With the Athena connector set up, our streaming data is now queryable from the same service we use to analyze S3 data lakes. Next, we use Athena to conduct point-in-time analysis of transactions flowing through Confluent Cloud.

Aggregation

We can use standard SQL functions to aggregate the data. For example, we can get the revenue by product category:

SELECT product_category, SUM(amount) AS Revenue
FROM "Confluent"."athena_blog"."transactions"
GROUP BY product_category
ORDER BY Revenue desc

SQL function to aggregate data

Enrich transaction data with customer data

The aggregation example is also available with ksqlDB pull queries. However, Athena’s connector allows us to join the data with other data sources like Amazon S3.

In our use case, the transactions streamed to Confluent Cloud lack detailed information about customers, apart from a customer_id. However, we have a reference dataset in Amazon S3 that has more information about the customers. With Athena, we can join both datasets together to gain insights about our customers. See the following code:

SELECT * 
FROM "Confluent"."athena_blog"."transactions" a
INNER JOIN "AwsDataCatalog"."athenablog"."customer" b 
ON a.customer_id=b.customer_id

join data

You can see from the results that we were able to enrich the streaming data with customer details, stored in Amazon S3, including name and address.

Visualize data using QuickSight

Another powerful feature this connector brings is the ability to visualize data stored in Confluent using any BI tool that supports Athena as a data source. In this post, we use QuickSight. QuickSight is a machine learning (ML)-powered BI service built for the cloud. You can use it to deliver easy-to-understand insights to the people you work with, wherever they are.

For more information about signing up for QuickSight, see Signing up for an Amazon QuickSight subscription.

Complete the following steps to visualize your streaming data with QuickSight:

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
Choose Athena as the data source.
For Data source name, enter a name.
Choose Create data source.
In the Choose your table section, choose Use custom SQL.
Enter the join query like the one given previously, then choose Confirm query.
Next, choose to import the data into SPICE (Super-fast, Parallel, In-memory Calculation Engine), a fully managed in-memory cache that boosts performance, or directly query the data.

Utilizing SPICE will enhance performance, but the data may need to be periodically updated. You can choose to incrementally refresh your dataset or schedule regular refreshes with SPICE. If you want near-real-time data reflected in your dashboards, select Directly query your data. Note that with the direct query option, user actions in QuickSight, such as applying a drill-down filter, may invoke a new Athena query.

Choose Visualize.

That’s it, we have successfully connected QuickSight to Confluent through Athena. With just a few clicks, you can create a few visuals displaying data from Confluent.

successfully connected QuickSight to Confluent through Athena.

Clean up

To avoid incurring ongoing charges, delete the resources you provisioned by completing the following steps:

Conclusion

In this post, we discussed use cases for Athena and Confluent. We provided examples of how you can use both for near-real-time data visualization with QuickSight and interactive analysis involving joins between streaming data in Confluent and data stored in Amazon S3.

The Athena connector for Kafka simplifies the process of querying and analyzing streaming data from Confluent Cloud. It removes the need to first move streaming data to persistent storage before it can be used in downstream use cases like business intelligence. This complements the existing integration between Confluent and Athena, using the S3 sink connector, which enables loading streaming data into a data lake, and is an additional option for customers who want to enable interactive analysis on Confluent data.

About the authors

Ahmed Zamzam is a Senior Partner Solutions Architect at Confluent, with a focus on the AWS partnership. In his role, he works with customers in the EMEA region across various industries to assist them in building applications that leverage their data using Confluent and AWS. Prior to Confluent, Ahmed was a Specialist Solutions Architect for Analytics AWS specialized in data streaming and search. In his free time, Ahmed enjoys traveling, playing tennis, and cycling.

Geetha Anne is a Partner Solutions Engineer at Confluent with previous experience in implementing solutions for data-driven business problems on the cloud, involving data warehousing and real-time streaming analytics. She fell in love with distributed computing during her undergraduate days and has followed her interest ever since. Geetha provides technical guidance, design advice, and thought leadership to key Confluent customers and partners. She also enjoys teaching complex technical concepts to both tech-savvy and general audiences.

Manage your data warehouse cost allocations with Amazon Redshift Serverless tagging

2023-03-27 Sandeep Bajwa

Post Syndicated from Sandeep Bajwa original https://aws.amazon.com/blogs/big-data/manage-your-data-warehouse-cost-allocations-with-amazon-redshift-serverless-tagging/

Amazon Redshift Serverless makes it simple to run and scale analytics without having to manage your data warehouse infrastructure. Developers, data scientists, and analysts can work across databases, data warehouses, and data lakes to build reporting and dashboarding applications, perform real-time analytics, share and collaborate on data, and even build and train machine learning (ML) models with Redshift Serverless.

Tags allows you to assign metadata to your AWS resources. You can define your own key and value for your resource tag, so that you can easily manage and filter your resources. Tags can also improve transparency and map costs to specific teams, products, or applications. This way, you can raise cost awareness and also make teams and users accountable for their own cost and usage.

You can now use tagging in Redshift Serverless to categorize the following resources based on your grouping needs:

Namespace – A collection of database objects and users
Workgroup – A collection of compute resources
Snapshot – Point-in-time backups of a cluster
Recovery point – Recovery points in Redshift Serverless are created every 30 minutes and saved for 24 hours

When using Redshift Serverless, you may have to manage data across many business departments, environments, and billing groups. In doing so, you’re usually faced with one of the following tasks:

Cost allocation and financial management – You want to know what you’re spending on AWS for a given project, line of business, or environment
Operations support and incident management – You want to send issues to the right teams and users
Access control – You want to constrain user access to certain resources
Security risk management – You want to group resources based on their level of security or data sensitivity and make sure proper controls are in place

In this post, we focus on tagging Redshift Serverless resources for cost allocation and reporting purposes. Knowing where you have incurred costs at the resource, workload, team, and organization level enhances your ability to budget and manage cost.

Solution overview

Let’s say that your company has two departments: marketing and finance. Each department has multiple cost centers and environments, as illustrated in the following figure. In AWS Cost Explorer, you want to create cost reports for Redshift Serverless by department, environment, and cost center.

We start with creating and applying user-defined tags to Amazon Serverless workgroups for respective departments, environments, and cost centers. You can use both the AWS Command Line Interface (AWS CLI) and Redshift Serverless console to tag serverless resources.

The high-level steps are as follows:

Create tags.
View and edit tags.
Set up cost allocation tags.
Create cost reports.

Create tags

To create tags, complete the following steps:

On the Amazon Redshift console, choose Manage tags in the navigation pane.
For Filter by resource type, you can filter by Workgroup, Namespace, Snapshot, and Recovery Point.
Optionally, you can search for resources by an existing tag by entering values for Tag key or Tag value. For this post, we don’t include any tag filters, so we can view all the resources across our account.
Select your resource from the search results and choose Manage tags to customize the tag key and value parameters.

Here, you can add new tags, remove tags, save changes, and cancel your changes if needed.

Because we want to allocate cost across the various departments, we add a new key called department and a new value called marketing.

Choose Save changes.
Confirm the changes by choosing Apply changes.

For more details on tagging, refer to Tagging resources overview.

View and edit tags

If you already have resources such as workgroups (listed on the Workgroup configuration page) or snapshots (listed on the Data backup page), you can create new tags or edit existing tags on the given resource. In the following example, we manage tags on an existing workgroup.

On the Amazon Redshift console, choose Workgroup configuration in the navigation pane.
Select your workgroup and on the Actions menu, choose Manage tags.

Now we can remove existing tags or add new tags. For our use case, let’s assume that the marketing department is no longer using the default workgroup, so we want to remove the current tag.

Choose Remove next to the marketing tag.

We are given the option to choose Undo if needed.

Choose Save changes and then Apply the changes to confirm.

After we apply the tags, we can view the full list of resources. The number of tags applied to each resource is found in the Tags column.

Set up cost allocation tags

After you create and apply the user-defined tags to your Redshift Serverless workgroups, it can take up to 24 hours for the tags to appear on your cost allocation tags page for activation. You can activate tags by using the AWS Billing console for cost allocation tracking with the following steps:

On the AWS Billing console, choose Cost allocation tags in the navigation pane.
Under User-defined cost allocation tags¸ select the tags you created and applied (for this example, cost-center).
Choose Activate.

After we activate all the tags we created, we can view the full list by choosing Active on the drop-down menu.

Create cost reports

After you activate the cost allocation tags, they appear on your cost allocation reports in Cost Explorer.

Cost Explorer helps you manage your AWS costs by giving you detailed insights into the line items in your bill. In Cost Explorer, you can visualize daily, monthly, and forecasted spend by combining an array of available filters. Filters allow you to narrow down costs according to AWS service type, linked accounts, and tags.

The following screenshot shows the preconfigured reports in Cost Explorer.

To create custom reports for your cost and usage data, complete the following steps:

On the AWS Cost Management console, choose Reports in the navigation pane.
Choose Create new report.
Select the report type you want to create (for this example, we select Cost and usage).
Choose Create Report.
To view weekly Redshift Serverless cost by cost center, choose the applicable settings in the Report parameters pane. For this post, we group data by the cost-center tag and filter data by the department tag.
Save the report for later use by choosing Save to report library.
Enter a name for your report, then choose Save report.

The following screenshot shows a sample report for daily Redshift Serverless cost by department.

sample report for daily Redshift Serverless cost by department.

The following screenshot shows an example report of weekly Redshift Serverless cost by environment.

example report of weekly Redshift Serverless cost by environment.

Conclusion

Tagging resources in Amazon Redshift helps you maintain a central place to organize and view resources across the service for billing management. This feature saves you hours of manual work you would spend in grouping your Amazon Redshift resources via a spreadsheet or other manual alternatives.

For more tagging best practices, refer to Tagging AWS resources.

About the Authors

Sandeep Bajwa is a Sr. Analytics Specialist based out of Northern Virginia, specialized in the design and implementation of analytics and data lake solutions.

Michael Yitayew is a Product Manager for Amazon Redshift based out of New York. He works with customers and engineering teams to build new features that enable data engineers and data analysts to more easily load data, manage data warehouse resources, and query their data. He has supported AWS customers for over 3 years in both product marketing and product management roles.

Use backups to recover from security incidents

2023-03-22 Jason Hurst

Post Syndicated from Jason Hurst original https://aws.amazon.com/blogs/security/use-backups-to-recover-from-security-incidents/

Greetings from the AWS Customer Incident Response Team (CIRT)! AWS CIRT is dedicated to supporting customers during active security events on the customer side of the AWS Shared Responsibility Model.

Over the past three years, AWS CIRT has supported customers with security events in their AWS accounts. These include the unauthorized use of AWS Identity and Access Management (IAM) credentials, ransomware, and data deletion in an AWS account.

In this post, I will walk you through key AWS services and features that provide backup and recovery solutions to restore your data based upon the lessons our team has learned when supporting customers experiencing security events.

Shared Responsibility Model

Security is a shared responsibility between AWS and the customer. Customers are responsible for protecting their data IN the cloud. For Amazon Elastic Compute Cloud (Amazon EC2), this includes the guest operating system, installed applications, and data stored within the instance and associated Amazon Elastic Block Store (Amazon EBS) volumes. For Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB, AWS operates the infrastructure layer, the operating system, and service resources, and customers access the endpoints to store and retrieve data.

Backup and recovery configuration are a part of the customer’s side of the shared responsibility model. AWS doesn’t have the ability to recover a deleted resource. It doesn’t matter how quickly the event is reported to AWS. The inability to recover resources includes actions by the AWS account root user or an IAM principal in the account.

Customers are also responsible for managing their data (including encryption options), classifying their assets, and using IAM tools to apply the appropriate permissions. AWS strives to make it simple for customers to back up and restore their data. We recommend that you compare the risk and costs associated with losing data to the available solutions to make the best decision for your data and business use cases.

Why do you need backups?

The National Institute of Technology (NIST) Computer Security Incident Handling Guide SP 800-61 Rev. 2 defines a computer security incident as “a violation or imminent threat of violation of computer security policies, acceptable use policies, or standard security practices.” AWS recently updated the AWS Security Incident Response Guide as a resource to help customers throughout the incident response life cycle.

Backup and restore processes help you restore data to a point in time before unauthorized actions. Unauthorized actions can be accidental or part of a security event. Implementing backup and restore processes can help you reduce costs by limiting the number of resources that need backups, associated storage, and overall timelines associated with acceptable Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs). For additional guidance on backup solutions and programs, see Top 10 security best practices for securing backups in AWS

How does AWS help?

AWS provides several solutions for backups to integrate with your operational and security incident recovery procedures which I describe in more detail in this section. For additional information, see AWS Backup & Restore.

Amazon EC2

Amazon EC2 provides scalable computing capacity in the AWS Cloud. Using Amazon EC2 can help eliminate your need to invest in hardware up front, helping you to develop and deploy applications faster.

EBS volumes are the primary persistent storage option for Amazon EC2. Use this block storage for structured data, such as databases, or unstructured data, such as files in a file system on a volume. An EBS snapshot takes a copy of the EBS volume and places it in Amazon S3, where it is stored redundantly in multiple Availability Zones.
Restore an entire EC2 instance including its associated volumes by restoring an Amazon Machine Image (AMI) backup of your instance. Create AMIs for known good configurations, and integrate them with auto scaling groups to support the scaling and resiliency of your services. For more information on snapshots and AMIs, see Backup and recovery for Amazon EC2 with EBS volumes.
Create a golden image by preloading needed software and configuration on an EC2 instance, and then creating an image of that. Then, use the resulting image to launch new instances, with updates needed only for the period after image creation.
Amazon FSx for Windows File Server provides fully-managed Microsoft Windows file servers, backed by a fully native Windows file system. To help ensure file system consistency, Amazon FSx uses the Volume Shadow Copy Service (VSS) in Microsoft Windows. Each FSx for Windows File Server backup contains the information that is needed to create a new file system from the backup, effectively restoring a point-in-time snapshot of the file system. For more information, see Amazon FSx: Working with backups.
Amazon EC2 Recycle Bin is a data recovery feature that enables you to restore Amazon EBS snapshots and EBS-backed AMIs that were accidentally deleted. If your resources are deleted, they are retained in the Recycle Bin for a period that you specify, before they are permanently deleted.

Transactional databases

In cloud computing, the ideal scenario is to keep persistent transactional states in databases so that those resources are the only things that actively require backups. When used in conjunction with AWS compute services, this minimizes the volume of data that you need to back up. Everything else is restored from a golden image or equivalent through auto scaling or a continuous integration and continuous delivery (CI/CD) pipeline. To estimate costs associated with service usage and the use of backup storage, use the AWS Pricing Calculator. Work backwards from your critical data that requires backups to help limit costs associated with your overall backup solution.

Amazon Aurora backups are continuous and incremental so that you can quickly restore to any point within the backup retention period. You can specify a backup retention period of 1 to 35 days when you create or modify a database cluster. Aurora backups are stored in Amazon S3.
Amazon DynamoDB allows you to back up your table data continuously by using point-in-time recovery (PITR). When you use PITR, DynamoDB backs up your table data automatically with per-second granularity to restore to any second in the preceding 35 days. For more information, see DynamoDB PITR.
Amazon Neptune is a fast, reliable, fully managed graph database service. The core of Neptune is a purpose-built, high-performance graph database engine. Neptune backups are continuous and incremental so that you can quickly restore to any point within the backup retention period. You can specify a backup retention period, from 1 to 35 days, when you create or modify a DB cluster.
Amazon Relational Database Service (Amazon RDS) creates and saves automated backups of your DB instance during the backup window of your DB instance. Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases. Amazon RDS saves the automated backups of your DB instance according to the backup retention period that you specify between 0 and 35 days. If necessary, you can recover your database to any point in time during the backup retention period.

Amazon Elastic File System

Amazon Elastic File System (Amazon EFS) provides serverless, fully elastic file storage to help you share file data without provisioning or managing storage capacity and performance. The service manages the file storage infrastructure for you to avoid the complexity of deploying, patching, and maintaining complex file system configurations.

The EFS-to-EFS Backup solution is suitable for Amazon EFS file systems in each AWS Region. It includes an AWS CloudFormation template that launches, configures, and runs the AWS services required to deploy the solution. This solution follows AWS best practices for security and availability.

Amazon S3

Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance designed for 99.999999999% (11 9’s) of durability. When using Amazon S3, you should configure the security of the S3 buckets and objects that are part of your backup solution. For more information on security best practices for Amazon S3, see Top 10 security best practices for securing data in Amazon S3 and The anatomy of ransomware event targeting data residing in Amazon S3.

AWS Backup: A comprehensive solution

If you need a backup strategy for multiple services or to manage backups from a single solution, consider using AWS Backup. AWS Backup is a fully-managed service that makes it simple to centralize and automate data protection across AWS services in the cloud, and on premises. For a list of supported services and resource feature availability, see the AWS Backup Developer Guide.

AWS Backup provides for centralized, policy-based data protection. Your backup data is encrypted using encryption keys managed by AWS Key Management Service (KMS), reducing your need to build and maintain a key management infrastructure. With AWS Backup, you can do the following:

Set backup retention policies that automatically retain and expire backups, minimizing backup storage costs.
Copy backups across different AWS Regions and accounts from a central console to help you meet your compliance and disaster recovery needs.
Create data protection policies and use AWS Organizations to enforce the protection policies throughout the accounts in that organization.
Set resource-based access policies on backup vaults. Use resource-based access policies to control access to backups in a backup vault across users, rather than having to define permissions for each user.

AWS Backup can help you align with your data protection needs with real-time analytics and insights, as follows:

You can audit and report on the compliance of your data protection policies to help meet your business and regulatory needs with AWS Backup Audit Manager.
AWS Backup supports legal hold, which is used when an organization must retain certain data either for preservation, auditing, or as evidence in legal proceedings and e-Discovery.
You can choose your controls. For information on the available controls, their customizable parameters, and their AWS Config recording resource types, see Choosing your controls. Every control requires the recording resource type AWS Config: resource compliance because this type records your compliance status with either the AWS Backup Framework or a custom framework that you define.

How much will this cost?

To estimate costs for individual services and features, use the AWS Pricing Calculator. For additional cost information, see the feature page for each service at AWS Cloud Products.

Conclusion

In this blog post, you learned about several AWS services and features to help you back up and restore your data. By analyzing and configuring backup and restore capabilities, you can enable resilience from an accidental deletion or security event.

How AWS Payments migrated from Redash to Amazon Redshift Query Editor v2

2023-03-21 Erol Murtezaoglu

Post Syndicated from Erol Murtezaoglu original https://aws.amazon.com/blogs/big-data/how-aws-payments-migrated-from-redash-to-amazon-redshift-query-editor-v2/

AWS Payments is part of the AWS Commerce Platform (CP) organization that owns the customer experience of paying AWS invoices. It helps AWS customers manage their payment methods and payment preferences, and helps customers make self-service payments to AWS.

The Machine Learning, Data and Analytics (MLDA) team at AWS Payments enables data-driven decision-making across payments processes and partners by delivering data, business insights, and causal and ML inferences through a scalable set of data, insights, and ML inference services.

In this post, we discuss how to democratize data access to Amazon Redshift using the Amazon Redshift Query Editor V2 .

Background

At AWS Payments, we had been using Redash to allow our users to author and run SQL queries against our Amazon Redshift data warehouse. Redash is a web-based SQL client application that can be used to author and run queries, visualize query results with charts, and collaborate with teams.

Over time, we began to notice incompatibilities between Redash’s operations and the needs of our workload.

We had the following requirements in mind when looking for an alternative tool:

Authentication and authorization
- Provide data access without creating a database user and password
- Allow list users using permission groups (POSIX/LDAP) for accessing the tool
- Limit user access to database objects
User experience
- Run SQL queries on the selected database
- Save a query and rerun it later
- Write a dynamic SQL query and run the query based on input parameters
- Export a query result to CSV
- Search saved queries
- Share a query with other users as a URL

After an evaluation of alternate services, we chose the Amazon Redshift Query Editor V2.

Amazon Redshift Query Editor V2

The Amazon Redshift Query Editor V2 has the following benefits:

It makes data across analytics and data scientists more accessible with a unified web-based analyst workbench for data analysts to explore, share, and collaborate on data through a SQL interface
It provides a managed service that allows you to focus on exploring your data without managing your infrastructure
Users can log in to the Query Editor using single sign-on (SSO)
Users can connect to Amazon Redshift using federated access without providing a user name and password
It enables you to collaborate with team members by providing the ability to share saved queries securely
You can benefit from new features as soon as they get released by the Amazon Redshift Query Editor team
You can keep track of changes made to saved queries using the Query History feature
You can write parameterized SQL queries, which allows you to reuse a query with different values
You can turn on the Chart feature to display a graphic visualization of the current page of results
You can use notebooks to organize, annotate, and share multiple SQL queries in a single document
You can run multiple queries in parallel by running each query in a separate tab

However, it presented the following challenges:

To restrict user access to other AWS services within our AWS account, we attached the AWS Identity and Access Management (IAM) policies (see the appendix at the end of this post) to the SAML IAM role. The policies promote the following:
- The user can only access the Query Editor V2 service.
- The federated user gets assigned to a database group with limited access.
The Query Editor V2 currently doesn’t support cross-account Amazon Redshift connections. However, we set up Amazon Redshift data sharing to access the Amazon Redshift cluster from other AWS accounts. For more details, refer to Sharing data across clusters in Amazon Redshift.

Architecture overview

The following diagram illustrates our architecture.
The diagram illustrates our architecture
In the following sections, we will walk you through the steps to set up the query editor and migrate Redash queries.

Prerequisites

To implement this solution, you must set up federated access to the Amazon Redshift Query Editor V2 using your identity provider (IdP) services.

You can find more information in the following posts:

Set up Amazon Redshift Query Editor V2

To set up the query editor, complete the following steps:

Create an Amazon Redshift database group with read-only access.
Create an IAM role for accessing Query Editor V2 in an AWS account and attach the required IAM policies based on your use case. For more information, refer to Configuring your AWS account.
Create a trust relationship between your IdP and AWS.
Add the principal tag sqlworkbench-team to the IAM role to share queries. For more information, refer to Sharing a query.

Migrate Redash queries to Amazon Redshift Query Editor V2

In this section, we walk you through different ways to migrate your Redash queries to the Amazon Redshift Query Editor V2.

Query without parameters

Querying without parameters is pretty straightforward, just copy your query from Redash and enter it in the query editor.

In Redash, navigate to the saved query and choose Edit Source.
Copy the source query.
In Amazon RedShift Query Editor V2, enter the query into the editor, choose the Save icon, and give your query a title.

Query with parameters

In Redash, a string between {{ }} will be treated as a parameter, but Amazon RedShift Query Editor V2 uses ${ } to identify a parameter. To migrate queries with parameters, follow the same steps but replace {{ with ${ and }} with }.

The following screenshot shows an example query in Redash.

screenshot shows an example query in Redash The following screenshot shows the same query in Amazon RedShift Query Editor V2.

screenshot shows the same query in Query Editor V2

Multi-part query to a Query Editor V2 notebook

For a multi-part query, copy the query of each section of a Redash dashboard and add it to a notebook. The notebook in Amazon Redshift Query Editor V2 runs queries successively. You also can add a description for your query.

The following screenshot shows an example query on the Redash dashboard.

The following screenshot shows the query in an Amazon Redshift Query Editor V2 notebook.

Summary

In this post, we demonstrated how we set up Amazon Redshift Query Editor V2 with SSO and Amazon Redshift federated access, and migrated our customers from Redash to Amazon Redshift Query Editor V2. This solution reduced our operational cost of maintaining a third-party application and its infrastructure.

If you have similar use cases and need to provide a web-based tool to your customers to explore data on your Amazon Redshift cluster, consider using Amazon Redshift Query Editor V2.

Appendix: Customer IAM policies

In this section, we provide the code for the IAM policies we attached to the SAML IAM role to restrict user access to other AWS services within our AWS account:

query-editor-credentials-policy – In the following code, provide your Region, account, and cluster parameters to grant access to Amazon Redshift to get cluster credentials, create users, and allow users to join groups:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<cluster>",
                "arn:aws:redshift:<region>:<account>:dbname:<cluster>/payments_beta",
                "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${redshift:DbUser}"
            ],
            "Effect": "Allow"
        },
        {
            "Action": "redshift:JoinGroup",
            "Resource": "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/payments_ro_users",
            "Effect": "Allow"
        },
        {
            "Action": "redshift:DescribeClusters",
            "Resource": "arn:aws:redshift:<region>:<account>:cluster:<cluster>",
            "Effect": "Allow"
        },
        {
            "Action": "redshift:CreateClusterUser",
            "Resource": "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${redshift:DbUser}",
            "Effect": "Allow"
        }
    ]
}

query-editor-access-policy – See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "redshift:DescribeClusters",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "RedshiftPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "secretsmanager:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "secretsmanager:CreateSecret",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DeleteSecret",
                "secretsmanager:TagResource"
            ],
            "Resource": "arn:aws:secretsmanager:::sqlworkbench!",
            "Effect": "Allow",
            "Sid": "SecretsManagerPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:CalledViaLast": "sqlworkbench.amazonaws.com"
                }
            },
            "Action": "tag:GetResources",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "ResourceGroupsTaggingPermissions"
        },
        {
            "Action": [
                "sqlworkbench:CreateFolder",
                "sqlworkbench:PutTab",
                "sqlworkbench:BatchDeleteFolder",
                "sqlworkbench:DeleteTab",
                "sqlworkbench:GenerateSession",
                "sqlworkbench:GetAccountInfo",
                "sqlworkbench:GetAccountSettings",
                "sqlworkbench:GetUserInfo",
                "sqlworkbench:GetUserWorkspaceSettings",
                "sqlworkbench:PutUserWorkspaceSettings",
                "sqlworkbench:ListConnections",
                "sqlworkbench:ListFiles",
                "sqlworkbench:ListTabs",
                "sqlworkbench:UpdateFolder",
                "sqlworkbench:ListRedshiftClusters",
                "sqlworkbench:DriverExecute",
                "sqlworkbench:ListTaggedResources"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2NonResourceLevelPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "sqlworkbench:CreateConnection",
                "sqlworkbench:CreateSavedQuery",
                "sqlworkbench:CreateChart"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2CreateOwnedResourcePermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "sqlworkbench:DeleteChart",
                "sqlworkbench:DeleteConnection",
                "sqlworkbench:DeleteSavedQuery",
                "sqlworkbench:GetChart",
                "sqlworkbench:GetConnection",
                "sqlworkbench:GetSavedQuery",
                "sqlworkbench:ListSavedQueryVersions",
                "sqlworkbench:UpdateChart",
                "sqlworkbench:UpdateConnection",
                "sqlworkbench:UpdateSavedQuery",
                "sqlworkbench:AssociateConnectionWithTab",
                "sqlworkbench:AssociateQueryWithTab",
                "sqlworkbench:AssociateConnectionWithChart",
                "sqlworkbench:UpdateFileFolder",
                "sqlworkbench:ListTagsForResource"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2OwnerSpecificPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}",
                    "aws:RequestTag/sqlworkbench-resource-owner": "${aws:userid}"
                },
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": "sqlworkbench-resource-owner"
                }
            },
            "Action": "sqlworkbench:TagResource",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2TagOnlyUserIdPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-team": "${aws:PrincipalTag/sqlworkbench-team}"
                }
            },
            "Action": [
                "sqlworkbench:GetChart",
                "sqlworkbench:GetConnection",
                "sqlworkbench:GetSavedQuery",
                "sqlworkbench:ListSavedQueryVersions",
                "sqlworkbench:ListTagsForResource",
                "sqlworkbench:AssociateQueryWithTab"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2TeamReadAccessPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}",
                    "aws:RequestTag/sqlworkbench-team": "${aws:PrincipalTag/sqlworkbench-team}"
                }
            },
            "Action": "sqlworkbench:TagResource",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2TagOnlyTeamPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
                },
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": "sqlworkbench-team"
                }
            },
            "Action": "sqlworkbench:UntagResource",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2UntagOnlyTeamPermissions"
        }
    ]
}

query-editor-notebook-policy – See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sqlworkbench:ListNotebooks",
                "sqlworkbench:ListNotebookVersions",
                "sqlworkbench:ListQueryExecutionHistory"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "sqlworkbench:CreateNotebook",
                "sqlworkbench:ImportNotebook",
                "sqlworkbench:DuplicateNotebook"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "sqlworkbench:GetNotebook",
                "sqlworkbench:UpdateNotebook",
                "sqlworkbench:DeleteNotebook",
                "sqlworkbench:CreateNotebookCell",
                "sqlworkbench:DeleteNotebookCell",
                "sqlworkbench:UpdateNotebookCellContent",
                "sqlworkbench:UpdateNotebookCellLayout",
                "sqlworkbench:BatchGetNotebookCell",
                "sqlworkbench:AssociateNotebookWithTab",
                "sqlworkbench:ExportNotebook",
                "sqlworkbench:CreateNotebookVersion",
                "sqlworkbench:GetNotebookVersion",
                "sqlworkbench:CreateNotebookFromVersion",
                "sqlworkbench:DeleteNotebookVersion",
                "sqlworkbench:RestoreNotebookVersion"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-team": "${aws:PrincipalTag/sqlworkbench-team}"
                }
            },
            "Action": [
                "sqlworkbench:GetNotebook",
                "sqlworkbench:BatchGetNotebookCell",
                "sqlworkbench:AssociateNotebookWithTab"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

About the Authors

Mohammad Nejad

Mohammad Nejad leads the AWS Payments Data Platform team. He has experience leading teams, architecting designs, implementing solutions, and launching products. Currently, his team focuses on building a modern data platform on AWS to provide a complete solution for processing, analyzing, and presenting data.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Mohamed Shaaban is a Senior Software Engineer in Amazon Redshift and is based in Berlin, Germany. He has over 12 years of experience in the software engineering. He is passionate about cloud services and building solutions that delight customers. Outside of work, he is an amateur photographer who loves to explore and capture unique moments.

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor

2023-03-20 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/part-2-glue-studio-visual-editor-introducing-native-support-for-apache-hudi-delta-lake-and-apache-iceberg-on-aws-glue-for-apache-spark/

In the first post of this series, we described how AWS Glue for Apache Spark works with Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg datasets tables using the native support of those data lake formats. This native support simplifies reading and writing your data for these data lake frameworks so you can more easily build and maintain your data lakes in a transactionally consistent manner. This feature removes the need to install a separate connector and reduces the configuration steps required to use these frameworks in AWS Glue for Apache Spark jobs.

These data lake frameworks help you store data more efficiently and enable applications to access your data faster. Unlike simpler data file formats such as Apache Parquet, CSV, and JSON, which can store big data, data lake frameworks organize distributed big data files into tabular structures that enable basic constructs of databases on data lakes.

Expanding on the functionality we announced at AWS re:Invent 2022, AWS Glue now natively supports Hudi, Delta Lake and Iceberg through the AWS Glue Studio visual editor. If you prefer authoring AWS Glue for Apache Spark jobs using a visual tool, you can now choose any of these three data lake frameworks as a source or target through a graphical user interface (GUI) without any custom code.

Even without prior experience using Hudi, Delta Lake or Iceberg, you can easily achieve typical use cases. In this post, we demonstrate how to ingest data stored in Hudi using the AWS Glue Studio visual editor.

Example scenario

To demonstrate the visual editor experience, this post introduces the Global Historical Climatology Network Daily (GHCN-D) dataset. The data is publicly accessible through an Amazon Simple Storage Service (Amazon S3) bucket. For more information, see the Registry of Open Data on AWS. You can also learn more in Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight.

The Amazon S3 location s3://noaa-ghcn-pds/csv/by_year/ has all the observations from 1763 to the present organized in CSV files, one file for each year. The following block shows an example of what the records look like:

ID,DATE,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
AE000041196,20220101,TAVG,204,H,,S,
AEM00041194,20220101,TAVG,211,H,,S,
AEM00041217,20220101,TAVG,209,H,,S,
AEM00041218,20220101,TAVG,207,H,,S,
AE000041196,20220102,TAVG,226,H,,S,
...
AE000041196,20221231,TMAX,243,,,S,
AE000041196,20221231,PRCP,0,D,,S,
AE000041196,20221231,TAVG,202,H,,S,

The records have fields including ID, DATE, ELEMENT, and more. Each combination of ID, DATE, and ELEMENT represents a unique record in this dataset. For example, the record with ID as AE000041196, ELEMENT as TAVG, and DATE as 20220101 is unique.

In this tutorial, we assume that the files are updated with new records every day, and want to store only the latest record per the primary key (ID and ELEMENT) to make the latest snapshot data queryable. One typical approach is to do an INSERT for all the historical data, and calculate the latest records in queries; however, this can introduce additional overhead in all the queries. When you want to analyze only the latest records, it’s better to do an UPSERT (update and insert) based on the primary key and DATE field rather than just an INSERT in order to avoid duplicates and maintain a single updated row of data.

Prerequisites

To continue this tutorial, you need to create the following AWS resources in advance:

An AWS Identity and Access Management (IAM) role for your ETL job or notebook as instructed in Set up IAM permissions for AWS Glue Studio
An S3 bucket for storing data
An AWS Glue database called hudi_native

Process a Hudi dataset on the AWS Glue Studio visual editor

Let’s author an AWS Glue job to read daily records in 2022, and write the latest snapshot into the Hudi table on your S3 bucket using UPSERT. Complete following steps:

Open AWS Glue Studio.
Choose Jobs.
Choose Visual with a source and target.
For Source and Target, choose Amazon S3, then choose Create.

A new visual job configuration appears. The next step is to configure the data source to read an example dataset:

Under Visual, choose Data source – S3 bucket.
Under Node properties, for S3 source type, select S3 location.
For S3 URL, enter s3://noaa-ghcn-pds/csv/by_year/2022.csv.

The data source is configured.

data-source

The next step is to configure the data target to ingest data in Apache Hudi on your S3 bucket:

Choose Data target – S3 bucket.
Under Data target properties- S3, for Format, choose Apache Hudi.
For Hudi Table Name, enter ghcn.
For Hudi Storage Type, choose Copy on write.
For Hudi Write Operation, choose Upsert.
For Hudi Record Key Fields, choose ID.
For Hudi Precombine Key Field, choose DATE.
For Compression Type, choose GZIP.
For S3 Target location, enter s3://<Your S3 bucket name>/<Your S3 bucket prefix>/hudi_native/ghcn/. (Provide your S3 bucket name and prefix.)

To make it easy to discover the sample data, and also make it queryable from Athena, configure the job to create a table definition on the AWS Glue Data Catalog:

For Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
For Database, choose hudi_native.
For Table name, enter ghcn.
For Partition keys – optional, choose ELEMENT.

Now your data integration job is authored in the visual editor completely. Let’s add one remaining setting about the IAM role, then run the job:

Under Job details, for IAM Role, choose your IAM role.
Choose Save, then choose Run.

data-target

Navigate to the Runs tab to track the job progress and wait for it to complete.

job-run

Query the table with Athena

Now that the job has successfully created the Hudi table, you can query the table through different engines, including Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum, in addition to AWS Glue for Apache Spark.

To query through Athena, complete the following steps:

On the Athena console, open the query editor.
In the query editor, enter the following SQL and choose Run:

SELECT * FROM "hudi_native"."ghcn" limit 10;

The following screenshot shows the query result.
athena-query1

Let’s dive deep into the table to understand how the data is ingested and focus on the records with ID=’AE000041196′.

Run the following query to focus on the very specific example records with ID='AE000041196':

SELECT * FROM "hudi_native"."ghcn" WHERE ID='AE000041196';

The following screenshot shows the query result.
athena-query2

The original source file 2022.csv has historical records for record ID='USW00012894' from 20220101 to 20221231, however the query result shows only four records, one record per ELEMENT at the latest snapshot of the day 20221230 or 20221231. Because we used the UPSERT write option when writing data, we configured the ID field as a Hudi record key field, the DATE field as a Hudi precombine field, and the ELEMENT field as partition key field. When two records have the same key value, Hudi picks the one with the largest value for the precombine field. When the job ingested data, it compared all the values in the DATE field for each pair of ID and ELEMENT, and then picked the record with the largest value in the DATE field.

According to the preceding result, we were able to ingest the latest snapshot from all the 2022 data. Now let’s do an UPSERT of the new 2023 data to overwrite the records on the target Hudi table.

Go back to AWS Glue Studio console, modify the source S3 location to s3://noaa-ghcn-pds/csv/by_year/2023.csv, then save and run the job.

upsert-data-source

Run the same Athena query from the Athena console.

athena-query3
Now you see that the four records have been updated with the new records in 2023.

If you have further future records, this approach works well to upsert new records based on the Hudi record key and Hudi precombine key.

Clean up

Now to the final step, cleaning up the resources:

Delete the AWS Glue database hudi_native.
Delete the AWS Glue table ghcn.
Delete the S3 objects under s3://<Your S3 bucket name>/<Your S3 bucket prefix>/hudi_native/ghcn2022/.

Conclusion

This post demonstrated how to process Hudi datasets using the AWS Glue Studio visual editor. The AWS Glue Studio visual editor enables you to author jobs while taking advantage of data lake formats and without needing expertise in them. If you have comments or feedback, please feel free to leave them in the comments.

About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Scott Long is a Front End Engineer on the AWS Glue team. He is responsible for implementing new features in AWS Glue Studio. In his spare time, he enjoys socializing with friends and participating in various outdoor activities.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18+ year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Extend geospatial queries in Amazon Athena with UDFs and AWS Lambda

2023-03-17 John Telford

Post Syndicated from John Telford original https://aws.amazon.com/blogs/big-data/extend-geospatial-queries-in-amazon-athena-with-udfs-and-aws-lambda/

Amazon Athena is a serverless and interactive query service that allows you to easily analyze data in Amazon Simple Storage Service (Amazon S3) and 25-plus data sources, including on-premises data sources or other cloud systems using SQL or Python. Athena built-in capabilities include querying for geospatial data; for example, you can count the number of earthquakes in each Californian county. One disadvantage of analyzing at county-level is that it may give you a misleading impression of which parts of California have had the most earthquakes. This is because the counties aren’t equally sized; a county may have had more earthquakes simply because it’s a big county. What if we wanted a hierarchical system that allowed us to zoom in and out to aggregate data over different equally-sized geographic areas?

In this post, we present a solution that uses Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons. We then use an Athena user-defined function (UDF) to determine which hexagon each historical earthquake occurred in. Because the hexagons are equally-sized, this analysis gives a fair impression of where earthquakes tend to occur.

At the end, we’ll produce a visualization like the one below that shows the number of historical earthquakes in different areas of the western US.

H3 divides the globe into equal-sized regular hexagons. The number of hexagons depends on the chosen resolution, which may vary from 0 (122 hexagons, each with edge lengths of about 1,100 km) to 15 (569,707,381,193,162 hexagons, each with edge lengths of about 50 cm). H3 enables analysis at the area level, and each area has the same size and shape.

Solution overview

The solution extends Athena’s built-in geospatial capabilities by creating a UDF powered by AWS Lambda. Finally, we use an Amazon SageMaker notebook to run Athena queries that are rendered as a choropleth map. The following diagram illustrates this architecture.

The end-to-end architecture is as follows:

A CSV file of historical earthquakes is uploaded into an S3 bucket.
An AWS Glue external table is created based on the earthquake CSV.
A Lambda function calculates H3 hexagons for parameters (latitude, longitude, resolution). The function is written in Java and can be called as a UDF using queries in Athena.
A SageMaker notebook uses an AWS SDK for pandas package to run a SQL query in Athena, including the UDF.
A Plotly Express package renders a choropleth map of the number of earthquakes in each hexagon.

Prerequisites

For this post, we use Athena to read data in Amazon S3 using the table defined in the AWS Glue Data Catalog associated with our earthquake dataset. In terms of permissions, there are two main requirements:

Athena has access to the S3 bucket where you store the earthquake data and another S3 bucket used to store the output of Athena queries.
The AWS Identity and Access Management (IAM) role or user you use to run this example is authorized to use Athena. For details, see AWS managed policies for Amazon Athena.

Configure Amazon S3

The first step is to create an S3 bucket to store the earthquake dataset, as follows:

Download the CSV file of historical earthquakes from GitHub.
On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
For Bucket name, enter a globally unique name for your data bucket.
Choose Create folder, and enter the folder name earthquakes.
Upload the file to the S3 bucket. In this example, we upload the earthquakes.csv file to the earthquakes prefix.

Create a table in Athena

Navigate to Athena console to create a table. Complete the following steps:

On the Athena console, choose Query editor.
Select your preferred Workgroup using the drop-down menu.

In the SQL editor, use the following code to create a table in the default database:

CREATE external TABLE earthquakes
(
  earthquake_date STRING,
  latitude DOUBLE,
  longitude DOUBLE,
  depth DOUBLE,
  magnitude DOUBLE,
  magtype STRING,
  mbstations STRING,
  gap STRING,
  distance STRING,
  rms STRING,
  source STRING,
  eventid STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION 's3://<MY-DATA-BUCKET>/earthquakes/';

Create a Lambda function for the Athena UDF

For a thorough explanation on how to build Athena UDFs, see Querying with user defined functions. We use Java 11 and Uber H3 Java binding to build the H3 UDF. We provide the implementation of the UDF on GitHub.

There are several options for deploying a UDF using Lambda. In this example, we use the AWS Management Console. For production deployments, you probably want to use infrastructure as code such as the AWS Cloud Development Kit (AWS CDK). For information about how to use the AWS CDK to deploy the Lambda function, refer to the project code repository. Another possible deployment option is using AWS Serverless Application Repository (SAR).

Deploy the UDF

Deploy the Uber H3 binding UDF using the console as follows:

Go to binary directory in the GitHub repository, and download aws-h3-athena-udf-*.jar to your local desktop.
Create a Lambda function called H3UDF with Runtime set to Java 11 (Corretto), and Architecture set to x86_64.
Upload the aws-h3-athena-udf*.jar file.
Change the handler name to com.aws.athena.udf.h3.H3AthenaHandler.
In the General configuration section, choose Edit to set the memory of the Lambda function to 4096 MB, which is an amount of memory that works for our examples. You may need to set the memory size larger for your use cases.

Use the Lambda function as an Athena UDF

After you create the Lambda function, you’re ready to use it as a UDF. The following screenshot shows the function details.

You can now use the function as an Athena UDF. On the Athena console, run the following command:

USING EXTERNAL FUNCTION lat_lng_to_cell_address(lat DOUBLE, lng DOUBLE, res INTEGER)
RETURNS VARCHAR
LAMBDA '<MY-LAMBDA-ARN>'-- Replace with ARN of your Lambda function.
SELECT *,
       lat_lng_to_cell_address(latitude, longitude, 4) AS h3_cell
FROM earthquakes
WHERE latitude BETWEEN 18 AND 70;

The udf/examples folder in the GitHub repository includes more examples of the Athena queries.

Developing the UDFs

Now that we showed you how to deploy a UDF for Athena using Lambda, let’s dive deeper into how to develop these kinds of UDFs. As explained in Querying with user defined functions, in order to develop a UDF, we first need to implement a class that inherits UserDefinedFunctionHandler. Then we need to implement the functions inside the class that can be used as UDFs of Athena.

We begin the UDF implementation by defining a class H3AthenaHandler that inherits the UserDefinedFunctionHandler. Then we implement functions that act as wrappers of functions defined in the Uber H3 Java binding. We make sure that all the functions defined in the H3 Java binding API are mapped, so that they can be used in Athena as UDFs. For example, we map the lat_lng_to_cell_address function used in the preceding example to the latLngToCell of the H3 Java binding.

On top of the call to the Java binding, many of the functions in the H3AthenaHandler check whether the input parameter is null. The null check is useful because we don’t assume the input to be non-null. In practice, null values for an H3 index or address are not unusual.

The following code shows the implementation of the get_resolution function:

/** Returns the resolution of an index.
     *  @param h3 the H3 index.
     *  @return the resolution. Null when h3 is null.
     *  @throws IllegalArgumentException when index is out of range.
     */
    public Integer get_resolution(Long h3){
        final Integer result;
        if (h3 == null) {
            result = null;
        } else {
            result = h3Core.getResolution(h3);
        }
        return result;
    }

Some H3 API functions such as cellToLatLng return List<Double> of two elements, where the first element is the latitude and the second is longitude. The H3 UDF that we implement provides a function that returns well-known text (WKT) representation. For example, we provide cell_to_lat_lng_wkt, which returns a Point WKT string instead of List<Double>. We can then use the output of cell_to_lat_lng_wkt in combination with the built-in spatial Athena function ST_GeometryFromText as follows:

USING EXTERNAL FUNCTION cell_to_lat_lng_wkt(h3 BIGINT) 
RETURNS VARCHAR
LAMBDA '<MY-LAMBDA-ARN>'
SELECT ST_GeometryFromText(cell_to_lat_lng_wkt(622506764662964223))

Athena UDF only supports scalar data types and does not support nested types. However, some H3 APIs return nested types. For example, the polygonToCells function in H3 takes a List<List<List<GeoCoord>>>. Our implementation of polygon_to_cells UDF receives a Polygon WKT instead. The following shows an example Athena query using this UDF:

-- get all h3 hexagons that cover Toulouse, Nantes, Lille, Paris, Nice 
USING EXTERNAL FUNCTION polygon_to_cells(polygonWKT VARCHAR, res INT)
RETURNS ARRAY(BIGINT)
LAMBDA '<MY-LAMBDA-ARN>'
SELECT polygon_to_cells('POLYGON ((43.604652 1.444209, 47.218371 -1.553621, 50.62925 3.05726, 48.864716 2.349014, 43.6961 7.27178, 3.604652 1.444209))', 2)

Use SageMaker notebooks for visualization

A SageMaker notebook is a managed machine learning compute instance that runs a Jupyter notebook application. In this example, we will use a SageMaker notebook to write and run our code to visualize our results, but if your use case includes Apache Spark then using Amazon Athena for Apache Spark would be a great choice. For advice on security best practices for SageMaker, see Building secure machine learning environments with Amazon SageMaker. You can create your own SageMaker notebook by following these instructions:

On the SageMaker console, choose Notebook in the navigation pane.
Choose Notebook instances.
Choose Create notebook instance.
Enter a name for the notebook instance.
Choose an existing IAM role or create a role that allows you to run SageMaker and grants access to Amazon S3 and Athena.
Choose Create notebook instance.
Wait for the notebook status to change from Creating to InService.
Open the notebook instance by choosing Jupyter or JupyterLab.

Explore the data

We’re now ready to explore the data.

On the Jupyter console, under New, choose Notebook.
On the Select Kernel drop-down menu, choose conda_python3.
Add new cells by choosing the plus sign.
In your first cell, download the following Python modules that aren’t included in the standard SageMaker environment:
```
!pip install geojson
!pip install awswrangler
!pip install geomet
!pip install shapely
```
GeoJSON is a popular format for storing spatial data in a JSON format. The geojson module allows you to easily read and write GeoJSON data with Python. The second module we install, awswrangler, is the AWS SDK for pandas. This is a very easy way to read data from various AWS data sources into Pandas data frames. We use it to read earthquake data from the Athena table.

Next, we import all the packages that we use to import the data, reshape it, and visualize it:

from geomet import wkt
import plotly.express as px
from shapely.geometry import Polygon, mapping
import awswrangler as wr
import pandas as pd
from shapely.wkt import loads
import geojson
import ast

We begin importing our data using the athena.read_sql._query function in AWS SDK for pandas. The Athena query has a subquery that uses the UDF to add a column h3_cell to each row in the earthquakes table, based on the latitude and longitude of the earthquake. The analytic function COUNT is then used to find out the number of earthquakes in each H3 cell. For this visualization, we’re only interested in earthquakes within the US, so we filter out rows in the data frame that are outside the area of interest:

def run_query(lambda_arn, db, resolution):
    query = f"""USING EXTERNAL FUNCTION cell_to_boundary_wkt(cell VARCHAR)
                    RETURNS ARRAY(VARCHAR)
                    LAMBDA '{lambda_arn}'
                       SELECT h3_cell, cell_to_boundary_wkt(h3_cell) as boundary, quake_count FROM(
                        USING EXTERNAL FUNCTION lat_lng_to_cell_address(lat DOUBLE, lng DOUBLE, res INTEGER)
                         RETURNS VARCHAR
                        LAMBDA '{lambda_arn}'
                    SELECT h3_cell, COUNT(*) AS quake_count
                      FROM
                        (SELECT *,
                           lat_lng_to_cell_address(latitude, longitude, {resolution}) AS h3_cell
                         FROM earthquakes
                         WHERE latitude BETWEEN 18 AND 70        -- For this visualisation, we're only interested in earthquakes within the USA.
                           AND longitude BETWEEN -175 AND -50
                         )
                       GROUP BY h3_cell ORDER BY quake_count DESC) cell_quake_count"""
    return wr.athena.read_sql_query(query, database=db)

lambda_arn = '<MY-LAMBDA-ARN>' # Replace with ARN of your lambda.
db_name = '<MY-DATABASE-NAME>' # Replace with name of your Glue database.
earthquakes_df = run_query(lambda_arn=lambda_arn,db=db_name, resolution=4)
earthquakes_df.head()

The following screenshot shows our results.

Follow along with the rest of the steps in our Jupyter notebook to see how we analyze and visualize our example with H3 UDF data.

Visualize the results

To visualize our results, we use the Plotly Express module to create a choropleth map of our data. A choropleth map is a type of visualization that is shaded based on quantitative values. This is a great visualization for our use case because we’re shading different regions based on the frequency of earthquakes.

In the resulting visual, we can see the ranges of frequency of earthquakes in different areas of North America. Note, the H3 resolution in this map is lower than in the earlier map, which makes each hexagon cover a larger area of the globe.

Clean up

To avoid incurring extra charges on your account, delete the resources you created:

On the SageMaker console, select the notebook and on the Actions menu, choose Stop.
Wait for the status of the notebook to change to Stopped, then select the notebook again and on the Actions menu, choose Delete.
On the Amazon S3 console, select the bucket you created and choose Empty.
Enter the bucket name and choose Empty.
Select the bucket again and choose Delete.
Enter the bucket name and choose Delete bucket.
On the Lambda console, select the function name and on the Actions menu, choose Delete.

Conclusion

In this post, you saw how to extend functions in Athena for geospatial analysis by adding your own user-defined function. Although we used Uber’s H3 geospatial index in this demonstration, you can bring your own geospatial index for your own custom geospatial analysis.

In this post, we used Athena, Lambda, and SageMaker notebooks to visualize the results of our UDFs in the western US. Code examples are in the h3-udf-for-athena GitHub repo.

As a next step, you can modify the code in this post and customize it for your own needs to gain further insights from your own geographical data. For example, you could visualize other cases such as droughts, flooding, and deforestation.

About the Authors

John Telford is a Senior Consultant at Amazon Web Services. He is a specialist in big data and data warehouses. John has a Computer Science degree from Brunel University.

Anwar Rizal is a Senior Machine Learning consultant based in Paris. He works with AWS customers to develop data and AI solutions to sustainably grow their business.

Pauline Ting is a Data Scientist in the AWS Professional Services team. She supports customers in achieving and accelerating their business outcome by developing sustainable AI/ML solutions. In her spare time, Pauline enjoys traveling, surfing, and trying new dessert places.

Establishing a data perimeter on AWS: Allow only trusted resources from my organization

2023-03-09 Laura Reith

Post Syndicated from Laura Reith original https://aws.amazon.com/blogs/security/establishing-a-data-perimeter-on-aws-allow-only-trusted-resources-from-my-organization/

Companies that store and process data on Amazon Web Services (AWS) want to prevent transfers of that data to or from locations outside of their company’s control. This is to support security strategies, such as data loss prevention, or to comply with the terms and conditions set forth by various regulatory and privacy agreements. On AWS, a resource perimeter is a set of AWS Identity and Access Management (IAM) features and capabilities that you can use to build your defense-in-depth protection against unintended data transfers. In this third blog post of the Establishing a data perimeter on AWS series, we review the benefits and implementation considerations when you define your resource perimeter.

The resource perimeter is one of the three perimeters in the data perimeter framework on AWS and has the following two control objectives:

My identities can access only trusted resources – This helps to ensure that IAM principals that belong to your AWS Organizations organization can access only the resources that you trust.
Only trusted resources can be accessed from my network – This helps to ensure that only resources that you trust can be accessed through expected networks, regardless of the principal that is making the API call.

Trusted resources are the AWS resources, such as Amazon Simple Storage Service (Amazon S3) buckets and objects or Amazon Simple Notification Service (Amazon SNS) topics, that are owned by your organization and in which you store and process your data. Additionally, there are resources outside your organization that your identities or AWS services acting on your behalf might need to access. You will need to consider these access patterns when you define your resource perimeter.

Security risks addressed by the resource perimeter

The resource perimeter helps address three main security risks.

Unintended data disclosure through use of corporate credentials — Your developers might have a personal AWS account that is not part of your organization. In that account, they could configure a resource with a resource-based policy that allows their corporate credentials to interact with the resource. For example, they could write an S3 bucket policy that allows them to upload objects by using their corporate credentials. This could allow the intentional or unintentional transfer of data from your corporate environment — your on-premises network or virtual private cloud (VPC) — to their personal account. While you advance through your least privilege journey, you should make sure that access to untrusted resources is prohibited, regardless of the permissions granted by identity-based policies that are attached to your IAM principals. Figure 1 illustrates an unintended access pattern where your employee uses an identity from your organization to move data from your on-premises or AWS environment to an S3 bucket in a non-corporate AWS account.

Figure 1: Unintended data transfer to an S3 bucket outside of your organization by your identities

Unintended data disclosure through non-corporate credentials usage — There is a risk that developers could introduce personal IAM credentials to your corporate network and attempt to move company data to personal AWS resources. We discussed this security risk in a previous blog post: Establishing a data perimeter on AWS: Allow only trusted identities to access company data. In that post, we described how to use the aws:PrincipalOrgID condition key to prevent the use of non-corporate credentials to move data into an untrusted location. In the current post, we will show you how to implement resource perimeter controls as a defense-in-depth approach to mitigate this risk.

Unintended data infiltration — There are situations where your developers might start the solution development process using commercial datasets, tooling, or software and decide to copy them from repositories, such as those hosted on public S3 buckets. This could introduce malicious components into your corporate environment, your on-premises network, or VPCs. Establishing the resource perimeter to only allow access to trusted resources from your network can help mitigate this risk. Figure 2 illustrates the access pattern where an employee with corporate credentials downloads assets from an S3 bucket outside of your organization.

Figure 2: Unintended data infiltration

Implement the resource perimeter

To achieve the resource perimeter control objectives, you can implement guardrails in your AWS environment by using the following AWS policy types:

Service control policies (SCPs) – Organization policies that are used to centrally manage and set the maximum available permissions for your IAM principals. SCPs help you ensure that your accounts stay within your organization’s access control guidelines. In the context of the resource perimeter, you will use SCPs to help prevent access to untrusted resources from AWS principals that belong to your organization.
VPC endpoint policy – An IAM resource-based policy that is attached to a VPC endpoint to control which principals, actions, and resources can be accessed through a VPC endpoint. In the context of the resource perimeter, VPC endpoint policies are used to validate that the resource the principal is trying to access belongs to your organization.

The condition key used to constrain access to resources in your organization is aws:ResourceOrgID. You can set this key in an SCP or VPC endpoint policy. The following table summarizes the relationship between the control objectives and the AWS capabilities used to implement the resource perimeter.

Control objective	Implemented by using	Primary IAM capability
My identities can access only trusted resources	SCPs	aws:ResourceOrgID
Only trusted resources can be accessed from my network	VPC endpoint policies	aws:ResourceOrgID

In the next section, you will learn how to use the IAM capabilities listed in the preceding table to implement each control objective of the resource perimeter.

My identities can access only trusted resources

The following is an example of an SCP that limits all actions to only the resources that belong to your organization. Replace <MY-ORG-ID> with your information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceResourcePerimeter",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:ResourceOrgID": "<MY-ORG-ID>"
        }
      }
    }
  ]
}

In this policy, notice the use of the negated condition key StringNotEqualsIfExists. This means that this condition will evaluate to true and the policy will deny API calls if the organization identifier of the resource that is being accessed differs from the one specified in the policy. It also means that this policy will deny API calls if the resource being accessed belongs to a standalone account, which isn’t part of an organization. The negated condition operators in the Deny statement mean that the condition still evaluates to true if the key is not present in the request; however, as a best practice, I added IfExists to the end of the StringNotEquals operator to clearly express the intent in the policy.

Note that for a permission to be allowed for a specific account, a statement that allows access must exist at every level of the hierarchy of your organization.

Only trusted resources can be accessed from my network

You can achieve this objective by combining the SCP we just reviewed with the use of aws:PrincipalOrgID in your VPC endpoint policies, as shown in the Establishing a data perimeter on AWS: Allow only trusted identities to access company data blog post. However, as a defense in depth, you can also apply resource perimeter controls on your networks by using aws:ResourceOrgID in your VPC endpoint policies.

The following is an example of a VPC endpoint policy that allows access to all actions but limits access to only trusted resources and identities that belong to your organization. Replace <MY-ORG-ID> with your information.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "AllowRequestsByOrgsIdentitiesToOrgsResources",
			"Effect": "Allow",
			"Principal": {
				"AWS": "*"
			},
			"Action": "*",
			"Resource": "*",
			"Condition": {
				"StringEquals": {
					"aws:PrincipalOrgID": "<MY-ORG-ID>",
					"aws:ResourceOrgID": "<MY-ORG-ID>"
				}
			}
		}
	]
}

The preceding VPC endpoint policy uses the StringEquals condition operator. To invoke the Allow effect, the principal making the API call and the resource they are trying to access both need to belong to your organization. Compared to the SCP example that we reviewed earlier, your intent for this policy is different — you want to make sure that the Allow condition evaluates to true only if the specified key exists in the request. Additionally, VPC endpoint policies apply to principals, as long as their request flows through the VPC endpoint.

In VPC endpoint policies, you do not grant permissions; rather, you define the maximum allowed access through the network. Therefore, this policy uses an Allow effect.

Extend your resource perimeter

The previous two policies help you ensure that your identities and networks can only be used to access AWS resources that belong to your organization. However, your company might require that you extend your resource perimeter to also include AWS owned resources — resources that do not belong to your organization and that are accessed by your principals or by AWS services acting on your behalf. For example, if you use the AWS Service Catalog in your environment, the service creates and uses Amazon S3 buckets that are owned by the service to store products. To allow your developers to successfully provision AWS Service Catalog products, your resource perimeter needs to account for this access pattern. The following statement shows how to account for the service catalog access pattern. Replace <MY-ORG-ID> with your information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceResourcePerimeter",
      "Effect": "Deny",
      "NotAction": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:PutObjectAcl"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:ResourceOrgID": "<MY-ORG-ID>"
        }
      }
    },
    {
      "Sid": "ExtendResourcePerimeter",
      "Effect": "Deny",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:PutObjectAcl"
      ],
      "Resource": [
        "*"
      ],
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:ResourceOrgID": "<MY-ORG-ID>"
        },
        "ForAllValues:StringNotEquals": {
          "aws:CalledVia": [
            "servicecatalog.amazonaws.com"
          ]
        }
      }
    }
  ]
}

Note that the EnforceResourcePerimeter statement in the SCP was modified to exclude s3:GetObject, s3:PutObject, and s3:PutObjectAcl actions from its effect (NotAction element). This is because these actions are performed by the Service Catalog to access service-owned S3 buckets. These actions are then restricted in the ExtendResourcePerimeter statement, which includes two negated condition keys. The second statement denies the previously mentioned S3 actions unless the resource that is being accessed belongs to your organization (StringNotEqualsIfExists with aws:ResourceOrgID), or the actions are performed by Service Catalog on your behalf (ForAllValues:StringNotEquals with aws:CalledVia). The aws:CalledVia condition key compares the services specified in the policy with the services that made requests on behalf of the IAM principal by using that principal’s credentials. In the case of the Service Catalog, the credentials of a principal who launches a product are used to access S3 buckets that are owned by the Service Catalog.

It is important to highlight that we are purposely not using the aws:ViaAWSService condition key in the preceding policy. This is because when you extend your resource perimeter, we recommend that you restrict access to only calls to buckets that are accessed by the service you are using.

You might also need to extend your resource perimeter to include the third-party resources of your partners. For example, you could be working with business partners that require your principals to upload or download data to or from S3 buckets that belong to their account. In this case, you can use the aws:ResourceAccount condition key in your resource perimeter policy to specify resources that belong to the trusted third-party account.

The following is an example of an SCP that accounts for access to the Service Catalog and third-party partner resources. Replace <MY-ORG-ID> and <THIRD-PARTY-ACCOUNT> with your information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceResourcePerimeter",
      "Effect": "Deny",
      "NotAction": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:PutObjectAcl"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:ResourceOrgID": "<MY-ORG-ID>"
        }
      }
    },
    {
      "Sid": "ExtendResourcePerimeter",
      "Effect": "Deny",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:PutObjectAcl"
      ],
      "Resource": [
        "*"
      ],
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:ResourceOrgID": "<MY-ORG-ID>",
          "aws:ResourceAccount": "<THIRD-PARTY-ACCOUNT>"
        },
        "ForAllValues:StringNotEquals": {
          "aws:CalledVia": [
            "servicecatalog.amazonaws.com"
          ]
        }
      }
    }
  ]
}

To account for access to trusted third-party account resources, the condition StringNotEqualsIfExists in the ExtendResourcePerimeter statement now also contains the condition key aws:ResourceAccount. Now, the second statement denies the previously mentioned S3 actions unless the resource that is being accessed belongs to your organization (StringNotEqualsIfExists with aws:ResourceOrgID), to a trusted third-party account (StringNotEqualsIfExists with aws:ResourceAccount), or the actions are performed by Service Catalog on your behalf (ForAllValues:StringNotEquals with aws:CalledVia).

The next policy example demonstrates how to extend your resource perimeter to permit access to resources that are owned by your trusted third parties through the networks that you control. This is required if applications running in your VPC or on-premises need to be able to access a dataset that is created and maintained in your business partner AWS account. Similar to the SCP example, you can use the aws:ResourceAccount condition key in your VPC endpoint policy to account for this access pattern. Replace <MY-ORG-ID>, <THIRD-PARTY-ACCOUNT>, and <THIRD-PARTY-RESOURCE-ARN> with your information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowRequestsByOrgsIdentitiesToOrgsResources",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalOrgID": "<MY-ORG-ID>",
          "aws:ResourceOrgID": "<MY-ORG-ID>"
        }
      }
    },
    {
      "Sid": "AllowRequestsByOrgsIdentitiesToThirdPartyResources",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:PutObjectAcl"
      ],
      "Resource": [
        "<THIRD-PARTY-RESOURCE-ARN>"
      ],
      "Condition": {
        "StringEquals": {
          "aws:PrincipalOrgID": "<MY-ORG-ID>",
          "aws:ResourceAccount": [
            "<THIRD-PARTY-ACCOUNT>"
          ]
        }
      }
    }
  ]
}

The second statement, AllowRequestsByOrgsIdentitiesToThirdPartyResources, in the updated VPC endpoint policy allows s3:GetObject, s3:PutObject, and s3:PutObjectAcl actions on trusted third-party resources (StringEquals with aws:ResourceAccount) by principals that belong to your organization (StringEquals with aws:PrincipalOrgID).

Note that you do not need to modify your VPC endpoint policy to support the previously discussed Service Catalog operations. This is because calls to Amazon S3 made by Service Catalog on your behalf originate from the Service Catalog service network and do not traverse your VPC endpoint. However, you should consider access patterns that are similar to the Service Catalog example when defining your trusted resources. To learn about services with similar access patterns, see the IAM policy samples section later in this post.

Deploy the resource perimeter at scale

For recommendations on deploying a data perimeter at scale, see the Establishing a data perimeter on AWS: Allow only trusted identities to access company data blog post. The section titled Deploying the identity perimeter at scale provides the details on how to achieve this for your organization.

IAM policy samples

Our GitHub repository contains policy examples that illustrate how to implement perimeter controls for a variety of AWS services. The policy examples in the repository are for reference only. You will need to tailor them to suit the specific needs of your AWS environment.

Conclusion

In this blog post, you learned about the resource perimeter, the control objectives achieved by the perimeter, and how to write SCPs and VPC endpoint policies that help achieve these objectives for your organization. You also learned how to extend your perimeter to include AWS service-owned resources and your third-party partner-owned resources.

For additional learning opportunities, see the Data perimeters on AWS page. This information resource provides additional materials such as a data perimeter workshop, blog posts, whitepapers, and webinar sessions.

If you have questions, comments, or concerns, contact AWS Support or browse AWS re:Post. If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

2023-03-07 Satesh Sonti

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/how-gaming-companies-can-use-amazon-redshift-serverless-to-build-scalable-analytical-applications-faster-and-easier/

This post provides guidance on how to build scalable analytical solutions for gaming industry use cases using Amazon Redshift Serverless. It covers how to use a conceptual, logical architecture for some of the most popular gaming industry use cases like event analysis, in-game purchase recommendations, measuring player satisfaction, telemetry data analysis, and more. This post also discusses the art of the possible with newer innovations in AWS services around streaming, machine learning (ML), data sharing, and serverless capabilities.

Our gaming customers tell us that their key business objectives include the following:

Increased revenue from in-app purchases
High average revenue per user and lifetime value
Improved stickiness with better gaming experience
Improved event productivity and high ROI

Our gaming customers also tell us that while building analytics solutions, they want the following:

Low-code or no-code model – Out-of-the-box solutions are preferred to building customized solutions.
Decoupled and scalable – Serverless, auto scaled, and fully managed services are preferred over manually managed services. Each service should be easily replaceable, enhanced with little or no dependency. Solutions should be flexible to scale up and down.
Portability to multiple channels – Solutions should be compatible with most of endpoint channels like PC, mobile, and gaming platforms.
Flexible and easy to use – The solutions should provide less restrictive, easy-to-access, and ready-to-use data. They should also provide optimal performance with low or no tuning.

Analytics reference architecture for gaming organizations

In this section, we discuss how gaming organizations can use a data hub architecture to address the analytical needs of an enterprise, which requires the same data at multiple levels of granularity and different formats, and is standardized for faster consumption. A data hub is a center of data exchange that constitutes a hub of data repositories and is supported by data engineering, data governance, security, and monitoring services.

A data hub contains data at multiple levels of granularity and is often not integrated. It differs from a data lake by offering data that is pre-validated and standardized, allowing for simpler consumption by users. Data hubs and data lakes can coexist in an organization, complementing each other. Data hubs are more focused around enabling businesses to consume standardized data quickly and easily. Data lakes are more focused around storing and maintaining all the data in an organization in one place. And unlike data warehouses, which are primarily analytical stores, a data hub is a combination of all types of repositories—analytical, transactional, operational, reference, and data I/O services, along with governance processes. A data warehouse is one of the components in a data hub.

The following diagram is a conceptual analytics data hub reference architecture. This architecture resembles a hub-and-spoke approach. Data repositories represent the hub. External processes are the spokes feeding data to and from the hub. This reference architecture partly combines a data hub and data lake to enable comprehensive analytics services.

Let’s look at the components of the architecture in more detail.

Sources

Data can be loaded from multiple sources, such as systems of record, data generated from applications, operational data stores, enterprise-wide reference data and metadata, data from vendors and partners, machine-generated data, social sources, and web sources. The source data is usually in either structured or semi-structured formats, which are highly and loosely formatted, respectively.

Data inbound

This section consists of components to process and load the data from multiple sources into data repositories. It can be in batch mode, continuous, pub/sub, or any other
custom integration. ETL (extract, transform, and load) technologies, streaming services, APIs, and data exchange interfaces are the core components of this pillar. Unlike ingestion processes, data can be transformed as per business rules before loading. You can apply technical or business data quality rules and load raw data as well. Essentially, it provides the flexibility to get the data into repositories in its most usable form.

Data repositories

This section consists of a group of data stores, which includes data warehouses, transactional or operational data stores, reference data stores, domain data stores housing purpose-built business views, and enterprise datasets (file storage). The file storage component is usually a common component between a data hub and a data lake to avoid data duplication and provide comprehensiveness. Data can also be shared among all these repositories without physically moving with features, such as data sharing and federated queries. However, data copy and duplication are allowed considering various consumption needs in terms of formats and latency.

Data outbound

Data is often consumed using structured queries for analytical needs. Also, datasets are accessed for ML, data exporting, and publishing needs. This section consists of components to query the data, export, exchange, and APIs. In terms of implementation, the same technologies may be used for both inbound and outbound, but the functions are different. However, it’s not mandatory to use the same technologies. These processes aren’t transformation heavy because the data is already standardized and almost ready to consume. The focus is on the ease of consumption and integration with consuming services.

Consumption

This pillar consists of various consumption channels for enterprise analytical needs. It includes business intelligence (BI) users, canned and interactive reports, dashboards, data science workloads, Internet of Things (IoT), web apps, and third-party data consumers. Popular consumption entities in many organizations are queries, reports, and data science workloads. Because there are multiple data stores maintaining data at different granularity and formats to service consumer needs, these consumption components depend on data catalogs for finding the right source.

Data governance

Data governance is key to the success of a data hub reference architecture. It constitutes components like metadata management, data quality, lineage, masking, and stewardship, which are required for organized maintenance of the data hub. Metadata management helps organize the technical and business metadata catalog, and consumers can reference this catalog to know what data is available in which repository and at what granularity, format, owners, refresh frequency, and so on. Along with metadata management, data quality is important to increase confidence for consumers. This includes data cleansing, validation, conformance, and data controls.

Security and monitoring

Users and application access should be controlled at multiple levels. It starts with authentication, then authorizing who and what should be accessed, policy management, encryption, and applying data compliance rules. It also includes monitoring components to log the activity for auditing and analysis.

Analytics data hub solution architecture on AWS

The following reference architecture provides an AWS stack for the solution components.

Let’s look at each component again and the relevant AWS services.

Data inbound services

AWS Glue and Amazon EMR services are ideal for batch processing. They scale automatically and are able to process most of the industry standard data formats. Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and Amazon Managed Streaming for Apache Kafka (Amazon MSK) enables you to build streaming process applications. These streaming services integrate well with the Amazon Redshift streaming feature. This helps you process real-time sources, IoT data, and data from online channels. You can also ingest data with third-party tools like Informatica, dbt, and Matallion.

You can build RESTful APIs and WebSocket APIs using Amazon API Gateway and AWS Lambda, which will enable real-time two-way communication with web sources, social, and IoT sources. AWS Data Exchange helps with subscribing to third-party data in AWS Marketplace. Data subscription and access is fully managed with this service. Refer to the respective service documentation for further details.

Data repository services

Amazon Redshift is the recommended data storage service for OLAP (Online Analytical Processing) workloads such as cloud data warehouses, data marts, and other analytical data stores. This service is the core of this reference architecture on AWS and can address most analytical needs out of the box. You can use simple SQL to analyze structured and semi-structured data across data warehouses, data marts, operational databases, and data lakes to deliver the best price performance at any scale. The Amazon Redshift data sharing feature provides instant, granular, and high-performance access without data copies and data movement across multiple Amazon Redshift data warehouses in the same or different AWS accounts, and across Regions.

For ease of use, Amazon Redshift offers a serverless option. Amazon Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. Just load your data and start querying right away in Amazon Redshift Query Editor or in your favorite BI tool and continue to enjoy the best price performance and familiar SQL features in an easy-to-use, zero administration environment.

Amazon Relational Database Service (Amazon RDS) is a fully managed service for building transactional and operational data stores. You can choose from many popular engines such as MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server. With the Amazon Redshift federated query feature, you can query transactional and operational data in place without moving the data. The federated query feature currently supports Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon Aurora MySQL-Compatible Edition.

Amazon Simple Storage Service (Amazon S3) is the recommended service for multi-format storage layers in the architecture. It offers industry-leading scalability, data availability, security, and performance. Organizations typically store data in Amazon S3 using open file formats. Open file formats enable analysis of the same Amazon S3 data using multiple processing and consumption layer components. Data in Amazon S3 can be easily queried in place using SQL with Amazon Redshift Spectrum. It helps you query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data. Multiple Amazon Redshift data warehouses can concurrently query the same datasets in Amazon S3 without the need to make copies of the data for each data warehouse.

Data outbound services

Amazon Redshift comes with the web-based analytics workbench Query Editor V2.0, which helps you run queries, explore data, create SQL notebooks, and collaborate on data with your teams in SQL through a common interface. AWS Transfer Family helps securely transfer files using SFTP, FTPS, FTP, and AS2 protocols. It supports thousands of concurrent users and is a fully managed, low-code service. Similar to inbound processes, you can utilize Amazon API Gateway and AWS Lambda for data pull using the Amazon Redshift Data API. And AWS Data Exchange helps publish your data to third parties for consumption through AWS Marketplace.

Consumption services

Amazon QuickSight is the recommended service for creating reports and dashboards. It enables you to create interactive dashboards, visualizations, and advanced analytics with ML insights. Amazon SageMaker is the ML platform for all your data science workload needs. It helps you build, train, and deploy models consuming the data from repositories in the data hub. You can use Amazon front-end web and mobile services and AWS IoT services to build web, mobile, and IoT endpoint applications to consume data out of the data hub.

Data governance services

The AWS Glue Data Catalog and AWS Lake Formation are the core data governance services AWS currently offers. These services help manage metadata centrally for all the data repositories and manage access controls. They also help with data classification and can automatically handle schema changes. You can use Amazon DataZone to discover and share data at scale across organizational boundaries with built-in governance and access controls. AWS is investing in this space to provide more a unified experience for AWS services. There are many partner products such as Collibra, Alation, Amorphic, Informatica, and more, which you can use as well for data governance functions with AWS services.

Security and monitoring services

AWS Identity and Access Management (AWS IAM) manages identities for AWS services and resources. You can define users, groups, roles, and policies for fine-grained access management of your workforce and workloads. AWS Key Management Service (AWS KMS) manages AWS keys or customer managed keys for your applications. Amazon CloudWatch and AWS CloudTrail help provide monitoring and auditing capabilities. You can collect metrics and events and analyze them for operational efficiency.

In this post, we’ve discussed the most common AWS services for the respective solution components. However, you aren’t limited to only these services. There are many other AWS services for specific use cases that may be more appropriate for your needs than what we discussed here. You can reach to AWS Analytics Solutions Architects for appropriate guidance.

Example architectures for gaming use cases

In this section, we discuss example architectures for two gaming use cases.

Game event analysis

In-game events (also called timed or live events) encourage player engagement through excitement and anticipation. Events entice players to interact with the game, increasing player satisfaction and revenue with in-game purchases. Events have become more and more important, especially as games shift from being static pieces of entertainment to be played as is to offering dynamic and changing content through the use of services that use information to make decisions about game play as the game is being played. This enables games to change as the players play and influence what works and what doesn’t, and gives any game a potentially infinite lifespan.

This capability of in-game events to offer fresh content and activities within a familiar framework is how you keep players engaged and playing for months to years. Players can enjoy new experiences and challenges within the familiar framework or world that they have grown to love.

The following example shows how such an architecture might appear, including changes to support various sections of the process like breaking the data into separate containers to accommodate scalability, charge-back, and ownership.

To fully understand how events are viewed by the players and to make decisions about future events requires information on how the latest event was actually performed. This means gathering a lot of data as the players play to build key performance indicators (KPIs) that measure the effectiveness and player satisfaction with each event. This requires analytics that specifically measure each event and capture, analyze, report on, and measure player experience for each event. These KPIs include the following:

Initial user flow interactions – What actions users are taking after they first receive or download an event update in a game. Are there any clear drop-off points or bottlenecks that are turning people off the event?
Monetization – When, what, and where users are spending money on in the event, whether it’s buying in-game currencies, answering ads, specials, and so on.
Game economy – How can users earn and spend virtual currencies or goods during an event, using in-game money, trades, or barter.
In-game activity – Player wins, losses, leveling up, competition wins, or player achievements within the event.
User to user interactions – Invitations, gifting, chats (private and group), challenges, and so on during an event.

These are just some of the KPIs and metrics that are key for predictive modeling of events as the game acquires new players while keeping existing users involved, engaged, and playing.

In-game activity analysis

In-game activity analysis essentially looks at any meaningful, purposeful activity the player might show, with the goal of trying to understand what actions are taken, their timing, and outcomes. This includes situational information about the players, including where they are playing (both geographical and cultural), how often, how long, what they undertake on each login, and other activities.

The following example shows how such an architecture might appear, including changes to support various sections of the process like breaking the data into separate warehouses. The multi-cluster warehouse approach helps scale the workload independently, provides flexibility to the implemented charge-back model, and supports decentralized data ownership.

The solution essentially logs information to help understand the behavior of your players, which can lead to insights that increase retention of existing players, and acquisition of new ones. This can provide the ability to do the following:

Provide in-game purchase recommendations
Measure player trends in the short term and over time
Plan events the players will engage in
Understand what parts of your game are most successful and which are less so

You can use this understanding to make decisions about future game updates, make in-game purchase recommendations, determine when and how your game economy may need to be balanced, and even allow players to change their character or play as the game progresses by injecting this information and accompanying decisions back into the game.

Conclusion

This reference architecture, while showing examples of only a few analysis types, provides a faster technology path for enabling game analytics applications. The decoupled, hub/spoke approach brings the agility and flexibility to implement different approaches to analytics and understanding the performance of game applications. The purpose-built AWS services described in this architecture provide comprehensive capabilities to easily collect, store, measure, analyze, and report game and event metrics. This helps you efficiently perform in-game analytics, event analysis, measure player satisfaction, and provide tailor-made recommendations to game players, efficiently organize events, and increase retention rates.

Thanks for reading the post. If you have any feedback or questions, please leave them in the comments.

About the authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Tanya Rhodes is a Senior Solutions Architect based out of San Francisco, focused on games customers with emphasis on analytics, scaling, and performance enhancement of games and supporting systems. She has over 25 years of experience in enterprise and solutions architecture specializing in very large business organizations across multiple lines of business including games, banking, healthcare, higher education, and state governments.

How to set up least privilege access to your encrypted Amazon SQS queue

2023-03-03 Ahmed Bakry

Post Syndicated from Ahmed Bakry original https://aws.amazon.com/blogs/security/how-to-set-up-least-privilege-access-to-your-encrypted-amazon-sqs-queue/

Amazon Simple Queue Service (Amazon SQS) is a fully-managed message queueing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. Amazon SQS provides authentication mechanisms so that you can control who has access to the queue. It also provides encryption in transit with HTTP over SSL or TLS, and it supports server-side encryption using AWS Key Management Service (AWS KMS) to help protect the data passing through Amazon SQS. These controls allow you to use Amazon SQS to exchange sensitive data between applications. With the integration of Amazon SQS and AWS KMS, you can centrally-manage the keys that protect Amazon SQS, as well as the keys that protect your other AWS resources.

AWS services, such as Amazon Simple Storage Service (Amazon S3) and Amazon Simple Notification Service (Amazon SNS), can act as event sources that send events to Amazon SQS. To enable an event source to access an encrypted SQS queue, you will need to configure the queue with a customer managed key in AWS KMS, and then use the key policy to allow the event source to use the required AWS KMS API methods. The event source also requires permissions to authenticate access to the queue to send events. You can achieve this by using an SQS policy, which is a resource-based policy that you can use to control access to the SQS queue and its data.

In this blog post, we will show you how to control access to your encrypted SQS queue through the key policy and the SQS policy. The policies introduced in this post will guide you towards achieving least privilege. We will also describe how the resource-based policies defined in this post address the confused deputy problem by using the aws:SourceArn, aws:SourceAccount, and aws:PrincipalOrgID global AWS Identity and Access Management (IAM) condition context keys.

Solution overview

In this post, we will walk you through a common use case to illustrate how you can build the key policy and the SQS queue policy. This use case is shown in Figure 1.

Figure 1: Architecture to publish Amazon SNS messages to Amazon SQS

As shown in Figure 1, the solution has the following steps:

The message producer is an Amazon SNS topic. The topic is configured to send messages to an encrypted Amazon SQS queue. The queue is encrypted by using an AWS KMS customer-managed key.
The message consumer is a compute service such as an AWS Lambda function, an Amazon Elastic Compute Cloud (Amazon EC2) instance, or an AWS Fargate container. The message consumer is configured to process messages from the queue.
The SQS queue is configured to send failed messages to a dead-letter queue (DLQ). This can help you debug your application or messaging system because DLQs let you isolate unconsumed messages to determine why their processing didn’t succeed.

Note: If the message consumer is located in an Amazon Virtual Private Cloud (Amazon VPC) and you need to restrict message reception to that specific VPC, then you should attach the DenyReceivingIfNotThroughVPCE policy statement to your SQS queue policy.

The SQS policy defined in this post doesn’t support redriving messages directly to the same or a different SQS queue.

Prerequisites

This post contains only the required IAM permissions in the form of policy statements. To construct the policy, you need to add the statements to your SQS policy or your AWS KMS key policy. This post doesn’t walk you through how to create the SQS queue or the AWS KMS key. Therefore, to use the policies included in this post, make sure that you’ve completed the following prerequisites:

Set up an SQS queue. For instructions, see Create a queue (console) in the Amazon SQS documentation.
Create an AWS KMS key. For instructions, see Creating keys in the AWS KMS documentation.

Least-privilege key policy for Amazon SQS

In this section, we describe the required least-privilege permissions in AWS KMS for the customer-managed key that you use to encrypt your SQS queue. With these permissions, you can limit access to only the intended entities while implementing least privilege. The key policy must consist of the following policy statements, which we describe in detail below:

Grant administrator permissions to the KMS key
Grant read-only access to the key metadata
Grant AWS KMS permissions to Amazon SNS to publish messages to the queue
Allow consumers to decrypt messages from the queue

Grant administrator permissions to the KMS key

To create an AWS KMS key, you need to provide AWS KMS administrator permissions to the IAM role that you use to deploy the KMS key. These administrator permissions are defined in the AllowKeyAdminPermissions policy statement that follows. When you add this statement to your key policy, make sure to replace <admin-role ARN> with the Amazon Resource Name (ARN) of the IAM role used to deploy the KMS key, manage the KMS key, or both. This can be the IAM role of your deployment pipeline or the administrator role for your organization in AWS Organizations.

{
  "Sid": "AllowKeyAdminPermissions",
  "Effect": "Allow",
  "Principal": {
    "AWS": [
      "<admin-role ARN>"
    ]
  },
  "Action": [
    "kms:Create*",
    "kms:Describe*",
    "kms:Enable*",
    "kms:List*",
    "kms:Put*",
    "kms:Update*",
    "kms:Revoke*",
    "kms:Disable*",
    "kms:Get*",
    "kms:Delete*",
    "kms:TagResource",
    "kms:UntagResource",
    "kms:ScheduleKeyDeletion",
    "kms:CancelKeyDeletion"
  ],
  "Resource": "*"
}

Note: In a key policy, the value of the Resource element needs to be “*”, which means “this KMS key”. The asterisk (“*”) identifies the KMS key to which the key policy is attached.

Grant read-only access to the key metadata

To grant other IAM roles read-only access to your key metadata, add the following AllowReadAccessToKeyMetaData statement to your key policy. This statement allows you, for example, to list the KMS keys in your account for auditing purposes. The statement grants the AWS account root user read-only access to the key metadata. Therefore, an IAM principal in the account can have access to the key metadata when their identity-based policies have the following permissions listed in the statement: kms:Describe*, kms:Get*, and kms:List*. Make sure to replace <account-ID> with your own information.

{
  "Sid": "AllowReadAcesssToKeyMetaData",
  "Effect": "Allow",
  "Principal": {
    "AWS": [
      "arn:aws:iam::<account-ID>:root"
    ]
  },
  "Action": [
    "kms:Describe*",
    "kms:Get*",
    "kms:List*"
  ],
  "Resource": "*"
}

Grant AWS KMS permissions to Amazon SNS to publish messages to the queue

To allow your SNS topic to publish messages to your encrypted SQS queue, add the following AllowSNSToSendToSQS policy statement to your key policy. This statement grants Amazon SNS permissions to use the KMS key to publish to your SQS queue. Make sure to replace <account-id> with your own information.

Note: The Condition element limits access to the SNS service in the same AWS account where the SNS topic exists.

{
  "Sid": "AllowSNSToSendToSQS",
  "Effect": "Allow",
  "Principal": {
    "Service": [
      "sns.amazonaws.com"
    ]
  },
  "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
  ],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:SourceAccount": "<account-id>"
    }
  }
}

Allow consumers to decrypt messages from the queue

The following AllowConsumersToReceiveFromTheQueue statement grants the SQS message consumer the required permissions to decrypt messages received from the encrypted SQS queue. When you attach the policy statement, replace <consumer’s runtime role ARN> with the ARN for the IAM runtime role of the message consumer.

{
  "Sid": "AllowConsumersToReceiveFromTheQueue",
  "Effect": "Allow",
  "Principal": {
    "AWS": [
      "<consumer's runtime role ARN>"
    ]
  },
  "Action": [
    "kms:Decrypt"
  ],
  "Resource": "*"
}

Least-privilege Amazon SQS policy

In this section, we will walk you through least-privilege SQS queue policies to help you send Amazon SNS messages to Amazon SQS. The defined policy is designed to prevent unintended access by using a mix of both allow and deny statements. The allow statements grant access to the intended entity or entities. The deny statements prevent other unintended entities from accessing the SQS queue, while excluding the intended entity within the policy condition. The SQS policy includes the following statements, which we describe in detail below:

Restrict Amazon SQS management permissions
Restrict SQS queue actions from the specified organization
Grant SQS permissions to consumers
Enforce encryption in transit
Restrict message transmission to a specific SNS topic
(Optional) Restrict message reception to a specific VPC endpoint

Restrict Amazon SQS management permissions

The following RestrictAdminQueueActions policy statement restricts the Amazon SQS management permissions to only the IAM role or roles that you use to deploy the queue, manage the queue, or both.

Make sure to replace the <placeholder values> with your own information. Specify the ARN of the IAM role used to deploy the SQS queue, as well as the ARNs of each administrator role that should have SQS management permissions. For the Resource element, you can specify either “*” or the ARN of the SQS queue.

{
  "Sid": "RestrictAdminQueueActions",
  "Effect": "Deny",
  "Principal": {
    "AWS": "*"
  },
  "Action": [
    "sqs:AddPermission",
    "sqs:DeleteQueue",
    "sqs:RemovePermission",
    "sqs:SetQueueAttributes"
  ],
  "Resource": "*",
  "Condition": {
    "StringNotLike": {
      "aws:PrincipalARN": [
        "arn:aws:iam::<account-id>:role/<deployment-role-name>",
        "<admin-role ARN>"
      ]
    }
  }
}

Restrict SQS queue actions from the specified organization

To help protect your Amazon SQS resources from external access (that is, access by an entity outside your AWS Organizations organization), use the following statement. The statement limits SQS queue access to the organization that you specify in the Condition element. Make sure to replace <org-id> with your organization ID. For the Resource element, you can specify either “*” or the ARN of the SQS queue.

{
  "Sid": "DenyQueueActionsOutsideOrg",
  "Effect": "Deny",
  "Principal": {
    "AWS": "*"
  },
  "Action": [
    "sqs:AddPermission",
    "sqs:ChangeMessageVisibility",
    "sqs:DeleteQueue",
    "sqs:RemovePermission",
    "sqs:SetQueueAttributes",
    "sqs:ReceiveMessage"
  ],
  "Resource": "*",
  "Condition": {
    "StringNotEquals": {
      "aws:PrincipalOrgID": [
        "<org-id>"
      ]
    }
  }
}

Grant SQS permissions to consumers

To receive messages from the SQS queue, you need to provide the message consumer with the necessary permissions. The following policy statement grants the consumer, which you specify, the required permissions to consume messages from the SQS queue. When adding the statement to your SQS policy, make sure to replace <consumer’s IAM runtime role ARN> with the ARN of the IAM runtime role used by the consumer. For the Resource element, you can specify either “*” or the ARN of the SQS queue.

{
  "Sid": "AllowConsumersToReceiveFromTheQueue",
  "Effect": "Allow",
  "Principal": {
    "AWS": "<consumer's IAM runtime role ARN>"
  },
  "Action": [
    "sqs:ChangeMessageVisibility",
    "sqs:DeleteMessage",
    "sqs:GetQueueAttributes",
    "sqs:ReceiveMessage"
  ],
  "Resource": "*"
}

To prevent other entities from receiving messages from the SQS queue, add the following DenyOtherConsumersFromReceiving statement to the SQS queue policy. This statement restricts message consumption to the consumer that you specify—allowing no other consumer to have access, even when their identity permissions would grant them access. Make sure to replace <consumer’s runtime role ARN> with your own information. For the Resource element, you can specify either “*” or the ARN of the SQS queue.

{
  "Sid": "DenyOtherConsumersFromReceiving",
  "Effect": "Deny",
  "Principal": {
    "AWS": "*"
  },
  "Action": [
    "sqs:ChangeMessageVisibility",
    "sqs:DeleteMessage",
    "sqs:ReceiveMessage"
  ],
  "Resource": "*",
  "Condition": {
    "StringNotLike": {
      "aws:PrincipalARN": "<consumer's runtime role ARN>"
    }
  }
}

Enforce encryption in transit

The following DenyUnsecureTransport policy statement enforces the consumers and producers to use secure channels (TLS connections) to send and receive messages to and from the SQS queue. For the Resource element, you can specify either “*” or the ARN of the SQS queue.

{
  "Sid": "DenyUnsecureTransport",
  "Effect": "Deny",
  "Principal": {
    "AWS": "*"
  },
  "Action": [
    "sqs:ReceiveMessage",
    "sqs:SendMessage"
  ],
  "Resource": "*",
  "Condition": {
    "Bool": {
      "aws:SecureTransport": "false"
    }
  }
}

Restrict message transmission to a specific SNS topic

The following AllowSNSToSendToTheQueue policy statement allows the specified SNS topic to send messages to the SQS queue. Make sure to replace <SNS topic ARN> with the SNS topic ARN. For the Resource element, you can specify either “*” or the ARN of the SQS queue.

{
  "Sid": "AllowSNSToSendToTheQueue",
  "Effect": "Allow",
  "Principal": {
    "Service": "sns.amazonaws.com"
  },
  "Action": "sqs:SendMessage",
  "Resource": "*",
  "Condition": {
    "ArnLike": {
      "aws:SourceArn": "<SNS topic ARN>"
    }
  }
}

The following DenyAllProducersExceptSNSFromSending policy statement prevents other producers from sending messages to the queue. Replace <SNS topic ARN> with your own information. For the Resource element, you can specify either “*” or the ARN of the SQS queue.

{
  "Sid": "DenyAllProducersExceptSNSFromSending",
  "Effect": "Deny",
  "Principal": {
    "AWS": "*"
  },
  "Action": "sqs:SendMessage",
  "Resource": "*",
  "Condition": {
    "ArnNotLike": {
      "aws:SourceArn": "<SNS topic ARN>"
    }
  }
}

(Optional) Restrict message reception to a specific VPC endpoint

To restrict the receipt of messages to only a specific VPC endpoint, add the following DenyReceivingIfNotThroughVPCE policy statement to your SQS queue policy. This statement prevents a message consumer from receiving messages from the queue unless the messages are from the desired VPC endpoint. Replace <vpce_id> with the ID of the VPC endpoint that you created for your SQS queue. For the Resource element, you can specify either “*” or the ARN of the SQS queue.

{
  "Sid": "DenyReceivingIfNotThroughVPCE",
  "Effect": "Deny",
  "Principal": "*",
  "Action": [
    "sqs:ReceiveMessage"
  ],
  "Resource": "*",
  "Condition": {
    "StringNotEquals": {
      "aws:sourceVpce": "<vpce id>"
    }
  }
}

SQS policy statements for the dead-letter queue

In this section, we will walk you through how to manage access to your SQS queue when you are using it as a dead-letter queue (DLQ) for another SQS queue.

Add policy statements to your DLQ access policy

Add the following policy statements, identified by their statement ID, to your DLQ access policy. These are the same policy statements introduced earlier in this post.

RestrictAdminQueueActions
DenyQueueActionsOutsideOrg
AllowConsumersToReceiveFromTheQueue
DenyOtherConsumersFromReceiving
DenyUnsecureTransport

In addition to adding the preceding policy statements to your DLQ access policy, you should add a statement to restrict message transmission to SQS queues, which we describe in the next section.

Restrict message transmission to SQS queues

To restrict access to only SQS queues from the same account, add the following DenyAnyProducersExceptSQS policy statement to the DLQ access policy. This statement doesn’t limit message transmission to a specific queue because you need to deploy the DLQ before you create the main queue, so you won’t know the SQS queue ARN when you create the DLQ. If you need to limit access to only one SQS queue, modify the aws:SourceArn in the Condition element with the ARN of your SQS source queue when you know it.

{
  "Sid": "DenyAnyProducersExceptSQS",
  "Effect": "Deny",
  "Principal": {
    "AWS": "*"
  },
  "Action": "sqs:SendMessage",
  "Resource": "*",
  "Condition": {
    "ArnNotLike": {
      "aws:SourceArn": "arn:aws:sqs:<region>:<account-id>:*"
    }
  }
}

Important: The SQS queue policies defined in this post don’t restrict the sqs:PurgeQueue action to a certain IAM role or roles. The sqs:PurgeQueue action enables you to delete all messages in the SQS queue. You can also use this action to make changes to the message format without replacing the SQS queue. When debugging an application, you can clear the SQS queue to remove potentially erroneous messages. When testing the application, you can drive a high message volume through the SQS queue and then purge the queue to start fresh before entering production. The reason for not restricting this action to a certain role is that this role might not be known when deploying the SQS queue. You will need to add this permission to the role’s identity-based policy to be able to purge the queue.

Prevent the cross-service confused deputy problem

The confused deputy problem is a security issue where an entity that doesn’t have permission to perform an action can coerce a more privileged entity to perform the action. To help prevent this problem, AWS provides tools that help you protect your account if you provide third parties (known as cross-account) or other AWS services (known as cross-service) access to resources in your account. The policy statements in this post can help you prevent the cross-service confused deputy problem.

Cross-service impersonation can occur when one service (the calling service) calls another service (the called service). The calling service can be manipulated to use its permissions to act on another customer’s resources in a way it shouldn’t otherwise have permission to access. To help protect against this issue, the resource-based policies defined in this post use the aws:SourceArn, aws:SourceAccount, and aws:PrincipalOrgID global IAM condition context keys. These limit the permissions that a service has to a specific resource, a specific account, or a specific organization in AWS Organizations.

For example, the following AllowS3ToSendToTheQueue policy statement allows Amazon S3 to deliver messages to your Amazon SQS queue; the aws:SourceArn condition in this policy grants access to a specific S3 bucket only.

{
  "Sid": "AllowS3ToSendToTheQueue",
  "Effect": "Allow",
  "Principal": {
    "Service": "s3.amazonaws.com"
  },
  "Action": "sqs:SendMessage",
  "Resource": "*",
  "Condition": {
    "ArnLike": {
      "aws:SourceArn": "<S3 bucket ARN>"
    }
  }
}

If a bad actor creates an S3 bucket to try to deliver messages to your Amazon SQS queue, the source ARN will not match that of the S3 bucket specified in this policy, so the policy will deny access. Without the aws:SourceArn condition, the unauthorized S3 bucket would be granted access unintentionally because any S3 bucket would be granted to deliver messages to our queue through the S3 service principal. Adding the aws:SourceArn condition prevents cross-service impersonation.

Use IAM Access Analyzer to review cross-account access

You can use IAM Access Analyzer to review your SQS queue policies and AWS KMS key policies and alert you when an SQS queue or a KMS key grants access to an external entity. IAM Access Analyzer helps identify resources in your organization and accounts that are shared with an entity outside the zone of trust. This zone of trust can be either an AWS account or the organization within AWS Organizations that you specify when you enable IAM Access Analyzer.

IAM Access Analyzer also helps identify resources shared with external principals by using logic-based reasoning to analyze the resource-based policies in your AWS environment. For each instance of a resource shared outside of your zone of trust, IAM Access Analyzer generates a finding. Figure 2 shows an IAM Access Analyzer finding, in which a sqs:SendMessage API call was made to our SQS queue from an account that is outside of our zone of trust.

Figure 2: IAM Access Analyzer example finding for an Amazon SQS queue

Findings include information about the access and the external principal granted to it. To determine whether the access is intended and safe, or unintended and a security risk, review the findings. For unintended access, review the affected policy and modify it by using the policy statements introduced in this blog post to further restrict access. For more information on how IAM Access Analyzer identifies unintended access to your AWS resources, see the blog post Identify Unintended Resource Access with IAM Access Analyzer.

Conclusion

In this post, you learned how to manage access to your encrypted Amazon SQS queue to help you achieve least privilege. We presented an SQS queue policy and an AWS KMS key policy so that you can use Amazon SQS to receive messages from an SNS topic. We addressed the confused deputy problem, specifying the exact source allowed to emit events. You also learned how to use IAM Access Analyzer to review the external access provided by your existing SQS queue policies and key policies.

You can follow the instructions in this post to resolve findings based on your SQS use case. You can also use the provided policies for newly created SQS queues and their KMS keys, or to modify existing queues (for example, to address IAM Access Analyzer findings). For more use cases, see the AWS SQS documentation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Simple Queue Service re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

2023-03-03 Sankar Sundaram

Post Syndicated from Sankar Sundaram original https://aws.amazon.com/blogs/big-data/build-incremental-data-pipelines-to-load-transactional-data-changes-using-aws-dms-delta-2-0-and-amazon-emr-serverless/

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. A solution to this problem is to use AWS Database Migration Service (AWS DMS) for migrating historical and real-time transactional data into the data lake. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes.

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers. EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and you only pay for the resources that the applications use. EMR Serverless also provides you with more flexibility on overriding default Spark configurations, customizing EMR Serverless images, and customizing Spark driver and executor sizes to better suit specific workloads.

This post demonstrates how to implement a solution that uses AWS DMS to stream ongoing replication or change data capture (CDC) from an Amazon Aurora PostgreSQL-Compatible Edition database into Amazon Simple Storage Service (Amazon S3). We then apply transformations using Spark jobs on an EMR Serverless application and write transformed output into open-source Delta tables in Amazon S3. The Delta tables created by the EMR Serverless application are exposed through the AWS Glue Data Catalog and can be queried through Amazon Athena. Although this post uses an Aurora PostgreSQL database hosted on AWS as the data source, the solution can be extended to ingest data from any of the AWS DMS supported databases hosted on your data centers.

Solution overview

The following diagram shows the overall architecture of the solution that we implement in this post.

Architecture diagram

The solution consists of the following steps for implementing a full and incremental (CDC) data ingestion from a relational database:

Data storage and data generation – We create an Aurora PostgreSQL database and generate fictional trip data by running a stored procedure. The data will have attributes like trip ID (primary key), timestamp, source location, and destination location. Incremental data is generated in the PostgreSQL table by running custom SQL scripts.
Data ingestion – Steps 1 and 2 use AWS DMS, which connects to the source database and moves full and incremental data (CDC) to Amazon S3 in Parquet format. Let’s refer to this S3 bucket as the raw layer.
Data transformation – Steps 3 and 4 represent an EMR Serverless Spark application (Amazon EMR 6.9 with Apache Spark version 3.3.0) created using Amazon EMR Studio. The script reads input data from the S3 raw bucket, and then invokes Delta Lake’s MERGE statements to merge the data with the target S3 bucket (curated layer). The script also creates and updates a manifest file on Amazon S3 every time the job is run to enable data access from Athena and Amazon Redshift Spectrum.
Data access – The EMR Serverless job has code snippets that create a Delta table in the AWS Glue Data Catalog in Step 5. Steps 6 and 7 describe using Athena and Redshift Spectrum to query data from the Delta tables using standard SQL through the AWS Glue Data Catalog.
Data pipeline – Step 8 describes the process for triggering the data pipeline in a periodic manner through Airflow operators using Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Refer to Submitting EMR Serverless jobs from Airflow for additional details. In this post, AWS DMS has been configured to replicate data from Amazon Aurora PostgreSQL-Compatible Edition into an S3 bucket with hourly partitions. The Airflow DAG can be configured to call an EMR Serverless job to process the past X hours of data based on specific project requirements. Implementation of the Airflow setup is not explored within the scope of this post.

The architecture has the following major features:

Reliability – The end-to-end architecture is made resilient with the Multi-AZ feature of EMR Serverless and using Multi-AZ deployments for AWS DMS and Amazon Aurora PostgreSQL-Compatible Edition. When you submit jobs to an EMR Serverless application, those jobs are automatically distributed to different Availability Zones in the Region. A job is run in a single Availability Zone to avoid performance implications of network traffic across Availability Zones. In case an Availability Zone is impaired, a job submitted to your EMR Serverless application is automatically run in a different (healthy) Availability Zone. When using resources in a private VPC, EMR Serverless recommends that you specify the private VPC configuration for multiple Availability Zones so that EMR Serverless can automatically select a healthy Availability Zone.
Cost optimization – When you run Spark or Hive applications using EMR Serverless, you pay for the amount of vCPU, memory, and storage resources consumed by your applications, leading to optimal utilization of resources. There is no separate charge for Amazon Elastic Compute Cloud (Amazon EC2) instances or Amazon Elastic Block Store (Amazon EBS) volumes. For additional details on cost, refer to Amazon EMR Serverless cost estimator.
Performance efficiency – You can run analytics workloads at any scale with automatic on-demand scaling that resizes resources in seconds to meet changing data volumes and processing requirements. EMR Serverless includes the Amazon EMR performance-optimized runtime for Apache Spark and Hive. The Amazon EMR runtime for Spark is 100% API-compatible with OSS Spark and is over 3.5 times as fast as the standard open-source, so your jobs run faster and incur less compute costs. With fast and fine-grained scaling in EMR Serverless, if a pipeline runs daily and needs to process 1 GB of data one day and 100 GB of data another day, EMR Serverless automatically scales to handle that load.
Monitoring – EMR Serverless sends metrics to Amazon CloudWatch at the application and job level every 1 minute. You can set up a single-view dashboard in CloudWatch to visualize application-level and job-level metrics using an AWS CloudFormation template provided on the EMR Serverless CloudWatch Dashboard GitHub repository. Also, EMR Serverless can store application logs in a managed storage, Amazon S3, or both based on your configuration settings. After you submit a job to an EMR Serverless application, you can view the real-time Spark UI or the Hive Tez UI for the running job from the EMR Studio console or request a secure URL using the GetDashboardForJobRun API. For completed jobs, you can view the Spark History Server or the Persistent Hive Tez UI from the EMR Studio console.

The following steps are performed to implement this solution:

Connect to the Aurora PostgreSQL instance and generate a sample dataset.
- Set up a data pipeline for loading data from Amazon Aurora PostgreSQL-Compatible Edition into Delta Lake on Amazon S3 and query using Athena:
- Start the AWS DMS task to perform full table load and capture ongoing replication to the S3 raw layer.
- Run the EMR Serverless Spark application to load data into Delta Lake.
- Query the Delta tables (native tables) through Athena.
Run the data pipeline to capture incremental data changes into Delta Lake:
- Generate an incremental (CDC) dataset and insert it into the Aurora PostgreSQL database.
- Run the EMR Serverless Spark application to merge CDC data in the S3 curated layer (incremental load).
- Query the Delta Lake tables through Athena to validate the merged data.

Prerequisites

We use a CloudFormation template to provision the AWS resources required for the solution. The CloudFormation template requires you to select an EC2 key pair. This key is configured on an EC2 instance that lives in the public subnet. We use this EC2 instance to connect to the Aurora PostgreSQL instance that lives in the private subnet. Make sure you have a key in the Region where you deploy the template. If you don’t have one, you can create a new key pair.

To walk through this post, we use Delta Lake version > 2.0.0, which is supported in Apache Spark 3.2.x. Choose the Delta Lake version compatible with your Spark version by visiting the Delta Lake releases page. We use an EMR Serverless application with version emr-6.9.0, which supports Spark version 3.3.0.

Deploy your resources

To provision the resources needed for the solution, complete the following steps:

Choose Launch Stack:

For Stack name, enter emr-serverless-deltalake-blog.
For DatabaseUserName, enter the user name for logging in to Amazon Aurora PostgreSQL-Compatible Edition. Keep the default value if you don’t want to change it.
For DatabasePassword, enter the password for logging in to Amazon Aurora PostgreSQL-Compatible Edition.
For ClientIPCIDR, enter the IP address of your SQL client that will be used to connect to the EC2 instance. We use this EC2 instance to connect to the Aurora PostgreSQL database.
For KeyName, enter the key pair to be used in your EC2 instance. This EC2 instance will be used as a proxy to connect from your SQL client to the Aurora PostgreSQL source database.
For EC2ImageId, PrivateSubnet1CIDR, PrivateSubnet2CIDR, PublicSubnetCIDR, and VpcCIDR, keep the default values or choose appropriate values for the VPC and EC2 image for your specific environment.
Choose Next.
Choose Next again.
On the review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

After the CloudFormation template is complete and the resources are created, the Outputs tab shows the information shown in the following screenshot.

The CloudFormation template creates all the resources needed for the solution workflow:

S3 raw and curated buckets
Aurora PostgreSQL database
AWS DMS migration task, replication instance, and other resources
EC2 instance for running data ingestion scripts
AWS Identity and Access Management (IAM) roles and policies needed to perform the necessary activities as part of this solution
VPC, subnets, security groups, and relevant network components
AWS Lambda functions that perform setup activities required for this workflow
Additional components needed for running the EMR Serverless workflow

You can find the PySpark script in the raw S3 bucket on the Amazon S3 console as shown in the following screenshot. The bucket will have the naming structure <CloudFormation template name>-rawS3bucket-<random string>. Make a note of the S3 path to the emr_delta_cdc.py script; you need this information while submitting the Spark job via the EMR Serverless application.

The preceding task for creating the resources via CloudFormation assumes that AWS Lake Formation is not enabled in the Region (which we enable later in this post). If you already have Lake Formation enabled in the Region, make sure the IAM user or role used in the CloudFormation template has the necessary permissions to create a database in the AWS Glue Data Catalog.

Connect to the Aurora PostgreSQL instance and generate a sample dataset

Connect to the Aurora PostgreSQL endpoint using your preferred client. For this post, we use the PSQL command line tool. Note that the IP address of the client machine from which you’re connecting to the database must be updated in the Aurora PostgreSQL security group. This is done by the CloudFormation template based on the input parameter value for ClientIPCIDR. If you’re accessing the database from another machine, update the security group accordingly.

Connect to your EC2 instance from the command line using the public DNS of the EC2 instance from the CloudFormation template output.
Log in to the EC2 instance and connect to the Aurora PostgreSQL instance using the following commands (the Aurora PostgreSQL endpoint is available on the Outputs tab of the CloudFormation stack):
```
psql -h << Aurora PostgreSQL endpoint >> -p 5432 -U <<username>> -d emrdelta_source_db
```

Run the following commands to create a schema and table for the fictional trip dataset:

create schema delta_emr_source;

create table delta_emr_source.travel_details (trip_id int PRIMARY KEY,tstamp timestamp, route_id varchar(2),destination varchar(50),source_location varchar(50));

Create the following stored procedure to generate the records for the trip dataset and insert the records into the table.

create or replace procedure delta_emr_source.insert_records(records int)
language plpgsql
as $$
declare
max_trip_id integer;
begin
--get max trip_id
select coalesce(max(trip_id),1) into max_trip_id from delta_emr_source.travel_details;

--insert records
for i in max_trip_id+1..max_trip_id+records loop
INSERT INTO delta_emr_source.travel_details (trip_id, tstamp, route_id,destination,source_location) values (i, current_timestamp, chr(65 + (i % 10)),(array['Seattle', 'New York', 'New Jersey', 'Los Angeles', 'Las Vegas',
'Tucson', 'Washington DC', 'Philadelphia', 'Miami', 'San Francisco'])[(floor(random() * 10))+1],(array['Seattle', 'New York', 'New Jersey', 'Los Angeles', 'Las Vegas',
'Tucson', 'Washington DC', 'Philadelphia', 'Miami', 'San Francisco'])[(floor(random() * 10))+1]);
end loop;

commit;

raise notice 'Inserted record count - %', records;
end; $$;

Call the preceding stored procedure to insert 20,000 records into the Aurora PostgreSQL database:
```
call delta_emr_source.insert_records(20000);
```
After the stored procedure is complete, verify that the records have been inserted successfully:
```
select count(*) from delta_emr_source.travel_details;
```

Set up a data pipeline for loading data into Delta tables on Amazon S3 and query using Athena

In this section, we walk through the steps to set up a data pipeline that loads data from Amazon Aurora PostgreSQL-Compatible Edition into Delta tables on Amazon S3 and then query the data using Athena.

Start the AWS DMS task to perform full table load to the S3 raw layer

To perform the full table load, complete the following steps:

On the AWS DMS console, choose Database migration tasks in the navigation pane.
Select the task that was created by the CloudFormation template (emrdelta-postgres-s3-migration).
On the Actions menu, choose Restart/Resume.

The task starts the full load and ongoing replication of data from the source database to Amazon S3.

Wait for the job to complete.

You can validate that the data has been migrated successfully checking the Load state column for the AWS DMS task.

Navigate to the S3 bucket created from the CloudFormation template to store raw data from AWS DMS.The bucket will have the naming structure <CloudFormation template name>-rawS3bucket-<random string>.
Navigate to the folder delta_emr_source/travel_details in the raw S3 bucket. You can verify the S3 folder has Parquet data populated from the AWS DMS task.

Run the EMR Serverless Spark application to load data into Delta tables

We use EMR Studio to manage and submit jobs in an EMR Serverless application.

Launch EMR Studio and create an EMR Serverless application.
For Name, enter emr-delta-blog.
For Type, choose Spark.
For Release version, choose your release version.
For Architecture, select x86_64.
For Application setup options, select Choose default settings.

Choose Create application and verify that the EMR application has been created successfully on the Amazon EMR console.

Choose emr_delta_blog and then choose Start application. You can verify that the EMR application has started successfully on the Amazon EMR console, as shown in the following screenshot.

The application will move to Stopped status after a period of inactivity. When you submit the job to the application, it will start again and start the job. This provides cost savings because the jobs are run on demand as opposed to maintaining a running EMR cluster.

While the application is in Started status, choose Submit job to submit the job to the application.

Create a new job in the Job details page

For Name, enter emr-delta-load-job.
For Runtime role, choose emrserverless-execution-role.
For S3 URI, enter the S3 (raw bucket) path where the script emr_delta_cdc.py is uploaded.
For Script arguments, enter ["I","delta_emr_source","9999-12-31-01","travel_details","route_id"].

The script arguments provide the following details to the EMR Serverless application:

I – The first argument represents the data load type. The allowed values are I for full load and U for incremental data load.
delta_emr_source – The second argument represents the source database schema from which data is being migrated through the AWS DMS task.
9999-12-31-01 – The third argument represents the partition from which data needs to be loaded in an incremental fashion. This argument is used only during CDC data load; for full load, we have provided a default value (9999-12-31-01).
travel_details – The fourth argument represents the source database table from which data is being migrated through the AWS DMS task. Use a semicolon as a delimiter when entering multiple tables.
route_id – The fifth argument represents the partition keys on which the table data should be partitioned when stored in the S3 curated bucket. Use a semicolon as a delimiter when entering comma-separated partition keys for multiple tables.

With arguments, you can group a set of tables and submit the job to an EMR Serverless application. You can provide multiple table names separated by semicolons and enter the partition keys for those tables also separated by semicolon. If a particular table doesn’t have a partition key, simply enter a semicolon alone. The number of semicolon-separated values should match the table and partition key arguments for the script to run successfully.

Also, if you want to capture additional tables as part of an existing EMR Serverless job, you need to create a new EMR Serverless job to capture full load separately (set the first argument as I along with the new table names) and then change the argument list of the existing EMR Serverless job to add those new tables to capture incremental data load going forward.

EMR Serverless version 6.9.0 comes pre-installed with Delta version 2.1.0. Refer to About Amazon EMR Releases for more details about pre-installed libraries and applications for a specific Amazon EMR release. Before this, we have to upload the Delta JAR files to an S3 bucket in your account and provide the JAR file path in the application configurations using the spark.jars option. In this walkthrough, we create an EMR Serverless 6.9.0 application and use the pre-installed Delta jars from Amazon EMR.

Under Spark properties, choose Edit in text and enter the following configurations:

--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.submit.pyFiles=/usr/share/aws/delta/lib/delta-core.jar --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

If you want to use a different version of Delta JAR files, you can replace the S3 path of the JAR files in these configuration options.

Leave the rest of the configurations at their default and choose Submit job.
Wait for the job to complete successfully. You can verify this on the EMR Serverless console.

Additionally, go to the S3 location (the curated bucket created by AWS CloudFormation) and verify that the Delta files are created along with the manifest file.

Select a job run and then choose Spark History Server (Completed jobs) on the View Application UIs menu.

You can now use the Spark History Server UI to navigate to various tabs and analyze the job run in a detailed manner. For Spark error and output logs, you can navigate to the Executors tab and explore the driver or executor logs as required. This can help you to debug the job in case of failures by looking at the Spark logs.You can also choose Spark UI (Running jobs) to track the progress of the EMR Serverless Spark jobs while they are running.

The data load script is the same for initial and incremental data load because it can handle both the workflows through script arguments:

from pyspark.sql import SparkSession
from datetime import datetime
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import boto3
import sys
from delta import *

# S3 bucket location, auto-populated for this post. Replace for other jobs
raw_bucket="<<raw_bucket_name>>"
curated_bucket= "<<curated_bucket_name>>"

spark = (
SparkSession.builder.appName("SparkSQL")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.enableHiveSupport()
.getOrCreate()
)

#Check for argument list and if it doesn't match the expected argument count, exit the program
if len(sys.argv) != 6:
print("This script requires 5 arguments for successful execution - Load_type,database_schema,CDC_path,source_table,Partition_keys")
print(sys.argv)
sys.exit(0)

s3 = boto3.client('s3')

# Split table names into a list if there are more than one table seperated by semicolon
tables = sys.argv[4].split(";")

schema = sys.argv[2]
load_type = sys.argv[1]
cdc_partition = sys.argv[3]
deltaHivePath = "s3://" + curated_bucket + "/" + schema + "/"
columns_to_drop = ["Op","schema_name", "table_name", "update_ts_dms", "tstamp"]
db_name = "emrserverless_delta"

# Split table partition keys into a list if there are more than one table separated by semicolon
partition_keys = sys.argv[5].split(";")
# Exit if length of table names and partition keys are different to ensure data is provided for all tables.
if len(tables)!=len(partition_keys):
print("Please enter partition keys for all tables. if partition key is not present enter empty semicolon - T1_PK;;T3PK")
sys.exit(0)


i = 0
while i < len(tables):
table = tables[i]
partition_key = partition_keys[i].split(",")
print(partition_key)
if load_type == 'I':
print("Moving to Full-load logic for the table", table)

# Read the data from the raw bucket
source_df1 = spark.read.format("parquet").load(
"s3://" + raw_bucket + "/" + schema + "/" + table + "/")

# There is no target table in Delta format. Loading for the first time
# The following code segment populates Delta table in S3 and also
# updated the Glue catalog for querying with Athena.
additional_options = {"path": deltaHivePath + table + "/"}
if columns_to_drop is not None and columns_to_drop != '':
source_df1 = source_df1.drop(*columns_to_drop)

#Check for presence of partition key and before writing data to Curated bucket
if partition_key[0]:
source_df1.write.mode("append")\
.format("delta")\
.partitionBy(*partition_key)\
.options(**additional_options)\
.saveAsTable(db_name + ".spark_" + table)
else:
source_df1.write.mode("append")\
.format("delta")\
.options(**additional_options)\
.saveAsTable(db_name + ".spark_" + table)

# Generate symlink for Amazon Redshift Spectrum to read data
deltaTable = DeltaTable.forPath(spark, deltaHivePath + table + "/")
deltaTable.generate("symlink_format_manifest")

else:
print("Moving to upsert logic, Reading data from partition - ",cdc_partition)
# The below logic will verify if the CDC path has data before proceeding with
# incremental load. if CDC path is not available for a specific table the load
# process is skipped to avoid spark read error.
resp = s3.list_objects_v2(
Bucket=raw_bucket,
Prefix=schema +"/" +table +"/" +cdc_partition,
Delimiter='/',
MaxKeys=1)
if 'CommonPrefixes' in resp:
update_df = spark.read.format("parquet").load(
"s3://" + raw_bucket + "/" + schema + "/" + table + "/" + cdc_partition + "/")

# Get recent record for each primary key to update the recent transaction to the Delta table
# This step is needed to de-dup transactions like inserts and deletes within the same batch
sort_order = Window.partitionBy(
col('trip_id')).orderBy(
col('update_ts_dms').desc())
update_df = update_df.withColumn("rec_val", row_number().over(
sort_order)).filter("rec_val=1").drop("rec_val")

# upsert script using Merge operation. The below script updates/inserts data
# on all columns. In case you need to insert/update specific columns
# use whenNotMatchedInsert/whenMatchedUpdate functions and parameterize the input for each table
deltaTable = DeltaTable.forPath(spark, deltaHivePath + table + "/")
deltaTable.alias('trg') \
.merge(update_df.alias('src'),'trg.trip_id = src.trip_id')\
.whenNotMatchedInsertAll(condition="src.Op = 'I'") \
.whenMatchedUpdateAll(condition="src.Op='U'") \
.whenMatchedDelete(condition="src.Op = 'D'") \
.execute()

# Generate symlink for Amazon Redshift Spectrum to read data
deltaTable.generate("symlink_format_manifest")
else:
print("The path is empty for table -", table)
i = i + 1
print("The Job has completed execution...")

Monitor EMR Serverless application using CloudWatch dashboards

We can optionally monitor the EMR Serverless application using CloudWatch dashboards by installing the CloudFormation template from the EMR Serverless CloudWatch Dashboard GitHub repository.Follow the instructions on the Getting started section on the GitHub repository and deploy the CloudFormation template in your account.

You need to provide the EMR Serverless application ID as a parameter while deploying the CloudFormation stack, which can be obtained on the EMR Studio Applications page as shown in the following screenshot.

After the CloudFormation template is successfully deployed, navigate to the CloudWatch console to see a custom dashboard created for the EMR Serverless application ID that was provided to the CloudFormation template.

Choose the dashboard to see the different metrics for the EMR Serverless application in a single dashboard view.

You can see the available workers (one driver and two executors that were pre-initialized in the default configuration) and also the spike under successful job count that indicates the initial data load job that was completed successfully.

You could also monitor the CPU, memory, and storage allocated for the application, driver, and executor nodes separately.

The following image shows application metrics for three workers with 12 vCPUs (both driver and executor initialized with 4 vCPUs) and also the memory and storage usage. You can monitor the metrics from this dashboard and pre-initialize your application capacity that suits your specific workloads.

We can see the number of executors that were utilized for this job execution from the executor metrics section within the CloudWatch dashboard. We have used two executors and a driver for running this job.

Query the Delta tables through Athena

Previously, Delta tables were accessed through Athena by generating the manifest files (which maintain the list of data files to read for querying a Delta table). With the newly launched support in Athena for reading native Delta tables, it’s no longer required to generate and update manifest files. The Athena SQL engine version 3 can directly query native Delta tables. If you’re using an older engine version, change the engine version.

Navigate to the Athena console and start querying the data. Run a SELECT query and fetch the first 10 records to verify the data:

SELECT * FROM "AwsDataCatalog"."emrserverless_delta"."spark_travel_details" limit 10;

The table (native Delta table) has been created and updated to the AWS Glue Data Catalog from the EMR Serverless application code. You can successfully query and explore the data through Athena or Spark applications, but the schema definitions for individual columns aren’t updated in Data Catalog with this approach.

The following screenshot shows the Delta table created through code has a single array column. Athena supports reading native Delta tables and therefore we can read the data successfully even though the Data Catalog shows only a single array column.

If you need the individual column-level metadata to be available in the Data Catalog, run an AWS Glue crawler periodically to keep the AWS Glue metadata updated. For more information, refer to Introducing native Delta Lake table support with AWS Glue crawlers.

Run the data pipeline to load incremental data changes into the Delta tables

In this section, we walk through the steps to run the data pipeline.

Generate an incremental (CDC) dataset and insert it into the Aurora PostgreSQL database

Log in to the EC2 instance via SSH and using the PSQL CLI, run the following SQL commands to generate CDC data on the source database:

update delta_emr_source.travel_details set destination='Tucson' where destination='Miami';
call delta_emr_source.insert_records(200);
delete from delta_emr_source.travel_details where destination='Los Angeles';

Navigate to the AWS DMS console and verify whether the incremental records are populated to the S3 raw bucket by the replication task.

You can also verify in the S3 raw bucket location that the files are created under hourly partitioned folders.

Run the EMR Serverless Spark application to merge CDC data in the S3 curated layer (incremental load)

After the AWS DMS task has successfully loaded the incremental data, submit the Spark job on the EMR Serverless application to load the incremental data (CDC) with the following script arguments:

["U", "delta_emr_source", "2022-10-25-21", "travel_details","route_id"]

The partition path given here as 2022-10-25-21 should be changed as applicable in your use case. We use an example use case where the EMR Serverless job runs every hour, and the input data folder is partitioned on an hourly basis from AWS DMS. You can choose an appropriate partitioning strategy on the S3 raw bucket for your use case.

Under Spark properties, choose Edit in text and enter the following configurations:

--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.submit.pyFiles=/usr/share/aws/delta/lib/delta-core.jar --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

When the job is successful, verify in Amazon S3 that more files are created in the _delta_log folder, capturing the changes from the current run.

Query the Delta tables through Athena to validate the merged data

Go to the Athena console to query the data and validate count to ensure that the table contains the most recent data:

SELECT destination, count(*) FROM "AwsDataCatalog"."emrserverless_delta"."spark_travel_details" group by destination;

If you also want to query this data from Amazon Redshift, you can create external tables in Redshift Spectrum for Delta tables. For more information, refer to Creating external tables for data managed in Delta Lake. Redshift Spectrum currently supports querying Delta tables through the manifest file option. A Delta table manifest contains a list of files that make up a consistent snapshot of the Delta table. The code snippet given in this post updates the manifest files every time new data is loaded in the Delta tables to ensure only the latest data is read from the Delta tables.

Clean up

To avoid incurring ongoing charges, clean up your infrastructure by deleting the stack from the AWS CloudFormation console. Delete the EMR Serverless application and any other resources you created during this exercise.

Conclusion

In this post, we demonstrated how to create a transactional data lake with Delta table format using EMR Serverless and AWS DMS. With the flexibility provided by EMR Serverless, you can use the latest version of open-source Delta framework on EMR Serverless (with the latest version of Spark) in order to support a wider range of transactional data lake needs based on various use cases.

Now you can build a transactional data lake for your organization with Delta table format and access data using Athena and Redshift Spectrum for various analytical workloads. You could use this high-level architecture for any other use cases where you need to use the latest version of Spark on EMR Serverless.

About the Authors

Sankar Sundaram is a Data Lab Architect at AWS, where he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions.

Monjumi Sarma is a Data Lab Solutions Architect at AWS. She helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives.

Access Amazon Athena in your applications using the WebSocket API

2023-03-03 Abhi Sodhani

Post Syndicated from Abhi Sodhani original https://aws.amazon.com/blogs/big-data/access-amazon-athena-in-your-applications-using-the-websocket-api/

Modern applications are built with modular independent components or microservices that rely on an API framework to communicate with services. Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products. To accomplish this on AWS, organizations use Amazon Simple Storage Service (Amazon S3) to provide cheap and reliable object storage to house their datasets. To enable interactive querying and analyzing their data in place using familiar SQL syntax, many teams are turning to Amazon Athena. Athena is an interactive query service that is used by modern applications to query large volumes of data on an S3 data lake using standard SQL.

When working with SQL databases, application developers and business analysts are most familiar with simple permissions management and synchronous query-response protocols—if a user has permissions to submit a query, they do so and receive the results from the server when the query is complete. Directly accessing Athena APIs, for example when integrating with a custom web application, requires an AWS Identity and Access Management (IAM) role for the applications, and requires you to build a custom process to poll for query completion asynchronously. The IAM role needs access to run Athena API calls, as well as S3 permissions to retrieve the Athena output stored on Amazon S3. Polling for Athena query completion when performed at several intervals could result in increased latency from the client perspective.

In this post, we present a solution that can integrate with your front-end application to query data from Amazon S3 using an Athena synchronous API invocation. With this solution, you can add a layer of abstraction to your application on direct Athena API calls and promote the access using the WebSocket API developed with Amazon API Gateway. The query results are returned back to the application as Amazon S3 presigned URLs.

Overview of solution

For illustration purposes, this post builds a COVID-19 data lake with a WebSocket API to handle Athena queries. Your application can invoke the WebSocket API to pull the data from Amazon S3 using an Athena SQL query, and the WebSocket API returns the JSON response with the presigned Amazon S3 URL. The application needs to parse the JSON message to read the presigned URL, download the data to local, and report the data back to the front end.

We use AWS Step Functions to poll the Athena query run. When the query is complete, Step Functions invokes an AWS Lambda function to generate the presigned URL and send the request back to the application.

The application doesn’t require direct access to Athena, just access to invoke the API. When using this solution, you should secure the API following AWS guidelines. For more information, refer to Controlling and managing access to a WebSocket API in API Gateway.

The following diagram summarizes the architecture, key components, and interactions in the solution.

Architecture diagram for the Athena WebSocket API. The user connects to the API through API Gateway. API Gateway uses Lambda and DynamoDB to store session data. SQL queries are routed to Amazon Athena and a Step Function polls for query status and returns the results back to the user.

The application is composed of the WebSocket API in API Gateway, which handles the connectivity between the client and Athena. A client application using the framework can submit the Athena SQL query and get back the presigned URL containing the query results data. The workflow includes the following steps:

The application invokes the WebSocket API connection.
A Lambda function is invoked to initiate the connection. The connection ID is stored in an Amazon DynamoDB
When the client application is connected, it can invoke the runquery action, which invokes the RunQuery Lambda function.
The function first runs the Athena query.
When the query is started, the function checks the status and uses Step Functions to track the query progress.
Step Functions invokes the third Lambda function to read the processed Athena results and get the presigned S3 URL. Failed messages are routed to an Amazon Simple Notification Service (Amazon SNS) topic, which you can subscribe to.
The presigned URL is returned to the client application.
The connection is closed using the OnDisconnect function.

The RunQuery Lambda function runs the Athena query using the start_query_execution request:

def run_query(client, query):
    """This function executes and sends the query request to Athena."""
    response = client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': params['Database']
        },
        ResultConfiguration={
            'OutputLocation': f's3://{params["BucketName"]}/{params["OutputDir"]}/'
        },
        WorkGroup=params["WorkGroup"]
    )
    return response

The Amazon S3 presigned URL is generated by invoking the generate_presigned_url request with the bucket and key information that hosts the Athena results. The code hard codes the presigner expiration to 120 seconds, which is configurable in the function input parameter PreSignerExpireSeconds. See the following code:

def signed_get_url(event):
    s3 = boto3.client('s3', region_name=params['Region'], config=Config(signature_version='s3v4'))
    # User provided body with object info
    bodyData = json.loads(event['body'])
    try:
        url = s3.generate_presigned_url(
            ClientMethod='get_object',
            Params={
                'Bucket': params['BucketName'],
                'Key': bodyData["ObjectName"]
            },
            ExpiresIn=int(params['PreSignerExpireSeconds'])
        )
        body = {'PreSignedUrl': url, 'ExpiresIn': params['PreSignerExpireSeconds']}
        response = {
            'statusCode': 200,
			'body': json.dumps(body),
            'headers': cors.global_returns["Allow Origin Header"]
        }
        logger.info(f"[MESSAGE] Response for PreSignedURL: {response}")
    except Exception as e:
        logger.exception(f"[MESSAGE] Unable to generate URL: {str(e)}")
        response = {
            'statusCode': 502,
            'body': 'Unable to generate PreSignedUrl',
            'headers': cors.global_returns["Allow Origin Header"]
        }
    return response

Prerequisites

This post assumes you have the following:

Access to an AWS account
Permissions to create an AWS CloudFormation stack
Permissions to create the following resources:
- AWS Glue catalog databases and tables
- API Gateway
- Lambda function
- IAM roles
- Step Functions state machine
- SNS topic
- DynamoDB table

Enable the WebSocket API

To enable the WebSocket API of API Gateway, complete the following steps:

Configure the Athena dataset.

To make the data from the AWS COVID-19 data lake available in the Data Catalog in your AWS account, create a CloudFormation stack using the following template. If you’re signed in to your AWS account, the following page fills out most of the stack creation form for you. All you need to do is choose Create stack. For instructions on creating a CloudFormation stack, see Getting started with AWS CloudFormation.

You can also use an existing Athena database to query, in which case you need to update the stack parameters.

If this is the first time you’re using Athena, you must specify a query result location on Amazon S3. For more information about querying and accessing the data from Athena, see A public data lake for analysis of COVID-19 data.

Configure the WebSocket framework using the following page, which deploys the API infrastructure using AWS Serverless Application Model (AWS SAM).
Update the parameters pBucketName with the S3 bucket (in the us-east-2 region) that stores the Athena results and also update the database if you want to query an existing database.
Select the check box to acknowledge creation of IAM roles and choose Deploy.

At a high level, these are the primary resources deployed by the application template:

An API Gateway with routes to the connect, disconnect, and query Lambda functions. Note that the API Gateway deployed with this sample doesn’t implement authentication and authorization. We recommend that you implement authentication and authorization before deploying into a production environment. Refer to Controlling and managing access to a WebSocket API in API Gateway to understand how to implement these security controls.
A DynamoDB table for tracking client connections.
Lambda functions to manage connection states using DynamoDB.
A Lambda function to run the query and start the step function. The function includes an associated IAM role and policies with permissions to Step Functions, the AWS Glue Data Catalog, Athena, AWS Key Management Service (AWS KMS), and Amazon S3. Note that the Lambda execution role gives read access to the Data Catalog and S3 bucket that you specify in the deployment parameters. We recommend that you don’t include a catalog that contains sensitive data without first understanding the impacts and implementing additional security controls.
A Lambda function with associated permissions to poll for the query results and return the presigned URL to the client.
A Step Functions state machine with associated permissions to run the polling Lambda function and send API notifications using Amazon SNS.

Test the setup

To test the WebSocket API, you can use wscat, an open-source command line tool.

Install NPM.
Install wscat:

$ npm install -g wscat

On the console, connect to your published API endpoint by running the following command. The full URI to use can be found on the AWS CloudFormation console by finding the WebSocketURI output in the serverlessrepo-aws-app-athena-websocket-integration stack that was deployed by the AWS SAM application you deployed previously.

$ wscat -c wss://{YOUR-API-ID}.execute-api.{YOUR-REGION}.amazonaws.com/{STAGE}

To test the runquery function, send a JSON message like the following example. This triggers the state machine to run your SQL query using Athena and, using Lambda, return an S3 presigned URL to your client, which you can access to download the query results. Note that the API accepts any valid Athena query. Additional query validation could be added to the internal Lambda function if desired.

$ wscat -c wss://{YOUR-API-ID}.execute-api.{YOUR-REGION}.amazonaws.com/{STAGE}
Connected (press CTRL+C to quit)
> {"action":"runquery", "data":"SELECT * FROM \"covid-19\".country_codes limit 5"}
< {"pre-signed-url": "https://xxx-s3.amazonaws.com/athena_api_access_results/xxxxx.csv?"}

Copy the value for pre-signed-url and enter it into your browser window to access the results.

The presigned URL provides you temporary credentials to download the query results. For more information, refer to Using presigned URLs. This process can be integrated into a front-end web application to automatically download and display the results of the query.

Clean up

To avoid incurring ongoing charges, delete the resources you provisioned by deleting the CloudFormation stacks CovidLakeStacks and serverlessrepo-AthenaWebSocketIntegration via the AWS CloudFormation console. For detailed instructions, refer to the cleanup sections in the starter kit README files in the GitHub repo.

Conclusion

In this post, we showed how to integrate your application with Athena using the WebSocket API. We have included a GitHub repo for you to understand the code and modify it per your application requirements, to get the full benefits of the solution. We encourage you to further explore the features of the API Gateway WebSocket API to add in security using authorizers, view live invocations using dashboards, and expand the framework for more routes on action request.

Let’s stay in touch via the GitHub repo.

About the Authors

Abhi Sodhani is a Sr. AI/ML Solutions Architect at AWS. He helps customers with a wide range of solutions, including machine leaning, artificial intelligence, data lakes, data warehousing, and data visualization. Outside of work, he is passionate about books, yoga, and travel.

Robin Zimmerman is a Data and ML Engineer with AWS Professional Services. He works with AWS enterprise customers to develop systems to extract value from large volumes of data using AWS data, analytics, and machine learning services. When he’s not working, you’ll probably find him in the mountains—rock climbing, skiing, mountain biking, or out on whatever other adventure he can dream up.

Visualize database privileges on Amazon Redshift using Grafana

2023-03-02 Yota Hamaoka

Post Syndicated from Yota Hamaoka original https://aws.amazon.com/blogs/big-data/visualize-database-privileges-on-amazon-redshift-using-grafana/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to use SQL for analyzing structured and semi-structured data with best price performance along with secure access to the data.

As more users start querying data in a data warehouse, access control is paramount to protect valuable organizational data. Database administrators want to continuously monitor and manage user privileges to maintain proper data access in the data warehouse. Amazon Redshift provides granular access control on the database, schema, table, column, row, and other database objects by granting privileges to roles, groups, and users from a SQL interface. To monitor privileges configured in Amazon Redshift, you can retrieve them by querying system tables.

Although Amazon Redshift provides a broad capability of managing access to database objects, we have heard from customers that they want to visualize and monitor privileges without using a SQL interface. In this post, we introduce predefined dashboards using Grafana which visualizes database privileges without writing SQL. This dashboard will help database administrators to reduce the time spent on database administration and increase the frequency of monitoring cycles.

Database security in Amazon Redshift

Security is the top priority at AWS. Amazon Redshift provides four levels of control:

Cluster management
Cluster connectivity
Database access
Temporary database credentials and single sign-on

This post focuses on database access, which relates to user access control against database objects. For more information, see Managing database security.

Amazon Redshift uses the GRANT command to define permissions in the database. For most database objects, GRANT takes three parameters:

Identity – The entity you grant access to. This could be a user, role, or group.
Object – The type of database object. This could be a database, schema, table or view, column, row, function, procedure, language, datashare, machine leaning (ML) model, and more.
Privilege – The type of operation. Examples include CREATE, SELECT, ALTER, DROP, DELETE, and INSERT. The level of privilege depends on the object.

To remove access, use the REVOKE command.

Additionally, Amazon Redshift offers granular access control with the Row-level security (RLS) feature. You can attach or detach RLS policies to identities with the ATTACH RLS POLICY and DETACH RLS POLICY commands, respectively. See RLS policy ownership and management for more details.

Generally, database administrator monitors and reviews the identities, objects, and privileges periodically to ensure proper access is configured. They also need to investigate access configurations if database users face permission errors. These tasks require a SQL interface to query multiple system tables, which can be a repetitive and undifferentiated operation. Therefore, database administrators need a single pane of glass to quickly navigate through identities, objects, and privileges without writing SQL.

Solution overview

The following diagram illustrates the solution architecture and its key components:

Amazon Redshift contains database privilege information in system tables.
Grafana provides a predefined dashboard to visualize database privileges. The dashboard runs queries against the Amazon Redshift system table via the Amazon Redshift Data API.

Note that the dashboard focuses on visualization. SQL interface is required to configure privileges in Amazon Redshift. You can use query editor v2, a web-based SQL interface which enables users to run SQL commands from a browser.

Prerequisites

Before moving to the next section, you should have the following prerequisites:

An AWS account
Create an Amazon Redshift cluster
Setup Amazon Managed Grafana or local Grafana
- To setup with Amazon Managed Grafana, go through the following sections in Query and visualize Amazon Redshift operational metrics using the Amazon Redshift plugin for Grafana:
  - Configure an Amazon Redshift cluster
  - Configure an Amazon Managed Grafana workspace and sign in to Grafana
  - Create a database user for Amazon Managed Grafana
  - Configure a user in AWS SSO for Amazon Managed Grafana UI access
  - Configure an Amazon Managed Grafana workspace and sign in to Grafana
  - Set up an Amazon Redshift data source in Grafana
- To setup with local Grafana, first install Grafana following the official setup documentation. Next, go through the following sections in Query and visualize Amazon Redshift operational metrics using the Amazon Redshift plugin for Grafana:
  - Create a database user for Amazon Managed Grafana
  - Set up an Amazon Redshift data source in Grafana

While Amazon Managed Grafana controls the plugin version and updates periodically, local Grafana allows user to control the version. Therefore, local Grafana could be an option if you need earlier access for the latest features. Refer to plugin changelog for released features and versions.

Import the dashboards

After you have finished the prerequisites, you should have access to Grafana configured with Amazon Redshift as a data source. Next, import two dashboards for visualization.

In Grafana console, go to the created Redshift data source and click Dashboards
Import the Amazon Redshift Identities and Objects
Go to the data source again and import the Amazon Redshift Privileges

Each dashboard will appear once imported.

Amazon Redshift Identities and Objects dashboard

The Amazon Redshift Identities and Objects dashboard shows identites and database objects in Amazon Redshift, as shown in the following screenshot.

The Identities section shows the detail of each user, role, and group in the source database.

One of the key features in this dashboard is the Role assigned to Role, User section, which uses a node graph panel to visualize the hierarchical structure of roles and users from multiple system tables. This visualization can help administrators quickly examine which roles are inherited to users instead of querying multiple system tables. For more information about role-based access, refer to Role-based access control (RBAC).

Amazon Redshift Privileges dashboard

The Amazon Redshift Privileges dashboard shows privileges defined in Amazon Redshift.

In the Role and Group assigned to User section, open the Role assigned to User panel to list the roles for a specific user. In this panel, you can list and compare roles assigned to multiple users. Use the User drop-down at the top of the dashboard to select users.

The dashboard will refresh immediately and show filtered result for selected users. Following screenshot is the filtered result for user hr1, hr2 and it3.

The Object Privileges section shows the privileges granted for each database object and identity. Note that objects with no privileges granted are not listed here. To show the full list of database objects, use the Amazon Redshift Identities and Objects dashboard.

The Object Privileges (RLS) section contains visualizations for row-level security (RLS). The Policy attachments panel enables you to examine RLS configuration by visualizing relation between of tables, policies, roles and users.

Conclusion

In this post, we introduced a visualization for database privileges of Amazon Redshift using predefined Grafana dashboards. Database administrators can use these dashboards to quickly navigate through identities, objects, and privileges without writing SQL. You can also customize the dashboard to meet your business requirements. The JSON definition file of this dashboard is maintained as part of OSS in the Redshift data source for Grafana GitHub repository.

For more information about the topics described to in this post, refer to the following:

About the author

Yota Hamaoka is an Analytics Solution Architect at Amazon Web Services. He is focused on driving customers to accelerate their analytics journey with Amazon Redshift.

Provider	Use case
MongoDB Atlas	Manage components in MongoDB Atlas. Add, edit, or delete administrative objects within Atlas, including projects, users, and database deployments Note: You cannot read or write data to Atlas Clusters with Atlas Admin APIs and AWS CloudFormation resources. To read and write data in Atlas, you must use the Atlas Data API
GitLab	Manage the users and groups in an organization, set up a new project with the right users, groups, and access token, tag a project automatically for every active CI/CD deployment
New Relic	Create a new Dashboard with custom Pages, Widgets and Layout, add tags to your data to help improve data organization and findability, workloads-related tasks
GitHub	Manage the users and groups in an organization, set up a new project with the right users, groups, and access token, Add a webhook to a repo
Dynatrace	Set up a new project with service level objective, locations, monitors and metrics
Okta	Onboard a new application into Okta with the right users and groups
PagerDuty	Set up monitoring of a new or existing application
Databricks	Set up a Databricks cluster and jobs
Fastly	Configure Fastly as a CDN for your web app
BigID	Connect S3 and DynamoDB data sources into your BigID application
Rollbar	Set up a new Rollbar project and manage rules, teams, and users
Cloudflare	Configure a DNS record and load-balancing using Cloudflare
Lacework	Configure Lacework alert profiles, rules, channels and manage queries
Snowflake	Create databases, users, and manage privileges