Tag Archives: Technical How-to

Modern web application authentication and authorization with Amazon VPC Lattice

2024-02-23 Nigel Brittain

Post Syndicated from Nigel Brittain original https://aws.amazon.com/blogs/security/modern-web-application-authentication-and-authorization-with-amazon-vpc-lattice/

When building API-based web applications in the cloud, there are two main types of communication flow in which identity is an integral consideration:

User-to-Service communication: Authenticate and authorize users to communicate with application services and APIs
Service-to-Service communication: Authenticate and authorize application services to talk to each other

To design an authentication and authorization solution for these flows, you need to add an extra dimension to each flow:

Authentication: What identity you will use and how it’s verified
Authorization: How to determine which identity can perform which task

In each flow, a user or a service must present some kind of credential to the application service so that it can determine whether the flow should be permitted. The credentials are often accompanied with other metadata that can then be used to make further access control decisions.

In this blog post, I show you two ways that you can use Amazon VPC Lattice to implement both communication flows. I also show you how to build a simple and clean architecture for securing your web applications with scalable authentication, providing authentication metadata to make coarse-grained access control decisions.

The example solution is based around a standard API-based application with multiple API components serving HTTP data over TLS. With this solution, I show that VPC Lattice can be used to deliver authentication and authorization features to an application without requiring application builders to create this logic themselves. In this solution, the example application doesn’t implement its own authentication or authorization, so you will use VPC Lattice and some additional proxying with Envoy, an open source, high performance, and highly configurable proxy product, to provide these features with minimal application change. The solution uses Amazon Elastic Container Service (Amazon ECS) as a container environment to run the API endpoints and OAuth proxy, however Amazon ECS and containers aren’t a prerequisite for VPC Lattice integration.

If your application already has client authentication, such as a web application using OpenID Connect (OIDC), you can still use the sample code to see how implementation of secure service-to-service flows can be implemented with VPC Lattice.

VPC Lattice configuration

VPC Lattice is an application networking service that connects, monitors, and secures communications between your services, helping to improve productivity so that your developers can focus on building features that matter to your business. You can define policies for network traffic management, access, and monitoring to connect compute services in a simplified and consistent way across instances, containers, and serverless applications.

For a web application, particularly those that are API based and comprised of multiple components, VPC Lattice is a great fit. With VPC Lattice, you can use native AWS identity features for credential distribution and access control, without the operational overhead that many application security solutions require.

This solution uses a single VPC Lattice service network, with each of the application components represented as individual services. VPC Lattice auth policies are AWS Identity and Access Management (IAM) policy documents that you attach to service networks or services to control whether a specified principal has access to a group of services or specific service. In this solution we use an auth policy on the service network, as well as more granular policies on the services themselves.

User-to-service communication flow

For this example, the web application is constructed from multiple API endpoints. These are typical REST APIs, which provide API connectivity to various application components.

The most common method for securing REST APIs is by using OAuth2. OAuth2 allows a client (on behalf of a user) to interact with an authorization server and retrieve an access token. The access token is intended to be presented to a REST API and contains enough information to determine that the user identified in the access token has given their consent for the REST API to operate on their data on their behalf.

Access tokens use OAuth2 scopes to indicate user consent. Defining how OAuth2 scopes work is outside the scope of this post. You can learn about scopes in Permissions, Privileges, and Scopes in the AuthO blog.

VPC Lattice doesn’t support OAuth2 client or inspection functionality, however it can verify HTTP header contents. This means you can use header matching within a VPC Lattice service policy to grant access to a VPC Lattice service only if the correct header is included. By generating the header based on validation occurring prior to entering the service network, we can use context about the user at the service network or service to make access control decisions.

Figure 1: User-to-service flow

The solution uses Envoy, to terminate the HTTP request from an OAuth 2.0 client. This is shown in Figure 1: User-to-service flow.

Envoy (shown as (1) in Figure 2) can validate access tokens (presented as a JSON Web Token (JWT) embedded in an Authorization: Bearer header). If the access token can be validated, then the scopes from this token are unpacked (2) and placed into X-JWT-Scope-<scopename> headers, using a simple inline Lua script. The Envoy documentation provides examples of how to use inline Lua in Envoy. Figure 2 – JWT Scope to HTTP shows how this process works at a high level.

Figure 2: JWT Scope to HTTP headers

Following this, Envoy uses Signature Version 4 (SigV4) to sign the request (3) and pass it to the VPC Lattice service. SigV4 signing is a native Envoy capability, but it requires the underlying compute that Envoy is running on to have access to AWS credentials. When you use AWS compute, assigning a role to that compute verifies that the instance can provide credentials to processes running on that compute, in this case Envoy.

By adding an authorization policy that permits access only from Envoy (through validating the Envoy SigV4 signature) and only with the correct scopes provided in HTTP headers, you can effectively lock down a VPC Lattice service to specific verified users coming from Envoy who are presenting specific OAuth2 scopes in their bearer token.

To answer the original question of where the identity comes from, the identity is provided by the user when communicating with their identity provider (IdP). In addition to this, Envoy is presenting its own identity from its underlying compute to enter the VPC Lattice service network. From a configuration perspective this means your user-to-service communication flow doesn’t require understanding of the user, or the storage of user or machine credentials.

The sample code provided shows a full Envoy configuration for VPC Lattice, including SigV4 signing, access token validation, and extraction of JWT contents to headers. This reference architecture supports various clients including server-side web applications, thick Java clients, and even command line interface-based clients calling the APIs directly. I don’t cover OAuth clients in detail in this post, however the optional sample code allows you to use an OAuth client and flow to talk to the APIs through Envoy.

Service-to-service communication flow

In the service-to-service flow, you need a way to provide AWS credentials to your applications and configure them to use SigV4 to sign their HTTP requests to the destination VPC Lattice services. Your application components can have their own identities (IAM roles), which allows you to uniquely identify application components and make access control decisions based on the particular flow required. For example, application component 1 might need to communicate with application component 2, but not application component 3.

If you have full control of your application code and have a clean method for locating the destination services, then this might be something you can implement directly in your server code. This is the configuration that’s implemented in the AWS Cloud Development Kit (AWS CDK) solution that accompanies this blog post, the app1, app2, and app3 web servers are capable of making SigV4 signed requests to the VPC Lattice services they need to communicate with. The sample code demonstrates how to perform VPC Lattice SigV4 requests in node.js using the aws-crt node bindings. Figure 3 depicts the use of SigV4 authentication between services and VPC Lattice.

Figure 3: Service-to-service flow

To answer the question of where the identity comes from in this flow, you use the native SigV4 signing support from VPC Lattice to validate the application identity. The credentials come from AWS STS, again through the native underlying compute environment. Providing credentials transparently to your applications is one of the biggest advantages of the VPC Lattice solution when comparing this to other types of application security solutions such as service meshes. This implementation requires no provisioning of credentials, no management of identity stores, and automatically rotates credentials as required. This means low overhead to deploy and maintain the security of this solution and benefits from the reliability and scalability of IAM and the AWS Security Token Service (AWS STS) — a very slick solution to securing service-to-service communication flows!

VPC Lattice policy configuration

VPC Lattice provides two levels of auth policy configuration — at the VPC Lattice service network and on individual VPC Lattice services. This allows your cloud operations and development teams to work independently of each other by removing the dependency on a single team to implement access controls. This model enables both agility and separation of duties. More information about VPC Lattice policy configuration can be found in Control access to services using auth policies.

Service network auth policy

This design uses a service network auth policy that permits access to the service network by specific IAM principals. This can be used as a guardrail to provide overall access control over the service network and underlying services. Removal of an individual service auth policy will still enforce the service network policy first, so you can have confidence that you can identify sources of network traffic into the service network and block traffic that doesn’t come from a previously defined AWS principal.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111122223333:role/ app2TaskRole",
                    "arn:aws:iam::111122223333:role/ app3TaskRole",
                    "arn:aws:iam::111122223333:role/ EnvoyFrontendTaskRole",
                    "arn:aws:iam::111122223333:role/app1TaskRole"
                ]
            },
            "Action": "vpc-lattice-svcs:Invoke",
            "Resource": "*"
        }
    ]
}

The preceding auth policy example grants permissions to any authenticated request that uses one of the IAM roles app1TaskRole, app2TaskRole, app3TaskRole or EnvoyFrontendTaskRole to make requests to the services attached to the service network. You will see in the next section how service auth policies can be used in conjunction with service network auth policies.

Service auth policies

Individual VPC Lattice services can have their own policies defined and implemented independently of the service network policy. This design uses a service policy to demonstrate both user-to-service and service-to-service access control.

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "UserToService",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111122223333:role/ EnvoyFrontendTaskRole",
                ]
            },
            "Action": "vpc-lattice-svcs:Invoke",
            "Resource": "arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-123456789/*",
            "Condition": {
                "StringEquals": {
                    "vpc-lattice-svcs:RequestHeader/x-jwt-scope-test.all": "true"
                }
            }
        },
        {
            "Sid": "ServiceToService",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111122223333:role/ app2TaskRole"
                ]
            },
            "Action": "vpc-lattice-svcs:Invoke",
            "Resource": "arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-123456789/*"
        }
    ]
}

The preceding auth policy is an example that could be attached to the app1 VPC Lattice service. The policy contains two statements:

The first (labelled “Sid”: “UserToService”) provides user-to-service authorization and requires requiring the caller principal to be EnvoyFrontendTaskRole and the request headers to contain the header x-jwt-scope-test.all: true when calling the app1 VPC Lattice service.
The second (labelled “Sid”: “ServiceToService”) provides service-to-service authorization and requires the caller principal to be app2TaskRole when calling the app1 VPC Lattice service.

As with a standard IAM policy, there is an implicit deny, meaning no other principals will be permitted access.

The caller principals are identified by VPC Lattice through the SigV4 signing process. This means by using the identities provisioned to the underlying compute the network flow can be associated with a service identity, which can then be authorized by VPC Lattice service access policies.

Distributed development

This model of access control supports a distributed development and operational model. Because the service network auth policy is decoupled from the service auth policies, the service auth policies can be iterated upon by a development team without impacting the overall policy controls set by an operations team for the entire service network.

Solution overview

I’ve provided an aws-samples AWS CDK solution that you can deploy to implement the preceding design.

Figure 4: CDK deployable solution

The AWS CDK solution deploys four Amazon ECS services, one for the frontend Envoy server for the client-to-service flow, and the remaining three for the backend application components. Figure 4 shows the solution when deployed with the internal domain parameter application.internal.

Backend application components are a simple node.js express server, which will print the contents of your request in JSON format and perform service-to-service calls.

A number of other infrastructure components are deployed to support the solution:

A VPC with associated subnets, NAT gateways and an internet gateway. Internet access is required for the solution to retrieve JSON Web Key Set (JWKS) details from your OAuth provider.
An Amazon Route53 hosted zone for handling traffic routing to the configured domain and VPC Lattice services.
An Amazon ECS cluster (two container hosts by default) to run the ECS tasks.
Four Application Load Balancers, one for frontend Envoy routing and one for each application component.
- All application load balancers are internally facing.
- Application component load balancers are configured to only accept traffic from the VPC Lattice managed prefix List.
- The frontend Envoy load balancer is configured to accept traffic from any host.
Three VPC Lattice services and one VPC Lattice network.

The code for Envoy and the application components can be found in the lattice_soln/containers directory.

AWS CDK code for all other deployable infrastructure can be found in lattice_soln/lattice_soln_stack.py.

Prerequisites

Before you begin, you must have the following prerequisites in place:

An AWS account to deploy solution resources into. AWS credentials should be available to the AWS CDK in the environment or configuration files for the CDK deploy to function.
Python 3.9.6 or higher
Docker or Finch for building containers. If using Finch, ensure the Finch executable is in your path and instruct the CDK to use it with the command export CDK_DOCKER=finch
Enable elastic network interface (ENI) trunking in your account to allow more containers to run in VPC networking mode:
```
aws ecs put-account-setting-default \
      --name awsvpcTrunking \
      --value enabled
```

[Optional] OAuth provider configuration

This solution has been tested using Okta, however any OAuth compatible provider will work if it can issue access tokens and you can retrieve them from the command line.

The following instructions describe the configuration process for Okta using the Okta web UI. This allows you to use the device code flow to retrieve access tokens, which can then be validated by the Envoy frontend deployment.

Create a new app integration

In the Okta web UI, select Applications and then choose Create App Integration.
For Sign-in method, select OpenID Connect.
For Application type, select Native Application.
For Grant Type, select both Refresh Token and Device Authorization.
Note the client ID for use in the device code flow.

Create a new API integration

Still in the Okta web UI, select Security, and then choose API.
Choose Add authorization server.
Enter a name and audience. Note the audience for use during CDK installation, then choose Save.
Select the authorization server you just created. Choose the Metadata URI link to open the metadata contents in a new tab or browser window. Note the jwks_uri and issuer fields for use during CDK installation.
Return to the Okta web UI, select Scopes and then Add scope.
For the scope name, enter test.all. Use the scope name for the display phrase and description. Leave User consent as implicit. Choose Save.
Under Access Policies, choose Add New Access Policy.
For Assign to, select The following clients and select the client you created above.
Choose Add rule.
In Rule name, enter a rule name, such as Allow test.all access
Under If Grant Type Is uncheck all but Device Authorization. Under And Scopes Requested choose The following scopes. Select OIDC default scopes to add the default scopes to the scopes box, then also manually add the test.all scope you created above.

During the API Integration step, you should have collected the audience, JWKS URI, and issuer. These fields are used on the command line when installing the CDK project with OAuth support.

You can then use the process described in configure the smart device to retrieve an access token using the device code flow. Make sure you modify scope to include test.all — scope=openid profile offline_access test.all — so your token matches the policy deployed by the solution.

Installation

You can download the deployable solution from GitHub.

Deploy without OAuth functionality

If you only want to deploy the solution with service-to-service flows, you can deploy with a CDK command similar to the following:

(.venv)$ cdk deploy -c app_domain=<application domain>

Deploy with OAuth functionality

To deploy the solution with OAuth functionality, you must provide the following parameters:

jwt_jwks: The URL for retrieving JWKS details from your OAuth provider. This would look something like https://dev-123456.okta.com/oauth2/ausa1234567/v1/keys
jwt_issuer: The issuer for your OAuth access tokens. This would look something like https://dev-123456.okta.com/oauth2/ausa1234567
jwt_audience: The audience configured for your OAuth protected APIs. This is a text string configured in your OAuth provider.
app_domain: The domain to be configured in Route53 for all URLs provided for this application. This domain is local to the VPC created for the solution. For example application.internal.

The solution can be deployed with a CDK command as follows:

$ cdk deploy -c enable_oauth=True -c jwt_jwks=<URL for retrieving JWKS details> \
-c jwt_issuer=<URL of the issuer for your OAuth access tokens> \
-c jwt_audience=<OAuth audience string> \
-c app_domain=<application domain>

Security model

For this solution, network access to the web application is secured through two main controls:

Entry into the service network requires SigV4 authentication, enforced by the service network policy. No other mechanisms are provided to allow access to the services, either through their load balancers or directly to the containers.
Service policies restrict access to either user- or service-based communication based on the identity of the caller and OAuth subject and scopes.

The Envoy configuration strips any x- headers coming from user clients and replaces them with x-jwt-subject and x-jwt-scope headers based on successful JWT validation. You are then able to match these x-jwt-* headers in VPC Lattice policy conditions.

Solution caveats

This solution implements TLS endpoints on VPC Lattice and Application Load Balancers. The container instances do not implement TLS in order to reduce cost for this example. As such, traffic is in cleartext between the Application Load Balancers and container instances, and can be implemented separately if required.

How to use the solution

Now for the interesting part! As part of solution deployment, you’ve deployed a number of Amazon Elastic Compute Cloud (Amazon EC2) hosts to act as the container environment. You can use these hosts to test some of the flows and you can use the AWS Systems Manager connect function from the AWS Management console to access the command line interface on any of the container hosts.

In these examples, I’ve configured the domain during the CDK installation as application.internal, which will be used for communicating with the application as a client. If you change this, adjust your command lines to match.

[Optional] For examples 3 and 4, you need an access token from your OAuth provider. In each of the examples, I’ve embedded the access token in the AT environment variable for brevity.

Example 1: Service-to-service calls (permitted)

For these first two examples, you must sign in to the container host and run a command in your container. This is because the VPC Lattice policies allow traffic from the containers. I’ve assigned IAM task roles to each container, which are used to uniquely identify them to VPC Lattice when making service-to-service calls.

To set up service-to service calls (permitted):

Sign in to the Amazon ECS console. You should see at least three ECS services running.

Figure 5: Cluster console
Select the app2 service LatticeSolnStack-app2service…, then select the Tasks tab.
Under the Container Instances heading select the container instance that’s running the app2 service.

Figure 6: Container instances
You will see the instance ID listed at the top left of the page.

Figure 7: Single container instance
Select the instance ID (this will open a new window) and choose Connect. Select the Session Manager tab and choose Connect again. This will open a shell to your container instance.

The policy statements permit app2 to call app1. By using the path app2/call-to-app1, you can force this call to occur.

Test this with the following commands:

sh-4.2$ sudo bash
# docker ps --filter "label=webserver=app2"
CONTAINER ID   IMAGE                                                                                                                                                                           COMMAND                  CREATED         STATUS         PORTS     NAMES
<containerid>  111122223333.dkr.ecr.ap-southeast-2.amazonaws.com/cdk-hnb659fds-container-assets-111122223333-ap-southeast-2:5b5d138c3abd6cfc4a90aee4474a03af305e2dae6bbbea70bcc30ffd068b8403   "sh /app/launch_expr…"   9 minutes ago   Up 9minutes             ecs-LatticeSolnStackapp2task4A06C2E4-22-app2-container-b68cb2ffd8e4e0819901
# docker exec -it <containerid> curl localhost:80/app2/call-to-app1

You should see the following output:

sh-4.2$ sudo bash
root@ip-10-0-152-46 bin]# docker ps --filter "label=webserver=app2"
CONTAINER ID   IMAGE                                                                                                                                                                           COMMAND                  CREATED         STATUS         PORTS     NAMES
cd8420221dcb   111122223333.dkr.ecr.ap-southeast-2.amazonaws.com/cdk-hnb659fds-container-assets-111122223333-ap-southeast-2:5b5d138c3abd6cfc4a90aee4474a03af305e2dae6bbbea70bcc30ffd068b8403   "sh /app/launch_expr…"   9 minutes ago   Up 9minutes             ecs-LatticeSolnStackapp2task4A06C2E4-22-app2-container-b68cb2ffd8e4e0819901
[root@ip-10-0-152-46 bin]# docker exec -it cd8420221dcb curl localhost:80/app2/call-to-app1
{
  "path": "/",
  "method": "GET",
  "body": "",
  "hostname": "app1.application.internal",
  "ip": "10.0.159.20",
  "ips": [
    "10.0.159.20",
    "169.254.171.192"
  ],
  "protocol": "http",
  "webserver": "app1",
  "query": {},
  "xhr": false,
  "os": {
    "hostname": "ip-10-0-243-145.ap-southeast-2.compute.internal"
  },
  "connection": {},
  "jwt_subject": "** No user identity present **",
  "headers": {
    "x-forwarded-for": "10.0.159.20, 169.254.171.192",
    "x-forwarded-proto": "http",
    "x-forwarded-port": "80",
    "host": "app1.application.internal",
    "x-amzn-trace-id": "Root=1-65499327-274c2d6640d10af4711aab09",
    "x-amzn-lattice-identity": "Principal=arn:aws:sts::111122223333:assumed-role/LatticeSolnStack-app2TaskRoleA1BE533B-3K7AJnCr8kTj/ddaf2e517afb4d818178f9e0fef8f841; SessionName=ddaf2e517afb4d818178f9e0fef8f841; Type=AWS_IAM",
    "x-amzn-lattice-network": "SourceVpcArn=arn:aws:ec2:ap-southeast-2:111122223333:vpc/vpc-01e7a1c93b2ea405d",
    "x-amzn-lattice-target": "ServiceArn=arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-0b4f63f746140f48e; ServiceNetworkArn=arn:aws:vpc-lattice:ap-southeast-2:111122223333:servicenetwork/sn-0ae554a9bc634c4ec; TargetGroupArn=arn:aws:vpc-lattice:ap-southeast-2:111122223333:targetgroup/tg-05644f55316d4869f",
    "x-amzn-source-vpc": "vpc-01e7a1c93b2ea405d"
  }

Example 2: Service-to-service calls (denied)

The policy statements don’t permit app2 to call app3. You can simulate this in the same way and verify that the access isn’t permitted by VPC Lattice.

To set up service-to-service calls (denied)

You can change the curl command from Example 1 to test app2 calling app3.

# docker exec -it cd8420221dcb curl localhost:80/app2/call-to-app3
{
  "upstreamResponse": "AccessDeniedException: User: arn:aws:sts::111122223333:assumed-role/LatticeSolnStack-app2TaskRoleA1BE533B-3K7AJnCr8kTj/ddaf2e517afb4d818178f9e0fef8f841 is not authorized to perform: vpc-lattice-svcs:Invoke on resource: arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-08873e50553c375cd/ with an explicit deny in a service-based policy"
}

[Optional] Example 3: OAuth – Invalid access token

If you’ve deployed using OAuth functionality, you can test from the shell in Example 1 that you’re unable to access the frontend Envoy server (application.internal) without a valid access token, and that you’re also unable to access the backend VPC Lattice services (app1.application.internal, app2.application.internal, app3.application.internal) directly.

You can also verify that you cannot bypass the VPC Lattice service and connect to the load balancer or web server container directly.

sh-4.2$ curl -v https://application.internal
Jwt is missing

sh-4.2$ curl https://app1.application.internal
AccessDeniedException: User: anonymous is not authorized to perform: vpc-lattice-svcs:Invoke on resource: arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-03edffc09406f7e58/ because no network-based policy allows the vpc-lattice-svcs:Invoke action

sh-4.2$ curl https://internal-Lattic-app1s-C6ovEZzwdTqb-1882558159.ap-southeast-2.elb.amazonaws.com
^C

sh-4.2$ curl https://10.0.209.127
^C

[Optional] Example 4: Client access

If you’ve deployed using OAuth functionality, you can test from the shell in Example 1 to access the application with a valid access token. A client can reach each application component by using application.internal/<componentname>. For example, application.internal/app2. If no component name is specified, it will default to app1.

sh-4.2$ curl -v https://application.internal/app2 -H "Authorization: Bearer $AT"

  "path": "/app2",
  "method": "GET",
  "body": "",
  "hostname": "app2.applicatino.internal",
  "ip": "10.0.128.231",
  "ips": [
    "10.0.128.231",
    "169.254.171.195"
  ],
  "protocol": "https",
  "webserver": "app2",
  "query": {},
  "xhr": false,
  "os": {
    "hostname": "ip-10-0-151-56.ap-southeast-2.compute.internal"
  },
  "connection": {},
  "jwt_subject": "Call made from user identity [email protected]",
  "headers": {
    "x-forwarded-for": "10.0.128.231, 169.254.171.195",
    "x-forwarded-proto": "https",
    "x-forwarded-port": "443",
    "host": "app2.applicatino.internal",
    "x-amzn-trace-id": "Root=1-65c431b7-1efd8282275397b44ac31d49",
    "user-agent": "curl/8.5.0",
    "accept": "*/*",
    "x-request-id": "7bfa509c-734e-496f-b8d4-df6e08384f2a",
    "x-jwt-subject": "[email protected]",
    "x-jwt-scope-profile": "true",
    "x-jwt-scope-offline_access": "true",
    "x-jwt-scope-openid": "true",
    "x-jwt-scope-test.all": "true",
    "x-envoy-expected-rq-timeout-ms": "15000",
    "x-amzn-lattice-identity": "Principal=arn:aws:sts::111122223333:assumed-role/LatticeSolnStack-EnvoyFrontendTaskRoleA297DB4D-OwD8arbEnYoP/827dc1716e3a49ad8da3fd1dd52af34c; PrincipalOrgID=o-123456; SessionName=827dc1716e3a49ad8da3fd1dd52af34c; Type=AWS_IAM",
    "x-amzn-lattice-network": "SourceVpcArn=arn:aws:ec2:ap-southeast-2:111122223333:vpc/vpc-0be57ee3f411a91c7",
    "x-amzn-lattice-target": "ServiceArn=arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-024e7362a8617145c; ServiceNetworkArn=arn:aws:vpc-lattice:ap-southeast-2:111122223333:servicenetwork/sn-0cbe1c113be4ae54a; TargetGroupArn=arn:aws:vpc-lattice:ap-southeast-2:111122223333:targetgroup/tg-09caa566d66b2a35b",
    "x-amzn-source-vpc": "vpc-0be57ee3f411a91c7"
  }
}

This will fail when attempting to connect to app3 using Envoy, as we’ve denied user to service calls on the VPC Lattice Service policy

sh-4.2$ https://application.internal/app3 -H "Authorization: Bearer $AT"

AccessDeniedException: User: arn:aws:sts::111122223333:assumed-role/LatticeSolnStack-EnvoyFrontendTaskRoleA297DB4D-OwD8arbEnYoP/827dc1716e3a49ad8da3fd1dd52af34c is not authorized to perform: vpc-lattice-svcs:Invoke on resource: arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-06987d9ab4a1f815f/app3 with an explicit deny in a service-based policy

Summary

You’ve seen how you can use VPC Lattice to provide authentication and authorization to both user-to-service and service-to-service flows. I’ve shown you how to implement some novel and reusable solution components:

JWT authorization and translation of scopes to headers, integrating an external IdP into your solution for user authentication.
SigV4 signing from an Envoy proxy running in a container.
Service-to-service flows using SigV4 signing in node.js and container-based credentials.
Integration of VPC Lattice with ECS containers, using the CDK.

All of this is created almost entirely with managed AWS services, meaning you can focus more on security policy creation and validation and less on managing components such as service identities, service meshes, and other self-managed infrastructure.

Some ways you can extend upon this solution include:

Implementing different service policies taking into consideration different OAuth scopes for your user and client combinations
Implementing multiple issuers on Envoy to allow different OAuth providers to use the same infrastructure
Deploying the VPC Lattice services and ECS tasks independently of the service network, to allow your builders to manage task deployment themselves

I look forward to hearing about how you use this solution and VPC Lattice to secure your own applications!

Additional references

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Simplify data streaming ingestion for analytics using Amazon MSK and Amazon Redshift

2024-02-21 Sebastian Vlad

Post Syndicated from Sebastian Vlad original https://aws.amazon.com/blogs/big-data/simplify-data-streaming-ingestion-for-analytics-using-amazon-msk-and-amazon-redshift/

Towards the end of 2022, AWS announced the general availability of real-time streaming ingestion to Amazon Redshift for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), eliminating the need to stage streaming data in Amazon Simple Storage Service (Amazon S3) before ingesting it into Amazon Redshift.

Streaming ingestion from Amazon MSK into Amazon Redshift, represents a cutting-edge approach to real-time data processing and analysis. Amazon MSK serves as a highly scalable, and fully managed service for Apache Kafka, allowing for seamless collection and processing of vast streams of data. Integrating streaming data into Amazon Redshift brings immense value by enabling organizations to harness the potential of real-time analytics and data-driven decision-making.

This integration enables you to achieve low latency, measured in seconds, while ingesting hundreds of megabytes of streaming data per second into Amazon Redshift. At the same time, this integration helps make sure that the most up-to-date information is readily available for analysis. Because the integration doesn’t require staging data in Amazon S3, Amazon Redshift can ingest streaming data at a lower latency and without intermediary storage cost.

You can configure Amazon Redshift streaming ingestion on a Redshift cluster using SQL statements to authenticate and connect to an MSK topic. This solution is an excellent option for data engineers that are looking to simplify data pipelines and reduce the operational cost.

In this post, we provide a complete overview on how to configure Amazon Redshift streaming ingestion from Amazon MSK.

Solution overview

The following architecture diagram describes the AWS services and features you will be using.

architecture diagram describing the AWS services and features you will be using

The workflow includes the following steps:

You start with configuring an Amazon MSK Connect source connector, to create an MSK topic, generate mock data, and write it to the MSK topic. For this post, we work with mock customer data.
The next step is to connect to a Redshift cluster using the Query Editor v2.
Finally, you configure an external schema and create a materialized view in Amazon Redshift, to consume the data from the MSK topic. This solution does not rely on an MSK Connect sink connector to export the data from Amazon MSK to Amazon Redshift.

The following solution architecture diagram describes in more detail the configuration and integration of the AWS services you will be using.

The workflow includes the following steps:

You deploy an MSK Connect source connector, an MSK cluster, and a Redshift cluster within the private subnets on a VPC.
The MSK Connect source connector uses granular permissions defined in an AWS Identity and Access Management (IAM) in-line policy attached to an IAM role, which allows the source connector to perform actions on the MSK cluster.
The MSK Connect source connector logs are captured and sent to an Amazon CloudWatch log group.
The MSK cluster uses a custom MSK cluster configuration, allowing the MSK Connect connector to create topics on the MSK cluster.
The MSK cluster logs are captured and sent to an Amazon CloudWatch log group.
The Redshift cluster uses granular permissions defined in an IAM in-line policy attached to an IAM role, which allows the Redshift cluster to perform actions on the MSK cluster.
You can use the Query Editor v2 to connect to the Redshift cluster.

Prerequisites

To simplify the provisioning and configuration of the prerequisite resources, you can use the following AWS CloudFormation template:

Complete the following steps when launching the stack:

For Stack name, enter a meaningful name for the stack, for example, prerequisites.
Choose Next.
Choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Submit.

The CloudFormation stack creates the following resources:

A VPC custom-vpc, created across three Availability Zones, with three public subnets and three private subnets:
- The public subnets are associated with a public route table, and outbound traffic is directed to an internet gateway.
- The private subnets are associated with a private route table, and outbound traffic is sent to a NAT gateway.
An internet gateway attached to the Amazon VPC.
A NAT gateway that is associated with an elastic IP and is deployed in one of the public subnets.
Three security groups:
- msk-connect-sg, which will be later associated with the MSK Connect connector.
- redshift-sg, which will be later associated with the Redshift cluster.
- msk-cluster-sg, which will be later associated with the MSK cluster. It allows inbound traffic from msk-connect-sg, and redshift-sg.
Two CloudWatch log groups:
- msk-connect-logs, to be used for the MSK Connect logs.
- msk-cluster-logs, to be used for the MSK cluster logs.
Two IAM Roles:
- msk-connect-role, which includes granular IAM permissions for MSK Connect.
- redshift-role, which includes granular IAM permissions for Amazon Redshift.
A custom MSK cluster configuration, allowing the MSK Connect connector to create topics on the MSK cluster.
An MSK cluster, with three brokers deployed across the three private subnets of custom-vpc. The msk-cluster-sg security group and the custom-msk-cluster-configuration configuration are applied to the MSK cluster. The broker logs are delivered to the msk-cluster-logs CloudWatch log group.
A Redshift cluster subnet group, which is using the three private subnets of custom-vpc.
A Redshift cluster, with one single node deployed in a private subnet within the Redshift cluster subnet group. The redshift-sg security group and redshift-role IAM role are applied to the Redshift cluster.

Create an MSK Connect custom plugin

For this post, we use an Amazon MSK data generator deployed in MSK Connect, to generate mock customer data, and write it to an MSK topic.

Complete the following steps:

Download the Amazon MSK data generator JAR file with dependencies from GitHub.
Upload the JAR file into an S3 bucket in your AWS account.
On the Amazon MSK console, choose Custom plugins under MSK Connect in the navigation pane.
Choose Create custom plugin.
Choose Browse S3, search for the Amazon MSK data generator JAR file you uploaded to Amazon S3, then choose Choose.
For Custom plugin name, enter msk-datagen-plugin.
Choose Create custom plugin.

When the custom plugin is created, you will see that its status is Active, and you can move to the next step.
amazon msk console showing the msk connect custom plugin being successfully created

Create an MSK Connect connector

Complete the following steps to create your connector:

On the Amazon MSK console, choose Connectors under MSK Connect in the navigation pane.
Choose Create connector.
For Custom plugin type, choose Use existing plugin.
Select msk-datagen-plugin, then choose Next.
For Connector name, enter msk-datagen-connector.
For Cluster type, choose Self-managed Apache Kafka cluster.
For VPC, choose custom-vpc.
For Subnet 1, choose the private subnet within your first Availability Zone.

For the custom-vpc created by the CloudFormation template, we are using odd CIDR ranges for public subnets, and even CIDR ranges for the private subnets:

- The CIDRs for the public subnets are 10.10.1.0/24, 10.10.3.0/24, and 10.10.5.0/24
- The CIDRs for the private subnets are 10.10.2.0/24, 10.10.4.0/24, and 10.10.6.0/24

For Subnet 2, select the private subnet within your second Availability Zone.
For Subnet 3, select the private subnet within your third Availability Zone.
For Bootstrap servers, enter the list of bootstrap servers for TLS authentication of your MSK cluster.

To retrieve the bootstrap servers for your MSK cluster, navigate to the Amazon MSK console, choose Clusters, choose msk-cluster, then choose View client information. Copy the TLS values for the bootstrap servers.

For Security groups, choose Use specific security groups with access to this cluster, and choose msk-connect-sg.
For Connector configuration, replace the default settings with the following:

connector.class=com.amazonaws.mskdatagen.GeneratorSourceConnector
tasks.max=2
genkp.customer.with=#{Code.isbn10}
genv.customer.name.with=#{Name.full_name}
genv.customer.gender.with=#{Demographic.sex}
genv.customer.favorite_beer.with=#{Beer.name}
genv.customer.state.with=#{Address.state}
genkp.order.with=#{Code.isbn10}
genv.order.product_id.with=#{number.number_between '101','109'}
genv.order.quantity.with=#{number.number_between '1','5'}
genv.order.customer_id.matching=customer.key
global.throttle.ms=2000
global.history.records.max=1000
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=false

For Connector capacity, choose Provisioned.
For MCU count per worker, choose 1.
For Number of workers, choose 1.
For Worker configuration, choose Use the MSK default configuration.
For Access permissions, choose msk-connect-role.
Choose Next.
For Encryption, select TLS encrypted traffic.
Choose Next.
For Log delivery, choose Deliver to Amazon CloudWatch Logs.
Choose Browse, select msk-connect-logs, and choose Choose.
Choose Next.
Review and choose Create connector.

After the custom connector is created, you will see that its status is Running, and you can move to the next step.
amazon msk console showing the msk connect connector being successfully created

Configure Amazon Redshift streaming ingestion for Amazon MSK

Complete the following steps to set up streaming ingestion:

Connect to your Redshift cluster using Query Editor v2, and authenticate with the database user name awsuser, and password Awsuser123.
Create an external schema from Amazon MSK using the following SQL statement.

In the following code, enter the values for the redshift-role IAM role, and the msk-cluster cluster ARN.

CREATE EXTERNAL SCHEMA msk_external_schema
FROM MSK
IAM_ROLE '<insert your redshift-role arn>'
AUTHENTICATION iam
CLUSTER_ARN '<insert your msk-cluster arn>';

Choose Run to run the SQL statement.

redshift query editor v2 showing the SQL statement used to create an external schema from amazon msk

Create a materialized view using the following SQL statement:

CREATE MATERIALIZED VIEW msk_mview AUTO REFRESH YES AS
SELECT
    "kafka_partition",
    "kafka_offset",
    "kafka_timestamp_type",
    "kafka_timestamp",
    "kafka_key",
    JSON_PARSE(kafka_value) as Data,
    "kafka_headers"
FROM
    "dev"."msk_external_schema"."customer"

Choose Run to run the SQL statement.

redshift query editor v2 showing the SQL statement used to create a materialized view

You can now query the materialized view using the following SQL statement:

select * from msk_mview LIMIT 100;

Choose Run to run the SQL statement.

redshift query editor v2 showing the SQL statement used to query the materialized view

To monitor the progress of records loaded via streaming ingestion, you can take advantage of the SYS_STREAM_SCAN_STATES monitoring view using the following SQL statement:

select * from SYS_STREAM_SCAN_STATES;

Choose Run to run the SQL statement.

redshift query editor v2 showing the SQL statement used to query the sys stream scan states monitoring view

To monitor errors encountered on records loaded via streaming ingestion, you can take advantage of the SYS_STREAM_SCAN_ERRORS monitoring view using the following SQL statement:

select * from SYS_STREAM_SCAN_ERRORS;

Choose Run to run the SQL statement.

Clean up

After following along, if you no longer need the resources you created, delete them in the following order to prevent incurring additional charges:

Delete the MSK Connect connector msk-datagen-connector.
Delete the MSK Connect plugin msk-datagen-plugin.
Delete the Amazon MSK data generator JAR file you downloaded, and delete the S3 bucket you created.
After you delete your MSK Connect connector, you can delete the CloudFormation template. All the resources created by the CloudFormation template will be automatically deleted from your AWS account.

Conclusion

In this post, we demonstrated how to configure Amazon Redshift streaming ingestion from Amazon MSK, with a focus on privacy and security.

The combination of the ability of Amazon MSK to handle high throughput data streams with the robust analytical capabilities of Amazon Redshift empowers business to derive actionable insights promptly. This real-time data integration enhances the agility and responsiveness of organizations in understanding changing data trends, customer behaviors, and operational patterns. It allows for timely and informed decision-making, thereby gaining a competitive edge in today’s dynamic business landscape.

This solution is also applicable for customers that are looking to use Amazon MSK Serverless and Amazon Redshift Serverless.

We hope this post was a good opportunity to learn more about AWS service integration and configuration. Let us know your feedback in the comments section.

About the authors

Sebastian Vlad is a Senior Partner Solutions Architect with Amazon Web Services, with a passion for data and analytics solutions and customer success. Sebastian works with enterprise customers to help them design and build modern, secure, and scalable solutions to achieve their business outcomes.

Sharad Pai is a Lead Technical Consultant at AWS. He specializes in streaming analytics and helps customers build scalable solutions using Amazon MSK and Amazon Kinesis. He has over 16 years of industry experience and is currently working with media customers who are hosting live streaming platforms on AWS, managing peak concurrency of over 50 million. Prior to joining AWS, Sharad’s career as a lead software developer included 9 years of coding, working with open source technologies like JavaScript, Python, and PHP.

Combine AWS Glue and Amazon MWAA to build advanced VPC selection and failover strategies

2024-02-21 Michael Greenshtein

Post Syndicated from Michael Greenshtein original https://aws.amazon.com/blogs/big-data/combine-aws-glue-and-amazon-mwaa-to-build-advanced-vpc-selection-and-failover-strategies/

AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

AWS Glue customers often have to meet strict security requirements, which sometimes involve locking down the network connectivity allowed to the job, or running inside a specific VPC to access another service. To run inside the VPC, the jobs needs to be assigned to a single subnet, but the most suitable subnet can change over time (for instance, based on the usage and availability), so you may prefer to make that decision at runtime, based on your own strategy.

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is an AWS service to run managed Airflow workflows, which allow writing custom logic to coordinate how tasks such as AWS Glue jobs run.

In this post, we show how to run an AWS Glue job as part of an Airflow workflow, with dynamic configurable selection of the VPC subnet assigned to the job at runtime.

Solution overview

To run inside a VPC, an AWS Glue job needs to be assigned at least a connection that includes network configuration. Any connection allows specifying a VPC, subnet, and security group, but for simplicity, this post uses connections of type: NETWORK, which just defines the network configuration and doesn’t involve external systems.

If the job has a fixed subnet assigned by a single connection, in case of a service outage on the Availability Zones or if the subnet isn’t available for other reasons, the job can’t run. Furthermore, each node (driver or worker) in an AWS Glue job requires an IP address assigned from the subnet. When running many large jobs concurrently, this could lead to an IP address shortage and the job running with fewer nodes than intended or not running at all.

AWS Glue extract, transform, and load (ETL) jobs allow multiple connections to be specified with multiple network configurations. However, the job will always try to use the connections’ network configuration in the order listed and pick the first one that passes the health checks and has at least two IP addresses to get the job started, which might not be the optimal option.

With this solution, you can enhance and customize that behavior by reordering the connections dynamically and defining the selection priority. If a retry is needed, the connections are reprioritized again based on the strategy, because the conditions might have changed since the last run.

As a result, it helps prevent the job from failing to run or running under capacity due to subnet IP address shortage or even an outage, while meeting the network security and connectivity requirements.

The following diagram illustrates the solution architecture.

Prerequisites

To follow the steps of the post, you need a user that can log in to the AWS Management Console and has permission to access Amazon MWAA, Amazon Virtual Private Cloud (Amazon VPC), and AWS Glue. The AWS Region where you choose to deploy the solution needs the capacity to create a VPC and two elastic IP addresses. The default Regional quota for both types of resources is five, so you might need to request an increase via the console.

You also need an AWS Identity and Access Management (IAM) role suitable to run AWS Glue jobs if you don’t have one already. For instructions, refer to Create an IAM role for AWS Glue.

Deploy an Airflow environment and VPC

First, you’ll deploy a new Airflow environment, including the creation of a new VPC with two public subnets and two private ones. This is because Amazon MWAA requires Availability Zone failure tolerance, so it needs to run on two subnets on two different Availability Zones in the Region. The public subnets are used so the NAT Gateway can provide internet access for the private subnets.

Complete the following steps:

Create an AWS CloudFormation template in your computer by copying the template from the following quick start guide into a local text file.
On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose Create stack with the option With new resources (standard).
Choose Upload a template file and choose the local template file.
Choose Next.
Complete the setup steps, entering a name for the environment, and leave the rest of the parameters as default.
On the last step, acknowledge that resources will be created and choose Submit.

The creation can take 20–30 minutes, until the status of the stack changes to CREATE_COMPLETE.

The resource that will take most of time is the Airflow environment. While it’s being created, you can continue with the following steps, until you are required to open the Airflow UI.

On the stack’s Resources tab, note the IDs for the VPC and two private subnets (PrivateSubnet1 and PrivateSubnet2), to use in the next step.

Create AWS Glue connections

The CloudFormation template deploys two private subnets. In this step, you create an AWS Glue connection to each one so AWS Glue jobs can run in them. Amazon MWAA recently added the capacity to run the Airflow cluster on shared VPCs, which reduces cost and simplifies network management. For more information, refer to Introducing shared VPC support on Amazon MWAA.

Complete the following steps to create the connections:

On the AWS Glue console, choose Data connections in the navigation pane.
Choose Create connection.
Choose Network as the data source.
Choose the VPC and private subnet (PrivateSubnet1) created by the CloudFormation stack.
Use the default security group.
Choose Next.
For the connection name, enter MWAA-Glue-Blog-Subnet1.
Review the details and complete the creation.
Repeat these steps using PrivateSubnet2 and name the connection MWAA-Glue-Blog-Subnet2.

Create the AWS Glue job

Now you create the AWS Glue job that will be triggered later by the Airflow workflow. The job uses the connections created in the previous section, but instead of assigning them directly on the job, as you would normally do, in this scenario you leave the job connections list empty and let the workflow decide which one to use at runtime.

The job script in this case is not significant and is just intended to demonstrate the job ran in one of the subnets, depending on the connection.

On the AWS Glue console, choose ETL jobs in the navigation pane, then choose Script editor.
Leave the default options (Spark engine and Start fresh) and choose Create script.

Replace the placeholder script with the following Python code:

import ipaddress
import socket

subnets = {
    "PrivateSubnet1": "10.192.20.0/24",
    "PrivateSubnet2": "10.192.21.0/24"
}

ip = socket.gethostbyname(socket.gethostname())
subnet_name = "unknown"
for subnet, cidr in subnets.items():
    if ipaddress.ip_address(ip) in ipaddress.ip_network(cidr):
        subnet_name = subnet

print(f"The driver node has been assigned the ip: {ip}"
      + f" which belongs to the subnet: {subnet_name}")

Rename the job to AirflowBlogJob.
On the Job details tab, for IAM Role, choose any role and enter 2 for the number of workers (just for frugality).
Save these changes so the job is created.

Grant AWS Glue permissions to the Airflow environment role

The role created for Airflow by the CloudFormation template provides the basic permissions to run workflows but not to interact with other services such as AWS Glue. In a production project, you would define your own templates with these additional permissions, but in this post, for simplicity, you add the additional permissions as an inline policy. Complete the following steps:

On the IAM console, choose Roles in the navigation pane.
Locate the role created by the template; it will start with the name you assigned to the CloudFormation stack and then -MwaaExecutionRole-.
On the role details page, on the Add permissions menu, choose Create inline policy.

Switch from Visual to JSON mode and enter the following JSON on the textbox. It assumes that the AWS Glue role you have follows the convention of starting with AWSGlueServiceRole. For enhanced security, you can replace the wildcard resource on the ec2:DescribeSubnets permission with the ARNs of the two private subnets from the CloudFormation stack.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetConnection"
            ],
            "Resource": [
                "arn:aws:glue:*:*:connection/MWAA-Glue-Blog-Subnet*",
                "arn:aws:glue:*:*:catalog"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:UpdateJob",
                "glue:GetJob",
                "glue:StartJobRun",
                "glue:GetJobRun"
            ],
            "Resource": [
                "arn:aws:glue:*:*:job/AirflowBlogJob",
                "arn:aws:glue:*:*:job/BlogAirflow"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeSubnets"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:PassRole"
            ],
            "Resource": "arn:aws:iam::*:role/service-role/AWSGlueServiceRole*"
        }
    ]
}

Choose Next.
Enter GlueRelatedPermissions as the policy name and complete the creation.

In this example, we use an ETL script job; for a visual job, because it generates the script automatically on save, the Airflow role would need permission to write to the configured script path on Amazon Simple Storage Service (Amazon S3).

Create the Airflow DAG

An Airflow workflow is based on a Directed Acyclic Graph (DAG), which is defined by a Python file that programmatically specifies the different tasks involved and its interdependencies. Complete the following scripts to create the DAG:

Create a local file named glue_job_dag.py using a text editor.

In each of the following steps, we provide a code snippet to enter into the file and an explanation of what is does.

The following snippet adds the required Python modules imports. The modules are already installed on Airflow; if that weren’t the case, you would need to use a requirements.txt file to indicate to Airflow which modules to install. It also defines the Boto3 clients that the code will use later. By default, they will use the same role and Region as Airflow, that’s why you set up before the role with the additional permissions required.
```
import boto3
from pendulum import datetime, duration
from random import shuffle
from airflow import DAG
from airflow.decorators import dag, task
from airflow.models import Variable
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator

glue_client = boto3.client('glue')
ec2 = boto3.client('ec2')
```

The following snippet adds three functions to implement the connection order strategy, which defines how to reorder the connections given to establish their priority. This is just an example; you can build your custom code to implement your own logic, as per your needs. The code first checks the IPs available on each connection subnet and separates the ones that have enough IPs available to run the job at full capacity and those that could be used because they have at least two IPs available, which is the minimum a job needs to start. If the strategy is set to random, it will randomize the order within each of the connection groups previously described and add any other connections. If the strategy is capacity, it will order them from most IPs free to fewest.

def get_available_ips_from_connection(glue_connection_name):
    conn_response = glue_client.get_connection(Name=glue_connection_name)
    connection_properties = conn_response['Connection']['PhysicalConnectionRequirements']
    subnet_id = connection_properties['SubnetId']
    subnet_response = ec2.describe_subnets(SubnetIds=[subnet_id])
    return subnet_response['Subnets'][0]['AvailableIpAddressCount']

def get_connections_free_ips(glue_connection_names, num_workers):
    good_connections = []
    usable_connections = []    
    for connection_name in glue_connection_names:
        try:
            available_ips = get_available_ips_from_connection(connection_name)
            # Priority to connections that can hold the full cluster and we haven't just tried
            if available_ips >= num_workers:
                good_connections.append((connection_name, available_ips))
            elif available_ips >= 2: # The bare minimum to start a Glue job
                usable_connections.append((connection_name, available_ips))                
        except Exception as e:
            print(f"[WARNING] Failed to check the free ips for:{connection_name}, will skip. Exception: {e}")  
    return good_connections, usable_connections

def prioritize_connections(connection_list, num_workers, strategy):
    (good_connections, usable_connections) = get_connections_free_ips(connection_list, num_workers)
    print(f"Good connections: {good_connections}")
    print(f"Usable connections: {usable_connections}")
    all_conn = []
    if strategy=="random":
        shuffle(good_connections)
        shuffle(usable_connections)
        # Good connections have priority
        all_conn = good_connections + usable_connections
    elif strategy=="capacity":
        # We can sort both at the same time
        all_conn = good_connections + usable_connections
        all_conn.sort(key=lambda x: -x[1])
    else: 
        raise ValueError(f"Unknown strategy specified: {strategy}")    
    result = [c[0] for c in all_conn] # Just need the name
    # Keep at the end any other connections that could not be checked for ips
    result += [c for c in connection_list if c not in result]
    return result

The following code creates the DAG itself with the run job task, which updates the job with the connection order defined by the strategy, runs it, and waits for the results. The job name, connections, and strategy come from Airflow variables, so it can be easily configured and updated. It has two retries with exponential backoff configured, so if the tasks fails, it will repeat the full task including the connection selection. Maybe now the best choice is another connection, or the subnet previously picked randomly is in an Availability Zone that is currently suffering an outage, and by picking a different one, it can recover.

with DAG(
    dag_id='glue_job_dag',
    schedule_interval=None, # Run on demand only
    start_date=datetime(2000, 1, 1), # A start date is required
    max_active_runs=1,
    catchup=False
) as glue_dag:
    
    @task(
        task_id="glue_task", 
        retries=2,
        retry_delay=duration(seconds = 30),
        retry_exponential_backoff=True
    )
    def run_job_task(**ctx):    
        glue_connections = Variable.get("glue_job_dag.glue_connections").strip().split(',')
        glue_jobname = Variable.get("glue_job_dag.glue_job_name").strip()
        strategy= Variable.get('glue_job_dag.strategy', 'random') # random or capacity
        print(f"Connections available: {glue_connections}")
        print(f"Glue job name: {glue_jobname}")
        print(f"Strategy to use: {strategy}")
        job_props = glue_client.get_job(JobName=glue_jobname)['Job']            
        num_workers = job_props['NumberOfWorkers']
        
        glue_connections = prioritize_connections(glue_connections, num_workers, strategy)
        print(f"Running Glue job with the connection order: {glue_connections}")
        existing_connections = job_props.get('Connections',{}).get('Connections', [])
        # Preserve other connections that we don't manage
        other_connections = [con for con in existing_connections if con not in glue_connections]
        job_props['Connections'] = {"Connections": glue_connections + other_connections}
        # Clean up properties so we can reuse the dict for the update request
        for prop_name in ['Name', 'CreatedOn', 'LastModifiedOn', 'AllocatedCapacity', 'MaxCapacity']:
            del job_props[prop_name]

        GlueJobOperator(
            task_id='submit_job',
            job_name=glue_jobname,
            iam_role_name=job_props['Role'].split('/')[-1],
            update_config=True,
            create_job_kwargs=job_props,
            wait_for_completion=True
        ).execute(ctx)   
        
    run_job_task()

Create the Airflow workflow

Now you create a workflow that invokes the AWS Glue job you just created:

On the Amazon S3 console, locate the bucket created by the CloudFormation template, which will have a name starting with the name of the stack and then -environmentbucket- (for example, myairflowstack-environmentbucket-ap1qks3nvvr4).
Inside that bucket, create a folder called dags, and inside that folder, upload the DAG file glue_job_dag.py that you created in the previous section.
On the Amazon MWAA console, navigate to the environment you deployed with the CloudFormation stack.

If the status is not yet Available, wait until it reaches that state. It shouldn’t take longer than 30 minutes since you deployed the CloudFormation stack.

Choose the environment link on the table to see the environment details.

It’s configured to pick up DAGs from the bucket and folder you used in the previous steps. Airflow will monitor that folder for changes.

Choose Open Airflow UI to open a new tab accessing the Airflow UI, using the integrated IAM security to log you in.

If there’s any issue with the DAG file you created, it will display an error on top of the page indicating the lines affected. In that case, review the steps and upload again. After a few seconds, it will parse it and update or remove the error banner.

On the Admin menu, choose Variables.
Add three variables with the following keys and values:
1. Key glue_job_dag.glue_connections with value MWAA-Glue-Blog-Subnet1,MWAA-Glue-Blog-Subnet2.
2. Key glue_job_dag.glue_job_name with value AirflowBlogJob.
3. Key glue_job_dag.strategy with value capacity.

Run the job with a dynamic subnet assignment

Now you’re ready to run the workflow and see the strategy dynamically reordering the connections.

On the Airflow UI, choose DAGs, and on the row glue_job_dag, choose the play icon.
On the Browse menu, choose Task instances.
On the instances table, scroll right to display the Log Url and choose the icon on it to open the log.

The log will update as the task runs; you can locate the line starting with “Running Glue job with the connection order:” and the previous lines showing details of the connection IPs and the category assigned. If an error occurs, you’ll see the details in this log.

On the AWS Glue console, choose ETL jobs in the navigation pane, then choose the job AirflowBlogJob.
On the Runs tab, choose the run instance, then the Output logs link, which will open a new tab.
On the new tab, use the log stream link to open it.

It will display the IP that the driver was assigned and which subnet it belongs to, which should match the connection indicated by Airflow (if the log is not displayed, choose Resume so it gets updated as soon as it’s available).

On the Airflow UI, edit the Airflow variable glue_job_dag.strategy to set it to random.
Run the DAG multiple times and see how the ordering changes.

Clean up

If you no longer need the deployment, delete the resources to avoid any further charges:

Delete the Python script you uploaded, so the S3 bucket can be automatically deleted in the next step.
Delete the CloudFormation stack.
Delete the AWS Glue job.
Delete the script that the job saved in Amazon S3.
Delete the connections you created as part of this post.

Conclusion

In this post, we showed how AWS Glue and Amazon MWAA can work together to build more advanced custom workflows, while minimizing the operational and management overhead. This solution gives you more control about how your AWS Glue job runs to meet special operational, network, or security requirements.

You can deploy your own Amazon MWAA environment in multiple ways, such as with the template used in this post, on the Amazon MWAA console, or using the AWS CLI. You can also implement your own strategies to orchestrate AWS Glue jobs, based on your network architecture and requirements (for instance, to run the job closer to the data when possible).

About the authors

Michael Greenshtein is an Analytics Specialist Solutions Architect for the Public Sector.

Gonzalo Herreros is a Senior Big Data Architect on the AWS Glue team.

Build an analytics pipeline that is resilient to schema changes using Amazon Redshift Spectrum

2024-02-20 Swapna Bandla

Post Syndicated from Swapna Bandla original https://aws.amazon.com/blogs/big-data/build-an-analytics-pipeline-that-is-resilient-to-schema-changes-using-amazon-redshift-spectrum/

You can ingest and integrate data from multiple Internet of Things (IoT) sensors to get insights. However, you may have to integrate data from multiple IoT sensor devices to derive analytics like equipment health information from all the sensors based on common data elements. Each of these sensor devices could be transmitting data with unique schemas and different attributes.

You can ingest data from all your IoT sensors to a central location on Amazon Simple Storage Service (Amazon S3). Schema evolution is a feature where a database table’s schema can evolve to accommodate for changes in the attributes of the files getting ingested. With the schema evolution functionality available in AWS Glue, Amazon Redshift Spectrum can automatically handle schema changes when new attributes get added or existing attributes get dropped. This is achieved with an AWS Glue crawler by reading schema changes based on the S3 file structures. The crawler creates a hybrid schema that works with both old and new datasets. You can read from all the ingested data files at a specified Amazon S3 location with different schemas through a single Amazon Redshift Spectrum table by referring to the AWS Glue metadata catalog.

In this post, we demonstrate how to use the AWS Glue schema evolution feature to read from multiple JSON formatted files with various schemas that are stored in a single Amazon S3 location. We also show how to query this data in Amazon S3 with Redshift Spectrum without redefining the schema or loading the data into Redshift tables.

Solution overview

The solution consists of the following steps:

Create an Amazon Data Firehose delivery stream with Amazon S3 as its destination.
Generate sample stream data from the Amazon Kinesis Data Generator (KDG) with the Firehose delivery stream as the destination.
Upload the initial data files to the Amazon S3 location.
Create and run an AWS Glue crawler to populate the Data Catalog with external table definition by reading the data files from Amazon S3.
Create the external schema called iotdb_ext in Amazon Redshift and query the Data Catalog table.
Query the external table from Redshift Spectrum to read data from the initial schema.
Add additional data elements to the KDG template and send the data to the Firehose delivery stream.
Validate that the additional data files are loaded to Amazon S3 with additional data elements.
Run an AWS Glue crawler to update the external table definitions.
Query the external table from Redshift Spectrum again to read the combined dataset from two different schemas.
Delete a data element from the template and send the data to the Firehose delivery stream.
Validate that the additional data files are loaded to Amazon S3 with one less data element.
Run an AWS Glue crawler to update the external table definitions.
Query the external table from Redshift Spectrum to read the combined dataset from three different schemas.

This solution is depicted in the following architecture diagram.

Prerequisites

This solution requires the following prerequisites:

A Redshift cluster. For instructions, see Create a sample Amazon Redshift cluster.
An S3 bucket.
Familiarity with the AWS Management Console.
An AWS account and access to the following AWS services:
- AWS Glue
- AWS Identity and Access Management (IAM)
- Amazon Redshift
- Amazon S3
- Amazon Data Firehose
Kinesis Data Generator setup:
- For Kinesis Data Generator setup instructions, refer to this blog.

Implement the solution

Complete the following steps to build the solution:

On the Kinesis console, create a Firehose delivery stream with the following parameters:
- For Source, choose Direct PUT.
- For Destination, choose Amazon S3.
- For S3 bucket, enter your S3 bucket.
- For Dynamic partitioning, select Enabled.

- Add the following dynamic partitioning keys:
  - Key year with expression .connectionTime | strptime("%d/%m/%Y:%H:%M:%S") | strftime("%Y")
  - Key month with expression .connectionTime | strptime("%d/%m/%Y:%H:%M:%S") | strftime("%m")
  - Key day with expression .connectionTime | strptime("%d/%m/%Y:%H:%M:%S") | strftime("%d")
  - Key hour with expression .connectionTime | strptime("%d/%m/%Y:%H:%M:%S") | strftime("%H")

- For S3 bucket prefix, enter year=!{partitionKeyFromQuery:year}/month=!{partitionKeyFromQuery:month}/day=!{partitionKeyFromQuery:day}/hour=!{partitionKeyFromQuery:hour}/

You can review your delivery stream details on the Kinesis Data Firehose console.

Your delivery stream configuration details should be similar to the following screenshot.

Generate sample stream data from the KDG with the Firehose delivery stream as the destination with the following template:

{
"sensorId": {{random.number(999999999)}},
"sensorType": "{{random.arrayElement( ["Thermostat","SmartWaterHeater","HVACTemperatureSensor","WaterPurifier"] )}}",
"internetIP": "{{internet.ip}}",
"recordedDate": "{{date.past}}",
"connectionTime": "{{date.now("DD/MM/YYYY:HH:mm:ss")}}",
"currentTemperature": "{{random.number({"min":10,"max":150})}}",
"serviceContract": "{{random.arrayElement( ["ActivePartsService","Inactive","SCIP","ActiveServiceOnly"] )}}",
"status": "{{random.arrayElement( ["OK","FAIL","WARN"] )}}" }

On the Amazon S3 console, validate that the initial set of files got loaded into the S3 bucket.
On the AWS Glue console, create and run an AWS Glue Crawler with the data source as the S3 bucket that you used in the earlier step.

When the crawler is complete, you can validate that the table was created on the AWS Glue console.

In Amazon Redshift Query Editor v2, connect to the Redshift instance and create an external schema pointing to the AWS Glue Data Catalog database. In the following code, use the Amazon Resource Name (ARN) for the IAM role that your cluster uses for authentication and authorization. As a minimum, the IAM role must have permission to perform a LIST operation on the S3 bucket to be accessed and a GET operation on the S3 objects the bucket contains.
```
CREATE external SCHEMA iotdb_ext FROM data catalog DATABASE 'iotdb' IAM_ROLE 'arn:aws:iam::&lt;AWS account-id&gt;:role/&lt;role-name&gt;' 
```
```
CREATE external DATABASE if not exists;
```
Query the table defined in the Data Catalog from the Redshift external schema and note the columns defined in the KDG template:
```
select * from iotdb_ext.sensorsiotschemaevol;
```
Add an additional data element in the KDG template and send the data to the Firehose delivery stream:
```
"serviceRecommendedDate": "{{date.future}}",
```
Validate that the new data was added to the S3 bucket.
Rerun the AWS Glue crawler.

Query the table again from the Redshift external schema and note the newly populated dataset vs. the previous dataset for the servicerecommendeddate column:

select * from iotdb_ext.sensorsiotschemaevol where servicerecommendeddate is not null;

select * from iotdb_ext.sensorsiotschemaevol where servicerecommendeddate is null;

Delete the data element status from the KDG template and resend the data to the Firehose delivery stream.
Validate that new data was added to the S3 bucket.
Rerun the AWS Glue crawler.
Query the table again from the Redshift external schema and note the newly populated dataset vs. previous datasets with values for the status column:
```
select * from iotdb_ext.sensorsiotschemaevol order by connectiontime desc;
```
```
select * from iotdb_ext.sensorsiotschemaevol order by connectiontime;
```

Troubleshooting

If data is not loaded into Amazon S3 after sending it from the KDG template to the Firehose delivery stream, refresh and make sure you are logged in to the KDG.

Clean up

You may want to delete your S3 data and Redshift cluster if you are not planning to use it further to avoid unnecessary cost to your AWS account.

Conclusion

With the emergence of requirements for predictive and prescriptive analytics based on big data, there is a growing demand for data solutions that integrate data from multiple heterogeneous data models with minimal effort. In this post, we showcased how you can derive metrics from common atomic data elements from different data sources with unique schemas. You can store data from all the data sources in a common S3 location, either in the same folder or multiple subfolders by each data source. You can define and schedule an AWS Glue crawler to run at the same frequency as the data refresh requirements for your data consumption. With this solution, you can create a Redshift Spectrum table to read from an S3 location with varying file structures using the AWS Glue Data Catalog and schema evolution functionality.

If you have any questions or suggestions, please leave your feedback in the comment section. If you need further assistance with building analytics solutions with data from various IoT sensors, please contact your AWS account team.

About the Authors

Swapna Bandla is a Senior Solutions Architect in the AWS Analytics Specialist SA Team. Swapna has a passion towards understanding customers data and analytics needs and empowering them to develop cloud-based well-architected solutions. Outside of work, she enjoys spending time with her family.

Indira Balakrishnan is a Principal Solutions Architect in the AWS Analytics Specialist SA Team. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems using data-driven decisions. Outside of work, she volunteers at her kids’ activities and spends time with her family.

Simplify authentication with native LDAP integration on Amazon EMR

2024-02-20 Stefano Sandona

Post Syndicated from Stefano Sandona original https://aws.amazon.com/blogs/big-data/simplify-authentication-with-native-ldap-integration-on-amazon-emr/

Many companies have corporate identities stored inside identity providers (IdPs) like Active Directory (AD) or OpenLDAP. Previously, customers using Amazon EMR could integrate their clusters with Active Directory by configuring a one-way realm trust between their AD domain and the EMR cluster Kerberos realm. For more details, refer to Tutorial: Configure a cross-realm trust with an Active Directory domain.

This setup has been a key enabler to make corporate users and groups available inside EMR clusters and define access control policies to control their data access (for example, through the Amazon EMR native Apache Ranger integration).

Although this option is still available, Amazon EMR has released support for native LDAP authentication, a new security feature that simplifies the integration with OpenLDAP and Active Directory.

This feature enables the following:

automatic configuration of security for the supported applications (HiveServer2, Trino, Presto and Livy) to use the Kerberos protocol under the hood and LDAP as external authentication. This allows a more straightforward integration from external tools that, to connect with cluster endpoints, do not have anymore to setup kerberos authentication but, instead, can simply be configured to provide an LDAP username and password
fine-grained access control (FGAC) over who can access your EMR clusters through SSH
fine-grained authorization policies on top of Hive Metastore database and tables if used in combination with the native Amazon EMR Apache Ranger integration.

In this post, we dive deep into the Amazon EMR LDAP authentication, showing how the authentication flow works, how to retrieve and test the needed LDAP configurations, and how to confirm an EMR cluster is properly LDAP integrated.

Using the information on this blog:

Teams managing EMR clusters can enhance coordination with their LDAP IdP administrators in order to request the proper information and properly perform pre-configuration tests
EMR cluster end-users can understand how straightforward it is to connect from external tools to LDAP-enabled EMR clusters compared to the previous Kerberos-based authentication

How Amazon EMR LDAP integration works

When talking about authentication in the context of EMR frameworks, we can distinguish between two levels:

External authentication – Used by users and external components to interact with the installed frameworks
Internal authentication – Used within the frameworks to authenticate the communications of internal components

With this new feature, internal framework authentication is still managed through Kerberos, but this is transparent to the end-users or external services that, on the other side, use a user name and password to authenticate.

The supported EMR installed frameworks implement an LDAP-based authentication method that, given a set of user name and password credentials, validates them against the LDAP endpoint and, in the case of success, enables the use of the framework.

The following diagram summarizes how the authentication flow works.

The workflow includes the following steps:

A user connects with one of the supported endpoints (such as HiveServer2, Trino/Presto Coordinator, or Hue WebUI) and provides their corporate credentials (user name and password).
The contacted framework uses a custom authenticator that performs the authentication using the EMR Secret Agent service running inside the cluster instances.
The EMR Secret Agent service validates the provided credentials against the LDAP endpoint.
In the case of success, the following occurs:
- A Kerberos principal is created for the specific user on the cluster MIT key distribution center (MIT KDC) running inside the primary node.
- The Kerberos principal keytab is created inside the home directory of the user on the primary node.

After the authentication is complete, the user can start using the framework.

Inside all the cluster instances, the SSSD service is configured to retrieve users and groups from the LDAP endpoint and make them available as system users.

The authentication flow when connecting with SSH is a bit different, and is summarized in the following diagram.

The workflow includes the following steps:

A user connects with SSH to the EMR primary instance, providing the corporate credentials (user name and password).
The contacted SSHD service uses the SSSD service to validate the provided credentials.
The SSSD service validates the provided credentials against the LDAP endpoint. In the case of success, the user lands on the related home directory. At this point, the user can use the different CLIs (beeline, trino-cli, presto-cli, curl) to access Hive, Trino/Presto, or Livy.
To use the Spark CLIs (spark-submit, pyspark, spark-shell), the user has to invoke the ldap-kinit script and provide the requested user name and password.
The authentication is performed using the EMR Secret Agent service running inside the cluster instances.
The EMR Secret Agent service validates the provided credentials against the LDAP endpoint.
In the case of success, the following occurs:
- A Kerberos principal is created for the specific user on the cluster MIT KDC running inside the primary node.
- The Kerberos principal keytab is created inside the home directory of the user on the primary node.
- A kerberos ticket is obtained and stored on the user Kerberos ticket cache on the primary node.

After the ldap-kinit script completes, the user can start using the Spark CLIs.

In the following sections, we show how to retrieve the required LDAP setting values and investigate how to launch a cluster with EMR LDAP authentication and test it.

Find the proper LDAP parameters

To configure LDAP authentication for Amazon EMR, the first step is to retrieve the LDAP properties to be used to set up your cluster. You need the following information:

The LDAP server DNS name
A certificate in PEM format to be used to interact over Secure LDAP (LDAPS) with the LDAP endpoint
The LDAP user search base, which is a path (or branch) on the LDAP tree from where to search users (only users belonging to this branch will be retrieved)
The LDAP groups search base, which is a path (or branch) on the LDAP tree from where to search groups (only groups belonging to this branch will be retrieved)
The LDAP server bind user credentials, which are the user name and password for a service user (usually called a bind user) to be used to trigger LDAP queries and retrieve user information such as user name and group membership.

With Active Directory, an AD admin can retrieve this information directly from the Active Directory Users and Computers tool. When you choose a user in this tool, you can see the related attributes (for example, distinguishedName). The following screenshot shows an example.

From the screenshot, we can see that the distinguishedName for the user john is CN=john,OU=users,OU=italy,OU=emr,DC=awsemr,DC=com, which means that john belongs to the following search bases, ordered from the most narrow to the most wide:

OU=users,OU=italy,OU=emr,DC=awsemr,DC=com
OU=italy,OU=emr,DC=awsemr,DC=com
OU=emr,DC=awsemr,DC=com
DC=awsemr,DC=com

Depending on the amount of entries inside a company LDAP directory, using a wide search base may lead to long retrieval times and timeouts. It’s a good practice to configure the search base to be as narrow as possible in order to include all the needed users. In the preceding example, OU=users,OU=italy,OU=emr,DC=awsemr,DC=com may be a good search base if all the users you want to provide access to the EMR cluster are part of that Organizational Unit.

Another way to retrieve user attributes is by using the ldapsearch tool. You can use this method for Active Directory as well as OpenLDAP, and it’s extremely useful to test the connectivity with the LDAP endpoint.

The following is an example with Active Directory (OpenLDAP is similar).

The LDAP endpoint should be resolvable and reachable by Amazon Elastic Compute Cloud (Amazon EC2) EMR cluster instances via TCP on port 636. It’s suggested to run the test from an Amazon Linux 2 EC2 instance belonging to the same subnet as the EMR cluster and having the same EMR security group associated as the EMR cluster instances.

After you launch an EC2 instance, install the nc tool and test the DNS resolution and connectivity. Assuming that DC1.awsemr.com is the DNS name for the LDAP endpoint, run the following commands:

sudo yum install nc
nc -vz DC1.awsemr.com 636

If the DNS resolution isn’t working properly, you should receive an error like the following:

Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Could not resolve hostname "DC1.awsemr.com": Name or service not known. QUITTING.

If the endpoint is not reachable, you should receive an error like the following:

Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.

In either of these cases, the networking and DNS team should be involved in order to troubleshot and solve the issues.

In case of success, the output should look like the following:

Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.1.235:636.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

If everything works, proceed with the testing and install the openldap clients as follows:

sudo yum install openldap-clients

Then run ldapsearch commands to retrieve information about users and groups from the LDAP endpoint. The following are sample ldapsearch commands:

#Customize these 6 variables
LDAPS_CERTIFICATE=/path/to/ldaps_cert.pem
LDAPS_ENDPOINT=DC1.awsemr.com
BINDUSER="CN=binduser,CN=Users,DC=awsemr,DC=com"
BINDUSER_PASSWORD=binduserpassword
SEARCH_BASE=DC=awsemr,DC=com
USER_TO_SEARCH=john
FILTER=(sAMAccountName=${USER_TO_SEARCH})
INFO_TO_SEARCH="*"

#Search user
LDAPTLS_CACERT=${LDAPS_CERTIFICATE} ldapsearch -LLL -x -H ldaps://${LDAPS_ENDPOINT} -v -D "${BINDUSER}" -w "${BINDUSER_PASSWORD}" -b "${SEARCH_BASE}" "${FILTER}" "${INFO_TO_SEARCH}"

We use the following parameters:

-x – This enables simple authentication.
-D – This indicates the user to perform the search.
-w – This indicates the user password.
-H – This indicates the URL of the LDAP server.
-b – This is the base search.
LDAPTLS_CACERT – This indicates the LDAPS endpoint SSL PEM public certificate or the LDAPS endpoint root certificate authority SSL PEM public certificate. This can be obtained from an AD or OpenLDAP admin user.

The following is a sample output of the preceding command:

filter: (sAMAccountName=john)
requesting: *
dn: CN=john,OU=users,OU=italy,OU=emr,DC=awsemr,DC=com
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: user
cn: john
givenName: john
distinguishedName: CN=john,OU=users,OU=italy,OU=emr,DC=awsemr,DC=com
instanceType: 4
whenCreated: 20230804094021.0Z
whenChanged: 20230804094021.0Z
displayName: john
uSNCreated: 262459
memberOf: CN=data-engineers,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com
uSNChanged: 262466
name: john
objectGUID:: gTxn8qYvy0SVL+mYAAbb8Q==
userAccountControl: 66048
badPwdCount: 0
codePage: 0
countryCode: 0
badPasswordTime: 0
lastLogoff: 0
lastLogon: 0
pwdLastSet: 133356156212864439
primaryGroupID: 513
objectSid:: AQUAAAAAAAUVAAAAIKyNe7Dn3azp7Sh+rgQAAA==
accountExpires: 9223372036854775807
logonCount: 0
sAMAccountName: john
sAMAccountType: 805306368
userPrincipalName: [email protected]
objectCategory: CN=Person,CN=Schema,CN=Configuration,DC=awsemr,DC=com
dSCorePropagationData: 20230804094021.0Z
dSCorePropagationData: 16010101000000.0Z

As we can see from the sample output, the user john is identified by the distinguished name CN=john,OU=users,OU=italy,OU=emr,DC=awsemr,DC=com, and the data-engineers group to which the user belongs (memberOf value) is identified by the distinguished name CN=data-engineers,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com.

We can run our ldapsearch queries to retrieve the user and group information using a narrowed search base:

#Customize these 9 variables
LDAPS_CERTIFICATE=/path/to/ldaps_cert.pem
LDAPS_ENDPOINT=DC1.awsemr.com
BINDUSER="CN=binduser,CN=Users,DC=awsemr,DC=com"
BINDUSER_PASSWORD=binduserpassword
SEARCH_BASE=DC=awsemr,DC=com
USER_SEARCH_BASE=OU=users,OU=italy,OU=emr,DC=awsemr,DC=com
GROUPS_SEARCH_BASE=OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com
USER_TO_SEARCH=john
GROUP_TO_SEARCH=data-engineers

#Search User
LDAPTLS_CACERT=${LDAPS_CERTIFICATE} ldapsearch -LLL -x -H ldaps://${LDAPS_ENDPOINT} -v -D "${BINDUSER}" -w "${BINDUSER_PASSWORD}" -b "${USER_SEARCH_BASE}" "(sAMAccountName=${USER_TO_SEARCH})" "*"

#Search Group
LDAPTLS_CACERT=${LDAPS_CERTIFICATE} ldapsearch -LLL -x -H ldaps://${LDAPS_ENDPOINT} -v -D "${BINDUSER}" -w "${BINDUSER_PASSWORD}" -b "${GROUPS_SEARCH_BASE}" "(sAMAccountName=${GROUP_TO_SEARCH})" "*"

You can also apply other filters while searching. For more information about how to create LDAP filters, refer to LDAP Filters.

By running ldapsearch commands, you can test the LDAP connectivity and LDAP properties, and determine the needed setup.

Test the solution

After you have verified that the connectivity to the LDAP endpoint is open and the LDAP configurations are correct, proceed with setting up the environment to launch an EMR LDAP-enabled cluster.

Create AWS Secret Manager secrets

Before you create the EMR security configuration, you need to create two AWS Secret Manager secrets. You use these credentials to interact with the LDAP endpoint and retrieve user details such as user name and group membership.

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.
For Secret type, select Other type of secret.
Create a new secret specifying the binduser distinguished name as the key and the binduser password as the value.
Create a second secret specifying in plaintext the LDAPS endpoint SSL public certificate or the LDAPS root certificate authority public certificate.
This certificate is trusted, allowing a secure communication between the EMR cluster and the LDAPS endpoint.

Create the EMR security configuration

Complete the following steps to create the EMR security configuration:

On the Amazon EMR console, choose Security configurations under EMR on EC2 in the navigation pane.
Choose Create.
For Security configuration name, enter a name.
For Security configuration setup options, select Choose custom settings.
For Encryption, select Turn on in-transit encryption.
For Certificate provider type¸ select PEM.
For Choose PEM certificate location, enter either a PEM bundle located in Amazon Simple Storage Service (Amazon S3) or a Java custom certificate provider.
Note that in-transit encryption is mandatory in order to use the LDAP authentication feature. For more information about in-transit encryption, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption.
Choose Next.
Select LDAP for Authentication protocol.
For LDAP server location, enter the LDAPS endpoint (ldaps://<ldap_endpoint_DNS_name>).
For LDAP SSL certificate, enter the second secret you created in Secrets Manager.
For LDAP access filter, enter an LDAP filter that is applied in order to restrict access to a subset of users retrieved from the LDAP user search base. If the field is left empty, no filters are applied and all users belonging to the LDAP user search base can access the EMR LDAP-protected endpoints with their corporate credentials. The following are example filters and their functions:
- (objectClass=person) – Filter users with the attribute objectClass set as person
- (memberOf=CN=admins,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com) – Filter users belonging to the admins group
- (|(memberof=CN=data-engineers,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com)(memberof=CN=admins,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com)) – Filter users belonging either to the data-engineers or the admins group (which we use for this post)
Enter values for LDAP user search base and LDAP group search base. Note that the two search bases do not support inline filters (for example, the following is not supported: OU=users,OU=italy,OU=emr,DC=awsemr,DC=com?subtree?(|(memberof=CN=data-engineers,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com)(memberof=CN=admins,OU=groups,OU=italy,OU=emr,DC=awsemr,DC=com))).
Select Turn on SSH login. This is needed only if you want your LDAP users to be able to SSH inside cluster instances with their corporate credentials. If SSH login is enabled, the LDAP access filter is needed—otherwise, SSH authentication will fail.
For LDAP server bind credentials, enter the first secret you created in Secrets Manager.
In the Authorization section, keep the defaults selected:
- For IAM role for applications, select Instance profile.
- For Fine-grained access control method, select None.
Choose Next.
Review the configuration summary and choose Create.

Launch the EMR cluster

You can launch the EMR cluster using the AWS Management Console, the AWS Command Line Interface (AWS CLI), or any AWS SDK.

When you’re creating the EMR on EC2 cluster, be sure to specify the following configurations:

EMR version – Use Amazon EMR 6.12.0 or above.
Applications – Select Hadoop, Spark, Hive, Hue, Livy and Presto/Trino.
Security configuration – Specify the security configuration you created in the previous step.
EC2 key pair – Use an existing key pair.
Network and security groups – Use a configuration that allows the EMR EC2 instances to interact with the LDAPS endpoint. In the Find the proper LDAP parameters section, you should have confirmed a valid setup.

Confirm the LDAP authentication is working

When the cluster is up and running, you can check the LDAP authentication is working properly.

If SSH login was enabled as part of LDAP authentication inside the EMR SecurityConfiguration, you can SSH into your cluster by specifying an LDAP user, prompting the related password when requested:

ssh myldapuser@<emr_primary_node>

If SSH login was disabled, you can SSH inside the cluster by using the EC2 key pair specified during cluster creation:

ssh -i mykeypair.pem ec2-user@<emr_primary_node>

An alternative way to access the primary instance, if you prefer, is to use Session Manager, a capability of AWS Systems Manager. For more information, refer to Connect to your Linux instance with AWS Systems Manager Session Manager.

When you’re inside the primary instance, you can test that the LDAP users and groups are properly retrieved by using the id command. The following is a sample command to check if the user john is properly retrieved with the related groups:

[ec2-user@ip-10-0-2-237 ~]# id john
uid=941601122(john) gid=941600513(users-group) groups=941600513(users-group),941601123(data-engineers)

You can then test authentication on the different installed frameworks.

First, let’s retrieve the frameworks’ public certificate and store it inside a truststore. All the frameworks share the same public certificate (the one we used to set up in-transit encryption), so you can use any of the SSL protected endpoints (Hive port 10000, Presto/Trino port 8446, Livy port 8998) to retrieve it. Take the certificate from the HiveServer2 endpoint (port 10000):

#Export Hive Server 2 public SSL certificate to a PEM file
openssl s_client -showcerts -connect $(hostname -f):10000 </dev/null 2>/dev/null|openssl x509 -outform PEM > certificate.pem

#Import the PEM certificate inside a truststore
echo "yes" | keytool -import -alias hive_cert -file certificate.pem -storetype JKS -keystore truststore.jks -storepass myStrongPassword

Then use this truststore to securely communicate with the different frameworks.

Use the following code to test HiveServer2 authentication with beeline:

#Use the truststore to connect to the Hive Server 2
beeline -u "jdbc:hive2://$(hostname -f):10000/default;ssl=true;sslTrustStore=truststore.jks;trustStorePassword=myStrongPassword" -n john -p johnPassword

If using Presto, test Presto authentication with the presto CLI (provide the user password when requested):

#Use the truststore to connect to the Presto coordinator
presto-cli \
--user john \
--password \
--catalog hive \
--server https://$(hostname -f):8446 \
--truststore-path truststore.jks \
--truststore-password myStrongPassword

If using Trino, test Trino authentication with the trino CLI (provide the user password when requested):

#Use the truststore to connect to the Trino coordinator
trino-cli \
--user john \
--password \
--catalog hive \
--server https://$(hostname -f):8446 \
--truststore-path truststore.jks \
--truststore-password myStrongPassword

Test Livy authentication with curl:

#Trust the PEM certificte to connect to the Livy server

#Start session
curl --cacert certificate.pem -X POST \
-u "john:johnPassword" \
--data '{"kind": "spark"}' \
-H "Content-Type: application/json" \
https://$(hostname -f):8998/sessions \
-c cookies.txt

#Example of output
#{"id":0,"name":null,"appId":null,"owner":"john","proxyUser":"john","state":"starting","kind":"spark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":["stdout: ","\nstderr: ","\nYARN Diagnostics: "]}

Test Spark commands with pyspark:

#SSH inside the primary instance with the specific user
ssh john@<emr-primary-node>
#Or impersonate the user
sudo su - john

#Create a keytab and obtain a kerberos ticket running the ldap-kinit tool
$ ldap-kinit
Username: john
Password: 

#Output
{"message":"ok","contents":{"username":"john","expirationTime":"2023-09-14T15:24:06.303Z[UTC]"}}

#Check the kerberos ticket has been created
$ klist

# Test spark CLIs
$ pyspark

>>> spark.sql("show databases").show()
>>> quit()

Note that here we tested the authentication from within the cluster, but we can interact with Trino, Hive, Presto and Livy even from outside the cluster as far as connectivity and DNS resolution are properly configured. Spark CLIs are the only ones which can be used only from inside the cluster.

To test Hue authentication, complete the following steps:

Navigate to the Hue web UI hosted on http://<emr_primary_node>:8888/ and provide an LDAP user name and password.
Test SQL queries inside the Hive and Trino/Presto editors.

To test with an external SQL tool (such as DBeaver connecting to Trino), complete the following steps. Be sure to configure the EMR primary node security group so that it allows TCP traffic from the DBeaver IP to the desired framework endpoint port (for example, 10000 for HiveServer2, 8446 for Trino/Presto) and to properly configure DNS resolution on the DBeaver client machine to properly resolve the EMR primary node hostname.

From your EMR cluster primary instance, copy to an S3 bucket the files truststore.jks (previously created) and /usr/lib/trino/trino-jdbc/trino-jdbc-XXX-amzn-0.jar (change the version XXX depending on the EMR version).
Download on your DBeaver client machine the truststore.jks and trino-jdbc-XXX-amzn-0.jar files.
Open DBeaver and choose Database, then choose Driver Manager.
Choose New to create a new driver.
On the Settings tab, provide the following information:
- For Driver Name, enter EMR Trino.
- For Class Name, enter io.trino.jdbc.TrinoDriver.
- For URL Template, enter jdbc:trino://{host}:{port}.
On the Libraries tab, complete the following steps:
- Choose Add File.
- Choose the Trino JDBC driver JAR file from the local file system (trino-jdbc-XXX-amzn-0.jar).
Choose OK to create the driver.
Choose Database and New Database Connection.
On the Main tab, specify the following:
- For Connect by, select Host.
- For Host, enter the EMR primary node.
- For Port, enter the Trino port (8446 by default).
On the Driver properties tab, add the following properties:
- Add SSL with True as the value.
- Add SSLTrustStorePath with the truststore.jks file location as the value.
- Add SSLTrustStorePassword with the truststore.jks password that you used to create it as the value.
Choose Finish.
Choose the created connection and choose the Connect icon.
Enter your LDAP user name and password, then choose OK.

If everything is working, you should be able to browse the Trino catalogs, databases, and tables in the navigation pane. To run queries, choose SQL Editor, then choose Open SQL Editor.

From the SQL Editor, you can query your tables.

Next steps

The new Amazon EMR LDAP authentication feature simplifies the way users can gain access to EMR installed frameworks. When users are using a framework, you may want to govern the data they can access. For this specific topic, you can use LDAP authentication in combination with the native EMR Apache Ranger integration. For more information, refer to Integrate Amazon EMR with Apache Ranger.

Clean up

Complete the following cleanup actions to remove the resources you created following this post and avoid incurring additional costs. For this post, we clean up using the AWS CLI. You can also clean up using similar actions via the console.

If you launched an EC2 instance to check the LDAP connectivity and don’t need it anymore, delete it with the following command (specify your instance ID):
```
aws ec2 terminate-instances \
--instance-ids i-XXXXXXXX \
--region <your-aws-region>
```
If you launched an EC2 instance to test DBeaver and don’t need it anymore, you can use the preceding command to delete it.
Delete the EMR cluster with the following command (specify your EMR cluster ID):
```
aws emr terminate-clusters \
--cluster-ids j-XXXXXXXXXXXXX \
--region <your-aws-region>
```
Note that if the EMR cluster has Termination Protection enabled, before you run the preceding terminate-clusters command, you have to disable it. You can do so with the following command (specify your EMR cluster ID):
```
aws emr modify-cluster-attributes \
--cluster-ids j-XXXXXXXXXXXXX \
--no-termination-protected \
--region eu-west-1
```

Delete the EMR security configuration with the following command:

aws emr delete-security-configuration \
--name <your-security-configuration> \
--region <your-aws-region>

Delete the Secrets Manager secrets with the following commands:

aws secretsmanager delete-secret \
--secret-id <first-secret-name> \
--force-delete-without-recovery \
--region <your-aws-region>

aws secretsmanager delete-secret \
--secret-id <second-secret-name> \
--force-delete-without-recovery \
--region <your-aws-region>

Conclusion

In this post, we discussed how you can configure and test LDAP authentication on EMR on EC2 clusters. We discussed how to retrieve the needed LDAP settings, test connectivity with the LDAP endpoint, configure your EMR security configuration, and test that the LDAP authentication is properly working. This post also highlighted how the authentication flow is simplified compared to the standard Active Directory cross-realm trust configuration. To learn more about this feature, refer to Use Active Directory or LDAP servers for authentication with Amazon EMR.

About the Authors

Stefano Sandona is a Senior Big Data Solution Architect at AWS. He loves data, distributed systems and security. He helps customers around the world architecting secure, scalable and reliable big data platforms.

Adnan Hemani is a Software Development Engineer at AWS working with the EMR team. He focuses on the security posture of applications running on EMR clusters. He is interested in modern Big Data applications and how customers interact with them.

Best practices for managing Terraform State files in AWS CI/CD Pipeline

2024-02-19 Arun Kumar Selvaraj

Post Syndicated from Arun Kumar Selvaraj original https://aws.amazon.com/blogs/devops/best-practices-for-managing-terraform-state-files-in-aws-ci-cd-pipeline/

Introduction

Today customers want to reduce manual operations for deploying and maintaining their infrastructure. The recommended method to deploy and manage infrastructure on AWS is to follow Infrastructure-As-Code (IaC) model using tools like AWS CloudFormation, AWS Cloud Development Kit (AWS CDK) or Terraform.

One of the critical components in terraform is managing the state file which keeps track of your configuration and resources. When you run terraform in an AWS CI/CD pipeline the state file has to be stored in a secured, common path to which the pipeline has access to. You need a mechanism to lock it when multiple developers in the team want to access it at the same time.

In this blog post, we will explain how to manage terraform state files in AWS, best practices on configuring them in AWS and an example of how you can manage it efficiently in your Continuous Integration pipeline in AWS when used with AWS Developer Tools such as AWS CodeCommit and AWS CodeBuild. This blog post assumes you have a basic knowledge of terraform, AWS Developer Tools and AWS CI/CD pipeline. Let’s dive in!

Challenges with handling state files

By default, the state file is stored locally where terraform runs, which is not a problem if you are a single developer working on the deployment. However if not, it is not ideal to store state files locally as you may run into following problems:

When working in teams or collaborative environments, multiple people need access to the state file
Data in the state file is stored in plain text which may contain secrets or sensitive information
Local files can get lost, corrupted, or deleted

Best practices for handling state files

The recommended practice for managing state files is to use terraform’s built-in support for remote backends. These are:

Remote backend on Amazon Simple Storage Service (Amazon S3): You can configure terraform to store state files in an Amazon S3 bucket which provides a durable and scalable storage solution. Storing on Amazon S3 also enables collaboration that allows you to share state file with others.

Remote backend on Amazon S3 with Amazon DynamoDB: In addition to using an Amazon S3 bucket for managing the files, you can use an Amazon DynamoDB table to lock the state file. This will allow only one person to modify a particular state file at any given time. It will help to avoid conflicts and enable safe concurrent access to the state file.

There are other options available as well such as remote backend on terraform cloud and third party backends. Ultimately, the best method for managing terraform state files on AWS will depend on your specific requirements.

When deploying terraform on AWS, the preferred choice of managing state is using Amazon S3 with Amazon DynamoDB.

AWS configurations for managing state files

Create an Amazon S3 bucket using terraform. Implement security measures for Amazon S3 bucket by creating an AWS Identity and Access Management (AWS IAM) policy or Amazon S3 Bucket Policy. Thus you can restrict access, configure object versioning for data protection and recovery, and enable AES256 encryption with SSE-KMS for encryption control.

Next create an Amazon DynamoDB table using terraform with Primary key set to LockID. You can also set any additional configuration options such as read/write capacity units. Once the table is created, you will configure the terraform backend to use it for state locking by specifying the table name in the terraform block of your configuration.

For a single AWS account with multiple environments and projects, you can use a single Amazon S3 bucket. If you have multiple applications in multiple environments across multiple AWS accounts, you can create one Amazon S3 bucket for each account. In that Amazon S3 bucket, you can create appropriate folders for each environment, storing project state files with specific prefixes.

Now that you know how to handle terraform state files on AWS, let’s look at an example of how you can configure them in a Continuous Integration pipeline in AWS.

Architecture

Figure 1: Example architecture on how to use terraform in an AWS CI pipeline

This diagram outlines the workflow implemented in this blog:

The AWS CodeCommit repository contains the application code
The AWS CodeBuild job contains the buildspec files and references the source code in AWS CodeCommit
The AWS Lambda function contains the application code created after running terraform apply
Amazon S3 contains the state file created after running terraform apply. Amazon DynamoDB locks the state file present in Amazon S3

Implementation

Pre-requisites

Before you begin, you must complete the following prerequisites:

Install the latest version of AWS Command Line Interface (AWS CLI)
Install terraform latest version
Install latest Git version and setup git-remote-codecommit
Use an existing AWS account or create a new one
Use AWS IAM role with role profile, role permissions, role trust relationship and user permissions to access your AWS account via local terminal

Setting up the environment

You need an AWS access key ID and secret access key to configure AWS CLI. To learn more about configuring the AWS CLI, follow these instructions.
Clone the repo for complete example: git clone https://github.com/aws-samples/manage-terraform-statefiles-in-aws-pipeline
After cloning, you could see the following folder structure:

Figure 2: AWS CodeCommit repository structure

Let’s break down the terraform code into 2 parts – one for preparing the infrastructure and another for preparing the application.

Preparing the Infrastructure

The main.tf file is the core component that does below:
- - It creates an Amazon S3 bucket to store the state file. We configure bucket ACL, bucket versioning and encryption so that the state file is secure.
  - It creates an Amazon DynamoDB table which will be used to lock the state file.
  - It creates two AWS CodeBuild projects, one for ‘terraform plan’ and another for ‘terraform apply’.
Note – It also has the code block (commented out by default) to create AWS Lambda which you will use at a later stage.

AWS CodeBuild projects should be able to access Amazon S3, Amazon DynamoDB, AWS CodeCommit and AWS Lambda. So, the AWS IAM role with appropriate permissions required to access these resources are created via iam.tf file.

Next you will find two buildspec files named buildspec-plan.yaml and buildspec-apply.yaml that will execute terraform commands – terraform plan and terraform apply respectively.

Modify AWS region in the provider.tf file.

Update Amazon S3 bucket name, Amazon DynamoDB table name, AWS CodeBuild compute types, AWS Lambda role and policy names to required values using variable.tf file. You can also use this file to easily customize parameters for different environments.

With this, the infrastructure setup is complete.

You can use your local terminal and execute below commands in the same order to deploy the above-mentioned resources in your AWS account.

terraform init
terraform validate
terraform plan
terraform apply

Once the apply is successful and all the above resources have been successfully deployed in your AWS account, proceed with deploying your application.

Preparing the Application

In the cloned repository, use the backend.tf file to create your own Amazon S3 backend to store the state file. By default, it will have below values. You can override them with your required values.

bucket = "tfbackend-bucket" 
key    = "terraform.tfstate" 
region = "eu-central-1"

The repository has sample python code stored in main.py that returns a simple message when invoked.

In the main.tf file, you can find the below block of code to create and deploy the Lambda function that uses the main.py code (uncomment these code blocks).

data "archive_file" "lambda_archive_file" {
    ……
}

resource "aws_lambda_function" "lambda" {
    ……
}

Now you can deploy the application using AWS CodeBuild instead of running terraform commands locally which is the whole point and advantage of using AWS CodeBuild.

Run the two AWS CodeBuild projects to execute terraform plan and terraform apply again.

Once successful, you can verify your deployment by testing the code in AWS Lambda. To test a lambda function (console):

- Open AWS Lambda console and select your function “tf-codebuild”
- In the navigation pane, in Code section, click Test to create a test event
- Provide your required name, for example “test-lambda”
- Accept default values and click Save
- Click Test again to trigger your test event “test-lambda”

It should return the sample message you provided in your main.py file. In the default case, it will display “Hello from AWS Lambda !” message as shown below.

Figure 3: Sample Amazon Lambda function response

To verify your state file, go to Amazon S3 console and select the backend bucket created (tfbackend-bucket). It will contain your state file.

Figure 4: Amazon S3 bucket with terraform state file

Open Amazon DynamoDB console and check your table tfstate-lock and it will have an entry with LockID.

Figure 5: Amazon DynamoDB table with LockID

Thus, you have securely stored and locked your terraform state file using terraform backend in a Continuous Integration pipeline.

Cleanup

To delete all the resources created as part of the repository, run the below command from your terminal.

terraform destroy

Conclusion

In this blog post, we explored the fundamentals of terraform state files, discussed best practices for their secure storage within AWS environments and also mechanisms for locking these files to prevent unauthorized team access. And finally, we showed you an example of how efficiently you can manage them in a Continuous Integration pipeline in AWS.

You can apply the same methodology to manage state files in a Continuous Delivery pipeline in AWS. For more information, see CI/CD pipeline on AWS, Terraform backends types, Purpose of terraform state.

Detect Stripe keys in S3 buckets with Amazon Macie

2024-02-19 Koulick Ghosh

Post Syndicated from Koulick Ghosh original https://aws.amazon.com/blogs/security/detect-stripe-keys-in-s3-buckets-with-amazon-macie/

Many customers building applications on Amazon Web Services (AWS) use Stripe global payment services to help get their product out faster and grow revenue, especially in the internet economy. It’s critical for customers to securely and properly handle the credentials used to authenticate with Stripe services. Much like your AWS API keys, which enable access to your AWS resources, Stripe API keys grant access to the Stripe account, which allows for the movement of real money. Therefore, you must keep Stripe’s API keys secret and well-controlled. And, much like AWS keys, it’s important to invalidate and re-issue Stripe API keys that have been inadvertently committed to GitHub, emitted in logs, or uploaded to Amazon Simple Storage Service (Amazon S3).

Customers have asked us for ways to reduce the risk of unintentionally exposing Stripe API keys, especially when code files and repositories are stored in Amazon S3. To help meet this need, we collaborated with Stripe to develop a new managed data identifier that you can use to help discover and protect Stripe API keys.

“I’m really glad we could collaborate with AWS to introduce a new managed data identifier in Amazon Macie. Mutual customers of AWS and Stripe can now scan S3 buckets to detect exposed Stripe API keys.”
— Martin Pool, Staff Engineer in Cloud Security at Stripe

In this post, we will show you how to use the new managed data identifier in Amazon Macie to discover and protect copies of your Stripe API keys.

About Stripe API keys

Stripe provides payment processing software and services for businesses. Using Stripe’s technology, businesses can accept online payments from customers around the globe.

Stripe authenticates API requests by using API keys, which are included in the request. Stripe takes various measures to help customers keep their secret keys safe and secure. Stripe users can generate test-mode keys, which can only access simulated test data, and which doesn’t move real money. Stripe encourages its customers to use only test API keys for testing and development purposes to reduce the risk of inadvertent disclosure of live keys or of accidentally generating real charges.

Stripe also supports publishable keys, which you can make publicly accessible in your web or mobile app’s client-side code to collect payment information.

In this blog post, we focus on live-mode keys, which are the primary security concern because they can access your real data and cause money movement. These keys should be closely held within the production services that need to use them. Stripe allows keys to be restricted to read or write specific API resources, or used only from certain IP ranges, but even with these restrictions, you should still handle live mode keys with caution.

Stripe keys have distinctive prefixes to help you detect them such as sk_live_ for secret keys, and rk_live_ for restricted keys (which are also secret).

Amazon Macie

Amazon Macie is a fully managed service that uses machine learning (ML) and pattern matching to discover and help protect your sensitive data, such as personally identifiable information. Macie can also provide detailed visibility into your data and help you align with compliance requirements by identifying data that needs to be protected under various regulations, such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA).

Macie supports a suite of managed data identifiers to make it simpler for you to configure and adopt. Managed data identifiers are prebuilt, customizable patterns that help automatically identify sensitive data, such as credit card numbers, social security numbers, and email addresses.

Now, Macie has a new managed data identifier STRIPE_CREDENTIALS that you can use to identify Stripe API secret keys.

Configure Amazon Macie to detect Stripe credentials

In this section, we show you how to use the managed data identifier STRIPE_CREDENTIALS to detect Stripe API secret keys. We recommend that you carry out these tutorial steps in an AWS account dedicated to experimentation and exploration before you move forward with detection in a production environment.

Prerequisites

To follow along with this walkthrough, complete the following prerequisites.

Enable Amazon Macie in your AWS account. For instructions, see Getting started with Amazon Macie.
If you are doing this tutorial within an organization in AWS Organizations, set your account as the delegated Macie administrator account and enable Macie in at least one member account by using Organizations. Then perform the following steps in the member account.

Create example data

The first step is to create some example objects in an S3 bucket in the AWS account. The objects contain strings that resemble Stripe secret keys. You will use the example data later to demonstrate how Macie can detect Stripe secret keys.

To create the example data

Open the S3 console and create an S3 bucket.

Create four files locally, paste the following mock sensitive data into those files, and upload them to the bucket.

file1
 stripe publishable key sk_live_cpegcLxKILlrXYNIuqYhGXoy

file2
 sk_live_cpegcLxKILlrXYNIuqYhGXoy
 sk_live_abcdcLxKILlrXYNIuqYhGXoy
 sk_live_efghcLxKILlrXYNIuqYhGXoy
 stripe payment sk_live_ijklcLxKILlrXYNIuqYhGXoy

 file3
 sk_live_cpegcLxKILlrXYNIuqYhGXoy
 stripe api key sk_live_abcdcLxKILlrXYNIuqYhGXoy

 file4
 stripe secret key sk_live_cpegcLxKILlrXYNIuqYhGXoy

Note: The keys mentioned in the preceding files are mock data and aren’t related to actual live Stripe keys.

Create a Macie job with the STRIPE_CREDENTIALS managed data identifier

Using Macie, you can scan your S3 buckets for sensitive data and security risks. In this step, you run a one-time Macie job to scan an S3 bucket and review the findings.

To create a Macie job with STRIPE_CREDENTIALS

Open the Amazon Macie console, and in the left navigation pane, choose Jobs. On the top right, choose Create job.

Figure 1: Create Macie Job
Select the bucket that you want Macie to scan or specify bucket criteria, and then choose Next.

Figure 2: Select S3 bucket
Review the details of the S3 bucket, such as estimated cost, and then choose Next.

Figure 3: Review S3 bucket
On the Refine the scope page, choose One-time job, and then choose Next.

Note: After you successfully test, you can schedule the job to scan S3 buckets at the frequency that you choose.

Figure 4: Select one-time job
For Managed data identifier options, select Custom and then select Use specific managed data identifiers. For Select managed data identifiers, search for STRIPE_CREDENTIALS and then select it. Choose Next.

Figure 5: Select managed data identifier
Enter a name and an optional description for the job, and then choose Next.

Figure 6: Enter job name
Review the job details and choose Submit. Macie will create and start the job immediately, and the job will run one time.
When the Status of the job shows Complete, select the job, and from the Show results dropdown, select Show findings.

Figure 7: Select the job and then select Show findings
You can now review the findings for sensitive data in your S3 bucket. As shown in Figure 8, Macie detected Stripe keys in each of the four files, and categorized the findings as High severity. You can review and manage the findings in the Macie console, retrieve them through the Macie API for further analysis, send them to Amazon EventBridge for automated processing, or publish them to AWS Security Hub for a comprehensive view of your security state.

Figure 8: Review the findings

Respond to unintended disclosure of Stripe API keys

If you discover Stripe live-mode keys (or other sensitive data) in an S3 bucket, then through the Stripe dashboard, you can roll your API keys to revoke access to the compromised key and generate a new one. This helps ensure that the key can’t be used to make malicious API requests. Make sure that you install the replacement key into the production services that need it. In the longer term, you can take steps to understand the path by which the key was disclosed and help prevent a recurrence.

Conclusion

In this post, you learned about the importance of safeguarding Stripe API keys on AWS. By using Amazon Macie with managed data identifiers, setting up regular reviews and restricted access to S3 buckets, training developers in security best practices, and monitoring logs and repositories, you can help mitigate the risk of key exposure and potential security breaches. By adhering to these practices, you can help ensure a robust security posture for your sensitive data on AWS.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Macie re:Post.

Improve your ETL performance using multiple Redshift warehouses for writes

2024-02-19 Ryan Waldorf

Post Syndicated from Ryan Waldorf original https://aws.amazon.com/blogs/big-data/improve-your-etl-performance-using-multiple-redshift-warehouses-for-writes/

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. Thousands of customers use Amazon Redshift read data sharing to enable instant, granular, and fast data access across Redshift provisioned clusters and serverless workgroups. This allows you to scale your read workloads to thousands of concurrent users without having to move or copy the data.

Now, at Amazon Redshift we are announcing multi-data warehouse writes through data sharing in public preview. This allows you to achieve better performance for extract, transform, and load (ETL) workloads by using different warehouses of different types and sizes based on your workload needs. Additionally, this allows you to easily keep your ETL jobs running more predictably as you can split them between warehouses in a few clicks, monitor and control costs as each warehouse has its own monitoring and cost controls, and foster collaboration as you can enable different teams to write to another team’s databases in just a few clicks.

The data is live and available across all warehouses as soon as it is committed, even when it’s written to cross-account or cross-region. For preview you can use a combination of ra3.4xl clusters, ra3.16xl clusters, or serverless workgroups.

In this post, we discuss when you should consider using multiple warehouses to write to the same databases, explain how multi-warehouse writes through data sharing works, and walk you through an example on how to use multiple warehouses to write to the same database.

Reasons for using multiple warehouses to write to the same databases

In this section, we discuss some of the reasons why you should consider using multiple warehouses to write to the same database.

Better performance and predictability for mixed workloads

Customers often start with a warehouse sized to fit their initial workload needs. For example, if you need to support occasional user queries and nightly ingestion of 10 million rows of purchase data, a 32 RPU workgroup may be perfectly suited for your needs. However, adding a new hourly ingestion of 400 million rows of user website and app interactions could slow existing users’ response times as the new workload consumes significant resources. You could resize to a larger workgroup so read and write workloads complete quickly without fighting over resources. However, this may provide unneeded power and cost for existing workloads. Also, because workloads share compute, a spike in one workload can affect the ability of other workloads to meet their SLAs.

The following diagram illustrates a single-warehouse architecture.

Single-Warehouse ETL Architecture. Three separate workloads--a Purchase History ETL job ingesting 10M rows nightly, Users running 25 read queries per hour, and a Web Interactions ETL job ingesting 400M rows/hour--all using the same 256 RPU Amazon Redshift serverless workgroup to read and write from the database called Customer DB.

With the ability to write through datashares, you can now separate the new user website and app interactions ETL into a separate, larger workgroup so that it completes quickly with the performance you need without impacting the cost or completion time of your existing workloads. The following diagram illustrates this multi-warehouse architecture.

The multi-warehouse architecture enables you to have all write workloads complete on time with less combined compute, and subsequently lower cost, than a single warehouse supporting all workloads.

Control and monitor costs

When you use a single warehouse for all your ETL jobs, it can be difficult to understand which workloads are contributing to your costs. For instance, you may have one team running an ETL workload ingesting data from a CRM system while another team is ingesting data from internal operational systems. It’s hard for you to monitor and control the costs for the workloads because queries are running together using the same compute in the warehouse. By splitting the write workloads into separate warehouses, you can separately monitor and control costs while ensuring the workloads can progress independently without resource conflict.

Collaborate on live data with ease

The are times when two teams use different warehouses for data governance, compute performance, or cost reasons, but also at times need to write to the same shared data. For instance, you may have a set of customer 360 tables that need to be updated live as customers interact with your marketing, sales, and customer service teams. When these teams use different warehouses, keeping this data live can be difficult because you may have to build a multi-service ETL pipeline using tools like Amazon Simple Storage Service (Amazon S3), Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and AWS Lambda to track live changes in each team’s data and ingest it into a single source.

With the ability to write through datashares, you can grant granular permissions on your database objects (for example, SELECT on one table, and SELECT, INSERT, and TRUNCATE on another) to different teams using different warehouses in a few clicks. This enables teams to start writing to the shared objects using their own warehouses. The data is live and available to all warehouses as soon as it is committed, and this even works if the warehouses are using different accounts and regions.

In the following sections, we walk you through how to use multiple warehouses to write to the same databases via data sharing.

Solution overview

We use the following terminology in this solution:

Namespace – A logical container for database objects, users and roles, their permissions on database objects, and compute (serverless workgroups and provisioned clusters).
Datashare – The unit of sharing for data sharing. You grant permissions on objects to datashares.
Producer – The warehouse that creates the datashare, grants permissions on objects to datashares, and grants other warehouses and accounts access to the datashare.
Consumer – The warehouse that is granted access to the datashare. You can think of consumers as datashare tenants.

This use case involves a customer with two warehouses: a primary warehouse used for attached to the primary namespace for most read and write queries, and a secondary warehouse attached to a secondary namespace that is primarily used to write to the primary namespace. We use the publicly available 10 GB TPCH dataset from AWS Labs, hosted in an S3 bucket. You can copy and paste many of the commands to follow along. Although it’s small for a data warehouse, this dataset allows easy functional testing of this feature.

The following diagram illustrates our solution architecture.

Architecture Diagram showing Two Warehouses for ETL

We set up the primary namespace by connecting to it via its warehouse, creating a marketing database in it with a prod and staging schema, and creating three tables in the prod schema called region, nation, and af_customer. We then load data into the region and nation tables using the warehouse. We do not ingest data into the af_customer table.

We then create a datashare in the primary namespace. We grant the datashare the ability to create objects in the staging schema and the ability to select, insert, update, and delete from objects in the prod schema. We then grant usage on the schema to another namespace in the account.

At that point, we connect to the secondary warehouse. We create a database from a datashare in that warehouse as well as a new user. We then grant permissions on the datashare object to the new user. Then we reconnect to the secondary warehouse as the new user.

We then create a customer table in the datashare’s staging schema and copy data from the TPCH 10 customer dataset into the staging table. We insert staging customer table data into the shared af_customer production table, and then truncate the table.

At this point, the ETL is complete and you are able to read the data in the primary namespace, inserted by the secondary ETL warehouse, from both the primary warehouse and the secondary ETL warehouse.

Prerequisites

To follow along with this post, you should have the following prerequisites:

Two warehouses created with the PREVIEW_2023 track. The warehouses can be a mix of serverless workgroups, ra3.4xl clusters, and ra3.16xl clusters.
Access to a superuser in both warehouses.
An AWS Identity and Access Management (IAM) role that is able to ingest data from Amazon Redshift to Amazon S3 (Amazon Redshift creates one by default when you create a cluster or serverless workgroup).
For cross-account only, you need access to an IAM user or role that is allowed to authorize datashares. For the IAM policy, refer to Sharing datashares.

Refer to Sharing both read and write data within an AWS account or across accounts (preview) for the most up-to-date information.

Set up the primary namespace (producer)

In this section, we show how to set up the primary (producer) namespace we will use to store our data.

Connect to producer

Complete the following steps to connect to the producer:

On the Amazon Redshift console, choose Query editor v2 in the navigation pane.

In the query editor v2, you can see all the warehouses you have access to in the left pane. You can expand them to see their databases.

Connect to your primary warehouse using a superuser.
Run the following command to create the marketing database:

CREATE DATABASE marketing;

Create the database objects to share

Complete the following steps to create your database objects to share:

After you create the marketing database, switch your database connection to the marketing database.

You may need to refresh your page to be able to see it.

Run the following commands to create the two schemas you intend to share:

CREATE SCHEMA staging;
CREATE SCHEMA prod;

Create the tables to share with the following code. These are standard DDL statements coming from the AWS Labs DDL file with modified table names.

create table prod.region (
  r_regionkey int4 not null,
  r_name char(25) not null ,
  r_comment varchar(152) not null,
  Primary Key(R_REGIONKEY)
);

create table prod.nation (
  n_nationkey int4 not null,
  n_name char(25) not null ,
  n_regionkey int4 not null,
  n_comment varchar(152) not null,
  Primary Key(N_NATIONKEY)
);

create table prod.af_customer (
  c_custkey int8 not null ,
  c_name varchar(25) not null,
  c_address varchar(40) not null,
  c_nationkey int4 not null,
  c_phone char(15) not null,
  c_acctbal numeric(12,2) not null,
  c_mktsegment char(10) not null,
  c_comment varchar(117) not null,
  Primary Key(C_CUSTKEY)
) distkey(c_custkey) sortkey(c_custkey);

Copy data into the `region` and `nation` tables

Run the following commands to copy data from the AWS Labs S3 bucket into the region and nation tables. If you created a cluster while keeping the default created IAM role, you can copy and paste the following commands to load data into your tables:

copy prod.nation from 's3://redshift-downloads/TPC-H/2.18/10GB/nation.tbl' iam_role default delimiter '|' region 'us-east-1';
copy prod.region from 's3://redshift-downloads/TPC-H/2.18/10GB/region.tbl' iam_role default delimiter '|' region 'us-east-1';

Create the datashare

Create the datashare using the following command:

create datashare marketing publicaccessible true;

The publicaccessible setting specifies whether or not a datashare can be used by consumers with publicly accessible provisioned clusters and serverless workgroups. If your warehouses are not publicly accessible, you can ignore that field.

Grant permissions on schemas to the datashare

To add objects with permissions to the datashare, use the grant syntax, specifying the datashare you’d like to grant the permissions to:

grant usage on schema prod to datashare marketing;
grant usage, create on schema staging to datashare marketing;

This allows the datashare consumers to use objects added to the prod schema and use and create objects added to the staging schema. To maintain backward compatibility, if you use the alter datashare command to add a schema, it will be the equivalent of granting usage on the schema.

Grant permissions on tables to the datashare

Now you can grant access to tables to the datashare using the grant syntax, specifying the permissions and the datashare. The following code grants all privileges on the af_customer table to the datashare:

grant all on table prod.af_customer to datashare marketing;

To maintain backward compatibility, if you use the alter datashare command to add a table, it will be the equivalent of granting select on the table.

Additionally, we’ve added scoped permissions that allow you to grant the same permission to all current and future objects within the datashare. We add the scoped select permission on the prod schema tables to the datashare:

grant select for tables in schema prod to datashare marketing;

After this grant, the customer will have select permissions on all current and future tables in the prod schema. This gives them select access on the region and nation tables.

View permissions granted to the datashare

You can view permissions granted to the datashare by running the following command:

show access for datashare marketing;

Grant permissions to the secondary ETL namespace

You can grant permissions to the secondary ETL namespace using the existing syntax. You do this by specifying the namespace ID. You can find the namespace on the namespace details page if your secondary ETL namespace is serverless, as part of the namespace ID in the cluster details page if your secondary ETL namespace is provisioned, or by connecting to the secondary ETL warehouse in the query editor v2 and running select current_namespace. You can then grant access to the other namespace with the following command (change the consumer namespace to the namespace UID of your own secondary ETL warehouse):

grant usage on datashare marketing to namespace '<consumer_namespace>';

Set up the secondary ETL namespace (consumer)

At this point, you’re ready to set up your secondary (consumer) ETL warehouse to start writing to the shared data.

Create a database from the datashare

Complete the following steps to create your database:

In the query editor v2, switch to the secondary ETL warehouse.
Run the command show datashares to see the marketing datashare as well as the datashare producer’s namespace.
Use that namespace to create a database from the datashare, as shown in the following code:

create database marketing_ds_db with permissions from datashare marketing of namespace '&lt;producer_namespace&gt;';

Specifying with permissions allows you to grant granular permissions to individual database users and roles. Without this, if you grant usage permissions on the datashare database, users and roles get all permissions on all objects within the datashare database.

Create a user and grant permissions to that user

Create a user using the CREATE USER command:

create user data_engineer password '[choose a secure password]';
grant usage on database marketing_ds_db to data_engineer;
grant all on schema marketing_ds_db.prod to data_engineer;
grant all on schema marketing_ds_db.staging to data_engineer;
grant all on all tables in schema marketing_ds_db.staging to data_engineer;
grant all on all tables in schema marketing_ds_db.prod to data_engineer;

With these grants, you’ve given the user data_engineer all permissions on all objects in the datashare. Additionally, you’ve granted all permissions available in the schemas as scoped permissions for data_engineer. Any permissions on any objects added to those schemas will be automatically granted to data_engineer.

At this point, you can continue the steps using either the admin user you’re currently signed in as or the data_engineer.

Options for writing to the datashare database

You can write data to the datashare database three ways.

Use three-part notation while connected to a local database

Like with read data sharing, you can use three-part notation to reference the datashare database objects. For instance, insert into marketing_ds_db.prod.customer. Note that you can’t use multi-statement transactions to write to objects in the datashare database like this.

Connect directly to the datashare database

You can connect directly to the datashare database via the Redshift JDBC, ODBC, or Python driver, in addition to the Amazon Redshift Data API (new). To connect like this, specify the datashare database name in the connection string. This allows you to write to the datashare database using two-part notation and use multi-statement transactions to write to the datashare database. Note that some system and catalog tables are not available this way.

Run the use command

You can now specify that you want to use another database with the command use <database_name>. This allows you to write to the datashare database using two-part notation and use multi-statement transactions to write to the datashare database. Note that some system and catalog tables are not available this way. Also, when querying system and catalog tables, you will be querying the system and catalog tables of the database you are connected to, not the database you are using.

To try this method, run the following command:

use marketing_ds_db;

Start writing to the datashare database

In this section, we show how to write to the datashare database using the second and third options we discussed (direct connection or use command). We use the AWS Labs provided SQL to write to the datashare database.

Create a staging table

Create a table within the staging schema, because you’ve been granted create privileges. We create a table within the datashare’s staging schema with the following DDL statement:

create table staging.customer (
  c_custkey int8 not null ,
  c_name varchar(25) not null,
  c_address varchar(40) not null,
  c_nationkey int4 not null,
  c_phone char(15) not null,
  c_acctbal numeric(12,2) not null,
  c_mktsegment char(10) not null,
  c_comment varchar(117) not null,
  Primary Key(C_CUSTKEY)
) distkey(c_nationkey) sortkey(c_nationkey);

You can use two-part notation because you used the USE command or directly connected to the datashare database. If not, you need to specify the datashare database names as well.

Copy data into the staging table

Copy the customer TPCH 10 data from the AWS Labs public S3 bucket into the table using the following command:

copy staging.customer from 's3://redshift-downloads/TPC-H/2.18/10GB/customer.tbl' iam_role default delimiter '|' region 'us-east-1';

As before, this requires you to have set up the default IAM role when creating this warehouse.

Ingest African customer data to the table `prod.af_customer`

Run the following command to ingest only the African customer data to the table prod.af_customer:

insert into prod.af_customer
select c.* from staging.customer c
  join prod.nation n on c.c_nationkey = n.n_nationkey
  join prod.region r on n.n_regionkey = r.r_regionkey
  where r.r_regionkey = 0; --0 is the region key for Africa

This requires you to join on the nation and region tables you have select permission for.

Truncate the staging table

You can truncate the staging table so that you can write to it without recreating it in a future job. The truncate action will run transactionally and can be rolled back if you are connected directly to the datashare database or you are using the use command (even if you’re not using a datashare database). Use the following code:

truncate staging.customer;

At this point, you’ve completed ingesting the data to the primary namespace. You can query the af_customer table from both the primary warehouse and secondary ETL warehouse and see the same data.

Conclusion

In this post, we showed how to use multiple warehouses to write to the same database. This solution has the following benefits:

You can use provisioned clusters and serverless workgroups of different sizes to write to the same databases
You can write across accounts and regions
Data is live and available to all warehouses as soon as it is committed
Writes work even if the producer warehouse (the warehouse that owns the database) is paused

To learn more about this feature, see Sharing both read and write data within an AWS account or across accounts (preview). Additionally, if you have any feedback, please email us at [email protected].

About the authors

Ryan Waldorf is a Senior Product Manager at Amazon Redshift. Ryan focuses on features that enable customers to define and scale compute including data sharing and concurrency scaling.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with Amazon Web Services (AWS).

Sudipto Das is a Senior Principal Engineer at Amazon Web Services (AWS). He leads the technical architecture and strategy of multiple database and analytics services in AWS with special focus on Amazon Redshift and Amazon Aurora.

How to automate rule management for AWS Network Firewall

2024-02-16 Ajinkya Patil

Post Syndicated from Ajinkya Patil original https://aws.amazon.com/blogs/security/how-to-automate-rule-management-for-aws-network-firewall/

AWS Network Firewall is a stateful managed network firewall and intrusion detection and prevention service designed for the Amazon Virtual Private Cloud (Amazon VPC). This post concentrates on automating rule updates in a central Network Firewall by using distributed firewall configurations. If you’re new to Network Firewall or seeking a technical background on rule management, see AWS Network Firewall – New Managed Firewall Service in VPC.

Network Firewall offers three deployment models: Distributed, centralized, and combined. Many customers opt for a centralized model to reduce costs. In this model, customers allocate the responsibility for managing the rulesets to the owners of the VPC infrastructure (spoke accounts) being protected, thereby shifting accountability and providing flexibility to the spoke accounts. Managing rulesets in a shared firewall policy generated from distributed input configurations of protected VPCs (spoke accounts) is challenging without proper input validation, state-management, and request throttling controls.

In this post, we show you how to automate firewall rule management within the central firewall using distributed firewall configurations spread across multiple AWS accounts. The anfw-automate solution provides input-validation, state-management, and throttling controls, reducing the update time for firewall rule changes from minutes to seconds. Additionally, the solution reduces operational costs, including rule management overhead while integrating seamlessly with the existing continuous integration and continuous delivery (CI/CD) processes.

Prerequisites

For this walkthrough, the following prerequisites must be met:

Basic knowledge of networking concepts such as routing and Classless Inter-Domain Routing (CIDR) range allocations.
Basic knowledge of YAML and JSON configuration formats, definitions, and schema.
Basic knowledge of Suricata Rule Format and Network Firewall rule management.
Basic knowledge of CDK deployment.
AWS Identity and Access Management (IAM) permissions to bootstrap the AWS accounts using AWS Cloud Development Kit (AWS CDK).
The firewall VPC in the central account must be reachable from a spoke account (see centralized deployment model). For this solution, you need two AWS accounts from the centralized deployment model:
- The spoke account is the consumer account the defines firewall rules for the account and uses central firewall endpoints for traffic filtering. At least one spoke account is required to simulate the user workflow in validation phase.
- The central account is an account that contains the firewall endpoints. This account is used by application and the Network Firewall.
StackSets deployment with service-managed permissions must be enabled in AWS Organizations (Activate trusted access with AWS Organizations). A delegated administrator account is required to deploy AWS CloudFormation stacks in any account in an organization. The CloudFormation StackSets in this account deploy the necessary CloudFormation stacks in the spoke accounts. If you don’t have a delegated administrator account, you must manually deploy the resources in the spoke account. Manual deployment isn’t recommended in production environments.
A resource account is the CI/CD account used to deploy necessary AWS CodePipeline stacks. The pipelines deploy relevant cross-account cross-AWS Region stacks to the preceding AWS accounts.
- IAM permissions to deploy CDK stacks in the resource account.

Solution description

In Network Firewall, each firewall endpoint connects to one firewall policy, which defines network traffic monitoring and filtering behavior. The details of the behavior are defined in rule groups — a reusable set of rules — for inspecting and handling network traffic. The rules in the rule groups provide the details for packet inspection and specify the actions to take when a packet matches the inspection criteria. Network Firewall uses a Suricata rules engine to process all stateful rules. Currently, you can create Suricata compatible or basic rules (such as domain list) in Network Firewall. We use Suricata compatible rule strings within this post to maintain maximum compatibility with most use cases.

Figure 1 describes how the anfw-automate solution uses the distributed firewall rule configurations to simplify rule management for multiple teams. The rules are validated, transformed, and stored in the central AWS Network Firewall policy. This solution isolates the rule generation to the spoke AWS accounts, but still uses a shared firewall policy and a central ANFW for traffic filtering. This approach grants the AWS spoke account owners the flexibility to manage their own firewall rules while maintaining the accountability for their rules in the firewall policy. The solution enables the central security team to validate and override user defined firewall rules before pushing them to the production firewall policy. The security team operating the central firewall can also define additional rules that are applied to all spoke accounts, thereby enforcing organization-wide security policies. The firewall rules are then compiled and applied to Network Firewall in seconds, providing near real-time response in scenarios involving critical security incidents.

Figure 1: Workflow launched by uploading a configuration file to the configuration (config) bucket

The Network Firewall firewall endpoints and anfw-automate solution are both deployed in the central account. The spoke accounts use the application for rule automation and the Network Firewall for traffic inspection.

As shown in Figure 1, each spoke account contains the following:

An Amazon Simple Storage Service (Amazon S3) bucket to store multiple configuration files, one per Region. The rules defined in the configuration files are applicable to the VPC traffic in the spoke account. The configuration files must comply with the defined naming convention ($Region-config.yaml) and be validated to make sure that only one configuration file exists per Region per account. The S3 bucket has event notifications enabled that publish all changes to configuration files to a local default bus in Amazon EventBridge.
EventBridge rules to monitor the default bus and forward relevant events to the custom event bus in the central account. The EventBridge rules specifically monitor VPCDelete events published by Amazon CloudTrail and S3 event notifications. When a VPC is deleted from the spoke account, the VPCDelete events lead to the removal of corresponding rules from the firewall policy. Additionally, all create, update, and delete events from Amazon S3 event notifications invoke corresponding actions on the firewall policy.
Two AWS Identity and Access Manager (IAM) roles with keywords xaccount.lmb.rc and xaccount.lmb.re are assumed by RuleCollect and RuleExecute functions in the central account, respectively.
A CloudWatch Logs log group to store event processing logs published by the central AWS Lambda application.

In the central account:

EventBridge rules monitor the custom event bus and invoke a Lambda function called RuleCollect. A dead-letter queue is attached to the EventBridge rules to store events that failed to invoke the Lambda function.
The RuleCollect function retrieves the config file from the spoke account by assuming a cross-account role. This role is deployed by the same stack that created the other spoke account resources. The Lambda function validates the request, transforms the request to the Suricata rule syntax, and publishes the rules to an Amazon Simple Queue Service (Amazon SQS) first-in-first-out (FIFO) queue. Input validation controls are paramount to make sure that users don’t abuse the functionality of the solution and bypass central governance controls. The Lambda function has input validation controls to verify the following:
- The VPC ID in the configuration file exists in the configured Region and the same AWS account as the S3 bucket.
- The Amazon S3 object version ID received in the event matches the latest version ID to mitigate race conditions.
- Users don’t have only top-level domains (for example, .com, .de) in the rules.
- The custom Suricata rules don’t have any as the destination IP address or domain.
- The VPC identifier matches the required format, that is, a+(AWS Account ID)+(VPC ID without vpc- prefix) in custom rules. This is important to have unique rule variables in rule groups.
- The rules don’t use security sensitive keywords such as sid, priority, or metadata. These keywords are reserved for firewall administrators and the Lambda application.
- The configured VPC is attached to an AWS Transit Gateway.
- Only pass rules exist in the rule configuration.
- CIDR ranges for a VPC are mapped appropriately using IP set variables.
The input validations make sure that rules defined by one spoke account don’t impact the rules from other spoke accounts. The validations applied to the firewall rules can be updated and managed as needed based on your requirements. The rules created must follow a strict format, and deviation from the preceding rules will lead to the rejection of the request.
The Amazon SQS FIFO queue preserves the order of create, update, and delete operations run in the configuration bucket of the spoke account. These state-management controls maintain consistency between the firewall rules in the configuration file within the S3 bucket and the rules in the firewall policy. If the sequence of updates provided by the distributed configurations isn’t honored, the rules in a firewall policy might not match the expected ruleset.
Rules not processed beyond the maxReceiveCount threshold are moved to a dead-letter SQS queue for troubleshooting.
The Amazon SQS messages are subsequently consumed by another Lambda function called RuleExecute. Multiple changes to one configuration are batched together in one message. The RuleExecute function parses the messages and generates the required rule groups, IP set variables, and rules within the Network Firewall. Additionally, the Lambda function establishes a reserved rule group, which can be administered by the solution’s administrators and used to define global rules. The global rules, applicable to participating AWS accounts, can be managed in the data/defaultdeny.yaml file by the central security team.
The RuleExecute function also implements throttling controls to make sure that rules are applied to the firewall policy without reaching the ThrottlingException from Network Firewall (see common errors). The function also implements back-off logic to handle this exception. This throttling effect can happen if there are too many requests issued to the Network Firewall API.

The function makes cross-Region calls to Network Firewall based on the Region provided in the user configuration. There is no need to deploy the RuleExecute and RuleCollect Lambda functions in multiple Regions unless a use case warrants it.

Walkthrough

The following section guides you through the deployment of the rules management engine.

Deployment: Outlines the steps to deploy the solution into the target AWS accounts.
Validation: Describes the steps to validate the deployment and ensure the functionality of the solution.
Cleaning up: Provides instructions for cleaning up the deployment.

Deployment

In this phase, you deploy the application pipeline in the resource account. The pipeline is responsible for deploying multi-Region cross-account CDK stacks in both the central account and the delegated administrator account.

If you don’t have a functioning Network Firewall firewall using the centralized deployment model in the central account, see the README for instructions on deploying Amazon VPC and Network Firewall stacks before proceeding. You need to deploy the Network Firewall in centralized deployment in each Region and Availability Zone used by spoke account VPC infrastructure.

The application pipeline stack deploys three stacks in all configured Regions: LambdaStack and ServerlessStack in the central account and StacksetStack in the delegated administrator account. It’s recommended to deploy these stacks solely in the primary Region, given that the solution can effectively manage firewall policies across all supported Regions.

LambdaStack deploys the RuleCollect and RuleExecute Lambda functions, Amazon SQS FIFO queue, and SQS FIFO dead-letter queue.
ServerlessStack deploys EventBridge bus, EventBridge rules, and EventBridge Dead-letter queue.
StacksetStack deploys a service-managed stack set in the delegated administrator account. The stack set includes the deployment of IAM roles, EventBridge rules, an S3 Bucket, and a CloudWatch log group in the spoke account. If you’re manually deploying the CloudFormation template (templates/spoke-serverless-stack.yaml) in the spoke account, you have the option to disable this stack in the application configuration.

Figure 2: CloudFormation stacks deployed by the application pipeline

To prepare for bootstrapping

Install and configure profiles for all AWS accounts using Amazon Command Line Interface (AWS CLI)
Install the Cloud Development Kit (CDK)
Install Git and clone the GitHub repo
Install and enable Docker Desktop

To prepare for deployment

Follow the README and cdk bootstrapping guide to bootstrap the resource account. Then, bootstrap the central account and delegated administrator account (optional if StacksetStack is deployed manually in the spoke account) to trust the resource account. The spoke accounts don’t need to be bootstrapped.
Create a folder to be referred to as <STAGE>, where STAGE is the name of your deployment stage — for example, local, dev, int, and so on — in the conf folder of the cloned repository. The deployment stage is set as the STAGE parameter later and used in the AWS resource names.
Create global.json in the <STAGE> folder. Follow the README to update the parameter values. A sample JSON file is provided in conf/sample folder.

Run the following commands to configure the local environment:

npm install
export STAGE=<STAGE>
export AWS_REGION=<AWS_Region_to_deploy_pipeline_stack>

To deploy the application pipeline stack

Create a file named app.json in the <STAGE> folder and populate the parameters in accordance with the README section and defined schema.
If you choose to manage the deployment of spoke account stacks using the delegated administrator account and have set the deploy_stacksets parameter to true, create a file named stackset.json in the <STAGE> folder. Follow the README section to align with the requirements of the defined schema.
You can also deploy the spoke account stack manually for testing using the AWS CloudFormation template in templates/spoke-serverless-stack.yaml. This will create and configure the needed spoke account resources.
Run the following commands to deploy the application pipeline stack:
```
export STACKNAME=app && make deploy
```
Figure 3: Example output of application pipeline deployment

After deploying the solution, each spoke account is required to configure stateful rules for every VPC in the configuration file and upload it to the S3 bucket. Each spoke account owner must verify the VPC’s connection to the firewall using the centralized deployment model. The configuration, presented in the YAML configuration language, might encompass multiple rule definitions. Each account must furnish one configuration file per VPC to establish accountability and non-repudiation.

Validation

Now that you’ve deployed the solution, follow the next steps to verify that it’s completed as expected, and then test the application.

To validate deployment

Sign in to the AWS Management Console using the resource account and go to CodePipeline.
Verify the existence of a pipeline named cpp-app-<aws_ organization_scope>-<project_name>-<module_name>-<STAGE> in the configured Region.
Verify that stages exist in each pipeline for all configured Regions.
Confirm that all pipeline stages exist. The LambdaStack and ServerlessStack stages must exist in the cpp-app-<aws_organization_scope>-<project_name>-<module_name>-<STAGE> stack. The StacksetStack stage must exist if you set the deploy_stacksets parameter to true in global.json.

To validate the application

Sign in and open the Amazon S3 console using the spoke account.
Follow the schema defined in app/RuleCollect/schema.json and create a file with naming convention ${Region}-config.yaml. Note that the Region in the config file is the destination Region for the firewall rules. Verify that the file has valid VPC data and rules.

Figure 4: Example configuration file for eu-west-1 Region
Upload the newly created config file to the S3 bucket named anfw-allowlist-<AWS_REGION for application stack>-<Spoke Account ID>-<STAGE>.
If the data in the config file is invalid, you will see ERROR and WARN logs in the CloudWatch log group named cw-<aws_organization_scope>-<project_name>-<module_name>-CustomerLog-<STAGE>.
If all the data in the config file is valid, you will see INFO logs in the same CloudWatch log group.

Figure 5: Example of logs generated by the anfw-automate in a spoke account
After the successful processing of the rules, sign in to the Network Firewall console using the central account.
Navigate to the Network Firewall rule groups and search for a rule group with a randomly assigned numeric name. This rule group will contain your Suricata rules after the transformation process.

Figure 6: Rules created in Network Firewall rule group based on the configuration file in Figure 4
Access the Network Firewall rule group identified by the suffix reserved. This rule group is designated for administrators and global rules. Confirm that the rules specified in app/data/defaultdeny.yaml have been transformed into Suricata rules and are correctly placed within this rule group.
Instantiate an EC2 instance in the VPC specified in the configuration file and try to access both the destinations allowed in the file and any destination not listed. Note that requests to destinations not defined in the configuration file are blocked.

Cleaning up

To avoid incurring future charges, remove all stacks and instances used in this walkthrough.

Sign in to both the central account and the delegated admin account. Manually delete the stacks in the Regions configured for the app parameter in global.json. Ensure that the stacks are deleted for all Regions specified for the app parameter. You can filter the stack names using the keyword <aws_organization_scope>-<project_name>-<module_name> as defined in global.json.
After deleting the stacks, remove the pipeline stacks using the same command as during deployment, replacing cdk deploy with cdk destroy.
Terminate or stop the EC2 instance used to test the application.

Conclusion

This solution simplifies network security by combining distributed ANFW firewall configurations in a centralized policy. Automated rule management can help reduce operational overhead, reduces firewall change request completion times from minutes to seconds, offloads security and operational mechanisms such as input validation, state-management, and request throttling, and enables central security teams to enforce global firewall rules without compromising on the flexibility of user-defined rulesets.

In addition to using this application through S3 bucket configuration management, you can integrate this tool with GitHub Actions into your CI/CD pipeline to upload the firewall rule configuration to an S3 bucket. By combining GitHub actions, you can automate configuration file updates with automated release pipeline checks, such as schema validation and manual approvals. This enables your team to maintain and change firewall rule definitions within your existing CI/CD processes and tools. You can go further by allowing access to the S3 bucket only through the CI/CD pipeline.

Finally, you can ingest the AWS Network Firewall logs into one of our partner solutions for security information and event management (SIEM), security monitoring, threat intelligence, and managed detection and response (MDR). You can launch automatic rule updates based on security events detected by these solutions, which can help reduce the response time for security events.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

2024-02-16 Kanwar Bajwa

Post Syndicated from Kanwar Bajwa original https://aws.amazon.com/blogs/big-data/enhance-data-security-and-governance-for-amazon-redshift-spectrum-with-vpc-endpoints/

Many customers are extending their data warehouse capabilities to their data lake with Amazon Redshift. They are looking to further enhance their security posture where they can enforce access policies on their data lakes based on Amazon Simple Storage Service (Amazon S3). Furthermore, they are adopting security models that require access to the data lake through their private networks.

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries on data stored in Amazon S3. Redshift Spectrum uses the AWS Glue Data Catalog as a Hive metastore. With a provisioned Redshift data warehouse, Redshift Spectrum compute capacity runs from separate dedicated Redshift servers owned by Amazon Redshift that are independent of your Redshift cluster. When enhanced VPC routing is enabled for your Redshift cluster, Redshift Spectrum connects from the Redshift VPC to an elastic network interface (ENI) in your VPC. Because it uses separate Redshift dedicated clusters, to force all traffic between Redshift and Amazon S3 through your VPC, you need to turn on enhanced VPC routing and create a specific network path between your Redshift data warehouse VPC and S3 data sources.

When using an Amazon Redshift Serverless instance, Redshift Spectrum uses the same compute capacity as your serverless workgroup compute capacity. To access your S3 data sources from Redshift Serverless without traffic leaving your VPC, you can use the enhanced VPC routing option without the need for any additional network configuration.

AWS Lake Formation offers a straightforward and centralized approach to access management for S3 data sources. Lake Formation allows organizations to manage access control for Amazon S3-based data lakes using familiar database concepts such as tables and columns, along with more advanced options such as row-level and cell-level security. Lake Formation uses the AWS Glue Data Catalog to provide access control for Amazon S3.

In this post, we demonstrate how to configure your network for Redshift Spectrum to use a Redshift provisioned cluster’s enhanced VPC routing to access Amazon S3 data through Lake Formation access control. You can set up this integration in a private network with no connectivity to the internet.

Solution overview

With this solution, network traffic is routed through your VPC by enabling Amazon Redshift enhanced VPC routing. This routing option prioritizes the VPC endpoint as the first route priority over an internet gateway, NAT instance, or NAT gateway. To prevent your Redshift cluster from communicating with resources outside of your VPC, it’s necessary to remove all other routing options. This ensures that all communication is routed through the VPC endpoints.

The following diagram illustrates the solution architecture.

The solution consists of the following steps:

Create a Redshift cluster in a private subnet network configuration:
1. Enable enhanced VPC routing for your Redshift cluster.
2. Modify the route table to ensure no connectivity to the public network.
Create the following VPC endpoints for Redshift Spectrum connectivity:
1. AWS Glue interface endpoint.
2. Lake Formation interface endpoint.
3. Amazon S3 gateway endpoint.
Analyze Amazon Redshift connectivity and network routing:
1. Verify network routes for Amazon Redshift in a private network.
2. Verify network connectivity from the Redshift cluster to various VPC endpoints.
3. Test connectivity using the Amazon Redshift query editor v2.

This integration uses VPC endpoints to establish a private connection from your Redshift data warehouse to Lake Formation, Amazon S3, and AWS Glue.

Prerequisites

To set up this solution, You need basic familiarity with the AWS Management Console, an AWS account, and access to the following AWS services:

AWS Glue
AWS Identity and Access Management (IAM)
Lake Formation
Amazon Redshift
Amazon S3
Amazon Virtual Private Cloud (Amazon VPC)

Additionally, you must have integrated Lake Formation with Amazon Redshift to access your S3 data lake in non-private network. For instructions, refer to Centralize governance for your data lake using AWS Lake Formation while enabling a modern data architecture with Amazon Redshift Spectrum.

Create a Redshift cluster in a private subnet network configuration.

The first step is to configure your Redshift cluster to only allow network traffic through your VPC and prevent any public routes. To accomplish this, you must enable enhanced VPC routing for your Redshift cluster. Complete the following steps:

On the Amazon Redshift console, navigate to your cluster.
Edit your network and security settings.
For Enhanced VPC routing, select Turn on.
Disable the Publicly accessible option.
Choose Save changes and modify the cluster to apply the updates. You now have a Redshift cluster that can only communicate through the VPC. Now you can modify the route table to ensure no connectivity to the public network.
On the Amazon Redshift console, make a note of the subnet group and identify the subnet associated with this subnet group.
On the Amazon VPC console, identify the route table associated with this subnet and edit to remove the default route to the NAT gateway.

If you cluster is in a public subnet, you may have to remove the internet gateway route. If subnet is shared among other resources, it may impact their connectivity.

Your cluster is now in a private network and can’t communicate with any resources outside of your VPC.

Create VPC endpoints for Redshift Spectrum connectivity

After you configure your Redshift cluster to operate within a private network without external connectivity, you need to establish connectivity to the following services through VPC endpoints:

AWS Glue
Lake Formation
Amazon S3

Create an AWS Glue endpoint

To begin with, Redshift Spectrum connects to AWS Glue endpoints to retrieve information from the AWS Data Glue Catalog. To create a VPC endpoint for AWS Glue, complete the following steps:

On the Amazon VPC console, choose Endpoints in the navigation pane.
Choose Create endpoint.
For Name tag, enter an optional name.
For Service category, select AWS services.
In the Services section, search for and select your AWS Glue interface endpoint.
Choose the appropriate VPC and subnets for your endpoint.
Configure the security group settings and review your endpoint settings.
Choose Create endpoint to complete the process.

After you create the AWS Glue VPC endpoint, Redshift Spectrum will be able to retrieve information from the AWS Glue Data Catalog within your VPC.

Create a Lake Formation endpoint

Repeat the same process to create a Lake Formation endpoint:

On the Amazon VPC console, choose Endpoints in the navigation pane.
Choose Create endpoint.
For Name tag, enter an optional name.
For Service category, select AWS services.
In the Services section, search for and select your Lake Formation interface endpoint.
Choose the appropriate VPC and subnets for your endpoint.
Configure the security group settings and review your endpoint settings.
Choose Create endpoint.

You now have connectivity for Amazon Redshift to Lake Formation and AWS Glue, which allows you to retrieve the catalog and validate permissions on the data lake.

Create an Amazon S3 endpoint

The next step is to create a VPC endpoint for Amazon S3 to enable Redshift Spectrum to access data stored in Amazon S3 via VPC endpoints:

On the Amazon VPC console, choose Endpoints in the navigation pane.
Choose Create endpoint.
For Name tag, enter an optional name.
For Service category, select AWS services.
In the Services section, search for and select your Amazon S3 gateway endpoint.
Choose the appropriate VPC and subnets for your endpoint.
Configure the security group settings and review your endpoint settings.
Choose Create endpoint.

With the creation of the VPC endpoint for Amazon S3, you have completed all necessary steps to ensure that your Redshift cluster can privately communicate with the required services via VPC endpoints within your VPC.

It’s important to ensure that the security groups attached to the VPC endpoints are properly configured, because an incorrect inbound rule can cause your connection to timeout. Verify that the security group inbound rules are correctly set up to allow necessary traffic to pass through the VPC endpoint.

Analyze traffic and network topology

You can use the following methods to verify the network paths from Amazon Redshift to other endpoints.

Verify network routes for Amazon Redshift in a private network

You can use an Amazon VPC resource map to visualize Amazon Redshift connectivity. The resource map shows the interconnections between resources within a VPC and the flow of traffic between subnets, NAT gateways, internet gateways, and gateway endpoints. As shown in the following screenshot, the highlighted subnet where the Redshift cluster is running doesn’t have connectivity to a NAT gateway or internet gateway. The route table associated with the subnet can reach out to Amazon S3 via VPC endpoint only.

Note that AWS Glue and Lake Formation endpoints are interface endpoints and not visible on a resource map.

Verify network connectivity from the Redshift cluster to various VPC endpoints

You can verify connectivity from your Redshift cluster subnet to all VPC endpoints using the Reachability Analyzer. The Reachability Analyzer is a configuration analysis tool that enables you to perform connectivity testing between a source resource and a destination resource in your VPCs. Complete the following steps:

On the Amazon Redshift console, navigate to the Redshift cluster configuration page and note the internal IP address.
On the Amazon EC2 console, search for your ENI by filtering by the IP address.
Choose the ENI associated with your Redshift cluster and choose Run Reachability Analyzer.
For Source type, choose Network interfaces.
For Source, choose the Redshift ENI.
For Destination type, choose VPC endpoints.
For Destination, choose your VPC endpoint.
Choose Create and analyze path.
When analysis is complete, view the analysis to see reachability.

As shown in the following screenshot, the Redshift cluster has connectivity to the Lake Formation endpoint.

You can repeat these steps to verify network reachability for all other VPC endpoints.

Test connectivity by running a SQL query from the Amazon Redshift query editor v2

You can verify connectivity by running a SQL query with your Redshift Spectrum table using the Amazon Redshift query editor, as shown in the following screenshot.

Congratulations! You are able to successfully query from Redshift Spectrum tables from a provisioned cluster while enhanced VPC routing is enabled for traffic to stay within your AWS network.

Clean up

You should clean up the resources you created as part of this exercise to avoid unnecessary cost to your AWS account. Complete the following steps:

On the Amazon VPC console, choose Endpoints in the navigation pane.
Select the endpoints you created and on the Actions menu, choose Delete VPC endpoints.
On the Amazon Redshift console, navigate to your Redshift cluster.
Edit the cluster network and security settings and select Turn off for Enhanced VPC routing.
You can also delete your Amazon S3 data and Redshift cluster if you are not planning to use them further.

Conclusion

By moving your Redshift data warehouse to a private network setting and enabling enhanced VPC routing, you can enhance the security posture of your Redshift cluster by limiting access to only authorized networks.

We want to acknowledge our fellow AWS colleagues Harshida Patel, Fabricio Pinto, and Soumyajeet Patra for providing their insights with this blog post.

If you have any questions or suggestions, leave your feedback in the comments section. If you need further assistance with securing your S3 data lakes and Redshift data warehouses, contact your AWS account team.

Additional resources

About the Authors

Kanwar Bajwa is an Enterprise Support Lead at AWS who works with customers to optimize their use of AWS services and achieve their business objectives.

Secure connectivity patterns for Amazon MSK Serverless cross-account access

2024-02-13 Tamer Soliman

Post Syndicated from Tamer Soliman original https://aws.amazon.com/blogs/big-data/secure-connectivity-patterns-for-amazon-msk-serverless-cross-account-access/

Amazon MSK Serverless is a cluster type of Amazon Managed Streaming for Apache Kafka (Amazon MSK) that makes it straightforward for you to run Apache Kafka without having to manage and scale cluster capacity. MSK Serverless automatically provisions and scales compute and storage resources. With MSK Serverless, you can use Apache Kafka on demand and pay for the data you stream and retain on a usage basis.

Deploying infrastructure across multiple VPCs and multiple accounts is considered best practice, facilitating scalability while maintaining isolation boundaries. In a multi-account environment, Kafka producers and consumers can exist within the same VPC—however, they are often located in different VPCs, sometimes within the same account, in a different account, or even in multiple different accounts. There is a need for a solution that can extend access to MSK Serverless clusters to producers and consumers from multiple VPCs within the same AWS account and across multiple AWS accounts. The solution needs to be scalable and straightforward to maintain.

In this post, we walk you through multiple solution approaches that address the MSK Serverless cross-VPC and cross-account access connectivity options, and we discuss the advantages and limitations of each approach.

MSK Serverless connectivity and authentication

When an MSK Serverless cluster is created, AWS manages the cluster infrastructure on your behalf and extends private connectivity back to your VPCs through VPC endpoints powered by AWS PrivateLink. You bootstrap your connection to the cluster through a bootstrap server that holds a record of all your underlying brokers.

At creation, a fully qualified domain name (FQDN) is assigned to your cluster bootstrap server. The bootstrap server FQDN has the general format of boot-ClusterUniqueID.xx.kafka-serverless.Region.amazonaws.com, and your cluster brokers follow the format of bxxxx-ClusterUniqueID.xx.kafka-serverless.Region.amazonaws.com, where ClusterUniqueID.xx is unique to your cluster and bxxxx is a dynamic broker range (b0001, b0037, and b0523 can be some of your assigned brokers at a point of time). It’s worth noting that the brokers assigned to your cluster are dynamic and change over time, but your bootstrap address remains the same for the cluster. All your communication with the cluster starts with the bootstrap server that can respond with the list of active brokers when required. For proper Kafka communication, your MSK client needs to be able to resolve the domain names of your bootstrap server as well as all your brokers.

At cluster creation, you specify the VPCs that you would like the cluster to communicate with (up to five VPCs in the same account as your cluster). For each VPC specified during cluster creation, cluster VPC endpoints are created along with a private hosted zone that includes a list of your bootstrap server and all dynamic brokers kept up to date. The private hosted zones facilitate resolving the FQDNs of your bootstrap server and brokers, from within the associated VPCs defined during cluster creation, to the respective VPC endpoints for each.

Cross-account access

To be able to extend private connectivity of your Kafka producers and consumers to your MSK Serverless cluster, you need to consider three main aspects: private connectivity, authentication and authorization, and DNS resolution.

The following diagram highlights the possible connectivity options. Although the diagram shows them all here for demonstration purposes, in most cases, you would use one or more of these options depending on your architecture, not necessary all in the same setup.

MSK cross account connectivity options

In this section, we discuss the different connectivity options along with their pros and cons. We also cover the authentication and DNS resolution aspects associated with the relevant connectivity options.

Private connectivity layer

This is the underlying private network connectivity. You can achieve this connectivity using VPC peering, AWS Transit Gateway, or PrivateLink, as indicated in the preceding diagram. VPC peering simplifies the setup, but it lacks the support for transitive routing. In most cases, peering is used when you have a limited number of VPCs or if your VPCs generally communicate with some limited number of core services VPCs without the need of lateral connectivity or transitive routing. On the other hand, AWS Transit Gateway facilitates transitive routing and can simplify the architecture when you have a large number of VPCs, and especially when lateral connectivity is required. PrivateLink is more suited for extending connectivity to a specific resource unidirectionally across VPCs or accounts without exposing full VPC-to-VPC connectivity, thereby adding a layer of isolation. PrivateLink is useful if you have overlapping CIDRs, which is a case that is not supported by Transit Gateway or VPC peering. PrivateLink is also useful when your connected parties are administrated separately, and when one-way connectivity and isolation are required.

If you choose PrivateLink as a connectivity option, you need to use a Network Load Balancer (NLB) with an IP type target group with its registered targets set as the IP addresses of the zonal endpoints of your MSK Serverless cluster.

Cluster authentication and authorization

In addition to having private connectivity and being able to resolve the bootstrap server and brokers domain names, for your producers and consumers to have access to your cluster, you need to configure your clients with proper credentials. MSK Serverless supports AWS Identity and Access Management (IAM) authentication and authorization. For cross-account access, your MSK client needs to assume a role that has proper credentials to access the cluster. This post focuses mainly on the cross-account connectivity and name resolution aspects. For more details on cross-account authentication and authorization, refer to the following GitHub repo.

DNS resolution

For Kafka producers and consumers located in accounts across the organization to be able to produce and consume to and from the centralized MSK Serverless cluster, they need to be able to resolve the FQDNs of the cluster bootstrap server as well as each of the cluster brokers. Understanding the dynamic nature of broker allocation, the solution will have to accommodate such a requirement. In the next section, we address how we can satisfy this part of the requirements.

Cluster cross-account DNS resolution

Now that we have discussed how MSK Serverless works, how private connectivity is extended, and the authentication and authorization requirements, let’s discuss how DNS resolution works for your cluster.

For every VPC associated with your cluster during cluster creation, a VPC endpoint is created along with a private hosted zone. Private hosted zones enable name resolve of the FQDNs of the cluster bootstrap server and the dynamically allocated brokers, from within each respective VPC. This works well when requests come from within any of the VPCs that were added during cluster creation because they already have the required VPC endpoints and relevant private hosted zones.

Let’s discuss how you can extend name resolution to other VPCs within the same account that were not included during cluster creation, and to others that may be located in other accounts.

You’ve already made your choice of the private connectivity option that best fits your architecture requirements, be it VPC peering, PrivateLink, or Transit Gateway. Assuming that you have also configured your MSK clients to assume roles that have the right IAM credentials in order to facilitate cluster access, you now need to address the name resolution aspect of connectivity. It’s worth noting that, although we list different connectivity options using VPC peering, Transit Gateway, and PrivateLink, in most cases only one or two of these connectivity options are present. You don’t necessarily need to have them all; they are listed here to demonstrate your options, and you are free to choose the ones that best fit your architecture and requirements.

In the following sections, we describe two different methods to address DNS resolution. For each method, there are advantages and limitations.

Private hosted zones

The following diagram highlights the solution architecture and its components. Note that, to simplify the diagram, and to make room for more relevant details required in this section, we have eliminated some of the connectivity options.

Cross-account access using Private Hosted Zones

The solution starts with creating a private hosted zone, followed by creating a VPC association.

Create a private hosted zone

We start by creating a private hosted zone for name resolution. To make the solution scalable and straightforward to maintain, you can choose to create this private hosted zone in the same MSK Serverless cluster account; in some cases, creating the private hosted zone in a centralized networking account is preferred. Having the private hosted zone created in the MSK Serverless cluster account facilitates centralized management of the private hosted zone alongside the MSK cluster. We can then associate the centralized private hosted zone with VPCs within the same account, or in different other accounts. Choosing to centralize your private hosted zones in a networking account can also be a viable solution to consider.

The purpose of the private hosted zone is to be able to resolve the FQDNs of the bootstrap server as well as all the dynamically assigned cluster-associated brokers. As discussed earlier, the bootstrap server FQDN format is boot-ClusterUniqueID.xx.kafka-serverless.Region.amazonaws.com, and the cluster brokers use the format bxxxx-ClusterUniqueID.xx.kafka-serverless.Region.amazonaws.com, with bxxxx being the broker ID. You need to create the new private hosted zone with the primary domain set as kafka-serverless.Region.amazonaws.com, with an A-Alias record called *.kafka-serverless.Region.amazonaws.com pointing to the Regional VPC endpoint of the MSK Serverless cluster in the MSK cluster VPC. This should be sufficient to direct all traffic targeting your cluster to the primary cluster VPC endpoints that you specified in your private hosted zone.

Now that you have created the private hosted zone, for name resolution to work, you need to associate the private hosted zone with every VPC where you have clients for the MSK cluster (producer or consumer).

Associate a private hosted zone with VPCs in the same account

For VPCs that are in the same account as the MSK cluster and weren’t included in the configuration during cluster creation, you can associate them to the private hosted zone created using the AWS Management Console by editing the private hosted zone settings and adding the respective VPCs. For more information, refer to Associating more VPCs with a private hosted zone.

Associate a private hosted zone in cross-account VPCs

For VPCs that are in a different account other than the MSK cluster account, refer to Associating an Amazon VPC and a private hosted zone that you created with different AWS accounts. The key steps are as follows:

Create a VPC association authorization in the account where the private hosted zone is created (in this case, it’s the same account as the MSK Serverless cluster account) to authorize the remote VPCs to be associated with the hosted zone:

aws route53 create-vpc-association-authorization --hosted-zone-id HostedZoneID --vpc VPCRegion=Region,VPCId=vpc-ID

Associate the VPC with the private hosted zone in the account where you have the VPCs with the MSK clients (remote account), referencing the association authorization you created earlier:

aws route53 list-vpc-association-authorizations --hosted-zone-id HostedZoneID
aws route53 associate-vpc-with-hosted-zone --hosted-zone-id HostedZoneID --VPC VPCRegion=Region,VPCId=vpc-ID

Delete the VPC authorization to associate the VPC with the hosted zone:

aws route53 delete-vpc-association-authorization --hosted-zone-id HostedZoneID --vpc VPCRegion=Region,VPCId=vpc-ID

Deleting the authorization doesn’t affect the association, it just prevents you from re-associating the VPC with the hosted zone in the future. If you want to re-associate the VPC with the hosted zone, you’ll need to repeat steps 1 and 2 of this procedure.

Note that your VPC needs to have the enableDnsSupport and enableDnsHostnames DNS attributes enabled for this to work. These two settings can be configured under the VPC DNS settings. For more information, refer to DNS attributes in your VPC.

These procedures work well for all remote accounts when connectivity is extended using VPC peering or Transit Gateway. If your connectivity option uses PrivateLink, the private hosted zone needs to be created in the remote account instead (the account where the PrivateLink VPC endpoints are). In addition, an A-Alias record that resolves to the PrivateLink endpoint instead of the MSK cluster endpoint needs to be created as indicated in the earlier diagram. This will facilitate name resolution to the PrivateLink endpoint. If other VPCs need access to the cluster through that same PrivateLink setup, you need to follow the same private hosted zone association procedure as described earlier and associate your other VPCs with the private hosted zone created for your PrivateLink VPC.

Limitations

The private hosted zones solution has some key limitations.

Firstly, because you’re using kafka-serverless.Region.amazonaws.com as the primary domain for our private hosted zone, and your A-Alias record uses *.kafka-serverless.Region.amazonaws.com, all traffic to the MSK Serverless service originating from any VPC associated with this private hosted zone will be directed to the one specific cluster VPC Regional endpoint that you specified in the hosted zone A-Alias record.

This solution is valid if you have one MSK Serverless cluster in your centralized service VPC. If you need to provide access to multiple MSK Serverless clusters, you can use the same solution but adapt a distributed private hosted zone approach as opposed to a centralized approach. In a distributed private hosted zone approach, each private hosted zone can point to a specific cluster. The VPCs associated with that specific private hosted zone will communicate only to the respective cluster listed under the specific private hosted zone.

In addition, after you establish a VPC association with a private hosted zone resolving *.kafka-serverless.Region.amazonaws.com, the respective VPC will only be able to communicate with the cluster defined in that specific private hosted zone and no other cluster. An exception to this rule is if a local cluster is created within the same client VPC, in which case the clients within the VPC will only be able to communicate with only the local cluster.

You can also use PrivateLink to accommodate multiple clusters by creating a PrivateLink plus private hosted zone per cluster, replicating the configuration steps described earlier.

Both solutions, using distributed private hosted zones or PrivateLink, are still subject to the limitation that for each client VPC, you can only communicate with the one MSK Serverless cluster that your associated private hosted zone is configured for.

In the next section, we discuss another possible solution.

Resolver rules and AWS Resource Access Manager

The following diagram shows a high-level overview of the solution using Amazon Route 53 resolver rules and AWS Resource Access Manager.

Cross-account access using Resolver Rules and Resolver Endpoints

The solution starts with creating Route 53 inbound and outbound resolver endpoints, which are associated with the MSK cluster VPC. Then you create a resolver forwarding rule in the MSK account that is not associated with any VPC. Next, you share the resolver rule across accounts using Resource Access Manager. At the remote account where you need to extend name resolution to, you need to accept the resource share and associate the resolver rules with your target VPCs located in the remote account (the account where the MSK clients are located).

For more information about this approach, refer to the third use case in Simplify DNS management in a multi-account environment with Route 53 Resolver.

This solution accommodates multiple centralized MSK serverless clusters in a more scalable and flexible approach. Therefore, the solution counts on directing DNS requests to be resolved by the VPC where the MSK clusters are. Multiple MSK Serverless clusters can coexist, where clients in a particular VPC can communicate with one or more of them at the same time. This option is not supported with the private hosted zone solution approach.

Limitations

Although this solution has its advantages, it also has a few limitations.

Firstly, for a particular target consumer or producer account, all your MSK Serverless clusters need to be in the same core service VPC in the MSK account. This is due to the fact that the resolver rule is set on an account level and uses.kafka-serverless.Region.amazonaws.com as the primary domain, directing its resolution to one specific VPC resolver endpoint inbound/outbound pair within that service VPC. If you need to have separate clusters in different VPCs, consider creating separate accounts.

The second limitation is that all your client VPCs need to be in the same Region as your core MSK Serverless VPC. The reason behind this limitation is that resolver rules pointing to a resolver endpoint pair (in reality, they point to the outbound endpoint that loops into the inbound endpoints) need to be in the same Region as the resolver rules, and Resource Access Manager will extend the share only within the same Region. However, this solution is good when you have multiple MSK clusters in the same core VPC, and although your remote clients are in different VPCs and accounts, they are still within the same Region. A workaround for this limitation is to duplicate the creation of resolver rules and outbound resolver endpoint in a second Region, where the outbound endpoint loops back through the original first Region inbound resolver endpoint associated with the MSK Serverless cluster VPC (assuming IP connectivity is facilitated). This second Region resolver rule can then be shared using Resource Access Manager within the second Region.

Conclusion

You can configure MSK Serverless cross-VPC and cross-account access in multi-account environments using private hosted zones or Route 53 resolver rules. The solution discussed in this post allows you to centralize your configuration while extending cross-account access, making it a scalable and straightforward-to-maintain solution. You can create your MSK Serverless clusters with cross-account access for producers and consumers, keep your focus on your business outcomes, and gain insights from sources of data across your organization without having to right-size and manage a Kafka infrastructure.

About the Author

Tamer Soliman is a Senior Solutions Architect at AWS. He helps Independent Software Vendor (ISV) customers innovate, build, and scale on AWS. He has over two decades of industry experience, and is an inventor with three granted patents. His experience spans multiple technology domains including telecom, networking, application integration, data analytics, AI/ML, and cloud deployments. He specializes in AWS Networking and has a profound passion for machine leaning, AI, and Generative AI.

SaaS access control using Amazon Verified Permissions with a per-tenant policy store

2024-02-13 Manuel Heinkel

Post Syndicated from Manuel Heinkel original https://aws.amazon.com/blogs/security/saas-access-control-using-amazon-verified-permissions-with-a-per-tenant-policy-store/

Access control is essential for multi-tenant software as a service (SaaS) applications. SaaS developers must manage permissions, fine-grained authorization, and isolation.

In this post, we demonstrate how you can use Amazon Verified Permissions for access control in a multi-tenant document management SaaS application using a per-tenant policy store approach. We also describe how to enforce the tenant boundary.

We usually see the following access control needs in multi-tenant SaaS applications:

Application developers need to define policies that apply across all tenants.
Tenant users need to control who can access their resources.
Tenant admins need to manage all resources for a tenant.

Additionally, independent software vendors (ISVs) implement tenant isolation to prevent one tenant from accessing the resources of another tenant. Enforcing tenant boundaries is imperative for SaaS businesses and is one of the foundational topics for SaaS providers.

Verified Permissions is a scalable, fine-grained permissions management and authorization service that helps you build and modernize applications without having to implement authorization logic within the code of your application.

Verified Permissions uses the Cedar language to define policies. A Cedar policy is a statement that declares which principals are explicitly permitted, or explicitly forbidden, to perform an action on a resource. The collection of policies defines the authorization rules for your application. Verified Permissions stores the policies in a policy store. A policy store is a container for policies and templates. You can learn more about Cedar policies from the Using Open Source Cedar to Write and Enforce Custom Authorization Policies blog post.

Before Verified Permissions, you had to implement authorization logic within the code of your application. Now, we’ll show you how Verified Permissions helps remove this undifferentiated heavy lifting in an example application.

Multi-tenant document management SaaS application

The application allows to add, share, access and manage documents. It requires the following access controls:

Application developers who can define policies that apply across all tenants.
Tenant users who can control who can access their documents.
Tenant admins who can manage all documents for a tenant.

Let’s start by describing the application architecture and then dive deeper into the design details.

Application architecture overview

There are two approaches to multi-tenant design in Verified Permissions: a single shared policy store and a per-tenant policy store. You can learn about the considerations, trade-offs and guidance for these approaches in the Verified Permissions user guide.

For the example document management SaaS application, we decided to use the per-tenant policy store approach for the following reasons:

Low-effort tenant policies isolation
The ability to customize templates and schema per tenant
Low-effort tenant off-boarding
Per-tenant policy store resource quotas

We decided to accept the following trade-offs:

High effort to implement global policies management (because the application use case doesn’t require frequent changes to these policies)
Medium effort to implement the authorization flow (because we decided that in this context, the above reasons outweigh implementing a mapping from tenant ID to policy store ID)

Figure 1 shows the document management SaaS application architecture. For simplicity, we omitted the frontend and focused on the backend.

Figure 1: Document management SaaS application architecture

A tenant user signs in to an identity provider such as Amazon Cognito. They get a JSON Web Token (JWT), which they use for API requests. The JWT contains claims such as the user_id, which identifies the tenant user, and the tenant_id, which defines which tenant the user belongs to.
The tenant user makes API requests with the JWT to the application.
Amazon API Gateway verifies the validity of the JWT with the identity provider.
If the JWT is valid, API Gateway forwards the request to the compute provider, in this case an AWS Lambda function, for it to run the business logic.
The Lambda function assumes an AWS Identity and Access Management (IAM) role with an IAM policy that allows access to the Amazon DynamoDB table that provides tenant-to-policy-store mapping. The IAM policy scopes down access such that the Lambda function can only access data for the current tenant_id.
The Lambda function looks up the Verified Permissions policy_store_id for the current request. To do this, it extracts the tenant_id from the JWT. The function then retrieves the policy_store_id from the tenant-to-policy-store mapping table.
The Lambda function assumes another IAM role with an IAM policy that allows access to the Verified Permissions policy store, the document metadata table, and the document store. The IAM policy uses tenant_id and policy_store_id to scope down access.
The Lambda function gets or stores documents metadata in a DynamoDB table. The function uses the metadata for Verified Permissions authorization requests.
Using the information from steps 5 and 6, the Lambda function calls Verified Permissions to make an authorization decision or create Cedar policies.
If authorized, the application can then access or store a document.

Application architecture deep dive

Now that you know the architecture for the use cases, let’s review them in more detail and work backwards from the user experience to the related part of the application architecture. The architecture focuses on permissions management. Accessing and storing the actual document is out of scope.

Define policies that apply across all tenants

The application developer must define global policies that include a basic set of access permissions for all tenants. We use Cedar policies to implement these permissions.

Because we’re using a per-tenant policy store approach, the tenant onboarding process should create these policies for each new tenant. Currently, to update policies, the deployment pipeline should apply changes to all policy stores.

The “Add a document” and “Manage all the documents for a tenant” sections that follow include examples of global policies.

Make sure that a tenant can’t edit the policies of another tenant

The application uses IAM to isolate the resources of one tenant from another. Because we’re using a per-tenant policy store approach we can use IAM to isolate one tenant policy store from another.

Architecture

Figure 2: Tenant isolation

A tenant user calls an API endpoint using a valid JWT.
The Lambda function uses AWS Security Token Service (AWS STS) to assume an IAM role with an IAM policy that allows access to the tenant-to-policy-store mapping DynamoDB table. The IAM policy only allows access to the table and the entries that belong to the requesting tenant. When the function assumes the role, it uses tenant_id to scope access to the items whose partition key matches the tenant_id. See the How to implement SaaS tenant isolation with ABAC and AWS IAM blog post for examples of such policies.
The Lambda function uses the user’s tenant_id to get the Verified Permissions policy_store_id.
The Lambda function uses the same mechanism as in step 2 to assume a different IAM role using tenant_id and policy_store_id which only allows access to the tenant policy store.
The Lambda function accesses the tenant policy store.

Add a document

When a user first accesses the application, they don’t own any documents. To add a document, the frontend calls the POST /documents endpoint and supplies a document_name in the request’s body.

Cedar policy

We need a global policy that allows every tenant user to add a new document. The tenant onboarding process creates this policy in the tenant’s policy store.

permit (    
  principal,
  action == DocumentsAPI::Action::"addDocument",
  resource
);

This policy allows any principal to add a document. Because we’re using a per-tenant policy store approach, there’s no need to scope the principal to a tenant.

Architecture

Figure 3: Adding a document

A tenant user calls the POST /documents endpoint to add a document.
The Lambda function uses the user’s tenant_id to get the Verified Permissions policy_store_id.
The Lambda function calls the Verified Permissions policy store to check if the tenant user is authorized to add a document.
After successful authorization, the Lambda function adds a new document to the documents metadata database and uploads the document to the documents storage.

The database structure is described in the following table:

tenant_id (Partition key): String	document_id (Sort key): String	document_name: String	document_owner: String
<TENANT_ID>	<DOCUMENT_ID>	<DOCUMENT_NAME>	<USER_ID>

tenant_id: The tenant_id from the JWT claims.
document_id: A random identifier for the document, created by the application.
document_name: The name of the document supplied with the API request.
document_owner: The user who created the document. The value is the user_id from the JWT claims.

Share a document with another user of a tenant

After a tenant user has created one or more documents, they might want to share them with other users of the same tenant. To share a document, the frontend calls the POST /shares endpoint and provides the document_id of the document the user wants to share and the user_id of the receiving user.

Cedar policy

We need a global document owner policy that allows the document owner to manage the document, including sharing. The tenant onboarding process creates this policy in the tenant’s policy store.

permit (    
  principal,
  action,
  resource
) when {
  resource.owner == principal && 
  resource.type == "document"
};

The policy allows principals to perform actions on available resources (the document) when the principal is the document owner. This policy allows the shareDocument action, which we describe next, to share a document.

We also need a share policy that allows the receiving user to access the document. The application creates these policies for each successful share action. We recommend that you use policy templates to define the share policy. Policy templates allow a policy to be defined once and then attached to multiple principals and resources. Policies that use a policy template are called template-linked policies. Updates to the policy template are reflected across the principals and resources that use the template. The tenant onboarding process creates the share policy template in the tenant’s policy store.

We define the share policy template as follows:

permit (    
  principal == ?principal,  
  action == DocumentsAPI::Action::"accessDocument",
  resource == ?resource 
);

The following is an example of a template-linked policy using the share policy template:

permit (    
  principal == DocumentsAPI::User::"<user_id>",
  action == DocumentsAPI::Action::"accessDocument",
  resource == DocumentsAPI::Document::"<document_id>" 
);

The policy includes the user_id of the receiving user (principal) and the document_id of the document (resource).

Architecture

Figure 4: Sharing a document

A tenant user calls the POST /shares endpoint to share a document.
The Lambda function uses the user’s tenant_id to get the Verified Permissions policy_store_id and policy template IDs for each action from the DynamoDB table that stores the tenant to policy store mapping. In this case the function needs to use the share_policy_template_id.
The function queries the documents metadata DynamoDB table to retrieve the document_owner attribute for the document the user wants to share.
The Lambda function calls Verified Permissions to check if the user is authorized to share the document. The request context uses the user_id from the JWT claims as the principal, shareDocument as the action, and the document_id as the resource. The document entity includes the document_owner attribute, which came from the documents metadata DynamoDB table.
If the user is authorized to share the resource, the function creates a new template-linked share policy in the tenant’s policy store. This policy includes the user_id of the receiving user as the principal and the document_id as the resource.

Access a shared document

After a document has been shared, the receiving user wants to access the document. To access the document, the frontend calls the GET /documents endpoint and provides the document_id of the document the user wants to access.

Cedar policy

As shown in the previous section, during the sharing process, the application creates a template-linked share policy that allows the receiving user to access the document. Verified Permissions evaluates this policy when the user tries to access the document.

Architecture

Figure 5: Accessing a shared document

A tenant user calls the GET /documents endpoint to access the document.
The Lambda function uses the user’s tenant_id to get the Verified Permissions policy_store_id.
The Lambda function calls Verified Permissions to check if the user is authorized to access the document. The request context uses the user_id from the JWT claims as the principal, accessDocument as the action, and the document_id as the resource.

Manage all the documents for a tenant

When a customer signs up for a SaaS application, the application creates the tenant admin user. The tenant admin must have permissions to perform all actions on all documents for the tenant.

Cedar policy

We need a global policy that allows tenant admins to manage all documents. The tenant onboarding process creates this policy in the tenant’s policy store.

permit (    
  principal in DocumentsAPI::Group::"<admin_group_id>”,
  action,
  resource
);

This policy allows every member of the <admin_group_id> group to perform any action on any document.

Architecture

Figure 6: Managing documents

A tenant admin calls the POST /documents endpoint to manage a document.
The Lambda function uses the user’s tenant_id to get the Verified Permissions policy_store_id.
The Lambda function calls Verified Permissions to check if the user is authorized to manage the document.

Conclusion

In this blog post, we showed you how Amazon Verified Permissions helps to implement fine-grained authorization decisions in a multi-tenant SaaS application. You saw how to apply the per-tenant policy store approach to the application architecture. See the Verified Permissions user guide for how to choose between using a per-tenant policy store or one shared policy store. To learn more, visit the Amazon Verified Permissions documentation and workshop.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Verified Permissions re:Post or contact AWS Support.

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

2024-02-12 Venkata Kampana

Post Syndicated from Venkata Kampana original https://aws.amazon.com/blogs/big-data/automate-aws-clean-rooms-querying-and-dashboard-publishing-using-aws-step-functions-and-amazon-quicksight-part-2/

Public health organizations need access to data insights that they can quickly act upon, especially in times of health emergencies, when data needs to be updated multiple times daily. For example, during the COVID-19 pandemic, access to timely data insights was critically important for public health agencies worldwide as they coordinated emergency response efforts. Up-to-date information and analysis empowered organizations to monitor the rapidly changing situation and direct resources accordingly.

This is the second post in this series; we recommend that you read this first post before diving deep into this solution. In our first post, Enable data collaboration among public health agencies with AWS Clean Rooms – Part 1 , we showed how public health agencies can create AWS Clean Room collaborations, invite other stakeholders to join the collaboration, and run queries on their collective data without either party having to share or copy underlying data with each other. As mentioned in the previous blog, AWS Clean Rooms enables multiple organizations to analyze their data and unlock insights they can act upon, without having to share sensitive, restricted, or proprietary records.

However, public health organizations leaders and decision-making officials don’t directly access data collaboration outputs from their Amazon Simple Storage Service (Amazon S3) buckets. Instead, they rely on up-to-date dashboards that help them visualize data insights to make informed decisions quickly.

To ensure these dashboards showcase the most updated insights, the organization builders and data architects need to catalog and update AWS Clean Rooms collaboration outputs on an ongoing basis, which often involves repetitive and manual processes that, if not done well, could delay your organization’s access to the latest data insights.

Manually handling repetitive daily tasks at scale poses risks like delayed insights, miscataloged outputs, or broken dashboards. At a large volume, it would require around-the-clock staffing, straining budgets. This manual approach could expose decision-makers to inaccurate or outdated information.

Automating repetitive workflows, validation checks, and programmatic dashboard refreshes removes human bottlenecks and help decrease inaccuracies. Automation helps ensure continuous, reliable processes that deliver the most current data insights to leaders without delays, all while streamlining resources.

In this post, we explain an automated workflow using AWS Step Functions and Amazon QuickSight to help organizations access the most current results and analyses, without delays from manual data handling steps. This workflow implementation will empower decision-makers with real-time visibility into the evolving collaborative analysis outputs, ensuring they have up-to-date, relevant insights that they can act upon quickly

Solution overview

The following reference architecture illustrates some of the foundational components of clean rooms query automation and publishing dashboards using AWS services. We automate running queries using Step Functions with Amazon EventBridge schedules, build an AWS Glue Data Catalog on query outputs, and publish dashboards using QuickSight so they automatically refresh with new data. This allows public health teams to monitor the most recent insights without manual updates.

The architecture consists of the following components, as numbered in the preceding figure:

A scheduled event rule on EventBridge triggers a Step Functions workflow.
The Step Functions workflow initiates the run of a query using the StartProtectedQuery AWS Clean Rooms API. The submitted query runs securely within the AWS Clean Rooms environment, ensuring data privacy and compliance. The results of the query are then stored in a designated S3 bucket, with a unique protected query ID serving as the prefix for the stored data. This unique identifier is generated by AWS Clean Rooms for each query run, maintaining clear segregation of results.
When the AWS Clean Rooms query is successfully complete, the Step Functions workflow calls the AWS Glue API to update the location of the table in the AWS Glue Data Catalog with the Amazon S3 location where the query results were uploaded in Step 2.
Amazon Athena uses the catalog from the Data Catalog to query the information using standard SQL.
QuickSight is used to query, build visualizations, and publish dashboards using the data from the query results.

Prerequisites

For this walkthrough, you need the following:

An AWS account.
AWS Management Console access to launch AWS CloudFormation templates.
A QuickSight account.
An AWS Clean rooms collaboration. For this post, we use the membership ID for the collaboration created in Part 1 of this series. You can locate this on the AWS Clean Rooms console, on your collaboration Details tab.

Launch the CloudFormation stack

In this post, we provide a CloudFormation template to create the following resources:

An EventBridge rule that triggers the Step Functions state machine on a schedule
An AWS Glue database and a catalog table
An Athena workgroup
Three S3 buckets:
- For AWS Clean Rooms to upload the results of query runs
- For Athena to upload the results for the queries
- For storing access logs of other buckets
A Step Functions workflow designed to run the AWS Clean Rooms query, upload the results to an S3 bucket, and update the table location with the S3 path in the AWS Glue Data Catalog
An AWS Key Management Service (AWS KMS) customer-managed key to encrypt the data in S3 buckets
AWS Identity and Access Management (IAM) roles and policies with the necessary permissions

To create the necessary resources, complete the following steps:

Choose Launch Stack:

Enter cleanrooms-query-automation-blog for Stack name.
Enter the membership ID from the AWS Clean Rooms collaboration you created in Part 1 of this series.
Choose Next.

Choose Next again.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

After you run the CloudFormation template and create the resources, you can find the following information on the stack Outputs tab on the AWS CloudFormation console:

AthenaWorkGroup – The Athena workgroup
EventBridgeRule – The EventBridge rule triggering the Step Functions state machine
GlueDatabase – The AWS Glue database
GlueTable – The AWS Glue table storing metadata for AWS Clean Rooms query results
S3Bucket – The S3 bucket where AWS Clean Rooms uploads query results
StepFunctionsStateMachine – The Step Functions state machine

Test the solution

The EventBridge rule named cleanrooms_query_execution_Stepfunctions_trigger is scheduled to trigger every 1 hour. When this rule is triggered, it initiates the run of the CleanRoomsBlogStateMachine-XXXXXXX Step Functions state machine. Complete the following steps to test the end-to-end flow of this solution:

On the Step Functions console, navigate to the state machine you created.
On the state machine details page, locate the latest query run.

The details page lists the completed steps:

The state machine submits a query to AWS Clean Rooms using the startProtectedQuery API. The output of the API includes the query run ID and its status.
The state machine waits for 30 seconds before checking the status of the query run.
After 30 seconds, the state machine checks the query status using the getProtectedQuery API. When the status changes to SUCCESS, it proceeds to the next step to retrieve the AWS Glue table metadata information. The output of this step contains the S3 location to which the query run results are uploaded.
The state machine retrieves the metadata of the AWS Glue table named patientimmunization, which was created via the CloudFormation stack.
The state machine updates the S3 location (the location to which AWS Clean Rooms uploaded the results) in the metadata of the AWS Glue table.
After a successful update of the AWS Glue table metadata, the state machine is complete.

On the Athena console, switch the workgroup to CustomWorkgroup.
Run the following query:

“SELECT * FROM "cleanrooms_patientdb "."patientimmunization" limit 10;"

Visualize the data with QuickSight

Now that you can query your data in Athena, you can use QuickSight to visualize the results. Let’s start by granting QuickSight access to the S3 bucket where your AWS Clean Rooms query results are stored.

Grant QuickSight access to Athena and your S3 bucket

First, grant QuickSight access to the S3 bucket:

Sign in to the QuickSight console.
Choose your user name, then choose Manage QuickSight.
Choose Security and permissions.
For QuickSight access to AWS services, choose Manage.
For Amazon S3, choose Select S3 buckets, and choose the S3 bucket named cleanrooms-query-execution-results -XX-XXXX-XXXXXXXXXXXX (XXXXX represents the AWS Region and account number where the solution is deployed).
Choose Save.

Create your datasets and publish visuals

Before you can analyze and visualize the data in QuickSight, you must create datasets for your Athena tables.

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
Select Athena.
Enter a name for your dataset.
Choose Create data source.
Choose the AWS Glue database cleanrooms_patientdb and select the table PatientImmunization.
Select Directly query your data.
Choose Visualize.

On the Analysis tab, choose the visual type of your choice and add visuals.

Clean up

Complete the following steps to clean up your resources when you no longer need this solution:

Manually delete the S3 buckets and the data stored in the bucket.
Delete the CloudFormation templates.
Delete the QuickSight analysis.
Delete the data source.

Conclusion

In this post, we demonstrated how to automate running AWS Clean Rooms queries using an API call from Step Functions. We also showed how to update the query results information on the existing AWS Glue table, query the information using Athena, and create visuals using QuickSight.

The automated workflow solution delivers real-time insights from AWS Clean Rooms collaborations to decision makers through automated checks for new outputs, processing, and Amazon QuickSight dashboard refreshes. This eliminates manual handling tasks, enabling faster data-driven decisions based on latest analyses. Additionally, automation frees up staff resources to focus on more strategic initiatives rather than repetitive updates.

Contact the public sector team directly to learn more about how to set up this solution, or reach out to your AWS account team to engage on a proof of concept of this solution for your organization.

About AWS Clean Rooms

AWS Clean Rooms helps companies and their partners more easily and securely analyze and collaborate on their collective datasets—without sharing or copying one another’s underlying data. With AWS Clean Rooms, you can create a secure data clean room in minutes, and collaborate with any other company on the AWS Cloud to generate unique insights about advertising campaigns, investment decisions, and research and development.

The AWS Clean Rooms team is continually building new features to help you collaborate. Watch this video to learn more about privacy-enhanced collaboration with AWS Clean Rooms.

Check out more AWS Partners or contact an AWS Representative to know how we can help accelerate your business.

Additional resources

About the Authors

Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Jim Daniel is the Public Health lead at Amazon Web Services. Previously, he held positions with the United States Department of Health and Human Services for nearly a decade, including Director of Public Health Innovation and Public Health Coordinator. Before his government service, Jim served as the Chief Information Officer for the Massachusetts Department of Public Health.

Identify Java nested dependencies with Amazon Inspector SBOM Generator

2024-02-12 Chi Tran

Post Syndicated from Chi Tran original https://aws.amazon.com/blogs/security/identify-java-nested-dependencies-with-amazon-inspector-sbom-generator/

Amazon Inspector is an automated vulnerability management service that continually scans Amazon Web Services (AWS) workloads for software vulnerabilities and unintended network exposure. Amazon Inspector currently supports vulnerability reporting for Amazon Elastic Compute Cloud (Amazon EC2) instances, container images stored in Amazon Elastic Container Registry (Amazon ECR), and AWS Lambda.

Java archive files (JAR, WAR, and EAR) are widely used for packaging Java applications and libraries. These files can contain various dependencies that are required for the proper functioning of the application. In some cases, a JAR file might include other JAR files within its structure, leading to nested dependencies. To help maintain the security and stability of Java applications, you must identify and manage nested dependencies.

In this post, I will show you how to navigate the challenges of uncovering nested Java dependencies, guiding you through the process of analyzing JAR files and uncovering these dependencies. We will focus on the vulnerabilities that Amazon Inspector identifies using the Amazon Inspector SBOM Generator.

The challenge of uncovering nested Java dependencies

Nested dependencies in Java applications can be outdated or contain known vulnerabilities linked to Common Vulnerabilities and Exposures (CVEs). A crucial issue that customers face is the tendency to overlook nested dependencies during analysis and triage. This oversight can lead to the misclassification of vulnerabilities as false positives, posing a security risk.

This challenge arises from several factors:

Volume of vulnerabilities — When customers encounter a high volume of vulnerabilities, the sheer number can be overwhelming, making it challenging to dedicate sufficient time and resources to thoroughly analyze each one.
Lack of tools or insufficient tooling — There is often a gap in the available tools to effectively identify nested dependencies (for example, mvn dependency:tree, OWASP Dependency-Check). Without the right tools, customers can miss critical dependencies hidden deep within their applications.
Understanding the complexity — Understanding the intricate web of nested dependencies requires a specific skill set and knowledge base. Deficits in this area can hinder effective analysis and risk mitigation.

Overview of nested dependencies

Nested dependencies occur when a library or module that is required by your application relies on additional libraries or modules. This is a common scenario in modern software development because developers often use third-party libraries to build upon existing solutions and to benefit from the collective knowledge of the open-source community.

In the context of JAR files, nested dependencies can arise when a JAR file includes other JAR files as part of its structure. These nested files can have their own dependencies, which might further depend on other libraries, creating a chain of dependencies. Nested dependencies help to modularize code and promote code reuse, but they can introduce complexity and increase the potential for security vulnerabilities if they aren’t managed properly.

Why it’s important to know what dependencies are consumed in a JAR file

Consider the following examples, which depict a typical file structure of a Java application to illustrate how nested dependencies are organized:

Example 1 — Log4J dependency

MyWebApp/
|-- mywebapp-1.0-SNAPSHOT.jar
|   |-- spring-boot-3.0.2.jar
|   |   |-- spring-boot-autoconfigure-3.0.2.jar
|   |   |   |-- ...
|   |   |   |   |-- log4j-to-slf4j.jar

This structure includes the following files and dependencies:

mywebapp-1.0-SNAPSHOT.jar is the main application JAR file.
Within mywebapp-1.0-SNAPSHOT.jar, there’s spring-boot-3.0.2.jar, which is a dependency of the main application.
Nested within spring-boot-3.0.2.jar, there’s spring-boot-autoconfigure-3.0.2.jar, a transitive dependency.
Within spring-boot-autoconfigure-3.0.2.jar, there’s log4j-to-slf4j.jar, which is our nested Log4J dependency.

This structure illustrates how a Java application might include nested dependencies, with Log4J nested within other libraries. The actual nesting and dependencies will vary based on the specific libraries and versions that you use in your project.

Example 2 — Jackson dependency

MyFinanceApp/
|-- myfinanceapp-2.5.jar
|   |-- jackson-databind-2.9.10.jar
|   |   |-- jackson-core-2.9.10.jar
|   |   |   |-- ...
|   |   |-- jackson-annotations-2.9.10.jar
|   |   |   |-- ...

This structure includes the following files and dependencies:

myfinanceapp-2.5.jar is the primary application JAR file.
Within myfinanceapp-2.5.jar, there is jackson-databind-2.9.10.1.jar, which is a library that the main application relies on for JSON processing.
Nested within jackson-databind-2.9.10.1.jar, there are other Jackson components such as jackson-core-2.9.10.jar and jackson-annotations-2.9.10.jar. These are dependencies that jackson-databind itself requires to function.

This structure is an example for Java applications that use Jackson for JSON operations. Because Jackson libraries are frequently updated to address various issues, including performance optimizations and security fixes, developers need to be aware of these nested dependencies to keep their applications up-to-date and secure. If you have detailed knowledge of where these components are nested within your application, it will be easier to maintain and upgrade them.

Example 3 — Hibernate dependency

MyERPSystem/
|-- myerpsystem-3.1.jar
|   |-- hibernate-core-5.4.18.Final.jar
|   |   |-- hibernate-validator-6.1.5.Final.jar
|   |   |   |-- ...
|   |   |-- hibernate-entitymanager-5.4.18.Final.jar
|   |   |   |-- ...

This structure includes the following files and dependencies:

myerpsystem-3.1.jar as the primary JAR file of the application.
Within myerpsystem-3.1.jar, hibernate-core-5.4.18.Final.jar serves as a dependency for object-relational mapping (ORM) capabilities.
Nested dependencies such as hibernate-validator-6.1.5.Final.jar and hibernate-entitymanager-5.4.18.Final.jar are crucial for the validation and entity management functionalities that Hibernate provides.

In instances where MyERPSystem encounters operational issues due to a mismatch between the Hibernate versions and another library (that is, a newer version of Spring expecting a different version of Hibernate), developers can use the detailed insights that Amazon Inspector SBOM Generator provides. This tool helps quickly pinpoint the exact versions of Hibernate and its nested dependencies, facilitating a faster resolution to compatibility problems.

Here are some reasons why it’s important to understand the dependencies that are consumed within a JAR file:

Security — Nested dependencies can introduce vulnerabilities if they are outdated or have known security issues. A prime example is the Log4J vulnerability discovered in late 2021 (CVE-2021-44228). This vulnerability was critical because Log4J is a widely used logging framework, and threat actors could have exploited the flaw remotely, leading to serious consequences. What exacerbated the issue was the fact that Log4J often existed as a nested dependency in various Java applications (see Example 1), making it difficult for organizations to identify and patch each instance.
Compliance — Many organizations must adhere to strict policies about third-party libraries for licensing, regulatory, or security reasons. Not knowing the dependencies, especially nested ones such as in the Log4J case, can lead to non-compliance with these policies.
Maintainability — It’s crucial that you stay informed about the dependencies within your project for timely updates or replacements. Consider the Jackson library (Example 2), which is often updated to introduce new features or to patch security vulnerabilities. Managing these updates can be complex, especially when the library is a nested dependency.
Troubleshooting — Identifying dependencies plays a critical role in resolving operational issues swiftly. An example of this is addressing compatibility issues between Hibernate and other Java libraries or frameworks within your application due to version mismatches (Example 3). Such problems often manifest as unexpected exceptions or degraded performance, so you need to have a precise understanding of the libraries involved.

These examples underscore that you need to have deep visibility into JAR file contents to help protect against immediate threats and help ensure long-term application health and compliance.

Existing tooling limitations

When analyzing Java applications for nested dependencies, one of the main challenges is that existing tools can’t efficiently narrow down the exact location of these dependencies. This issue is particularly evident with tools such as mvn dependency:tree, OWASP Dependency-Check, and similar dependency analysis solutions.

Although tools are available to analyze Java applications for nested dependencies, they often fall short in several key areas. The following points highlight common limitations of these tools:

Inadequate depth in dependency trees — Although other tools provide a hierarchical view of project dependencies, they often fail to delve deep enough to reveal nested dependencies, particularly those that are embedded within other JAR files as nested dependencies. Nested dependencies are repackaged within a library and aren’t immediately visible in the standard dependency tree.
Lack of specific location details — These tools typically don’t offer the granularity needed to pinpoint the exact location of a nested dependency within a JAR file. For large and complex Java applications, it may be challenging to identify and address specific dependencies, especially when they are deeply embedded.
Complexity in large projects — In projects with a vast and intricate network of dependencies, these tools can struggle to provide clear and actionable insights. The output can be complicated and difficult to navigate, leaving customers without a clear path to identifying critical dependencies.

Address tooling limitations with Amazon Inspector SBOM Generator

The Amazon Inspector SBOM Generator (Sbomgen) introduces a significant advancement in the identification of nested dependencies in Java applications. Although the concept of monitoring dependencies is well-established in software development, AWS has tailored this tool to enhance visibility into the complexities of software compositions. By generating a software bill of materials (SBOM) for a container image, Sbomgen provides a detailed inventory of the software installed on a system, including hidden nested dependencies that traditional tools can overlook. This capability enriches the existing toolkit, offering a more granular and actionable understanding of the dependency structure of your applications.

Sbomgen works by scanning for files that contain information about installed packages. Upon finding such files, it extracts essential data such as package names, versions, and other metadata. Then it transforms this metadata into a CycloneDX SBOM, providing a structured and detailed view of the dependencies.

For information about how to install Sbomgen, see Installing Amazon Inspector SBOM Generator (Sbomgen)

A key feature of Sbomgen is its ability to provide explicit paths to each dependency.

For example, given a compiled jar application MyWebApp-0.0.1-SNAPSHOT.jar, users can run the following CLI command with Sbomgen:

./inspector-sbomgen localhost --path /path/to/MyWebApp-0.0.1-SNAPSHOT.jar --scanners java-jar

The output should look similar to the following:

{
  "bom-ref": "comp-11",
  "type": "library",
  "name": "org.apache.logging.log4j/log4j-to-slf4j",
  "version": "2.19.0",
  "hashes": [
    {
      "alg": "SHA-1",
      "content": "30f4812e43172ecca5041da2cb6b965cc4777c19"
    }
  ],
  "purl": "pkg:maven/org.apache.logging.log4j/[email protected]",
  "properties": [
...
    {
      "name": "amazon:inspector:sbom_generator:source_path",
      "value": "/tmp/MyWebApp-0.0.1-SNAPSHOT.jar/BOOT-INF/lib/spring-boot-3.0.2.jar/BOOT-INF/lib/spring-boot-autoconfigure-3.0.2.jar/BOOT-INF/lib/logback-classic-1.4.5.jar/BOOT-INF/lib/logback-core-1.4.5.jar/BOOT-INF/lib/log4j-to-slf4j-2.19.0.jar/META-INF/maven/org.apache.logging.log4j/log4j-to-slf4j/pom.properties"
    }
  ]
}

In this output, the amazon:inspector:sbom_collector:path property is particularly significant. It provides a clear and complete path to the location of the specific dependency (in this case, log4j-to-slf4j) within the application’s structure. This level of detail is crucial for several reasons:

Precise location identification — It helps you quickly and accurately identify the exact location of each dependency, which is especially useful for nested dependencies that are otherwise hard to locate.
Effective risk management — When you know the exact path of dependencies, you can more efficiently assess and address security risks associated with these dependencies.
Time and resource efficiency — It reduces the time and resources needed to manually trace and analyze dependencies, streamlining the vulnerability management process.
Enhanced visibility and transparency — It provides a clearer understanding of the application’s dependency structure, contributing to better overall management and maintenance.
Comprehensive package information — The detailed package information, including name, version, hashes, and package URL, of Sbomgen equips you with a thorough understanding of each dependency’s specifics, aiding in precise vulnerability tracking and software integrity verification.

Mitigate vulnerable dependencies

After you identify the nested dependencies in your Java JAR files, you should verify whether these dependencies are outdated or vulnerable. Amazon Inspector can help you achieve this by doing the following:

Comparing the discovered dependencies against a database of known vulnerabilities.
Providing a list of potentially vulnerable dependencies, along with detailed information about the associated CVEs.
Offering recommendations on how to mitigate the risks, such as updating the dependencies to a newer, more secure version.

By integrating Amazon Inspector into your software development lifecycle, you can continuously monitor your Java applications for vulnerable nested dependencies and take the necessary steps to help ensure that your application remains secure and compliant.

Conclusion

To help secure your Java applications, you must manage nested dependencies. Amazon Inspector provides an automated and efficient way to discover and mitigate potentially vulnerable dependencies in JAR files. By using the capabilities of Amazon Inspector, you can help improve the security posture of your Java applications and help ensure that they adhere to best practices.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

How to enforce creation of roles in a specific path: Use IAM role naming in hierarchy models

2024-02-09 Varun Sharma

Post Syndicated from Varun Sharma original https://aws.amazon.com/blogs/security/how-to-enforce-creation-of-roles-in-a-specific-path-use-iam-role-naming-in-hierarchy-models/

An AWS Identity and Access Management (IAM) role is an IAM identity that you create in your AWS account that has specific permissions. An IAM role is similar to an IAM user because it’s an AWS identity with permission policies that determine what the identity can and cannot do on AWS. However, as outlined in security best practices in IAM, AWS recommends that you use IAM roles instead of IAM users. An IAM user is uniquely associated with one person, while a role is intended to be assumable by anyone who needs it. An IAM role doesn’t have standard long-term credentials such as a password or access keys associated with it. Instead, when you assume a role, it provides you with temporary security credentials for your role session that are only valid for certain period of time.

This blog post explores the effective implementation of security controls within IAM roles, placing a specific focus on the IAM role’s path feature. By organizing IAM roles hierarchically using paths, you can address key challenges and achieve practical solutions to enhance IAM role management.

Benefits of using IAM paths

A fundamental benefit of using paths is the establishment of a clear and organized organizational structure. By using paths, you can handle diverse use cases while creating a well-defined framework for organizing roles on AWS. This organizational clarity can help you navigate complex IAM setups and establish a cohesive structure that’s aligned with your organizational needs.

Furthermore, by enforcing a specific structure, you can gain precise control over the scope of permissions assigned to roles, helping to reduce the risk of accidental assignment of overly permissive policies. By assisting in preventing inadvertent policy misconfigurations and assisting in coordinating permissions with the planned organizational structure, this proactive solution improves security. This approach is highly effective when you consistently apply established naming conventions to paths, role names, and policies. Enforcing a uniform approach to role naming enhances the standardization and efficiency of IAM role management. This practice fosters smooth collaboration and reduces the risk of naming conflicts.

Path example

In IAM, a role path is a way to organize and group IAM roles within your AWS account. You specify the role path as part of the role’s Amazon Resource Name (ARN).

As an example, imagine that you have a group of IAM roles related to development teams, and you want to organize them under a path. You might structure it like this:

Role name: Dev App1 admin
Role path: /D1/app1/admin/
Full ARN: arn:aws:iam::123456789012:role/D1/app1/admin/DevApp1admin

Role name: Dev App2 admin
Role path: /D2/app2/admin/
Full ARN: arn:aws:iam::123456789012:role/D2/app2/admin/DevApp2admin

In this example, the IAM roles DevApp1admin and DevApp2admin are organized under two different development team paths: D1/app1/admin and D2/app2/admin, respectively. The role path provides a way to group roles logically, making it simpler to manage and understand their purpose within the context of your organization.

Solution overview

Figure 1: Sample architecture

The sample architecture in Figure 1 shows how you can separate and categorize the enterprise roles and development team roles into a hierarchy model by using a path in an IAM role. Using this hierarchy model, you can enable several security controls at the level of the service control policy (SCP), IAM policy, permissions boundary, or the pipeline. I recommend that you avoid incorporating business unit names in paths because they could change over time.

Here is what the IAM role path looks like as an ARN:

arn:aws:iam::123456789012:role/EnT/iam/adm/IAMAdmin

In this example, in the resource name, /EnT/iam/adm/ is the role path, and IAMAdmin is the role name.

You can now use the role path as part of a policy, such as the following:

arn:aws:iam::123456789012:role/EnT/iam/adm/*

In this example, in the resource name, /EnT/iam/adm/ is the role path, and * indicates any IAM role inside this path.

Walkthrough of examples for preventative controls

Now let’s walk through some example use cases and SCPs for a preventative control that you can use based on the path of an IAM role.

PassRole preventative control example

The following SCP denies passing a role for enterprise roles, except for roles that are part of the IAM admin hierarchy within the overall enterprise hierarchy.

		{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "DenyEnTPassRole",
			"Effect": "Deny",
			"Action": "iam:PassRole",
			"Resource": "arn:aws:iam::*:role/EnT/*",
			"Condition": {
				"ArnNotLike": {
					"aws:PrincipalArn": "arn:aws:iam::*:role/EnT/fed/iam/*"
				}
			}
		}
	]
}

With just a couple of statements in the SCP, this preventative control helps provide protection to your high-privilege roles for enterprise roles, regardless of the role’s name or current status.

This example uses the following paths:

/EnT/ — enterprise roles (roles owned by the central teams, such as cloud center of excellence, central security, and networking teams)
/fed/ — federated roles, which have interactive access
/iam/ — roles that are allowed to perform IAM actions, such as CreateRole, AttachPolicy, or DeleteRole

IAM actions preventative control example

The following SCP restricts IAM actions, including CreateRole, DeleteRole, AttachRolePolicy, and DetachRolePolicy, on the enterprise path.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyIAMActionsonEnTRoles",
            "Effect": "Deny",
            "Action": [
                "iam:CreateRole",
                "iam:DeleteRole",
                "iam:DetachRolePolicy",
                "iam:AttachRolePolicy"
            ],
            "Resource": "arn:aws:iam::*:role/EnT/*",
            "Condition": {
                "ArnNotLike": {
                    "aws:PrincipalArn": "arn:aws:iam::*:role/EnT/fed/iam/*"
                }
            }
        }
    ]
}

This preventative control denies an IAM role that is outside of the enterprise hierarchy from performing the actions CreateRole, DeleteRole, DetachRolePolicy, and AttachRolePolicy in this hierarchy. Every IAM role will be denied those API actions except the one with the path as arn:aws:iam::*:role/EnT/fed/iam/*

The example uses the following paths:

/EnT/ — enterprise roles (roles owned by the central teams, such as cloud center of excellence, central security, or network automation teams)
/fed/ — federated roles, which have interactive access
/iam/ — roles that are allowed to perform IAM actions (in this case, CreateRole, DeteleRole, DetachRolePolicy, and AttachRolePolicy)

IAM policies preventative control example

The following SCP policy denies attaching certain high-privilege AWS managed policies such as AdministratorAccess outside of certain IAM admin roles. This is especially important in an environment where business units have self-service capabilities.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RolePolicyAttachment",
            "Effect": "Deny",
            "Action": "iam:AttachRolePolicy",
            "Resource": "arn:aws:iam::*:role/EnT/fed/iam/*",
            "Condition": {
                "ArnNotLike": {
                    "iam:PolicyARN": "arn:aws:iam::aws:policy/AdministratorAccess"
                }
            }
        }
    ]
}

AssumeRole preventative control example

The following SCP doesn’t allow non-production roles to assume a role in production accounts. Make sure to replace <Your production OU ID> and <your org ID> with your own information.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "DenyAssumeRole",
			"Effect": "Deny",
			"Action": "sts:AssumeRole",
			"Resource": "*",
			"Condition": {
				"StringLike": {
					"aws:PrincipalArn": "arn:aws:iam::*:role/np/*"
				},
				"ForAnyValue:StringLike": {
					"aws:ResourceOrgPaths": "<your org ID>/r-xxxx/<Your production OU ID>/*"
				}
			}
		}
	]
}

This example uses the /np/ path, which specifies non-production roles. The SCP denies non-production IAM roles from assuming a role in the production organizational unit (OU) (in our example, this is represented by “<your org ID>/r-xxxx/<Your production OU ID>/*”). Depending on the structure of your organization, the ResourceOrgPaths will have one of the following formats:

“o-a1b2c3d4e5/*”
“o-a1b2c3d4e5/r-ab12/ou-ab12-11111111/*”
“o-a1b2c3d4e5/r-ab12/ou-ab12-11111111/ou-ab12-22222222/”

Walkthrough of examples for monitoring IAM roles (detective control)

Now let’s walk through two examples of detective controls.

AssumeRole in CloudTrail Lake

The following is an example of a detective control to monitor IAM roles in AWS CloudTrail Lake.

SELECT
    userIdentity.arn as "Username", eventTime, eventSource, eventName, sourceIPAddress, errorCode, errorMessage
FROM
    <Event data store ID>
WHERE
    userIdentity.arn IS NOT NULL
    AND eventName = 'AssumeRole'
    AND userIdentity.arn LIKE '%/np/%'
    AND errorCode = 'AccessDenied'
    AND eventTime > '2023-07-01 14:00:00'
    AND eventTime < '2023-11-08 18:00:00';

This query lists out AssumeRole events for non-production roles in the organization for AccessDenied errors. The output is stored in an Amazon Simple Storage Service (Amazon S3) bucket from CloudTrail Lake, from which the csv file can be downloaded. The following shows some example output:

Username,eventTime,eventSource,eventName,sourceIPAddress,errorCode,errorMessage
arn:aws:sts::123456789012:assumed-role/np/test,2023-12-09 10:35:45.000,iam.amazonaws.com,AssumeRole,11.11.113.113,AccessDenied,User: arn:aws:sts::123456789012:assumed-role/np/test is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::123456789012:role/hello because no identity-based policy allows the sts:AssumeRole action

You can modify the query to audit production roles as well.

CreateRole in CloudTrail Lake

Another example of a CloudTrail Lake query for a detective control is as follows:

SELECT
    userIdentity.arn as "Username", eventTime, eventSource, eventName, sourceIPAddress, errorCode, errorMessage
FROM
    <Event data store ID>
WHERE
    userIdentity.arn IS NOT NULL
    AND eventName = 'CreateRole'
    AND userIdentity.arn LIKE '%/EnT/fed/iam/%'
    AND eventTime > '2023-07-01 14:00:00'
    AND eventTime < '2023-11-08 18:00:00';

This query lists out CreateRole events for roles in the /EnT/fed/iam/ hierarchy. The following are some example outputs:

Username,eventTime,eventSource,eventName,sourceIPAddress,errorCode,errorMessage

arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test,2023-12-09 16:31:11.000,iam.amazonaws.com,CreateRole,10.10.10.10,AccessDenied,User: arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::123456789012:role/EnT/fed/iam/security because no identity-based policy allows the iam:CreateRole action

arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test,2023-12-09 16:33:10.000,iam.amazonaws.com,CreateRole,10.10.10.10,AccessDenied,User: arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::123456789012:role/EnT/fed/iam/security because no identity-based policy allows the iam:CreateRole action

Because these roles can create additional enterprise roles, you should audit roles created in this hierarchy.

Important considerations

When you implement specific paths for IAM roles, make sure to consider the following:

The path of an IAM role is part of the ARN. After you define the ARN, you can’t change it later. Therefore, just like the name of the role, consider what the path should be during the early discussions of design.
IAM roles can’t have the same name, even on different paths.
When you switch roles through the console, you need to include the path because it’s part of the role’s ARN.
The path of an IAM role can’t exceed 512 characters. For more information, see IAM and AWS STS quotas.
The role name can’t exceed 64 characters. If you intend to use a role with the Switch Role feature in the AWS Management Console, then the combined path and role name can’t exceed 64 characters.
When you create a role through the console, you can’t set an IAM role path. To set a path for the role, you need to use automation, such as AWS Command Line Interface (AWS CLI) commands or SDKs. For example, you might use an AWS CloudFormation template or a script that interacts with AWS APIs to create the role with the desired path.

Conclusion

By adopting the path strategy, you can structure IAM roles within a hierarchical model, facilitating the implementation of security controls on a scalable level. You can make these controls effective for IAM roles by applying them to a path rather than specific roles, which sets this approach apart.

This strategy can help you elevate your overall security posture within IAM, offering a forward-looking solution for enterprises. By establishing a scalable IAM hierarchy, you can help your organization navigate dynamic changes through a robust identity management structure. A well-crafted hierarchy reduces operational overhead by providing a versatile framework that makes it simpler to add or modify roles and policies. This scalability can help streamline the administration of IAM and help your organization manage access control in evolving environments.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Use multiple bookmark keys in AWS Glue JDBC jobs

2024-02-08 Durga Prasad

Post Syndicated from Durga Prasad original https://aws.amazon.com/blogs/big-data/use-multiple-bookmark-keys-in-aws-glue-jdbc-jobs/

AWS Glue is a serverless data integrating service that you can use to catalog data and prepare for analytics. With AWS Glue, you can discover your data, develop scripts to transform sources into targets, and schedule and run extract, transform, and load (ETL) jobs in a serverless environment. AWS Glue jobs are responsible for running the data processing logic.

One important feature of AWS Glue jobs is the ability to use bookmark keys to process data incrementally. When an AWS Glue job is run, it reads data from a data source and processes it. One or more columns from the source table can be specified as bookmark keys. The column should have sequentially increasing or decreasing values without gaps. These values are used to mark the last processed record in a batch. The next run of the job resumes from that point. This allows you to process large amounts of data incrementally. Without job bookmark keys, AWS Glue jobs would have to reprocess all the data during every run. This can be time-consuming and costly. By using bookmark keys, AWS Glue jobs can resume processing from where they left off, saving time and reducing costs.

This post explains how to use multiple columns as job bookmark keys in an AWS Glue job with a JDBC connection to the source data store. It also demonstrates how to parameterize the bookmark key columns and table names in the AWS Glue job connection options.

This post is focused towards architects and data engineers who design and build ETL pipelines on AWS. You are expected to have a basic understanding of the AWS Management Console, AWS Glue, Amazon Relational Database Service (Amazon RDS), and Amazon CloudWatch logs.

Solution overview

To implement this solution, we complete the following steps:

Create an Amazon RDS for PostgreSQL instance.
Create two tables and insert sample data.
Create and run an AWS Glue job to extract data from the RDS for PostgreSQL DB instance using multiple job bookmark keys.
Create and run a parameterized AWS Glue job to extract data from different tables with separate bookmark keys

The following diagram illustrates the components of this solution.

Deploy the solution

For this solution, we provide an AWS CloudFormation template that sets up the services included in the architecture, to enable repeatable deployments. This template creates the following resources:

An RDS for PostgreSQL instance
An Amazon Simple Storage Service (Amazon S3) bucket to store the data extracted from the RDS for PostgreSQL instance
An AWS Identity and Access Management (IAM) role for AWS Glue
Two AWS Glue jobs with job bookmarks enabled to incrementally extract data from the RDS for PostgreSQL instance

To deploy the solution, complete the following steps:

Choose to launch the CloudFormation stack:
Enter a stack name.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.
Wait until the creation of the stack is complete, as shown on the AWS CloudFormation console.
When the stack is complete, copy the AWS Glue scripts to the S3 bucket job-bookmark-keys-demo-<accountid>.
Open AWS CloudShell.
Run the following commands and replace <accountid> with your AWS account ID:

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_1_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_1_job.py
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_2_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_2_job.py

Add sample data and run AWS Glue jobs

In this section, we connect to the RDS for PostgreSQL instance via AWS Lambda and create two tables. We also insert sample data into both the tables.

On the Lambda console, choose Functions in the navigation pane.
Choose the function LambdaRDSDDLExecute.
Choose Test and choose Invoke for the Lambda function to insert the data.

The two tables product and address will be created with sample data, as shown in the following screenshot.

Run the multiple_job_bookmark_keys AWS Glue job

We run the multiple_job_bookmark_keys AWS Glue job twice to extract data from the product table of the RDS for PostgreSQL instance. In the first run, all the existing records will be extracted. Then we insert new records and run the job again. The job should extract only the newly inserted records in the second run.

On the AWS Glue console, choose Jobs in the navigation pane.
Choose the job multiple_job_bookmark_keys.
Choose Run to run the job and choose the Runs tab to monitor the job progress.
Choose the Output logs hyperlink under CloudWatch logs after the job is complete.
Choose the log stream in the next window to see the output logs printed.

The AWS Glue job extracted all records from the source table product. It keeps track of the last combination of values in the columns product_id and version.Next, we run another Lambda function to insert a new record. The product_id 45 already exists, but the inserted record will have a new version as 2, making the combination sequentially increasing.
Run the LambdaRDSDDLExecute_incremental Lambda function to insert the new record in the product table.
Run the AWS Glue job multiple_job_bookmark_keys again after you insert the record and wait for it to succeed.
Choose the Output logs hyperlink under CloudWatch logs.
Choose the log stream in the next window to see only the newly inserted record printed.

The job extracts only those records that have a combination greater than the previously extracted records.

Run the parameterised_job_bookmark_keys AWS Glue job

We now run the parameterized AWS Glue job that takes the table name and bookmark key column as parameters. We run this job to extract data from different tables maintaining separate bookmarks.

The first run will be for the address table with bookmarkkey as address_id. These are already populated with the job parameters.

On the AWS Glue console, choose Jobs in the navigation pane.
Choose the job parameterised_job_bookmark_keys.
Choose Run to run the job and choose the Runs tab to monitor the job progress.
Choose the Output logs hyperlink under CloudWatch logs after the job is complete.
Choose the log stream in the next window to see all records from the address table printed.
On the Actions menu, choose Run with parameters.
Expand the Job parameters section.
Change the job parameter values as follows:
- Key --bookmarkkey with value product_id
- Key --table_name with value product
- The S3 bucket name is unchanged (job-bookmark-keys-demo-<accountnumber>)
Choose Run job to run the job and choose the Runs tab to monitor the job progress.
Choose the Output logs hyperlink under CloudWatch logs after the job is complete.
Choose the log stream to see all the records from the product table printed.

The job maintains separate bookmarks for each of the tables when extracting the data from the source data store. This is achieved by adding the table name to the job name and transformation contexts in the AWS Glue job script.

Clean up

To avoid incurring future charges, complete the following steps:

On the Amazon S3 console, choose Buckets in the navigation pane.
Select the bucket with job-bookmark-keys in its name.
Choose Empty to delete all the files and folders in it.
On the CloudFormation console, choose Stacks in the navigation pane.
Select the stack you created to deploy the solution and choose Delete.

Conclusion

This post demonstrated passing more than one column of a table as jobBookmarkKeys in a JDBC connection to an AWS Glue job. It also explained how you can a parameterized AWS Glue job to extract data from multiple tables while keeping their respective bookmarks. As a next step, you can test the incremental data extract by changing data in the source tables.

About the Authors

Durga Prasad is a Sr Lead Consultant enabling customers build their Data Analytics solutions on AWS. He is a coffee lover and enjoys playing badminton.

Murali Reddy is a Lead Consultant at Amazon Web Services (AWS), helping customers build and implement data analytics solution. When he’s not working, Murali is an avid bike rider and loves exploring new places.

How to migrate asymmetric keys from CloudHSM to AWS KMS

2024-02-06 Mani Manasa Mylavarapu

Post Syndicated from Mani Manasa Mylavarapu original https://aws.amazon.com/blogs/security/how-to-migrate-asymmetric-keys-from-cloudhsm-to-aws-kms/

In June 2023, Amazon Web Services (AWS) introduced a new capability to AWS Key Management Service (AWS KMS): you can now import asymmetric key materials such as RSA or elliptic-curve cryptography (ECC) private keys for your signing workflow into AWS KMS. This means that you can move your asymmetric keys that are managed outside of AWS KMS—such as a hybrid (on-premises) environment, multi-cloud environment, and even AWS CloudHSM—and make them available through AWS KMS. Combined with the announcement on AWS KMS HSMs achieving FIPS 140-2 Security Level 3, you can make sure that your keys are secured and used in a manner that aligns to the cryptographic standards laid out by the U.S. National Institute of Standards and Technology (NIST).

In this post, we will show you how to migrate your asymmetric keys from CloudHSM to AWS KMS. This can help you simplify your key management strategy and take advantage of the robust authorization control of AWS KMS key policies.

Benefits of importing key materials into AWS KMS

In general, we recommend that you use a native KMS key because it provides the best security, durability, and availability compared to other key store options. AWS KMS FIPS-validated hardware security modules (HSMs) generate the key materials for KMS keys, and these key materials never leave the HSMs unencrypted. Operations that require use of your KMS key (for example, decryption of a data key or digital signature signing) must occur within the HSM.

However, depending on your organization’s requirements, you might need to bring your own key (BYOK) from outside. Importing your own key gives you direct control over the generation, lifecycle management, and durability of your keys. In addition, you have full control over the availability of your imported keys because you can set an expiration period or delete and reimport the keys at any time. You have greater control over the durability of your imported keys because you can maintain the original version of the keys elsewhere. If you need to generate and store copies of keys outside of AWS, these additional controls can help you meet your compliance requirements.

Solution overview

At a high level, our solution involves downloading the wrapping key from AWS KMS, using the CloudHSM Command Line Interface (CLI) to import a wrapping key to CloudHSM, wrapping the private key by using the wrapping key in CloudHSM, and uploading the wrapped private key to AWS KMS by using an import token. You can perform the same procedures by using other supported libraries, such as the PKCS #11 library or a JCE provider.

Figure 1: Overall architecture of the solution

As shown in Figure 1, the solution involves the following steps:

Create a KMS key without key material in AWS KMS
Download the wrapping public key and import token from AWS KMS
Import the wrapping key provided by AWS KMS into CloudHSM
Wrap the private key inside CloudHSM with the imported wrapping public key from AWS KMS
Import the wrapped private key to AWS KMS

For the walkthrough in this post, you will import into AWS KMS an ECC 256-bit private key (NIST P-256) that’s used for signing purpose from a CloudHSM cluster. When you import an asymmetric key into AWS KMS, you only need to import a private key. You don’t need to import a public key because AWS KMS can generate and retrieve a public key from the private key after the private key is imported.

Prerequisites

To follow along with this walkthrough, make sure that you have the following prerequisites in place:

An active CloudHSM cluster with at least one active HSM and a valid crypto user credential.
An Amazon Elastic Compute Cloud (Amazon EC2) instance with the CloudHSM Client SDK 5 installed and configured to connect to the CloudHSM cluster. For instructions on how to configure and connect the client instance, see Getting started with AWS CloudHSM.
OpenSSL installed on your EC2 instance (we recommend version 3.0.0 or newer).

Step 1: Create a KMS key without key material in AWS KMS

The first step is to create a new KMS key. You can do this through the AWS KMS console or the AWS CLI, or by running the CreateKey API operation.

When you create your key, keep the following guidance in mind:

Set the key material origin to External so that no key material is created for this new key.
According to NIST SP 800-57 guidance and cryptography best practice, in general, you should use a single key for only one purpose (for example, if you use an RSA key for encryption, you shouldn’t also use that key for signing). Select the key usage that best suits your use case.
Make sure that the key spec match the algorithm specification of the key that you are trying to import from CloudHSM.
If you want to use the key in multiple AWS Regions (for example, to avoid the need for a cross-Region call to access the key), consider using a multi-Region key.

To create a KMS key using the AWS CLI

Run the following command:

aws kms create-key --origin EXTERNAL --key-spec ECC_NIST_P256 --key-usage SIGN_VERIFY

Step 2: Download the wrapping public key and import token from AWS KMS

After you create the key, download the wrapping key and import token.

The wrapping key spec and the wrapping algorithm that you select depend on the key that you’re trying to import. AWS KMS supports several standard RSA wrapping algorithms and a two-step hybrid wrapping algorithm. CloudHSM supports both wrapping algorithms as well.

In general, an RSA wrapping algorithm (RSAES_OAEP_SHA_*) with a key spec of RSA_4096 should be sufficient for wrapping ECC private keys because it can wrap the key material completely. However, when importing RSA private keys, you will need to use the two-step hybrid wrapping algorithm (RSA_AES_KEY_WRAP_SHA_*) due to their large key size. The overall process is the same as what’s shown here, but the two-step hybrid wrapping algorithm requires that you encrypt your key material with an Advanced Encryption Standard (AES) symmetric key that you generate, and then encrypt the AES symmetric key with the RSA public wrapping key. Additionally, when you select the wrapping algorithm, you also have a choice between the SHA-1 or SHA-256 hashing algorithm. We recommend that you use the SHA-256 hashing algorithm whenever possible.

Note that each wrapping public key and import token set is valid for 24 hours. If you don’t use the set to import key material within 24 hours of downloading it, you must download a new set.

To download the wrapping public key and import token from AWS KMS

Run the following command. Make sure to replace <KMS KeyID> with the key ID of the KMS key that you created in the previous step. The key ID is the last part of the key ARN after :key/ (for example, arn:aws:kms:us-east-1:<AWS Account ID>:key/<Key ID>). “ImportToken.b64” represents the wrapping token, and “WrappingPublicKey.b64” represents the import token.
```
aws kms get-parameters-for-import \
--key-id <KMS KeyID> \
--wrapping-algorithm RSAES_OAEP_SHA_256 \
--wrapping-key-spec RSA_4096 \
--query "[ImportToken, PublicKey]" \
--output text \
| awk '{print $1 > "ImportToken.b64"; print $2 > "WrappingPublicKey.b64"}'
```

Decode the base64 encoding.

openssl enc -d -base64 -A -in WrappingPublicKey.b64 -out WrappingPublicKey.bin

To convert the wrapping public key from DER to PEM format

The key import pem command in CloudHSM CLI requires that the public key is in PEM format. AWS KMS outputs public keys in the DER format, so you must convert the wrapping public key to PEM format. To convert the public key to PEM format, run the following command:
```
openssl rsa -pubin -in WrappingPublicKey.bin -inform DER -outform PEM -out WrappingPublicKey.pem
```

Step 3: Import the wrapping key provided by AWS KMS into CloudHSM

Now that you have created the KMS key and made the necessary preparations to import it, switch to CloudHSM to import the key.

To import the wrapping key

Log in to your EC2 instance that has the CloudHSM CLI installed and run the following command to use it in an interactive mode:
```
/opt/cloudhsm/bin/cloudhsm-cli interactive
```
Log in with your crypto user credential. Make sure to replace <YourUserName> with your own information and supply your password when prompted.
```
login --username <YourUserName> --role crypto-user
```

Import the wrapping key and set the attribute allowing this key to be used for wrapping other keys.

key import pem --path ./WrappingPublicKey.pem --label <kms-wrapping-key> --key-type-class rsa-public --attributes wrap=true

You should see an output similar to the following:

{
  "error_code": 0,
  "data": {
    "key": {
      "key-reference": "0x00000000002800c2",
      "key-info": {
        "key-owners": [
          {
            "username": "<YourUserName>",
            "key-coverage": "full"
          }
        ],
        "shared-users": [],
        "cluster-coverage": "full"
      },
      "attributes": {
        "key-type": "rsa",
        "label": "<kms-wrapping-key>",
        "id": "0x",
        "check-value": "0x5efd07",
        "class": "public-key",
        "encrypt": false,
        "decrypt": false,
        "token": true,
        "always-sensitive": false,
        "derive": false,
        "destroyable": true,
        "extractable": true,
        "local": false,
        "modifiable": true,
        "never-extractable": false,
        "private": true,
        "sensitive": false,
        "sign": false,
        "trusted": false,
        "unwrap": false,
        "verify": false,
        "wrap": true,
        "wrap-with-trusted": false,
        "key-length-bytes": 1024,
        "public-exponent": "0x010001",
        "modulus": "0xd7683010 … b6fc9df07",
        "modulus-size-bits": 4096
      }
    },
    "message": "Successfully imported key"
  }
}

From the output, note the value for the key label (<kms-wrapping-key> in this example) because you will need it for the next step.

Step 4: Wrap the private key inside CloudHSM with the imported wrapping public key from AWS KMS

Now that you have imported the wrapping key into CloudHSM, you can wrap the private key that you want to import to AWS KMS by using the wrapping key.

Important: Only the owner of a key—the crypto user who created the key—can wrap the key. In addition, the key that you want to wrap must have the extractable attribute set to true.

To wrap the private key

Use the key wrap command in the CloudHSM CLI to wrap the private key that’s stored in CloudHSM. Make sure to replace the following placeholder values with your own information:
- rsa-oaep specifies the wrapping algorithm.
- --payload-filter is used to define the key that you want to wrap out of the HSM. You can use the key reference (for example, key-reference=0x00000000002800c2) or reference key attributes, such as the key label. In our example, we used the key label ec-priv-import-to-kms.
- --wrapping-filter is used to define the key that you will use to wrap out the payload key. This should be the wrapping key that you imported previously from AWS KMS, which was labeled kms-wrapping-key in Step 3.3.
- --hash-function defines the hash function used as part of the OAEP encryption. This should match the wrapping algorithm that you specified when you got the import parameters from AWS KMS. In our example, it should be SHA-256 because we selected RSAES_OAEP_SHA_256 as the wrapping algorithm previously.
- --mgf defines the mask generation function used as part of the OAEP encryption. The mask hash function must match the signing mechanism hash function, which is SHA-256 in this example.
- --path defines the path to the binary file where the wrapped key data will be saved. In this example, we name the file EncryptedECC_P256KeyMaterial.bin but you can specify a different name.
```
key wrap rsa-oaep --payload-filter attr.label=ec-priv-import-to-kms --wrapping-filter attr.label=kms-wrapping-key --hash-function sha256 --mgf mgf1-sha256 --path EncryptedECC_P256KeyMaterial.bin
```

(Optional) To export the public key

You can also use the CloudHSM CLI to export the public key of your private key. You will use this key for testing later. Make sure to replace the placeholder values <ec-priv-import-to-kms> and <KeyName.pem> with your own information.
```
key generate-file --encoding pem --path <KeyName.pem> --filter attr.label=<ec-priv-import-to-kms>
```

Step 5: Import the wrapped private key to AWS KMS

Now that you’ve wrapped the private key from CloudHSM, you can import it into AWS KMS.

Note that you have the option to set an expiration time for your imported key. After the expiration time passes, AWS KMS deletes your imported key automatically.

To import the wrapped private key to AWS KMS

If you have been using the CLI or API, the import token is base64 encoded. You must decode the token from base64 to binary format before it can be used. You can use OpenSSL to do this.
```
openssl enc -d -base64 -A -in ImportToken.b64 -out ImportToken.bin
```

Run the following command to import the wrapped private key. Make sure to replace <KMS KeyID> with the key ID of the KMS key that you created in Step 1.

aws kms import-key-material --key-id <KMS KeyID> \
--encrypted-key-material fileb://EncryptedECC_P256KeyMaterial.bin \
--import-token fileb://ImportToken.bin \
--expiration-model KEY_MATERIAL_DOES_NOT_EXPIRE

Test whether your private key was imported successfully

The nature of asymmetric cryptography means that a digital signature produced by your private key should produce the same signature on the same message, regardless of the tool that you used to perform the signing operation. To verify that your imported private key functions the same in both CloudHSM and AWS KMS, you can perform a signing operation and compare the signature on CloudHSM and AWS KMS to make sure that they are the same.

Another way to check that your imported private key functions are the same in AWS KMS is to perform a signing operation and then verify the signature by using the corresponding public key that you exported from CloudHSM in Step 4. We will show you how to use this method to check that your private key was imported successfully.

To test that your private key was imported

Create a simple message in a text file and encode it in base64.

echo -n 'Testing My Imported Key!' | openssl base64 -out msg_base64.txt

Perform the signing operation by using AWS KMS. Make sure to replace <YourImported KMS KeyID> with your own information.

aws kms sign --key-id <YourImported KMS KeyID> --message fileb://msg_base64.txt --message-type RAW --signing-algorithm ECDSA_SHA_256

The following shows the output of the signing operation.

{
"KeyId": "arn:aws:kms:us-east-1:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab",
"Signature": "EXAMPLEXsP11QVTkrSsab2CygcgtodDbSpd+j558B4qINpKIxwIhAMkKwd65mA3roo76ItuHiRsbwO9F0XMyuyKCKEXAMPLE",
"SigningAlgorithm": "ECDSA_SHA_256"
}

Save the signature in a separate file called signature.sig and decode it from base64 to binary.
```
openssl enc -d -base64 -in signature.sig -out signature.bin
```
Verify the signature by using the public key that you exported from CloudHSM in Step 4.
```
openssl dgst -sha256 -verify <KeyName.pem> -signature signature.bin msg_base64.txt
```
If successful, you should see a message that says Verified OK.

Conclusion

In this post, you learned how to import an asymmetric key into AWS KMS from CloudHSM by using the CloudHSM CLI.

Although this post focused on migrating keys from CloudHSM, you can also follow the general directions to import your asymmetric key from elsewhere. When you import a private key, make sure that the imported key matches the key spec and the wrapping algorithm that you choose in AWS KMS.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Combine transactional, streaming, and third-party data on Amazon Redshift for financial services

2024-02-01 Satesh Sonti

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/combine-transactional-streaming-and-third-party-data-on-amazon-redshift-for-financial-services/

Financial services customers are using data from different sources that originate at different frequencies, which includes real time, batch, and archived datasets. Additionally, they need streaming architectures to handle growing trade volumes, market volatility, and regulatory demands. The following are some of the key business use cases that highlight this need:

Trade reporting – Since the global financial crisis of 2007–2008, regulators have increased their demands and scrutiny on regulatory reporting. Regulators have placed an increased focus to both protect the consumer through transaction reporting (typically T+1, meaning 1 business day after the trade date) and increase transparency into markets via near-real-time trade reporting requirements.
Risk management – As capital markets become more complex and regulators launch new risk frameworks, such as Fundamental Review of the Trading Book (FRTB) and Basel III, financial institutions are looking to increase the frequency of calculations for overall market risk, liquidity risk, counter-party risk, and other risk measurements, and want to get as close to real-time calculations as possible.
Trade quality and optimization – In order to monitor and optimize trade quality, you need to continually evaluate market characteristics such as volume, direction, market depth, fill rate, and other benchmarks related to the completion of trades. Trade quality is not only related to broker performance, but is also a requirement from regulators, starting with MIFID II.

The challenge is to come up with a solution that can handle these disparate sources, varied frequencies, and low-latency consumption requirements. The solution should be scalable, cost-efficient, and straightforward to adopt and operate. Amazon Redshift features like streaming ingestion, Amazon Aurora zero-ETL integration, and data sharing with AWS Data Exchange enable near-real-time processing for trade reporting, risk management, and trade optimization.

In this post, we provide a solution architecture that describes how you can process data from three different types of sources—streaming, transactional, and third-party reference data—and aggregate them in Amazon Redshift for business intelligence (BI) reporting.

Solution overview

This solution architecture is created prioritizing a low-code/no-code approach with the following guiding principles:

Ease of use – It should be less complex to implement and operate with intuitive user interfaces
Scalable – You should be able to seamlessly increase and decrease capacity on demand
Native integration – Components should integrate without additional connectors or software
Cost-efficient – It should deliver balanced price/performance
Low maintenance – It should require less management and operational overhead

The following diagram illustrates the solution architecture and how these guiding principles were applied to the ingestion, aggregation, and reporting components.

Deploy the solution

You can use the following AWS CloudFormation template to deploy the solution.

This stack creates the following resources and necessary permissions to integrate the services:

Data stream – With Amazon Kinesis Data Streams, you can send data from your streaming source to a data stream to ingest the data into a Redshift data warehouse. Refer to Real-time analytics with Amazon Redshift streaming ingestion for information about configuring streaming ingestion.
Database cluster – For this solution, we use an Amazon Aurora MySQL-Compatible Edition 8.0 version cluster. This will be your OLTP data store for transactional data. You will set up zero-ETL integration for ingesting transaction data to the Redshift data warehouse. Refer to Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift for instructions on creating the integration. The required parameter groups for source and target are already created as part of the CloudFormation stack.
Redshift cluster – This stack creates an Amazon Redshift Serverless workgroup and associated namespace. It also deploys a provisioned Redshift cluster. If you would like to work with Redshift Serverless, you can remove the provisioned cluster from the template and vice versa.
IAM role – The stack creates an AWS Identity and Access Management (IAM) role with required policies and trust relationships.
Network components – This includes a VPC, subnets, route table, and associations. You can customize these resources as per your organization’s rules.

Ingestion

To ingest data, you use Amazon Redshift Streaming Ingestion to load streaming data from the Kinesis data stream. For transactional data, you use the Redshift zero-ETL integration with Amazon Aurora MySQL. For third-party reference data, you take advantage of AWS Data Exchange data shares. These capabilities allow you to quickly build scalable data pipelines because you can increase the capacity of Kinesis Data Streams shards, compute for zero-ETL sources and targets, and Redshift compute for data shares when your data grows. Redshift streaming ingestion and zero-ETL integration are low-code/no-code solutions that you can build with simple SQLs without investing significant time and money into developing complex custom code.

For the data used to create this solution, we partnered with FactSet, a leading financial data, analytics, and open technology provider. FactSet has several datasets available in the AWS Data Exchange marketplace, which we used for reference data. We also used FactSet’s market data solutions for historical and streaming market quotes and trades.

Processing

Data is processed in Amazon Redshift adhering to an extract, load, and transform (ELT) methodology. With virtually unlimited scale and workload isolation, ELT is more suited for cloud data warehouse solutions.

You use Redshift streaming ingestion for real-time ingestion of streaming quotes (bid/ask) from the Kinesis data stream directly into a streaming materialized view and process the data in the next step using PartiQL for parsing the data stream inputs. Note that streaming materialized views differs from regular materialized views in terms of how auto refresh works and the data management SQL commands used. Refer to Streaming ingestion considerations for details.

You use the zero-ETL Aurora integration for ingesting transactional data (trades) from OLTP sources. Refer to Working with zero-ETL integrations for currently supported sources. You can combine data from all these sources using views, and use stored procedures to implement business transformation rules like calculating weighted averages across sectors and exchanges.

Historical trade and quote data volumes are huge and often not queried frequently. You can use Amazon Redshift Spectrum to access this data in place without loading it into Amazon Redshift. You create external tables pointing to data in Amazon Simple Storage Service (Amazon S3) and query similarly to how you query any other local table in Amazon Redshift. Multiple Redshift data warehouses can concurrently query the same datasets in Amazon S3 without the need to make copies of the data for each data warehouse. This feature simplifies accessing external data without writing complex ETL processes and enhances the ease of use of the overall solution.

Let’s review a few sample queries used for analyzing quotes and trades. We use the following tables in the sample queries:

dt_hist_quote – Historical quotes data containing bid price and volume, ask price and volume, and exchanges and sectors. You should use relevant datasets in your organization that contain these data attributes.
dt_hist_trades – Historical trades data containing traded price, volume, sector, and exchange details. You should use relevant datasets in your organization that contain these data attributes.
factset_sector_map – Mapping between sectors and exchanges. You can obtain this from the FactSet Fundamentals ADX dataset.

Sample query for analyzing historical quotes

You can use the following query to find weighted average spreads on quotes:

select
date_dt :: date,
case
when exchange_name like 'Cboe%' then 'CBOE'
when (exchange_name) like 'NYSE%' then 'NYSE'
when (exchange_name) like 'New York Stock Exchange' then 'NYSE'
when (exchange_name) like 'Nasdaq%' then 'NASDAQ'
end as parent_exchange_name,
sector_name,
sum(spread * weight)/sum(weight) :: decimal (30,5) as weighted_average_spread
from
(
select date_dt,exchange_name,
factset_sector_desc sector_name,
((bid_price*bid_volume) + (ask_price*ask_volume))as weight,
((ask_price - bid_price)/ask_price) as spread
from
dt_hist_quotes a
join
fds_adx_fundamentals_db.ref_v2.factset_sector_map b
on(a.sector_code = b.factset_sector_code)
where ask_price <> 0 and bid_price <> 0
)
group by 1,2,3

Sample query for analyzing historical trades

You can use the following query to find $-volume on trades by detailed exchange, by sector, and by major exchange (NYSE and Nasdaq):

select
cast(date_dt as date) as date_dt,
case
when exchange_name like 'Cboe%' then 'CBOE'
when (exchange_name) like 'NYSE%' then 'NYSE'
when (exchange_name) like 'New York Stock Exchange' then 'NYSE'
when (exchange_name) like 'Nasdaq%' then 'NASDAQ'
end as parent_exchange_name,
factset_sector_desc sector_name,
sum((price * volume):: decimal(30,4)) total_transaction_amt
from
dt_hist_trades a
join
fds_adx_fundamentals_db.ref_v2.factset_sector_map b
on(a.sector_code = b.factset_sector_code)
group by 1,2,3

Reporting

You can use Amazon QuickSight and Amazon Managed Grafana for BI and real-time reporting, respectively. These services natively integrate with Amazon Redshift without the need to use additional connectors or software in between.

You can run a direct query from QuickSight for BI reporting and dashboards. With QuickSight, you can also locally store data in the SPICE cache with auto refresh for low latency. Refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters for comprehensive details on how to integrate QuickSight with Amazon Redshift.

You can use Amazon Managed Grafana for near-real-time trade dashboards that are refreshed every few seconds. The real-time dashboards for monitoring the trade ingestion latencies are created using Grafana and the data is sourced from system views in Amazon Redshift. Refer to Using the Amazon Redshift data source to learn about how to configure Amazon Redshift as a data source for Grafana.

The users who interact with regulatory reporting systems include analysts, risk managers, operators, and other personas that support business and technology operations. Apart from generating regulatory reports, these teams require visibility into the health of the reporting systems.

Historical quotes analysis

In this section, we explore some examples of historical quotes analysis from the Amazon QuickSight dashboard.

Weighted average spread by sectors

The following chart shows the daily aggregation by sector of the weighted average bid-ask spreads of all the individual trades on NASDAQ and NYSE for 3 months. To calculate the average daily spread, each spread is weighted by the sum of the bid and the ask dollar volume. The query to generate this chart processes 103 billion of data points in total, joins each trade with the sector reference table, and runs in less than 10 seconds.

Weighted average spread by exchanges

The following chart shows the daily aggregation of the weighted average bid-ask spreads of all the individual trades on NASDAQ and NYSE for 3 months. The calculation methodology and query performance metrics are similar to those of the preceding chart.

Historical trades analysis

In this section, we explore some examples of historical trades analysis from the Amazon QuickSight dashboard.

Trade volumes by sector

The following chart shows the daily aggregation by sector of all the individual trades on NASDAQ and NYSE for 3 months. The query to generate this chart processes 3.6 billion of trades in total, joins each trade with the sector reference table, and runs in under 5 seconds.

Trade volumes for major exchanges

The following chart shows the daily aggregation by exchange group of all the individual trades for 3 months. The query to generate this chart has similar performance metrics as the preceding chart.

Real-time dashboards

Monitoring and observability is an important requirement for any critical business application such as trade reporting, risk management, and trade management systems. Apart from system-level metrics, it’s also important to monitor key performance indicators in real time so that operators can be alerted and respond as soon as possible to business-impacting events. For this demonstration, we have built dashboards in Grafana that monitor the delay of quote and trade data from the Kinesis data stream and Aurora, respectively.

The quote ingestion delay dashboard shows the amount of time it takes for each quote record to be ingested from the data stream and be available for querying in Amazon Redshift.

The trade ingestion delay dashboard shows the amount of time it takes for a transaction in Aurora to become available in Amazon Redshift for querying.

Clean up

To clean up your resources, delete the stack you deployed using AWS CloudFormation. For instructions, refer to Deleting a stack on the AWS CloudFormation console.

Conclusion

Increasing volumes of trading activity, more complex risk management, and enhanced regulatory requirements are leading capital markets firms to embrace real-time and near-real-time data processing, even in mid- and back-office platforms where end of day and overnight processing was the standard. In this post, we demonstrated how you can use Amazon Redshift capabilities for ease of use, low maintenance, and cost-efficiency. We also discussed cross-service integrations to ingest streaming market data, process updates from OLTP databases, and use third-party reference data without having to perform complex and expensive ETL or ELT processing before making the data available for analysis and reporting.

Please reach out to us if you need any guidance in implementing this solution. Refer to Real-time analytics with Amazon Redshift streaming ingestion, Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift, and Working with AWS Data Exchange data shares as a producer for more information.

About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 18 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Alket Memushaj works as a Principal Architect in the Financial Services Market Development team at AWS. Alket is responsible for technical strategy for capital markets, working with partners and customers to deploy applications across the trade lifecycle to the AWS Cloud, including market connectivity, trading systems, and pre- and post-trade analytics and research platforms.

Ruben Falk is a Capital Markets Specialist focused on AI and data & analytics. Ruben consults with capital markets participants on modern data architecture and systematic investment processes. He joined AWS from S&P Global Market Intelligence where he was Global Head of Investment Management Solutions.

Jeff Wilson is a World-wide Go-to-market Specialist with 15 years of experience working with analytic platforms. His current focus is sharing the benefits of using Amazon Redshift, Amazon’s native cloud data warehouse. Jeff is based in Florida and has been with AWS since 2019.

Data masking and granular access control using Amazon Macie and AWS Lake Formation

2024-01-29 Iris Ferreira

Post Syndicated from Iris Ferreira original https://aws.amazon.com/blogs/security/data-masking-and-granular-access-control-using-amazon-macie-and-aws-lake-formation/

Companies have been collecting user data to offer new products, recommend options more relevant to the user’s profile, or, in the case of financial institutions, to be able to facilitate access to higher credit lines or lower interest rates. However, personal data is sensitive as its use enables identification of the person using a specific system or application and in the wrong hands, this data might be used in unauthorized ways. Governments and organizations have created laws and regulations, such as General Data Protection Regulation (GDPR) in the EU, General Data Protection Law (LGPD) in Brazil, and technical guidance such as the Cloud Computing Implementation Guide published by the Association of Banks in Singapore (ABS), that specify what constitutes sensitive data and how companies should manage it. A common requirement is to ensure that consent is obtained for collection and use of personal data and that any data collected is anonymized to protect consumers from data breach risks.

In this blog post, we walk you through a proposed architecture that implements data anonymization by using granular access controls according to well-defined rules. It covers a scenario where a user might not have read access to data, but an application does. A common use case for this scenario is a data scientist working with sensitive data to train machine learning models. The training algorithm would have access to the data, but the data scientist would not. This approach helps reduce the risk of data leakage while enabling innovation using data.

Prerequisites

To implement the proposed solution, you must have an active AWS account and AWS Identity and Access Management (IAM) permissions to use the following services:

Note: If there’s a pre-existing Lake Formation configuration, there might be permission issues when testing this solution. We suggest that you test this solution on a development account that doesn’t yet have Lake Formation active. If you don’t have access to a development account, see more details about the permissions required on your role in the Lake Formation documentation.

You must give permission for AWS DMS to create the necessary resources, such as the EC2 instance where you will run DMS tasks. If you have ever worked with DMS, this permission should already exist. Otherwise, you can use CloudFormation to create the necessary roles to deploy the solution. To see if permission already exists, open the AWS Management Console and go to IAM, select Roles, and see if there is a role called dms-vpc-role. If not, you must create the role during deployment.

We use the Faker library to create dummy data consisting of the following tables:

Customer
Bank
Card

Solution overview

This architecture allows multiple data sources to send information to the data lake environment on AWS, where Amazon S3 is the central data store. After the data is stored in an S3 bucket, Macie analyzes the objects and identifies sensitive data using machine learning (ML) and pattern matching. AWS Glue then uses the information to run a workflow to anonymize the data.

Figure 1: Solution architecture for data ingestion and identification of PII

We will describe two techniques used in the process: data masking and data encryption. After the workflow runs, the data is stored in a separate S3 bucket. This hierarchy of buckets is used to segregate access to data for different user personas.

Figure 1 depicts the solution architecture:

The data source in the solution is an Amazon RDS database. Data can be stored in a database on an EC2 instance, in an on-premises server, or even deployed in a different cloud provider.
AWS DMS uses full load, which allows data migration from the source (an Amazon RDS database) into the target S3 bucket — dcp-macie — as a one-time migration. New objects uploaded to the S3 bucket are automatically encrypted using server-side encryption (SSE-S3).
A personally identifiable information (PII) detection pipeline is invoked after the new Amazon S3 objects are uploaded. Macie analyzes the objects and identifies values that are sensitive. Users can manually identify which fields and values within the files should be classified as sensitive or use the Macie automated sensitive data discovery capabilities.
The sensitive values identified by Macie are sent to EventBridge, invoking Kinesis Data Firehose to store them in the dcp-glue S3 bucket. AWS Glue uses this data to know which fields to mask or encrypt using an encryption key stored in AWS KMS.
1. Using EventBridge enables an event-based architecture. EventBridge is used as a bridge between Macie and Kinesis Data Firehose, integrating these services.
2. Kinesis Data Firehose supports data buffering mitigating the risk of information loss when sent by Macie while reducing the overall cost of storing data in Amazon S3. It also allows data to be sent to other locations, such as Amazon Redshift or Splunk, making it available to be analyzed by other products.
At the end of this step, Amazon S3 is invoked from a Lambda function that starts the AWS Glue workflow, which masks and encrypts the identified data.
1. AWS Glue starts a crawler on the S3 bucket dcp-macie (a) and the bucket dcp-glue (b) to populate two tables, respectively, created as part of the AWS Glue service.
2. After that, a Python script is run (c), querying the data in these tables. It uses this information to mask and encrypt the data and then store it in the prefixes dcp-masked (d) and dcp-encrypted (e) in the bucket dcp-athena.
3. The last step in the workflow is to perform a crawler for each of these prefixes (f) and (g) by creating their respective tables in the AWS Glue Data Catalog.
To enable fine-grained access to data, Lake Formation maps permissions to the tags you have configured. The implementation of this part is described further in this post.
Athena can be used to query the data. Other tools, such as Amazon Redshift or Amazon Quicksight can also be used, as well as third-party tools.

If a user lacks permission to view sensitive data but needs to access it for machine learning model training purposes, AWS KMS can be used. The AWS KMS service manages the encryption keys that are used for data masking and to give access to the training algorithms. Users can see the masked data, but the algorithms can use the data in its original form to train the machine learning models.

This solution uses three personas:

secure-lf-admin: Data lake administrator. Responsible for configuring the data lake and assigning permissions to data administrators.
secure-lf-business-analyst: Business analyst. No access to certain confidential information.
secure-lf-data-scientist: Data scientist. No access to certain confidential information.

Solution implementation

To facilitate implementation, we created a CloudFormation template. The model and other artifacts produced can be found in this GitHub repository. You can use the CloudFormation dashboard to review the output of all the deployed features.

Choose the following Launch Stack button to deploy the CloudFormation template.

Deploy the CloudFormation template

To deploy the CloudFormation template and create the resources in your AWS account, follow the steps below.

After signing in to the AWS account, deploy the CloudFormation template. On the Create stack window, choose Next.

Figure 2: CloudFormation create stack screen
In the following section, enter a name for the stack. Enter a password in the TestUserPassword field for Lake Formation personas to use to sign in to the console. When finished filling in the fields, choose Next.
On the next screen, review the selected options and choose Next.
In the last section, review the information and select I acknowledge that AWS CloudFormation might create IAM resources with custom names. Choose Create Stack.

Figure 3: List of parameters and values in the CloudFormation stack
Wait until the stack status changes to CREATE_COMPLETE.

The deployment process should take approximately 15 minutes to finish.

Run an AWS DMS task

To extract the data from the Amazon RDS instance, you must run an AWS DMS task. This makes the data available to Macie in an S3 bucket in Parquet format.

Open the AWS DMS console.
On the navigation bar, for the Migrate data option, select Database migration tasks.
Select the task with the name rdstos3task.
Choose Actions.
Choose Restart/Resume. The loading process should take around 1 minute.

When the status changes to Load Complete, you will be able to see the migrated data in the target bucket (dcp-macie-<AWS_REGION>-<ACCOUNT_ID>) in the dataset folder. Within each prefix there will be a parquet file that follows the naming pattern: LOAD00000001.parquet. After this step, use Macie to scan the data for sensitive information in the files.

Run a classification job with Macie

You must create a data classification job before you can evaluate the contents of the bucket. The job you create will run and evaluate the full contents of your S3 bucket to determine the files stored in the bucket contain PII. This job uses the managed identifiers available in Macie and a custom identifier.

Open the Macie Console, on the navigation bar, select Jobs.
Choose Create job.
Select the S3 bucket dcp-macie-<AWS_REGION>-<ACCOUNT_ID> containing the output of the AWS DMS task. Choose Next to continue.
On the Review Bucket page, verify the selected bucket is dcp-macie-<AWS_REGION>-<ACCOUNT_ID>, and then choose Next.
In Refine the scope, create a new job with the following scope:
1. Sensitive data Discovery options: One-time job (for demonstration purposes, this will be a single discovery job. For production environments, we recommend selecting the Scheduled job option, so Macie can analyze objects following a scheduled).
2. Sampling Depth: 100 percent.
3. Leave the other settings at their default values.
On Managed data identifiers options, select All so Macie can use all managed data identifiers. This enables a set of built-in criteria to detect all identified types of sensitive data. Choose Next.
On the Custom data identifiers option, select account_number, and then choose Next. With the custom identifier, you can create custom business logic to look for certain patterns in files stored in Amazon S3. In this example, the task generates a discovery job for files that contain data with the following regular expression format XYZ- followed by numbers, which is the default format of the false account_number generated in the dataset. The logic used for creating this custom data identifier is included in the CloudFormation template file.
On the Select allow lists, choose Next to continue.
Enter a name and description for the job.
Choose Next to continue.
On Review and create step, check the details of the job you created and choose Submit.

Figure 4: List of Macie findings detected by the solution

The amount of data being scanned directly influences how long the job takes to run. You can choose the Update button at the top of the screen, as shown in Figure 4, to see the updated status of the job. This job, based on the size of the test dataset, will take about 10 minutes to complete.

Run the AWS Glue data transformation pipeline

After the Macie job is finished, the discovery results are ingested into the bucket dcp-glue-<AWS_REGION>-<ACCOUNT_ID>, invoking the AWS Glue step of the workflow (dcp-Workflow), which should take approximately 11 minutes to complete.

To check the workflow progress:

Open the AWS Glue console and on navigation bar, select Workflows (orchestration).
Next, choose dcp-workflow.
Next, select History to see the past runs of the dcp-workflow.

The AWS Glue job, which is launched as part of the workflow (dcp-workflow), reads the Macie findings to know the exact location of sensitive data. For example, in the customer table are name and birthdate. In the bank table are account_number, iban, and bban. And in the card table are card_number, card_expiration, and card_security_code. After this data is found, the job masks and encrypts the information.

Text encryption is done using an AWS KMS key. Here is the code snippet that provides this functionality:

def encrypt_rows(r):
    encrypted_entities = columns_to_be_masked_and_encrypted
    try:
        for entity in encrypted_entities:
            if entity in table_columns:
                encrypted_entity = get_kms_encryption(r[entity])
                r[entity + '_encrypted'] = encrypted_entity.decode("utf-8")
                del r[entity]
    except:
        print ("DEBUG:",sys.exc_info())
    return r

def get_kms_encryption(row):
    # Create a KMS client
    session = boto3.session.Session()
    client = session.client(service_name='kms',region_name=region_name)
   
    try:
        encryption_result = client.encrypt(KeyId=key_id, Plaintext=row)
        blob = encryption_result['CiphertextBlob']
        encrypted_row = base64.b64encode(blob)       
        return encrypted_row
       
    except:
        return 'Error on get_kms_encryption function'

If your application requires access to the unencrypted text, and because access to the AWS KMS encryption key exists, you can use the following excerpt example to access the information:

client.decrypt(CiphertextBlob=base64.b64decode(data_encrypted))
print(decrypted['Plaintext'])

After performing all the above steps, the datasets are fully anonymized with tables created in Data Catalog and data stored in the respective S3 buckets. These are the buckets where fine-grained access controls are applied through Lake Formation:

Masked data — s3://dcp-athena-<AWS_REGION>-<ACCOUNT_ID>/masked/
Encrypted data — s3://dcp-athena-<AWS_REGION>-<ACCOUNT_ID>/encrypted/

Now that the tables are defined, you refine the permissions using Lake Formation.

Enable Lake Formation fine-grained access

After the data is processed and stored, you use Lake Formation to define and enforce fine-grained access permissions and provide secure access to data analysts and data scientists.

To enable fine-grained access, you first add a user (secure-lf-admin) to Lake Formation:

In the Lake Formation console, clear Add myself and select Add other AWS users or roles.
From the drop-down menu, select secure-lf-admin.
Choose Get started.

Figure 5: Lake Formation deployment process

Grant access to different personas

Before you grant permissions to different user personas, you must register Amazon S3 locations in Lake Formation so that the personas can access the data. All buckets have been created with the following pattern <prefix>-<bucket_name>-<aws_region>-<account_id>, where <prefix> matches the prefix you selected when you deployed the Cloudformation template and <aws_region> corresponds to the selected AWS Region (for example, ap-southeast-1), and <account_id> is the 12 numbers that match your AWS account (for example, 123456789012). For ease of reading, we left only the initial part of the bucket name in the following instructions.

In the Lake Formation console, on the navigation bar, on the Register and ingest option, select Data Lake locations.
Choose Register location.
Select the dcp-glue bucket and choose Register Location.
Repeat for the dcp-macie/dataset, dcp-athena/masked, and dcp-athena/encrypted prefixes.

Figure 6: Amazon S3 locations registered in the solution

You’re now ready to grant access to different users.

Granting per-user granular access

After successfully deploying the AWS services described in the CloudFormation template, you must configure access to resources that are part of the proposed solution.

Grant read-only accesses to all tables for secure-lf-admin

Before proceeding you must sign in as the secure-lf-admin user. To do this, sign out from the AWS console and sign in again using the secure-lf-admin credential and password that you set in the CloudFormation template.

Now that you’re signed in as the user who administers the data lake, you can grant read-only access to all tables in the dataset database to the secure-lf-admin user.

In the Permissions section, select Data Lake permissions, and then choose Grant.
Select IAM users and roles.
Select the secure-lf-admin user.
Under LF-Tags or catalog resources, select Named data catalog resources.
Select the database dataset.
For Tables, select All tables.
In the Table permissions section, select Alter and Super.
Under Grantable permissions, select Alter and Super.
Choose Grant.

You can confirm your user permissions on the Data Lake permissions page.

Create tags to grant access

Return to the Lake Formation console to define tag-based access control for users. You can assign policy tags to Data Catalog resources (databases, tables, and columns) to control access to this type of resources. Only users who receive the corresponding Lake Formation tag (and those who receive access with the resource method named) can access the resources.

Open the Lake Formation console, then on the navigation bar, under Permissions, select LF-tags.
Choose Add LF Tag. In the dialog box Add LF-tag, for Key, enter data, and for Values, enter mask. Choose Add, and then choose Add LF-Tag.
Follow the same steps to add a second tag. For Key, enter segment, and for Values enter campaign.

Assign tags to users and databases

Now grant read-only access to the masked data to the secure-lf-data-scientist user.

In the Lake Formation console, on the navigation bar, under Permissions, select Data Lake permissions.
Choose Grant.
Under IAM users and roles, select secure-lf-data-scientist as the user.
In the LF-Tags or catalog resources section, select Resources matched by LF-Tags and choose add LF-Tag. For Key, enter data and for Values, enter mask.

Figure 7: Creating resource tags for Lake Formation
In the section Database permissions, in the Database permissions part and in Grantable permissions, select Describe.
In the section Table permissions, in the Table permissions part and in Grantable permissions, select Select.
Choose Grant.

Figure 8: Database and table permissions granted

To complete the process and give the secure-lf-data-scientist user access to the dataset_masked database, you must assign the tag you created to the database.

On the navigation bar, under Data Catalog, select Databases.
Select dataset_masked and select Actions. From the drop-down menu, select Edit LF-Tags.
In the section Edit LF-Tags: dataset_masked, choose Assign new LF-Tag. For Key, enter data, and for Values, enter mask. Choose Save.

Grant read-only accesses to secure-lf-business-analyst

Now grant the secure-lf-business-analyst user read-only access to certain encrypted columns using column-based permissions.

In the Lake Formation console, under Data Catalog, select Databases.
Select the database dataset_encrypted and then select Actions. From the drop-down menu, choose Grant.
Select IAM users and roles.
Choose secure-lf-business-analyst.
In the LF-Tags or catalog resources section, select Named data catalog resources.
In the Database permissions section, in the Database permissions section and in Grantable permissions, select Describe and Alter.
Choose Grant.

Now give the secure-lf-business-analyst user access to the Customer table, except for the username column.

In the Lake Formation console, under Data Catalog, select Databases.
Select the database dataset_encrypted and then, choose View tables.
From the Actions option in the drop-down menu, select Grant.
Select IAM users and roles.
Select secure-lf-business-analyst.
In the LF-Tags or catalog resources part, select Named data catalog resources.
In the Database section, leave the dataset_encrypted selected.
In the tables section, select the customer table.
In the Table permission section, in the Table permission section and in Grantable permissions, choose Select.
In the Data Permissions section, select Column-based access.
Select Include columns and select the id, username, mail, and gender columns, which are the data-less columns encrypted for the secure-lf-business-analyst user to have access to.
Choose Grant.

Figure 9: Granting access to secure-lf-business-analyst user in the Customer table

Now give the secure-lf-business-analyst user access to the table Card, only for columns that do not contain PII information.

In the Lake Formation console, under Data Catalog, choose Databases.
Select the database dataset_encrypted and choose View tables.
Select the table Card.
In the Schema section, choose Edit schema.
Select the cred_card_provider column, which is the column that has no PII data.
Choose Edit tags.
Choose Assign new LF-Tag.
For Assigned keys, enter segment and for Values, enter mask.

Figure 10: Editing tags in Lake Formation tables
Choose Save, and then choose Save as new version.

In this step you add the segment tag in the column cred_card_provider to the card table. For the user secure-lf-business-analyst to have access, you need to configure this tag for the user.

In the Lake Formation console, under Permissions, select Data Lake permissions.
Choose Grant.
Under IAM users and roles, select secure-lf-business-analyst as the user.
In the LF-Tags or catalog resources section, select Resources matched by LF-Tags, and choose add LF-tag and for as Key enter segment and for Values, enter campaign.

Figure 11: Configure tag-based access for user secure-lf-business-analyst
In the Database permissions section, in the Database permissions part and in Grantable permissions, select Describe from both options.
In the Table permission section, in the Table permission part as well as in Grantable permissions, choose Select.
Choose Grant.

The next step is to revoke Super access to the IAMAllowedPrincipals group.

The IAMAllowedPrincipals group includes all IAM users and roles that are allowed access to Data Catalog resources using IAM policies. The Super permission allows a principal to perform all operations supported by Lake Formation on the database or table on which it is granted. These settings provide access to Data Catalog resources and Amazon S3 locations controlled exclusively by IAM policies. Therefore, the individual permissions configured by Lake Formation are not considered, so you will remove the concessions already configured by the IAMAllowedPrincipals group, leaving only the Lake Formation settings.

In the Databases menu, select the database dataset, and then select Actions. From the drop-down menu, select Revoke.
In the Principals section, select IAM users and roles, and then select the IAMAllowedPrincipals group as the user.
Under LF-Tags or catalog resources, select Named data catalog resources.
In the Database section, leave the dataset option selected.
Under Tables, select the following tables: bank, card, and customer.
In the Table permissions section, select Super.
Choose Revoke.

Repeat the same steps for the dataset_encrypted and dataset_masked databases.

Figure 12: Revoke SUPER access to the IAMAllowedPrincipals group

You can confirm all user permissions on the Data Permissions page.

Querying the data lake using Athena with different personas

To validate the permissions of different personas, you use Athena to query the Amazon S3 data lake.

Ensure the query result location has been created as part of the CloudFormation stack (secure-athena-query-<ACCOUNT_ID>-<AWS_REGION>).

Sign in to the Athena console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and verify that you are in the AWS Region used in the query result location.
On the navigation bar, choose Query editor.
Choose Setting to set up a query result location in Amazon S3, and then choose Browse S3 and select the bucket secure-athena-query-<ACCOUNT_ID>-<AWS_REGION>.

Run a SELECT query on the dataset.

SELECT * FROM "dataset"."bank" limit 10;

The secure-lf-admin user should see all tables in the dataset database and dcp. As for the banks dataset_encrypted and dataset_masked, the user should not have access to the tables.

Figure 13: Athena console with query results in clear text

Finally, validate the secure-lf-data-scientist permissions.

Sign in to the Athena console with secure-lf-data-scientist (use the password value for TestUserPassword from the CloudFormation stack) and verify that you are in the correct Region.

Run the following query:

SELECT * FROM “dataset_masked”.”bank” limit 10;

The user secure-lf-data-scientist will only be able to view all the columns in the database dataset_masked.

Figure 14: Athena query results with masked data

Now, validate the secure-lf-business-analyst user permissions.

Sign in to the Athena console as secure-lf-business-analyst (use the password value for TestUserPassword from the CloudFormation stack) and verify that you are in the correct Region.
Run a SELECT query on the dataset.
```
SELECT * FROM “dataset_encrypted”.”card” limit 10;
```
Figure 15: Validating secure-lf-business-analyst user permissions to query data

The user secure-lf-business-analyst should only be able to view the card and customer tables of the dataset_encrypted database. In the table card, you will only have access to the cred_card_provider column and in the table Customer, you will have access only in the username, mail, and sex columns, as previously configured in Lake Formation.

Cleaning up the environment

After testing the solution, remove the resources you created to avoid unnecessary expenses.

Open the Amazon S3 console.
Navigate to each of the following buckets and delete all the objects within:
1. dcp-assets-<AWS_REGION>-<ACCOUNT_ID>
2. dcp-athena-<AWS_REGION>-<ACCOUNT_ID>
3. dcp-glue-<AWS_REGION>-<ACCOUNT_ID>
4. dcp-macie-<AWS_REGION>-<ACCOUNT_ID>
Open the CloudFormation console.
Select the Stacks option from the navigation bar.
Select the stack that you created in Deploy the CloudFormation Template.
Choose Delete, and then choose Delete Stack in the pop-up window.
If you also want to delete the bucket that was created, go to Amazon S3 and delete it from the console or by using the AWS CLI.
To remove the settings made in Lake Formation, go to the Lake Formation dashboard, and remove the data lake locales and the Lake Formation administrator.

Conclusion

Now that the solution is implemented, you have an automated anonymization dataflow. This solution demonstrates how you can build a solution using AWS serverless solutions where you only pay for what you use and without worrying about infrastructure provisioning. In addition, this solution is customizable to meet other data protection requirements such as General Data Protection Law (LGPD) in Brazil, General Data Protection Regulation in Europe (GDPR), and the Association of Banks in Singapore (ABS) Cloud Computing Implementation Guide.

We used Macie to identify the sensitive data stored in Amazon S3 and AWS Glue to generate Macie reports to anonymize the sensitive data found. Finally, we used Lake Formation to implement fine-grained data access control to specific information and demonstrated how you can programmatically grant access to applications that need to work with unmasked data.

Export a Software Bill of Materials using Amazon Inspector

2024-01-26 Varun Sharma

Post Syndicated from Varun Sharma original https://aws.amazon.com/blogs/security/export-a-software-bill-of-materials-using-amazon-inspector/

Amazon Inspector is an automated vulnerability management service that continually scans Amazon Web Services (AWS) workloads for software vulnerabilities and unintended network exposure. Amazon Inspector has expanded capability that allows customers to export a consolidated Software Bill of Materials (SBOM) for supported Amazon Inspector monitored resources, excluding Windows EC2 instances.

Customers have asked us to provide additional software application inventory collected from Amazon Inspector monitored resources. This makes it possible to precisely track the software supply chain and security threats that might be connected to the results of the current Amazon Inspector. Generating an SBOM gives you critical security information that offers you visibility into specifics about your software supply chain, including the packages you use the most frequently and the related vulnerabilities that might affect your whole company.

This blog post includes steps that you can follow to export a consolidated SBOM for the resources monitored by Amazon Inspector across your organization in industry standard formats, including CycloneDx and SPDX. It also shares insights and approaches for analyzing SBOM artifacts using Amazon Athena.

Overview

An SBOM is defined as a nested inventory with a list of ingredients that make up software components. Security teams can export a consolidated SBOM to Amazon Simple Storage Service (Amazon S3) for an entire organization from the resource coverage page in the AWS Management Console for Amazon Inspector.

Using CycloneDx and SPDX industry standard formats, you can use insights gained from an SBOM to make decisions such as which software packages need to be updated across your organization or deprecated, if there’s no other option. Individual application or security engineers can also export an SBOM for a single resource or group of resources by applying filters for a specific account, resource type, resource ID, tags, or a combination of these as a part of the SBOM export workflow in the console or application programming interfaces.

Exporting SBOMs

To export Amazon Inspector SBOM reports to an S3 bucket, you must create and configure a bucket in the AWS Region where the SBOM reports are to be exported. You must configure your bucket permissions to allow only Amazon Inspector to put new objects into the bucket. This prevents other AWS services and users from adding objects to the bucket.

Each SBOM report is stored in an S3 bucket and has the name Cyclonedx_1_4 (Json) or Spdx_2_3-compatible (Json), depending on the export format that you specify. You can also use S3 event notifications to alert different operational teams that new SBOM reports have been exported.

Amazon Inspector requires that you use an AWS Key Management Service (AWS KMS) key to encrypt the SBOM report. The key must be a customer managed, symmetric KMS encryption key and must be in the same Region as the S3 bucket that you configured to store the SBOM report. The new KMS key for the SBOM report requires a key policy to be configured to grant permissions for Amazon Inspector to use the key. (Shown in Figure 1.)

Figure 1: Amazon Inspector SBOM export

Deploy prerequisites

The AWS CloudFormation template provided creates an S3 bucket with an associated bucket policy to enable Amazon Inspector to export SBOM report objects into the bucket. The template also creates a new KMS key to be used for SBOM report exports and grants the Amazon Inspector service permissions to use the key.

The export can be initiated from the AWS Inspector delegated administrator account or the AWS Inspector administrator account itself. This way, the S3 bucket contains reports for the AWS Inspector member accounts. To export the SBOM reports from Amazon Inspector deployed in the same Region, make sure the CloudFormation template is deployed within the AWS account and Region. If you enabled AWS Inspector in multiple accounts, the CloudFormation stack must be deployed in each Region where AWS Inspector is enabled.

To deploy the CloudFormation template

Choose the following Launch Stack button to launch a CloudFormation stack in your account.
Review the stack name and the parameters (MyKMSKeyName and MyS3BucketName) for the template. Note that the S3 bucket name must be unique.
Choose Next and confirm the stack options.
Go to the next page and choose Submit. The deployment of the CloudFormation stack will take 1–2 minutes.

After the CloudFormation stack has deployed successfully, you can use the S3 bucket and KMS key created by the stack to export SBOM reports.

Export SBOM reports

After setup is complete, you can export SBOM reports to an S3 bucket.

To export SBOM reports from the console

Navigate to the AWS Inspector console in the same Region where the S3 bucket and KMS key were created.
Select Export SBOMs from the navigation pane.
Add filters to create reports for specific subsets of resources. The SBOMs for all active, supported resources are exported if you don’t supply a filter.
Select the export file type you want. Options are Cyclonedx_1_4 (Json) or Spdx_2_3-compatible (Json).
Enter the S3 bucket URI from the output section of the CloudFormation template and enter the KMS key that was created.
Choose Export. It can take 3–5 minutes to complete depending on the number of artifacts to be exported.

Figure 2: SBOM export configuration

When complete, all SBOM artifacts will be in the S3 bucket. This gives you the flexibility to download the SBOM artifacts from the S3 bucket, or you can use Amazon S3 Select to retrieve a subset of data from an object using standard SQL queries.

Figure 3: Amazon S3 Select

You can also run advanced queries using Amazon Athena or create dashboards using Amazon QuickSight to gain insights and map trends.

Querying and visualization

With Athena, you can run SQL queries on raw data that’s stored in S3 buckets. The Amazon Inspector reports are exported to an S3 bucket, and you can query the data and create tables by following the Adding an AWS Glue crawler tutorial.

To enable AWS Glue to crawl the S3 data, you must add the role as described in the AWS Glue crawler tutorial to the AWS KMS key permissions so that AWS Glue can decrypt the S3 data.

The following is an example policy JSON that you can update for your use case. Make sure to replace the AWS account ID <111122223333> and S3 bucket name <DOC-EXAMPLE-BUCKET-111122223333> with your own information.

{
    "Sid": "Allow the AWS Glue crawler usage of the KMS key",
    "Effect": "Allow",
    "Principal": {
        "AWS": "arn:aws:iam::<111122223333>:role/service-role/AWSGlueServiceRole-S3InspectorSBOMReports"
    },
    "Action": [
        "kms:Decrypt",
        "kms:GenerateDataKey*"
    ],
    "Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET-111122223333>"
},

Note: The role created for AWS Glue also needs permission to read the S3 bucket where the reports are exported for creating the crawlers. The AWS Glue AWS Identity and Access Management (IAM) role allows the crawler to run and access your Amazon S3 data stores.

After an AWS Glue Data Catalog has been built, you can run the crawler on a scheduled basis to help ensure that it’s kept up to date with the latest Amazon Inspector SBOM manifests as they’re exported into the S3 bucket.

You can further navigate to the added table using the crawler and view the data in Athena. Using Athena, you can run queries against the Amazon Inspector reports to generate output data relevant to your environment. The schema for the generated SBOM report is different depending on the specific resources (Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, Amazon Elastic Container Registry (Amazon ECR)) in the reports. So, depending on the schema, you can create a SQL Athena query to fetch information from the reports.

The following is an Athena example query that identifies the top 10 vulnerabilities for resources in an SBOM report. You can use the common vulnerability and exposures (CVE) IDs from the report to list the individual components affected by the CVEs.

SELECT
   account,
   vuln.id as vuln_id, 
   count(*) as vuln_count
FROM
   <Insert_table_name>,
   UNNEST(Inset_table_name.vulnerabilities)as t(vuln)
GROUP BY
   account,
   vuln.id
ORDER BY
vuln_count DESC
LIMIT 10;

The following Athena example query can be used to identify the top 10 operating systems (OS) along with the resource types and their count.

SELECT
   resource,
   metadata.component.name as os_name,
   count(*) as os_count 
FROM
   <Insert_table_name>
WHERE
   resource = 'AWS_LAMBDA_FUNCTION'
GROUP BY
   resource,
   metadata.component.name 
ORDER BY
   os_count DESC 
LIMIT 10;

If you have a package that has a critical vulnerability and you need to know if the package is used as a primary package or adds a dependency, you can use the following Athena sample query to check for the package in your application. In this example, I’m searching for a Log4j package. The result returns account ID, resource type, package_name, and package_count.

SELECT
   account,
   resource,
   comp.name as package_name,
   count(*) as package_count 
FROM
   <Insert_Table _name>,
   UNNEST(<Insert_Table_name>.components) as t(comp) 
WHERE
   comp.name = 'Log4j' 
GROUP BY
   account,
   comp.name,
   resource 
ORDER BY
   package_count DESC 
LIMIT 10 ;

Note: The sample Athena queries must be customized depending on the schema of the SBOM export report.

To further extend this solution, you can use Amazon QuickSight to produce dashboards to visualize the data by connecting to the AWS Glue table.

Conclusion

The new SBOM generation capabilities in Amazon Inspector improve visibility into the software supply chain by providing a comprehensive list of software packages across multiple levels of dependencies. You can also use SBOMs to monitor the licensing information for each of the software packages and identify potential licensing violations in your organization, helping you avoid potential legal risks.

The most important benefit of SBOM export is to help you comply with industry regulations and standards. By providing an industry-standard format (SPDX and CycloneDX) and enabling easy integration with other tools, systems, or services (such as Nexus IQ and WhiteSource), you can streamline the incident response processes, improve the accuracy and speed of security assessments, and adhere to compliance with regulatory requirements.

In addition to these benefits, the SBOM export feature provides a comprehensive and accurate understanding of the OS packages and software libraries found in their resources, further enhancing your ability to adhere to industry regulations and standards.

If you have feedback about this post, submit comments in the Comments section below. If you have any question/query in regard to information shared in this post, start a new thread on the AWS IAM Identity Center re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.