Tag Archives: Advanced (300)

How to improve cross-account access for SaaS applications accessing customer accounts

Post Syndicated from Ashwin Phadke original https://aws.amazon.com/blogs/security/how-to-improve-cross-account-access-for-saas-applications-accessing-customer-accounts/

Several independent software vendors (ISVs) and software as a service (SaaS) providers need to access their customers’ Amazon Web Services (AWS) accounts, especially if the SaaS product accesses data from customer environments. SaaS providers have adopted multiple variations of this third-party access scenario. In some cases, the providers ask the customer for an access key and a secret key, which is not recommended because these are long-term user credentials and require processes to be built for periodic rotation. However, in most cases, the provider has an integration guide with specific details on creating a cross-account AWS Identity and Access Management (IAM) role.

In all these scenarios, as a SaaS vendor, you should add the necessary protections to your SaaS implementation. At AWS, security is the top priority and we recommend that customers follow best practices and incorporate security in their product design. In this blog post intended for SaaS providers, I describe three ways to improve your cross-account access implementation for your products.

Why is this important?

As a security specialist, I’ve worked with multiple ISV customers on improving the security of their products, specifically on this third-party cross-account access scenario. Consumers of your SaaS products don’t want to give more access permissions than are necessary for the product’s proper functioning. At the same time, you should maintain and provide a secure SaaS product to protect your customers’ and your own AWS accounts from unauthorized access or privilege escalations.

Let’s consider a hypothetical scenario with a simple SaaS implementation where a customer is planning to use a SaaS product. In Figure 1, you can see that the SaaS product has multiple different components performing separate functions, for example, a SaaS product with separate components performing compute analysis, storage analysis, and log analysis. The SaaS provider asks the customer to provide IAM user credentials and uses those in their product to access customer resources. Let’s look at three techniques for improving the cross-account access for this scenario. Each technique builds on the previous one, so you could adopt an incremental approach to implement these techniques.

Figure 1: SaaS architecture using customer IAM user credentials

Figure 1: SaaS architecture using customer IAM user credentials

Technique 1 – Using IAM roles and an external ID

As stated previously, IAM user credentials are long-term, so customers would need to implement processes to rotate these periodically and share them with the ISV.

As a better option, SaaS product components can use IAM roles, which provide short-term credentials to the component assuming the role. These credentials need to be refreshed depending on the role’s session duration setting (the default is 1 hour) to continue accessing the resources. IAM roles also provide an advantage for auditing purposes because each time an IAM principal assumes a role, a new session is created, and this can be used to identify and audit activity for separate sessions.

When using IAM roles for third-party access, an important consideration is the confused deputy problem, where an unauthorized entity could coerce the product components into performing an action against another customers’ resources. To mitigate this problem, a highly recommended approach is to use the external ID parameter when assuming roles in customers’ accounts. It’s important and recommended that you generate these external ID parameters to make sure they’re unique for each of your customers, for example, using a customer ID or similar attribute. For external ID character restrictions, see the IAM quotas page. Your customers will use this external ID in their IAM role’s trust policy, and your product components will pass this as a parameter in all AssumeRole API calls to customer environments. An example of the trust policy principal and condition blocks for the role to be assumed in the customer’s account follows:

    "Principal": {"AWS": "<SaaS Provider’s AWS account ID>"},
    "Condition": {"StringEquals": {"sts:ExternalId": "<Unique ID Assigned by SaaS Provider>"}}
Figure 2: SaaS architecture using an IAM role and external ID

Figure 2: SaaS architecture using an IAM role and external ID

Technique 2 – Using least-privilege IAM policies and role chaining

As an IAM best practice, we recommend that an IAM role should only have the minimum set of permissions as required to perform its functions. When your customers create an IAM role in Technique 1, they might inadvertently provide more permissions than necessary to use your product. The role could have permissions associated with multiple AWS services and might become overly permissive. If you provide granular permissions for separate AWS services, you might reach the policy size quota or policies per role quota. See IAM quotas for more information. That’s why, in addition to Technique 1, we recommend that each component have a separate IAM role in the customer’s account with only the minimum permissions required for its functions.

As a part of your integration guide to the customer, you should ask them to create appropriate IAM policies for these IAM roles. There needs to be a clear separation of duties and least privilege access for the product components. For example, an account-monitoring SaaS provider might use a separate IAM role for Amazon Elastic Compute Cloud (Amazon EC2) monitoring and another one for AWS CloudTrail monitoring. Your components will also use separate IAM roles in your own AWS account. However, you might want to provide a single integration IAM role to customers to establish the trust relationship with each component role in their account. In effect, you will be using the concept of role chaining to access your customer’s accounts. The auditing mechanisms on the customer’s end will only display the integration IAM role sessions.

When using role chaining, you must be aware of certain caveats and limitations. Your components will each have separate roles: Role A, which will assume the integration role (Role B), and then use the Role B credentials to assume the customer role (Role C) in customer’s accounts. You need to properly define the correct permissions for each of these roles, because the permissions of the previous role aren’t passed while assuming the role. Optionally, you can pass an IAM policy document known as a session policy as a parameter while assuming the role, and the effective permissions will be a logical intersection of the passed policy and the attached permissions for the role. To learn more about these session policies, see session policies.

Another consideration of using role chaining is that it limits your AWS Command Line Interface (AWS CLI) or AWS API role session duration to a maximum of one hour. This means that you must track the sessions and perform credential refresh actions every hour to continue accessing the resources.

Figure 3: SaaS architecture with role chaining

Figure 3: SaaS architecture with role chaining

Technique 3 – Using role tags and session tags for attribute-based access control

When you create your IAM roles for role chaining, you define which entity can assume the role. You will need to add each component-specific IAM role to the integration role’s trust relationship. As the number of components within your product increases, you might reach the maximum length of the role trust policy. See IAM quotas for more information.

That’s why, in addition to the above two techniques, we recommend using attribute-based access control (ABAC), which is an authorization strategy that defines permissions based on tag attributes. You should tag all the component IAM roles with role tags and use these role tags as conditions in the trust policy for the integration role as shown in the following example. Optionally, you could also include instructions in the product integration guide for tagging customers’ IAM roles with certain role tags and modify the IAM policy of the integration role to allow it to assume only roles with those role tags. This helps in reducing IAM policy length and minimizing the risk of reaching the IAM quota.

"Condition": {
     "StringEquals": {"iam:ResourceTag/<Product>": "<ExampleSaaSProduct>"}

Another consideration for improving the auditing and traceability for your product is IAM role session tags. These could be helpful if you use CloudTrail log events for alerting on specific role sessions. If your SaaS product also operates on CloudTrail logs, you could use these session tags to identify the different sessions from your product. As opposed to role tags, which are tags attached to an IAM role, session tags are key-value pair attributes that you pass when you assume an IAM role. These can be used to identify a session and further control or restrict access to resources based on the tags. Session tags can also be used along with role chaining. When you use session tags with role chaining, you can set the keys as transitive to make sure that you pass them to subsequent sessions. CloudTrail log events for these role sessions will contain the session tags, transitive tags, and role (also called principal) tags.

Conclusion

In this post, we discussed three incremental techniques that build on each other and are important for SaaS providers to improve security and access control while implementing cross-account access to their customers. As a SaaS provider, it’s important to verify that your product adheres to security best practices. When you improve security for your product, you’re also improving security for your customers.

To see more tutorials about cross-account access concepts, visit the AWS documentation on IAM Roles, ABAC, and session tags.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Identity and Access Management re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Ashwin Phadke

Ashwin Phadke

Ashwin is a Sr. Solutions Architect, working with large enterprises and ISV customers to build highly available, scalable, and secure applications, and to help them successfully navigate through their cloud journey. He is passionate about information security and enjoys working on creative solutions for customers’ security challenges.

Optimize AWS administration with IAM paths

Post Syndicated from David Rowe original https://aws.amazon.com/blogs/security/optimize-aws-administration-with-iam-paths/

As organizations expand their Amazon Web Services (AWS) environment and migrate workloads to the cloud, they find themselves dealing with many AWS Identity and Access Management (IAM) roles and policies. These roles and policies multiply because IAM fills a crucial role in securing and controlling access to AWS resources. Imagine you have a team creating an application. You create an IAM role to grant them access to the necessary AWS resources, such as Amazon Simple Storage Service (Amazon S3) buckets, Amazon Key Management Service (Amazon KMS) keys, and Amazon Elastic File Service (Amazon EFS) shares. With additional workloads and new data access patterns, the number of IAM roles and policies naturally increases. With the growing complexity of resources and data access patterns, it becomes crucial to streamline access and simplify the management of IAM policies and roles

In this blog post, we illustrate how you can use IAM paths to organize IAM policies and roles and provide examples you can use as a foundation for your own use cases.

How to use paths with your IAM roles and policies

When you create a role or policy, you create it with a default path. In IAM, the default path for resources is “/”. Instead of using a default path, you can create and use paths and nested paths as a structure to manage IAM resources. The following example shows an IAM role named S3Access in the path developer:

arn:aws:iam::111122223333:role/developer/S3Access

Service-linked roles are created in a reserved path /aws-service-role/. The following is an example of a service-linked role path.

arn:aws:iam::*:role/aws-service-role/SERVICE-NAME.amazonaws.com/SERVICE-LINKED-ROLE-NAME

The following example is of an IAM policy named S3ReadOnlyAccess in the path security:

arn:aws:iam::111122223333:policy/security/S3ReadOnlyAccess

Why use IAM paths with roles and policies?

By using IAM paths with roles and policies, you can create groupings and design a logical separation to simplify management. You can use these groupings to grant access to teams, delegate permissions, and control what roles can be passed to AWS services. In the following sections, we illustrate how to use IAM paths to create groupings of roles and policies by referencing a fictional company and its expansion of AWS resources.

First, to create roles and policies with a path, you use the IAM API or AWS Command Line Interface (AWS CLI) to run aws cli create-role.

The following is an example of an AWS CLI command that creates a role in an IAM path.

aws iam create-role --role-name <ROLE-NAME> --assume-role-policy-document file://assume-role-doc.json --path <PATH>

Replace <ROLE-NAME> and <PATH> in the command with your role name and role path respectively. Use a trust policy for the trust document that matches your use case. An example trust policy that allows Amazon Elastic Compute Cloud (Amazon EC2) instances to assume this role on your behalf is below:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole"
            ],
            "Principal": {
                "Service": [
                    "ec2.amazonaws.com"
                ]
            }
        }
    ]
}

The following is an example of an AWS CLI command that creates a policy in an IAM path.

aws iam create-policy --policy-name <POLICY-NAME> --path <PATH> --policy-document file://policy.json

IAM paths sample implementation

Let’s assume you’re a cloud platform architect at AnyCompany, a startup that’s planning to expand its AWS environment. By the end of the year, AnyCompany is going to expand from one team of developers to multiple teams, all of which require access to AWS. You want to design a scalable way to manage IAM roles and policies to simplify the administrative process to give permissions to each team’s roles. To do that, you create groupings of roles and policies based on teams.

Organize IAM roles with paths

AnyCompany decided to create the following roles based on teams.

Team name Role name IAM path Has access to
Security universal-security-readonly /security/ All resources
Team A database administrators DBA-role-A /teamA/ TeamA’s databases
Team B database administrators DBA-role-B /teamB/ TeamB’s databases

The following are example Amazon Resource Names (ARNs) for the roles listed above. In this example, you define IAM paths to create a grouping based on team names.

  1. arn:aws:iam::444455556666:role/security/universal-security-readonly-role
  2. arn:aws:iam::444455556666:role/teamA/DBA-role-A
  3. arn:aws:iam::444455556666:role/teamB/DBA-role-B

Note: Role names must be unique within your AWS account regardless of their IAM paths. You cannot have two roles named DBA-role, even if they’re in separate paths.

Organize IAM policies with paths

After you’ve created roles in IAM paths, you will create policies to provide permissions to these roles. The same path structure that was defined in the IAM roles is used for the IAM policies. The following is an example of how to create a policy with an IAM path. After you create the policy, you can attach the policy to a role using the attach-role-policy command.

aws iam create-policy --policy-name <POLICY-NAME> --policy-document file://policy-doc.json --path <PATH>
  1. arn:aws:iam::444455556666:policy/security/universal-security-readonly-policy
  2. arn:aws:iam::444455556666:policy/teamA/DBA-policy-A
  3. arn:aws:iam::444455556666:policy/teamB/DBA-policy-B

Grant access to groupings of IAM roles with resource-based policies

Now that you’ve created roles and policies in paths, you can more readily define which groups of principals can access a resource. In this deny statement example, you allow only the roles in the IAM path /teamA/ to act on your bucket, and you deny access to all other IAM principals. Rather than use individual roles to deny access to the bucket, which would require you to list every role, you can deny access to an entire group of principals by path. If you create a new role in your AWS account in the specified path, you don’t need to modify the policy to include them. The path-based deny statement will apply automatically.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "s3:*",
      "Effect": "Deny",
      "Resource": [
		"arn:aws:s3:::EXAMPLE-BUCKET",
		"arn:aws:s3:::EXAMPLE-BUCKET/*"
		],
      "Principal": "*",
"Condition": {
        "ArnNotLike": {
          "aws:PrincipalArn": "arn:aws:iam::*:role/teamA/*"
        }
      }
}
  ]
}

Delegate access with IAM paths

IAM paths can also enable teams to more safely create IAM roles and policies and allow teams to only use the roles and policies contained by the paths. Paths can help prevent teams from privilege escalation by denying the use of roles that don’t belong to their team.

Continuing the example above, AnyCompany established a process that allows each team to create their own IAM roles and policies, providing they’re in a specified IAM path. For example, AnyCompany allows team A to create IAM roles and policies for team A in the path /teamA/:

  1. arn:aws:iam::444455556666:role/teamA/<role-name>
  2. arn:aws:iam::444455556666:policy/teamA/<policy-name>

Using IAM paths, AnyCompany can allow team A to more safely create and manage their own IAM roles and policies and safely pass those roles to AWS services using the iam:PassRole permission.

At AnyCompany, four IAM policies using IAM paths allow teams to more safely create and manage their own IAM roles and policies. Following recommended best practices, AnyCompany uses infrastructure as code (IaC) for all IAM role and policy creation. The four path-based policies that follow will be attached to each team’s CI/CD pipeline role, which has permissions to create roles. The following example focuses on team A, and how these policies enable them to self-manage their IAM credentials.

  1. Create a role in the path and modify inline policies on the role: This policy allows three actions: iam:CreateRole, iam:PutRolePolicy, and iam:DeleteRolePolicy. When this policy is attached to a principal, that principal is allowed to create roles in the IAM path /teamA/ and add and delete inline policies on roles in that IAM path.
    {
      "Version": "2012-10-17",
      "Statement": [
    {
            "Effect": "Allow",
            "Action": [
                "iam:CreateRole",
                "iam:PutRolePolicy",
                "iam:DeleteRolePolicy"
            ],
            "Resource": "arn:aws:iam::444455556666:role/teamA/*"
        }
    ]
    }

  2. Add and remove managed policies: The second policy example allows two actions: iam:AttachRolePolicy and iam:DetachRolePolicy. This policy allows a principal to attach and detach managed policies in the /teamA/ path to roles that are created in the /teamA/ path.
    {
      "Version": "2012-10-17",
      "Statement": [
    
    {
            "Effect": "Allow",
            "Action": [
                "iam:AttachRolePolicy",
                "iam:DetachRolePolicy"
            ],
            "Resource": "arn:aws:iam::444455556666:role/teamA/*",
            "Condition": {
                "ArnLike": {
                    "iam:PolicyARN": "arn:aws:iam::444455556666:policy/teamA/*"
                }          
            }
        }
    ]}

  3. Delete roles, tag and untag roles, read roles: The third policy allows a principal to delete roles, tag and untag roles, and retrieve information about roles that are created in the /teamA/ path.
    {
      "Version": "2012-10-17",
      "Statement": [
    
    
    {
            "Effect": "Allow",
            "Action": [
                "iam:DeleteRole",
                "iam:TagRole",
                "iam:UntagRole",
                "iam:GetRole",
                "iam:GetRolePolicy"
            ],
            "Resource": "arn:aws:iam::444455556666:role/teamA/*"
        }]}

  4. Policy management in IAM path: The final policy example allows access to create, modify, get, and delete policies that are created in the /teamA/ path. This includes creating, deleting, and tagging policies.
    {
      "Version": "2012-10-17",
      "Statement": [
    
    {
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy",
                "iam:DeletePolicy",
                "iam:CreatePolicyVersion",            
                "iam:DeletePolicyVersion",
                "iam:GetPolicy",
                "iam:TagPolicy",
                "iam:UntagPolicy",
                "iam:SetDefaultPolicyVersion",
                "iam:ListPolicyVersions"
             ],
            "Resource": "arn:aws:iam::444455556666:policy/teamA/*"
        }]}

Safely pass roles with IAM paths and iam:PassRole

To pass a role to an AWS service, a principal must have the iam:PassRole permission. IAM paths are the recommended option to restrict which roles a principal can pass when granted the iam:PassRole permission. IAM paths help verify principals can only pass specific roles or groupings of roles to an AWS service.

At AnyCompany, the security team wants to allow team A to add IAM roles to an instance profile and attach it to Amazon EC2 instances, but only if the roles are in the /teamA/ path. The IAM action that allows team A to provide the role to the instance is iam:PassRole. The security team doesn’t want team A to be able to pass other roles in the account, such as an administrator role.

The policy that follows allows passing of a role that was created in the /teamA/ path and does not allow the passing of other roles such as an administrator role.

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": "iam:PassRole",
        "Resource": "arn:aws:iam::444455556666:role/teamA/*"
    }]
}

How to create preventative guardrails for sensitive IAM paths

You can use service control policies (SCP) to restrict access to sensitive roles and policies within specific IAM paths. You can use an SCP to prevent the modification of sensitive roles and policies that are created in a defined path.

You will see the IAM path under the resource and condition portion of the statement. Only the role named IAMAdministrator created in the /security/ path can create or modify roles in the security path. This SCP allows you to delegate IAM role and policy management to developers with confidence that they won’t be able to create, modify, or delete any roles or policies in the security path.

{
    "Version": "2012-10-17",
    "Statement": [
        {
	    "Sid": "RestrictIAMWithPathManagement",
            "Effect": "Deny",
            "Action": [
                "iam:AttachRolePolicy",
                "iam:CreateRole",
                "iam:DeleteRole",
                "iam:DeleteRolePermissionsBoundary",
                "iam:DeleteRolePolicy",
                "iam:DetachRolePolicy",
                "iam:PutRolePermissionsBoundary",
                "iam:PutRolePolicy",
                "iam:UpdateRole",
                "iam:UpdateAssumeRolePolicy",
                "iam:UpdateRoleDescription",
                "sts:AssumeRole",
                "iam:TagRole",
                "iam:UntagRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/security/* "
            ],
            "Condition": {
                "ArnNotLike": {
                    "aws:PrincipalARN": "arn:aws:iam::444455556666:role/security/IAMAdministrator"
                }
            }
        }
    ]
}

This next example shows you how you can safely exempt IAM roles created in the security path from specific controls in your organization. The policy denies all roles except the roles created in the /security/ IAM path to close member accounts.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PreventCloseAccount",
      "Effect": "Deny",
      "Action": "organizations:CloseAccount",
      "Resource": "*",
      "Condition": {
        "ArnNotLikeIfExists": {
          "aws:PrincipalArn": [
            "arn:aws:iam::*:role/security/*"
          ]
        }
      }
    }
  ]
}

Additional considerations when using IAM paths

You should be aware of some additional considerations when you start using IAM paths.

  1. Paths are immutable for IAM roles and policies. To change a path, you must delete the IAM resource and recreate the IAM resource in the alternative path. Deleting roles or instance profiles has step-by-step instructions to delete an IAM resource.
  2. You can only create IAM paths using AWS API or command line tools. You cannot create IAM paths with the AWS console.
  3. IAM paths aren’t added to the uniqueness of the role name. Role names must be unique within your account without the path taken into consideration.
  4. AWS reserves several paths including /aws-service-role/ and you cannot create roles in this path.

Conclusion

IAM paths provide a powerful mechanism for effectively grouping IAM resources. Path-based groupings can streamline access management across AWS services. In this post, you learned how to use paths with IAM principals to create structured access with IAM roles, how to delegate and segregate access within an account, and safely pass roles using iam:PassRole. These techniques can empower you to fine-tune your AWS access management and help improve security while streamlining operational workflows.

You can use the following references to help extend your knowledge of IAM paths. This post references the processes outlined in the user guides and blog post, and sources the IAM policies from the GitHub repositories.

  1. AWS Organizations User Guide, SCP General Examples
  2. AWS-Samples Service-control-policy-examples GitHub Repository
  3. AWS Security Blog: IAM Policy types: How and when to use them
  4. AWS-Samples how-and-when-to-use-aws-iam-policy-blog-samples GitHub Repository

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

David Rowe

David Rowe

As a Senior Solutions Architect, David unites diverse global teams to drive cloud transformation through strategies and intuitive identity solutions. He creates consensus, guiding teams to adopt emerging technologies. He thrives on bringing together cross-functional perspectives to transform vision into reality in dynamic industries.

Security at multiple layers for web-administered apps

Post Syndicated from Guy Morton original https://aws.amazon.com/blogs/security/security-at-multiple-layers-for-web-administered-apps/

In this post, I will show you how to apply security at multiple layers of a web application hosted on AWS.

Apply security at all layers is a design principle of the Security pillar of the AWS Well-Architected Framework. It encourages you to apply security at the network edge, virtual private cloud (VPC), load balancer, compute instance (or service), operating system, application, and code.

Many popular web apps are designed with a single layer of security: the login page. Behind that login page is an in-built administration interface that is directly exposed to the internet. Admin interfaces for these apps typically have simple login mechanisms and often lack multi-factor authentication (MFA) support, which can make them an attractive target for threat actors.

The in-built admin interface can also be problematic if you want to horizontally scale across multiple servers. The admin interface is available on every server that runs the app, so it creates a large attack surface. Because the admin interface updates the software on its own server, you must synchronize updates across a fleet of instances.

Multi-layered security is about identifying (or creating) isolation boundaries around the parts of your architecture and minimizing what is permitted to cross each boundary. Adding more layers to your architecture gives you the opportunity to introduce additional controls at each layer, creating more boundaries where security controls can be enforced.

In the example app scenario in this post, you have the opportunity to add many additional layers of security.

Example of multi-layered security

This post demonstrates how you can use the Run Web-Administered Apps on AWS sample project to help address these challenges, by implementing a horizontally-scalable architecture with multi-layered security. The project builds and configures many different AWS services, each designed to help provide security at different layers.

By running this solution, you can produce a segmented architecture that separates the two functions of these apps into an unprivileged public-facing view and an admin view. This design limits access to the web app’s admin functions while creating a fleet of unprivileged instances to serve the app at scale.

Figure 1 summarizes how the different services in this solution work to help provide security at the following layers:

  1. At the network edge
  2. Within the VPC
  3. At the load balancer
  4. On the compute instances
  5. Within the operating system
Figure 1: Logical flow diagram to apply security at multiple layers

Figure 1: Logical flow diagram to apply security at multiple layers

Deep dive on a multi-layered architecture

The following diagram shows the solution architecture deployed by Run Web-Administered Apps on AWS. The figure shows how the services deployed in this solution are deployed in different AWS Regions, and how requests flow from the application user through the different service layers.

Figure 2: Multi-layered architecture

Figure 2: Multi-layered architecture

This post will dive deeper into each of the architecture’s layers to see how security is added at each layer. But before we talk about the technology, let’s consider how infrastructure is built and managed — by people.

Perimeter 0 – Security at the people layer

Security starts with the people in your team and your organization’s operational practices. How your “people layer” builds and manages your infrastructure contributes significantly to your security posture.

A design principle of the Security pillar of the Well-Architected Framework is to automate security best practices. This helps in two ways: it reduces the effort required by people over time, and it helps prevent resources from being in inconsistent or misconfigured states. When people use manual processes to complete tasks, misconfigurations and missed steps are common.

The simplest way to automate security while reducing human effort is to adopt services that AWS manages for you, such as Amazon Relational Database Service (Amazon RDS). With Amazon RDS, AWS is responsible for the operating system and database software patching, and provides tools to make it simple for you to back up and restore your data.

You can automate and integrate key security functions by using managed AWS security services, such as Amazon GuardDuty, AWS Config, Amazon Inspector, and AWS Security Hub. These services provide network monitoring, configuration management, and detection of software vulnerabilities and unintended network exposure. As your cloud environments grow in scale and complexity, automated security monitoring is critical.

Infrastructure as code (IaC) is a best practice that you can follow to automate the creation of infrastructure. By using IaC to define, configure, and deploy the AWS resources that you use, you reduce the likelihood of human error when building AWS infrastructure.

Adopting IaC can help you improve your security posture because it applies the rigor of application code development to infrastructure provisioning. Storing your infrastructure definition in a source control system (such as AWS CodeCommit) creates an auditable artifact. With version control, you can track changes made to it over time as your architecture evolves.

You can add automated testing to your IaC project to help ensure that your infrastructure is aligned with your organization’s security policies. If you ever need to recover from a disaster, you can redeploy the entire architecture from your IaC project.

Another people-layer discipline is to apply the principle of least privilege. AWS Identity and Access Management (IAM) is a flexible and fine-grained permissions system that you can use to grant the smallest set of actions that your solution needs. You can use IAM to control access for both humans and machines, and we use it in this project to grant the compute instances the least privileges required.

You can also adopt other IAM best practices such as using temporary credentials instead of long-lived ones (such as access keys), and regularly reviewing and removing unused users, roles, permissions, policies, and credentials.

Perimeter 1 – network protections

The internet is public and therefore untrusted, so you must proactively address the risks from threat actors and network-level attacks.

To reduce the risk of distributed denial of service (DDoS) attacks, this solution uses AWS Shield for managed protection at the network edge. AWS Shield Standard is automatically enabled for all AWS customers at no additional cost and is designed to provide protection from common network and transport layer DDoS attacks. For higher levels of protection against attacks that target your applications, subscribe to AWS Shield Advanced.

Amazon Route 53 resolves the hostnames that the solution uses and maps the hostnames as aliases to an Amazon CloudFront distribution. Route 53 is a robust and highly available globally distributed DNS service that inspects requests to protect against DNS-specific attack types, such as DNS amplification attacks.

Perimeter 2 – request processing

CloudFront also operates at the AWS network edge and caches, transforms, and forwards inbound requests to the relevant origin services across the low-latency AWS global network. The risk of DDoS attempts overwhelming your application servers is further reduced by caching web requests in CloudFront.

The solution configures CloudFront to add a shared secret to the origin request within a custom header. A CloudFront function copies the originating user’s IP to another custom header. These headers get checked when the request arrives at the load balancer.

AWS WAF, a web application firewall, blocks known bad traffic, including cross-site scripting (XSS) and SQL injection events that come into CloudFront. This project uses AWS Managed Rules, but you can add your own rules, as well. To restrict frontend access to permitted IP CIDR blocks, this project configures an IP restriction rule on the web application firewall.

Perimeter 3 – the VPC

After CloudFront and AWS WAF check the request, CloudFront forwards it to the compute services inside an Amazon Virtual Private Cloud (Amazon VPC). VPCs are logically isolated networks within your AWS account that you can use to control the network traffic that is allowed in and out. This project configures its VPC to use a private IPv4 CIDR block that cannot be directly routed to or from the internet, creating a network perimeter around your resources on AWS.

The Amazon Elastic Compute Cloud (Amazon EC2) instances are hosted in private subnets within the VPC that have no inbound route from the internet. Using a NAT gateway, instances can make necessary outbound requests. This design hosts the database instances in isolated subnets that don’t have inbound or outbound internet access. Amazon RDS is a managed service, so AWS manages patching of the server and database software.

The solution accesses AWS Secrets Manager by using an interface VPC endpoint. VPC endpoints use AWS PrivateLink to connect your VPC to AWS services as if they were in your VPC. In this way, resources in the VPC can communicate with Secrets Manager without traversing the internet.

The project configures VPC Flow Logs as part of the VPC setup. VPC flow logs capture information about the IP traffic going to and from network interfaces in your VPC. GuardDuty analyzes these logs and uses threat intelligence data to identify unexpected, potentially unauthorized, and malicious activity within your AWS environment.

Although using VPCs and subnets to segment parts of your application is a common strategy, there are other ways that you can achieve partitioning for application components:

  • You can use separate VPCs to restrict access to a database, and use VPC peering to route traffic between them.
  • You can use a multi-account strategy so that different security and compliance controls are applied in different accounts to create strong logical boundaries between parts of a system. You can route network requests between accounts by using services such as AWS Transit Gateway, and control them using AWS Network Firewall.

There are always trade-offs between complexity, convenience, and security, so the right level of isolation between components depends on your requirements.

Perimeter 4 – the load balancer

After the request is sent to the VPC, an Application Load Balancer (ALB) processes it. The ALB distributes requests to the underlying EC2 instances. The ALB uses TLS version 1.2 to encrypt incoming connections with an AWS Certificate Manager (ACM) certificate.

Public access to the load balancer isn’t allowed. A security group applied to the ALB only allows inbound traffic on port 443 from the CloudFront IP range. This is achieved by specifying the Region-specific AWS-managed CloudFront prefix list as the source in the security group rule.

The ALB uses rules to decide whether to forward the request to the target instances or reject the traffic. As an additional layer of security, it uses the custom headers that the CloudFront distribution added to make sure that the request is from CloudFront. In another rule, the ALB uses the originating user’s IP to decide which target group of Amazon EC2 instances should handle the request. In this way, you can direct admin users to instances that are configured to allow admin tasks.

If a request doesn’t match a valid rule, the ALB returns a 404 response to the user.

Perimeter 5 – compute instance network security

A security group creates an isolation boundary around the EC2 instances. The only traffic that reaches the instance is the traffic that the security group rules allow. In this solution, only the ALB is allowed to make inbound connections to the EC2 instances.

A common practice is for customers to also open ports, or to set up and manage bastion hosts to provide remote access to their compute instances. The risk in this approach is that the ports could be left open to the whole internet, exposing the instances to vulnerabilities in the remote access protocol. With remote work on the rise, there is an increased risk for the creation of these overly permissive inbound rules.

Using AWS Systems Manager Session Manager, you can remove the need for bastion hosts or open ports by creating secure temporary connections to your EC2 instances using the installed SSM agent. As with every software package that you install, you should check that the SSM agent aligns with your security and compliance requirements. To review the source code to the SSM agent, see amazon-ssm-agent GitHub repo.

The compute layer of this solution consists of two separate Amazon EC2 Auto Scaling groups of EC2 instances. One group handles requests from administrators, while the other handles requests from unprivileged users. This creates another isolation boundary by keeping the functions separate while also helping to protect the system from a failure in one component causing the whole system to fail. Each Amazon EC2 Auto Scaling group spans multiple Availability Zones (AZs), providing resilience in the event of an outage in an AZ.

By using managed database services, you can reduce the risk that database server instances haven’t been proactively patched for security updates. Managed infrastructure helps reduce the risk of security issues that result from the underlying operating system not receiving security patches in a timely manner and the risk of downtime from hardware failures.

Perimeter 6 – compute instance operating system

When instances are first launched, the operating system must be secure, and the instances must be updated as required when new security patches are released. We recommend that you create immutable servers that you build and harden by using a tool such as EC2 Image Builder. Instead of patching running instances in place, replace them when an updated Amazon Machine Image (AMI) is created. This approach works in our example scenario because the application code (which changes over time) is stored on Amazon Elastic File System (Amazon EFS), so when you replace the instances with a new AMI, you don’t need to update them with data that has changed after the initial deployment.

Another way that the solution helps improve security on your instances at the operating system is to use EC2 instance profiles to allow them to assume IAM roles. IAM roles grant temporary credentials to applications running on EC2, instead of using hard-coded credentials stored on the instance. Access to other AWS resources is provided using these temporary credentials.

The IAM roles have least privilege policies attached that grant permission to mount the EFS file system and access AWS Systems Manager. If a database secret exists in Secrets Manager, the IAM role is granted permission to access it.

Perimeter 7 – at the file system

Both Amazon EC2 Auto Scaling groups of EC2 instances share access to Amazon EFS, which hosts the files that the application uses. IAM authorization applies IAM file system policies to control the instance’s access to the file system. This creates another isolation boundary that helps prevent the non-admin instances from modifying the application’s files.

The admin group’s instances have the file system mounted in read-write mode. This is necessary so that the application can update itself, install add-ons, upload content, or make configuration changes. On the unprivileged instances, the file system is mounted in read-only mode. This means that these instances can’t make changes to the application code or configuration files.

The unprivileged instances have local file caching enabled. This caches files from the EFS file system on the local Amazon Elastic Block Store (Amazon EBS) volume to help improve scalability and performance.

Perimeter 8 – web server configuration

This solution applies different web server configurations to the instances running in each Amazon EC2 Auto Scaling group. This creates a further isolation boundary at the web server layer.

The admin instances use the default configuration for the application that permits access to the admin interface. Non-admin, public-facing instances block admin routes, such as wp-login.php, and will return a 403 Forbidden response. This creates an additional layer of protection for those routes.

Perimeter 9 – database security

The database layer is within two additional isolation boundaries. The solution uses Amazon RDS, with database instances deployed in isolated subnets. Isolated subnets have no inbound or outbound internet access and can only be reached through other network interfaces within the VPC. The RDS security group further isolates the database instances by only allowing inbound traffic from the EC2 instances on the database server port.

By using IAM authentication for the database access, you can add an additional layer of security by configuring the non-admin instances with less privileged database user credentials.

Perimeter 10 – Security at the application code layer

To apply security at the application code level, you should establish good practices around installing updates as they become available. Most applications have email lists that you can subscribe to that will notify you when updates become available.

You should evaluate the quality of an application before you adopt it. The following are some metrics to consider:

  • Number of developers who are actively working on it
  • Frequency of updates to it
  • How quickly the developers respond with patches when bugs are reported

Other steps that you can take

Use AWS Verified Access to help secure application access for human users. With Verified Access, you can add another user authentication stage, to help ensure that only verified users can access an application’s administrative functions.

Amazon GuardDuty is a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity and delivers detailed security findings for visibility and remediation. It can detect communication with known malicious domains and IP addresses and identify anomalous behavior. GuardDuty Malware Protection helps you detect the potential presence of malware by scanning the EBS volumes that are attached to your EC2 instances.

Amazon Inspector is an automated vulnerability management service that automatically discovers the Amazon EC2 instances that are running and scans them for software vulnerabilities and unintended network exposure. To help ensure that your web server instances are updated when security patches are available, use AWS Systems Manager Patch Manager.

Deploy the sample project

We wrote the Run Web-Administered Apps on AWS project by using the AWS Cloud Development Kit (AWS CDK). With the AWS CDK, you can use the expressive power of familiar programming languages to define your application resources and accelerate development. The AWS CDK has support for multiple languages, including TypeScript, Python, .NET, Java, and Go.

This project uses Python. To deploy it, you need to have a working version of Python 3 on your computer. For instructions on how to install the AWS CDK, see Get Started with AWS CDK.

Configure the project

To enable this project to deploy multiple different web projects, you must do the configuration in the parameters.properties file. Two variables identify the configuration blocks: app (which identifies the web application to deploy) and env (which identifies whether the deployment is to a dev or test environment, or to production).

When you deploy the stacks, you specify the app and env variables as CDK context variables so that you can select between different configurations at deploy time. If you don’t specify a context, a [default] stanza in the parameters.properties file specifies the default app name and environment that will be deployed.

To name other stanzas, combine valid app and env values by using the format <app>-<env>. For each stanza, you can specify its own Regions, accounts, instance types, instance counts, hostnames, and more. For example, if you want to support three different WordPress deployments, you might specify the app name as wp, and for env, you might want devtest, and prod, giving you three stanzas: wp-devwp-test, and wp-prod.

The project includes sample configuration items that are annotated with comments that explain their function.

Use CDK bootstrapping

Before you can use the AWS CDK to deploy stacks into your account, you need to use CDK bootstrapping to provision resources in each AWS environment (account and Region combination) that you plan to use. For this project, you need to bootstrap both the US East (N. Virginia) Region (us-east-1)  and the home Region in which you plan to host your application.

Create a hosted zone in the target account

You need to have a hosted zone in Route 53 to allow the creation of DNS records and certificates. You must manually create the hosted zone by using the AWS Management Console. You can delegate a domain that you control to Route 53 and use it with this project. You can also register a domain through Route 53 if you don’t currently have one.

Run the project

Clone the project to your local machine and navigate to the project root. To create the Python virtual environment (venv) and install the dependencies, follow the steps in the Generic CDK instructions.

To create and configure the parameters.properties file

Copy the parameters-template.properties file (in the root folder of the project) to a file called parameters.properties and save it in the root folder. Open it with a text editor and then do the following:

If you want to restrict public access to your site, change 192.0.2.0/24 to the IP range that you want to allow. By providing a comma-separated list of allowedIps, you can add multiple allowed CIDR blocks.

If you don’t want to restrict public access, set allowedIps=* instead.

If you have forked this project into your own private repository, you can commit the parameters.properties file to your repo. To do that, comment out the parameters.properties  line in the .gitignore file.

To install the custom resource helper

The solution uses an AWS CloudFormation custom resource for cross-Region configuration management. To install the needed Python package, run the following command in the custom_resource directory:

cd custom_resource
pip install crhelper -t .

To learn more about CloudFormation custom resource creation, see AWS CloudFormation custom resource creation with Python, AWS Lambda, and crhelper.

To configure the database layer

Before you deploy the stacks, decide whether you want to include a data layer as part of the deployment. The dbConfig parameter determines what will happen, as follows:

  • If dbConfig is left empty — no database will be created and no database credentials will be available in your compute stacks
  • If dbConfig is set to instance — you will get a new Amazon RDS instance
  • If dbConfig is set to cluster — you will get an Amazon Aurora cluster
  • If dbConfig is set to none — if you previously created a database in this stack, the database will be deleted

If you specify either instance or cluster, you should also configure the following database parameters to match your requirements:

  • dbEngine — set the database engine to either mysql or postgres
  • dbSnapshot — specify the named snapshot for your database
  • dbSecret — if you are using an existing database, specify the Amazon Resource Name (ARN) of the secret where the database credentials and DNS endpoint are located
  • dbMajorVersion — set the major version of the engine that you have chosen; leave blank to get the default version
  • dbFullVersion — set the minor version of the engine that you have chosen; leave blank to get the default version
  • dbInstanceType — set the instance type that you want (note that these vary by service); don’t prefix with db. because the CDK will automatically prepend it
  • dbClusterSize — if you request a cluster, set this parameter to determine how many Amazon Aurora replicas are created

You can choose between mysql or postgres for the database engine. Other settings that you can choose are determined by that choice.

You will need to use an Amazon Machine Image (AMI) that has the CLI preinstalled, such as Amazon Linux 2, or install the AWS Command Line Interface (AWS CLI) yourself with a user data command. If instead of creating a new, empty database, you want to create one from a snapshot, supply the snapshot name by using the dbSnapshot parameter.

To create the database secret

AWS automatically creates and stores the RDS instance or Aurora cluster credentials in a Secrets Manager secret when you create a new instance or cluster. You make these credentials available to the compute stack through the db_secret_command variable, which contains a single-line bash command that returns the JSON from the AWS CLI command aws secretsmanager get-secret-value. You can interpolate this variable into your user data commands as follows:

SECRET=$({db_secret_command})
USERNAME=`echo $SECRET | jq -r '.username'`
PASSWORD=`echo $SECRET | jq -r '.password'`
DBNAME=`echo $SECRET | jq -r '.dbname'`
HOST=`echo $SECRET | jq -r '.host'`

If you create a database from a snapshot, make sure that your Secrets Manager secret and Amazon RDS snapshot are in the target Region. If you supply the secret for an existing database, make sure that the secret contains at least the following four key-value pairs (replace the <placeholder values> with your values):

{
    "password":"<your-password>",
    "dbname":"<your-database-name>",
    "host":"<your-hostname>",
    "username":"<your-username>"
}

The name for the secret must match the app value followed by the env value (both in title case), followed by DatabaseSecret, so for app=wp and env=dev, your secret name should be WpDevDatabaseSecret.

To deploy the stacks

The following commands deploy the stacks defined in the CDK app. To deploy them individually, use the specific stack names (these will vary according to the info that you supplied previously), as shown in the following.

cdk deploy wp-dev-network-stack -c app=wp -c env=dev
cdk deploy wp-dev-database-stack -c app=wp -c env=dev
cdk deploy wp-dev-compute-stack -c app=wp -c env=dev
cdk deploy wp-dev-cdn-stack -c app=wp -c env=dev

To create a database stack, deploy the network and database stacks first.

cdk deploy wp-dev-network-stack -c app=wp -c env=dev
cdk deploy wp-dev-database-stack -c app=wp -c env=dev

You can then initiate the deployment of the compute stack.

cdk deploy wp-dev-compute-stack -c app=wp -c env=dev

After the compute stack deploys, you can deploy the stack that creates the CloudFront distribution.

cdk deploy wp-dev-cdn-stack -c env=dev

This deploys the CloudFront infrastructure to the US East (N. Virginia) Region (us-east-1). CloudFront is a global AWS service, which means that you must create it in this Region. The other stacks are deployed to the Region that you specified in your configuration stanza.

To test the results

If your stacks deploy successfully, your site appears at one of the following URLs:

  • subdomain.hostedZone (if you specified a value for the subdomain) — for example, www.example.com
  • appName-env.hostedZone (if you didn’t specify a value for the subdomain) — for example, wp-dev.example.com.

If you connect through the IP address that you configured in the adminIps configuration, you should be connected to the admin instance for your site. Because the admin instance can modify the file system, you should use it to do your administrative tasks.

Users who connect to your site from an IP that isn’t in your allowedIps list will be connected to your fleet instances and won’t be able to alter the file system (for example, they won’t be able to install plugins or upload media).

If you need to redeploy the same app-env combination, manually remove the parameter store items and the replicated secret that you created in us-east-1. You should also delete the cdk.context.json file because it caches values that you will be replacing.

One project, multiple configurations

You can modify the configuration file in this project to deploy different applications to different environments using the same project. Each app can have different configurations for dev, test, or production environments.

Using this mechanism, you can deploy sites for test and production into different accounts or even different Regions. The solution uses CDK context variables as command-line switches to select different configuration stanzas from the configuration file.

CDK projects allow for multiple deployments to coexist in one account by using unique names for the deployed stacks, based on their configuration.

Check the configuration file into your source control repo so that you track changes made to it over time.

Got a different web app that you want to deploy? Create a new configuration by copying and pasting one of the examples and then modify the build commands as needed for your use case.

Conclusion

In this post, you learned how to build an architecture on AWS that implements multi-layered security. You can use different AWS services to provide protections to your application at different stages of the request lifecycle.

You can learn more about the services used in this sample project by building it in your own account. It’s a great way to explore how the different services work and the full features that are available. By understanding how these AWS services work, you will be ready to use them to add security, at multiple layers, in your own architectures.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Guy Morton

Guy Morton

Guy is a Senior Solutions Architect at AWS. He enjoys bringing his decades of experience as a full stack developer, architect, and people manager to helping customers build and scale their applications securely in the AWS Cloud. Guy has a passion for automation in all its forms, and is also an occasional songwriter and musician who performs under the pseudonym Whtsqr.

Introducing IAM Access Analyzer custom policy checks

Post Syndicated from Mitch Beaumont original https://aws.amazon.com/blogs/security/introducing-iam-access-analyzer-custom-policy-checks/

AWS Identity and Access Management (IAM) Access Analyzer was launched in late 2019. Access Analyzer guides customers toward least-privilege permissions across Amazon Web Services (AWS) by using analysis techniques, such as automated reasoning, to make it simpler for customers to set, verify, and refine IAM permissions. Today, we are excited to announce the general availability of IAM Access Analyzer custom policy checks, a new IAM Access Analyzer feature that helps customers accurately and proactively check IAM policies for critical permissions and increases in policy permissiveness.

In this post, we’ll show how you can integrate custom policy checks into builder workflows to automate the identification of overly permissive IAM policies and IAM policies that contain permissions that you decide are sensitive or critical.

What is the problem?

Although security teams are responsible for the overall security posture of the organization, developers are the ones creating the applications that require permissions. To enable developers to move fast while maintaining high levels of security, organizations look for ways to safely delegate the ability of developers to author IAM policies. Many AWS customers implement manual IAM policy reviews before deploying developer-authored policies to production environments. Customers follow this practice to try to prevent excessive or unwanted permissions finding their way into production. Depending on the volume and complexity of the policies that need to be reviewed; these reviews can be intensive and take time. The result is a slowdown in development and potential delay in deployment of applications and services. Some customers write custom tooling to remove the manual burden of policy reviews, but this can be costly to build and maintain.

How do custom policy checks solve that problem?

Custom policy checks are a new IAM Access Analyzer capability that helps security teams accurately and proactively identify critical permissions in their policies. Custom policy checks can also tell you if a new version of a policy is more permissive than the previous version. Custom policy checks use automated reasoning, a form of static analysis, to provide a higher level of security assurance in the cloud. For more information, see Formal Reasoning About the Security of Amazon Web Services.

Custom policy checks can be embedded in a continuous integration and continuous delivery (CI/CD) pipeline so that checks can be run against policies without having to deploy the policies. In addition, developers can run custom policy checks from their local development environments and get fast feedback about whether or not the policies they are authoring are in line with your organization’s security standards.

How to analyze IAM policies with custom policy checks

In this section, we provide step-by-step instructions for using custom policy checks to analyze IAM policies.

Prerequisites

To complete the examples in our walkthrough, you will need the following:

  1. An AWS account, and an identity that has permissions to use the AWS services, and create the resources, used in the following examples. For more information, see the full sample code used in this blog post on GitHub.
  2. An installed and configured AWS CLI. For more information, see Configure the AWS CLI.
  3. The AWS Cloud Development Kit (AWS CDK). For installation instructions, refer to Install the AWS CDK.

Example 1: Use custom policy checks to compare two IAM policies and check that one does not grant more access than the other

In this example, you will create two IAM identity policy documents, NewPolicyDocument and ExistingPolicyDocument. You will use the new CheckNoNewAccess API to compare these two policies and check that NewPolicyDocument does not grant more access than ExistingPolicyDocument.

Step 1: Create two IAM identity policy documents

  1. Use the following command to create ExistingPolicyDocument.
    cat << EOF > existing-policy-document.json
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ec2:StartInstances",
                    "ec2:StopInstances"
                ],
                "Resource": "arn:aws:ec2:*:*:instance/*",
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceTag/Owner": "\${aws:username}"
                    }
                }
            }
        ]
    }
    EOF

  2. Use the following command to create NewPolicyDocument.
    cat << EOF > new-policy-document.json
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ec2:StartInstances",
                    "ec2:StopInstances"
                ],
                "Resource": "arn:aws:ec2:*:*:instance/*"
            }
        ]
    }
    EOF

Notice that ExistingPolicyDocument grants access to the ec2:StartInstances and ec2:StopInstances actions if the condition key aws:ResourceTag/Owner resolves to true. In other words, the value of the tag matches the policy variable aws:username. NewPolicyDocument grants access to the same actions, but does not include a condition key.

Step 2: Check the policies by using the AWS CLI

  1. Use the following command to call the CheckNoNewAccess API to check whether NewPolicyDocument grants more access than ExistingPolicyDocument.
    aws accessanalyzer check-no-new-access \
    --new-policy-document file://new-policy-document.json \
    --existing-policy-document file://existing-policy-document.json \
    --policy-type IDENTITY_POLICY

After a moment, you will see a response from Access Analyzer. The response will look similar to the following.

{
    "result": "FAIL",
    "message": "The modified permissions grant new access compared to your existing policy.",
    "reasons": [
        {
            "description": "New access in the statement with index: 1.",
            "statementIndex": 1
        }
    ]
}

In this example, the validation returned a result of FAIL. This is because NewPolicyDocument is missing the condition key, potentially granting any principal with this identity policy attached more access than intended or needed.

Example 2: Use custom policy checks to check that an IAM policy does not contain sensitive permissions

In this example, you will create an IAM identity-based policy that contains a set of permissions. You will use the CheckAccessNotGranted API to check that the new policy does not give permissions to disable AWS CloudTrail or delete any associated trails.

Step 1: Create a new IAM identity policy document

  • Use the following command to create IamPolicyDocument.
    cat << EOF > iam-policy-document.json
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "cloudtrail:StopLogging",
                    "cloudtrail:Delete*"
                ],
                "Resource": ["*"] 
            }
        ]
    }
    EOF

Step 2: Check the policy by using the AWS CLI

  • Use the following command to call the CheckAccessNotGranted API to check if the new policy grants permission to the set of sensitive actions. In this example, you are asking Access Analyzer to check that IamPolicyDocument does not contain the actions cloudtrail:StopLogging or cloudtrail:DeleteTrail (passed as a list to the access parameter).
    aws accessanalyzer check-access-not-granted \
    --policy-document file://iam-policy-document.json \
    --access actions=cloudtrail:StopLogging,cloudtrail:DeleteTrail \
    --policy-type IDENTITY_POLICY

Because the policy that you created contains both cloudtrail:StopLogging and cloudtrail:DeleteTrail actions, Access Analyzer returns a FAIL.

{
    "result": "FAIL",
    "message": "The policy document grants access to perform one or more of the listed actions.",
    "reasons": [
        {
            "description": "One or more of the listed actions in the statement with index: 0.",
            "statementIndex": 0
        }
    ]
}

Example 3: Integrate custom policy checks into the developer workflow

Building on the previous two examples, in this example, you will automate the analysis of the IAM policies defined in an AWS CloudFormation template. Figure 1 shows the workflow that will be used. The workflow will initiate each time a pull request is created against the main branch of an AWS CodeCommit repository called my-iam-policy (the commit stage in Figure 1). The first check uses the CheckNoNewAccess API to determine if the updated policy is more permissive than a reference IAM policy. The second check uses the CheckAccessNotGranted API to automatically check for critical permissions within the policy (the validation stage in Figure 1). In both cases, if the updated policy is more permissive, or contains critical permissions, a comment with the results of the validation is posted to the pull request. This information can then be used to decide whether the pull request is merged into the main branch for deployment (the deploy stage is shown in Figure 1).

Figure 1: Diagram of the pipeline that will check policies

Figure 1: Diagram of the pipeline that will check policies

Step 1: Deploy the infrastructure and set up the pipeline

  1. Use the following command to download and unzip the Cloud Development Kit (CDK) project associated with this blog post.
    git clone https://github.com/aws-samples/access-analyzer-automated-policy-analysis-blog.git
    cd ./access-analyzer-automated-policy-analysis-blog

  2. Create a virtual Python environment to contain the project dependencies by using the following command.
    python3 -m venv .venv

  3. Activate the virtual environment with the following command.
    source .venv/bin/activate

  4. Install the project requirements by using the following command.
    pip install -r requirements.txt

  5. Use the following command to update the CDK CLI to the latest major version.
    npm install -g aws-cdk@2 --force

  6. Before you can deploy the CDK project, use the following command to bootstrap your AWS environment. Bootstrapping is the process of creating resources needed for deploying CDK projects. These resources include an Amazon Simple Storage Service (Amazon S3) bucket for storing files and IAM roles that grant permissions needed to perform deployments.
    cdk bootstrap

  7. Finally, use the following command to deploy the pipeline infrastructure.
    cdk deploy --require-approval never

    The deployment will take a few minutes to complete. Feel free to grab a coffee and check back shortly.

    When the deployment completes, there will be two stack outputs listed: one with a name that contains CodeCommitRepo and another with a name that contains ConfigBucket. Make a note of the values of these outputs, because you will need them later.

    The deployed pipeline is displayed in the AWS CodePipeline console and should look similar to the pipeline shown in Figure 2.

    Figure 2: AWS CodePipeline and CodeBuild Management Console view

    Figure 2: AWS CodePipeline and CodeBuild Management Console view

    In addition to initiating when a pull request is created, the newly deployed pipeline can also be initiated when changes to the main branch of the AWS CodeCommit repository are detected. The pipeline has three stages, CheckoutSources, IAMPolicyAnalysis, and deploy. The CheckoutSource stage checks out the contents of the my-iam-policy repository when the pipeline is triggered due to a change in the main branch.

    The IAMPolicyAnalysis stage, which runs after the CheckoutSource stage or when a pull request has been created against the main branch, has two actions. The first action, Check no new access, verifies that changes to the IAM policies in the CloudFormation template do not grant more access than a pre-defined reference policy. The second action, Check access not granted, verifies that those same updates do not grant access to API actions that are deemed sensitive or critical. Finally, the Deploy stage will deploy the resources defined in the CloudFormation template, if the actions in the IAMPolicyAnalysis stage are successful.

    To analyze the IAM policies, the Check no new access and Check access not granted actions depend on a reference policy and a predefined list of API actions, respectively.

  8. Use the following command to create the reference policy.
    cd ../ 
    cat << EOF > cnna-reference-policy.json
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "*",
                "Resource": "*"
            },
            {
                "Effect": "Deny",
                "Action": "iam:PassRole",
                "Resource": "arn:aws:iam::*:role/my-sensitive-roles/*"
            }
        ]
    }	
    EOF

    This reference policy sets out the maximum permissions for policies that you plan to validate with custom policy checks. The iam:PassRole permission is a permission that allows an IAM principal to pass an IAM role to an AWS service, like Amazon Elastic Compute Cloud (Amazon EC2) or AWS Lambda. The reference policy says that the only way that a policy is more permissive is if it allows iam:PassRole on this group of sensitive resources: arn:aws:iam::*:role/my-sensitive-roles/*”.

    Why might a reference policy be useful? A reference policy helps ensure that a particular combination of actions, resources, and conditions is not allowed in your environment. Reference policies typically allow actions and resources in one statement, then deny the problematic permissions in a second statement. This means that a policy that is more permissive than the reference policy allows access to a permission that the reference policy has denied.

    In this example, a developer who is authorized to create IAM roles could, intentionally or unintentionally, create an IAM role for an AWS service (like EC2 for AWS Lambda) that has permission to pass a privileged role to another service or principal, leading to an escalation of privilege.

  9. Use the following command to create a list of sensitive actions. This list will be parsed during the build pipeline and passed to the CheckAccessNotGranted API. If the policy grants access to one or more of the sensitive actions in this list, a result of FAIL will be returned. To keep this example simple, add a single API action, as follows.
    cat << EOF > sensitive-actions.file
    dynamodb:DeleteTable
    EOF

  10. So that the CodeBuild projects can access the dependencies, use the following command to copy the cnna-reference-policy.file and sensitive-actions.file to an S3 bucket. Refer to the stack outputs you noted earlier and replace <ConfigBucket> with the name of the S3 bucket created in your environment.
    aws s3 cp ./cnna-reference-policy.json s3://<ConfgBucket>/cnna-reference-policy.json
    aws s3 cp ./sensitive-actions.file s3://<ConfigBucket>/sensitive-actions.file

Step 2: Create a new CloudFormation template that defines an IAM policy

With the pipeline deployed, the next step is to clone the repository that was created and populate it with a CloudFormation template that defines an IAM policy.

  1. Install git-remote-codecommit by using the following command.
    pip install git-remote-codecommit

    For more information on installing and configuring git-remote-codecommit, see the AWS CodeCommit User Guide.

  2. With git-remote-codecommit installed, use the following command to clone the my-iam-policy repository from AWS CodeCommit.
    git clone codecommit://my-iam-policy && cd ./my-iam-policy

    If you’ve configured a named profile for use with the AWS CLI, use the following command, replacing <profile> with the name of your named profile.

    git clone codecommit://<profile>@my-iam-policy && cd ./my-iam-policy

  3. Use the following command to create the CloudFormation template in the local clone of the repository.
    cat << EOF > ec2-instance-role.yaml
    ---
    AWSTemplateFormatVersion: 2010-09-09
    Description: CloudFormation Template to deploy base resources for access_analyzer_blog
    Resources:
      EC2Role:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: 2012-10-17
            Statement:
            - Effect: Allow
              Principal:
                Service: ec2.amazonaws.com
              Action: sts:AssumeRole
          Path: /
          Policies:
          - PolicyName: my-application-permissions
            PolicyDocument:
              Version: 2012-10-17
              Statement:
              - Effect: Allow
                Action:
                  - 'ec2:RunInstances'
                  - 'lambda:CreateFunction'
                  - 'lambda:InvokeFunction'
                  - 'dynamodb:Scan'
                  - 'dynamodb:Query'
                  - 'dynamodb:UpdateItem'
                  - 'dynamodb:GetItem'
                Resource: '*'
              - Effect: Allow
                Action:
                  - iam:PassRole 
                Resource: "arn:aws:iam::*:role/my-custom-role"
            
      EC2InstanceProfile:
        Type: AWS::IAM::InstanceProfile
        Properties:
          Path: /
          Roles:
            - !Ref EC2Role
    EOF

The actions in the IAMPolicyValidation stage are run by a CodeBuild project. CodeBuild environments run arbitrary commands that are passed to the project using a buildspec file. Each project has already been configured to use an inline buildspec file.

You can inspect the buildspec file for each project by opening the project’s Build details page as shown in Figure 3.

Figure 3: AWS CodeBuild console and build details

Figure 3: AWS CodeBuild console and build details

Step 3: Run analysis on the IAM policy

The next step involves checking in the first version of the CloudFormation template to the repository and checking two things. First, that the policy does not grant more access than the reference policy. Second, that the policy does not contain any of the sensitive actions defined in the sensitive-actions.file.

  1. To begin tracking the CloudFormation template created earlier, use the following command.
    git add ec2-instance-role.yaml 

  2. Commit the changes you have made to the repository.
    git commit -m 'committing a new CFN template with IAM policy'

  3. Finally, push these changes to the remote repository.
    git push

  4. Pushing these changes will initiate the pipeline. After a few minutes the pipeline should complete successfully. To view the status of the pipeline, do the following:
    1. Navigate to https://<region>.console.aws.amazon.com/codesuite/codepipeline/pipelines (replacing <region> with your AWS Region).
    2. Choose the pipeline called accessanalyzer-pipeline.
    3. Scroll down to the IAMPolicyValidation stage of the pipeline.
    4. For both the check no new access and check access not granted actions, choose View Logs to inspect the log output.
  5. If you inspect the build logs for both the check no new access and check access not granted actions within the pipeline, you should see that there were no blocking or non-blocking findings, similar to what is shown in Figure 4. This indicates that the policy was validated successfully. In other words, the policy was not more permissive than the reference policy, and it did not include any of the critical permissions.
    Figure 4: CodeBuild log entry confirming that the IAM policy was successfully validated

    Figure 4: CodeBuild log entry confirming that the IAM policy was successfully validated

Step 4: Create a pull request to merge a new update to the CloudFormation template

In this step, you will make a change to the IAM policy in the CloudFormation template. The change deliberately makes the policy grant more access than the reference policy. The change also includes a critical permission.

  1. Use the following command to create a new branch called add-new-permissions in the local clone of the repository.
    git checkout -b add-new-permissions

  2. Next, edit the IAM policy in ec2-instance-role.yaml to include an additional API action, dynamodb:Delete* and update the resource property of the inline policy to use an IAM role in the /my-sensitive-roles/*” path. You can copy the following example, if you’re unsure of how to do this.
    ---
    AWSTemplateFormatVersion: 2010-09-09
    Description: CloudFormation Template to deploy base resources for access_analyzer_blog
    Resources:
      EC2Role:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: 2012-10-17
            Statement:
            - Effect: Allow
              Principal:
                Service: ec2.amazonaws.com
              Action: sts:AssumeRole
          Path: /
          Policies:
          - PolicyName: my-application-permissions
            PolicyDocument:
              Version: 2012-10-17
              Statement:
              - Effect: Allow
                Action:
                  - 'ec2:RunInstances'
                  - 'lambda:CreateFunction'
                  - 'lambda:InvokeFunction'
                  - 'dynamodb:Scan'
                  - 'dynamodb:Query'
                  - 'dynamodb:UpdateItem'
                  - 'dynamodb:GetItem'
                  - 'dynamodb:Delete*'
                Resource: '*'
              - Effect: Allow
                Action:
                  - iam:PassRole 
                Resource: "arn:aws:iam::*:role/my-sensitive-roles/my-custom-admin-role"
            
      EC2InstanceProfile:
        Type: AWS::IAM::InstanceProfile
        Properties:
          Path: /
          Roles:
            - !Ref EC2Role

  3. Commit the policy change and push the updated policy document to the repo by using the following commands.
    git add ec2-instance-role.yaml 
    git commit -m "adding new permission and allowing my ec2 instance to assume a pass sensitive IAM role"

  4. The add-new-permissions branch is currently a local branch. Use the following command to push the branch to the remote repository. This action will not initiate the pipeline, because the pipeline only runs when changes are made to the repository’s main branch.
    git push -u origin add-new-permissions

  5. With the new branch and changes pushed to the repository, follow these steps to create a pull request:
    1. Navigate to https://console.aws.amazon.com/codesuite/codecommit/repositories (don’t forget to the switch to the correct Region).
    2. Choose the repository called my-iam-policy.
    3. Choose the branch add-new-permissions from the drop-down list at the top of the repository screen.
      Figure 5: my-iam-policy repository with new branch available

      Figure 5: my-iam-policy repository with new branch available

    4. Choose Create pull request.
    5. Enter a title and description for the pull request.
    6. (Optional) Scroll down to see the differences between the current version and new version of the CloudFormation template highlighted.
    7. Choose Create pull request.
  6. The creation of the pull request will Initiate the pipeline to fetch the CloudFormation template from the repository and run the check no new access and check access not granted analysis actions.
  7. After a few minutes, choose the Activity tab for the pull request. You should see a comment from the pipeline that contains the results of the failed validation.
    Figure 6: Results from the failed validation posted as a comment to the pull request

    Figure 6: Results from the failed validation posted as a comment to the pull request

Why did the validations fail?

The updated IAM role and inline policy failed validation for two reasons. First, the reference policy said that no one should have more permissions than the reference policy does. The reference policy in this example included a deny statement for the iam:PassRole permission with a resource of /my-sensitive-role/*. The new created inline policy included an allow statement for the iam:PassRole permission with a resource of arn:aws:iam::*:role/my-sensitive-roles/my-custom-admin-role. In other words, the new policy had more permissions than the reference policy.

Second, the list of critical permissions included the dynamodb:DeleteTable permission. The inline policy included a statement that would allow the EC2 instance to perform the dynamodb:DeleteTable action.

Cleanup

Use the following command to delete the infrastructure that was provisioned as part of the examples in this blog post.

cdk destroy 

Conclusion

In this post, I introduced you to two new IAM Access Analyzer APIs: CheckNoNewAccess and CheckAccessNotGranted. The main example in the post demonstrated one way in which you can use these APIs to automate security testing throughout the development lifecycle. The example did this by integrating both APIs into the developer workflow and validating the developer-authored IAM policy when the developer created a pull request to merge changes into the repository’s main branch. The automation helped the developer to get feedback about the problems with the IAM policy quickly, allowing the developer to take action in a timely way. This is often referred to as shifting security left — identifying misconfigurations early and automatically supporting an iterative, fail-fast model of continuous development and testing. Ultimately, this enables teams to make security an inherent part of a system’s design and architecture and can speed up product development workflow.

You can find the full sample code used in this blog post on GitHub.

To learn more about IAM Access Analyzer and the new custom policy checks feature, see the IAM Access Analyzer documentation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Mitch Beaumont

Mitch Beaumont

Mitch is a Principal Solutions Architect for AWS, based in Sydney, Australia. Mitch works with some of Australia’s largest financial services customers, helping them to continually raise the security bar for the products and features that they build and ship. Outside of work, Mitch enjoys spending time with his family, photography, and surfing.

Author

Matt Luttrell

Matt is a Principal Solutions Architect on the AWS Identity Solutions team. When he’s not spending time chasing his kids around, he enjoys skiing, cycling, and the occasional video game.

How to use the PassRole permission with IAM roles

Post Syndicated from Liam Wadman original https://aws.amazon.com/blogs/security/how-to-use-the-passrole-permission-with-iam-roles/

iam:PassRole is an AWS Identity and Access Management (IAM) permission that allows an IAM principal to delegate or pass permissions to an AWS service by configuring a resource such as an Amazon Elastic Compute Cloud (Amazon EC2) instance or AWS Lambda function with an IAM role. The service then uses that role to interact with other AWS resources in your accounts. Typically, workloads, applications, or services run with different permissions than the developer who creates them, and iam:PassRole is the mechanism in AWS to specify which IAM roles can be passed to AWS services, and by whom.

In this blog post, we’ll dive deep into iam:PassRole, explain how it works and what’s required to use it, and cover some best practices for how to use it effectively.

A typical example of using iam:PassRole is a developer passing a role’s Amazon Resource Name (ARN) as a parameter in the Lambda CreateFunction API call. After the developer makes the call, the service verifies whether the developer is authorized to do so, as seen in Figure 1.

Figure 1: Developer passing a role to a Lambda function during creation

Figure 1: Developer passing a role to a Lambda function during creation

The following command shows the parameters the developer needs to pass during the CreateFunction API call. Notice that the role ARN is a parameter, but there is no passrole parameter.

aws lambda create-function 
    --function-name my-function 
    --runtime nodejs14.x 
    --zip-file fileb://my-function.zip 
    --handler my-function.handler 
    --role arn:aws:iam::123456789012:role/service-role/MyTestFunction-role-tges6bf4

The API call will create the Lambda function only if the developer has the iam:PassRole permission as well as the CreateFunction API permissions. If the developer is lacking either of these, the request will be denied.

Now that the permissions have been checked and the Function resource has been created, the Lambda service principal will assume the role you passed whenever your function is invoked and use the role to make requests to other AWS services in your account.

Understanding IAM PassRole

When we say that iam:PassRole is a permission, we mean specifically that it is not an API call; it is an IAM action that can be specified within an IAM policy. The iam:PassRole permission is checked whenever a resource is created with an IAM service role or is updated with a new IAM service role.

Here is an example IAM policy that allows a principal to pass a role named lambda_role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
        "arn:aws:iam::111122223333:role/lambda_role"
      ]
    }
  ]
}

The roles that can be passed are specified in the Resource element of the IAM policy. It is possible to list multiple IAM roles, and it is possible to use a wildcard (*) to match roles that begins with the pattern you specify. Use a wildcard as the last characters only when you’re matching a role pattern, to help prevent over-entitlement.

Note: We recommend that you avoid using resource ”*” with the iam:PassRole action in most cases, because this could grant someone the permission to pass any role, opening the possibility of unintended privilege escalation.

The iam:PassRole action can only grant permissions when used in an identity-based policy attached to an IAM role or user, and it is governed by all relevant AWS policy types, such as service control policies (SCPs) and VPC endpoint policies.

When a principal attempts to pass a role to an AWS service, there are three prerequisites that must be met to allow the service to use that role:

  1. The principal that attempts to pass the role must have the iam:PassRole permission in an identity-based policy with the role desired to be passed in the Resource field, all IAM conditions met, and no implicit or explicit denies in other policies such as SCPs, VPC endpoint policies, session policies, or permissions boundaries.
  2. The role that is being passed is configured via the trust policy to trust the service principal of the service you’re trying to pass it to. For example, the role that you pass to Amazon EC2 has to trust the Amazon EC2 service principal, ec2.amazonaws.com.

    To learn more about role trust policies, see this blog post. In certain scenarios, the resource may end up being created or modified even if a passed IAM role doesn’t trust the required service principal, but the AWS service won’t be able to use the role to perform actions.

  3. The role being passed and the principal passing the role must both be in the same AWS account.

Best practices for using iam:PassRole

In this section, you will learn strategies to use when working with iam:PassRole within your AWS account.

Place iam:PassRole in its own policy statements

As we demonstrated earlier, the iam:PassRole policy action takes an IAM role for a resource. If you specify a wildcard as a resource in a policy granting iam:PassRole permission, it means that the principals to whom this policy applies will be able to pass any role in that account, allowing them to potentially escalate their privilege beyond what you intended.

To be able to specify the Resource value and be more granular in comparison to other permissions you might be granting in the same policy, we recommend that you keep the iam:PassRole action in its own policy statement, as indicated by the following example.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
        "arn:aws:iam::111122223333:role/lambda_role"
      ]
    },
    {
      "Effect": "Allow",
      "Action": "cloudwatch:GetMetricData",
      "Resource": [
        "*"
      ]
    }
  ]
}

Use IAM paths or naming conventions to organize IAM roles within your AWS accounts

You can use IAM paths or a naming convention to grant a principal access to pass IAM roles using wildcards (*) in a portion of the role ARN. This reduces the need to update IAM policies whenever new roles are created.

In your AWS account, you might have IAM roles that are used for different reasons, for example roles that are used for your applications, and roles that are used by your security team. In most circumstances, you would not want your developers to associate a security team’s role to the resources they are creating, but you still want to allow them to create and pass business application roles.

You may want to give developers the ability to create roles for their applications, as long as they are safely governed. You can do this by verifying that those roles have permissions boundaries attached to them, and that they are created in a specific IAM role path. You can then allow developers to pass only the roles in that path. To learn more about using permissions boundaries, see our Example Permissions Boundaries GitHub repo.

In the following example policy, access is granted to pass only the roles that are in the /application_role/ path.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
        "arn:aws:iam::111122223333:role/application_role/*"
      ]
    }
  ]
}

Protect specific IAM paths with an SCP

You can also protect specific IAM paths by using an SCP.

In the following example, the SCP prevents your principals from passing a role unless they have a tag of “team” with a value of “security” when the role they are trying to pass is in the IAM path /security_app_roles/.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/security_app_roles/*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalTag/team": "security"
        }
      }
    }
  ]
}

Similarly, you can craft a policy to only allow a specific naming convention or IAM path to pass a role in a specific path. For example, the following SCP shows how to prevent a role outside of the IAM path security_response_team from passing a role in the IAM path security_app_roles.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/security_app_roles/*",
      "Condition": {
        "ArnNotLike": {
          "aws:PrincipalARN": "arn:aws:iam::*:role/security_response_team/*"
        }
      }
    }
  ]
}

Using variables and tags with iam:PassRole

iam:PassRole does not support using the iam:ResourceTag or aws:ResourceTag condition keys to specify which roles can be passed. However, the IAM policy language supports using variables as part of the Resource element in an IAM policy.

The following IAM policy example uses the aws:PrincipalTag condition key as a variable in the Resource element. That allows this policy to construct the IAM path based on the values of the caller’s IAM tags or Session tags.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
"arn:aws:iam::111122223333:role/${aws:PrincipalTag/AllowedRolePath}/*"
      ]
    }
  ]
}

If there was no value set for the AllowedRolePath tag, the resource would not match any role ARN, and no iam:PassRole permissions would be granted.

Pass different IAM roles for different use cases, and for each AWS service

As a best practice, use a single IAM role for each use case, and avoid situations where the same role is used by multiple AWS services.

We recommend that you also use different IAM roles for different workloads in your AWS accounts, even if those workloads are built on the same AWS service. This will allow you to grant only the permissions necessary to your workloads and make it possible to adhere to the principle of least privilege.

Using iam:PassRole condition keys

The iam:PassRole action has two available condition keys, iam:PassedToService and iam:AssociatedResourceArn.

iam:PassedToService allows you to specify what service a role may be passed to. iam:AssociatedResourceArn allows you to specify what resource ARNs a role may be associated with.

As mentioned previously, we typically recommend that customers use an IAM role with only one AWS service wherever possible. This is best accomplished by listing a single AWS service in a role’s trust policy, reducing the need to use the iam:PassedToService condition key in the calling principal’s identity-based policy. In circumstances where you have an IAM role that can be assumed by more than one AWS service, you can use iam:PassedToService to specify which service the role can be passed to. For example, the following policy allows ExampleRole to be passed only to the Amazon EC2 service.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/ExampleRole",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "ec2.amazonaws.com"
        }
      }
    }
  ]
}

When you use iam:AssociatedResourceArn, it’s important to understand that ARN formats typically do not change, but each AWS resource will have a unique ARN. Some AWS resources have non-predictable components, such as EC2 instance IDs in their ARN. This means that when you’re using iam:AssociatedResourceArn, if an AWS resource is ever deleted and a new resource created, you might need to modify the IAM policy with a new resource ARN to allow a role to be associated with it.

Most organizations prefer to limit who can delete and modify resources in their AWS accounts, rather than limit what resource a role can be associated with. An example of this would be limiting which principals can modify a Lambda function, rather than limiting which function a role can be associated with, because in order to pass a role to Lambda, the principals would need permissions to update the function itself.

Using iam:PassRole with service-linked roles

If you’re dealing with a service that uses service-linked roles (SLRs), most of the time you don’t need the iam:PassRole permission. This is because in most cases such services will create and manage the SLR on your behalf, so that you don’t pass a role as part of a service configuration, and therefore, the iam:PassRole permission check is not performed.

Some AWS services allow you to create multiple SLRs and pass them when you create or modify resources by using those services. In this case, you need the iam:PassRole permission on service-linked roles, just the same as you do with a service role.

For example, Amazon EC2 Auto Scaling allows you to create multiple SLRs with specific suffixes and then pass a role ARN in the request as part of the ec2:CreateAutoScalingGroup API action. For the Auto Scaling group to be successfully created, you need permissions to perform both the ec2:CreateAutoScalingGroup and iam:PassRole actions.

SLRs are created in the /aws-service-role/ path. To help confirm that principals in your AWS account are only passing service-linked roles that they are allowed to pass, we recommend using suffixes and IAM policies to separate SLRs owned by different teams.

For example, the following policy allows only SLRs with the _BlueTeamSuffix to be passed.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
        "arn:aws:iam::*:role/aws-service-role/*_BlueTeamSuffix"
      ]
    }
  ]
}

You could attach this policy to the role used by the blue team to allow them to pass SLRs they’ve created for their use case and that have their specific suffix.

AWS CloudTrail logging

Because iam:PassRole is not an API call, there is no entry in AWS CloudTrail for it. To identify what role was passed to an AWS service, you must check the CloudTrail trail for events that created or modified the relevant AWS service’s resource.

In Figure 2, you can see the CloudTrail log created after a developer used the Lambda CreateFunction API call with the role ARN noted in the role field.

Figure 2: CloudTrail log of a CreateFunction API call

Figure 2: CloudTrail log of a CreateFunction API call

PassRole and VPC endpoints

Earlier, we mentioned that iam:PassRole is subject to VPC endpoint policies. If a request that requires the iam:PassRole permission is made over a VPC endpoint with a custom VPC endpoint policy configured, iam:PassRole should be allowed through the Action element of that VPC endpoint policy, or the request will be denied.

Conclusion

In this post, you learned about iam:PassRole, how you use it to interact with AWS services and resources, and the three prerequisites to successfully pass a role to a service. You now also know best practices for using iam:PassRole in your AWS accounts. To learn more, see the documentation on granting a user permissions to pass a role to an AWS service.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Laura Reith

Laura is an Identity Solutions Architect at AWS, where she thrives on helping customers overcome security and identity challenges.

Liam Wadman

Liam Wadman

Liam is a Solutions Architect with the Identity Solutions team. When he’s not building exciting solutions on AWS or helping customers, he’s often found in the hills of British Columbia on his mountain bike. Liam points out that you cannot spell LIAM without IAM.

Establishing a data perimeter on AWS: Require services to be created only within expected networks

Post Syndicated from Harsha Sharma original https://aws.amazon.com/blogs/security/establishing-a-data-perimeter-on-aws-require-services-to-be-created-only-within-expected-networks/

Welcome to the fifth post in the Establishing a data perimeter on AWS series. Throughout this series, we’ve discussed how a set of preventative guardrails can create an always-on boundary to help ensure that your trusted identities are accessing your trusted resources over expected networks. In a previous post, we emphasized the importance of preventing access from unexpected locations, even for authorized users. For example, you wouldn’t expect non-public corporate data to be accessed from outside the corporate network. In this post, we demonstrate how to use preventative controls to help ensure that your resources are deployed within your Amazon Virtual Private Cloud (Amazon VPC), so that you can effectively enforce the network perimeter controls. We also explore detective controls you can use to detect the lack of adherence to this requirement.

Let’s begin with a quick refresher on the fundamental concept of data perimeters using Figure 1 as a reference. Customers generally prefer establishing a high-level perimeter to help prevent untrusted entities from coming in and data from going out. The perimeter defines what access customers expect within their AWS environment. It refers to the access patterns among your identities, resources, and networks that should always be blocked. Using those three elements, an assertion can be made to define your perimeter’s goal: access can only be allowed if the identity is trusted, the resource is trusted, and the network is expected. If any of these conditions are false, then the access inside the perimeter is unintended and should be denied. The perimeter is composed of controls implemented on your identities, resources, and networks to maintain that the necessary conditions are true.

Figure 1: A high-level depiction of defining a perimeter around your AWS resources to prevent interaction with unintended IAM principals, unintended resources, and unexpected networks

Figure 1: A high-level depiction of defining a perimeter around your AWS resources to prevent interaction with unintended IAM principals, unintended resources, and unexpected networks

Now, let’s consider a scenario to understand the problem statement this post is trying to solve. Assume a setup like the one in Figure 2, where an application needs to access an Amazon Simple Storage Service (Amazon S3) bucket using its temporary AWS Identity and Access Management (IAM) credentials over an Amazon S3 VPC endpoint.

Figure 2: Scenario of a simple app using its temporary credential to access an S3 bucket

Figure 2: Scenario of a simple app using its temporary credential to access an S3 bucket

From our previous posts in this series, we’ve learned that we can use the following set of capabilities to build a network perimeter to achieve our control objectives for this sample scenario.

Control objective Implemented using Applicable IAM capability
My identities can access resources only from expected networks. For example, in Figure 2, my application’s temporary credential can only access my S3 bucket when my application is within my expected network space. Service control policies (SCP) aws:SourceIp
aws:SourceVpc
aws:SourceVpce
My resources can only be accessed from expected networks. For example, in Figure 2, my S3 bucket can only be accessed from my expected network space. Resource-based policies aws:SourceIp
aws:SourceVpc
aws:SourceVpce

But there are certain AWS services that allow for different network deployment models, such as providing the choice of associating the service resources with either an AWS managed VPC or a customer managed VPC. For example, an AWS Lambda function always runs inside a VPC owned by the Lambda service (AWS managed VPC) and by default isn’t connected to VPCs in your account (customer managed VPC). For more information, see Connecting Lambda functions to your VPC.

This means that if your application code was deployed as a Lambda function that isn’t connected to your VPC, then the function cannot access your resources with standard network perimeter controls enforced. Let’s understand this situation better using Figure 3, where a Lambda function isn’t configured to connect to the customer VPC. This function cannot access your S3 bucket over the internet because of how the recommended data perimeter in the preceding table has been defined, that is, to only allow your bucket to be accessible from a known network segment (the customer VPC and IP CIDR range) and only allow the IAM role associated with the Lambda function to allow accessing the bucket from known networks. The function also cannot access your S3 bucket through your S3 VPC endpoint because the function isn’t associated with the customer VPC. Lastly, unless other compensating controls are in place, this function might be able to access untrusted resources as your standard data perimeter controls enforced with the VPC endpoint policies won’t be in effect, which might not meet your company’s security requirements.

Figure 3: Lambda function configured to be associated with AWS managed VPC

Figure 3: Lambda function configured to be associated with AWS managed VPC

This means that for the Lambda function to conform to your data perimeter, it must be associated with your network segment (customer VPC) as shown in Figure 4.

Figure 4: Lambda function configured to be associated with the customer managed VPC

Figure 4: Lambda function configured to be associated with the customer managed VPC

To make sure that your Lambda functions are deployed into your networks so that they can access your resources under the purview of data perimeter controls, it’s preferable to have a way to automatically prevent deployment or configuration errors. Additionally, if you have a large deployment of Lambda functions across hundreds or even thousands of accounts, you want an efficient way to enforce conformance of these functions to your data perimeter.

To solve for this problem and make sure that an application team or a developer cannot create a function that’s not associated with your VPC, you can use the lambda:VpcIds or lambda:SubnetIds IAM condition keys (for more information, see Using IAM condition keys for VPC settings). These keys allow you to create and update functions only when VPC settings are satisfied.

In the following SCP example, an IAM principal that is subject to the following SCP policy will only be able to create or update a Lambda function if the function is associated with a VPC (customer VPC). When the customer VPC isn’t specified, the lambda:VpcIds condition key has no value—it is null—and thus this policy will deny creating or updating the function. For more information about how the Null condition operator functions, see Condition operator to check existence of condition keys.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceVPCFunction",
      "Action": [
          "lambda:CreateFunction",
          "lambda:UpdateFunctionConfiguration"
       ],
      "Effect": "Deny",
      "Resource": "*",
      "Condition": {
        "Null": {
           "lambda:VpcIds": "true"
        }
      }
    }
  ]
}

Additionally, you can use variations of the preceding example and create more fine-grained controls using these condition keys. For more such examples, see Example policies with condition keys for VPC settings.

AWS services such as AWS Glue and Amazon SageMaker have similar feature behavior and provide similar condition keys. For example, the glue:VpcIds condition key allows you to govern the creation of AWS Glue jobs only in your VPC. For further details and an example policy, see Control policies that control settings using condition keys.

Similarly, Amazon SageMaker Studio, SageMaker notebook instances, SageMaker training, and deployed inference containers are internet accessible or enabled by default. The sagemaker:VpcSubnets condition key can be used to restrict launching these resources in a VPC. For more information, see Condition keys for Amazon SageMaker, Connect to Resources From Within a VPC, and Run Training and Inference Containers in Internet-Free Mode.

Detective controls

The AWS Well-Architected Framework recommends applying a defense in-depth approach with multiple security controls (see Security Pillar). This is why in addition to the preventative controls discussed in the form of condition keys in this post, you should also consider using AWS native fully managed governance tools to help you manage your environment’s deployed resources and their conformance to your data perimeter (see Management and Governance on AWS).

For example, AWS Config provides managed rules to check for Lambda functions inside a VPC and Sagemaker notebooks inside a VPC. You can also use the built-in checks of AWS Security Hub to detect and consolidate findings, such as [Lambda.3] Lambda functions should be in a VPC and [SageMaker.2] SageMaker notebook instances should be launched in a custom VPC.

You can also use similar detective controls for AWS services that don’t currently offer built-in preventative controls. For example, OpenSearch Service has an AWS Config managed rule for OpenSearch in VPC only and security hub check for [Opensearch.2] OpenSearch domains should be in a VPC.

Conclusion

In this post, we discussed how you can enforce that specific AWS services resources can only be created such that they adhere to your data perimeter. We used a sample scenario to dive into AWS Lambda and its network deployment options. We then used IAM condition keys as preventative controls to enforce predictable creation of Lambda functions conforming with our security standard. We also discussed additional AWS services that have similar behavior when the same concepts apply. Finally, we briefly discussed some AWS provided managed rules and security checks that you can use as supplementary detective controls to ensure that your preventative controls are in effect as expected.

Additional resources

The following are some additional resources that you can use to further explore data perimeters.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Harsha Sharma

Harsha Sharma

Harsha is a Principal Solutions Architect with AWS in New York. He joined AWS in 2016 and works with Global Financial Services customers to design and develop architectures on AWS, supporting their journey on the AWS Cloud.

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

Post Syndicated from Antonio Samaniego Jurado original https://aws.amazon.com/blogs/big-data/visualize-amazon-dynamodb-insights-in-amazon-quicksight-using-the-amazon-athena-dynamodb-connector-and-aws-glue/

Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers built-in security, continuous backups, automated multi-Region replication, in-memory caching, and data import and export tools. The scalability and flexible data schema of DynamoDB make it well-suited for a variety of use cases. These include internet-scale web and mobile applications, low-latency metadata stores, high-traffic retail websites, Internet of Things (IoT) and time series data, online gaming, and more.

Data stored in DynamoDB is the basis for valuable business intelligence (BI) insights. To make this data accessible to data analysts and other consumers, you can use Amazon Athena. Athena is a serverless, interactive service that allows you to query data from a variety of sources in heterogeneous formats, with no provisioning effort. Athena accesses data stored in DynamoDB via the open source Amazon Athena DynamoDB connector. Table metadata, such as column names and data types, is stored using the AWS Glue Data Catalog.

Finally, to visualize BI insights, you can use Amazon QuickSight, a cloud-powered business analytics service. QuickSight makes it straightforward for organizations to build visualizations, perform ad hoc analysis, and quickly get business insights from their data, anytime, on any device. Its generative BI capabilities enable you to ask questions about your data using natural language, without having to write SQL queries or learn a BI tool.

This post shows how you can use the Athena DynamoDB connector to easily query data in DynamoDB with SQL and visualize insights in QuickSight.

Solution overview

The following diagram illustrates the solution architecture.

Architecture Diagram

  1. The Athena DynamoDB connector runs in a pre-built, serverless AWS Lambda function. You don’t need to write any code.
  2. AWS Glue provides supplemental metadata from the DynamoDB table. In particular, an AWS Glue crawler is run to infer and store the DynamoDB table format, schema, and associated properties in the Glue Data Catalog.
  3. The Athena editor is used to test the connector and perform analysis via SQL queries.
  4. QuickSight uses the Athena connector to visualize BI insights from DynamoDB.

This walkthrough uses data from the ProductCatalog table, part of the DynamoDB developer guide sample data files.

Prerequisites

Before you get started, you should meet the following prerequisites:

Set up the Athena DynamoDB connector

The Athena DynamoDB connector comprises a pre-built, serverless Lambda function provided by AWS that communicates with DynamoDB so you can query your tables with SQL using Athena. The connector is available in the AWS Serverless Application Repository, and is used to create the Athena data source for later use in data analysis and visualization. To set up the connector, complete the following steps:

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. In the search bar, search for and choose Amazon DynamoDB.
  4. Choose Next.
  5. Under Data source details, enter a name. Note that this name should be unique and will be referenced in your SQL statements when you query your Athena data source.
  6. Under Connection details, choose Create Lambda function.

This will take you to the Lambda applications page on the Lambda console. Do not close the Athena data source creation tab; you will return to it in a later step.

  1. Scroll down to Application settings and enter a value for the following parameters (leave the other parameters as default):
    • SpillBucket – Specifies the Amazon Simple Storage Service (Amazon S3) bucket name for storing data that exceeds Lambda function response size limits. To create an S3 bucket, refer to Creating a bucket.
    • AthenaCatalogName – A lowercase name for the Lambda function to be created.Lambda Application Settings
  2. Select the acknowledgement check box and choose Deploy.

Wait for deployment to complete before moving to the next step.

  1. Return to the Athena data source creation tab.
  2. Under Connection details, choose the refresh icon and choose the Lambda function you created.Lambda Connection Details
  3. Choose Next.
  4. Review and choose Create data source.

Provide supplemental metadata via AWS Glue

The Athena connector already comes with a built-in inference capability to discover the schema and table properties of your data source. However, this capability is limited. To accurately discover the metadata of your DynamoDB table and centralize schema management as your data evolves over time, the connector integrates with AWS Glue.

To achieve this, an AWS Glue crawler is run to automatically determine the format, schema, and associated properties of the raw data stored in your DynamoDB table, writing the resulting metadata to a Glue database. Glue databases contain tables, which hold metadata from different data stores, independent from the actual location of the data. The Athena connector then references the Glue table and retrieves the corresponding DynamoDB metadata to enable queries.

Create the AWS Glue database

Complete the following steps to create the Glue database:

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Databases.
  2. Choose Add database (you can also edit an existing database if you already have one).
  3. For Name, enter a database name.
  4. For Location, enter the string literal dynamo-db-flag. This keyword indicates that the database contains tables that the connector can use for supplemental metadata.
  5. Choose Create database.

Following security best practices, it is also recommended that you enable encryption at rest for your Data Catalog. For details, refer to Encrypting your Data Catalog.

Create the AWS Glue crawler

Complete the following steps to create and run the Glue crawler:

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Crawlers.
  2. Choose Create crawler.
  3. Enter a crawler name and choose Next.
  4. For Data sources, choose Add a data source.
  5. On the Data source drop-down menu, choose DynamoDB. For Table name, enter the name of your DynamoDB table (string literal).
  6. Choose Add a DynamoDB data source.
  7. Choose Next.
  8. For IAM Role, choose Create new IAM role.
  9. Enter a role name and choose Create. This will automatically create an IAM role that trusts AWS Glue and has permissions to access the crawler targets.
  10. Choose Next.
  11. For Target database, choose the database previously created.
  12. Choose Next.
  13. Review and choose Create crawler.
  14. On the newly created crawler page, choose Run crawler.

Crawler runtimes depend on your DynamoDB table size and properties. You can find crawler run details under Crawler runs.

Validate the output metadata

When your crawler run status shows as Completed, follow the below steps to validate the output metadata:

  1. On the AWS Glue console, choose Tables in the navigation pane. Here, you can confirm a new table has been added to the database as a result of the crawler run.
  2. Navigate to the newly created table and take a look at the Schema tab. This tab shows the column names, data types, and other parameters inferred from your DynamoDB table.
  3. If needed, edit the schema by choosing Edit schema.Glue Table Details
  4. Choose Advanced properties.
  5. Under Table properties, verify the crawler automatically created and set the classification key to dynamodb. This indicates to the Athena connector that the table can be used for supplemental metadata.
  6. Optionally, add the following properties to correctly catalog and reference DynamoDB data in AWS Glue and Athena queries. This is due to capital letters not being permitted in AWS Glue table and column names, but being permitted in DynamoDB table and attribute names.
    1. If your DynamoDB table name contains any capital letters, choose Actions and Edit Table and add an extra table property as follows:
      • Key: sourceTable
      • Value: YourDynamoDBTableName
    2. If your DynamoDB table has attributes that contain any capital letters, add an extra table property as follows:
      • Key: columnMapping
      • Value: yourcolumn1=YourColumn1, yourcolumn2=YourColumn2, …

Test the connector with the Athena SQL editor

After the Athena DynamoDB connector is deployed and the AWS Glue table is populated with supplemental metadata, the DynamoDB table is ready for analysis. The example in this post uses the Athena editor to make SQL queries to the ProductCatalog table. For further options to interact with Athena, see Accessing Athena.

Complete the following steps to test the connector:

  1. Open the Athena query editor.
  2. If this is your first time visiting the Athena console in your current AWS Region, complete the following steps. This is a prerequisite before you can run Athena queries. See Getting Started for more details.
    1. Choose Query editor in the navigation pane to open the editor.
    2. Navigate to Settings and choose Manage to set up a query result location in Amazon S3.
  3. Under Data, select the data source and database you created (you may need to choose the refresh icon for them to sync up with Athena).
  4. Tables belonging to the selected database appear under Tables. You can choose a table name for Athena to show the table column list and data types.
  5. Test the connector by pulling data from your table via a SELECT statement. When you run Athena queries, you can reference Athena data sources, databases, and tables as <datasource_name>.<database>.<table_name>. Retrieved records are shown under Results.

For increased security, refer to Encrypting Athena query results stored in Amazon S3 to encrypt query results at rest.

Athena Query Results

For this post, we run a SELECT statement to validate the process. You can refer to the SQL reference for Athena to build more complex queries and analyses.

Visualize in QuickSight

QuickSight allows for building modern interactive dashboards, paginated reports, embedded analytics, and natural language queries through a unified BI solution. In this step, we use QuickSight to generate visual insights from the DynamoDB table by connecting to the Athena data source previously created.

Allow QuickSight to access to resources

Complete the following steps to grant QuickSight access to resources:

  1. On the QuickSight console, choose the profile icon and choose Manage QuickSight.
  2. In the navigation pane, choose Security & Permissions.
  3. Under QuickSight access to AWS services, choose Manage.
  4. QuickSight may ask you to switch to the Region in which users and groups in your account are managed. To change the current Region, navigate to the profile icon on the QuickSight console and choose the Region you want to switch to.
  5. For IAM Role, choose Use QuickSight-managed role (default).

Subsequent instructions assume that the default QuickSight-managed role is being used. If this is not the case, make sure to update the existing role to the same effect.

  1. Under Allow access and autodiscovery for these resources, select IAM and Amazon S3.
  2. For Amazon S3, choose Select S3 buckets.
  3. Choose the spill bucket you specified in earlier when deploying the Lambda function for the connector and the bucket you specified as the Athena query result location in Amazon S3.
  4. For both buckets, select Write permission for Athena Workgroup.
  5. Choose Amazon Athena.
  6. In the pop-up window, choose Next.
  7. Choose Lambda and choose the Amazon Resource Name (ARN) of the Lambda function previously used for the Athena data source connector.
  8. Choose Finish.
  9. Choose Save.

Create the Athena dataset

To create the Athena dataset, complete the following steps:

  1. On the QuickSight console, choose the user profile and switch to the Region you deployed the Athena data source to.
  2. Return to the QuickSight home page.
  3. In the navigation pane, choose Datasets.
  4. Choose New dataset.
  5. For Create a Dataset, select Athena.
  6. For Data source name, enter a name and choose Validate connection.
  7. When the connection shows as Validated, choose Create data source.
  8. Under Catalog, Database, and Tables, select the Athena data source, AWS Glue database, and AWS Glue table previously created.
  9. Choose Select.
  10. On the Finish dataset creation page, select Import to SPICE for quicker analytics.
  11. Choose Visualize.

For additional information on QuickSight query modes, see Importing data into SPICE and Using SQL to customize data.

Build QuickSight visualizations

Once the DynamoDB data is available in QuickSight via the Athena DynamoDB connector, it is ready to be visualized. The QuickSight analysis in the below example shows a vertical stacked bar chart with the average price per product category for the ProductCatalog sample dataset. In addition, it shows a donut chart with the proportion of products by product category, and a tree map containing the count of bicycles per bicycle type.

If you use data imported to SPICE in a QuickSight analysis, the dataset will only be available after the import is complete. For further details, see Using SPICE data in an analysis.

Quicksight Analysis

For comprehensive information on how to create and share visualizations in QuickSight, refer to Visualizing data in Amazon QuickSight and Sharing and subscribing to data in Amazon QuickSight.

Clean up

To avoid incurring continued AWS usage charges, make sure you delete all resources created as part of this walkthrough.

  • Delete the Athena data source:
    1. On the Athena console, switch to the Region you deployed your resources in.
    2. Choose Data sources in the navigation pane.
    3. Select the data source you created and on the Actions menu, choose Delete.
  • Delete the Lambda application:
    1. On the AWS CloudFormation console, switch to the Region you deployed your resources in.
    2. Choose Stacks in the navigation pane.
    3. Select serverlessrepo-AthenaDynamoDBConnector and choose Delete.
  • Delete the AWS Glue resources:
    1. On the AWS Glue console, switch to the Region you deployed your resources in.
    2. Choose Databases in the navigation pane.
    3. Select the database you created and choose Delete.
    4. Choose Crawlers in the navigation pane.
    5. Select the crawler you created and on the Action menu, choose Delete crawler.
  • Delete the QuickSight resources:
    1. On the QuickSight console, switch to the Region you deployed your resources in.
    2. Delete the analysis created for this walkthrough.
    3. Delete the Athena dataset created for this walkthrough.
    4. If you no longer need the Athena data source to create other datasets, delete the data source.

Summary

This post demonstrated how you can use the Athena DynamoDB connector to query data in DynamoDB with SQL and build visualizations in QuickSight.

Learn more about the Athena DynamoDB connector in the Amazon Athena User Guide. Discover more available data source connectors to query and visualize a variety of data sources without setting up or managing any infrastructure while only paying for the queries you run.

For advanced QuickSight capabilities powered by AI, see Gaining insights with machine learning (ML) in Amazon QuickSight and Answering business questions with Amazon QuickSight Q.


About the Authors

Antonio Samaniego Jurado is a Solutions Architect at Amazon Web Services. With a strong passion for modern technology, Antonio helps customers build state-of-the-art applications on AWS. A creator at heart, he loves community-driven learning and sharing of best practices across the AWS service portfolio to make the best of customers cloud journey.

Pascal Vogel is a Solutions Architect at Amazon Web Services. Pascal helps startups and enterprises build cloud-native solutions. As a cloud enthusiast, Pascal loves learning new technologies and connecting with like-minded customers who want to make a difference in their cloud journey.

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/power-enterprise-grade-data-vaults-with-amazon-redshift-part-1/

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x better price-performance than other cloud data warehouses.

As with all AWS services, Amazon Redshift is a customer-obsessed service that recognizes there isn’t a one-size-fits-all for customers when it comes to data models, which is why Amazon Redshift supports multiple data models such as Star Schemas, Snowflake Schemas and Data Vault. This post discusses best practices for designing enterprise-grade Data Vaults of varying scale using Amazon Redshift; the second post in this two-part series discusses the most pressing needs when designing an enterprise-grade Data Vault and how those needs are addressed by Amazon Redshift.

Whether it’s a desire to easily retain data lineage directly within the data warehouse, establish a source-system agnostic data model within the data warehouse, or more easily comply with GDPR regulations, customers that implement a Data Vault model will benefit from this post’s discussion of considerations, best practices, and Amazon Redshift features relevant to the building of enterprise-grade Data Vaults. Building a starter version of anything can often be straightforward, but building something with enterprise-grade scale, security, resiliency, and performance typically requires knowledge of and adherence to battle-tested best practices, and using the right tools and features in the right scenario.

Data Vault overview

Let’s first briefly review the core Data Vault premise and concepts. Data models provide a framework for how the data in a data warehouse should be organized into database tables. Amazon Redshift supports a number of data models, and some of the most popular data models include STAR schemas and Data Vault.

Data Vault is not only a modeling methodology, it’s also an opinionated framework that tells you how to solve certain problems in your data ecosystem. An opinionated framework provides a set of guidelines and conventions that developers are expected to follow, rather than leaving all decisions up to the developer. You can compare this with what big enterprise frameworks like Spring or Micronauts do when developing applications at enterprise scale. This is incredibly helpful especially on large data warehouse projects, because it structures your extract, load, and transform (ELT) pipeline and clearly tells you how to solve certain problems within the data and pipeline contexts. This also allows for a high degree of automation.

Data Vault 2.0 allows for the following:

  • Agile data warehouse development
  • Parallel data ingestion
  • A scalable approach to handle multiple data sources even on the same entity
  • A high level of automation
  • Historization
  • Full lineage support

However, Data Vault 2.0 also comes with costs, and there are use cases where it’s not a good fit, such as the following:

  • You only have a few data sources with no related or overlapping data (for example, a bank with a single core system)
  • You have simple reporting with infrequent changes
  • You have limited resources and knowledge of Data Vault

Data Vault typically organizes an organization’s data into a pipeline of four layers: staging, raw, business, and presentation. The staging layer represents data intake and light data transformations and enhancements that occur before the data comes to its more permanent resting place, the raw Data Vault (RDV).

The RDV holds the historized copy of all of the data from multiple source systems. It is referred to as raw because no filters or business transformations have occurred at this point except for storing the data in source system independent targets. The RDV organizes data into three key types of tables:

  • Hubs – This type of table represents a core business entity such as a customer. Each record in a hub table is married with metadata that identifies the record’s creation time, originating source system, and unique business key.
  • Links – This type of table defines a relationship between two or more hubs—for example, how the customer hub and order hub are to be joined.
  • Satellites – This type of table records the historized reference data about either hubs or links, such as product_info and customer_info

The RDV is used to feed data into the business Data Vault (BDV), which is responsible for reorganizing, denormalizing, and aggregating data for optimized consumption by the presentation mart. The presentation marts, also known as the data mart layer, further reorganizes the data for optimized consumption by downstream clients such as business dashboards. The presentation marts may, for example, reorganize data into a STAR schema.

For a more detailed overview of Data Vault along with a discussion of its applicability in the context of very interesting use cases, refer to the following:

How does Data Vault fit into a modern data architecture?

Currently, the lake house paradigm is becoming a major pattern in data warehouse design, even as part of a data mesh architecture. This follows the pattern of data lakes getting closer to what a data warehouse can do and vice versa. To compete with the flexibility of a data lake, Data Vault is a good choice. This way, the data warehouse doesn’t become a bottleneck and you can achieve similar agility, flexibility, scalability, and adaptability when ingestion and onboarding new data.

Platform flexibility

In this section, we discuss some recommended Redshift configurations for Data Vaults of varying scale. As mentioned earlier, the layers within a Data Vault platform are well known. We typically see a flow from the staging layer to the RDV, BDV, and finally presentation mart.

The Amazon Redshift is highly flexible in supporting both modest and large-scale Data Vaults, offering features like the following:

Modest vs. large-scale Data Vaults

Amazon Redshift is flexible in how you decide to structure these layers. For modest data vaults, a single Redshift warehouse with one database with multiple schemas will work just fine.

For large data vaults with more complex transformations, we would look at multiple warehouses, each with their own schema of mastered data representing one or more layer. The reason for using multiple warehouses is to take advantage of the Amazon Redshift architecture’s flexibility for implementing large-scale data vault implementations, such as using Redshift RA3 nodes and Redshift Serverless for separating the compute from the data storage layer and using Redshift data sharing to share the data between different Redshift warehouses. This enables you to scale the compute capacity independently at each layer depending on the processing complexity. The staging layer, for example, can be a layer within your data lake (Amazon S3 storage) or a schema within a Redshift database.

Using Amazon Aurora zero-ETL integrations with Amazon Redshift, you can create a data vault implementation with a staging layer in an Amazon Aurora database that will take care of real-time transaction processing and move the data to Amazon Redshift automatically for further processing in the Data Vault implementation without creating any complex ETL pipelines. This way, you can use Amazon Aurora for transactions and Amazon Redshift for analytics. Compute resources are isolated for the same data, and you’re using the most efficient tools to process it.

Large-scale Data Vaults

For larger Data Vaults platforms, concurrency and compute power become important to process both the loading of data and any business transformations. Amazon Redshift offers flexibility to increase compute capacity both horizontally via concurrency scaling and vertically via cluster resize and also via different architectures for each Data Vault layer.

Staging layer

You can create a data warehouse for the staging layer and perform hard business rules processing here, including calculation of hash keys, hash diffs, and addition of technical metadata columns. If data is not loaded 24/7, consider either pause/resume or a Redshift Serverless workgroup.

Raw Data Vault layer

For raw Data Vault (RDV), it’s recommended to either create one Redshift warehouse for the whole RDV or one Redshift warehouse for one or more subject areas within the RDV. For example, if the volume of data and number of normalized tables within the RDV for a particular subject area is large (either the raw data layer has so many tables that it runs out of maximum table limit on Amazon Redshift or the advantage of workload isolation within a single Redshift warehouse outweighs the overhead of performance and management), this subject area within the RDV can be run and mastered on its own Redshift warehouse.

The RDV is typically loaded 24/7 so a provisioned Redshift data warehouse may be most suitable here to take advantage of reserved instance pricing.

Business Data Vault layer

For creating a data warehouse for the business Data Vault (BDV) layer, this could be larger in size than the previous data warehouses due to the nature of the BDV processing, typically denormalization of data from a large number of source RDV tables.

Some customers run their BDV processing once a day, so a pause/resume window for Redshift provisioned cluster may be cost beneficial here. You can also run BDV processing on an Amazon Redshift Serverless warehouse which will automatically pause when workloads complete and resume when workloads start processing again.

Presentation Data Mart layer

For creating Redshift (provisioned or serverless) warehouses for one or more data marts, the schemas within these marts typically contain views or materialized views, so a Redshift data share will be set up between the data marts and the previous layers.

We need to ensure there is enough concurrency to cope with the increased read traffic at this level. This is achieved via multiple read only warehouses with a data share or the use of concurrency scaling to auto scale.

Example architectures

The following diagram illustrates an example platform for a modest Data Vault model.

The following diagram illustrates the architecture for a large-scale Data Vault model.

Data Vault data model guiding principles

In this section, we discuss some recommended design principles for joining and filtering table access within a Data Vault implementation. These guiding principles address different combinations of entity type access, but should be tested for suitability with each client’s particular use case.

Let’s begin with a brief refresher of table distribution styles in Amazon Redshift. There are four ways that a table’s data can be distributed among the different compute nodes in a Redshift cluster: ALL, KEY, EVEN, and AUTO.

The ALL distribution style ensures that a full copy of the table is maintained on each compute node to eliminate the need for inter-node network communication during workload runs. This distribution style is ideal for tables that are relatively small in size (such as fewer than 5 million rows) and don’t exhibit frequent changes.

The KEY distribution style uses a hash-based approach to persisting a table’s rows in the cluster. A distribution key column is defined to be one of the columns in the row, and the value of that column is hashed to determine on which compute node the row will be persisted. The current generation RA3 node type is built on the AWS Nitro System with managed storage that uses high performance SSDs for your hot data and Amazon S3 for your cold data, providing ease of use, cost-effective storage, and fast query performance. Managed storage means this mapping of row to compute node is more in terms of metadata and compute node ownership rather than the actual persistence. This distribution style is ideal for large tables that have well-known and frequent join patterns on the distribution key column.

The EVEN distribution style uses a round-robin approach to locating a table’s row. Simply put, table rows are cycled through the different compute nodes and when the last compute node in the cluster is reached, the cycle begins again with the next row being persisted to the first compute node in the cluster. This distribution style is ideal for large tables that exhibit frequent table scans.

Finally, the default table distribution style in Amazon Redshift is AUTO, which empowers Amazon Redshift to monitor how a table is used and change the table’s distribution style at any point in the table’s lifecycle for greater performance with workloads. However, you are also empowered to explicitly state a particular distribution style at any point in time if you have a good understanding of how the table will be used by workloads.

Hub and hub satellites

Hub and hub satellites are often joined together, so it’s best to co-locate these datasets based on the primary key of the hub, which will also be part of the compound key of each satellite. As mentioned earlier, for smaller volumes (typically fewer than 5–7 million rows) use the ALL distribution style and for larger volumes, use the KEY distribution style (with the _PK column as the distribution KEY column).

Link and link satellites

Link and link satellites are often joined together, so it’s best to co-locate these datasets based on the primary key of the link, which will also be part of the compound key of each link satellite. These typically involve larger data volumes, so look at a KEY distribution style (with the _PK column as the distribution KEY column).

Point in time and satellites

You may decide to denormalize key satellite attributes by adding them to point in time (PIT) tables with the goal of reducing or eliminating runtime joins. Because denormalization of data helps reduce or eliminate the need for runtime joins, denormalized PIT tables can be defined with an EVEN distribution style to optimize table scans.

However, if you decide not to denormalize, then smaller volumes should use the ALL distribution style and larger volumes should use the KEY distribution style (with the _PK column as the distribution KEY column). Also, be sure to define the business key column as a sort key on the PIT table for optimized filtering.

Bridge and link satellites

Similar to PIT tables, you may decide to denormalize key satellite attributes by adding them to bridge tables with the goal of reducing or eliminating runtime joins. Although denormalization of data helps reduce or eliminate the need for runtime joins, denormalized bridge tables are still typically larger in data volume and involved frequent joins, so the KEY distribution style (with the _PK column as the distribution KEY column) would be the recommended distribution style. Also, be sure to define the bridge of the dominant business key columns as sort keys for optimized filtering.

KPI and reporting

KPI and reporting tables are designed to meet the specific needs of each customer, so flexibility on their structure is key here. These are often standalone tables that exhibit multiple types of interactions, so the EVEN distribution style may be the best table distribution style to evenly spread the scan workloads.

Be sure to choose a sort key that is based on common WHERE clauses such as a date[time] element or a common business key. In addition, a time series table can be created for very large datasets that are always sliced on a time attribute to optimize workloads that typically interact with one slice of time. We discuss this subject in greater detail later in the post.

Non-functional design principles

In this section, we discuss potential additional data dimensions that are often created and married with business data to satisfy non-functional requirements. In the physical data model, these additional data dimensions take the form of technical columns added to each row to enable tracking of non-functional requirements. Many of these technical columns will be populated by the Data Vault framework. The following table lists some of the common technical columns, but you can extend the list as needed.

Column Name Applies to Table Description
LOAD_DTS All A timestamp recording of when this row was inserted. This is a primary key column for historized targets (links, satellites, reference), and a non-primary key column for transactional links and hubs.
BATCH_ID All A unique process ID identifying the run of the ETL code that populated the row.
JOB_NAME All The process name from the ETL framework. This may be a sub-process within a larger process.
SOURCE_SYSTEM_CD All The system from which this data was discovered.
HASH_DIFF Satellite A method in Data Vault of performing change data capture (CDC) changes.
RECORD_ID Satellite
Link
Reference
A unique identifier captured by the code framework for each row.
EFFECTIVE_DTS Link Business effective dates to record the business validity of the row. It’s set to the LOAD_DTS if no business date is present or needed.
DQ_AUDIT Satellite
Link
Reference
Warnings and errors found during staging for this row, tied to the RECORD_ID.

Advanced optimizations and guidelines

In this section, we discuss potential optimizations that can be deployed at the start or later on in the lifecycle of the Data Vault implementation.

Time series tables

Let’s begin with a brief refresher on time series tables as a pattern. Time series tables involve taking a large table and segmenting it into multiple identical tables that hold a time-bound portion of the rows in the original table. One common scenario is to divide a monolithic sales table into monthly or annual versions of the sales table (such as sales_jan,sales_feb, and so on). For example, let’s assume we want to maintain data for a rolling time period using a series of tables, as the following diagram illustrates.

With each new calendar quarter, we create a new table to hold the data for the new quarter and drop the oldest table in the series. Furthermore, if the table rows arrive in a naturally sorted order (such as sales date), then no work is needed to sort the table data, resulting in skipping the expensive VACUUM SORT operation on table.

Time series tables can help significantly optimize workloads that often need to scan these large tables but within a certain time range. Furthermore, by segmenting the data across tables that represent calendar quarters, we are able to drop aged data with a single DROP command. Had we tried to perform the same DELETE operation on a monolithic table design using the DELETE command, for example, it would have been a more expensive deletion operation that would have left the table in a suboptimal state requiring defragmentation and also saves to run a subsequent VACUUM process to reclaim space.

Should a workload ever need to query against the entire time range, you can use standard or materialized views using a UNION ALL operation within Amazon Redshift to easily stitch all the component tables back into the unified dataset. Materialized views can also be used to abstract the table segmentation from downstream users.

In the context of Data Vault, time series tables can be useful for archiving rows within satellites, PIT, and bridge tables that aren’t used often. Time series tables can then be used to distribute the remaining hot rows (rows that are either recently added or referenced often) with more aggressive table properties.

Conclusion

In this post, we discussed a number of areas ripe for optimization and automation to successfully implement a Data Vault 2.0 system at scale and the Amazon Redshift capabilities that you can use to satisfy the related requirements. There are many more Amazon Redshift capabilities and features that will surely come in handy, and we strongly encourage current and prospective customers to reach out to us or other AWS colleagues to delve deeper into Data Vault with Amazon Redshift.


About the Authors

Asser Moustafa is a Principal Analytics Specialist Solutions Architect at AWS based out of Dallas, Texas. He advises customers globally on their Amazon Redshift and data lake architectures, migrations, and visions—at all stages of the data ecosystem lifecycle—starting from the POC stage to actual production deployment and post-production growth.

Philipp Klose is a Global Solutions Architect at AWS based in Munich. He works with enterprise FSI customers and helps them solve business problems by architecting serverless platforms. In this free time, Philipp spends time with his family and enjoys every geek hobby possible.

Saman Irfan is a Specialist Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/power-enterprise-grade-data-vaults-with-amazon-redshift-part-2/

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x better price-performance than any other cloud data warehouses.

As with all AWS services, Amazon Redshift is a customer-obsessed service that recognizes there isn’t a one-size-fits-all for customers when it comes to data models, which is why Amazon Redshift supports multiple data models such as Star Schemas, Snowflake Schemas and Data Vault. This post discusses the most pressing needs when designing an enterprise-grade Data Vault and how those needs are addressed by Amazon Redshift in particular and AWS cloud in general. The first post in this two-part series discusses best practices for designing enterprise-grade data vaults of varying scale using Amazon Redshift.

Whether it’s a desire to easily retain data lineage directly within the data warehouse, establish a source-system agnostic data model within the data warehouse, or more easily comply with GDPR regulations, customers that implement a data vault model will benefit from this post’s discussion of considerations, best practices, and Amazon Redshift features as well as the AWS cloud capabilities relevant to the building of enterprise-grade data vaults. Building a starter version of anything can often be straightforward, but building something with enterprise-grade scale, security, resiliency, and performance typically requires knowledge and adherence to battle-tested best practices, and using the right tools and features in the right scenario.

Data Vault overview

For a brief review of the core Data Vault premise and concepts, refer to the first post in this series.

In the following sections, we discuss the most common areas of consideration that are critical for Data Vault implementations at scale: data protection, performance and elasticity, analytical functionality, cost and resource management, availability, and scalability. Although these areas can also be critical areas of consideration for any data warehouse data model, in our experience, these areas present their own flavor and special needs to achieve data vault implementations at scale.

Data protection

Security is always priority-one at AWS, and we see the same attention to security every day with our customers. Data security has many layers and facets, ranging from encryption at rest and in transit to fine-grained access controls and more. In this section, we explore the most common data security needs within the raw and business data vaults and the Amazon Redshift features that help achieve those needs.

Data encryption

Amazon Redshift encrypts data in transit by default. With the click of a button, you can configure Amazon Redshift to encrypt data at rest at any point in a data warehouse’s lifecycle, as shown in the following screenshot.

You can use either AWS Key Management Service (AWS KMS) or Hardware Security Module (HSM) to perform encryption of data at rest. If you use AWS KMS, you can either use an AWS managed key or customer managed key. For more information, refer to Amazon Redshift database encryption.

You can also modify cluster encryption after cluster creation, as shown in the following screenshot.

Moreover, Amazon Redshift Serverless is encrypted by default.

Fine-grained access controls

When it comes to achieving fine-grained access controls at scale, Data Vaults typically need to use both static and dynamic access controls. You can use static access controls to restrict access to databases, tables, rows, and columns to explicit users, groups, or roles. With dynamic access controls, you can mask part or all portions of a data item, such as a column based on a user’s role or some other functional analysis of a user’s privileges.

Amazon Redshift has long supported static access controls through the GRANT and REVOKE commands for databases, schemas, and tables, at row level and column level. Amazon Redshift also supports row-level security, where you can further restrict access to particular rows of the visible columns, as well as role-based access control, which helps simplify the management of security privileges in Amazon Redshift.

In the following example, we demonstrate how you can use GRANT and REVOKE statements to implement static access control in Amazon Redshift.

  1. First, create a table and populate it with credit card values:
    -- Create the credit cards table
    
    CREATE TABLE credit_cards 
    ( customer_id INT, 
    is_fraud BOOLEAN, 
    credit_card TEXT);
    
    --populate the table with sample values
    
    INSERT INTO credit_cards 
    VALUES
    (100,'n', '453299ABCDEF4842'),
    (100,'y', '471600ABCDEF5888'),
    (102,'n', '524311ABCDEF2649'),
    (102,'y', '601172ABCDEF4675'),
    (102,'n', '601137ABCDEF9710'),
    (103,'n', '373611ABCDEF6352');
    

  2. Create the user user1 and check permissions for user1 on the credit_cards table. We use SET SESSION AUTHORIZATION to switch to user1 in the current session:
       -- Create user
    
       CREATE USER user1 WITH PASSWORD '1234Test!';
    
       -- Check access permissions for user1 on credit_cards table
       SET SESSION AUTHORIZATION user1; 
       SELECT * FROM credit_cards; -- This will return permission defined error
    

  3. Grant SELECT access on the credit_cards table to user1:
    RESET SESSION AUTHORIZATION;
     
    GRANT SELECT ON credit_cards TO user1;
    

  4. Verify access permissions on the table credit_cards for user1:
    SET SESSION AUTHORIZATION user1;
    
    SELECT * FROM credit_cards; -- Query will return rows
    RESET SESSION AUTHORIZATION;

Data obfuscation

Static access controls are often useful to establish hard boundaries (guardrails) of the user communities that should be able to access certain datasets (for example, only those users that are part of the marketing user group should be allowed access to marketing data). However, what if access controls need to restrict only partial aspects of a field, not the entire field? Amazon Redshift supports partial, full, or custom data masking of a field through dynamic data masking. Dynamic data masking enables you to protect sensitive data in your data warehouse. You can manipulate how Amazon Redshift shows sensitive data to the user at query time without transforming it in the database by using masking policies.

In the following example, we achieve a full redaction of credit card numbers at runtime using a masking policy on the previously used credit_cards table.

  1. Create a masking policy that fully masks the credit card number:
    CREATE MASKING POLICY mask_credit_card_full 
    WITH (credit_card VARCHAR(256)) 
    USING ('000000XXXX0000'::TEXT);

  2. Attach mask_credit_card_full to the credit_cards table as the default policy. Note that all users will see this masking policy unless a higher priority masking policy is attached to them or their role.
    ATTACH MASKING POLICY mask_credit_card_full 
    ON credit_cards(credit_card) TO PUBLIC;

  3. Users will see credit card information being masked when running the following query
    SELECT * FROM credit_cards;

Centralized security policies

You can achieve a great deal of scale by combining static and dynamic access controls to span a broad swath of user communities, datasets, and access scenarios. However, what about datasets that are shared across multiple Redshift warehouses, as might be done between raw data vaults and business data vaults? How can scale be achieved with access controls for a dataset that resides on one Redshift warehouse but is authorized for use across multiple Redshift warehouses using Amazon Redshift data sharing?

The integration of Amazon Redshift with AWS Lake Formation enables centrally managed access and permissions for data sharing. Amazon Redshift data sharing policies are established in Lake Formation and will be honored by all of your Redshift warehouses.

Performance

It is not uncommon for sub-second SLAs to be associated with data vault queries, particularly when interacting with the business vault and the data marts sitting atop the business vault. Amazon Redshift delivers on that needed performance through a number of mechanisms such as caching, automated data model optimization, and automated query rewrites.

The following are common performance requirements for Data Vault implementations at scale:

  • Query and table optimization in support of high-performance query throughput
  • High concurrency
  • High-performance string-based data processing

Amazon Redshift features and capabilities for performance

In this section, we discuss Amazon Redshift features and capabilities that address those performance requirements.

Caching

Amazon Redshift uses multiple layers of caching to deliver subsecond response times for repeat queries. Through Amazon Redshift in-memory result set caching and compilation caching, workloads ranging from dashboarding to visualization to business intelligence (BI) that run repeat queries experience a significant performance boost.

With in-memory result set caching, queries that have a cached result set and no changes to the underlying data return immediately and typically within milliseconds.

The current generation RA3 node type is built on the AWS Nitro System with managed storage that uses high performance SSDs for your hot data and Amazon S3 for your cold data, providing ease of use, cost-effective storage, and fast query performance. In short, managed storage means fast retrieval for your most frequently accessed data and automated/managed identification of hot data by Amazon Redshift.

The large majority of queries in a typical production data warehouse are repeat queries, and data warehouses with data vault implementations observe the same pattern. The most optimal run profile for a repeat query is one that avoids costly query runtime interpretation, which is why queries in Amazon Redshift are compiled during the first run and the compiled code is cached in a global cache, providing repeat queries a significant performance boost.

Materialized views

Pre-computing the result set for repeat queries is a powerful mechanism for boosting performance. The fact that it automatically refreshes to reflect the latest changes in the underlying data is yet another powerful pattern for boosting performance. For example, consider the denormalization queries that might be run on the raw data vault to populate the business vault. It’s quite possible that some less-active source systems will have exhibited little to no changes in the raw data vault since the last run. Avoiding the hit of rerunning the business data vault population queries from scratch in those cases could be a tremendous boost to performance. Redshift materialized views provide that exact functionality by storing the precomputed result set of their backing query.

Queries that are similar to the materialized view’s backing query don’t have to rerun the same logic each time, because they can pull records from the existing result set. Developers and analysts can choose to create materialized views after analyzing their workloads to determine which queries would benefit. Materialized views also support automatic query rewriting to have Amazon Redshift rewrite queries to use materialized views, as well as auto refreshing materialized views, where Amazon Redshift can automatically refresh materialized views with up-to-date data from its base tables.

Alternatively, the automated materialized views (AutoMV) feature provides the same performance benefits of user-created materialized views without the maintenance overhead because Amazon Redshift automatically creates the materialized views based on observed query patterns. Amazon Redshift continually monitors the workload using machine learning and then creates new materialized views when they are beneficial. AutoMV balances the costs of creating and keeping materialized views up to date against expected benefits to query latency. The system also monitors previously created AutoMVs and drops them when they are no longer beneficial. AutoMV behavior and capabilities are the same as user-created materialized views. They are refreshed automatically and incrementally, using the same criteria and restrictions.

Also, whether the materialized views are user-created or auto-generated, Amazon Redshift automatically rewrites queries, without users to change queries, to use materialized views when there is enough of a similarity between the query and the materialized view’s backing query.

Concurrency scaling

Amazon Redshift automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries. Concurrency scaling resources are added to your Redshift warehouse transparently in seconds, as concurrency increases, to process read/write queries without wait time. When workload demand subsides, Amazon Redshift automatically shuts down concurrency scaling resources to save you cost. You can continue to use your existing applications and BI tools without any changes.

Because Data Vault allows for highly concurrent data processing and is primarily run within Amazon Redshift, concurrency scaling is the recommended way to handle concurrent transformation operations. You should avoid operations that aren’t supported by concurrency scaling.

Concurrent ingestion

One of the key attractions of Data Vault 2.0 is its ability to support high-volume concurrent ingestion from multiple source systems into the data warehouse. Amazon Redshift provides a number of options for concurrent ingestion, including batch and streaming.

For batch- and microbatch-based ingestion, we suggest using the COPY command in conjunction with CSV format. CSV is well supported by concurrency scaling. In case your data is already on Amazon S3 but in Bigdata formats like ORC or Parquet, always consider the trade-off of converting the data to CSV vs. non-concurrent ingestion. You can also use workload management to prioritize non-concurrent ingestion jobs to keep the throughput high.

For low-latency workloads, we suggest using the native Amazon Redshift streaming capability or the Amazon Redshift Zero ETL capability in conjunction with Amazon Aurora. By using Aurora as a staging layer for the raw data, you can handle small increments of data efficiently and with high concurrency, and then use this data inside your Redshift data warehouse without any extract, transform, and load (ETL) processes. For stream ingestion, we suggest using the native streaming feature (Amazon Redshift streaming ingestion) and have a dedicated stream for ingesting each table. This might require a stream processing solution upfront, which splits the input stream into the respective elements like the hub and the satellite record.

String-optimized compression

The Data Vault 2.0 methodology often involves time-sensitive lookup queries against potentially very large satellite tables (in terms of row count) that have low-cardinality hash/string indexes. Low-cardinality indexes and very large tables tend to work against time-sensitive queries. Amazon Redshift, however, provides a specialized compression method for low-cardinality string-based indexes called BYTEDICT. Using BYTEDICT creates a dictionary of the low-cardinality string indexes that allow Amazon Redshift to reads the rows even in a compressed state, thereby significantly improving performance. You can manually select the BYTEDICT compression method for a column, or alternatively rely on Amazon Redshift automated table optimization facilities to select it for you.

Support of transactional data lake frameworks

Data Vault 2.0 is an insert-only framework. Therefore, reorganizing data to save money is a challenge you may face. Amazon Redshift integrates seamlessly with S3 data lakes allowing you to perform data lake queries in your S3 using standard SQL as you would with native tables. This way, you can outsource less frequently used satellites to your S3 data lake, which is cheaper than keeping it as a native table.

Modern transactional lake formats like Apache Iceberg are also an excellent option to store this data. They ensure transactional safety and therefore ensure that your audit trail, which is a fundamental feature of Data Vault, doesn’t break.

We also see customers using these frameworks as a mechanism to implement incremental loads. Apache Iceberg lets you query for the last state for a given point in time. You can use this mechanism to optimize merge operations while still making the data accessible from within Amazon Redshift.

Amazon Redshift data sharing performance considerations

For large-scale Data Vault implementation, one of the preferred design principals is to have a separate Redshift data warehouse for each layer (staging, raw Data Vault, business Data Vault, and presentation data mart). These layers have separate Redshift provisioned or serverless warehouses according to their storage and compute requirements and use Amazon Redshift data sharing to share the data between these layers without physically moving the data.

Amazon Redshift data sharing enables you to seamlessly share live data across multiple Redshift warehouses without any data movement. Because the data sharing feature serves as the backbone in implementing large-scale Data Vaults, it’s important to understand the performance of Amazon Redshift in this scenario.

In a data sharing architecture, we have producer and consumer Redshift warehouses. The producer warehouse shares the data objects to one or more consumer warehouse for read purposes only without having to copy the data.

Producer/consumer Redshift cluster performance dependency

From a performance perspective, the producer (provisioned or serverless) warehouse is not responsible for query performance running on the consumer (provisioned or serverless) warehouse and has zero impact in terms of performance or activity on the producer Redshift warehouse. It depends on the consumer Redshift warehouse compute capacity. The producer warehouse is only responsible for the shared data.

Result set caching on the consumer Redshift cluster

Amazon Redshift uses result set caching to speed up the retrieval of data when it knows that the data in the underlying table has not changed. In a data sharing architecture, Amazon Redshift also uses result set caching on the consumer Redshift warehouse. This is quite helpful for repeatable queries that commonly occur in a data warehousing environment.

Best practices for materialized views in Data Vault with Amazon Redshift data sharing

In Data Vault implementation, the presentation data mart layer typically contains views or materialized views. There are two possible routes to create materialized views for the presentation data mart layer. First, create the materialized views on the producer Redshift warehouse (business data vault layer) and share materialized views with the consumer Redshift warehouse (dedicated data marts). Alternatively, share the table objects directly from the business data vault layer to the presentation data mart layer and build the materialized view on the shared objects directly on the consumer Redshift warehouse.

The second option is recommended in this case, because it gives us the flexibility of creating customized materialized views of data on each consumer according to the specific use case and simplifies the management because each data mart user can create and manage materialized views on their own Redshift warehouse rather than be dependent on the producer warehouse.

Table distribution implications in Amazon Redshift data sharing

Table distribution style and how data is distributed across Amazon Redshift plays a significant role in query performance. In Amazon Redshift data sharing, the data is distributed on the producer Redshift warehouse according to the distribution style defined for table. When we associate the data via a data share to the consumer Redshift warehouse, it maps to the same disk block layout. Also, a bigger consumer Redshift warehouse will result in better query performance for queries running on it.

Concurrency scaling

Concurrency scaling is also supported on both producer and consumer Redshift warehouses for read and write operations.

Cost and resource management

Given that multiple source systems and users will interact heavily with the data vault data warehouse, it’s a prudent best practice to enable usage and query limits to serve as guardrails against runaway queries and unapproved usage patterns. Furthermore, it often helps to have a systematic way for allocating service costs based on usage of the data vault to different source systems and user groups within your organization.

The following are common cost and resource management requirements for Data Vault implementations at scale:

  • Utilization limits and query resource guardrails
  • Advanced workload management
  • Chargeback capabilities

Amazon Redshift features and capabilities for cost and resource management

In this section, we discuss Amazon Redshift features and capabilities that address those cost and resource management requirements.

Utilization limits and query monitoring rules

Runaway queries and excessive auto scaling are likely to be the two most common runaway patterns observed with data vault implementations at scale.

A Redshift provisioned cluster supports usage limits for features such as Redshift Spectrum, concurrency scaling, and cross-Region data sharing. A concurrency scaling limit specifies the threshold of the total amount of time used by concurrency scaling in 1-minute increments. A limit can be specified for a daily, weekly, or monthly period (using UTC to determine the start and end of the period).

You can also define multiple usage limits for each feature. Each limit can have a different action, such as logging to system tables, alerting via Amazon CloudWatch alarms and optionally Amazon Simple Notification Service (Amazon SNS) subscriptions to that alarm (such as email or text), or disabling the feature outright until the next time period begins (such as the start of the month). When a usage limit threshold is reached, events are also logged to a system table.

Redshift provisioned clusters also support query monitoring rules to define metrics-based performance boundaries for workload management queues and the action that should be taken when a query goes beyond those boundaries. For example, for a queue dedicated to short-running queries, you might create a rule that cancels queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.

Each query monitoring rule includes up to three conditions, or predicates, and one query action (such as stop, hop, or log). A predicate consists of a metric, a comparison condition (=, <, or >), and a value. If all of the predicates for any rule are met, that rule’s action is triggered. Amazon Redshift evaluates metrics every 10 seconds and if more than one rule is triggered during the same period, Amazon Redshift initiates the most severe action (stop, then hop, then log).

Redshift Serverless also supports usage limits where you can specify the base capacity according to your price-performance requirements. You can also set the maximum RPU (Redshift Processing Units) hours used per day, per week, or per month to keep the cost predictable and specify different actions, such as write to system table, receive an alert, or turn off user queries when the limit is reached. A cross-Region data sharing usage limit is also supported, which limits how much data transferred from the producer Region to the consumer Region that consumers can query.

You can also specify query limits in Redshift Serverless to stop poorly performing queries that exceed the threshold value.

Auto workload management

Not all queries have the same performance profile or priority, and data vault queries are no different. Amazon Redshift workload management (WLM) adapts in real time to the priority, resource allocation, and concurrency settings required to optimally run different data vault queries. These queries could consist of a high number of joins between the hubs, links, and satellites tables; large-scale scans of the satellite tables; or massive aggregations. Amazon Redshift WLM enables you to flexibly manage priorities within workloads so that, for example, short or fast-running queries won’t get stuck in queues behind long-running queries.

You can use automatic WLM to maximize system throughput and use resources effectively. You can enable Amazon Redshift to manage how resources are divided to run concurrent queries with automatic WLM. Automatic WLM manages the resources required to run queries. Amazon Redshift determines how many queries run concurrently and how much memory is allocated to each dispatched query.

Chargeback metadata

Amazon Redshift provides different pricing models to cater to different customer needs. On-demand pricing offers the greatest flexibility, whereas Reserved Instances provide significant discounts for predictable and steady usage scenarios. Redshift Serverless provides a pay-as-you-go model that is ideal for sporadic workloads.

However, with any of these pricing models, Amazon Redshift customers can attribute cost to different users. To start, Amazon Redshift provides itemized billing like many other AWS services in AWS Cost Explorer to attain the overall cost of using Amazon Redshift. Moreover, the cross-group collaboration (data sharing) capability of Amazon Redshift enables a more explicit and structured chargeback model to different teams.

Availability

In the modern data organization, data warehouses are no longer used purely to perform historical analysis in batches overnight with relatively forgiving SLAs, Recovery Time Objectives (RTOs), and Recovery Point Objectives (RPOs). They have become mission-critical systems in their own right that are used for both historical analysis and near-real-time data analysis. Data Vault systems at scale very much fit that mission-critical profile, which makes availability key.

The following are common availability requirements for Data Vault implementations at scale:

  • RTO of near-zero
  • RPO of near-zero
  • Automated failover
  • Advanced backup management
  • Commercial-grade SLA

Amazon Redshift features and capabilities for availability

In this section, we discuss the features and capabilities in Amazon Redshift that address those availability requirements.

Separation of storage and compute

AWS and Amazon Redshift are inherently resilient. With Amazon Redshift, there’s no additional cost for active-passive disaster recovery. Amazon Redshift replicates all of your data within your data warehouse when it is loaded and also continuously backs up your data to Amazon S3. Amazon Redshift always attempts to maintain at least three copies of your data (the original and replica on the compute nodes, and a backup in Amazon S3).

With separation of storage and compute and Amazon S3 as the persistence layer, you can achieve an RPO of near-zero, if not zero itself.

Cluster relocation to another Availability Zone

Amazon Redshift provisioned RA3 clusters support cluster relocation to another Availability Zone in events where cluster operation in the current Availability Zone is not optimal, without any data loss or changes to your application. Cluster relocation is available free of charge; however, relocation might not always be possible if there is a resource constraint in the target Availability Zone.

Multi-AZ deployment

For many customers, the cluster relocation feature is sufficient; however, enterprise data warehouse customers require a low RTO and higher availability to support their business continuity with minimal impact to applications.

Amazon Redshift supports Multi-AZ deployment for provisioned RA3 clusters.

A Redshift Multi-AZ deployment uses compute resources in multiple Availability Zones to scale data warehouse workload processing as well as provide an active-active failover posture. In situations where there is a high level of concurrency, Amazon Redshift will automatically use the resources in both Availability Zones to scale the workload for both read and write requests using active-active processing. In cases where there is a disruption to an entire Availability Zone, Amazon Redshift will continue to process user requests using the compute resources in the sister Availability Zone.

With features such as multi-AZ deployment, you can achieve a low RTO, should there ever be a disruption to the primary Redshift cluster or an entire Availability Zone.

Automated backup

Amazon Redshift automatically takes incremental snapshots that track changes to the data warehouse since the previous automated snapshot. Automated snapshots retain all of the data required to restore a data warehouse from a snapshot. You can create a snapshot schedule to control when automated snapshots are taken, or you can take a manual snapshot any time.

Automated snapshots can be taken as often as once every hour and retained for up to 35 days at no additional charge to the customer. Manual snapshots can be kept indefinitely at standard Amazon S3 rates. Furthermore, automated snapshots can be automatically replicated to another Region and stored there as a disaster recovery site also at no additional charge (with the exception of data transfer charges across Regions) and manual snapshots can also be replicated with standard Amazon S3 rates applying (and data transfer costs).

Amazon Redshift SLA

As a managed service, Amazon Redshift frees you from being the first and only line of defense against disruptions. AWS will use commercially reasonable efforts to make Amazon Redshift available with a monthly uptime percentage for each Multi-AZ Redshift cluster during any monthly billing cycle, of at least 99.99% and for multi-node cluster, at least 99.9%. In the event that Amazon Redshift doesn’t meet the Service Commitment, you will be eligible to receive a Service Credit.

Scalability

One of the major motivations of organizations migrating to the cloud is improved and increased scalability. With Amazon Redshift, Data Vault systems will always have a number of scaling options available to them.

The following are common scalability requirements for Data Vault implementations at scale:

  • Automated and fast-initiating horizontal scaling
  • Robust and performant vertical scaling
  • Data reuse and sharing mechanisms

Amazon Redshift features and capabilities for scalability

In this section, we discuss the features and capabilities in Amazon Redshift that address those scalability requirements.

Horizontal and vertical scaling

Amazon Redshift uses concurrency scaling automatically to support virtually unlimited horizontal scaling of concurrent users and concurrent queries, with consistently fast query performance. Furthermore, concurrency scaling requires no downtime, supports read/write operations, and is typically the most impactful and used scaling option for customers during normal business operations to maintain consistent performance.

With Amazon Redshift provisioned warehouse, as your data warehousing capacity and performance needs to change or grow, you can vertically scale your cluster to make the best use of the computing and storage options that Amazon Redshift provides. Resizing your cluster by changing the node type or number of nodes can typically be achieved in 10–15 minutes. Vertical scaling typically occurs much less frequently in response to persistent and organic growth and is typically performed during a planned maintenance window when the short downtime doesn’t impact business operations.

Explicit horizontal or vertical resize and pause operations can be automated per a schedule (for example, development clusters can be automatically scaled down or paused for the weekends). Note that the storage of paused clusters remains accessible to clusters with which their data was shared.

For resource-intensive workloads that might benefit from a vertical scaling operation vs. concurrency scaling, there are also other best-practice options that avoid downtime, such as deploying the workload onto its own Redshift Serverless warehouse while using data sharing.

Redshift Serverless measures data warehouse capacity in RPUs, which are resources used to handle workloads. You can specify the base data warehouse capacity Amazon Redshift uses to serve queries (ranging from as little as 8 RPUs to as high as 512 RPUs) and change the base capacity at any time.

Data sharing

Amazon Redshift data sharing is a secure and straightforward way to share live data for read purposes across Redshift warehouses within the same or different accounts and Regions. This enables high-performance data access while preserving workload isolation. You can have separate Redshift warehouses, either provisioned or serverless, for different use cases according to your compute requirement and seamlessly share data between them.

Common use cases for data sharing include setting up a central ETL warehouse to share data with many BI warehouses to provide read workload isolation and chargeback, offering data as a service and sharing data with external consumers, multiple business groups within an organization, sharing and collaborating on data to gain differentiated insights, and sharing data between development, test, and production environments.

Reference architecture

The diagram in this section shows one possible reference architecture of a Data Vault 2.0 system implemented with Amazon Redshift.

We suggest using three different Redshift warehouses to run a Data Vault 2.0 model in Amazon Redshift. The data between these data warehouses is shared via Amazon Redshifts data sharing and allows you to consume data from a consumer data warehouse even if the provider data warehouse is inactive.

  • Raw Data Vault – The RDV data warehouse hosts hubs, links, and satellite tables. For large models, you can additionally slice the RDV into additional data warehouses to even better adopt the data warehouse sizing to your workload patterns. Data is ingested via the patterns described in the previous section as batch or high velocity data.
  • Business Data Vault – The BDV data warehouse hosts bridge and point in time (PIT) tables. These tables are computed based on the RDV tables using Amazon Redshift. Materialized or automatic materialized views are straightforward mechanisms to create those.
  • Consumption cluster – This data warehouse contains query-optimized data formats and marts. Users interact with this layer.

If the workload pattern is unknown, we suggest starting with a Redshift Serverless warehouse and learning the workload pattern. You can easily migrate between a serverless and provisioned Redshift cluster at a later stage based on your processing requirements, as discussed in Part 1 of this series.

Best practices building a Data Vault warehouse on AWS

In this section, we cover how the AWS Cloud as a whole plays its role in building an enterprise-grade Data Vault warehouse on Amazon Redshift.

Education

Education is a fundamental success factor. Data Vault is more complex than traditional data modeling methodologies. Before you start the project, make sure everyone understands the principles of Data Vault. Amazon Redshift is designed to be very easy to use, but to ensure the most optimal Data Vault implementation on Amazon Redshift, gaining a good understanding of how Amazon Redshift works is recommended. Start with free resources like reaching out to your AWS account representative to schedule a free Amazon Redshift Immersion Day or train for the AWS Analytics specialty certification.

Automation

Automation is a major benefit of Data Vault. This will increase efficiency and consistency across your data landscape. Most customers focus on the following aspects when automating Data Vault:

  • Automated DDL and DML creation, including modeling tools especially for the raw data vault
  • Automated ingestion pipeline creation
  • Automated metadata and lineage support

Depending on your needs and skills, we typically see three different approaches:

  • DSL – This is a common tool for generating data vault models and flows with Domain Specific Languages (DSL). Popular frameworks for building such DSLs are EMF with Xtext or MPS. This solution provides the most flexibility. You directly build your business vocabulary into the application and generate documentation and business glossary along with the code. This approach requires the most skill and biggest resource investment.
  • Modeling tool – You can build on an existing modeling language like UML 2.0. Many modeling tools come with code generators. Therefore, you don’t need to build your own tool, but these tools are often hard to integrate into modern DevOps pipelines. They also require UML 2.0 knowledge, which raises the bar for non-tech users.
  • Buy – There are a number of different third-party solutions that integrate well into Amazon Redshift and are available via AWS Marketplace.

Whichever approach of the above-mentioned approaches you choose, all three approaches offer multiple benefits. For example, you can take away repetitive tasks from your development team and enforce modeling standards like data types, data quality rules, and naming conventions. To generate the code and deploy it, you can use AWS DevOps services. As part of this process, you save the generated metadata to the AWS Glue Data Catalog, which serves as a central technical metadata catalog. You then deploy the generated code to Amazon Redshift (SQL scripts) and to AWS Glue.

We designed AWS CloudFormation for automation; it’s the AWS-native way of automating infrastructure creation and management. A major use case for infrastructure as code (IaC) is to create new ingestion pipelines for new data sources or add new entities to existing one.

You can also use our new AI coding tool Amazon CodeWhisperer, which helps you quickly write secure code by generating whole line and full function code suggestions in your IDE in real time, based on your natural language comments and surrounding code. For example, CodeWhisperer can automatically take a prompt such as “get new files uploaded in the last 24 hours from the S3 bucket” and suggest appropriate code and unit tests. This can greatly reduce development effort in writing code, for example for ETL pipelines or generating SQL queries, and allow more time for implementing new ideas and writing differentiated code.

Operations

As previously mentioned, one of the benefits of Data Vault is the high level of automation which, in conjunction with serverless technologies, can lower the operating efforts. On the other hand, some industry products come with built-in schedulers or orchestration tools, which might increase operational complexity. By using AWS-native services, you’ll benefit from integrated monitoring options of all AWS services.

Conclusion

In this series, we discussed a number of crucial areas required for implementing a Data Vault 2.0 system at scale, and the Amazon Redshift capabilities and AWS ecosystem that you can use to satisfy those requirements. There are many more Amazon Redshift capabilities and features that will surely come in handy, and we strongly encourage current and prospective customers to reach out to us or other AWS colleagues to delve deeper into Data Vault with Amazon Redshift.


About the Authors

Asser Moustafa is a Principal Analytics Specialist Solutions Architect at AWS based out of Dallas, Texas. He advises customers globally on their Amazon Redshift and data lake architectures, migrations, and visions—at all stages of the data ecosystem lifecycle—starting from the POC stage to actual production deployment and post-production growth.

Philipp Klose is a Global Solutions Architect at AWS based in Munich. He works with enterprise FSI customers and helps them solve business problems by architecting serverless platforms. In this free time, Philipp spends time with his family and enjoys every geek hobby possible.

Saman Irfan is a Specialist Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Building sensitive data remediation workflows in multi-account AWS environments

Post Syndicated from Darren Roback original https://aws.amazon.com/blogs/security/building-sensitive-data-remediation-workflows-in-multi-account-aws-environments/

The rapid growth of data has empowered organizations to develop better products, more personalized services, and deliver transformational outcomes for their customers. As organizations use Amazon Web Services (AWS) to modernize their data capabilities, they can sometimes find themselves with data spread across several AWS accounts, each aligned to distinct use cases and business units. This can present a challenge for security professionals, who need not only a mechanism to identify sensitive data types—such as protected health information (PHI), payment card industry (PCI), and personally identifiable information (PII), or organizational intellectual property—stored on AWS, but also the ability to automatically act upon these findings through custom logic that supports organizational policies and regulatory requirements.

In this blog post, we present a solution that provides you with visibility into sensitive data residing across a fleet of AWS accounts through a ChatOps-style notification mechanism using Microsoft Teams, which also provides contextual information needed to conduct security investigations. This solution also incorporates a decision logic mechanism to automatically quarantine sensitive data assets while they’re pending review, which can be tailored to meet unique organizational, industry, or regulatory environment requirements.

Prerequisites

Before you proceed, ensure that you have the following within your environment:

Assumptions

Things to know about the solution in this blog post:

Solution overview

The solution architecture and overall workflow are detailed in Figure 1 that follows.

Figure 1: Solution overview

Figure 1: Solution overview

Upon discovering sensitive data in member accounts, this solution selectively quarantines objects based on their finding severity and public status. This logic can be customized to evaluate additional details such as the finding type or number of sensitive data occurrences detected. The ability to adjust this workflow logic can provide a custom-tailored solution to meet a variety of industry use cases, helping you adhere to industry-specific security regulations and frameworks.

Figure 1 provides an overview of the components used in this solution, and for illustrative purposes we step through them here.

Automated scanning of buckets

Macie supports various scope options for sensitive data discovery jobs, including the use of bucket tags to determine in-scope buckets for scanning. Setting up automated scanning includes the use of an AWS Organizations tag policy to verify that the S3 buckets deployed in your chosen OUs conform to tagging requirements, an AWS Config job to automatically check that tags have been applied correctly, an Amazon EventBridge rule and bus to receive compliance events from a fleet of member accounts, and an AWS Lambda function to notify administrators of compliance change events.

  1. An AWS Organizations tag policy verifies that the S3 buckets created in the in-scope AWS accounts have a tag structure that facilitates automated scanning and identification of the bucket owner. Specific tags enforced with this tag policy include RequireMacieScan : True|False and BucketOwner : <variable>. The tag policy is enforced at the OU level.
  2. A custom AWS Config rule evaluates whether these tags have been applied to all in-scope S3 buckets, and marks resources as compliant or not compliant.
  3. After every evaluation of S3 bucket tag compliance, AWS Config will send compliance events to an EventBridge event bus in the same AWS member account.
  4. An EventBridge rule is used to send compliance messages from member accounts to a centralized security account, which is used for operational administration of the solution.
  5. An EventBridge rule in the centralized security account is used to send all tag compliance events to a Lambda serverless function, which is used for notification.
  6. The tag compliance notification function receives the tag compliance events from EventBridge, parses the information, and posts it to a Microsoft Teams webhook. This provides you with details such as the affected bucket, compliance status, and bucket owner, along with a link to investigate the compliance event in AWS Config.

Detecting sensitive data

Macie is a key component of this solution and facilitates sensitive data discovery across a fleet of AWS accounts based on the RequireMacieScan tag value configured at the time of bucket creation. Setting up sensitive data detection includes using Macie for sensitive data discovery, an EventBridge rule and bus to receive these finding events from the Macie delegated administrator account, an AWS Step Functions state machine to process these findings, and a Lambda function to notify administrators of new findings.

  1. Macie is used to detect sensitive data in member account S3 buckets with job definitions based on bucket tags, and central reporting of findings through a Macie delegated administrator account (that is, the central security account).
  2. When new findings are detected, Macie will send finding events to an EventBridge event bus in the security account.
  3. An EventBridge rule is used to send all finding events to AWS Step Functions, which runs custom business logic to evaluate each finding and determine the next action to take on the affected object.
  4. In the default configuration, for findings that are of low or medium severity and in objects that are not publicly accessible, Step Functions sends finding event information to a Lambda serverless function, which is used for notification. You can alter this event processing logic in Step Functions to change which finding types initiate an automated quarantine of the object.
  5. The Macie finding notification function receives the finding events from Step Functions, parses this information, and posts it to a Microsoft Teams webhook. This presents you with details such as the affected bucket and AWS account, finding ID and severity, public status, and bucket owner, along with a link to investigate the finding in Macie.

Automated quarantine

As outlined in the previous section, this solution uses event processing decision logic in Step Functions to determine whether to quarantine a sensitive object in place. Setting up automated quarantine includes a Step Functions state machine for Macie finding processing logic, an Amazon DynamoDB table to track items moved to quarantine, a Lambda function to notify administrators of quarantine, and an Amazon API Gateway and second Step Functions state machine to facilitate remediation or removal of objects from quarantine.

  1. In the default configuration for findings that are of high severity, or in objects that are publicly accessible, Step Functions adds the affected object’s metadata to an Amazon DynamoDB table, which is used to track quarantine and remediation status at scale.
  2. Step Functions then quarantines the affected object, moving it to an automatically configured and secured prefix in the same S3 bucket while the finding is being investigated. Only select personnel (that is, cybersecurity) has access to the object.
  3. Step Functions then sends finding event information to a Lambda serverless function, which is used for notification.
  4. The Macie quarantine notification function receives the finding events from Step Functions, parses this information, and posts it to a Microsoft Teams webhook. This presents you with similar details to the Macie finding notification function, but also notes the object has been moved to quarantine, and provides a one-click option to release the object from quarantine.
  5. In the Microsoft Teams channel, which should only be accessible to qualified security professionals, DLP administrators select the option to release the object from quarantine. This invokes a REST API deployed on API Gateway.
  6. API Gateway invokes a release object workflow in Step Functions, which begins to process releasing the affected object from quarantine.
  7. Step Functions inspects the affected object ID received through API Gateway, and first retrieves details about this object from DynamoDB. Upon receiving these details, the object is removed from the quarantine tracking database.
  8. Step Functions then moves the affected object to its original location in Amazon S3, thereby making the object accessible again under the original bucket policy and IAM permissions.

Organization structure

As mentioned in the prerequisites, the solution uses a set of AWS accounts that have been joined to an organization, which is shown in Figure 2. While the logical structure of your AWS Organizations deployment can differ from what’s shown, for illustration purposes, we’re looking for sensitive data in the Development and Production AWS accounts, and the examples shown throughout the remainder of this blog post reflect that.

Figure 2: Organization structure

Figure 2: Organization structure

Deploy the solution

The overall deployment process of this solution has been decomposed into three AWS CloudFormation templates to be deployed to your management, security, and member accounts as CloudFormation stacks and StackSets, respectively. Performing the deployment in this manner not only verifies that the solution is extended to other member accounts created after the initial solution deployment, but also serves as an illustrative aid of the components deployed in each portion of the solution. An overview of the deployment process is as follows:

  1. Set up of Microsoft Teams webhooks to receive information from this solution.
  2. Deployment of a CloudFormation stack to the management account to configure the tag policy for this solution.
  3. Deployment of a CloudFormation stack set to member accounts to enable monitoring of S3 bucket tags and forwarding of tag compliance events to the security account.
  4. Deployment of a CloudFormation stack to the security account to configure the remainder of the solution that will facilitate sensitive data discovery, finding event processing, administrator notification, and release from quarantine functionality.
  5. Remaining manual configuration of quarantine remediation API authorization, enabling Macie, and specifying a Macie results location.

Set up Microsoft Teams

Before deploying the solution, you must create two incoming webhooks in a Microsoft Teams channel of your choice. Due to the sensitivity of the event information provided and the ability to release objects from quarantine, we recommend that this channel only be accessible to information security professionals. In addition, we recommend creating two distinct webhooks to distinguish tag compliance events from finding events and have named the webhooks in the examples S3-Tag-Compliance-Notification and Macie-Finding-Notification, respectively. A complete walkthrough of creating an incoming webhook is out of scope for this blog post, but you can access Microsoft’s documentation on creating incoming webhooks for an overview. After the webhooks have been created, save the URLs, to use in the solution deployment process.

Configure the management account

The first step of the deployment process is to deploy a CloudFormation stack in your management account that creates an AWS Organizations tag policy and applies it to the OUs of your choice. Before performing this step, note the two OU IDs you will apply the policy to, as these will be captured as input parameters for the CloudFormation stack.

  1. Choose the following Launch Stack button to open the CloudFormation console pre-loaded with the template for this step:

    Launch Stack

    Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, change your regional selection in the CloudFormation console, and deploy it to the selected Region.

  2. For Stack name, enter ConfigureManagementAccount.
  3. For First AWS Organizations Unit ID, enter your first OU ID.
  4. For Second AWS Organizations Unit ID, enter your second OU ID.
  5. Choose Next.
    Figure 3: Management account stack details

    Figure 3: Management account stack details

After you’ve entered all details, launch the stack and wait until the stack has reached CREATE_COMPLETE status before proceeding. The deployment process will take 1–2 minutes.

Configure the member accounts

The next step of the deployment process is to deploy a CloudFormation Stack, which will initiate a StackSet deployment from your management account that’s scoped to the OUs of your choice. This stack set will enable AWS Config along with an AWS Config rule to evaluate Amazon S3 tag compliance and will deploy an EventBridge rule to send compliance events from AWS Config in your member accounts to a centralized event bus in your security account. If AWS Config has previously been enabled, select True for the AWS Config Status parameter to help prevent an overwrite of your existing settings. Prior to performing this setup, note the two AWS Organizations OU IDs you will deploy the stack set to. You will also be prompted to enter the AWS account ID and Region of your security account.

  1. Choose the following Launch Stack button to open the CloudFormation console pre-loaded with the template for this step:

    Launch Stack

    Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, change your regional selection in the CloudFormation console, and deploy it to the selected Region.

  2. For Stack name, enter DeployToMemberAccounts.
  3. For First AWS Organizations Unit ID, enter your first OU ID.
  4. For Second AWS Organizations Unit ID, enter your second OU ID.
  5. For Deployment Region, enter the Region you want to deploy the Stack set to.
  6. For AWS Config Status, accept the default value of false if you have not previously enabled AWS Config in your accounts.
  7. For Support all resource types, accept the default value of false.
  8. For Include global resource types, accept the default value of false.
  9. For List of resource types if not all supported, accept the default value of AWS::S3::Bucket.
  10. For Configuration delivery channel name, accept the default value of <Generated>.
  11. For Snapshot delivery frequency, accept the default value of 1 hour.
  12. For Security account ID, enter your security account ID.
  13. For Security account region, select the Region of your security account.
  14. Choose Next.
    Figure 4: Member account Stack details

    Figure 4: Member account Stack details

After you’ve entered all details, launch the stack and wait until the stack has reached CREATE_COMPLETE status before proceeding. The deployment process will take 3–5 minutes.

Configure the security account

The next step of the deployment process involves deploying a CloudFormation stack in your security account that creates all the resources needed for at-scale sensitive data detection, automated quarantine, and security professional notification and response. This stack configures the following:

  • An S3 bucket and AWS Key Management Service (AWS KMS) keys for storage and encryption of discovery results in Macie.
  • Two rules in EventBridge for routing tag compliance and Macie finding events.
  • Two Step Functions state machines whose logic will be used for automated object quarantine and release.
  • Three Lambda functions for tag compliance, Macie findings, and quarantine notification.
  • A DynamoDB table for quarantine status tracking.
  • An API Gateway REST endpoint to facilitate the release of objects from quarantine.

Before performing this setup, note your AWS Organizations ID and two Microsoft Teams webhook URLs previously configured.

  1. Choose the following Launch Stack button to open the CloudFormation console pre-loaded with the template for this step:

    Launch Stack

    Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, change your regional selection in the CloudFormation console, and deploy it to the selected Region.

  2. For Stack name, enter ConfigureSecurityAccount.
  3. For AWS Org ID, enter your AWS Organizations ID.
  4. For Webhook URI for S3 tag compliance notifications, enter the first webhook URL you created in Microsoft Teams.
  5. For Webhook URI for Macie finding and quarantine notifications, enter the second webhook URL you created in Microsoft Teams.
  6. Choose Next.
    Figure 5: Security account stack details

    Figure 5: Security account stack details

After you’ve entered all details, launch the stack and wait until the stack has reached CREATE_COMPLETE status before proceeding. The deployment process will take 3–5 minutes.

Remaining configuration

While most of the solution is deployed automatically using CloudFormation, there are a few items that you must configure manually.

Configure Lambda API key environment variable

When the CloudFormation stack was deployed to the security account, CloudFormation created a new REST API for security professionals to release objects from quarantine. This API was configured with an API key to be used for authorization, and you must retrieve the value of this API key and set it as an environment variable in your MacieQuarantineNotification function, which also deployed in the security account. To retrieve the value of this API key, navigate to the REST API created in the security account, select API Keys, and retrieve the value of APIKey1. Next, navigate to the MacieQuarantineNotification function in the Lambda console, and set the ReleaseObjectApiKey environment variable to the value of your API key.

Enable Macie

Next, you must enable Macie to facilitate sensitive data discovery in selected accounts in your organization, and this process begins with the selection of a delegated administrator account (that is, the security account), followed by onboarding the member accounts you want to test with. See Integrating and configuring an organization in Amazon Macie for detailed instructions on enabling Macie in your organization.

Configure the Macie results bucket and KMS key

Macie creates an analysis record for each Amazon S3 object that it analyzes. This includes objects where Macie has detected sensitive data as well as objects where sensitive data was not detected or that Macie could not analyze. The CloudFormation stack deployed in the security account created an S3 bucket and KMS key for this, and they are noted as MacieResultsS3BucketName and MacieResultsKmsKeyAlias in the CloudFormation stack output. Use these resources to configure the Macie results bucket and KMS key in the security account according to Storing and retaining sensitive data discovery results with Amazon Macie. Customization of the S3 bucket policy or KMS key policy has already been done for you as part of the ConfigureSecurityAccount CloudFormation template deployed earlier.

Validate the solution

With the solution fully deployed, you now need to deploy an S3 bucket with sample data to test the solution and review the findings.

Create a member account S3 bucket

In any of the member accounts onboarded into this solution as part of the Configure the member accounts step, deploy a new S3 bucket and the KMS key used to encrypt the bucket using the CloudFormation template that follows. Before performing this step, note the InvestigateMacieFindingsRole, StepFunctionsProcessFindingsRole, and StepFunctionsReleaseObjectRole outputs from the CloudFormation template deployed to the security account, as these will be captured as input parameters for the CloudFormation stack.

  1. Choose the following Launch Stack button to open the CloudFormation console pre-loaded with the template for this step:

    Launch Stack

    Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, change your regional selection in the CloudFormation console, and deploy it to the selected Region.

  2. For Stack name, enter DeployS3BucketKmsKey.
  3. For IAM Role used by Step Functions to process Macie findings, enter the ARN that was provided to you as the StepFunctionsProcessFindingsRole output from the Configure security account step.
  4. For IAM Role used by Step Functions to release objects from quarantine, enter the ARN that was provided to you as the StepFunctionsReleaseObjectRole output from the Configure security account step.
  5. For IAM Role used by security professionals to investigate Macie findings, enter the ARN that was provided to you as the InvestigateMacieFindingsRole output from the Configure security account step.
  6. For Department name of the bucket owner, enter any department or team name you want to designate as having ownership responsibility over the S3 bucket.
  7. Choose Next.
    Figure 6: Deploy S3 bucket and KMS key stack details

    Figure 6: Deploy S3 bucket and KMS key stack details

After you’ve entered all details, launch the stack and wait until the stack has reached CREATE_COMPLETE status before proceeding. The deployment process will take 3–5 minutes.

Monitor S3 bucket tag compliance

Shortly after the deployment of the new S3 bucket, you should see a message in your Microsoft Teams channel notifying you of the tag compliance status of the new bucket. This AWS Config rule is evaluated automatically any time an S3 resource event takes place, and the tag compliance event is sent to the centralized security account for notification purposes. While the notification shown in Figure 7 depicts a compliant S3 bucket, a bucket deployed without the required tags will be marked as NON_COMPLIANT, and security professionals can check the AWS Config compliance status directly in the AWS console for the member account.

Figure 7: S3 tag compliance notification

Figure 7: S3 tag compliance notification

Upload sample data

Download this .zip file of sample data and upload the expanded files into the newly created S3 bucket. The sample files include fictitious PII, including credit card information and social security numbers, and so will invoke various Macie findings.

Note: All data in this blog post has been artificially created by AWS for demonstration purposes and has not been collected from any individual person. Similarly, such data does not, nor is it intended, to relate back to any individual person.

Configure a Macie discovery job

Configure a sensitive data discovery job in the Amazon Macie delegated administrator account (that is, the security account) according to Creating a sensitive data discovery job. When creating the job, specify tag-based bucket criteria instructing Macie to scan any bucket with a tag key of RequireMacieScan and a tag value of True. This instructs Macie to scan buckets matching this criterion across the accounts that have been onboarded into Macie.

Figure 8: Macie bucket criteria

Figure 8: Macie bucket criteria

On the discovery options page, specify a one-time job with a sampling depth of 100 percent. Further refine the job scope by adding the quarantine prefix to the exclusion criteria of the sensitive data discovery job.

Figure 9: Macie scan scope

Figure 9: Macie scan scope

Select the AWS_CREDENTIALS, CREDIT_CARD_NUMBER, CREDIT_CARD_SECURITY_CODE, and DATE_OF_BIRTH managed data identifiers and proceed to the review screen. On the review screen, ensure that the bucket name you created is listed under bucket matches, and launch the discovery job.

Note: This solution also works with the Automated Sensitive Data Discovery feature in Amazon Macie. I recommend you investigate this feature further for broad visibility into where sensitive data might reside within your Amazon S3 data estate. Regardless of the method you choose, you will be able to integrate the appropriate notification and quarantine solution.

Review and investigate findings

After a few minutes, the discovery job will complete and soon you should see four messages in your Microsoft teams channel notifying you of the finding events created by Macie. One of these findings will be marked as medium severity, while the other three will be marked as high.

Review the medium severity finding, and recall the flow of event information from the solution overview section. Macie scanned a bucket in the member account and presented this finding in the Macie delegated administrator account. Macie then sent this finding event to EventBridge, which initiated a workflow run in Step Functions. Step Functions invoked customer-specified logic to evaluate the finding and determined that because the object isn’t publicly accessible, and because the finding isn’t high severity, it should only notify security professionals rather than quarantine the object in question. Several key pieces of information necessary for investigation are presented to the security team, along with a link to directly investigate the finding in the AWS console of the central security account.

Figure 10: Macie medium severity finding notification

Figure 10: Macie medium severity finding notification

Now review the high severity finding. The flow of event information in this scenario is identical to the medium severity finding, but in this case, Step Functions quarantined the object because the severity is high. The security team is again presented with an option to use the console to investigate further. The process to investigate this finding is a bit different due to the object being moved to a quarantine prefix. If security professionals want to view the original object in its entirety, they would assume the InvestigateMacieFindingsRole in the security account, which has cross-account access to the S3 bucket quarantine prefix in the in-scope member accounts. S3 buckets deployed in member accounts using the CloudFormation template listed above will have a special bucket policy that denies access to the quarantine prefix for any role other than the InvestigateMacieFindingsRole, StepFunctionsProcessFindingsRole, and StepFunctionsReleaseObjectRole. This makes sure that objects are truly quarantined and inaccessible while being investigated by security professionals.

Figure 11: Macie high severity finding notification

Figure 11: Macie high severity finding notification

Unlike the previous example, the security team is also notified that an affected object was moved to quarantine, and is presented with a separate option to release the object from quarantine. Choosing Release Object from Quarantine runs an HTTP POST to the REST API transparently in the background, and the API responds with a SUCCEEDED or FAILED message indicating the result of the operation.

Figure 12: Release object from quarantine

Figure 12: Release object from quarantine

The state machine uses decision logic based on the affected object’s public status and the severity of the finding. Customers deploying this solution can choose to alter this logic or add additional customization by altering the Step Functions state machine definition either directly in the CloudFormation template or through the Step Functions Workflow Studio low-code interface available in the Step Functions console. For reference, the full event schema used by Macie can be found in the Eventbridge event schema for Macie findings.

The logic of the Step Functions state machine used to process Macie finding events follows and is shown in Figure 13.

  1. An EventBridge rule invokes the Step Functions state machine as Macie findings are received.
  2. Step Functions parses Macie event data to determine the finding severity.
  3. If the finding is high severity or determined to be public, the affected object is then added to the quarantine tracking database in DynamoDB.
  4. After adding the object to quarantine tracking, the object is copied into the quarantine prefix within its S3 bucket.
  5. After being copied, the object is deleted from its original S3 location.
  6. After the object is deleted, the MacieQuarantineNotification function is invoked to alert you of the finding and quarantine status.
  7. If the finding is not high severity and not determined to be public, the MacieFindingNotification function is invoked to alert you of the finding.
    Figure 13: Process Macie findings workflow

    Figure 13: Process Macie findings workflow

Solution cleanup

To remove the solution and avoid incurring additional charges for the AWS resources used in this solution, perform the following steps.

Note: If you want to only suspend Macie, which will preserve your settings, resources, and findings, see Suspending or disabling Amazon Macie.

  1. Open the Macie console in your security account. Under Settings, choose Accounts. Select the checkboxes next to the member accounts onboarded previously, select the Actions dropdown, and select Disassociate Account. When that has completed, select the same accounts again, and choose Delete.
  2. Open the Macie console in your management account. Click on Get started, and remove your security account as the Macie delegated administrator.
  3. Open the Macie console in your security account, choose Settings, then choose Disable Macie.
  4. Open the S3 console in your security account. Remove all objects from the Macie results S3 bucket.
  5. Open the CloudFormation console in your security account. Select the ConfigureSecurityAccount stack and choose Delete.
  6. Open the Macie console in your member accounts. Under Settings, choose Disable Macie.
  7. Open the Amazon S3 console in your member accounts. Remove all sample data objects from the S3 bucket.
  8. Open the CloudFormation console in your member accounts. Select the DeployS3BucketKmsKey stack and choose Delete.
  9. Open the CloudFormation console in your management account. Select the DeployToMemberAccounts stack and choose Delete.
  10. Still in the CloudFormation console in your management account, select the ConfigureManagementAccount stack and choose Delete.

Summary

In this blog post, we demonstrated a solution to process—and act upon—sensitive data discovery findings surfaced by Macie through the incorporation of decision logic implemented in Step Functions. This logic provides granularity on quarantine decisions and extensibility you can use to customize automated finding response to suit your business and regulatory needs.

In addition, we demonstrated a mechanism to notify you of finding event information as it’s discovered, while providing you with the contextual details necessary for investigative purposes. Additional customization of the information presented in Microsoft Teams can be accomplished through parsing of the event payload. You can also customize the Lambda functions to interface with Slack, as demonstrated in this sample solution for Macie.

Finally, we demonstrated a solution that can operate at-scale, and will automatically be deployed to new member accounts in your organization. By using an in-place quarantine strategy instead of one that is centralized, you can more easily manage this solution across potentially hundreds of AWS accounts and thousands of S3 buckets. By incorporating a global tracking database in DynamoDB, this solution can also be enhanced through a visual dashboard depicting quarantine insights.

If you have feedback about this post, let us know in the comments section. If you have questions about this post, start a new thread on AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Darren Roback

Darren Roback

Darren is a Senior Solutions Architect with Amazon Web Services based in St. Louis, Missouri. He has a background in Security and Compliance, Serverless Event-Driven Architecture, and Enterprise Architecture. At AWS, Darren partners with customers to help them solve business challenges with AWS technology. Outside of work, Darren enjoys spending time in his shop working on woodworking projects.

Seth Malone

Seth Malone

Seth is a Solutions Architect with Amazon Web Services based in Missouri. He has experience maintaining secure workloads across healthcare, financial, and engineering firms. Seth enjoys the outdoors, and can be found on the water after building solutions with his customers.

Chuck Enstall

Chuck Enstall

Chuck is a Principal Solutions Architect with AWS based in St. Louis, MO, focusing on enterprise customers. He has decades of experience in security across multiple clouds, and holds numerous security certifications such as CISSP, CCSP, and others. As a former Navy veteran, Chuck enjoys mentoring other veterans as they explore possible careers within the security field.

Implement Apache Flink near-online data enrichment patterns

Post Syndicated from Luis Morales original https://aws.amazon.com/blogs/big-data/implement-apache-flink-near-online-data-enrichment-patterns/

Stream data processing allows you to act on data in real time. Real-time data analytics can help you have on-time and optimized responses while improving the overall customer experience.

Data streaming workloads often require data in the stream to be enriched via external sources (such as databases or other data streams). Pre-loading of reference data provides low latency and high throughput. However, this pattern may not be suitable for certain types of workloads:

  • Reference data updates with high frequency
  • The streaming application needs to make an external call to compute the business logic
  • Accuracy of the output is important and the application shouldn’t use stale data
  • Cardinality of reference data is very high, and the reference dataset is too big to be held in the state of the streaming application

For example, if you’re receiving temperature data from a sensor network and need to get additional metadata of the sensors to analyze how these sensors map to physical geographic locations, you need to enrich it with sensor metadata data.

Apache Flink is a distributed computation framework that allows for stateful real-time data processing. It provides a single set of APIs for building batch and streaming jobs, making it easy for developers to work with bounded and unbounded data. Amazon Managed Service for Apache Flink (successor to Amazon Kinesis Data Analytics) is an AWS service that provides a serverless, fully managed infrastructure for running Apache Flink applications. Developers can build highly available, fault tolerant, and scalable Apache Flink applications with ease and without needing to become an expert in building, configuring, and maintaining Apache Flink clusters on AWS.

You can use several approaches to enrich your real-time data in Amazon Managed Service for Apache Flink depending on your use case and Apache Flink abstraction level. Each method has different effects on the throughput, network traffic, and CPU (or memory) utilization. For a general overview of data enrichment patterns, refer to Common streaming data enrichment patterns in Amazon Managed Service for Apache Flink.

This post covers how you can implement data enrichment for near-online streaming events with Apache Flink and how you can optimize performance. To compare the performance of the enrichment patterns, we ran performance testing based on synthetic data. The result of this test is useful as a general reference. It’s important to note that the actual performance for your Flink workload will depend on various and different factors, such as API latency, throughput, size of the event, and cache hit ratio.

We discuss three enrichment patterns, detailed in the following table.

. Synchronous Enrichment Asynchronous Enrichment Synchronous Cached Enrichment
Enrichment approach Synchronous, blocking per-record requests to the external endpoint Non-blocking parallel requests to the external endpoint, using asynchronous I/O Frequently accessed information is cached in the Flink application state, with a fixed TTL
Data freshness Always up-to-date enrichment data Always up-to-date enrichment data Enrichment data may be stale, up to the TTL
Development complexity Simple model Harder to debug, due to multi-threading Harder to debug, due to relying on Flink state
Error handling Straightforward More complex, using callbacks Straightforward
Impact on enrichment API Max: one request per message Max: one request per message Reduce I/O to enrichment API (depends on cache TTL)
Application latency Sensitive to enrichment API latency Less sensitive to enrichment API latency Reduce application latency (depends on cache hit ratio)
Other considerations none none

Customizable TTL.

Only synchronous implementation as of Flink 1.17

Result of the comparative test (Throughput) ~350 events per second ~2,000 events per second ~28,000 events per second

Solution overview

For this post, we use an example of a temperature sensor network (component 1 in the following architecture diagram) that emits sensor information, such as temperature, sensor ID, status, and the timestamp this event was produced. These temperature events get ingested into Amazon Kinesis Data Streams (2). Downstream systems also require the brand and country code information of the sensors, in order to analyze, for example, the reliability per brand and temperature per plant side.

Based on the sensor ID, we enrich this sensor information from the Sensor Info API (3), which provide us with information of the brand, location, and an image. The resulting enriched stream is sent to another Kinesis data stream and can then be analyzed in an Amazon Managed Service for Apache Flink Studio notebook (4).

Solution overview

Prerequisites

To get started with implementing near-online data enrichment patterns, you can clone or download the code from the GitHub repository. This repository implements the Flink streaming application we described. You can find the instructions on how to set up Flink in either Amazon Managed Service for Apache Flink or other available Flink deployment options in the README.md file.

If you want to learn how these patterns are implemented and how to optimize performance for your Flink application, you can simply follow along with this post without deploying the samples.

Project overview

The project is structured as follows:

docs/                               -- Contains project documentation
src/
├── main/java/...                   -- Contains all the Flink application code
│   ├── ProcessTemperatureStream    -- Main class that decides on the enrichment strategy
│   ├── enrichment.                 -- Contains the different enrichment strategies (sync, async and cached)
│   ├── event.                      -- Event POJOs
│   ├── serialize.                  -- Utils for serialization
│   └── utils.                      -- Utils for Parameter parsing
└── test/                           -- Contains all the Flink testing code

The main method in the ProcessTemperatureStream class sets up the run environment and either takes the parameters from the command line, if it’s is a local environment, or uses the application properties from Amazon Managed Service for Apache Flink. Based on the parameter EnrichmentStrategy, it decides which implementation to pick: synchronous enrichment (default), asynchronous enrichment, or cached enrichment based on the Flink concept of KeyedState.

public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
     ParameterTool parameter = ParameterToolUtils.getParameters(args, env);

    String strategy = parameter.get("EnrichmentStrategy", "SYNC");
     switch (strategy) {
         case "SYNC":
             new SyncProcessTemperatureStreamStrategy().run(env, parameter);
             break;
         case "ASYNC":
             new AsyncProcessTemperatureStreamStrategy().run(env, parameter);
             break;
        case "CACHED":
             new CachedProcessTemperatureStreamStrategy().run(env, parameter);
             break;
         default:
             throw new InvalidParameterException("Please choose one of the existing enrichment strategies (SYNC|ASYNC|CACHED)");
     }
}

We go over the three approaches in the following sections.

Synchronous data enrichment

When you want to enrich your data from an external provider, you can use synchronous per-record lookup. When your Flink application processes an incoming event, it makes an external HTTP call and after sending every request, it has to wait until it receives the response.

As Flink processes events synchronously, the thread that is running the enrichment is blocked until it receives the HTTP response. This results in the processor staying idle for a significant period of processing time. On the other hand, the synchronous model is easier to design, debug, and trace. It also allows you to always have the latest data.

It can be integrated into your streaming application as such:

DataStream<EnrichedTemperature> enrichedTemperatureDataStream =
        temperatureDataStream
                .map(new SyncEnrichmentFunction(parameter.get("SensorApiUrl", DEFAULT_API_URL)));

The implementation of the enrichment function looks like the following code:

public class SyncEnrichmentFunction extends RichMapFunction<Temperature, EnrichedTemperature> {

    // Setup of HTTP client and ObjectMapper

    @Override
    public EnrichedTemperature map(Temperature temperature) throws Exception {
        String url = this.getRequestUrl + temperature.getSensorId();

        // Retrieve response from sensor info API
        Response response = client
                .prepareGet(url)
                .execute()
                .toCompletableFuture()
                .get();

        // Parse the sensor info
        SensorInfo sensorInfo = parseSensorInfo(response.getResponseBody());

        // Merge the temperature sensor data and sensor info data
        return getEnrichedTemperature(temperature, sensorInfo);
    }

    // ...
}

To optimize the performance for synchronous enrichment, you can use the KeepAlive flag because the HTTP client will be reused for multiple events.

For applications with I/O-bound operators (such as external data enrichment), it can also make sense to increase the application parallelism without increasing the resources dedicated to the application. You can do this by increasing the ParallelismPerKPU setting of the Amazon Managed Service for Apache Flink application. This configuration describes the number of parallel subtasks an application can perform per Kinesis Processing Unit (KPU), and a higher value of ParallelismPerKPU can lead to full utilization of KPU resources. But keep in mind that increasing the parallelism doesn’t work in all cases, such as when you are consuming from sources with few shards or partitions.

In our synthetic testing with Amazon Managed Service for Apache Flink, we saw a throughput of approximately 350 events per second on a single KPU with 4 parallelism per KPU and the default settings.

Synchronous enrichment performance

Asynchronous data enrichment

Synchronous enrichment doesn’t take full advantage of computing resources. That’s because Fink waits for HTTP responses. But Flink offers asynchronous I/O for external data access. This allows you to enrich the stream events asynchronously, so it can send a request for other elements in the stream while it waits for the response for the first element and requests can be batched for greater efficiency.

Sync I/O vs Async I/O

While using this pattern, you have to decide between unorderedWait (where it emits the result to the next operator as soon as the response is received, disregarding the order of the elements on the stream) and orderedWait (where it waits until all inflight I/O operations complete, then sends the results to the next operator in the same order as the original elements were placed on the stream). When your use case doesn’t require event ordering, unorderedWait provides better throughput and less idle time. Refer to Enrich your data stream asynchronously using Amazon Managed Service for Apache Flink to learn more about this pattern.

The asynchronous enrichment can be added as follows:

SingleOutputStreamOperator<EnrichedTemperature> asyncEnrichedTemperatureSingleOutputStream =
        AsyncDataStream
                .unorderedWait(
                        temperatureDataStream,
                        new AsyncEnrichmentFunction(parameter.get("SensorApiUrl", DEFAULT_API_URL)),
                        ASYNC_OPERATOR_TIMEOUT,
                        TimeUnit.MILLISECONDS,
                        ASYNC_OPERATOR_CAPACITY);

The enrichment function works similar as the synchronous implementation. It first retrieves the sensor info as a Java Future, which represents the result of an asynchronous computation. As soon as it’s available, it parses the information and then merges both objects into an EnrichedTemperature:

public class AsyncEnrichmentFunction extends RichAsyncFunction<Temperature, EnrichedTemperature> {

    // Setup of HTTP client and ObjectMapper

    @Override
    public void asyncInvoke(final Temperature temperature, final ResultFuture<EnrichedTemperature> resultFuture) {
        String url = this.getRequestUrl + temperature.getSensorId();

        // Retrieve response from sensor info API
        Future<Response> future = client
                .prepareGet(url)
                .execute();
        CompletableFuture
                .supplyAsync(() -> {
                    try {
                        Response response = future.get();

                        // Parse the sensor info as soon as it is available
                        return parseSensorInfo(response.getResponseBody());
                    } catch (Exception e) {
                        return null;
                    }
                })
                .thenAccept((SensorInfo sensorInfo) ->

                    // Merge the temperature sensor data and sensor info data
                    resultFuture.complete(getEnrichedTemperature(temperature, sensorInfo)));
    }

    // ...
}

In our testing with Amazon Managed Service for Apache Flink, we saw a throughput of 2,000 events per second on a single KPU with 2 parallelism per KPU and the default settings.

Async enrichment performance

Synchronous cached data enrichment

Although numerous operations in a data flow focus on individual events independently, such as event parsing, there are certain operations that retain information across multiple events. These operations, such as window operators, are referred to as stateful due to their ability to maintain state.

The keyed state is stored within an embedded key-value store, conceptualized as a part of Flink’s architecture. This state is partitioned and distributed in conjunction with the streams that are consumed by the stateful operators. As a result, access to the key-value state is limited to keyed streams, meaning it can only be accessed after a keyed or partitioned data exchange, and is restricted to the values associated with the current event’s key. For more information about the concepts, refer to Stateful Stream Processing.

You can use the keyed state for frequently accessed information that doesn’t change often, such as the sensor information. This will not only allow you to reduce the load on downstream resources, but also increase the efficiency of your data enrichment because no round-trip to an external resource for already fetched keys is necessary and there’s also no need to recompute the information. But keep in mind that Amazon Managed Service for Apache Flink stores transient data in a RocksDB backend, which adds a latency to retrieving the information. But because RocksDB is local to the node processing the data, this is faster than reaching out to external resources, as you can see in the following example.

To use keyed streams, you have to partition your stream using the .keyBy(...) method, which assures that events for the same key, in this case sensor ID, will be routed to the same worker. You can implement it as follows:

SingleOutputStreamOperator<EnrichedTemperature> cachedEnrichedTemperatureSingleOutputStream = temperatureDataStream
        .keyBy(Temperature::getSensorId)
        .process(new CachedEnrichmentFunction(
                parameter.get("SensorApiUrl", DEFAULT_API_URL),
                parameter.get("CachedItemsTTL", String.valueOf(CACHED_ITEMS_TTL))));

We are using the sensor ID as the key to partition the stream and later enrich it. This way, we can then cache the sensor information as part of the keyed state. When picking a partition key for your use case, choose one that has a high cardinality. This leads to an even distribution of events across different workers.

To store the sensor information, we use the ValueState. To configure the state management, we have to describe the state type by using the TypeHint. Additionally, we can configure how long a certain state will be cached by specifying the time-to-live (TTL) before the state will be cleaned up and has to retrieved or recomputed again.

public class CachedEnrichmentFunction extends KeyedProcessFunction<String, Temperature, EnrichedTemperature> {

    // Setup of HTTP client and ObjectMapper...

    private transient ValueState<SensorInfo> cachedSensorInfoLight;
    
    @Override
    public void open(Configuration configuration) throws Exception {
        // Initialize HTTP client
    
        ValueStateDescriptor<SensorInfo> descriptor =
                new ValueStateDescriptor<>("sensorInfo", TypeInformation.of(new TypeHint<>()}));
    
        StateTtlConfig ttlConfig = StateTtlConfig
                .newBuilder(Time.seconds(this.ttl))
                .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                .setStateVisibility(StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp)
                .build();
        descriptor.enableTimeToLive(ttlConfig);
    
        cachedSensorInfoLight = getRuntimeContext().getState(descriptor);
    }
    
    // ...
}

As of Flink 1.17, access to the state is not possible in asynchronous functions, so the implementation must be synchronous.

It first checks if the sensor information for this particular key exists; if so, it gets enriched. Otherwise, it retrieves the sensor information, parses it, and then merges both objects into an EnrichedTemperature:

public class CachedEnrichmentFunction extends KeyedProcessFunction<String, Temperature, EnrichedTemperature> {

    // Setup of HTTP client, ObjectMapper and ValueState

    @Override
    public void processElement(Temperature temperature, KeyedProcessFunction<String, Temperature, EnrichedTemperature>.Context ctx, Collector<EnrichedTemperature> out) throws Exception {
        SensorInfo sensorInfoCachedEntry = cachedSensorInfoLight.value();

        // Check if sensor info is cached
        if (sensorInfoCachedEntry != null) {
            out.collect(getEnrichedTemperature(temperature, sensorInfoCachedEntry));
        } else {
            String url = this.getRequestUrl + temperature.getSensorId();

            // Retrieve response from sensor info API
            Response response = client
                    .prepareGet(url)
                    .execute()
                    .toCompletableFuture()
                    .get();

            // Parse the sensor info
            SensorInfo sensorInfo = parseSensorInfo(response.getResponseBody());

            // Cache the sensor info
            cachedSensorInfoLight.update(sensorInfo);

            // Merge the temperature sensor data and sensor info data
            out.collect(getEnrichedTemperature(temperature, sensorInfo));
        }
    }

    // ...
}

In our synthetic testing with Amazon Managed Service for Apache Flink, we saw a throughput of 28,000 events per second on a single KPU with 4 parallelism per KPU and the default settings.

Sync+Cached enrichment performance

You can also see the impact and reduced load on the downstream sensor API.

Impact on Enrichment API

Test your workload on Amazon Managed Service for Apache Flink

This post compared different approaches to run an application on Amazon Managed Service for Apache Flink with 1 KPU. Testing with a single KPU gives a good performance baseline that allows you to compare the enrichment patterns without generating a full-scale production workload.

It’s important to understand that the actual performance of the enrichment patterns depends on the actual workload and other external systems the Flink application interacts with. For example, performance of cached enrichment may vary with the cache hit ratio. Synchronous enrichment may behave differently depending on the response latency of the enrichment endpoint.

To evaluate which approach best suits your workload, you should first perform scaled-down tests with 1 KPU and a limited throughput of realistic data, possibly experimenting with different values of Parallelism per KPU. After you identify the best approach, it’s important to test the implementation at full scale, with real data and integrating with real external systems, before moving to production.

Summary

This post explored different approaches to implement near-online data enrichment using Flink, focusing on three communication patterns: synchronous enrichment, asynchronous enrichment, and caching with Flink KeyedState.

We compared the throughput achieved by each approach, with caching using Flink KeyedState being up to 14 times faster than using asynchronous I/O, in this particular experiment with synthetic data. Furthermore, we delved into optimizing the performance of Apache Flink, specifically on Amazon Managed Service for Apache Flink. We discussed strategies and best practices to maximize the performance of Flink applications in a managed environment, enabling you to fully take advantage of the capabilities of Flink for your near-online data enrichment needs.

Overall, this overview offers insights into different data enrichment patterns, their performance characteristics, and optimization techniques when using Apache Flink, particularly in the context of near-online data enrichment scenarios and on Amazon Managed Service for Apache Flink.

We welcome your feedback. Please leave your thoughts and questions in the comments section.


About the authors

Luis MoralesLuis Morales works as Senior Solutions Architect with digital-native businesses to support them in constantly reinventing themselves in the cloud. He is passionate about software engineering, cloud-native distributed systems, test-driven development, and all things code and security.

Lorenzo NicoraLorenzo Nicora works as Senior Streaming Solution Architect helping customers across EMEA. He has been building cloud-native, data-intensive systems for several years, working in the finance industry both through consultancies and for fin-tech product companies. He leveraged open source technologies extensively and contributed to several projects, including Apache Flink.

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Post Syndicated from Ismail Makhlouf original https://aws.amazon.com/blogs/big-data/clean-up-your-excel-and-csv-files-without-writing-code-using-aws-glue-databrew/

Managing data within an organization is complex. Handling data from outside the organization adds even more complexity. As the organization receives data from multiple external vendors, it often arrives in different formats, typically Excel or CSV files, with each vendor using their own unique data layout and structure. In this blog post, we’ll explore a solution that streamlines this process by leveraging the capabilities of AWS Glue DataBrew.

DataBrew is an excellent tool for data quality and preprocessing. You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing.

In this post, we demonstrate the following:

  • Extracting non-transactional metadata from the top rows of a file and merging it with transactional data
  • Combining multi-line rows into single-line rows
  • Extracting unique identifiers from within strings or text

Solution overview

For this use case, imagine you’re a data analyst working at your organization. The sales leadership have requested a consolidated view of the net sales they are making from each of the organization’s suppliers. Unfortunately, this information is not available in a database. The sales data comes from each supplier in layouts like the following example.

However, with hundreds of resellers, manually extracting the information at the top is not feasible. Your goal is to clean up and flatten the data into the following output layout.

image2

To achieve this, you can use pre-built transformations in DataBrew to quickly get the data in the layout you want.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Connect to the dataset

The first thing we need to do is upload the input dataset to Amazon S3. Create an S3 bucket for the project and create a folder to upload the raw input data. The output data will be stored in another folder in a later step.

Next, we need to connect DataBrew to our CSV file. We create what we call a dataset, which
is an artifact that points to whatever data source we will be using. Navigate to “Datasets” on
the left hand menu.

Ensure the Column header values field is set to Add default header. The input CSV has an irregular format, so the first row will not have the needed column values.

Create a project

To create a new project, complete the following steps:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Choose Create project.
  3. For Project name, enter FoodMartSales-AllUpProject.
  4. For Attached recipe, choose Create new recipe.
  5. For Recipe name, enter FoodMartSales-AllUpProject-recipe.
  6. For Select a dataset, select My datasets.
  7. Select the FoodMartSales-AllUp dataset.
  8. Under Permissions, for Role name, choose the IAM role you created as a prerequisite or create a new role.
  9. Choose Create project.

After the project is opened, an interactive session is created where you can author transformations on a sample of the data.

Extract non-transactional metadata from within the contents of the file and merge it with transactional data

In this section, we consider data that has metadata on the first few rows of the file, followed by transactional data. We walk through how to extract data relevant to the whole file from the top of the document and combine it with the transactional data into one flat table.

Extract metadata from the header and remove invalid rows

Complete the following steps to extract metadata from the header:

  1. Choose Conditions and then choose IF.
  2. For Matching conditions, choose Match all conditions.
  3. For Source, choose Value of and Column_1.
  4. For Logical condition, choose Is exactly.
  5. For Enter a value, choose Enter custom value and enter RESELLER NAME.
  6. For Flag result value as, choose Custom value.
  7. For Value if true, choose Select source column and set Value of to Column_2.
  8. For Value if false, choose Enter custom value and enter INVALID.
  9. Choose Apply.

Your dataset should now look like the following screenshot, with the Reseller Name value extracted to a column by itself.

Next, you remove invalid rows and fill the rows with the Reseller Name value.

  1. Choose Clean and then choose Custom values.
  2. For Source column, choose ResellerName.
  3. For Specify values to remove, choose Custom value.
  4. For Values to remove, choose Invalid.
  5. For Apply transform to, choose All rows.
  6. Choose Apply.
  7. Choose Missing and then choose Fill with most frequent value.
  8. For Source column, choose FirstTransactionDate.
  9. For Missing value action, choose Fill with most frequent value.
  10. For Apply transform to, choose All rows.
  11. Choose Apply.

Your dataset should now look like the following screenshot, with the Reseller Name value extracted to a column by itself.

Repeat the same steps in this section for the rest of the metadata, including Reseller Email Address, Reseller ID, and First Transaction Date.

Promote column headers and clean up data

To promote column headers, complete the following steps:

  1. Reorder the columns to put the metadata columns to the left of the dataset by choosing Column, Move column, and Start of the table.
  2. Rename the columns with the appropriate names.

Now you can clean up some columns and rows.

  1. Delete unnecessary columns, such as Column_7.

You can also delete invalid rows by filtering out records that don’t have a transaction date value.

  1. Choose the ABC icon on the menu of the Transaction_Date column and choose date.

  2. For Handle invalid values, select Delete rows, then choose Apply.

The dataset should now have the metadata extracted and the column headers promoted.

Combine multi-line rows into single-line rows

The next issue to address is transactions pertaining to the same row that are split across multiple lines. In the following steps, we extract the needed data from the rows and merge it into single-line transactions. For this example specifically, the Reseller Margin data is split across two lines.


Complete the following steps to get the Reseller Margin value on the same line as the corresponding transaction. First, we identify the Reseller Margin rows and store them in a temporary column.

  1. Choose Conditions and then choose IF.
  2. For Matching conditions, choose Match all conditions.
  3. For Source, choose Value of and Transaction_ID.
  4. For Logical condition, choose Contains.
  5. For Enter a value, choose Enter custom value and enter Reseller Margin.
  6. For Flag result value as, choose Custom value.
  7. For Value if true, choose Select source column set Value of to TransactionAmount.
  8. For Value if false, choose Enter custom value and enter Invalid.
  9. For Destination column, choose ResellerMargin_Temp.
  10. Choose Apply.

Next, you shift the Reseller Margin value up one row.

  1. Choose Functions and then choose NEXT.
  2. For Source column, choose ResellerMargin_Temp.
  3. For Number of rows, enter 1.
  4. For Destination column, choose ResellerMargin.
  5. For Apply transform to, choose All rows.
  6. Choose Apply.

Next, delete the invalid rows.

  1. Choose Missing and then choose Remove missing rows.
  2. For Source column, choose TransactionDate.
  3. For Missing value action, choose Delete rows with missing values.
  4. For Apply transform to, choose All rows.
  5. Choose Apply.

Your dataset should now look like the following screenshot, with the Reseller Margin value extracted to a column by itself.

With the data structured properly, we can move on to mining the cleaned data.

Extract unique identifiers from within strings and text

Many types of data contain important information stored as unstructured text in a cell. In this section, we look at how to extract this data. Within the sample dataset, the BankTransferText column has valuable information around our resellers’ registered bank account numbers as well as the currency of the transaction, namely IBAN, SWIFT Code, and Currency.

Complete the following steps to extract IBAN, SWIFT code, and Currency into separate columns. First, you extract the IBAN number from the text using a regular expression (regex).

  1. Choose Extract and then choose Custom value or pattern.
  2. For Create column options, choose Extract values.
  3. For Source column, choose BankTransferText.
  4. For Extract options, choose Custom value or pattern.
  5. For Values to extract, enter [a-zA-Z][a-zA-Z][0-9]{2}[A-Z0-9]{1,30}.
  6. For Destination column, choose IBAN.
  7. For Apply transform to, choose All rows.
  8. Choose Apply.
  9. Extract the SWIFT code from the text using a regex following the same steps used to extract the IBAN number, but using the following regex instead: (?!^)(SWIFT Code: )([A-Z]{2}[A-Z0-9]+).

Next, remove the SWIFT Code: label from the extracted text.

  1. Choose Remove and then choose Custom values.
  2. For Source column, choose SWIFT Code.
  3. For Specify values to remove, choose Custom value.
  4. For Apply transform to, choose All rows.
  5. Extract the currency from the text using a regex following the same steps used to extract the IBAN number, but using the following regex instead: (?!^)(Currency: )([A-Z]{3}).
  6. Remove the Currency: label from the extracted text following the same steps used to remove the SWIFT Code: label.

You can clean up by deleting any unnecessary columns.

  1. Choose Column and then choose Delete.
  2. For Source columns, choose BankTransferText.
  3. Choose Apply.
  4. Repeat for any remaining columns.

Your dataset should now look like the following screenshot, with IBAN, SWIFT Code, and Currency extracted to separate columns.

Write the transformed data to Amazon S3

With all the steps captured in the recipe, the last step is to write the transformed data to Amazon S3.

  1. On the DataBrew console, choose Run job.
  1. For Job name, enter FoodMartSalesToDataLake.
  2. For Output to, choose Amazon S3.
  3. For File type, choose CSV.
  4. For Delimiter, choose Comma (,).
  5. For Compression, choose None.
  6. For S3 bucket owners’ account, select Current AWS account.
  7. For S3 location, enter s3://{name of S3 bucket}/clean/.
  8. For Role name, choose the IAM role created as a prerequisite or create a new role.
  9. Choose Create and run job.
  10. Go to the Jobs tab and wait for the job to complete.
  11. Navigate to the job output folder on the Amazon S3 console.
  12. Download the CSV file and view the transformed output.

Your dataset should look similar to the following screenshot.

Clean up

To optimize cost, make sure to clean up the resources deployed for this project by completing the following steps:

  1. Delete every DataBrew project along with their linked recipes.
  2. Delete all the DataBrew datasets.
  3. Delete the contents in your S3 bucket.
  4. Delete the S3 bucket.

Conclusion

The reality of exchanging data with suppliers is that we can’t always control the shape of the input data. With DataBrew, we can use a list of pre-built transformations and repeatable steps to transform incoming data into a desired layout and extract relevant data and insights from Excel or CSV files. Start using DataBrew today and transform 3 rd party files into structured datasets ready for consumption by your business.


About the Author

Ismail Makhlouf is a Senior Specialist Solutions Architect for Data Analytics at AWS. Ismail focuses on architecting solutions for organizations across their end-to-end data analytics estate, including batch and real-time streaming, big data, data warehousing, and data lake workloads. He primarily works with direct-to-consumer platform companies in the ecommerce, FinTech, PropTech, and HealthTech space to achieve their business objectives with well-architected data platforms.

Set up AWS Private Certificate Authority to issue certificates for use with IAM Roles Anywhere

Post Syndicated from Chris Sciarrino original https://aws.amazon.com/blogs/security/set-up-aws-private-certificate-authority-to-issue-certificates-for-use-with-iam-roles-anywhere/

Traditionally, applications or systems—defined as pieces of autonomous logic functioning without direct user interaction—have faced challenges associated with long-lived credentials such as access keys. In certain circumstances, long-lived credentials can increase operational overhead and the scope of impact in the event of an inadvertent disclosure.

To help mitigate these risks and follow the best practice of using short-term credentials, Amazon Web Services (AWS) introduced IAM Roles Anywhere, a feature of AWS Identity and Access Management (IAM). With the introduction of IAM Roles Anywhere, systems running outside of AWS can exchange X.509 certificates to assume an IAM role and receive temporary IAM credentials from AWS Security Token Service (AWS STS).

You can use IAM Roles Anywhere to help you implement a secure and manageable authentication method. It uses the same IAM policies and roles as within AWS, simplifying governance and policy management across hybrid cloud environments. Additionally, the certificates used in this process come with a built-in validity period defined when the certificate request is created, enhancing the security by providing a time-limited trust for the identities. Furthermore, customers in high security environments can optionally keep private keys for the certificates stored in PKCS #11-compatible hardware security modules for extra protection.

For organizations that lack an existing public key infrastructure (PKI), AWS Private Certificate Authority allows for the creation of a certificate hierarchy without the complexity of self-hosting a PKI.

With the introduction of IAM Roles Anywhere, there is now an accompanying requirement to manage certificates and their lifecycle. AWS Private CA is an AWS managed service that can issue x509 certificates for hosts. This makes it ideal for use with IAM Roles Anywhere. However, AWS Private CA doesn’t natively deploy certificates to hosts.

Certificate deployment is an essential part of managing the certificate lifecycle for IAM Roles Anywhere, the absence of which can lead to operational inefficiencies. Fortunately, there is a solution. By using AWS Systems Manager with its Run Command capability, you can automate issuing and renewing certificates from AWS Private CA. This simplifies the management process of IAM Roles Anywhere on a large scale.

In this blog post, we walk you through an architectural pattern that uses AWS Private CA and Systems Manager to automate issuing and renewing x509 certificates. This pattern smooths the integration of non-AWS hosts with IAM Roles Anywhere. It can help you replace long-term credentials while reducing operational complexity of IAM Roles Anywhere with certificate vending automation.

While IAM Roles Anywhere supports both Windows and Linux, this solution is designed for a Linux environment. Windows users integrating with Active Directory should check out the AWS Private CA Connector for Active Directory. By implementing this architectural pattern, you can distribute certificates to your non-AWS Linux hosts, thereby enabling them to use IAM Roles Anywhere. This approach can help you simplify certificate management tasks.

Architecture overview

Figure 1: Architecture overview

Figure 1: Architecture overview

The architectural pattern we propose (Figure 1) is composed of multiple stages, involving AWS services including Amazon EventBridge, AWS Lambda, Amazon DynamoDB, and Systems Manager.

  1. Amazon EventBridge Scheduler invokes a Lambda function called CertCheck twice daily.
  2. The Lambda function scans a DynamoDB table to identify instances that require certificate management. It specifically targets instances managed by Systems Manager, which the administrator populates into the table.
  3. The information about the instances with no certificate and instances requiring new certificates due to expiry of existing ones is received by CertCheck.
  4. Depending on the certificate’s expiration date for a particular instance, a second Lambda function called CertIssue is launched.
  5. CertIssue instructs Systems Manager to apply a run command on the instance.
  6. Run Command generates a certificate signing request (CSR) and a private key on the instance.
  7. The CSR is retrieved by Systems Manager, the private key remains securely on the instance.
  8. CertIssue then retrieves the CSR from Systems Manager.
  9. CertIssue uses the CSR to request a signed certificate from AWS Private CA.
  10. On successful certificate issuance, AWS Private CA creates an event through EventBridge that contains the ID of the newly issued certificate.
  11. This event subsequently invokes a third Lambda function called CertDeploy.
  12. CertDeploy retrieves the certificate from AWS Private CA and invokes Systems Manager to launch Run Command with the certificate data and updates the certificate’s expiration date in the DynamoDB table for future reference.
  13. Run Command conducts a brief test to verify the certificate’s functionality, and upon success, stores the signed certificate on the instance.
  14. The instance can then exchange the certificate for AWS credentials through IAM Roles Anywhere.

Additionally, on a certificate rotation failure, an Amazon Simple Notification Service (Amazon SNS) notification is delivered to an email address specified during the AWS CloudFormation deployment.

The solution enables periodic certificate rotation. If a certificate is nearing expiration, the process initiates the generation of a new private key and CSR, thus issuing a new certificate. Newly generated certificates, private keys, and CSRs replace the existing ones.

With certificates in place, they can be used by IAM Roles Anywhere to obtain short-term IAM credentials. For more details on setting up IAM Roles Anywhere, see the IAM Roles Anywhere User Guide.

Costs

Although this solution offers significant benefits, it’s important to consider the associated costs before you deploy. To provide a cost estimate, managing certificates for 100 hosts would cost approximately $85 per month. However, for a larger deployment of 1,100 hosts with the Systems Manager advanced tier, the cost would be around $5937 per month. These pricing estimates include the rotation of certificates six times a month.

AWS Private CA in short-lived mode incurs a monthly charge of $50, and each certificate issuance costs $0.058. Systems Manager Hybrid Activation standard has no additional cost for managing fewer than 1,000 hosts. If you have more than 1,000 hosts, the advanced plan must be used at an approximate cost of $5 per host per month. DynamoDB, Amazon SNS, and Lambda costs should be under $5 per month per service for under 1000 hosts. For environments with over 1,000 hosts, it might be worthwhile to explore other options of machine to machine authentication or another option for distributing certificates.

Please note that the estimated pricing mentioned here is specific to the us-east-1 AWS Region and can be calculated for other regions using the AWS Pricing Calculator.

Prerequisites

You should have several items already set up to make it easier to follow along with the blog.

Enabling Systems manager hybrid activation

To create a hybrid activation, follow these steps:

  1. Open the AWS Management Console for Systems Manager, go to Hybrid activations and choose Create an Activation.
    Figure 2: Hybrid activation page

    Figure 2: Hybrid activation page

  2. Enter a description [optional] for the activation and adjust the Instance limit value to the maximum you need, then choose Create activation.
    Figure 3: Create hybrid activation

    Figure 3: Create hybrid activation

  3. This gives you a green banner with an Activation Code and Activation ID. Make a note of these.
    Figure 4: Successful hybrid activation with activation code and ID

    Figure 4: Successful hybrid activation with activation code and ID

  4. Install the AWS Systems Manager Agent (SSM Agent) on the hosts to be managed. Follow the instructions for the appropriate operating system. In the example commands, replace <activation-code>, <activation-id>, and <region> with the activation code and ID from the previous step and your Region. Here is an example of commands to run for an Ubuntu host:
    mkdir /tmp/ssm
    
    curl https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/debian_amd64/amazon-ssm-agent.deb -o /tmp/ssm/amazon-ssm-agent.deb
    
    sudo dpkg -i /tmp/ssm/amazon-ssm-agent.deb
    
    sudo service amazon-ssm-agent stop
    
    sudo -E amazon-ssm-agent -register -code "<activation-code>" -id "<activation-id>" -region <region> 
    
    sudo service amazon-ssm-agent start
    

You should see a message confirming the instance was successfully registered with Systems Manager.

Note: If you receive errors during Systems Manager registration about the Region having invalid characters, verify that the Region is not in quotation marks.

Deploy with CloudFormation

We’ve created a Git repository with a CloudFormation template that sets up the aforementioned architecture. An existing S3 bucket is required for CloudFormation to upload the Lambda package.

To launch the CloudFormation stack:

  1. Clone the Git repository that contains the CloudFormation template and the Lambda function code.
    git clone https://github.com/aws-samples/aws-privateca-certificate-deployment-automator.git
    

  2. cd into the directory created by Git.
    cd aws-privateca-certificate-deployment-automator
    

  3. Launch the CloudFormation stack within the cloned Git directory using the cf_template.yaml file, replacing <DOC-EXAMPLE-BUCKET> with the name of your S3 bucket from the prerequisites.
    aws cloudformation package --template-file cf_template.yaml --output-template-file packaged.yaml --s3-bucket <DOC-EXAMPLE-BUCKET>
    

Note: These commands should be run on the system you plan to use to deploy the CloudFormation and have the Git and AWS CLI installed.

After successfully running the CloudFormation package command, run the CloudFormation deploy command. The template supports various parameters to change the path where the certificates and keys will be generated. Adjust the paths as needed with the parameter-overrides flag, but verify that they exist on the hosts. Replace the <email> placeholder with one that you want to receive alerts for failures. The stack name must be in lower case.

aws cloudformation deploy --template packaged.yaml --stack-name ssm-pca-stack --capabilities CAPABILITY_NAMED_IAM --parameter-overrides SNSSubscriberEmail=<email>

The available CloudFormation parameters are listed in the following table:

Parameter Default value Use
AWSSigningHelperPath /root Default path on the host for the AWS Signing Helper binary
CACertPath /tmp Default path on the host the CA certificate will be created in
CACertValidity 10 Default CA certificate length in years
CACommonNam ca.example.com Default CA certificate common name
CACountry US Default CA certificate country code
CertPath /tmp Default path on the host the certificates will be created in
CSRPath /tmp Default path on the host the certificates will be created in
KeyAlgorithm RSA_2048 Default algorithm use to create the CA private key
KeyPath /tmp Default path on the host the private keys will be created in
OrgName Example Corp Default CA certificate organization name
SigningAlgorithm SHA256WITHRSA Default CA signing algorithm for issued certificates

After the CloudFormation stack is ready, manually add the hosts requiring certificate management into the DynamoDB table.

You will also receive an email at the email address specified to accept the SNS topic subscription. Make sure to choose the Confirm Subscription link as shown in Figure 5.

Figure 5: SNS topic subscription confirmation

Figure 5: SNS topic subscription confirmation

Add data to the DynamoDB table

  1. Open the AWS Systems Manager console and select Fleet Manager.
  2. Choose Managed Nodes and copy the Node ID value. The node ID value in the Fleet Manager as shown in Figure 6 will be the host ID to be used in a subsequent step.
    Figure 6: Systems Manager Node ID

    Figure 6: Systems Manager Node ID

  3. Open the DynamoDB console and select Dashboard and then Tables in the left navigation pane.
    Figure 7: DynamoDB menu

    Figure 7: DynamoDB menu

  4. Select the certificates table.
    Figure 8: DynamoDB tables

    Figure 8: DynamoDB tables

  5. Choose Explore table items and then choose Create item.
  6. Enter the node ID as a value for the hostID attribute as copied in step 2.
    Figure 9: DynamoDB table hostID attribute creation

    Figure 9: DynamoDB table hostID attribute creation

Additional string attributes listed in the following table can be added to the item to specify paths for the certificates on a per host basis. If these attributes aren’t created, either the default paths or overrides in the CloudFormation parameters will be used.

Additional supported attributes Use
certPath Path on the host the certificate will be created in
keyPath Path on the host the private key will be created in
signinghelperPath Path on the host for the AWS Signing Helper binary
cacertPath Path on the host the CA certificate will be created in

The CertCheck Lambda function created by the CloudFormation template runs twice daily to verify that the certificates for these hosts are kept up to date. If necessary, you can use the Lambda invoke command to run the Lambda function on-demand.

aws lambda invoke --function-name CertCheck-Trigger --cli-binary-format raw-in-base64-out response.json

The certificate expiration and serial number metadata are stored in the DynamoDB table certificate. Select the certificates table and choose Explore table items to view the data.

Figure 10: DynamoDB table item with certificate expiration and serial for a host

Figure 10: DynamoDB table item with certificate expiration and serial for a host

Validation

To validate successful certificate deployment, you should find four files in the location specified in the CloudFormation parameter or DynamoDB table attribute, as shown in the following table.

File Use Location
{host}.crt The certificate containing the public key, signed by AWS Private CA. certPath attribute in DynamoDB. Otherwise, default specified by the certPath CF parameter.
ca_chain_certificate.crt The certificate chain including intermediates from AWS Private CA. cacertPath attribute in DynamoDB. Otherwise, default specified by the CACertPath CF parameter.
{host}.key The private key for the certificate. keyPath attribute in DynamoDB. Otherwise, default specified by the KeyPath CF parameter.
{host}.csr The CSR used to generate the signed certificate. Default specified by the CSRPath CF parameter.

These certificates can now be used to configure the host for IAM Roles Anywhere. See Obtaining temporary security credentials from AWS Identity and Access Management Roles Anywhere for using the signing helper tool provided by IAM Roles Anywhere. The signing helper must be installed on the instance for the validation to work. You can pass the location of the signing helper as a parameter to the CloudFormation template.

Note: As a security best practice, it’s important to use permissions and ACLs to keep the private key secure and restrict access to it. The automation will create and set the private key with chmod 400 permissions. Chmod command is used to change the permission for a file or directory. Chmod 400 permission will allow owner of the file to read the file while restricting others from reading, writing, or running the file.

Revoke a certificate

AWS Private CA also supports generating a certificate revocation list (CRL), which can be imported to IAM Roles Anywhere. The CloudFormation template automatically sets up the CRL process between AWS Private CA and IAM Roles Anywhere.

Figure 11: Certificate revocation process

Figure 11: Certificate revocation process

Within 30 minutes after revocation, AWS Private CA generates a CRL file and uploads it to the CRL S3 bucket that was created by the CloudFormation template. Then, the CRLProcessor Lambda function receives a notification through EventBridge of the new CRL file and passes it to the IAM Roles Anywhere API.

To revoke a certificate, use the AWS CLI. In the following example, replace <certificate-authority-arn>, <certificate-serial>, and<revocation-reason> with your own information.

aws acm-pca revoke-certificate --certificate-authority-arn <certificate-authority-arn> --certificate-serial <certificate-serial> --revocation-reason <revocation-reason>

The AWS Private CA ARN can be found in the Cloudformation stack outputs under the name PCAARN. The certificate serial number are listed in the DynamoDB table for each host as previously mentioned. The revocation reasons can be one of these possible values:

  • UNSPECIFIED
  • KEY_COMPROMISE
  • CERTIFICATE_AUTHORITY_COMPROMISE
  • AFFILIATION_CHANGED
  • SUPERSEDED
  • CESSATION_OF_OPERATION
  • PRIVILEGE_WITHDRAWN
  • A_A_COMPROMISE

Revoking a certificate won’t automatically generate a new certificate for the host. See Manually rotate certificates.

Manually rotate certificates

The certificates are set to expire weekly and are rotated the day of expiration. If you need to manually replace a certificate sooner, remove the expiration date for the host’s record in the DynamoDB table (see Figure 12). On the next run of the Lambda function, the lack of an expiration date will cause the certificate for that host to be replaced. To immediately renew a certificate or test the rotation function, remove the expiration date from the DynamoDB table and run the following Lambda invoke command. After the certificates have been rotated, the new expiration date will be listed in the table.

aws lambda invoke --function-name CertCheck-Trigger --cli-binary-format raw-in-base64-out response.json

Conclusion

By using AWS IAM Roles Anywhere, systems outside of AWS can use short-term credentials in the form of x509 certificates in exchange for AWS STS credentials. This can help you improve your security in a hybrid environment by reducing the use of long-term access keys as credentials.

For organizations without an existing enterprise PKI, the solution described in this post provides an automated method of generating and rotating certificates using AWS Private CA and AWS Systems Manager. We showed you how you can use Systems Manager to set up a non-AWS host with certificates for use with IAM Roles Anywhere and ensure they’re rotated regularly.

Deploy this solution today and move towards IAM Roles Anywhere to remove long term credentials for programmatic access. For more information, see the IAM Roles Anywhere blog article or post your queries on AWS re:Post.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on IAM re:Post or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Chris Sciarrino

Chris Sciarrino

Chris is a Senior Solutions Architect and a member of the AWS security field community based in Toronto, Canada. He works with enterprise customers helping them design solutions on AWS. Outside of work, Chris enjoys spending his time hiking and skiing with friends and listening to audiobooks.

Ravikant Sharma/>

Ravikant Sharma

Ravikant is a Solutions Architect based in London. He specializes in cloud security and financial services. He helps Fintech and Web3 startups build and scale their business using AWS Cloud. Prior to AWS, he worked at Citi Singapore as Vice President – API management, where he played a pivotal role in implementing Open API security frameworks and Open Banking regulations.

Rahul Gautam

Rahul Gautam

Rahul is a Security and Compliance Specialist Solutions Architect based in London. He helps customers in adopting AWS security services to meet and improve their security posture in the cloud. Before joining the SSA team, Rahul spent 5 years as a Cloud Support Engineer in AWS Premium Support. Outside of work, Rahul enjoys travelling as much as he can.

Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control

Post Syndicated from Ashley Zhou original https://aws.amazon.com/blogs/big-data/use-iam-runtime-roles-with-amazon-emr-studio-workspaces-and-aws-lake-formation-for-cross-account-fine-grained-access-control/

Amazon EMR Studio is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio provides fully managed Jupyter notebooks and tools such as Spark UI and YARN Timeline Server via EMR Studio Workspaces. You can attach an EMR Studio Workspace to an EMR cluster, and use the compute power of the EMR cluster and run data science jobs on the cluster. Data is often stored in data lakes managed by AWS Lake Formation, enabling you to apply fine-grained access control through a simple grant or revoke mechanism.

We’re happy to introduce runtime roles for EMR Studio Workspaces. You can now define a runtime role and assign it to an EMR cluster when attaching an EMR Studio Workspace. The jobs on the EMR cluster will use this runtime role to access AWS resources. After configuring a runtime role, you can also use Lake Formation and apply fine-grained data access control for the jobs submitted by the EMR Studio Workspace.

Previously, when attaching EMR Studio Workspaces to EMR clusters, all Workspaces had to use the same AWS Identity and Access Management (IAM) role—namely, the cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instance profile. Therefore, all Workspaces attached to the same EMR cluster had the same data access. To control access to data sources, each EMR Studio Workspace had to use a different EMR cluster, and multiple EMR instance profiles were needed.

Starting with the release of Amazon EMR 6.11, you can now choose a runtime role when attaching an EMR Studio Workspace to an EMR cluster. This runtime role scopes down access at the Workspace level. Your Apache Livy and Apache Spark jobs that run from the EMR Studio Workspaces will have permission to access only the data and resources permitted by policies attached to the runtime role. Also, when data is accessed from data lakes managed with Lake Formation, you can enforce fine-grained data access control using Lake Formation permissions. This helps you reduce operational overhead.

In this post, we demonstrate how to configure runtime roles for EMR Studio Workspaces and attach a Workspace to an EMR cluster with runtime roles. Because large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account, our example uses two AWS accounts. We explain how to control access to EMR Studio runtime roles, manage data access across accounts in a data lake via Lake Formation, and enforce table-level and column-level permissions to the EMR runtime roles.

Solution overview

To demonstrate fine-grained access control, we create a sample AWS Glue database named company and manage the database permission in Lake Formation. The database consists of two separate tables:

  • employees – This table stores information about the company’s employees, including employee ID, name, department, and salary
  • products – This table stores information about the products sold by the company, including product ID, name, category, and price

To demonstrate data access control, we consider the following data users:

  • Alice, a data scientist in the sales team – She should have read-only access to all columns in the products table and selected columns, including uID, name, and department in the employees table
  • Bob, a data scientist in the human resources team – He should have read-only access to all columns in employees table and should not have access to the products table

To demonstrate cross-account data sharing, we consider two accounts:

  • Data producer account – We refer to this account as 123456789012 in this post. This account manages the raw data in Amazon Simple Storage Service (Amazon S3) and writes data to the data lake. The company database and tables should be in this account.
  • Data consumer account – We refer to this account as 111122223333 in this post. This account is accessed directly by the users for data analysis and doesn’t have write access to the data. This account should be accessible by Alice and Bob.

The architecture is implemented as follows:

  • The data producer account manages a data lake. Raw data is stored in S3 buckets and catalogued in the AWS Glue Data Catalog.
  • Lake Formation in the data producer account governs the data access via the Data Catalog, and provides cross-account data sharing with the data consumer account.
  • Lake Formation in the data consumer account governs cross-account access to the data lake on table level and fine-grained Lake Formation permissions. For more information, refer to Methods for fine-grained access control.
  • EMR Studio Workspaces in the data consumer account use runtime roles when running jobs on an EMR cluster.
  • The EMR cluster connects to Glue Data Catalog in the data consumer account and queries the data from the data lake through cross-account data sharing.

The following diagram illustrates this architecture.

In the following sections, we go through the steps to share data across accounts via Lake Formation, run an EMR Studio Workspace with runtime roles, and demonstrate fine-grained access control.

Prerequisites

You should have the following prerequisites:

Create the infrastructure in the data producer account

Complete the following steps to create the infrastructure resources:

  1. Log in to the data producer AWS account (123456789012).
  2. Choose Launch Stack to deploy a CloudFormation template to create the necessary resources.
  3. For DataLakeBucketSuffix, enter the suffix for the S3 bucket used by the data lake. The whole S3 bucket name to be created will be {AwsAccoundId}-{AwsRegion}-{DataLakeBucketSuffix}.
  4. After the CloudFormation stack is created, navigate to the Outputs tab of the stack and capture the value of DataLakeS3Bucket to use in the next step.

Create data files and upload them to Amazon S3 in the data producer account

Configure your AWS CLI to use the IAM identity with permission to upload to DataLakeS3BucketName in the data producer AWS account (123456789012), or you can sign in to CloudShell using the AWS Management Console. Complete the following steps:

  1. On your local machine, move to a directory of your choice with the cd command, for example, cd ~.
  2. Run the script with chmod 744 create_sample_data.sh && ./create_sample_data.sh <DataLakeS3BucketName>.

The script will create a subdirectory tmp in your current working directory, create the test data in CSV files, and upload the files to the DataLakeS3BucketName S3 bucket.

Set up Lake Formation in the data producer account

In this section, we walk through the steps to set up Lake Formation in the data producer account.

Set up Lake Formation cross-account data sharing version settings

Lake Formation supports multiple data sharing versions. For this post, we use version 3. To learn more about the differences between data sharing versions, refer to Updating cross-account data sharing version settings. To change the data sharing version, see To enable the new version.

Register the Amazon S3 location as the data lake location

When you register an Amazon S3 location with Lake Formation, you specify an IAM role with read/write permissions on that location. After registering, when EMR clusters request access to this Amazon S3 location, Lake Formation will supply temporary credentials of the provided role to access the data. We already created the role LakeFormationCompanyDatabaseDataAccessRole for this purpose in the previous step. To register the Amazon S3 location as the data lake location, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account (123456789012).
  2. In the navigation pane, choose Data lake locations under Administration.
  3. Choose Register location.
  4. For Amazon S3 path, enter s3://<DataLakeS3BucketName>/company-database.
  5. For IAM role, enter LakeFormationCompanyDatabaseDataAccessRole.
  6. For Permission mode, select Lake Formation.
  7. Choose Register location.

Register data location

Revoke permissions granted to IAMAllowedPrincipals

The IAMAllowedPrincipals group includes any IAM users and roles that are allowed access to your Data Catalog resources by your IAM policies. To enforce the Lake Formation model, we need to revoke permission from IAMAllowedPrincipals using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. In the navigation pane, choose Data lake permissions under Permissions.
  3. Filter permissions by Database = company and Principle=IAMAllowedPrinciples.
  4. Select all the permissions given to the principal IAMAllowedPrincipals and choose Revoke.

Revoke permissions granted to IAMAllowedPrincipals

Set up application integration settings

To enforce permissions for the EMR cluster, you need to register a session tag value with Lake Formation. Lake Formation uses this session tag to authorize callers and provide access to the data lake. We register Amazon EMR as the session tag value. This value will be referenced in the security configuration when creating the EMR cluster.

Set up the session tag using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. Choose Application integration settings under Administration in the navigation pane.
  3. Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
  4. For Session tag values, enter Amazon EMR.
  5. For AWS account IDs, enter the data consumer AWS account ID (111122223333).
  6. Choose Save.

Set up application integration settings in data producer account

Share the database and tables to the data consumer account

We now grant permissions to the data consumer AWS account, including grantable permissions. This allows the Lake Formation data lake administrator in the data consumer account to control access to the data within the account.

Grant database permissions to the data consumer account

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. In the navigation pane, choose Databases.
  3. Select the database company, and on the Actions menu, under Permissions, choose Grant.
  4. In the Principles section, select External accounts and enter the data consumer AWS account (111122223333).
  5. In the LF-Tags or catalog resources section, choose company for Databases.
  6. In the Database permissions section, select Describe for both Database permissions and Grantable permissions.

This allows the data lake administrator in the data consumer account to describe the database and grant describe permissions to other principals in the data consumer account.

  1. Choose Grant.

Grant database permissions to the data consumer account

Grant table permissions to the data consumer account

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
  2. In the navigation pane, choose Tables.
  3. Select the products table, which belongs to the company database, and on the Actions menu, under Permissions, choose Grant.
  4. In the Principles section, select External accounts and enter in the data consumer AWS account (111122223333).
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables, choose products and employees.
  6. In the Table permissions section, choose Select and Describe for both Table permissions and Grantable permissions.

This allows the data lake administrator in the data consumer account to select and describe the tables, and grant select and describe table permissions to other principals in the data consumer account.

  1. In the Data permissions section, select All data access.
  2. Choose Grant.

Grant table permissions to the data consumer account
Now we have finished setting up the data producer account.

Set up the infrastructure in the data consumer account

Complete the following steps to create the infrastructure resources:

  1. Log in to the data consumer account (111122223333).
  2. Choose Launch stack to deploy a CloudFormation template to create the necessary resources.
    Launch Stack
  3. For Release Label, enter the Amazon EMR release label to use, which can only be emr-6.11 or up.
  4. For InstanceType, choose the instance type for EMR cluster, such as r4.4xlarge.
  5. For EMRS3BucketNameSuffix, enter the S3 bucket suffix to store EMR cluster logs and EMR notebook files. The full S3 bucket name to be created will be {AWSAccoundId}-{AWSRegion}-{EMRS3BucketNameSuffix}.
  6. For S3PathToInTransitCertificate, enter the S3 path for the .zip file that contains the .pem files used for in-transit encryption.

For instructions on creating the .zip file that contains the .pem files and uploading them to your S3 bucket, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption.

  1. After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
  2. Capture the value of EMRStudioLink to use to sign in to EMR Studio.

Accept the resource share in the data consumer account

To access shared resources, you must accept the invitation first.

  1. Open the AWS RAM console of the data consumer account with the IAM identity that has AWS RAM access.
  2. In the navigation pane, choose Resource shares under Shared with me.

You should see two pending resource shares from the data producer account.

  1. Accept both resource shares.

You should see the company database, employees table, and products table in the Data Catalog.

Set up Lake Formation in the data consumer account

In this section, we walk through the steps to set up Lake Formation in the data consumer account.

Set up application integration settings

Similar to the setup in the data producer account, you need register Amazon EMR as a session tag. This value is referenced in the security configuration when creating the EMR cluster in the CloudFormation stack.

To do that, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account (111122223333).
  2. Choose Application integration settings under Administration in the navigation pane.
  3. Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
  4. For Session tag values, enter Amazon EMR.
  5. For AWS account IDs, enter the data consumer AWS account ID (111122223333).
  6. Choose Save.

Set up application integration settings in data consumer account

Grant describe permissions to runtime roles on the default database

If you don’t have a default database in Lake Formation, or your default database already has permissions to grant to IAMAllowedPrinciples, you can skip this step.

Amazon EMR will check on the default database by default. If you already have a default database in your Lake Formation, grant the describe permission to the runtime roles on the default database by completing the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator user in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the default database, verify that the owner account ID is the data consumer account (111122223333), and on the Actions menu, choose Grant.
  4. In the Principles section, select IAM users and roles.
  5. For IAM users and roles, choose sales-runtime-role and human-resource-runtime-role.
  6. For LF-Tags or catalog resources, select Named data catalog resources and choose default for Databases.
  7. In the Database permissions section, for Database permissions, choose Describe.
  8. Choose Grant.

Grant describe permissions to runtime roles on the default database

Create a resource link for the shared database

To access the database and table resources that were shared by the data producer AWS account, you need to create a resource link in the data consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database or table. After you create a resource link to a database or table, you can use the resource link name wherever you would use the database or table name. In this step, you grant permission on the resource links to the runtime role principles. The runtime roles will then access the data in shared databases and underlying tables through the resource link.

To create a resource link, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the company database, verify that the owner account ID is the data producer account (123456789012), and on the Actions menu, choose Create Resource links.
  4. For Resource link name, enter the name of the resource link (for example, company-shared).
  5. For Shared database’s region, choose the Region of the company database.
  6. For Shared database, choose the company database.
  7. For Shared database’s owner ID, enter the account ID of the data producer account (123456789012).
  8. Choose Create.

Create a resource link for the shared database

Grant permissions on the resource link to the runtime role principle

Grant permissions on the resource link to sales-runtime-role and human-resource-runtime-role using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant.
  4. In the Principles section, select IAM users and roles, and choose sales-runtime-role and human-resource-runtime-role.
  5. In the LF-Tags or catalog resources section, for Databases, choose company-shared.
  6. In the Resource link permissions section, select Describe.

This allows the runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.

  1. Choose Grant.

Grant permissions on the resource link to the runtime role principle

Grant permission on the tables to the runtime role principle

You need to grant permissions on the tables to sales-runtime-role and human-resource-runtime-role to allow data access:

  • Human-resource-runtime-role should have describe and select permissions on all columns in the employees table, and no permissions on the products table.
  • Sales-runtime-role should have select permissions on the columns uid, name, and department in the employees table, and describe and select permissions on all columns in the products table.

Grant permission on the employees table to human-resource-runtime-role

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant on Target.
  4. In the Principles section, select IAM users and roles, then choose human-resource-runtime-role.
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables¸ choose employees.
  6. In the Table permissions section, for Table permissions, select Describe and Select.
  7. In the Data permissions section, select All data access.
  8. Choose Grant.

Grant permission on the employees table to human-resource-runtime-role

Grant permission on the employees table to sales-runtime-role

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant on Target.
  4. In the Principles section, select IAM users and roles, then choose sales-runtime-role.
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables, choose employees.
  6. In the Table permissions section, for Table permissions, select Select.
  7. In the Data permissions section, select Column-based access.
  8. Select Include columns and choose the uid, name, and department columns.
  9. Choose Grant.

 Grant permission on the employees table to sales-runtime-role

Grant permission on the products table to sales-runtime-role

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (company-shared) and on the Actions menu, choose Grant on Target.
  4. In the Principles section, select IAM users and roles, then choose sales-runtime-role.
  5. In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
    1. For Databases, choose company.
    2. For Tables, choose products.
  6. In the Table permissions section, for Table permissions, select Select and Describe.
  7. In the Data permissions section, select All data access.
  8. Choose Grant.

Grant permission on the products table to sales-runtime-role

Log in to EMR Studio and use the EMR Studio Workspace

Switch your role to alice-role or bob-role on the console using different web browsers to test access. Open the EMRStudioLink URL from the CloudFormation stack output to sign in to the EMR Studio with each role, then complete the following steps:

  1. Choose Workspaces in the navigation pane and choose Create Workspace.
  2. Enter a name and a description for the Workspace.
  3. Choose Create Workspace.

A new tab containing JupyterLab will open automatically when the Workspace is ready. Enable pop-ups in your browser if necessary.

  1. Chose the Compute icon in the navigation pane to attach the EMR Studio Workspace with a compute engine.
  2. Select EMR cluster on EC2 for Compute type.
  3. Choose the EMR cluster ID you created with AWS CloudFormation.
  4. For Runtime role, choose sales-runtime-role if signed in as alice-role. Choose human-resource-runtime-role if signed in as bob-role.
  5. Choose Attach.

attach EMR Studio Workspace to cluster

Run code in the EMR Studio Workspace and verify data access

Run the following code in the EMR Studio Workspace with a PySpark kernel after signing in with alice-role or bob-role:

%%sql -o result -n -1
select * from `company-shared`.products limit 5;

%%sql -o result -n -1
select * from `company-shared`.employees limit 5;

You should see different results when using different roles.

According to our data access configuration in Lake Formation, Alice will have full data access for the products table. She can view all the columns except for salary in the employees table.

Alice (sales) query result

For Bob, according to our data access configuration in Lake Formation, he will have full data access to the employees table, but he has no access to the products table.

Bob (human resource) query result

Clean up

When you’re finished experimenting with this solution, clean up your resources:

  1. Stop and delete the EMR Studio Workspaces created in the data consumer AWS account.
  2. Delete all the content in the S3 bucket EMRS3Bucket in the data consumer AWS account.
  3. Delete the CloudFormation stack in the data consumer AWS account.
  4. Delete all the content in the S3 bucket DataLakeS3Bucket in the data producer AWS account.
  5. Delete the CloudFormation stack in the data producer AWS account.

Conclusion

This post showed how you can use runtime roles to connect to an EMR Studio Workspace with Amazon EMR to apply cross-account fine-grained data access control with Lake Formation. We also demonstrated how multiple EMR Studio users can connect to the same EMR cluster, each using a runtime role scoped with permissions matching their individual level of access to data.

To learn more about using EMR Studio Workspaces with Lake Formation, refer to Run an EMR Studio Workspace with a runtime role. We encourage you to try out this new functionality, and connect with the us if you have any questions or feedback!


About the Authors

Ashley Zhou is a Software Development Engineer at AWS. She is interested in data analytics and distributed systems.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

Build an entitlement service for business applications using Amazon Verified Permissions

Post Syndicated from Abdul Qadir original https://aws.amazon.com/blogs/security/build-an-entitlement-service-for-business-applications-using-amazon-verified-permissions/

Amazon Verified Permissions is designed to simplify the process of managing permissions within an application. In this blog post, we aim to help customers understand how this service can be applied to several business use cases.

Companies typically use custom entitlement logic embedded in their business applications. This is the most common approach, and it involves writing custom code to manage user access permissions. We’ll explore the common challenges faced by application developers and access administrators when handling user access permissions in an application and how Verified Permissions can help you solve these challenges. We’ll provide an integration guide for incorporating Verified Permissions into an entitlement service, specifically for use cases such as payment management. Finally, we’ll discuss the advantages of using a granular, adaptable, and externally managed access control system.

This blog post will provide a comprehensive and centralized approach to managing access policies, reducing administrative overhead, and empowering line-of-business users to define, administer, and enforce application entitlement policies.

Challenges of building an entitlement system

Entitlements refer to the rules that determine what each user can or cannot do within an application. Figure 1 shows the architecture of a common entitlement system, with components embedded in applications and entitlements stored in multiple data stores.

Figure 1: Typical entitlement system

Figure 1: Typical entitlement system

Creating your own permissions management system can be resource-intensive, requiring time and expertise to ensure its effectiveness. Enterprises face many issues when building a custom entitlement management system, such as complexity, security risks, performance, and lack of scalability. Let’s delve into these issues in detail.

  • Data complexity – Entitlement decisions are often based on complex data relationships, such as user roles, group membership, and product permissions. Managing this complexity can be challenging, especially in a large organization with a lot of users, groups, and products.
  • Compliance and security – Building an entitlement system requires careful consideration of compliance regulations and security best practices. You need to protect user data, implement secure communication protocols, and handle potential security vulnerabilities.
  • Scalability – Permissions management systems must scale to handle large number of users and transactions. This can be a challenge, especially if the service is used to control access to critical resources.
  • Performance and availability – Entitlement services need to be performant, because they are often used to make real-time decisions. Additionally, they need to be reliable and consistent, so that users can be confident that their entitlements are accurate.

Architecting an entitlement service using Amazon Verified Permissions

Amazon Verified Permissions is a scalable permissions management and fine-grained authorization service that helps you build and modernize applications without relying heavily on coding authorization within your applications.

Let’s discuss how you can use Verified Permissions to manage entitlements.

Creating and deploying policies

Verified Permissions uses Cedar, a policy language that allows developers to express permissions as policies that permit users or forbid them from doing certain tasks. A central policy-based authorization system gives developers a consistent way to define and manage fine-grained authorization across applications, simplifies changing permission rules without a need to change code, and improves visibility by moving permissions out of the code.

By using Verified Permissions, you can create specific permission policies that incorporate characteristics of role-based access control (RBAC) and attribute-based access control (ABAC). This approach enables you to implement granular controls while prioritizing the principle of least privilege.

Use case 1: Mary, who works as a clerk, can submit and view payments. Her role within the payment management system allows for multiple actions, and the policy for this role can be defined as follows.

permit (
    principal,
    action in [
        PaymentManager::Action::"SubmitPayment",
        PaymentManager::Action::"UpdatePayment",
        PaymentManager::Action::"ListPayment"
    ],
    resource
)
when { principal.role == "clerk" };

In contrast, Shirley is an auditor, with access that only allows her to list payments. The policy for this role is as follows.

permit (
    principal,
    action in [PaymentManager::Action::"ListPayment"],
    resource
)
when { principal.role == "auditor" };

The payment system will pass the principal, action, resource, and the entity data to Verified Permissions. If the user information is not explicitly defined within the application, the payment system must retrieve it from data stores such as an identity provider or database.

Following that, Verified Permissions evaluates relevant policies by assembling policies that affect the calling principal and the resource in question to make a decision on whether the action should be permitted or denied. Once a decision is made, it is conveyed back to the application, which can then enforce the decision.

As you can see in Figure 2, Mary has access to submit a payment because she has the role of “clerk” and the policy shown earlier permits this action.

Figure 2: Using the test bench to test if Mary can submit payment

Figure 2: Using the test bench to test if Mary can submit payment

Shirley can’t submit a payment based on her role as an “auditor” and the action is denied, as shown in Figure 3.

Figure 3: Using the test bench to test if Shirley can submit payment

Figure 3: Using the test bench to test if Shirley can submit payment

However, she can list the payments, as the policy shown earlier permits this action, as shown in Figure 4.

Figure 4: Using the test bench to test if Shirley can list payments

Figure 4: Using the test bench to test if Shirley can list payments

Use case 2: Using the payment system application, CFO Jane delegates access for a high-value account, 111222333, to John, VP of Finance, during her vacation by creating a policy from a template. This gives John permission to approve payments on the account without Jane’s direct presence.

Policy template for approving payment: Figure 5 shows a sample policy template to approve payment. Policies created by using this template, like the one following, will provide the principal with the ability to approve payments for the resource.

permit (
    principal == ?principal,
    action in [PaymentManager::Action::"ApprovePayment”],
    resource == ?resource
);
Figure 5: Creating a policy template

Figure 5: Creating a policy template

Create the policy from the template: Figure 6 shows the policy created by using the preceding template. The parameters that you have to pass are the principal and resource information. For this use case, the principal is “John” and the resource is the account “111222333”, enabling John to approve payment for the account. (AWS recommends using a universally unique identifier (UUID) for the principal, but “John” is used in this blog post to make it more readable.)

Figure 6: Creating a policy from template

Figure 6: Creating a policy from template

Evaluate the policy: As expected, John is granted access to approve payment for the account 111222333, as shown in Figure 7.

Figure 7: Using the test bench to test if Jeff can approve payment

Figure 7: Using the test bench to test if Jeff can approve payment

Building an entitlement service with Verified Permissions

Verified Permissions enables you to build an entitlement service by externalizing authorization and centralizing policy management and administration. It allows you to tailor access control to your specific application requirements while leveraging the underlying entitlement management provided by Verified Permissions.

Integrating an existing entitlement service with Verified Permissions

Let’s look at how you can integrate an existing entitlement service with Verified Permissions, as shown in Figure 8. In this diagram, the underlying implementation of the entitlement service uses the standard enterprise technology stack. Amazon DynamoDB is used to store the user and role information.

Figure 8: Integrating an entitlement service with Verified Permissions

Figure 8: Integrating an entitlement service with Verified Permissions

Here’s an approach you can use to seamlessly integrate your existing entitlement service with Verified Permissions:

  1. Identify permissions: Begin by assessing your existing entitlement service to identify the permissions it currently uses, different roles, actions, and resources. Compile a detailed list of the permissions along with their respective purposes.
  2. Formulate policies: Map the permissions identified for each use case in the previous step into policies. You can use both inline policies and policy templates. In the AWS Management Console, use the Verified Permissions test bench to evaluate the policies you’ve drafted.
  3. Create policies: Depending on your business needs, create one or more policy stores within Verified Permissions. Create the policies within these policy stores. This is a one-time task and we recommend using automation to accomplish it.
  4. Update entitlement service: Use your entitlement service’s existing interface to create a logic that transforms the current request payload into the format that Verified Permissions’ authorization request expects. You might need to identify and incorporate missing parameters into the existing interfaces. Apply this same transformation logic to the response payload. Refer to this documentation for the Verified Permissions authorization request and response format.
  5. Integrate with Verified Permissions: Use the Verified Permission API or AWS SDK to integrate the entitlements service with Verified Permissions. This involves tasks such as fetching the user role from Amazon DynamoDB, making authorization requests to Verified Permissions, and processing the resulting responses.
  6. Testing: Thoroughly test your service after making the permission changes. Verify that all functionalities are working as expected and that the policies in Verified Permissions are being utilized correctly.
  7. Deployment: After your service passes the review process, roll out the updated entitlement service along with the integrated Verified Permissions functionality.
  8. Monitor and maintain: Following deployment, continuously monitor the performance and gather feedback. Be prepared to make further adjustments if necessary.
  9. Documentation and support: Provide comprehensive documentation for developers who will use your entitlement service. Clearly explain the available endpoints, the request and response formats, and the authorization requirements.

You can use a similar approach to integrate your existing entitlement service with other third-party permission management systems.

Building a new entitlement service in AWS using Amazon Verified Permissions

The reference architecture in Figure 9 shows how to build a new entitlement service using Verified Permissions. AWS customers already use Amazon Cognito for simple, fast authentication. With Amazon Verified Permissions, customers can also add simple, fast authorization to their applications by adding user profile attributes to the identity token generated by Amazon Cognito.

Figure 9: Entitlement service using Verified Permissions

Figure 9: Entitlement service using Verified Permissions

The workflow in the diagram is as follows:

  1. The user signs in to the application by using Amazon Cognito.
  2. If the authentication is successful, the pre-token generation Lambda function will be invoked.
  3. You can use the pre-token generation Lambda function to customize an identity token before Amazon Cognito generates it. In this case, the trigger is used to add the user profile attributes as new claims in the identity token.
    1. The user profile attributes are retrieved from Amazon Dynamo DB.
    2. The attributes are then added as new claims in the identity token.
  4. After the user is signed in, they request access to the protected resource in the application through Amazon API Gateway.
  5. Amazon API Gateway initiates an authorization check using a Lambda authorizer. A Lambda authorizer is a feature of the API Gateway that allows you to implement a custom authorization scheme using the identity token generated by Amazon Cognito.
  6. The Lambda authorizer validates, decodes, and retrieves the user profile attributes from the identity token.
  7. The Lambda authorizer calls the Verified Permission authorization API and passes the principal, action, resource, and user profile attributes as entities.
  8. Based on the decision returned by Verified Permissions, the user is permitted or denied access to the resource.

Common pitfalls of using an entitlement service

Entitlement services can be tricky, but there are a few common mistakes you can avoid to make them more secure and simpler to use:

  • Entitlement service misconfigurations can create security vulnerabilities and lead to data breaches. It is important to carefully configure the entitlement service and to regularly review policies to verify that they are correct and up-to-date.
  • When you first start using an entitlement service, it’s easy to give users too many permissions. This can make your application less secure and harder to manage. It’s important to give users only the permissions they need to do their jobs.
  • Users need to be trained on how to use the entitlement service correctly, especially when it comes to requesting and managing permissions. If users don’t know how to do these tasks appropriately, they could make mistakes that could leave your system vulnerable.

Conclusion

Amazon Verified Permissions is a comprehensive solution for businesses looking to manage granular access control, flexible authorization, and externalized access control. With this service, organizations can quickly and conveniently apply new policies across their environment, streamlining user management processes and helping to improve overall security. This post has highlighted the many benefits of using Verified Permissions for entitlement management within an application. We hope it has been helpful in understanding how you can apply this service to your business use cases.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Abdul Qadir

Abdul Qadir

Abdul is a Solutions Architect based in New York. He designs and architects solutions for independent software vendor (ISV) customers and helps customers in their cloud journeys. He’s been working in the financial and insurance industries and helping companies with digital transformation and modernizing their legacy systems.

Arun Sivaraman

Arun Sivaraman

Arun is a Boston-based Solutions Architect. He enjoys working with customers to create innovative solutions and supporting their digital transformation.

How to create an AMI hardening pipeline and automate updates to your ECS instance fleet

Post Syndicated from Nima Fotouhi original https://aws.amazon.com/blogs/security/how-to-create-an-ami-hardening-pipeline-and-automate-updates-to-your-ecs-instance-fleet/

Amazon Elastic Container Service (Amazon ECS) is a comprehensive managed container orchestrator that simplifies the deployment, maintenance, and scalability of container-based applications. With Amazon ECS, you can deploy your containerized application as a standalone task, or run a task as part of a service in your cluster. The Amazon ECS infrastructure for tasks includes Amazon Elastic Compute Cloud (Amazon EC2) instances in the AWS Cloud, serverless (AWS Fargate) in the AWS Cloud, or on-premises virtual machines (VMs) or servers. You can enable auto-scaling for Amazon ECS capacity providers when using EC2 instances, allowing your infrastructure to dynamically adjust based on workload demands. You define the infrastructure type or the capacity providers where you deploy your tasks or services.

You can choose EC2 instances as the computing resources for your ECS cluster, which allows you to control your cluster’s underlying infrastructure, including the size of EC2 instances, the instance operating system, and extra security controls required by a compliance framework. AWS recommends that you use Amazon ECS-optimized Amazon Machine Images (AMIs), which are set up with the requirements and recommendations to efficiently run your container workloads on Amazon Linux instances. We recommend that you refresh your container instances fleet with the latest ECS-optimized AMIs to include the latest bug fixes and feature updates. However, managing and updating your container instance fleet might become complex as your Amazon ECS workload grows.

In this blog post, I will show you how to create a workflow to enhance Amazon ECS-optimized AMIs by using the CIS Docker Benchmark and automatically updating your EC2 instances in your ECS cluster with the newly created AMIs.

Overview of CIS Docker Benchmark

The CIS Docker Benchmark provides prescriptive guidance for establishing a secure configuration posture for a Docker container engine, container host, container images and build files. The CIS Docker Benchmark has seven sections about Docker and container security:

  1. Host configuration
  2. Docker daemon configuration
  3. Docker daemon configuration files
  4. Container images and build file configuration
  5. Container runtime configuration
  6. Docker security operations
  7. Docker swarm configuration

The solution described in this post covers sections 1, 2, and 3 of the CIS Docker Benchmark, including security recommendations to prepare the host machine used for Amazon ECS workloads, securing the behavior of the Docker daemon (server), and securing Docker-related files and directory permissions and ownerships. However, the solution doesn’t implement all of the controls listed in these three sections. For a complete list of controls implemented, see the solution’s repository.

Solution overview

EC2 Image Builder is a fully managed AWS service, designed to simplify the process of creating, handling, and implementing server images that are custom, secure, and consistently updated. For this solution, you will deploy an EC2 Image Builder pipeline to apply the CIS Docker Benchmarks to an Amazon ECS-optimized AMI and use the created AMI to refresh the Amazon ECS instance fleet. This solution is customizable, so you can select the security controls to harden your base AMI. You can also specify cluster tags during CloudFormation template deployment; these tags will filter the ECS clusters that you have included in the Amazon EC2 instance refresh process. I’ve provided an AWS CloudFormation template that you can use to provision the necessary resources.

Figure 1: Amazon ECS instance refresh workflow

Figure 1: Amazon ECS instance refresh workflow

As shown in Figure 1, the solution involves the following steps:

  1. EC2 Image Builder
    1. The AMI image pipeline downloads the ansible playbook from the S3 bucket, and runs it against the base image.
    2. The pipeline publishes the hardened AMI.
    3. The pipeline validates the benchmarks applied to the base image and publishes the results to a test results S3 bucket. It also invokes Amazon Inspector to run a vulnerability scan on the published image.
  2. State machine initiation
    1. When the AMI is successfully published, the pipeline publishes a message to the AMI status SNS topic. The SNS topic invokes the State machine initiation Lambda function.
    2. The State machine initiation Lambda function extracts the image ID of the published AMI and uses it as the input to initiate the state machine.
  3. State machine
    1. The first state gathers information related to Amazon ECS clusters, including the capacity providers for the EC2 auto scaling group. It creates a new launch template version with the hardened AMI image ID for the EC2 auto scaling group.
    2. The second state uses the new launch template to initiate an instance refresh for the EC2 auto scaling group.
  4. Instance refresh status update
    1. The instance refresh rule selects the auto scaling group instance refresh events (failure, success, and cancellation events) and sends them to the Instance refresh status SNS topic.
    2. The Instance refresh status SNS topic sends an email on the instance refresh status to subscribers.
  5. Image update reminder
    1. A weekly scheduled rule invokes the Image update reminder Lambda function.
    2. The Image update reminder Lambda function retrieves the value for LatestECSOptimizedAMI from the CloudFormation template, and extracts the last modified date of the Amazon ECS-optimized AMI used as the base image in the EC2 Image Builder pipeline. It compares the last modified date of the AMI with the creation date of the latest AMI published by the pipeline. If a new base image is available, it publishes a message to the image update reminder SNS topic.
    3. The Image update reminder SNS topic sends a message to subscribers notifying them of a new base image. You need to create a new version of your image recipe to update it with the new AMI.

Prerequisites

To follow along with this walkthrough, make sure that you have the following prerequisites in place:

Walkthrough

To deploy the solution, complete the following steps.

Step 1: Download or clone the repository

The first step is to download or clone the solution’s repository.

To download the repository

  1. Go to the main page of the repository on GitHub.
  2. Choose Code, and then choose Download ZIP.

To clone the repository

  1. Make sure that you have Git installed.
  2. Run the following command in your terminal:

git clone https://github.com/aws-samples/ecs-image-hardening-and-instance-refresh.git

Step 2: Create an S3 bucket

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. An S3 bucket is a container for objects stored on Amazon S3. For this walkthrough, you need to create an S3 bucket and copy the content of the ansible folder to your newly created bucket. Make a note of your S3 bucket name because you will need it in the next step.

Step 3: Create the CloudFormation stack

In this step, you deploy the solution’s resources by creating a CloudFormation stack using the provided CloudFormation template. Sign in to your account and choose an AWS Region where you want to create the stack. Make sure that the Region you choose supports the services used by this solution. To create the stack, follow the steps in Creating a stack on the AWS CloudFormation console. Note that you need to provide values for the parameters defined in the template to deploy the stack. The following table lists the parameters that you need to provide.

Parameter Description
AnsiblePlaybookArguments ansible-playbook command arguments
AnsiblePlaybookBucket S3 bucket name containing ansible playbook
CloudFormationUpdaterEventBridgeRuleState Amazon EventBridge rule that invokes the Lambda function that checks for a new version of the EC2 Image Builder parent image
ClusterTags Tags in JSON format to filter the ECS clusters that you want to update
ComponentName Name of the EC2 Image Builder component
DistributionConfigurationName Name of the EC2 Image Builder distribution configuration
EnableImageScanning Choose whether or not to enable Amazon Inspector image scanning
ImagePipelineName Name of the EC2 Image Builder pipeline
InfrastructureConfigurationName Name of the EC2 Image Builder infrastructure configuration
InstanceType EC2 Image Builder infrastructure configuration EC2 instance type
LatestECSOptimizedAMI ECS-optimized AMI parameter name; for more info, see Retrieving Amazon ECS-optimized AMI metadata
libDockerVolumeSize Container partition size in gigabytes (GB)
libDockerVolumeType Container partition volume type
RecipeName Name of the EC2 Image Builder recipe
RootVolumeSize AMI root partition volume size in GB
RootVolumeType AMI root partition volume type

Step 4: Set up Amazon SNS topic subscribers

Amazon Simple Notification Service (Amazon SNS) is a web service that coordinates and manages the delivery or sending of messages to subscribing endpoints or clients. An Amazon SNS topic is a logical access point that acts as a communication channel.

The solution in this post creates three Amazon SNS topics to keep you informed of each step of the process. The following is a list of the topics that the solution creates and their purpose.

  • AMI status topic – a message is published to this topic upon successful creation of an AMI.
  • Image update reminder topic – a message is published to this topic if a newer version of the base Amazon ECS-optimized AMI is published by AWS.
  • Instance refresh status topic – a message is published to this topic each time that an ECS cluster capacity provider gets an instance fleet refresh.

You need to manually modify the subscriptions for each topic to receive messages published to that topic.

To modify the subscriptions for the topics created by the CloudFormation template

  1. Sign in to the Amazon SNS console.
  2. In the left navigation pane, choose Subscriptions.
  3. On the Subscriptions page, choose Create subscription.
  4. On the Create subscription page, in the Details section, do the following:
    • For Topic ARN, choose the Amazon Resource Name (ARN) of one of the topics that the CloudFormation topic created.
    • For Protocol, choose Email.
    • For Endpoint, enter the endpoint value. In our example, this is an email address, such as the email address of a distribution list.
    • Choose Create subscription.
  5. Repeat the preceding steps for the other two topics.

Step 5: Run the pipeline

The EC2 Image Builder pipeline that the solution creates consists of an image recipe with one component, an infrastructure configuration, and a distribution configuration. I’ve set up the image recipe to create an AMI, select a base image, choose components, and define block device mapping. There’s only one component where building and testing steps are defined. For the building step, the solution creates a separate partition for /var/lib/docker and mounts it to a dedicated device specified in the image recipe. It then applies the CIS Docker Benchmark ansible playbook and cleans up the unnecessary files and folders. In the test step, the solution runs Amazon inspector, a continuous assessment service that scans your AWS workloads for software vulnerabilities and unintended network exposure, and Docker Bench for Security. Optionally, you can create your own components and associate them with the image recipe to make further modifications on the base image.

You will need to manually run the pipeline by using either the AWS Management Console or AWC CLI.

To run the pipeline (console)

  1. Open the EC2 Image Builder console.
  2. From the pipeline details page, choose the name of your pipeline.
  3. From the Actions menu at the top of the page, select Run pipeline.

To run the pipeline (AWS CLI)

  1. Make sure that you have properly configured your AWS CLI.
  2. Run the following command. Replace <pipeline region> with your own information.

aws imagebuilder list-image-pipelines –region <pipeline region>

  1. From the list of pipelines, find the pipeline named ECSAnsiblePipeline and note the pipeline ARN, which you will use in the next step.
  2. Run the pipeline. Make sure to replace <pipeline arn> and <region> with your own information.

aws imagebuilder start-image-pipeline-execution –image-pipeline-arn <pipeline arn> –region <region>

The following is a process overview of the image hardening and instance refresh:

  1. Image hardening – when you start the pipeline, EC2 Image Builder creates the required infrastructure to build your AMI, applies the ansible playbook (CIS Docker Benchmark) to the base AMI, and publishes the hardened AMI. A message is published to the AMI status topic as well.
  2. Image testing – after publishing the AMI, EC2 Image Builder scans the newly created AMI with Amazon Inspector and reports the findings back. It also runs Docker Bench for Security to verify the changes that the ansible playbook made to the base AMI and publishes the results to an S3 bucket.
  3. State machine initiation – after a new AMI is successfully published, the AMI status topic invokes the State machine initiation Lambda function. The Lambda function invokes the instance refresh state machine and passes on the AMI info.
  4. Instance refresh – the instance refresh state machine has two steps:
    1. Gather cluster information – a Lambda function gathers information regarding EC2 capacity providers and their associated auto scaling groups. For each auto scaling group, it creates a new launch template and includes the hardened AMI information. When you create the CloudFormation stack, if you pass a tag or a list of tags, only clusters with matching tags are processed in this step.
    2. Auto scaling group instance refresh – the state machine uses the output of the first Lambda function (first state) and starts instance refresh for auto scaling groups in parallel (second state). An EventBridge rule publishes a message to the Instance refresh status topic upon successful refresh of each auto scaling group.

This solution also creates an EventBridge rule that is invoked weekly. This rule invokes the Image update reminder Lambda function, and notifies you if a new version of your base AMI has been published by AWS so that you can run the pipeline and update your hardened AMI.

Conclusion

In this blog post, you learned how to create a workflow to harden Amazon ECS-optimized AMIs by using the CIS Docker Benchmark and to automate the refresh of EC2 instances in your ECS clusters. This automated workflow has several advantages. First, it helps ensure a consistent and standardized process for image hardening, reducing potential human errors and inconsistencies. By automating the entire process, you can apply security and compliance standards across your instances. Second, the tight integration with AWS Step Functions enables smooth, orchestrated updates to the ECS cluster instances, enhancing the reliability and predictability of deployments. This automation also reduces manual intervention, helping you achieve time savings so that your teams can focus on more value-driven tasks. Moreover, this systematic approach helps to enhance the security posture of your Amazon ECS workloads because you can address vulnerabilities rapidly and systematically, helping to keep the environment resilient against potential threats.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Nima Fotouhi

Nima Fotouhi

Nima is a Security Consultant at AWS. He’s a builder with a passion for infrastructure as code (IaC) and policy as code (PaC) and helps customers build secure infrastructure on AWS. In his spare time, he loves to hit the slopes and go snowboarding.

How to use chaos engineering in incident response

Post Syndicated from Kevin Low original https://aws.amazon.com/blogs/security/how-to-use-chaos-engineering-in-incident-response/

Simulations, tests, and game days are critical parts of preparing and verifying incident response processes. Customers often face challenges getting started and building their incident response function as the applications they build become increasingly complex. In this post, we will introduce the concept of chaos engineering and how you can use it to accelerate your incident response preparation and testing processes.

Why chaos engineering?

Chaos engineering is a formalized approach that uses fault injection experiments to create real-world conditions needed to understand how your system will react to unknowns and build confidence in the system’s resiliency and security.

Modern applications can have multiple components, including web, API, application, and data persistence layers. To respond to potential security events, you must understand the failure scenarios across each component and their downstream impacts. One challenge is that creating incident response processes and playbooks for components in a silo doesn’t consider known unknowns—how these components interact with each other—and can’t reveal unknown unknowns such as second-order effects during a security event.

As an example, consider the personalization microservice shown in Figure 1.

The microservice relies on two Amazon Elastic Compute Cloud (Amazon EC2) instances that are deployed in an auto scaling group across two Availability Zones. An upstream data collection microservice sends data for the personalization microservice to process. In addition, a downstream website microservice takes the personalized data and displays it to customers.

Figure 1: Architecture of the personalization microservice

Figure 1: Architecture of the personalization microservice

Now imagine that unexpected activity occurred on an EC2 instance. The instance started to query a domain name that’s associated with cryptocurrency-related activity. A first set of unknowns already emerges:

  • Can your detective controls detect the activity on the instance?
  • How long do they take to do so?
  • How long does it take your security team to be notified?
  • Does the security team know what to do?
  • Does the notification have all the information that the team needs to respond?
  • Is there an existing automated response to other stakeholders?

Security professionals may not consider all of these questions when building and designing their threat detection and incident response capabilities.

In our example, Amazon GuardDuty is able to detect the unexpected activity and generates the CryptoCurrency:EC2/BitcoinTool.B!DNS finding within 15 minutes. The security team takes a snapshot for further forensics before the instance is isolated, as shown in Figure 2.

Figure 2: Architecture after GuardDuty detects unexpected activity and the security team isolates the EC2 instance

Figure 2: Architecture after GuardDuty detects unexpected activity and the security team isolates the EC2 instance

Although this might seem like an adequate response in isolation, it leads to more questions.

From a security perspective:

  • What other logs do we need for further investigation?
  • Do we know if the credentials need to be rotated and what impact that will have on the workload?
  • Should other parts of the system be replaced or restarted?

From an operational perspective:

  • Do any of the (manual or automated) incident response processes impact the performance of the workload?
  • Can the remaining instance handle the traffic before the auto scaling group creates another instance?
  • If there is increased latency or failure of the microservice, how will the data collection and website microservices react to it?

Creating detection and incident response plans in isolation doesn’t consider the second order effects that could have an impact on the integrity and availability of the system.

How can chaos engineering help?

Chaos engineering is a formalized process that can help solve this problem. It creates failure in a controlled environment with well-defined experiments to generate data on system behavior during a simulated event. You can use this data to improve incident response processes and make proactive changes that improve the security of your workloads. By using chaos engineering, developer and security teams can reveal additional unknowns and understand areas of opportunity to improve incident response processes and workload availability.

Chaos engineering has five phases—steady state, hypothesis, run experiment, verify, and improve—which we’ll discuss in more detail next.

Steady state 

The first phase involves an understanding of the behavior and configuration of the system under normal conditions. Instead of focusing on the internal attributes of the system, you should focus on an output metric or indicator that ties operational metrics and customer experience together. Including these output metrics in your hypothesis helps you collect data on security events and understand how these events and your response to them impact business outcomes.

Returning to our earlier example, this could be the latency when a user attempts to retrieve personalized information. This output is critical to the customer experience and relies on multiple operational metrics.

In addition, two key metrics in incident response are time to detect (TTD) and time to remediate (TTR). These metrics help capture how effectively your team has responded to the security event.

By defining your steady state, you can detect deviations from that state and determine if your system has fully returned to the known good state. You should identify the relevant metrics to measure your system and make these metrics simple for engineers to consume.

Using AWS, you can collect logs from the different services that you use in a workload, such as Amazon VPC Flow Logs, Amazon CloudWatch log groups, and AWS CloudTrail. For more details about the different log sources, see Logging strategies for security incident response.

Hypothesis

After you understand the steady state behavior, you can write a hypothesis about it. Security hypotheses can take the following form:

When _________ happens, ________ system will notify the team within _______ and the application’s metric _________ will remain at ________.

It can be challenging to decide what should happen. Chaos engineering recommends that you choose real-world events that are likely to occur and that will impact the user experience. Get your team to brainstorm. For security issues, this is an ideal time to use your threat model as the starting point for discussions. Starting with one of your identified threats and then running experiments based on that threat can help you test both your processes and automation.

After you’ve chosen your component, decide which variable to influence or what could happen in your complex system. For example, a misconfigured Amazon Simple Storage Service (Amazon S3) bucket or an open database port could lead to unintended exposure of customer data. A software flaw in your application could lead to the misuse of resources by an unauthorized user.

Here are a few examples of hypotheses:

  • If port 22 permits unrestricted access on a security group, AWS Config will detect it, run an automation to remove the security group rule, and notify the security team through Slack within 5 minutes, and the application’s latency will remain at 0.005 seconds.
  • If malware is run on an EC2 instance, Amazon GuardDuty will detect it within 15 minutes and notify the security team. Remediation playbooks will not affect the application’s error rate of 1 error for every thousand requests.

Design and run the experiment 

The next phase is to run the experiment. You don’t need to run experiments in production right away. A great place to get started with chaos engineering is the staging environment. One benefit of the AWS Cloud is that you can configure your staging environment to be identical to production. This increases the value of using an approach like chaos engineering before you get to production. By running experiments in staging, you can see how your system will likely react in production while earning trust within your organization.

As you gain confidence, you can begin running experiments in production. Because you configured staging to be identical to production, the risk of this transition is mitigated.

You can use AWS Fault Injection Simulator (FIS), our fully managed service for running fault injection experiments. FIS supports multiple fault injection actions, such as injecting API errors, restarting instances, running scripts on instances, disrupting network connectivity, and more. For the full list, see the FIS actions reference.

Although FIS doesn’t support security-related actions out of the box, you can use FIS to run AWS Systems Manager Automation documents that can run AWS APIs and scripts to simulate security events. To learn how to set up FIS to run a Systems Manager document that turns off bucket-level block public access for a randomly-selected S3 bucket, see the workshop Chaos Kitty – Gamifying Incident Response with Chaos Engineering. To learn how to set up FIS to run experiments that simulate events such as an RDP brute force event, lateral movement, cryptocurrency mining, and DNS data exfiltration, see the workshop Validating security guardrails with Chaos Engineering.

During this phase, you must understand the scope of impact of your experiment and work to minimize it. If an Amazon CloudWatch alarm goes into an alarm state, FIS can automatically stop the experiment. You should have a plan to return the environment to the steady state if the experiment has an unintended impact.

As you run your experiment, remember to document the key metrics and human responses, such as whether incident responders were confident, knew where to find the correct resources, or were aware of the escalation points.

Learn and verify 

The next step is to analyze and document the data to understand what happened. Lessons learned during the experiment are critical and should promote a culture of support instead of blame.

Here are some questions that you should address:

  1. What happened?
  2. What was the impact on our customers?
  3. What did we learn? Did we have enough information in the notification to investigate?
  4. What could have reduced our time to detect or time to remediate by 50 percent?
  5. Can we apply this to other similar systems?
  6. How can we improve our incident response processes?
  7. What human steps in the process can we automate?

Here are a few examples of things you might learn from your chaos engineering experiments:

  • After port 22 was opened, AWS Config detected the misconfigured security group within 2 minutes. However, the notification system was misconfigured, and the security team wasn’t notified. During the five minutes that port 22 was opened, the EC2 instance received 22 attempts to connect to it from unknown IP addresses.
  • After a cryptocurrency mining script was run on an EC2 instance, GuardDuty detected the activity, generated a finding within 10 minutes, and notified the security team.
  • The security team’s remediation actions—terminating the instance—led to increased application latency beyond the SLA of 0.05 seconds.

Improve and fix it

Use your learnings to improve the workload. It’s vital that you get leadership alignment and support to prioritize the remediation of findings from chaos experiments or other testing scenarios. Examples include improving incident response playbooks, creating new forms of automation, or creating preventative controls to prevent the event from happening. For more guidance on playbooks, see these sample incident response playbooks and the workshop Building an AWS incident response runbook using Jupyter notebooks and CloudTrail Lake.

Preparation is important in incident response. As you improve your processes, run more experiments to collect additional data and continue to iteratively improve. Automate chaos experiments on new environments or applications with minimal user traffic before directing the majority of traffic to it. As you use chaos engineering approaches to prepare incident response processes, your detection and incident response capabilities should improve.

Conclusion

It’s vital to be prepared when a security event happens. In this blog post, you learned about the five phases of chaos engineering—steady state, hypothesis, design and run the experiment, learn and verify, and improve and fix—and how you can use them to accelerate your incident response preparation and testing processes. For more information on chaos engineering, see the following resources. Choose a workload and run an experiment on it to verify and improve your incident response processes today.

Additional resources

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Kevin Low

Kevin Low

Kevin is a Security Solutions Architect who helps customers of all sizes across ASEAN build securely. He is passionate about integrating resilience and security and has a keen interest in chaos engineering. Outside of work, he loves spending time with his wife and dog, a poodle called Noodle.

Approaches for migrating users to Amazon Cognito user pools

Post Syndicated from Edward Sun original https://aws.amazon.com/blogs/security/approaches-for-migrating-users-to-amazon-cognito-user-pools/

Update: An earlier version of this post was published on September 14, 2017, on the Front-End Web and Mobile Blog.


Amazon Cognito user pools offer a fully managed OpenID Connect (OIDC) identity provider so you can quickly add authentication and control access to your mobile app or web application. User pools scale to millions of users and add layers of additional features for security, identity federation, app integration, and customization of the user experience. Amazon Cognito is available in regions around the globe, processing over 100 billion authentications each month. You can take advantage of security features when using user pools in Cognito, such as email and phone number verification, multi-factor authentication, and advanced security features, such as compromised credentials detection, and adaptive authentications.

Many customers ask about the best way to migrate their existing users to Amazon Cognito user pools. In this blog post, we describe several different recommended approaches and provide step-by-step instructions on how to implement them.

Key considerations

The main consideration when migrating users across identity providers is maintaining a consistent end-user experience. Ideally, users can continue to use their existing passwords so that their experience is seamless. However, security best practices dictate that passwords should never be stored directly as cleartext in a user store. Instead, passwords are used to compute cryptographic hashes and verifiers that can later be used to verify submitted passwords. This means that you cannot securely export passwords in cleartext form from an existing user store and import them into a Cognito user pool. You might ask your users to choose a new password during the migration. Or, if you want to retain the existing passwords, you need to retain access to the existing hashes and verifiers, at least during the migration period.

A secondary consideration is the migration timeline. For example, do you need a faster migration timeline because your current identity store’s license is expiring? Or do you prefer a slow and steady migration because you are modernizing your current application, and it takes time to connect your existing systems to the new identity provider?

The following two methods define our recommended approaches for migrating existing users into a user pool:

  • Bulk user import – Export your existing users into a comma-separated (.csv) file, and then upload this .csv file to import users into a user pool. Your desired user attributes (except passwords) can be included and mapped to attributes in the target user pool. This approach requires users to reset their passwords when they sign in with Cognito. You can choose to migrate your existing user store entirely in a single import job or split users into multiple jobs for parallel or incremental processing.
  • Just-in-time user migration – Migrate users just in time into a Cognito user pool as they sign in to your mobile or web app. This approach allows users to retain their current passwords, because the migration process captures and verifies the password during the sign-in process, seamlessly migrating them to the Cognito user pool.

In the following sections, we describe the bulk user import and just-in-time user migration methods in more detail and then walk through the steps of each approach.

Bulk user import

You perform bulk import of users into an Amazon Cognito user pool by uploading a .csv file that contains user profile data, including usernames, email addresses, phone numbers, and other attributes. You can download a template .csv file for your user pool from Cognito, with a user schema structured in the template header.

Following is an example of performing bulk user import.

To create an import job

  1. Open the Cognito user pool console and select the target user pool for migration.
  2. On the Users tab, navigate to the Import users section, and choose Create import job.
  3. Figure 1: Create import job

    Figure 1: Create import job

  4. In the Create import job dialog box, download the template.csv file for user import.
  5. Export your existing user data from your existing user directory or store your data into the .csv file
  6. Match the user attribute types with column headings in the template. Each user must have an email address or a phone number that is marked as verified in the .csv file, in order to receive the password reset confirmation code.
  7. Figure 2: Configure import job

    Figure 2: Configure import job

  8. Go back to the Create import job dialog box (as shown in Figure 2) and do the following:
    1. Enter a Job name.
    2. Choose to Create a new IAM role or Use an existing IAM role. This role grants Amazon Cognito permission to write to Amazon CloudWatch Logs in your account, so that Cognito can provide logs for successful imports and errors for skipped or failed transactions.
    3. Upload the .csv file that you have prepared, and choose Create and start job.

Depending on the size of the .csv file, the job can run for minutes or hours, and you can follow the status from that same page in the Amazon Cognito console.

Figure 3: Check import job status

Figure 3: Check import job status

Cognito runs through the import job and imports users with a RESET_REQUIRED state. When users attempt to sign in, Cognito will return PasswordResetRequiredException from the sign-in API, and the app should direct the user into the ForgotPassword flow.

Figure 4: View imported user

Figure 4: View imported user

The bulk import approach can also be used continuously to incrementally import users. You can set up an Extract-Transform-Load (ETL) batch job process to extract incremental changes to your existing user directories, such as the new sign-ups on the existing systems before you switch over to a Cognito user pool. Your batch job will transform the changes into a .csv file to map user attribute schemas, and load the .csv file as a Cognito import job through the CreateUserImportJob CLI or SDK operation. Then start the import job through the StartUserImportJob CLI or SDK operation. For more information, see Importing users into user pools in the Amazon Cognito Developer Guide.

Just-in-time user migration

The just-in-time (JIT) user migration method involves first attempting to sign in the user through the Amazon Cognito user pool. Then, if the user doesn’t exist in the Cognito user pool, Cognito calls your Migrate User Lambda trigger and sends the username and password to the Lambda trigger to sign the user in through the existing user store. If successful, the Migrate User Lambda trigger will also fetch user attributes and return them to Cognito. Then Cognito silently creates the user in the user pool with user attributes, as well as salts and password verifiers from the user-provided password. With the Migrate User Lambda trigger, your client app can start to use the Cognito user pool to sign in users who have already been migrated, and continue migrating users who are signing in for the first time towards the user pool. This just-in-time migration approach helps to create a seamless authentication experience for your users.

Cognito, by default, uses the USER_SRP_AUTH authentication flow with the Secure Remote Password (SRP) protocol. This flow doesn’t involve sending the password across the network, but rather allows the client to exchange a cryptographic proof with the Cognito service to prove the client’s knowledge of the password. For JIT user migration, Cognito needs to verify the username and password against the existing user store. Therefore, you need to enable a different Cognito authentication flow. You can choose to use either the USER_PASSWORD_AUTH flow for client-side authentication or the ADMIN_USER_PASSWORD_AUTH flow for server-side authentication. This will allow the password to be sent to Cognito over an encrypted TLS connection, and allow Cognito to pass the information to the Lambda function to perform user authentication against the original user store.

This JIT approach might not be compatible with existing identity providers that have multi-factor authentication (MFA) enabled, because the Lambda function cannot support multiple rounds of challenges. If the existing identity provider requires MFA, you might consider the alternative JIT migration approach discussed later in this blog post.

Figure 5 illustrates the steps for the JIT sign-in flow. The mobile or web app first tries to sign in the user in the user pool. If the user isn’t already in the user pool, Cognito handles user authentication and invokes the Migrate User Lambda trigger to migrate the user. This flow keeps the logic in the app simple and allows the app to use the Amazon Cognito SDK to sign in users in the standard way. The migration logic takes place in the Lambda function in the backend.

Figure 5: JIT migration user authentication flow

Figure 5: JIT migration user authentication flow

The flow in Figure 5 starts in the mobile or web app, which attempts to sign in the user by using the AWS SDK. If the user doesn’t exist in the user pool, the migration attempt starts. Cognito calls the Migrate User Lambda trigger with triggerSource set to UserMigration_Authentication, and passes the user’s username and password in the request in order to attempt to migrate the user.

This approach also works in the forgot password flow shown in Figure 6, where the user has forgotten their password and hasn’t been migrated yet. In this case, once the user makes a “Forgot Password” request, your mobile or web app will send a forgot password request to Cognito. Cognito invokes your Migrate User Lambda trigger with triggerSource set to UserMigration_ForgotPassword, and passes the username in the request in order to attempt user lookup, migrate the user profile, and facilitate the password reset process.

Figure 6: JIT migration forgot password flow

Figure 6: JIT migration forgot password flow

Just-in-time user migration sample code

In this section, we show sample source codes for a Migrate User Lambda trigger overall structure. We will fill in the commented sections with additional code, shown later in the section. When you set up your own Lambda function, configure a Lambda execution role to grant permissions for CloudWatch logs.

const handler = async (event) => {
    if (event.triggerSource == "UserMigration_Authentication") {
        //***********************************************************************
        // Attempt to sign in the user or verify the password with existing identity store
        // (shown in the Section A – Migrate User of this post)
        //***********************************************************************
    }
    else if (event.triggerSource == "UserMigration_ForgotPassword") {
       //***********************************************************************
       // Attempt to look up the user in your existing identity store
       // (shown in the section B – Forget Password of this post)
       //***********************************************************************
    }
    return event;
};
export { handler };

In the migration flow, the Lambda trigger will sign in the user and verify the user’s password in the existing user store. That may involve a sign-in attempt against your existing user store or a check of the password against a stored hash. You need to customize this step based on your existing setup. You can also create a function to fetch user attributes that you want to migrate. If your existing user store conforms to the OIDC specification, you can parse the ID Token claims to retrieve the user’s attributes. The following example shows how to set the username and attributes for the migrated user.

// Section A – Migrate User
if (event.triggerSource == "UserMigration_Authentication") {
// Attempt to sign in the user or verify the password with the existing user store.
// Add an authenticateUser() functionbased on your existing user store setup. 
    const user = await authenticateUser(event.userName, event.request.password);
    if (user) {
        // Migrating user attributes from the source user store. You can migrate additional attributes as needed.
        event.response.userAttributes = {
            // Setting username and email address
            username: event.userName,
            email: user.emailAddress,
            email_verified: "true",
        };
        // Setting user status to CONFIRMED to autoconfirm users so they can sign in to the user pool
        event.response.finalUserStatus = "CONFIRMED";
        // Setting messageAction to SUPPRESS to decline to send the welcome message that Cognito usually sends to new users
        event.response.messageAction = "SUPPRESS";
        }
    }

The user is now migrated from the existing user store to the user pool, as well as the user’s attributes. Users will also be redirected to your application with the authorization code or JSON Web Tokens, depending on the OAuth 2.0 grant types you configured in the user pool.

Let’s look at the forgot password flow. Your Lambda function calls the existing user store and migrates other attributes in the user’s profile first, and then Lambda sets user attributes in the response to the Cognito user pool. Cognito initiates the ForgotPassword flow and sends a confirmation code to the user to confirm the password reset process. The user needs to have a verified email address or phone number migrated from the existing user store to receive the forgot password confirmation code. The following sample code demonstrates how to complete the ForgotPassword flow.

// Section B – Forgot Password
else if (event.triggerSource == "UserMigration_ForgotPassword") {
        // Look up the user in your existing user store service.  
		// Add a lookupUser() function based on your existing user store setup. 
        const lookupResult = await lookupUser(event.userName);
        if (lookupResult) {
            // Setting user attributes from the source user store
            event.response.userAttributes = {
                username: event.userName,
                // Required to set verified communication to receive password recovery code
                email: lookupResult.emailAddress,
                email_verified: "true",
            };
            event.response.finalUserStatus = "RESET_REQUIRED";
            event.response.messageAction = "SUPPRESS";
        }
    }

Just-in-time user migration – alternative approach

Using the Migrate User Lambda trigger, we showed the JIT migration approach where the app switches to use the Cognito user pool at the beginning of the migration period, to interface with the user for signing in and migrating them from the existing user store. An alternative JIT approach is to maintain the existing systems and user store, but to silently create each user in the Cognito user pool in a backend process as users sign in, then switch over to use Cognito after enough users have been migrated.

Figure 7: JIT migration alternative approach with backend process

Figure 7: JIT migration alternative approach with backend process

Figure 7 shows this alternative approach in depth. When an end user signs in successfully in your mobile or web app, the backend migration process is initiated. This backend process first calls the Cognito admin API operation, AdminCreateUser, to create users and map user attributes in the destination user pool. The user will be created with a temporary password and be placed in FORCE_CHANGE_PASSWORD status. If you capture the user password during the sign-in process, you can also migrate the password by setting it permanently for the newly created user in the Cognito user pool using the AdminSetUserPassword API operation. This operation will also set the user status to CONFIRMED to allow the user to sign in to Cognito using the existing password.

Following is a code example for the AdminCreateUser function using the AWS SDK for JavaScript.

var params = {
    MessageAction: "SUPPRESS",
    UserAttributes: [{
        Name: "name",
        Value: "Nikki Wolf"
    },
    {
        Name: "email",
        Value: "[email protected]"
    },
    {
        Name: "email_verified",
        Value: "True"
    }
    ],
    UserPoolId: "us-east-1_EXAMPLE",
    Username: "nikki_wolf"
};
const cognito = new CognitoIdentityProviderClient();
const createUserCommand  = new AdminCreateUserCommand(params);
await cognito.send (createUserCommand);

The following is a code example for the AdminSetUserPassword function.

var params = {
    UserPoolId: 'us-east-1_EXAMPLE' ,
    Username: 'nikki_wolf' ,
    Password: 'ExamplePassword1$' ,
    Permanent: true
};
const cognito = new CognitoIdentityProviderClient();
const setUserPasswordCommand = new AdminSetUserPasswordCommand(params);
await cognito.send(setUserPasswordCommand);

This alternative approach does not require the app to update its authentication codebase until a majority of users are migrated, but you need to propagate user attribute changes and new user signups from the existing systems to Cognito. If you are capturing and migrating passwords, you should also build a similar logic to capture password changes in existing systems and set the new password in the user pool to keep it synchronized until you perform a full switchover from the existing identity store to the Cognito user pool.

Summary and best practices

In this post, we described our two recommended approaches for migrating users into an Amazon Cognito user pool. You can decide which approach is best suited for your use case. The bulk method is simpler to implement, but it doesn’t preserve user passwords like the just-in-time migration does. The just-in-time migration is transparent to users and mitigates the potential attrition of users that can occur when users need to reset their passwords.

You could also consider a hybrid approach, where you first apply JIT migration as users are actively signing in to your app, and perform bulk import for the remaining less-active users. This hybrid approach helps provide a good experience for your active user communities, while being able to decommission existing user stores in a manageable timeline because you don’t need to wait for every user to sign in and be migrated through JIT migration.

We hope you can use these explanations and code samples to set up the most suitable approach for your migration project.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Edward Sun

Edward Sun

Edward is a Security Specialist Solutions Architect focused on identity and access management. He loves helping customers throughout their cloud transformation journey with architecture design, security best practices, migration, and cost optimizations. Outside of work, Edward enjoys hiking, golfing, and cheering for his alma mater, the Georgia Bulldogs.

How to share security telemetry per OU using Amazon Security Lake and AWS Lake Formation

Post Syndicated from Chris Lamont-Smith original https://aws.amazon.com/blogs/security/how-to-share-security-telemetry-per-ou-using-amazon-security-lake-and-aws-lake-formation/

This is the final part of a three-part series on visualizing security data using Amazon Security Lake and Amazon QuickSight. In part 1, Aggregating, searching, and visualizing log data from distributed sources with Amazon Athena and Amazon QuickSight, you learned how you can visualize metrics and logs centrally with QuickSight and AWS Lake Formation irrespective of the service or tool generating them. In part 2, How to visualize Amazon Security Lake findings with Amazon QuickSight (LINK NOT LIVE YET), you learned how to integrate Amazon Athena with Security Lake and create visualizations with QuickSight of the data and events captured by Security Lake.

For companies where security administration and ownership are distributed across a single organization in AWS Organizations, it’s important to have a mechanism for securely sharing and visualizing security data. This can be achieved by enriching data within Security Lake with organizational unit (OU) structure and account tags and using AWS Lake Formation to securely share data across your organization on a per-OU basis. Users can then analyze and visualize security data of only those AWS accounts in the OU that they have been granted access to. Enriching the data enables users to effectively filter information using business-specific criteria, minimizing distractions and enabling them to concentrate on key priorities.

Distributed security ownership

It’s not unusual to find security ownership distributed across an organization in AWS Organizations. Take for example a parent company with legal entities operating under it, which are responsible for the security posture of the AWS accounts within their lines of business. Not only is each entity accountable for managing and reporting on security within its area, it must not be able to view the security data of other entities within the same organization.

In this post, we discuss a common example of distributing dashboards on a per-OU basis for visualizing security posture measured by the AWS Foundational Security Best Practices (FSBP) standard as part of AWS Security Hub. In this post, you learn how to use a simple tool published on AWS Samples to extract OU and account tags from your organization and automatically create row-level security policies to share Security Lake data to AWS accounts you specify. At the end, you will have an aggregated dataset of Security Hub findings enriched with AWS account metadata that you can use as a basis for building QuickSight dashboards.

Although this post focuses on sharing Security Hub data through Security Lake, the same steps can be performed to share any data—including Security Hub findings in Amazon S3—according to OU. You need to ensure any tables you want to share contain an AWS account ID column and that the tables are managed by Lake Formation.

Prerequisites

This solution assumes you have:

  • Followed the previous posts in this series and understand how Security Lake, Lake Formation, and QuickSight work together.
  • Enabled Security Lake across your organization and have set up a delegated administrator account.
  • Configured Security Hub across your organization and have enabled the AWS FSBP standard.

Example organization

AnyCorp Inc, a fictional organization, wants to provide security compliance dashboards to its two subsidiaries, ExampleCorpEast and ExampleCorpWest, so that each only has access to data for their respective companies.

Each subsidiary has an OU under AnyCorp’s organization as well as multiple nested OUs for each line of business they operate. ExampleCorpEast and ExampleCorpWest have their own security teams and each operates a security tooling AWS account and uses QuickSight for visibility of security compliance data. AnyCorp has implemented Security Lake to centralize the collection and availability of security data across their organization and has enabled Security Hub and the AWS FSBP standard across every AWS account.

Figure 1 – Overview of AnyCorp Inc OU structure and AWS accounts

Figure 1: Overview of AnyCorp Inc OU structure and AWS accounts


Note: Although this post describes a fictional OU structure to demonstrate the grouping and distribution of security data, you can substitute your specific OU and AWS account details and achieve the same results.

Logical architecture

Figure 2 – Logical overview of solution components

Figure 2: Logical overview of solution components

The solution includes the following core components:

  • An AWS Lambda function is deployed into the Security Lake delegated administrator account (Account A) and extracts AWS account metadata for grouping Security Lake data and manages secure sharing through Lake Formation.
  • Lake Formation implements row-level security using data filters to restrict access to Security Lake data to only records from AWS accounts in a particular OU. Lake Formation also manages the grants that allow consumer AWS accounts access to the filtered data.
  • An Amazon Simple Storage Service (Amazon S3) bucket is used to store metadata tables that the solution uses. Apache Iceberg tables are used to allow record-level updates in S3.
  • QuickSight is configured within each data consumer AWS account (Account B) and is used to visualize the data for the AWS accounts within an OU.

Deploy the solution

You can deploy the solution through either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK).

To deploy the solution using the AWS Management Console, follow these steps:

  1. Download the CloudFormation template.
  2. In your Amazon Security Lake delegated administrator account (Account A), navigate to create a new AWS CloudFormation stack.
  3. Under Specify a template, choose Upload a template file and upload the file downloaded in the previous step. Then choose Next.
  4. Enter RowLevelSecurityLakeStack as the stack name.

    The table names used by Security Lake include AWS Region identifiers that you might need to change depending on the Region you’re using Security Lake in. Edit the following parameters if required and then choose Next.

    • MetadataDatabase: the name you want to give the metadata database.
      • Default: aws_account_metadata_db
    • SecurityLakeDB: the Security Lake database as registered by Security Lake.
      • Default: amazon_security_lake_glue_db_ap_southeast_2
    • SecurityLakeTable: the Security Lake table you want to share.
      • Default: amazon_security_lake_table_ap_southeast_2_sh_findings_1_0
  5. On the Configure stack options screen, leave all other values as default and choose Next.
  6. On the next screen, navigate to the bottom of the page and select the checkbox next to I acknowledge that AWS CloudFormation might create IAM resources. Choose Submit.

The solution takes about 5 minutes to deploy.

To deploy the solution using the AWS CDK, follow these steps:

  1. Download the code from the row-level-security-lake GitHub repository, where you can also contribute to the sample code. The CDK initializes your environment and uploads the Lambda assets to Amazon S3. Then, deploy the solution to your account.
  2. For a CDK deployment, you can edit the same Region identifier parameters discussed in the CloudFormation deployment option by editing the cdk.context.json file and changing the metadata_database, security_lake_db, and security_lake_table values if required.
  3. While you’re authenticated in the Security Lake delegated administrator account, you can bootstrap the account and deploy the solution by running the following commands:
  4. cdk bootstrap
    cdk deploy

Configuring the solution in the Security Lake delegated administrator account

After the solution has been successfully deployed, you can review the OUs discovered within your organization and specify which consumer AWS accounts (Account B) you want to share OU data with.

To specify AWS accounts to share OU security data with, follow these steps:

  1. While in the Security Lake delegated administrator account (Account A), go to the Lake Formation console.
  2. To view and update the metadata discovered by the Lambda function, you first must grant yourself access to the tables where it’s stored. Select the radio button for aws_account_metadata_db. Then, under the Action dropdown menu, select Grant.
  3. Figure 3: Creating a grant for your IAM role

    Figure 3: Creating a grant for your IAM role

  4. On the Grant data permissions page, under Principals, select the IAM users and roles dropdown and select the IAM role that you are currently logged in as.
  5. Under LF-Tags or catalog resources, select the Tables dropdown and select All tables.
  6. Figure 4: Choosing All Tables for the grant

    Figure 4: Choosing All Tables for the grant

  7. Under Table permissions, select Select, Insert, and Alter. These permissions let you view and update the data in the tables.
  8. Leave all other options as default and choose Grant.
  9. Now go to the AWS Athena console.
  10. Note: To use Athena for queries you must configure an S3 bucket to store query results. If this is the first time Athena is being used in your account, you will receive a message saying that you need to configure an S3 bucket. To do this, select the Edit settings button in the blue information notice and follow the instructions.

  11. On the left side, select aws_account_metadata_db> as the Database. You will see aws_account_metadata and ou_groups >as tables within the database.
  12. Figure 5: List of tables under the aws_accounts_metadata_db database

    Figure 5: List of tables under the aws_accounts_metadata_db database

  13. To view the OUs available within your organization, paste the following query into the Athena query editor window and choose Run.
  14. SELECT * FROM "aws_account_metadata_db"."ou_groups"
    

  15. Next, you must specify an AWS account you want to share an OU’s data with. Run the following SQL query in Athena and replace <AWS account Id> and <OU to assign> with values from your organization:
  16. UPDATE "aws_account_metadata_db"."ou_groups"
    SET consumer_aws_account_id = '<AWS account Id>'
    WHERE ou = '<OU to assign>' 

    In the example organization, all ExampleCorpWest security data is shared with AWS account 123456789012 (Account B) using the following SQL query:

    UPDATE "aws_account_metadata_db"."ou_groups"
    SET consumer_aws_account_id = '123456789012'
    WHERE ou = 'OU=root,OU=ExampleCorpWest'

    Note: You must specify the full OU path beginning with OU=root.

  17. Repeat this process for each OU you want to assign different AWS accounts to.
  18. Note: You can only assign one AWS account ID to each OU group

  19. You can confirm that changes have been applied by running the Athena query from Step 3 again.
  20. SELECT * FROM "aws_account_metadata_db"."ou_groups"

You should see the AWS account ID you specified next to your OU.

Figure 6 – Consumer AWS account listed against ExampleCorpWest OU

Figure 6: Consumer AWS account listed against ExampleCorpWest OU

Invoke the Lambda function manually

By default, the Lambda function is scheduled to run hourly to monitor for changes to AWS account metadata and to update Lake Formation sharing permissions (grants) if needed. To perform the remaining steps in this post without having to wait for the hourly run, you must manually invoke the Lambda function.

To invoke the Lambda function manually, follow these steps:

  1. Open the AWS Lambda console.
  2. Select the RowLevelSecurityLakeStack-* Lambda function.
  3. Under Code source, choose Test.
  4. The Lambda function doesn’t take any parameters. Enter rl-sec-lake-test as the Event name and leave all other options as the default. Choose Save.
  5. Choose Test again. The Lambda function will take approximately 5 minutes to complete in an environment with less than 100 AWS accounts.

After the Lambda function has finished, you can review the data cell filters and grants that have been created in Lake Formation to securely share Security Lake data with your consumer AWS account (Account B).

To review the data filters and grants, follow these steps:

  1. Open the Lake Formation console.
  2. In the navigation pane, select Data filters under Data catalog to see a list of data cells filters that have been created for each OU that you assigned a consumer AWS account to. One filter is created per table. Each consumer AWS account is granted restricted access to the aws_account_metadata table and the aggregated Security Lake table.
  3. Figure 7 – Viewing data filters in Lake Formation

    Figure 7: Viewing data filters in Lake Formation

  4. Select one of the filters in the list and choose Edit. Edit data filter displays information about the filter such as the database and table it’s applied to, as well as the Row filter expression that enforces row-level security to only return rows where the AWS account ID is in the OU it applies to. Choose Cancel to close the window.
  5. Figure 8 – Details of a data filter showing row filter expression

    Figure 8: Details of a data filter showing row filter expression

  6. To see how the filters are used to grant restricted access to your tables, select Data lake permission under Permissions from navigation pane. In the search bar under Data permissions, enter the AWS account ID for your consumer AWS account (Account B) and press Enter. You will see a list of all the grants applied to that AWS account. Scroll to the right to see a column titled Resource that lists the names of the data cell filters you saw in the previous step.
  7. Figure 9 – Grants to the data consumer account for data filters

    Figure 9: Grants to the data consumer account for data filters

You can now move on to setting up the consumer AWS account.

Configuring QuickSight in the consumer AWS account (Account B)

Now that you’ve configured everything in the Security Lake delegated administrator account (Account A), you can configure QuickSight in the consumer account (Account B).

To confirm you can access shared tables, follow these steps:

  1. Sign in to your consumer AWS account (also known  as Account B).
  2. Follow the same steps as outlined in this previous post (NEEDS 2ND POST IN SERIES LINK WHEN LIVE) to accept the AWS Resource Access Manager invitation, create a new database, and create resource links for the aws_account_metadata and amazon_security_lake_table_<region>_sh_findings_1_0 tables that have been shared with your consumer AWS account. Make sure you create resource links for both tables shared with the account. When done, return to this post and continue with step 3.
  3. [Optional] After the resource links have been created, test that you’re able to query the data by selecting the radio button next to the aws_account_metadata resource link, select Actions, and then select View data under Table. This takes you to the Athena query editor where you can now run queries on the shared tables.
  4. Figure 10 – Selecting View data in Lake Formation to open Athena

    Figure 10: Selecting View data in Lake Formation to open Athena

    Note: To use Athena for queries you must configure an S3 bucket to store query results. If this is the first time using Athena in your account, you will receive a message saying that you need to configure an S3 bucket. To do this, choose Edit settings in the blue information notice and follow the instructions.

  5. In the Editor configuration, select AwsDataCatalog from the Data source options. The Database should be the database you created in the previous steps, for example security_lake_visualization. After selecting the database, copy the SQL query that follows and paste it into your Athena query editor, and choose Run. You will only see rows of account information from the OU you previously shared.
  6. SELECT * FROM "security_lake_visualization"."aws_account_metadata"

  7. Next, to enrich your Security Lake data with the AWS account metadata you need to create an Athena View that will join the datasets and filter the results to only return findings from the AWS Foundational Security Best Practices Standard. You can do this by copying the below query and running it in the Athena query editor.
  8. CREATE OR REPLACE VIEW "security_hub_fsbps_joined_view" AS 
    WITH
      security_hub AS (
       SELECT *
       FROM
         "security_lake_visualization"."amazon_security_lake_table_ap_southeast_2_sh_findings_1_0"
       WHERE (metadata.product.feature.uid LIKE 'aws-foundational-security-best-practices%')
    ) 
    SELECT
      amm.*
    , security_hub.*
    FROM
      (security_hub
    INNER JOIN "security_lake_visualization"."aws_account_metadata" amm ON (security_hub.cloud.account_uid = amm.id))

The SQL above performs a subquery to find only those findings in the Security Lake table that are from the AWS FSBP standard and then joins those rows with the aws_account_metadata table based on the AWS account ID. You can see it has created a new view listed under Views containing enriched security data that you can import as a dataset in QuickSight.

Figure 11 – Additional view added to the security_lake_visualization database

Figure 11: Additional view added to the security_lake_visualization database

Configuring QuickSight

To perform the initial steps to set up QuickSight in the consumer AWS account, you can follow the steps listed in the second post in this series. You must also provide the following grants to your QuickSight user:

Type Resource Permissions
GRANT security_hub_fsbps_joined_view SELECT
GRANT aws_metadata_db (resource link) DESCRIBE
GRANT amazon_security_lake_table_<region>_sh_findings_1_0 (resource link) DESCRIBE
GRANT ON TARGET aws_metadata_db (resource link) SELECT
GRANT ON TARGET amazon_security_lake_table_<region>_sh_findings_1_0 (resource link) SELECT

To create a new dataset in QuickSight, follow these steps:

  1. After your QuickSight user has the necessary permissions, open the QuickSight console and verify that you’re in same Region where Lake Formation is sharing the data.
  2. Add your data by choosing Datasets from the navigation pane and then selecting New dataset. To create a new dataset from new data sources, select Athena.
  3. Enter a data source name, for example security_lake_visualization, leave the Athena workgroup as [ primary ]. Then choose Create data source.
  4. The next step is to select the tables to build your dashboards. On the Choose your table prompt, for Catalog, select AwsDataCatalog. For Database, select the database you created in the previous steps, for example security_lake_visualization. For Table, select the security_hub_fsbps_joined_view you created previously and choose Edit/Preview data.
  5. Figure 12: Choosing the joined dataset in QuickSight

    Figure 12 – Choosing the joined dataset in QuickSight

  6. You will be taken to a screen where you can preview the data in your dataset.
  7. Figure 13: Previewing data in QuickSight

    Figure 13: Previewing data in QuickSight

  8. After you confirm you’re able to preview the data from the view, select the SPICE radio button in the bottom left of the screen and then choose PUBLISH & VISUALIZE.
  9. You can now create analyses and dashboards from Security Hub AWS FSBP standard findings per OU and filter data based on business dimensions available to you through OU structure and account tags.
  10. Figure 14 – QuickSight dashboard showing only ExampleCorpWest OU data and incorporating business dimensions

    Figure 14: QuickSight dashboard showing only ExampleCorpWest OU data and incorporating business dimensions

Clean up the resources

To clean up the resources that you created for this example:

  1. Sign in to the Security Lake delegated admin account and delete the CloudFormation stack by either:
    • Using the CloudFormation console to delete the stack, or
    • Using the AWS CDK to run cdk destroy in your terminal. Follow the instructions and enter y when prompted to delete the stack.
  2. Remove any data filters you created by navigating to data filters within Lake Formation, selecting each one and choosing Delete.

Conclusion

In this final post of the series on visualizing Security Lake data with QuickSight, we introduced you to using a tool—available from AWS Samples—to extract OU structure and account metadata from your organization and use it to securely share Security Lake data on a per-OU basis across your organization. You learned how to enrich Security Lake data with account metadata and use it to create row-level security controls in Lake Formation. You were then able to address a common example of distributing security posture measured by the AWS Foundational Security Best Practices standard as part of AWS Security Hub.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Chris Lamont-Smith

Chris Lamont-Smith

Chris is a Senior Security Consultant working in the Security, Risk and Compliance team for AWS ProServe based out of Perth, Australia. He enjoys working in the area where security and data analytics intersect, and is passionate about helping customers gain actionable insights from their security data. When Chris isn’t working, he is out camping or off-roading with his family in the Australian bush.

Implement model versioning with Amazon Redshift ML

Post Syndicated from Rohit Bansal original https://aws.amazon.com/blogs/big-data/implement-model-versioning-with-amazon-redshift-ml/

Amazon Redshift ML allows data analysts, developers, and data scientists to train machine learning (ML) models using SQL. In previous posts, we demonstrated how you can use the automatic model training capability of Redshift ML to train classification and regression models. Redshift ML allows you to create a model using SQL and specify your algorithm, such as XGBoost. You can use Redshift ML to automate data preparation, preprocessing, and selection of your problem type (for more information, refer to Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML). You can also bring a model previously trained in Amazon SageMaker into Amazon Redshift via Redshift ML for local inference. For local inference on models created in SageMaker, the ML model type must be supported by Redshift ML. However, remote inference is available for model types that are not natively available in Redshift ML.

Over time, ML models grow old, and even if nothing drastic happens, small changes accumulate. Common reasons why ML models needs to be retrained or audited include:

  • Data drift – Because your data has changed over time, the prediction accuracy of your ML models may begin to decrease compared to the accuracy exhibited during testing
  • Concept drift – The ML algorithm that was initially used may need to be changed due to different business environments and other changing needs

You may need to refresh the model on a regular basis, automate the process, and reevaluate your model’s improved accuracy. As of this writing, Amazon Redshift doesn’t support versioning of ML models. In this post, we show how you can use the bring your own model (BYOM) functionality of Redshift ML to implement versioning of Redshift ML models.

We use local inference to implement model versioning as part of operationalizing ML models. We assume that you have a good understanding of your data and the problem type that is most applicable for your use case, and have created and deployed models to production.

Solution overview

In this post, we use Redshift ML to build a regression model that predicts the number of people that may use the city of Toronto’s bike sharing service at any given hour of a day. The model accounts for various aspects, including holidays and weather conditions, and because we need to predict a numerical outcome, we used a regression model. We use data drift as a reason for retraining the model, and use model versioning as part of the solution.

After a model is validated and is being used on a regular basis for running predictions, you can create versions of the models, which requires you to retrain the model using an updated training set and possibly a different algorithm. Versioning serves two main purposes:

  • You can refer to prior versions of a model for troubleshooting or audit purposes. This enables you to ensure that your model still retains high accuracy before switching to a newer model version.
  • You can continue to run inference queries on the current version of a model during the model training process of the new version.

At the time of this writing, Redshift ML doesn’t have native versioning capabilities, but you can still achieve versioning by implementing a few simple SQL techniques by using the BYOM capability. BYOM was introduced to support pre-trained SageMaker models to run your inference queries in Amazon Redshift. In this post, we use the same BYOM technique to create a version of an existing model built using Redshift ML.

The following figure illustrates this workflow.

In the following sections, we show you how to can create a version from an existing model and then perform model retraining.

Prerequisites

As a prerequisite for implementing the example in this post, you need to set up a Redshift cluster or Amazon Redshift Serverless endpoint. For the preliminary steps to get started and set up your environment, refer to Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

We use the regression model created in the post Build regression models with Amazon Redshift ML. We assume that it is already been deployed and use this model to create new versions and retrain the model.

Create a version from the existing model

The first step is to create a version of the existing model (which means saving developmental changes of the model) so that a history is maintained and the model is available for comparison later on.

The following code is the generic format of the CREATE MODEL command syntax; in the next step, you get the information needed to use this command to create a new version:

CREATE MODEL model_name
    FROM ('job_name' | 's3_path' )
    FUNCTION function_name ( data_type [, ...] )
    RETURNS data_type
    IAM_ROLE { default }
    [ SETTINGS (
      S3_BUCKET 'bucket', | --required
      KMS_KEY_ID 'kms_string') --optional
    ];

Next, we collect and apply the input parameters to the preceding CREATE MODEL code to the model. We need the job name and the data types of the model input and output values. We collect these by running the show model command on our existing model. Run the following command in Amazon Redshift Query Editor v2:

show model predict_rental_count;

Note the values for AutoML Job Name, Function Parameter Types, and the Target Column (trip_count) from the model output. We use these values in the CREATE MODEL command to create the version.

The following CREATE MODEL statement creates a version of the current model using the values collected from our show model command. We append the date (the example format is YYYYMMDD) to the end of the model and function names to track when this new version was created.

CREATE MODEL predict_rental_count_20230706 
FROM 'redshiftml-20230706171639810624' 
FUNCTION predict_rental_count_20230706 (int4, int4, int4, int4, int4, int4, int4, numeric, numeric, int4)
RETURNS float8 
IAM_ROLE default
SETTINGS (
S3_BUCKET '<<your S3 Bucket>>');

This command may take few minutes to complete. When it’s complete, run the following command:

show model predict_rental_count_20230706;

We can observe the following in the output:

  • AutoML Job Name is the same as the original version of the model
  • Function Name shows the new name, as expected
  • Inference Type shows Local, which designates this is BYOM with local inference

You can run inference queries using both versions of the model to validate the inference outputs.

The following screenshot shows the output of the model inference using the original version.

The following screenshot shows the output of model inference using the version copy.

As you can see, the inference outputs are the same.

You have now learned how to create a version of a previously trained Redshift ML model.

Retrain your Redshift ML model

After you create a version of an existing model, you can retrain the existing model by simply creating a new model.

You can create and train a new model using same CREATE MODEL command but using different input parameters, datasets, or problem types as applicable. For this post, we retrain the model on newer datasets. We append _new to the model name so it’s similar to the existing model for identification purposes.

In the following code, we use the CREATE MODEL command with a new dataset available in the training_data table:

CREATE MODEL predict_rental_count_new
FROM training_data
TARGET trip_count
FUNCTION predict_rental_count_new
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML'
PROBLEM_TYPE regression
OBJECTIVE 'mse'
SETTINGS (s3_bucket 'redshiftml-<your-account-id>',
          s3_garbage_collect off,
          max_runtime 5000);

Run the following command to check the status of the new model:

show model predict_rental_count_new;

Replace the existing Redshift ML model with the retrained model

The last step is to replace the existing model with the retrained model. We do this by dropping the original version of the model and recreating a model using the BYOM technique.

First, check your retrained model to ensure the MSE/RMSE scores are staying stable between model training runs. To validate the models, you can run inferences by each of the model functions on your dataset and compare the results. We use the inference queries provided in Build regression models with Amazon Redshift ML.

After validation, you can replace your model.

Start by collecting the details of the predict_rental_count_new model.

Note the AutoML Job Name value, the Function Parameter Types values, and the Target Column name in the model output.

Replace the original model by dropping the original model and then creating the model with the original model and function names to make sure the existing references to the model and function names work:

drop model predict_rental_count;
CREATE MODEL predict_rental_count
FROM 'redshiftml-20230706171639810624' 
FUNCTION predict_rental_count(int4, int4, int4, int4, int4, int4, int4, numeric, numeric, int4)
RETURNS float8 
IAM_ROLE default
SETTINGS (
S3_BUCKET ’<<your S3 Bucket>>’);

The model creation should complete in a few minutes. You can check the status of the model by running the following command:

show model predict_rental_count;

When the model status is ready, the newer version predict_rental_count of your existing model is available for inference and the original version of the ML model predict_rental_count_20230706 is available for reference if needed.

Please refer to this GitHub repository for sample scripts to automate model versioning.

Conclusion

In this post, we showed how you can use the BYOM feature of Redshift ML to do model versioning. This allows you to have a history of your models so that you can compare model scores over time, respond to audit requests, and run inferences while training a new model.

For more information about building different models with Redshift ML, refer to Amazon Redshift ML.


About the Authors

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation analytics solutions using other AWS Analytics services.

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and using the power of ML within their data warehouse.