Tag Archives: Technical How-to

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Post Syndicated from Pradeep Misra original https://aws.amazon.com/blogs/big-data/use-your-corporate-identities-for-analytics-with-amazon-emr-and-aws-iam-identity-center/

To enable your workforce users for analytics with fine-grained data access controls and audit data access, you might have to create multiple AWS Identity and Access Management (IAM) roles with different data permissions and map the workforce users to one of those roles. Multiple users are often mapped to the same role where they need similar privileges to enable data access controls at the corporate user or group level and audit data access.

AWS IAM Identity Center enables centralized management of workforce user access to AWS accounts and applications using a local identity store or by connecting corporate directories via identity providers (IdPs). IAM Identity Center now supports trusted identity propagation, a streamlined experience for users who require access to data with AWS analytics services.

Amazon EMR Studio is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to build data engineering and data science applications. With trusted identity propagation, data access management can be based on a user’s corporate identity and can be propagated seamlessly as they access data with single sign-on to build analytics applications with Amazon EMR (EMR Studio and Amazon EMR on EC2).

AWS Lake Formation allows data administrators to centrally govern, secure, and share data for analytics and machine learning (ML). With trusted identity propagation, data administrators can directly provide granular access to corporate users using their identity attributes and simplify the traceability of end-to-end data access across AWS services. Because access is managed based on a user’s corporate identity, they don’t need to use database local user credentials or assume an IAM role to access data.

In this post, we show how to bring your workforce identity to EMR Studio for analytics use cases, directly manage fine-grained permissions for the corporate users and groups using Lake Formation, and audit their data access.

Solution overview

For our use case, we want to enable a data analyst user named analyst1 to use their own enterprise credentials to query data they have been granted permissions to and audit their data access. We use Okta as the IdP for this demonstration. The following diagram illustrates the solution architecture.

This architecture is based on the following components:

  • Okta is responsible for maintaining the corporate user identities, related groups, and user authentication.
  • IAM Identity Center connects Okta users and centrally manages their access across AWS accounts and applications.
  • Lake Formation provides fine-grained access controls on data directly to corporate users using trusted identity propagation.
  • EMR Studio is an IDE for users to build and run applications. It allows users to log in directly with their corporate credentials without signing in to the AWS Management Console.
  • AWS Service Catalog provides a product template to create EMR clusters.
  • EMR cluster is integrated with IAM Identity Center using a security configuration.
  • AWS CloudTrail captures user data access activities.

The following are the high-level steps to implement the solution:

  1. Integrate Okta with IAM Identity Center.
  2. Set up Amazon EMR Studio.
  3. Create an IAM Identity Center enabled security configuration for EMR clusters.
  4. Create a Service Catalog product template to create the EMR clusters.
  5. Use Lake Formation to grant permissions to users to access data.
  6. Test the solution by accessing data with a corporate identity.
  7. Audit user data access.

Prerequisites

You should have the following prerequisites:

Integrate Okta with IAM Identity Center

For more information about configuring Okta with IAM Identity Center, refer to Configure SAML and SCIM with Okta and IAM Identity Center.

For this setup, we have created two users, analyst1 and engineer1, and assigned them to the corresponding Okta application. You can validate the integration is working by navigating to the Users page on the IAM Identity Center console, as shown in the following screenshot. Both enterprise users from Okta are provisioned in IAM Identity Center.

The following exact users will not be listed in your account. You can either create similar users or use an existing user.

Each provisioned user in IAM Identity Center has a unique user ID. This ID does not originate from Okta; it’s created in IAM Identity Center to uniquely identify this user. With trusted identity propagation, this user ID will be propagated across services and also used for traceability purposes in CloudTrail. The following screenshot shows the IAM Identity Center user matching the provisioned Okta user analyst1.

Choose the link under AWS access portal URL and log in with the analyst1 Okta user credentials that are already assigned to this application.

If you are able to log in and see the landing page, then all your configurations up to this step are set correctly. You will not see any applications on this page yet.

Set up EMR Studio

In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. This allows users to directly access EMR Studio with their enterprise credentials.

Note: All Amazon S3 buckets (created after January 5, 2023) have encryption configured by default (Amazon S3 managed keys (SSE-S3)), and all new objects that are uploaded to an S3 bucket are automatically encrypted at rest. To use a different type of encryption, to meet your security needs, please update the default encryption configuration for the bucket. See Protecting data for server-side encryption for further details.

  • On the Amazon EMR console, choose Studios in the navigation pane under EMR Studio.
  • Choose Create Studio.

  • For Setup options¸ select Custom.
  • For Studio name, enter a name (for this post, emr-studio-with-tip).
  • For S3 location for Workspace storage, select Select existing location and enter an existing S3 bucket (if you have one). Otherwise, select Create new bucket.

  • For Service role to let Studio access your AWS resources, choose View permissions details to get the trust and IAM policy information that is needed and create a role with those specific policies in IAM. In this case, we create a new role called emr_tip_role.

  • For Service role to let Studio access your AWS resources, choose the IAM role you created.
  • For Workspace name, enter a name (for this post, studio-workspace-with-tip).

  • For Authentication, select IAM Identity Center.
  • For User role¸ you can create a new role or choose an existing role. For this post, we choose the role we created (emr_tip_role).
  • To use the same role, add the following statement to the trust policy of the service role:
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "elasticmapreduce.amazonaws.com",
 "AWS": "arn:aws:iam::xxxxxx:role/emr_tip_role"
      },
      "Action": [
              "sts:AssumeRole",
              "sts:SetContext"
              ]
    }
  ]
}
  • Select Enable trusted identity propagation to allow you to control and log user access across connected applications.

  • For Choose who can access your application, select All users and groups.

Later, we restrict access to resources using Lake Formation. However, there is an option here to restrict access to only assigned users and groups.

  • In the Networking and security section, you can provide optional details for your VPC, subnets, and security group settings.
  • Choose Create Studio.

  • On the Studios page of the Amazon EMR console, locate your Studio enabled with IAM Identity Center.
  • Copy the link for Studio Access URL.

  • Enter the URL into a web browser and log in using Okta credentials.

You should be able to successfully sign in to the EMR Studio console.

Create an AWS Identity Center enabled security configuration for EMR clusters

EMR security configurations allow you to configure data encryption, Kerberos authentication, and Amazon S3 authorization for the EMR File System (EMRFS) on the clusters. The security configuration is available to use and reuse when you create clusters.

To integrate Amazon EMR with IAM Identity Center, you need to first create an IAM role that authenticates with IAM Identity Center from the EMR cluster. Amazon EMR uses IAM credentials to relay the IAM Identity Center identity to downstream services such as Lake Formation. The IAM role should also have the respective permissions to invoke the downstream services.

  1. Create a role (for this post, called emr-idc-application) with the following trust and permission policy. The role referenced in the trust policy is the InstanceProfile role for EMR clusters. This allows the EC2 instance profile to assume this role and act as an identity broker on behalf of the federated users.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AssumeRole",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::xxxxxxxxxxn:role/service-role/AmazonEMR-InstanceProfile-20240127T102444"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:SetContext"
            ]
        }
    ]
}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "IdCPermissions",
            "Effect": "Allow",
            "Action": [
                "sso-oauth:*"
            ],
            "Resource": "*"
        },
        {
            "Sid": "GlueandLakePermissions",
            "Effect": "Allow",
            "Action": [
                "glue:*",
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        },
        {
            "Sid": "S3Permissions",
            "Effect": "Allow",
            "Action": [
                "s3:GetDataAccess",
                "s3:GetAccessGrantsInstanceForPrefix"
            ],
            "Resource": "*"
        }
    ]
}

Next, you create certificates for encrypting data in transit with Amazon EMR.

  • For this post, we use OpenSSL to generate a self-signed X.509 certificate with a 2048-bit RSA private key.

The key allows access to the issuer’s EMR cluster instances in the AWS Region being used. For a complete guide on creating and providing a certificate, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption.

  • Upload my-certs.zip to an S3 location that will be used to create the security configuration.

The EMR service role should have access to the S3 location. The key allows access to the issuer’s EMR cluster instances in the us-west-2 Region as specified by the *.us-west-2.compute.internal domain name as the common name. You can change this to the Region your cluster is in.

$ openssl req -x509 -newkey rsa:2048 -keyout privateKey.pem -out certificateChain.pem -days 365 -nodes -subj '/CN=*.us-west-2.compute.internal'
$ cp certificateChain.pem trustedCertificates.pem
$ zip -r -X my-certs.zip certificateChain.pem privateKey.pem trustedCertificates.pem
  • Create an EMR security configuration with IAM Identity Center enabled from the AWS Command Line Interface (AWS CLI) with the following code:
aws emr create-security-configuration --name "IdentityCenterConfiguration-with-lf-tip" --region "us-west-2" --endpoint-url https://elasticmapreduce.us-west-2.amazonaws.com --security-configuration '{
    "AuthenticationConfiguration":{
        "IdentityCenterConfiguration":{
            "EnableIdentityCenter":true,
            "IdentityCenterApplicationAssigmentRequired":false,
            "IdentityCenterInstanceARN": "arn:aws:sso:::instance/ssoins-7907b0d7d77e3e0d",
            "IAMRoleForEMRIdentityCenterApplicationARN": "arn:aws:iam::1xxxxxxxxx0:role/emr-idc-application"
        }
    },
    "AuthorizationConfiguration": {
        "LakeFormationConfiguration": {
            "EnableLakeFormation": true
        }
    },
    "EncryptionConfiguration": {
        "EnableInTransitEncryption": true,
        "EnableAtRestEncryption": false,
        "InTransitEncryptionConfiguration": {
            "TLSCertificateConfiguration": {
                "CertificateProviderType": "PEM",
                "S3Object": "s3://<<Bucket Name>>/emr-transit-encry-certs/my-certs.zip"
            }
        }
    }
}' 

You can view the security configuration on the Amazon EMR console.

Create a Service Catalog product template to create EMR clusters

EMR Studio with trusted identity propagation enabled can only work with clusters created from a template. Complete the following steps to create a product template in Service Catalog:

  • On the Service Catalog console, choose Portfolios under Administration in the navigation pane.
  • Choose Create portfolio.

  • Enter a name for your portfolio (for this post, EMR Clusters Template) and an optional description.
  • Choose Create.

  • On the Portfolios page, choose the portfolio you just created to view its details.

  • On the Products tab, choose Create product.

  • For Product type, select CloudFormation.
  • For Product name, enter a name (for this post, EMR-7.0.0).
  • Use the security configuration IdentityCenterConfiguration-with-lf-tip you created in previous steps with the appropriate Amazon EMR service roles.
  • Choose Create product.

The following is an example CloudFormation template. Update the account-specific values for SecurityConfiguration, JobFlowRole, ServiceRole, LogUri, Ec2KeyName, and Ec2SubnetId. We provide a sample Amazon EMR service role and trust policy in Appendix A at the end of this post.

'Parameters':
  'ClusterName':
    'Type': 'String'
    'Default': 'EMR_TIP_Cluster'
  'EmrRelease':
    'Type': 'String'
    'Default': 'emr-7.0.0'
    'AllowedValues':
    - 'emr-7.0.0'
  'ClusterInstanceType':
    'Type': 'String'
    'Default': 'm5.xlarge'
    'AllowedValues':
    - 'm5.xlarge'
    - 'm5.2xlarge'
'Resources':
  'EmrCluster':
    'Type': 'AWS::EMR::Cluster'
    'Properties':
      'Applications':
      - 'Name': 'Spark'
      - 'Name': 'Livy'
      - 'Name': 'Hadoop'
      - 'Name': 'JupyterEnterpriseGateway'       
      'SecurityConfiguration': 'IdentityCenterConfiguration-with-lf-tip'
      'EbsRootVolumeSize': '20'
      'Name':
        'Ref': 'ClusterName'
      'JobFlowRole': <Instance Profile Role>
      'ServiceRole': <EMR Service Role>
      'ReleaseLabel':
        'Ref': 'EmrRelease'
      'VisibleToAllUsers': !!bool 'true'
      'LogUri':
        'Fn::Sub': <S3 LOG Path>
      'Instances':
        "Ec2KeyName" : <Key Pair Name>
        'TerminationProtected': !!bool 'false'
        'Ec2SubnetId': <subnet-id>
        'MasterInstanceGroup':
          'InstanceCount': !!int '1'
          'InstanceType':
            'Ref': 'ClusterInstanceType'
        'CoreInstanceGroup':
          'InstanceCount': !!int '2'
          'InstanceType':
            'Ref': 'ClusterInstanceType'
          'Market': 'ON_DEMAND'
          'Name': 'Core'
'Outputs':
  'ClusterId':
    'Value':
      'Ref': 'EmrCluster'
    'Description': 'The ID of the  EMR cluster'
'Metadata':
  'AWS::CloudFormation::Designer': {}
'Rules': {}

Trusted identity propagation is supported from Amazon EMR 6.15 onwards. For Amazon EMR 6.15, add the following bootstrap action to the CloudFormation script:

'BootstrapActions':
- 'Name': 'spark-config'
'ScriptBootstrapAction':
'Path': 's3://emr-data-access-control-<aws-region>/customer-bootstrap-actions/idc-fix/replace-puppet.sh'

The portfolio now should have the EMR cluster creation product added.

  • Grant the EMR Studio role emr_tip_role access to the portfolio.

Grant Lake Formation permissions to users to access data

In this step, we enable Lake Formation integration with IAM Identity Center and grant permissions to the Identity Center user analyst1. If Lake Formation is not already enabled, refer to Getting started with Lake Formation.

To use Lake Formation with Amazon EMR, create a custom role to register S3 locations. You need to create a new custom role with Amazon S3 access and not use the default role AWSServiceRoleForLakeFormationDataAccess. Additionally, enable external data filtering in Lake Formation. For more details, refer to Enable Lake Formation with Amazon EMR.

Complete the following steps to manage access permissions in Lake Formation:

  • On the Lake Formation console, choose IAM Identity Center integration under Administration in the navigation pane.

Lake Formation will automatically specify the correct IAM Identity Center instance.

  • Choose Create.

You can now view the IAM Identity Center integration details.

For this post, we have a Marketing database and a customer table on which we grant access to our enterprise user analyst1. You can use an existing database and table in your account or create a new one. For more examples, refer to Tutorials.

The following screenshot shows the details of our customer table.

Complete the following steps to grant analyst1 permissions. For more information, refer to Granting table permissions using the named resource method.

  • On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
  • Choose Grant.

  • Select Named Data Catalog resources.
  • For Databases, choose your database (marketing).
  • For Tables, choose your table (customer).

  • For Table permissions, select Select and Describe.
  • For Data permissions, select All data access.
  • Choose Grant.

The following screenshot shows a summary of permissions that user analyst1 has. They have Select access on the table and Describe permissions on the databases.

Test the solution

To test the solution, we log in to EMR Studio as enterprise user analyst1, create a new Workspace, create an EMR cluster using a template, and use that cluster to perform an analysis. You could also use the Workspace that was created during the Studio setup. In this demonstration, we create a new Workspace.

You need additional permissions in the EMR Studio role to create and list Workspaces, use a template, and create EMR clusters. For more details, refer to Configure EMR Studio user permissions for Amazon EC2 or Amazon EKS. Appendix B at the end of this post contains a sample policy.

When the cluster is available, we attach the cluster to the Workspace and run queries on the customer table, which the user has access to.

User analyst1 is now able to run queries for business use cases using their corporate identity. To open a PySpark notebook, we choose PySpark under Notebook.

When the notebook is open, we run a Spark SQL query to list the databases:

%%sql 
show databases

In this case, we query the customer table in the marketing database. We should be able to access the data.

%%sql
select * from marketing.customer

Audit data access

Lake Formation API actions are logged by CloudTrail. The GetDataAccess action is logged whenever a principal or integrated AWS service requests temporary credentials to access data in a data lake location that is registered with Lake Formation. With trusted identity propagation, CloudTrail also logs the IAM Identity Center user ID of the corporate identity who requested access to the data.

The following screenshot shows the details for the analyst1 user.

Choose View event to view the event logs.

The following is an example of the GetDataAccess event log. We can trace that user analyst1, Identity Center user ID c8c11390-00a1-706e-0c7a-bbcc5a1c9a7f, has accessed the customer table.

{
    "eventVersion": "1.09",
    
….
        "onBehalfOf": {
            "userId": "c8c11390-00a1-706e-0c7a-bbcc5a1c9a7f",
            "identityStoreArn": "arn:aws:identitystore::xxxxxxxxx:identitystore/d-XXXXXXXX"
        }
    },
    "eventTime": "2024-01-28T17:56:25Z",
    "eventSource": "lakeformation.amazonaws.com",
    "eventName": "GetDataAccess",
    "awsRegion": "us-west-2",
….
        "requestParameters": {
        "tableArn": "arn:aws:glue:us-west-2:xxxxxxxxxx:table/marketing/customer",
        "supportedPermissionTypes": [
            "TABLE_PERMISSION"
        ]
    },
    …..
    }
}

Here is an end to end demonstration video of steps to follow for enabling trusted identity propagation to your analytics flow in Amazon EMR

Clean up

Clean up the following resources when you’re done using this solution:

Conclusion

In this post, we demonstrated how to set up and use trusted identity propagation using IAM Identity Center, EMR Studio, and Lake Formation for analytics. With trusted identity propagation, a user’s corporate identity is seamlessly propagated as they access data using single sign-on across AWS analytics services to build analytics applications. Data administrators can provide fine-grained data access directly to corporate users and groups and audit usage. To learn more, see Integrate Amazon EMR with AWS IAM Identity Center.


About the Authors

Pradeep Misra is a Principal Analytics Solutions Architect at AWS. He works across Amazon to architect and design modern distributed analytics and AI/ML platform solutions. He is passionate about solving customer challenges using data, analytics, and AI/ML. Outside of work, Pradeep likes exploring new places, trying new cuisines, and playing board games with his family. He also likes doing science experiments with his daughters.

Deepmala Agarwal works as an AWS Data Specialist Solutions Architect. She is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking!

Abhilash Nagilla is a Senior Specialist Solutions Architect at Amazon Web Services (AWS), helping public sector customers on their cloud journey with a focus on AWS analytics services. Outside of work, Abhilash enjoys learning new technologies, watching movies, and visiting new places.


Appendix A

Sample Amazon EMR service role and trust policy:

Note: This is a sample service role. Fine grained access control is done using Lake Formation. Modify the permissions as per your enterprise guidance and to comply with your security team.

Trust policy:

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "elasticmapreduce.amazonaws.com",
   "AWS": "arn:aws:iam::xxxxxx:role/emr_tip_role"

            },
            "Action": [
                "sts:AssumeRole",
                "sts:SetContext"
            ]
        }
    ]
}

Permission Policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ResourcesToLaunchEC2",
            "Effect": "Allow",
            "Action": [
                "ec2:RunInstances",
                "ec2:CreateFleet",
                "ec2:CreateLaunchTemplate",
                "ec2:CreateLaunchTemplateVersion"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*",
                "arn:aws:ec2:*::image/ami-*",
                "arn:aws:ec2:*:*:key-pair/*",
                "arn:aws:ec2:*:*:capacity-reservation/*",
                "arn:aws:ec2:*:*:placement-group/pg-*",
                "arn:aws:ec2:*:*:fleet/*",
                "arn:aws:ec2:*:*:dedicated-host/*",
                "arn:aws:resource-groups:*:*:group/*"
            ]
        },
        {
            "Sid": "TagOnCreateTaggedEMRResources",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*",
                "arn:aws:ec2:*:*:instance/*",
                "arn:aws:ec2:*:*:volume/*",
                "arn:aws:ec2:*:*:launch-template/*"
            ],
            "Condition": {
                "StringEquals": {
                    "ec2:CreateAction": [
                        "RunInstances",
                        "CreateFleet",
                        "CreateLaunchTemplate",
                        "CreateNetworkInterface"
                    ]
                }
            }
        },
        {
            "Sid": "ListActionsForEC2Resources",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeAccountAttributes",
                "ec2:DescribeCapacityReservations",
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeImages",
                "ec2:DescribeInstances",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeNetworkAcls",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribePlacementGroups",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVolumes",
                "ec2:DescribeVolumeStatus",
                "ec2:DescribeVpcAttribute",
                "ec2:DescribeVpcEndpoints",
                "ec2:DescribeVpcs"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AutoScaling",
            "Effect": "Allow",
            "Action": [
                "application-autoscaling:DeleteScalingPolicy",
                "application-autoscaling:DeregisterScalableTarget",
                "application-autoscaling:DescribeScalableTargets",
                "application-autoscaling:DescribeScalingPolicies",
                "application-autoscaling:PutScalingPolicy",
                "application-autoscaling:RegisterScalableTarget"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AutoScalingCloudWatch",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricAlarm",
                "cloudwatch:DeleteAlarms",
                "cloudwatch:DescribeAlarms"
            ],
            "Resource": "arn:aws:cloudwatch:*:*:alarm:*_EMR_Auto_Scaling"
        },
        {
            "Sid": "PassRoleForAutoScaling",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/EMR_AutoScaling_DefaultRole",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "application-autoscaling.amazonaws.com*"
                }
            }
        },
        {
            "Sid": "PassRoleForEC2",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::xxxxxxxxxxx:role/service-role/<Instance-Profile-Role>",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "ec2.amazonaws.com*"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*",
                "s3-object-lambda:*"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket>/*",
                "arn:aws:s3:::*logs*/*"
            ]
        },
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": [
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:CancelSpotInstanceRequests",
                "ec2:CreateFleet",
                "ec2:CreateLaunchTemplate",
                "ec2:CreateNetworkInterface",
                "ec2:CreateSecurityGroup",
                "ec2:CreateTags",
                "ec2:DeleteLaunchTemplate",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteSecurityGroup",
                "ec2:DeleteTags",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeAccountAttributes",
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeImages",
                "ec2:DescribeInstanceStatus",
                "ec2:DescribeInstances",
                "ec2:DescribeKeyPairs",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeNetworkAcls",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribePrefixLists",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSpotInstanceRequests",
                "ec2:DescribeSpotPriceHistory",
                "ec2:DescribeSubnets",
                "ec2:DescribeTags",
                "ec2:DescribeVpcAttribute",
                "ec2:DescribeVpcEndpoints",
                "ec2:DescribeVpcEndpointServices",
                "ec2:DescribeVpcs",
                "ec2:DetachNetworkInterface",
                "ec2:ModifyImageAttribute",
                "ec2:ModifyInstanceAttribute",
                "ec2:RequestSpotInstances",
                "ec2:RevokeSecurityGroupEgress",
                "ec2:RunInstances",
                "ec2:TerminateInstances",
                "ec2:DeleteVolume",
                "ec2:DescribeVolumeStatus",
                "ec2:DescribeVolumes",
                "ec2:DetachVolume",
                "iam:GetRole",
                "iam:GetRolePolicy",
                "iam:ListInstanceProfiles",
                "iam:ListRolePolicies",
                "cloudwatch:PutMetricAlarm",
                "cloudwatch:DescribeAlarms",
                "cloudwatch:DeleteAlarms",
                "application-autoscaling:RegisterScalableTarget",
                "application-autoscaling:DeregisterScalableTarget",
                "application-autoscaling:PutScalingPolicy",
                "application-autoscaling:DeleteScalingPolicy",
                "application-autoscaling:Describe*"
            ]
        }
    ]
}

Appendix B

Sample EMR Studio role policy:

Note: This is a sample service role. Fine grained access control is done using Lake Formation. Modify the permissions as per your enterprise guidance and to comply with your security team.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowEMRReadOnlyActions",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:ListInstances",
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:ListSteps"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowEC2ENIActionsWithEMRTags",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterfacePermission",
                "ec2:DeleteNetworkInterface"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true"
                }
            }
        },
        {
            "Sid": "AllowEC2ENIAttributeAction",
            "Effect": "Allow",
            "Action": [
                "ec2:ModifyNetworkInterfaceAttribute"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:instance/*",
                "arn:aws:ec2:*:*:network-interface/*",
                "arn:aws:ec2:*:*:security-group/*"
            ]
        },
        {
            "Sid": "AllowEC2SecurityGroupActionsWithEMRTags",
            "Effect": "Allow",
            "Action": [
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:RevokeSecurityGroupEgress",
                "ec2:RevokeSecurityGroupIngress",
                "ec2:DeleteNetworkInterfacePermission"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true"
                }
            }
        },
        {
            "Sid": "AllowDefaultEC2SecurityGroupsCreationWithEMRTags",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateSecurityGroup"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:security-group/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/for-use-with-amazon-emr-managed-policies": "true"
                }
            }
        },
        {
            "Sid": "AllowDefaultEC2SecurityGroupsCreationInVPCWithEMRTags",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateSecurityGroup"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:vpc/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true"
                }
            }
        },
        {
            "Sid": "AllowAddingEMRTagsDuringDefaultSecurityGroupCreation",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags"
            ],
            "Resource": "arn:aws:ec2:*:*:security-group/*",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/for-use-with-amazon-emr-managed-policies": "true",
                    "ec2:CreateAction": "CreateSecurityGroup"
                }
            }
        },
        {
            "Sid": "AllowEC2ENICreationWithEMRTags",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/for-use-with-amazon-emr-managed-policies": "true"
                }
            }
        },
        {
            "Sid": "AllowEC2ENICreationInSubnetAndSecurityGroupWithEMRTags",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:subnet/*",
                "arn:aws:ec2:*:*:security-group/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true"
                }
            }
        },
        {
            "Sid": "AllowAddingTagsDuringEC2ENICreation",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags"
            ],
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Condition": {
                "StringEquals": {
                    "ec2:CreateAction": "CreateNetworkInterface"
                }
            }
        },
        {
            "Sid": "AllowEC2ReadOnlyActions",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeTags",
                "ec2:DescribeInstances",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowSecretsManagerReadOnlyActionsWithEMRTags",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": "arn:aws:secretsmanager:*:*:secret:*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/for-use-with-amazon-emr-managed-policies": "true"
                }
            }
        },
        {
            "Sid": "AllowWorkspaceCollaboration",
            "Effect": "Allow",
            "Action": [
                "iam:GetUser",
                "iam:GetRole",
                "iam:ListUsers",
                "iam:ListRoles",
                "sso:GetManagedApplicationInstance",
                "sso-directory:SearchUsers"
            ],
            "Resource": "*"
        },
        {
            "Sid": "S3Access",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetEncryptionConfiguration",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket>",
                "arn:aws:s3:::<bucket>/*"
            ]
        },
        {
            "Sid": "EMRStudioWorkspaceAccess",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:CreateEditor",
                "elasticmapreduce:DescribeEditor",
                "elasticmapreduce:ListEditors",
                "elasticmapreduce:DeleteEditor",
                "elasticmapreduce:UpdateEditor",
                "elasticmapreduce:PutWorkspaceAccess",
                "elasticmapreduce:DeleteWorkspaceAccess",
                "elasticmapreduce:ListWorkspaceAccessIdentities",
                "elasticmapreduce:StartEditor",
                "elasticmapreduce:StopEditor",
                "elasticmapreduce:OpenEditorInConsole",
                "elasticmapreduce:AttachEditor",
                "elasticmapreduce:DetachEditor",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:ListBootstrapActions",
                "servicecatalog:SearchProducts",
                "servicecatalog:DescribeProduct",
                "servicecatalog:DescribeProductView",
                "servicecatalog:DescribeProvisioningParameters",
                "servicecatalog:ProvisionProduct",
                "servicecatalog:UpdateProvisionedProduct",
                "servicecatalog:ListProvisioningArtifacts",
                "servicecatalog:DescribeRecord",
                "servicecatalog:ListLaunchPaths",
                "elasticmapreduce:RunJobFlow",      
                "elasticmapreduce:ListClusters",
                "elasticmapreduce:DescribeCluster",
                "codewhisperer:GenerateRecommendations",
                "athena:StartQueryExecution",
                "athena:StopQueryExecution",
                "athena:GetQueryExecution",
                "athena:GetQueryRuntimeStatistics",
                "athena:GetQueryResults",
                "athena:ListQueryExecutions",
                "athena:BatchGetQueryExecution",
                "athena:GetNamedQuery",
                "athena:ListNamedQueries",
                "athena:BatchGetNamedQuery",
                "athena:UpdateNamedQuery",
                "athena:DeleteNamedQuery",
                "athena:ListDataCatalogs",
                "athena:GetDataCatalog",
                "athena:ListDatabases",
                "athena:GetDatabase",
                "athena:ListTableMetadata",
                "athena:GetTableMetadata",
                "athena:ListWorkGroups",
                "athena:GetWorkGroup",
                "athena:CreateNamedQuery",
                "athena:GetPreparedStatement",
                "glue:CreateDatabase",
                "glue:DeleteDatabase",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:UpdateDatabase",
                "glue:CreateTable",
                "glue:DeleteTable",
                "glue:BatchDeleteTable",
                "glue:UpdateTable",
                "glue:GetTable",
                "glue:GetTables",
                "glue:BatchCreatePartition",
                "glue:CreatePartition",
                "glue:DeletePartition",
                "glue:BatchDeletePartition",
                "glue:UpdatePartition",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition",
                "kms:ListAliases",
                "kms:ListKeys",
                "kms:DescribeKey",
                "lakeformation:GetDataAccess",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload",
                "s3:PutObject",
                "s3:PutBucketPublicAccessBlock",
                "s3:ListAllMyBuckets",
                "elasticmapreduce:ListStudios",
                "elasticmapreduce:DescribeStudio",
                "cloudformation:GetTemplate",
                "cloudformation:CreateStack",
                "cloudformation:CreateStackSet",
                "cloudformation:DeleteStack",
                "cloudformation:GetTemplateSummary",
                "cloudformation:ValidateTemplate",
                "cloudformation:ListStacks",
                "cloudformation:ListStackSets",
                "elasticmapreduce:AddTags",
                "ec2:CreateNetworkInterface",
                "elasticmapreduce:GetClusterSessionCredentials",
                "elasticmapreduce:GetOnClusterAppUIPresignedURL",
                "cloudformation:DescribeStackResources"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "AllowPassingServiceRoleForWorkspaceCreation",
            "Action": "iam:PassRole",
            "Resource": [
                "arn:aws:iam::*:role/<Studio Role>",
                "arn:aws:iam::*:role/<EMR Service Role>",
                "arn:aws:iam::*:role/<EMR Instance Profile Role>"
            ],
            "Effect": "Allow"
        },
{
			"Sid": "Statement1",
			"Effect": "Allow",
			"Action": [
				"iam:PassRole"
			],
			"Resource": [
				"arn:aws:iam::*:role/<EMR Instance Profile Role>"
			]
		}
    ]
}

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Post Syndicated from Radhika Jakkula original https://aws.amazon.com/blogs/big-data/orchestrate-an-end-to-end-etl-pipeline-using-amazon-s3-aws-glue-and-amazon-redshift-serverless-with-amazon-mwaa/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows. With Amazon MWAA, you can use Apache Airflow and Python to create workflows without having to manage the underlying infrastructure for scalability, availability, and security.

By using multiple AWS accounts, organizations can effectively scale their workloads and manage their complexity as they grow. This approach provides a robust mechanism to mitigate the potential impact of disruptions or failures, making sure that critical workloads remain operational. Additionally, it enables cost optimization by aligning resources with specific use cases, making sure that expenses are well controlled. By isolating workloads with specific security requirements or compliance needs, organizations can maintain the highest levels of data privacy and security. Furthermore, the ability to organize multiple AWS accounts in a structured manner allows you to align your business processes and resources according to your unique operational, regulatory, and budgetary requirements. This approach promotes efficiency, flexibility, and scalability, enabling large enterprises to meet their evolving needs and achieve their goals.

This post demonstrates how to orchestrate an end-to-end extract, transform, and load (ETL) pipeline using Amazon Simple Storage Service (Amazon S3), AWS Glue, and Amazon Redshift Serverless with Amazon MWAA.

Solution overview

For this post, we consider a use case where a data engineering team wants to build an ETL process and give the best experience to their end-users when they want to query the latest data after new raw files are added to Amazon S3 in the central account (Account A in the following architecture diagram). The data engineering team wants to separate the raw data into its own AWS account (Account B in the diagram) for increased security and control. They also want to perform the data processing and transformation work in their own account (Account B) to compartmentalize duties and prevent any unintended changes to the source raw data present in the central account (Account A). This approach allows the team to process the raw data extracted from Account A to Account B, which is dedicated for data handling tasks. This makes sure the raw and processed data can be maintained securely separated across multiple accounts, if required, for enhanced data governance and security.

Our solution uses an end-to-end ETL pipeline orchestrated by Amazon MWAA that looks for new incremental files in an Amazon S3 location in Account A, where the raw data is present. This is done by invoking AWS Glue ETL jobs and writing to data objects in a Redshift Serverless cluster in Account B. The pipeline then starts running stored procedures and SQL commands on Redshift Serverless. As the queries finish running, an UNLOAD operation is invoked from the Redshift data warehouse to the S3 bucket in Account A.

Because security is important, this post also covers how to configure an Airflow connection using AWS Secrets Manager to avoid storing database credentials within Airflow connections and variables.

The following diagram illustrates the architectural overview of the components involved in the orchestration of the workflow.

The workflow consists of the following components:

  • The source and target S3 buckets are in a central account (Account A), whereas Amazon MWAA, AWS Glue, and Amazon Redshift are in a different account (Account B). Cross-account access has been set up between S3 buckets in Account A with resources in Account B to be able to load and unload data.
  • In the second account, Amazon MWAA is hosted in one VPC and Redshift Serverless in a different VPC, which are connected through VPC peering. A Redshift Serverless workgroup is secured inside private subnets across three Availability Zones.
  • Secrets like user name, password, DB port, and AWS Region for Redshift Serverless are stored in Secrets Manager.
  • VPC endpoints are created for Amazon S3 and Secrets Manager to interact with other resources.
  • Usually, data engineers create an Airflow Directed Acyclic Graph (DAG) and commit their changes to GitHub. With GitHub actions, they are deployed to an S3 bucket in Account B (for this post, we upload the files into S3 bucket directly). The S3 bucket stores Airflow-related files like DAG files, requirements.txt files, and plugins. AWS Glue ETL scripts and assets are stored in another S3 bucket. This separation helps maintain organization and avoid confusion.
  • The Airflow DAG uses various operators, sensors, connections, tasks, and rules to run the data pipeline as needed.
  • The Airflow logs are logged in Amazon CloudWatch, and alerts can be configured for monitoring tasks. For more information, see Monitoring dashboards and alarms on Amazon MWAA.

Prerequisites

Because this solution centers around using Amazon MWAA to orchestrate the ETL pipeline, you need to set up certain foundational resources across accounts beforehand. Specifically, you need to create the S3 buckets and folders, AWS Glue resources, and Redshift Serverless resources in their respective accounts prior to implementing the full workflow integration using Amazon MWAA.

Deploy resources in Account A using AWS CloudFormation

In Account A, launch the provided AWS CloudFormation stack to create the following resources:

  • The source and target S3 buckets and folders. As a best practice, the input and output bucket structures are formatted with hive style partitioning as s3://<bucket>/products/YYYY/MM/DD/.
  • A sample dataset called products.csv, which we use in this post.

Upload the AWS Glue job to Amazon S3 in Account B

In Account B, create an Amazon S3 location called aws-glue-assets-<account-id>-<region>/scripts (if not present). Replace the parameters for the account ID and Region in the sample_glue_job.py script and upload the AWS Glue job file to the Amazon S3 location.

Deploy resources in Account B using AWS CloudFormation

In Account B, launch the provided CloudFormation stack template to create the following resources:

  • The S3 bucket airflow-<username>-bucket to store Airflow-related files with the following structure:
    • dags – The folder for DAG files.
    • plugins – The file for any custom or community Airflow plugins.
    • requirements – The requirements.txt file for any Python packages.
    • scripts – Any SQL scripts used in the DAG.
    • data – Any datasets used in the DAG.
  • A Redshift Serverless environment. The name of the workgroup and namespace are prefixed with sample.
  • An AWS Glue environment, which contains the following:
    • An AWS Glue crawler, which crawls the data from the S3 source bucket sample-inp-bucket-etl-<username> in Account A.
    • A database called products_db in the AWS Glue Data Catalog.
    • An ELT job called sample_glue_job. This job can read files from the products table in the Data Catalog and load data into the Redshift table products.
  • A VPC gateway endpointto Amazon S3.
  • An Amazon MWAA environment. For detailed steps to create an Amazon MWAA environment using the Amazon MWAA console, refer to Introducing Amazon Managed Workflows for Apache Airflow (MWAA).

launch stack 1

Create Amazon Redshift resources

Create two tables and a stored procedure on an Redshift Serverless workgroup using the products.sql file.

In this example, we create two tables called products and products_f. The name of the stored procedure is sp_products.

Configure Airflow permissions

After the Amazon MWAA environment is created successfully, the status will show as Available. Choose Open Airflow UI to view the Airflow UI. DAGs are automatically synced from the S3 bucket and visible in the UI. However, at this stage, there are no DAGs in the S3 folder.

Add the customer managed policy AmazonMWAAFullConsoleAccess, which grants Airflow users permissions to access AWS Identity and Access Management (IAM) resources, and attach this policy to the Amazon MWAA role. For more information, see Accessing an Amazon MWAA environment.

The policies attached to the Amazon MWAA role have full access and must only be used for testing purposes in a secure test environment. For production deployments, follow the least privilege principle.

Set up the environment

This section outlines the steps to configure the environment. The process involves the following high-level steps:

  1. Update any necessary providers.
  2. Set up cross-account access.
  3. Establish a VPC peering connection between the Amazon MWAA VPC and Amazon Redshift VPC.
  4. Configure Secrets Manager to integrate with Amazon MWAA.
  5. Define Airflow connections.

Update the providers

Follow the steps in this section if your version of Amazon MWAA is less than 2.8.1 (the latest version as of writing this post).

Providers are packages that are maintained by the community and include all the core operators, hooks, and sensors for a given service. The Amazon provider is used to interact with AWS services like Amazon S3, Amazon Redshift Serverless, AWS Glue, and more. There are over 200 modules within the Amazon provider.

Although the version of Airflow supported in Amazon MWAA is 2.6.3, which comes bundled with the Amazon provided package version 8.2.0, support for Amazon Redshift Serverless was not added until the Amazon provided package version 8.4.0. Because the default bundled provider version is older than when Redshift Serverless support was introduced, the provider version must be upgraded in order to use that functionality.

The first step is to update the constraints file and requirements.txt file with the correct versions. Refer to Specifying newer provider packages for steps to update the Amazon provider package.

  1. Specify the requirements as follows:
    --constraint "/usr/local/airflow/dags/constraints-3.10-mod.txt"
    apache-airflow-providers-amazon==8.4.0

  2. Update the version in the constraints file to 8.4.0 or higher.
  3. Add the constraints-3.11-updated.txt file to the /dags folder.

Refer to Apache Airflow versions on Amazon Managed Workflows for Apache Airflow for correct versions of the constraints file depending on the Airflow version.

  1. Navigate to the Amazon MWAA environment and choose Edit.
  2. Under DAG code in Amazon S3, for Requirements file, choose the latest version.
  3. Choose Save.

This will update the environment and new providers will be in effect.

  1. To verify the providers version, go to Providers under the Admin table.

The version for the Amazon provider package should be 8.4.0, as shown in the following screenshot. If not, there was an error while loading requirements.txt. To debug any errors, go to the CloudWatch console and open the requirements_install_ip log in Log streams, where errors are listed. Refer to Enabling logs on the Amazon MWAA console for more details.

Set up cross-account access

You need to set up cross-account policies and roles between Account A and Account B to access the S3 buckets to load and unload data. Complete the following steps:

  1. In Account A, configure the bucket policy for bucket sample-inp-bucket-etl-<username> to grant permissions to the AWS Glue and Amazon MWAA roles in Account B for objects in bucket sample-inp-bucket-etl-<username>:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": [
                        "arn:aws:iam::<account-id-of- AcctB>:role/service-role/<Glue-role>",
                        "arn:aws:iam::<account-id-of-AcctB>:role/service-role/<MWAA-role>"
                    ]
                },
                "Action": [
                    "s3:GetObject",
    "s3:PutObject",
    		   "s3:PutObjectAcl",
    		   "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::sample-inp-bucket-etl-<username>/*",
                    "arn:aws:s3:::sample-inp-bucket-etl-<username>"
                ]
            }
        ]
    }
    

  2. Similarly, configure the bucket policy for bucket sample-opt-bucket-etl-<username> to grant permissions to Amazon MWAA roles in Account B to put objects in this bucket:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::<account-id-of-AcctB>:role/service-role/<MWAA-role>"
                },
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:PutObjectAcl",
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::sample-opt-bucket-etl-<username>/*",
                    "arn:aws:s3:::sample-opt-bucket-etl-<username>"
                ]
            }
        ]
    }
    

  3. In Account A, create an IAM policy called policy_for_roleA, which allows necessary Amazon S3 actions on the output bucket:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "kms:Decrypt",
                    "kms:Encrypt",
                    "kms:GenerateDataKey"
                ],
                "Resource": [
                    "<KMS_KEY_ARN_Used_for_S3_encryption>"
                ]
            },
            {
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:GetBucketAcl",
                    "s3:GetBucketCors",
                    "s3:GetEncryptionConfiguration",
                    "s3:GetBucketLocation",
                    "s3:ListAllMyBuckets",
                    "s3:ListBucket",
                    "s3:ListBucketMultipartUploads",
                    "s3:ListBucketVersions",
                    "s3:ListMultipartUploadParts"
                ],
                "Resource": [
                    "arn:aws:s3:::sample-opt-bucket-etl-<username>",
                    "arn:aws:s3:::sample-opt-bucket-etl-<username>/*"
                ]
            }
        ]
    }

  4. Create a new IAM role called RoleA with Account B as the trusted entity role and add this policy to the role. This allows Account B to assume RoleA to perform necessary Amazon S3 actions on the output bucket.
  5. In Account B, create an IAM policy called s3-cross-account-access with permission to access objects in the bucket sample-inp-bucket-etl-<username>, which is in Account A.
  6. Add this policy to the AWS Glue role and Amazon MWAA role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:PutObjectAcl"
                ],
                "Resource": "arn:aws:s3:::sample-inp-bucket-etl-<username>/*"
            }
        ]
    }

  7. In Account B, create the IAM policy policy_for_roleB specifying Account A as a trusted entity. The following is the trust policy to assume RoleA in Account A:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "CrossAccountPolicy",
                "Effect": "Allow",
                "Action": "sts:AssumeRole",
                "Resource": "arn:aws:iam::<account-id-of-AcctA>:role/RoleA"
            }
        ]
    }

  8. Create a new IAM role called RoleB with Amazon Redshift as the trusted entity type and add this policy to the role. This allows RoleB to assume RoleA in Account A and also to be assumable by Amazon Redshift.
  9. Attach RoleB to the Redshift Serverless namespace, so Amazon Redshift can write objects to the S3 output bucket in Account A.
  10. Attach the policy policy_for_roleB to the Amazon MWAA role, which allows Amazon MWAA to access the output bucket in Account A.

Refer to How do I provide cross-account access to objects that are in Amazon S3 buckets? for more details on setting up cross-account access to objects in Amazon S3 from AWS Glue and Amazon MWAA. Refer to How do I COPY or UNLOAD data from Amazon Redshift to an Amazon S3 bucket in another account? for more details on setting up roles to unload data from Amazon Redshift to Amazon S3 from Amazon MWAA.

Set up VPC peering between the Amazon MWAA and Amazon Redshift VPCs

Because Amazon MWAA and Amazon Redshift are in two separate VPCs, you need to set up VPC peering between them. You must add a route to the route tables associated with the subnets for both services. Refer to Work with VPC peering connections for details on VPC peering.

Make sure that CIDR range of the Amazon MWAA VPC is allowed in the Redshift security group and the CIDR range of the Amazon Redshift VPC is allowed in the Amazon MWAA security group, as shown in the following screenshot.

If any of the preceding steps are configured incorrectly, you are likely to encounter a “Connection Timeout” error in the DAG run.

Configure the Amazon MWAA connection with Secrets Manager

When the Amazon MWAA pipeline is configured to use Secrets Manager, it will first look for connections and variables in an alternate backend (like Secrets Manager). If the alternate backend contains the needed value, it is returned. Otherwise, it will check the metadata database for the value and return that instead. For more details, refer to Configuring an Apache Airflow connection using an AWS Secrets Manager secret.

Complete the following steps:

  1. Configure a VPC endpoint to link Amazon MWAA and Secrets Manager (com.amazonaws.us-east-1.secretsmanager).

This allows Amazon MWAA to access credentials stored in Secrets Manager.

  1. To provide Amazon MWAA with permission to access Secrets Manager secret keys, add the policy called SecretsManagerReadWrite to the IAM role of the environment.
  2. To create the Secrets Manager backend as an Apache Airflow configuration option, go to the Airflow configuration options, add the following key-value pairs, and save your settings.

This configures Airflow to look for connection strings and variables at the airflow/connections/* and airflow/variables/* paths:

secrets.backend: airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend secrets.backend_kwargs: {"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"}

  1. To generate an Airflow connection URI string, go to AWS CloudShell and enter into a Python shell.
  2. Run the following code to generate the connection URI string:
    import urllib.parse
    conn_type = 'redshift'
    host = 'sample-workgroup.<account-id-of-AcctB>.us-east-1.redshift-serverless.amazonaws.com' #Specify the Amazon Redshift workgroup endpoint
    port = '5439'
    login = 'admin' #Specify the username to use for authentication with Amazon Redshift
    password = '<password>' #Specify the password to use for authentication with Amazon Redshift
    role_arn = urllib.parse.quote_plus('arn:aws:iam::<account_id>:role/service-role/<MWAA-role>')
    database = 'dev'
    region = 'us-east-1' #YOUR_REGION
    conn_string = '{0}://{1}:{2}@{3}:{4}?role_arn={5}&database={6}&region={7}'.format(conn_type, login, password, host, port, role_arn, database, region)
    print(conn_string)
    

The connection string should be generated as follows:

redshift://admin:<password>@sample-workgroup.<account_id>.us-east-1.redshift-serverless.amazonaws.com:5439?role_arn=<MWAA role ARN>&database=dev&region=<region>
  1. Add the connection in Secrets Manager using the following command in the AWS Command Line Interface (AWS CLI).

This can also be done from the Secrets Manager console. This will be added in Secrets Manager as plaintext.

aws secretsmanager create-secret --name airflow/connections/secrets_redshift_connection --description "Apache Airflow to Redshift Cluster" --secret-string "redshift://admin:<password>@sample-workgroup.<account_id>.us-east-1.redshift-serverless.amazonaws.com:5439?role_arn=<MWAA role ARN>&database=dev&region=us-east-1" --region=us-east-1

Use the connection airflow/connections/secrets_redshift_connection in the DAG. When the DAG is run, it will look for this connection and retrieve the secrets from Secrets Manager. In case of RedshiftDataOperator, pass the secret_arn as a parameter instead of connection name.

You can also add secrets using the Secrets Manager console as key-value pairs.

  1. Add another secret in Secrets Manager in and save it as airflow/connections/redshift_conn_test.

Create an Airflow connection through the metadata database

You can also create connections in the UI. In this case, the connection details will be stored in an Airflow metadata database. If the Amazon MWAA environment is not configured to use the Secrets Manager backend, it will check the metadata database for the value and return that. You can create an Airflow connection using the UI, AWS CLI, or API. In this section, we show how to create a connection using the Airflow UI.

  1. For Connection Id, enter a name for the connection.
  2. For Connection Type, choose Amazon Redshift.
  3. For Host, enter the Redshift endpoint (without port and database) for Redshift Serverless.
  4. For Database, enter dev.
  5. For User, enter your admin user name.
  6. For Password, enter your password.
  7. For Port, use port 5439.
  8. For Extra, set the region and timeout parameters.
  9. Test the connection, then save your settings.

Create and run a DAG

In this section, we describe how to create a DAG using various components. After you create and run the DAG, you can verify the results by querying Redshift tables and checking the target S3 buckets.

Create a DAG

In Airflow, data pipelines are defined in Python code as DAGs. We create a DAG that consists of various operators, sensors, connections, tasks, and rules:

  • The DAG starts with looking for source files in the S3 bucket sample-inp-bucket-etl-<username> under Account A for the current day using S3KeySensor. S3KeySensor is used to wait for one or multiple keys to be present in an S3 bucket.
    • For example, our S3 bucket is partitioned as s3://bucket/products/YYYY/MM/DD/, so our sensor should check for folders with the current date. We derived the current date in the DAG and passed this to S3KeySensor, which looks for any new files in the current day folder.
    • We also set wildcard_match as True, which enables searches on bucket_key to be interpreted as a Unix wildcard pattern. Set the mode to reschedule so that the sensor task frees the worker slot when the criteria is not met and it’s rescheduled at a later time. As a best practice, use this mode when poke_interval is more than 1 minute to prevent too much load on a scheduler.
  • After the file is available in the S3 bucket, the AWS Glue crawler runs using GlueCrawlerOperator to crawl the S3 source bucket sample-inp-bucket-etl-<username> under Account A and updates the table metadata under the products_db database in the Data Catalog. The crawler uses the AWS Glue role and Data Catalog database that were created in the previous steps.
  • The DAG uses GlueCrawlerSensor to wait for the crawler to complete.
  • When the crawler job is complete, GlueJobOperator is used to run the AWS Glue job. The AWS Glue script name (along with location) and is passed to the operator along with the AWS Glue IAM role. Other parameters like GlueVersion, NumberofWorkers, and WorkerType are passed using the create_job_kwargs parameter.
  • The DAG uses GlueJobSensor to wait for the AWS Glue job to complete. When it’s complete, the Redshift staging table products will be loaded with data from the S3 file.
  • You can connect to Amazon Redshift from Airflow using three different operators:
    • PythonOperator.
    • SQLExecuteQueryOperator, which uses a PostgreSQL connection and redshift_default as the default connection.
    • RedshiftDataOperator, which uses the Redshift Data API and aws_default as the default connection.

In our DAG, we use SQLExecuteQueryOperator and RedshiftDataOperator to show how to use these operators. The Redshift stored procedures are run RedshiftDataOperator. The DAG also runs SQL commands in Amazon Redshift to delete the data from the staging table using SQLExecuteQueryOperator.

Because we configured our Amazon MWAA environment to look for connections in Secrets Manager, when the DAG runs, it retrieves the Redshift connection details like user name, password, host, port, and Region from Secrets Manager. If the connection is not found in Secrets Manager, the values are retrieved from the default connections.

In SQLExecuteQueryOperator, we pass the connection name that we created in Secrets Manager. It looks for airflow/connections/secrets_redshift_connection and retrieves the secrets from Secrets Manager. If Secrets Manager is not set up, the connection created manually (for example, redshift-conn-id) can be passed.

In RedshiftDataOperator, we pass the secret_arn of the airflow/connections/redshift_conn_test connection created in Secrets Manager as a parameter.

  • As final task, RedshiftToS3Operator is used to unload data from the Redshift table to an S3 bucket sample-opt-bucket-etl in Account B. airflow/connections/redshift_conn_test from Secrets Manager is used for unloading the data.
  • TriggerRule is set to ALL_DONE, which enables the next step to run after all upstream tasks are complete.
  • The dependency of tasks is defined using the chain() function, which allows for parallel runs of tasks if needed. In our case, we want all tasks to run in sequence.

The following is the complete DAG code. The dag_id should match the DAG script name, otherwise it won’t be synced into the Airflow UI.

from datetime import datetime
from airflow import DAG 
from airflow.decorators import task
from airflow.models.baseoperator import chain
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.providers.amazon.aws.operators.glue_crawler import GlueCrawlerOperator
from airflow.providers.amazon.aws.sensors.glue import GlueJobSensor
from airflow.providers.amazon.aws.sensors.glue_crawler import GlueCrawlerSensor
from airflow.providers.amazon.aws.operators.redshift_data import RedshiftDataOperator
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from airflow.providers.amazon.aws.transfers.redshift_to_s3 import RedshiftToS3Operator
from airflow.utils.trigger_rule import TriggerRule


dag_id = "data_pipeline"
vYear = datetime.today().strftime("%Y")
vMonth = datetime.today().strftime("%m")
vDay = datetime.today().strftime("%d")
src_bucket_name = "sample-inp-bucket-etl-<username>"
tgt_bucket_name = "sample-opt-bucket-etl-<username>"
s3_folder="products"
#Please replace the variable with the glue_role_arn
glue_role_arn_key = "arn:aws:iam::<account_id>:role/<Glue-role>"
glue_crawler_name = "products"
glue_db_name = "products_db"
glue_job_name = "sample_glue_job"
glue_script_location="s3://aws-glue-assets-<account_id>-<region>/scripts/sample_glue_job.py"
workgroup_name = "sample-workgroup"
redshift_table = "products_f"
redshift_conn_id_name="secrets_redshift_connection"
db_name = "dev"
secret_arn="arn:aws:secretsmanager:us-east-1:<account_id>:secret:airflow/connections/redshift_conn_test-xxxx"
poll_interval = 10

@task
def get_role_name(arn: str) -> str:
    return arn.split("/")[-1]

@task
def get_s3_loc(s3_folder: str) -> str:
    s3_loc  = s3_folder + "/year=" + vYear + "/month=" + vMonth + "/day=" + vDay + "/*.csv"
    return s3_loc

with DAG(
    dag_id=dag_id,
    schedule="@once",
    start_date=datetime(2021, 1, 1),
    tags=["example"],
    catchup=False,
) as dag:
    role_arn = glue_role_arn_key
    glue_role_name = get_role_name(role_arn)
    s3_loc = get_s3_loc(s3_folder)


    # Check for new incremental files in S3 source/input bucket
    sensor_key = S3KeySensor(
        task_id="sensor_key",
        bucket_key=s3_loc,
        bucket_name=src_bucket_name,
        wildcard_match=True,
        #timeout=18*60*60,
        #poke_interval=120,
        timeout=60,
        poke_interval=30,
        mode="reschedule"
    )

    # Run Glue crawler
    glue_crawler_config = {
        "Name": glue_crawler_name,
        "Role": role_arn,
        "DatabaseName": glue_db_name,
    }

    crawl_s3 = GlueCrawlerOperator(
        task_id="crawl_s3",
        config=glue_crawler_config,
    )

    # GlueCrawlerOperator waits by default, setting as False to test the Sensor below.
    crawl_s3.wait_for_completion = False

    # Wait for Glue crawler to complete
    wait_for_crawl = GlueCrawlerSensor(
        task_id="wait_for_crawl",
        crawler_name=glue_crawler_name,
    )

    # Run Glue Job
    submit_glue_job = GlueJobOperator(
        task_id="submit_glue_job",
        job_name=glue_job_name,
        script_location=glue_script_location,
        iam_role_name=glue_role_name,
        create_job_kwargs={"GlueVersion": "4.0", "NumberOfWorkers": 10, "WorkerType": "G.1X"},
    )

    # GlueJobOperator waits by default, setting as False to test the Sensor below.
    submit_glue_job.wait_for_completion = False

    # Wait for Glue Job to complete
    wait_for_job = GlueJobSensor(
        task_id="wait_for_job",
        job_name=glue_job_name,
        # Job ID extracted from previous Glue Job Operator task
        run_id=submit_glue_job.output,
        verbose=True,  # prints glue job logs in airflow logs
    )

    wait_for_job.poke_interval = 5

    # Execute the Stored Procedure in Redshift Serverless using Data Operator
    execute_redshift_stored_proc = RedshiftDataOperator(
        task_id="execute_redshift_stored_proc",
        database=db_name,
        workgroup_name=workgroup_name,
        secret_arn=secret_arn,
        sql="""CALL sp_products();""",
        poll_interval=poll_interval,
        wait_for_completion=True,
    )

    # Execute the Stored Procedure in Redshift Serverless using SQL Operator
    delete_from_table = SQLExecuteQueryOperator(
        task_id="delete_from_table",
        conn_id=redshift_conn_id_name,
        sql="DELETE FROM products;",
        trigger_rule=TriggerRule.ALL_DONE,
    )

    # Unload the data from Redshift table to S3
    transfer_redshift_to_s3 = RedshiftToS3Operator(
        task_id="transfer_redshift_to_s3",
        s3_bucket=tgt_bucket_name,
        s3_key=s3_loc,
        schema="PUBLIC",
        table=redshift_table,
        redshift_conn_id=redshift_conn_id_name,
    )

    transfer_redshift_to_s3.trigger_rule = TriggerRule.ALL_DONE

    #Chain the tasks to be executed
    chain(
        sensor_key,
        crawl_s3,
        wait_for_crawl,
        submit_glue_job,
        wait_for_job,
        execute_redshift_stored_proc,
        delete_from_table,
        transfer_redshift_to_s3
        )
    

Verify the DAG run

After you create the DAG file (replace the variables in the DAG script) and upload it to the s3://sample-airflow-instance/dags folder, it will be automatically synced with the Airflow UI. All DAGs appear on the DAGs tab. Toggle the ON option to make the DAG runnable. Because our DAG is set to schedule="@once", you need to manually run the job by choosing the run icon under Actions. When the DAG is complete, the status is updated in green, as shown in the following screenshot.

In the Links section, there are options to view the code, graph, grid, log, and more. Choose Graph to visualize the DAG in a graph format. As shown in the following screenshot, each color of the node denotes a specific operator, and the color of the node outline denotes a specific status.

Verify the results

On the Amazon Redshift console, navigate to the Query Editor v2 and select the data in the products_f table. The table should be loaded and have the same number of records as S3 files.

On the Amazon S3 console, navigate to the S3 bucket s3://sample-opt-bucket-etl in Account B. The product_f files should be created under the folder structure s3://sample-opt-bucket-etl/products/YYYY/MM/DD/.

Clean up

Clean up the resources created as part of this post to avoid incurring ongoing charges:

  1. Delete the CloudFormation stacks and S3 bucket that you created as prerequisites.
  2. Delete the VPCs and VPC peering connections, cross-account policies and roles, and secrets in Secrets Manager.

Conclusion

With Amazon MWAA, you can build complex workflows using Airflow and Python without managing clusters, nodes, or any other operational overhead typically associated with deploying and scaling Airflow in production. In this post, we showed how Amazon MWAA provides an automated way to ingest, transform, analyze, and distribute data between different accounts and services within AWS. For more examples of other AWS operators, refer to the following GitHub repository; we encourage you to learn more by trying out some of these examples.


About the Authors


Radhika Jakkula is a Big Data Prototyping Solutions Architect at AWS. She helps customers build prototypes using AWS analytics services and purpose-built databases. She is a specialist in assessing wide range of requirements and applying relevant AWS services, big data tools, and frameworks to create a robust architecture.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance, and operational excellence at scale in their cloud journey. He has a keen interest in data analytics as well.

Authorize API Gateway APIs using Amazon Verified Permissions and Amazon Cognito

Post Syndicated from Kevin Hakanson original https://aws.amazon.com/blogs/security/authorize-api-gateway-apis-using-amazon-verified-permissions-and-amazon-cognito/

Externalizing authorization logic for application APIs can yield multiple benefits for Amazon Web Services (AWS) customers. These benefits can include freeing up development teams to focus on application logic, simplifying application and resource access audits, and improving application security by using continual authorization. Amazon Verified Permissions is a scalable permissions management and fine-grained authorization service that you can use for externalizing application authorization. Along with controlling access to application resources, you can use Verified Permissions to restrict API access to authorized users by using Cedar policies. However, a key challenge in adopting an external authorization system like Verified Permissions is the effort involved in defining the policy logic and integrating with your API. This blog post shows how Verified Permissions accelerates the process of securing REST APIs that are hosted on Amazon API Gateway for Amazon Cognito customers.

Setting up API authorization using Amazon Verified Permissions

As a developer, there are several tasks you need to do in order to use Verified Permissions to store and evaluate policies that define which APIs a user is permitted to access. Although Verified Permissions enables you to decouple authorization logic from your application code, you may need to spend time up front integrating Verified Permissions with your applications. You may also need to spend time learning the Cedar policy language, defining a policy schema, and authoring policies that enforce access control on APIs. Lastly, you may need to spend additional time developing and testing the AWS Lambda authorizer function logic that builds the authorization request for Verified Permissions and enforces the authorization decision.

Getting started with the simplified wizard

Amazon Verified Permissions now includes a console-based wizard that you can use to quickly create building blocks to set up your application’s API Gateway to use Verified Permissions for authorization. Verified Permissions generates an authorization model based on your APIs and policies that allows only authorized Cognito groups access to your APIs. Additionally, it deploys a Lambda authorizer, which you attach to the APIs you want to secure. After the authorizer is attached, API requests are authorized by Verified Permissions. The generated Cedar policies and schema flatten the learning curve, yet allow you full control to modify and help you adhere to your security requirements.

Overview of sample application

In this blog post, we demonstrate how you can simplify the task of securing permissions to a sample application API by using the Verified Permissions console-based wizard. We use a sample pet store application which has two resources:

  1. PetStorePool – An Amazon Cognito user pool with users in one of three groups: customers, employees, and owners.
  2. PetStore – An Amazon API Gateway REST API derived from importing the PetStore example API and extended with a mock integration for administration. This mock integration returns a message with a URI path that uses {“statusCode”: 200} as the integration request and {“Message”: “User authorized for $context.path”} as the integration response.

The PetStore has the following four authorization requirements that allow access to the related resources. All other behaviors should be denied.

  1. Both authenticated and unauthenticated users are allowed to access the root URL.
    • GET /
  2. All authenticated users are allowed to get the list of pets, or get a pet by its identifier.
    • GET /pets
    • GET /pets/{petid}
  3. The employees and owners group are allowed to add new pets.
    • POST /pets
  4. Only the owners group is allowed to perform administration functions. These are defined using an API Gateway proxy resource that enables a single integration to implement a set of API resources.
    • ANY /admin/{proxy+}

Walkthrough

Verified Permissions includes a setup wizard that connects a Cognito user pool to an API Gateway REST API and secures resources based on Cognito group membership. In this section, we provide a walkthrough of the wizard that generates authorization building blocks for our sample application.

To set up API authorization based on Cognito groups

  1. On the Amazon Verified Permissions page in the AWS Management Console, choose Create a new policy store.
  2. On the Specify policy store details page under Starting options, select Set up with Cognito and API Gateway, and then choose Next.

    Figure 1: Starting options

    Figure 1: Starting options

  3. On the Import resources and actions page under API Gateway details, select the API and Deployment stage from the dropdown lists. (A REST API stage is a named reference to a deployment.) For this example, we selected the PetStore API and the demo stage.

    Figure 2: API Gateway and deployment stage

    Figure 2: API Gateway and deployment stage

  4. Choose Import API to generate a Map of imported resources and actions. For our example, this list includes Action::”get /pets” for getting the list of pets, Action::”get /pets/{petId}” for getting a single pet, and Action::”post /pets” for adding a new pet. Choose Next.

    Figure 3: Map of imported resources and actions

    Figure 3: Map of imported resources and actions

  5. On the Choose identity source page, select an Amazon Cognito user pool (PetStorePool in our example). For Token type to pass to API, select a token type. For our example, we chose the default value, Access token, because Cognito recommends using the access token to authorize API operations. The additional claims available in an id token may support more fine-grained access control. For Client application validation, we also specified the default, to not validate that tokens match a configured app client ID. Consider validation when you have multiple user pool app clients configured with different permissions.

    Figure 4: Choose Cognito user pool as identity source

    Figure 4: Choose Cognito user pool as identity source

  6. Choose Next.
  7. On the Assign actions to groups page under Group selection, choose the Cognito user pool groups that can take actions in the application. This solution uses native Cognito group membership to control permissions. In Figure 5, the customers group is not used for access control, we deselected it and it isn’t included in the generated policies. Instead, access to get /pets and get/pets/{petId} is granted to all authenticated users using a different authorizer that we define later in this post.

    Figure 5: Assign actions to groups

    Figure 5: Assign actions to groups

  8. For each of the groups, choose which actions are allowed. In our example, post /pets is the only action selected for the employees group. For the owners group, all of the /admin/{proxy+} actions are additionally selected. Choose Next.

    Figure 6: Groups employees and owners

    Figure 6: Groups employees and owners

  9. On the Deploy app integration page, review the API Gateway Integration details. Choose Create policy store.

    Figure 7: API Gateway integration

    Figure 7: API Gateway integration

  10. On the Create policy store summary page, review the progress of the setup. Choose Check deployment to check the progress of Lambda authorizer.

    Figure 8: Create policy store

    Figure 8: Create policy store

The setup wizard deployed a CloudFormation stack with a Lambda authorizer. This authorizes access to the API Gateway resources for the employees and owners groups. For the resources that should be authorized for all authenticated users, a separate Cognito User Pool authorizer is required. You can use the following AWS CLI apigateway create-authorizer command to create the authorizer.

aws apigateway create-authorizer \
--rest-api-id wrma51eup0 \
--name "Cognito-PetStorePool" \
--type COGNITO_USER_POOLS \
--identity-source "method.request.header.Authorization" \
--provider-arns "arn:aws:cognito-idp:us-west-2:000000000000:userpool/us-west-2_iwWG5nyux"

After the CloudFormation stack deployment completes and the second Cognito authorizer is created, there are two authorizers that can be attached to PetStore API resources, as shown in Figure 9.

Figure 9: PetStore API Authorizers

Figure 9: PetStore API Authorizers

In Figure 9, Cognito-PetStorePool is a Cognito user pool authorizer. Because this example uses an access token, an authorization scope (for example, a custom scope like petstore/api) is specified when attached to the GET /pets and GET /pets/{petId} resources.

AVPAuthorizer-XXX is a request parameter-based Lambda authorizer, which determines the caller’s identity from the configured identity sources. In Figure 9, these sources are Authorization (Header), httpMethod (Context), and path (Context). This authorizer is attached to the POST /pets and ANY /admin/{proxy+} resources. Authorization caching is initially set at 120 seconds and can be configured using the API Gateway console.

This combination of multiple authorizers and caching reduces the number of authorization requests to Verified Permissions. For API calls that are available to all authenticated users, using the Cognito-PetStorePool authorizer instead of a policy permitting the customers group helps avoid chargeable authorization requests to Verified Permissions. Applications where the users initiate the same action multiple times or have a predictable sequence of actions will experience high cache hit rates. For repeated API calls that use the same token, AVPAuthorizer-XXX caching results in lower latency, fewer requests per second, and reduced costs from chargeable requests. The use of caching can delay the time between policy updates and policy enforcement, meaning that the policy updates to Verified Permissions are not realized until the timeout or the FlushStageAuthorizersCache API is called.

Deployment architecture

Figure 10 illustrates the runtime architecture after you have used the Verified Permissions setup wizard to perform the deployment and configuration steps. After the users are authenticated with the Cognito PetStorePool, API calls to the PetStore API are authorized with the Cognito access token. Fine-grained authorization is performed by Verified Permissions using a Lambda authorizer. The wizard automatically created the following four items for you, which are labelled in Figure 10:

  1. A Verified Permissions policy store that is connected to a Cognito identity source.
  2. A Cedar schema that defines the User and UserGroup entities, and an action for each API Gateway resource.
  3. Cedar policies that assign permissions for the employees and owners groups to related actions.
  4. Lambda authorizer that is configured on the API Gateway.

Figure 10: Architecture diagram after deployment

Figure 10: Architecture diagram after deployment

Verified Permissions uses the Cedar policy language to define fine-grained permissions. The default decision for an authorization response is “deny.” The Cedar policies that are generated by the setup wizard can determine an “allow” decision. The principal for each policy is a UserGroup entity with an entity ID format of {user pool id}|{group name}. The action IDs for each policy represent the set of selected API Gateway HTTP methods and resource paths. Note that post /pets is permitted for both employees and owners. The resource in the policy scope is unspecified, because the resource is implicitly the application.

permit (
    principal in PetStore::UserGroup::"us-west-2_iwWG5nyux|employees",
    action in [PetStore::Action::"post /pets"],
    resource
);

permit (
    principal in PetStore::UserGroup::"us-west-2_iwWG5nyux|owners",
    action in
        [PetStore::Action::"delete /admin/{proxy+}",
         PetStore::Action::"post /admin/{proxy+}",
         PetStore::Action::"get /admin/{proxy+}",
         PetStore::Action::"patch /admin/{proxy+}",
         PetStore::Action::"put /admin/{proxy+}",
         PetStore::Action::"post /pets"],
    resource
);

Validating API security

A set of terminal-based curl commands validate API security for both authorized and unauthorized users, by using different access tokens. For readability, a set of environment variables is used to represent the actual values. TOKEN_C, TOKEN_E, and TOKEN_O contain valid access tokens for respective users in the customers, employees, and owners groups. API_STAGE is the base URL for the PetStore API and demo stage that we selected earlier.

To test that an unauthenticated user is allowed for the GET / root path (Requirement 1 as described in the Overview section of this post), but not allowed to call the GET /pets API (Requirement 2), run the following curl commands. The Cognito-PetStorePool authorizer should return {“message”:”Unauthorized”}.

curl -X GET ${API_STAGE}/
<html>
...Welcome to your Pet Store API...
</html>

curl -X GET ${API_STAGE}/pets
{"message":"Unauthorized"}

To test that an authenticated user is allowed to call the GET /pets API (Requirement 2) by using an access token (due to the Cognito-PetStorePool authorizer), run the following curl commands. The user should receive an error message when they try to call the POST /pets API (Requirement 3), because of the AVPAuthorizer. There are no Cedar polices defined for the customers group with the action post /pets.

curl -H "Authorization: Bearer ${TOKEN_C}" -X GET ${API_STAGE}/pets
[
  {
    "id": 1,
    "type": "dog",
    "price": 249.99
  },
  {
    "id": 2,
    "type": "cat",
    "price": 124.99
  },
  {
    "id": 3,
    "type": "fish",
    "price": 0.99
  }
]

curl -H "Authorization: Bearer ${TOKEN_C}" -X POST ${API_STAGE}/pets
{"Message":"User is not authorized to access this resource with an explicit deny"}

The following commands will verify that a user in the employees group is allowed the post /pets action (Requirement 3).

curl -H "Authorization: Bearer ${TOKEN_E}" \
     -H "Content-Type: application/json" \
     -d '{"type": "dog","price": 249.99}' \
     -X POST ${API_STAGE}/pets
{
  "pet": {
    "type": "dog",
    "price": 249.99
  },
  "message": "success"
}

The following commands will verify that a user in the employees group is not authorized for the admin APIs, but a user in the owners group is allowed (Requirement 4).

curl -H "Authorization: Bearer ${TOKEN_E}" -X GET ${API_STAGE}/admin/curltest1
{"Message":"User is not authorized to access this resource with an explicit deny"} 

curl -H "Authorization: Bearer ${TOKEN_O}" -X GET ${API_STAGE}/admin/curltest1
{"Message": "User authorized for /demo/admin/curltest1"}

Try it yourself

How could this work with your user pool and REST API? Before you try out the solution, make sure that you have the following prerequisites in place, which are required by the Verified Permissions setup wizard:

  1. A Cognito user pool, along with Cognito groups that control authorization to the API endpoints.
  2. An API Gateway REST API in the same Region as the Cognito user pool.

As you review the resources generated by the solution, consider these authorization modeling topics:

  • Are access tokens or id tokens preferable for your API? Are there custom claims on your tokens that you would use in future Cedar policies for fine-grained authorization?
  • Do multiple authorizers fit your model, or do you have an “all users” group for use in Cedar policies?
  • How might you extend the Cedar schema, allowing for new Cedar policies that include URL path parameters, such as {petId} from the example?

Conclusion

This post demonstrated how the Amazon Verified Permissions setup wizard provides you with a step-by-step process to build authorization logic for API Gateway REST APIs using Cognito user groups. The wizard generates a policy store, schema, and Cedar policies to manage access to API endpoints based on the specification of the APIs deployed. In addition, the wizard creates a Lambda authorizer that authorizes access to the API Gateway resources based on the configured Cognito groups. This removes the modeling effort required for initial configuration of API authorization logic and setup of Verified Permissions to receive permission requests. You can use the wizard to set up and test access controls to your APIs based on Cognito groups in non-production accounts. You can further extend the policy schema and policies to accommodate fine-grained or attribute-based access controls, based on specific requirements of the application, without making code changes.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Verified Permissions re:Post or contact AWS Support.

Kevin Hakanson

Kevin Hakanson

Kevin is a Senior Solutions Architect for AWS World Wide Public Sector, based in Minnesota. He works with EdTech and GovTech customers to ideate, design, validate, and launch products using cloud-native technologies and modern development practices. When not staring at a computer screen, he is probably staring at another screen, either watching TV or playing video games with his family.

sowjir-1.jpeg

Sowjanya Rajavaram

Sowjanya is a Senior Solutions Architect who specializes in identity and security solutions at AWS. Her career has been focused on helping customers of all sizes solve their identity and access management problems. She enjoys traveling and experiencing new cultures and food.

Genomics workflows, Part 6: cost prediction

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/part-6-genomics-workflows-cost-prediction/

Genomics workflows run on large pools of compute resources and take petabyte-scale datasets as inputs. Workflow runs can cost as much as hundreds of thousands of US dollars. Given this large scale, scientists want to estimate the projected cost of their genomics workflow runs before deciding to launch them.

In Part 6 of this series, we build on the benchmarking concepts presented in Part 5. You will learn how to train machine learning (ML) models on historical data to predict the cost of future runs. While we focus on genomics, the design pattern is broadly applicable to any compute-intensive workflow use case.

Use case

In large life-sciences organizations, multiple research teams often use the same genomics applications. The actual cost of consuming shared resources is only periodically shown or charged back to research teams.

In this blog post’s scenario, scientists want to predict the cost of future workflow runs based on the following input parameters:

  • Workflow name
  • Input data set size
  • Expected output dataset size

In our experience, scientists might not know how to reliably estimate compute cost based on the preceding parameters. This is because workflow run cost doesn’t linearly correlate to the input dataset size. For example, some workflow steps might be highly parallelizable while others aren’t. Otherwise, scientists could simply use the AWS Pricing Calculator or interact programmatically with the AWS Price List API. To solve this problem, we use ML to model the pattern of correlation and predict workflow cost.

Business benefits of predicting the cost of genomics workflow runs

Price prediction brings the following benefits:

  • Prioritizing workflow runs based on financial impact
  • Promoting cost awareness and frugality with application users
  • Supporting enterprise resource planning and prevention of budget overruns by integrating estimation data into management reporting and approval workflows

Prerequisites

To build this solution, you must have workflows running on AWS for which you collect actual cost data after each workflow run. This setup is demonstrated in Part 3 and Part 5 of this blog series. This data provides training data for the solution’s cost prediction models.

Solution overview

This solution includes a friendly user interface, ML models that predict usage parameters, and a metadata storage mechanism to estimate the cost of a workflow run. We use the automated workflow manager presented in Part 3 and the benchmarking solution from Part 5. The data on historical workflow launches and their cost serves as training and testing data for our ML models. We store this in Amazon DynamoDB. We use AWS Amplify to host a serverless user interface and a library/framework such as React to build it.

Scientists input the required parameters about their genomics workflow run to the Amplify frontend React application. The latter makes a request to an Amazon API Gateway REST API. This invokes an AWS Lambda function, which calls an Amazon SageMaker hosted endpoint to return predicted costs (Figure 1).

This visual summarizes the cost prediction and model training processes. Users request cost predictions for future workflow runs on a web frontend hosted in AWS Amplify. The frontend passes the requests to an Amazon API Gateway endpoint with Lambda integration. The Lambda function retrieves the suitable model endpoint from the DynamoDB table and invokes the model via the Amazon SageMaker API. Model training runs on a schedule and is orchestrated by an AWS Step Functions state machine. The state machine queries training datasets from the DynamoDB table. If the new model performs better, it is registered in the SageMaker model registry. Otherwise, the state machine sends a notification to an Amazon Simple Notification Service topic stating that there are no updates.

Figure 1. Automated cost prediction of genomics workflow runs

Each workflow captured in the DynamoDB table has a corresponding ML model trained for the specific use case. Separating out models for specific workflows simplifies the model development process. This solution periodically trains ML models to improve their overall accuracy and performance. A rule in Amazon EventBridge Scheduler invokes model training on a regular basis. An AWS Step Functions state machine automates the model training process.

Implementation considerations

When a scientist submits a request (which includes the name of the workflow they’re running), API Gateway uses Lambda integration. The Lambda function retrieves a record from the DynamoDB table that keeps track of the SageMaker hosted endpoints. The partition key of the table is the workflow name (indicated as workflow_name), as shown in the following example:

This visual displays an exemplary DynamoDB record. The record includes the Amazon SageMaker hosted endpoint that AWS Lambda would retrieve for a regenie workflow.

Using the input parameters, the Lambda function invokes the SageMaker hosted endpoint and returns the inference values back to the frontend.

Automating model training

Our Step Functions state machine for model training uses native SageMaker SDK integration. It runs as follows:

  1. The state machine invokes a SageMaker training job to train a new ML model. The training job uses the historical workflow run data sourced from the DynamoDB table. After the training job completes, it outputs the ML model to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. The state machine registers the new model in the SageMaker model registry.
  3. A Lambda function compares the performance of the new model with the prior version on the training dataset.
  4. If the new model performs better than the prior model, the state machine creates a new SageMaker hosted endpoint configuration and puts the endpoint name in the DynamoDB table.
  5. Otherwise, the state machine sends a notification to an Amazon Simple Notification Service (Amazon SNS) topic stating that there are no updates.

Conclusion

In this blog post, we demonstrated how genomics research teams can build a price estimator to predict genomics workflow run cost. This solution trains ML models for each workflow based on data from historical workflow runs. A state machine helps automate the entire model training process. You can use price estimation to promote cost awareness in your organization and reduce the risk of budget overruns.

Our solution is particularly suitable if you want to predict the price of individual workflow runs. If you want forecast overall consumption of your shared application infrastructure, consider deploying a forecasting workflow with Amazon Forecast. The Build workflows for Amazon Forecast with AWS Step Functions blog post provides details on the specific use case for using Amazon Forecast workflows.

Related information

Run interactive workloads on Amazon EMR Serverless from Amazon EMR Studio

Post Syndicated from Sekar Srinivasan original https://aws.amazon.com/blogs/big-data/run-interactive-workloads-on-amazon-emr-serverless-from-amazon-emr-studio/

Starting from release 6.14, Amazon EMR Studio supports interactive analytics on Amazon EMR Serverless. You can now use EMR Serverless applications as the compute, in addition to Amazon EMR on EC2 clusters and Amazon EMR on EKS virtual clusters, to run JupyterLab notebooks from EMR Studio Workspaces.

EMR Studio is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug analytics applications written in PySpark, Python, and Scala. EMR Serverless is a serverless option for Amazon EMR that makes it straightforward to run open source big data analytics frameworks such as Apache Spark without configuring, managing, and scaling clusters or servers.

In the post, we demonstrate how to do the following:

  • Create an EMR Serverless endpoint for interactive applications
  • Attach the endpoint to an existing EMR Studio environment
  • Create a notebook and run an interactive application
  • Seamlessly diagnose interactive applications from within EMR Studio

Prerequisites

In a typical organization, an AWS account administrator will set up AWS resources such as AWS Identity and Access management (IAM) roles, Amazon Simple Storage Service (Amazon S3) buckets, and Amazon Virtual Private Cloud (Amazon VPC) resources for internet access and access to other resources in the VPC. They assign EMR Studio administrators who manage setting up EMR Studios and assigning users to a specific EMR Studio. Once they’re assigned, EMR Studio developers can use EMR Studio to develop and monitor workloads.

Make sure you set up resources like your S3 bucket, VPC subnets, and EMR Studio in the same AWS Region.

Complete the following steps to deploy these prerequisites:

  1. Launch the following AWS CloudFormation stack.
    Launch Cloudformation Stack
  2. Enter values for AdminPassword and DevPassword and make a note of the passwords you create.
  3. Choose Next.
  4. Keep the settings as default and choose Next again.
  5. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  6. Choose Submit.

We have also provided instructions to deploy these resources manually with sample IAM policies in the GitHub repo.

Set up EMR Studio and a serverless interactive application

After the AWS account administrator completes the prerequisites, the EMR Studio administrator can log in to the AWS Management Console to create an EMR Studio, Workspace, and EMR Serverless application.

Create an EMR Studio and Workspace

The EMR Studio administrator should log in to the console using the emrs-interactive-app-admin-user user credentials. If you deployed the prerequisite resources using the provided CloudFormation template, use the password that you provided as an input parameter.

  1. On the Amazon EMR console, choose EMR Serverless in the navigation pane.
  2. Choose Get started.
  3. Select Create and launch EMR Studio.

This creates a Studio with the default name studio_1 and a Workspace with the default name My_First_Workspace. A new browser tab will open for the Studio_1 user interface.

Create and Launch EMR Studio

Create an EMR Serverless application

Complete the following steps to create an EMR Serverless application:

  1. On the EMR Studio console, choose Applications in the navigation pane.
  2. Create a new application.
  3. For Name, enter a name (for example, my-serverless-interactive-application).
  4. For Application setup options, select Use custom settings for interactive workloads.
    Create Serverless Application using custom settings

For interactive applications, as a best practice, we recommend keeping the driver and workers pre-initialized by configuring the pre-initialized capacity at the time of application creation. This effectively creates a warm pool of workers for an application and keeps the resources ready to be consumed, enabling the application to respond in seconds. For further best practices for creating EMR Serverless applications, see Define per-team resource limits for big data workloads using Amazon EMR Serverless.

  1. In the Interactive endpoint section, select Enable Interactive endpoint.
  2. In the Network connections section, choose the VPC, private subnets, and security group you created previously.

If you deployed the CloudFormation stack provided in this post, choose emr-serverless-sg­  as the security group.

A VPC is needed for the workload to be able to access the internet from within the EMR Serverless application in order to download external Python packages. The VPC also allows you to access resources such as Amazon Relational Database Service (Amazon RDS) and Amazon Redshift that are in the VPC from this application. Attaching a serverless application to a VPC can lead to IP exhaustion in the subnet, so make sure there are sufficient IP addresses in your subnet.

  1. Choose Create and start application.

Enable Interactive Endpoints, Choose private subnets and security group

On the applications page, you can verify that the status of your serverless application changes to Started.

  1. Select your application and choose How it works.
  2. Choose View and launch workspaces.
  3. Choose Configure studio.

  1. For Service role¸ provide the EMR Studio service role you created as a prerequisite (emr-studio-service-role).
  2. For Workspace storage, enter the path of the S3 bucket you created as a prerequisite (emrserverless-interactive-blog-<account-id>-<region-name>).
  3. Choose Save changes.

Choose emr-studio-service-role and emrserverless-interactive-blog s3 bucket

14.  Navigate to the Studios console by choosing Studios in the left navigation menu in the EMR Studio section. Note the Studio access URL from the Studios console and provide it to your developers to run their Spark applications.

Run your first Spark application

After the EMR Studio administrator has created the Studio, Workspace, and serverless application, the Studio user can use the Workspace and application to develop and monitor Spark workloads.

Launch the Workspace and attach the serverless application

Complete the following steps:

  1. Using the Studio URL provided by the EMR Studio administrator, log in using the emrs-interactive-app-dev-user user credentials shared by the AWS account admin.

If you deployed the prerequisite resources using the provided CloudFormation template, use the password that you provided as an input parameter.

On the Workspaces page, you can check the status of your Workspace. When the Workspace is launched, you will see the status change to Ready.

  1. Launch the workspace by choosing the workspace name (My_First_Workspace).

This will open a new tab. Make sure your browser allows pop-ups.

  1. In the Workspace, choose Compute (cluster icon) in the navigation pane.
  2. For EMR Serverless application, choose your application (my-serverless-interactive-application).
  3. For Interactive runtime role, choose an interactive runtime role (for this post, we use emr-serverless-runtime-role).
  4. Choose Attach to attach the serverless application as the compute type for all the notebooks in this Workspace.

Choose my-serverless-interactive-application as your app and emr-serverless-runtime-role and attach

Run your Spark application interactively

Complete the following steps:

  1. Choose the Notebook samples (three dots icon) in the navigation pane and open Getting-started-with-emr-serverless notebook.
  2. Choose Save to Workspace.

There are three choices of kernels for our notebook: Python 3, PySpark, and Spark (for Scala).

  1. When prompted, choose PySpark as the kernel.
  2. Choose Select.

Choose PySpark as kernel

Now you can run your Spark application. To do so, use the %%configure Sparkmagic command, which configures the session creation parameters. Interactive applications support Python virtual environments. We use a custom environment in the worker nodes by specifying a path for a different Python runtime for the executor environment using spark.executorEnv.PYSPARK_PYTHON. See the following code:

%%configure -f
{
  "conf": {
    "spark.pyspark.virtualenv.enabled": "true",
    "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv",
    "spark.pyspark.virtualenv.type": "native",
    "spark.pyspark.python": "/usr/bin/python3",
    "spark.executorEnv.PYSPARK_PYTHON": "/usr/bin/python3"
  }
}

Install external packages

Now that you have an independent virtual environment for the workers, EMR Studio notebooks allow you to install external packages from within the serverless application by using the Spark install_pypi_package function through the Spark context. Using this function makes the package available for all the EMR Serverless workers.

First, install matplotlib, a Python package, from PyPi:

sc.install_pypi_package("matplotlib")

If the preceding step doesn’t respond, check your VPC setup and make sure it is configured correctly for internet access.

Now you can use a dataset and visualize your data.

Create visualizations

To create visualizations, we use a public dataset on NYC yellow taxis:

file_name = "s3://athena-examples-us-east-1/notebooks/yellow_tripdata_2016-01.parquet"
taxi_df = (spark.read.format("parquet").option("header", "true") \
.option("inferSchema", "true").load(file_name))

In the preceding code block, you read the Parquet file from a public bucket in Amazon S3. The file has headers, and we want Spark to infer the schema. You then use a Spark dataframe to group and count specific columns from taxi_df:

taxi1_df = taxi_df.groupBy("VendorID", "passenger_count").count()
taxi1_df.show()

Use %%display magic to view the result in table format:

%%display
taxi1_df

Table shows vendor_id, passenger_count and count columns

You can also quickly visualize your data with five types of charts. You can choose the display type and the chart will change accordingly. In the following screenshot, we use a bar chart to visualize our data.

bar chart showing passenger_count against each vendor_id

Interact with EMR Serverless using Spark SQL

You can interact with tables in the AWS Glue Data Catalog using Spark SQL on EMR Serverless. In the sample notebook, we show how you can transform data using a Spark dataframe.

First, create a new temporary view called taxis. This allows you to use Spark SQL to select data from this view. Then create a taxi dataframe for further processing:

taxi_df.createOrReplaceTempView("taxis")
sqlDF = spark.sql(
    "SELECT DOLocationID, sum(total_amount) as sum_total_amount \
     FROM taxis where DOLocationID < 25 Group by DOLocationID ORDER BY DOLocationID"
)
sqlDF.show(5)

Table shows vendor_id, passenger_count and count columns

In each cell in your EMR Studio notebook, you can expand Spark Job Progress to view the various stages of the job submitted to EMR Serverless while running this specific cell. You can see the time taken to complete each stage. In the following example, stage 14 of the job has 12 completed tasks. In addition, if there is any failure, you can see the logs, making troubleshooting a seamless experience. We discuss this more in the next section.

Job[14]: showString at NativeMethodAccessorImpl.java:0 and Job[15]: showString at NativeMethodAccessorImpl.java:0

Use the following code to visualize the processed dataframe using the matplotlib package. You use the maptplotlib library to plot the dropoff location and the total amount as a bar chart.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
plt.clf()
df = sqlDF.toPandas()
plt.bar(df.DOLocationID, df.sum_total_amount)
%matplot plt

Diagnose interactive applications

You can get the session information for your Livy endpoint using the %%info Sparkmagic. This gives you links to access the Spark UI as well as the driver log right in your notebook.

The following screenshot is a driver log snippet for our application, which we opened via the link in our notebook.

driver log screenshot

Similarly, you can choose the link below Spark UI to open the UI. The following screenshot shows the Executors tab, which provides access to the driver and executor logs.

The following screenshot shows stage 14, which corresponds to the Spark SQL step we saw earlier in which we calculated the location wise sum of total taxi collections, which had been broken down into 12 tasks. Through the Spark UI, the interactive application provides fine-grained task-level status, I/O, and shuffle details, as well as links to corresponding logs for each task for this stage right from your notebook, enabling a seamless troubleshooting experience.

Clean up

If you no longer want to keep the resources created in this post, complete the following cleanup steps:

  1. Delete the EMR Serverless application.
  2. Delete the EMR Studio and the associated workspaces and notebooks.
  3. To delete rest of the resources, navigate to CloudFormation console, select the stack, and choose Delete.

All of the resources will be deleted except the S3 bucket, which has its deletion policy set to retain.

Conclusion

The post showed how to run interactive PySpark workloads in EMR Studio using EMR Serverless as the compute. You can also build and monitor Spark applications in an interactive JupyterLab Workspace.

In an upcoming post, we’ll discuss additional capabilities of EMR Serverless Interactive applications, such as:

  • Working with resources such as Amazon RDS and Amazon Redshift in your VPC (for example, for JDBC/ODBC connectivity)
  • Running transactional workloads using serverless endpoints

If this is your first time exploring EMR Studio, we recommend checking out the Amazon EMR workshops and referring to Create an EMR Studio.


About the Authors

Sekar Srinivasan is a Principal Specialist Solutions Architect at AWS focused on Data Analytics and AI. Sekar has over 20 years of experience working with data. He is passionate about helping customers build scalable solutions modernizing their architecture and generating insights from their data. In his spare time he likes to work on non-profit projects, focused on underprivileged Children’s education.

Disha Umarwani is a Sr. Data Architect with Amazon Professional Services within Global Health Care and LifeSciences. She has worked with customers to design, architect and implement Data Strategy at scale. She specializes in architecting Data Mesh architectures for Enterprise platforms.

Using Amazon Verified Permissions to manage authorization for AWS IoT smart home applications

Post Syndicated from Rajat Mathur original https://aws.amazon.com/blogs/security/using-amazon-verified-permissions-to-manage-authorization-for-aws-iot-smart-thermostat-applications/

This blog post introduces how manufacturers and smart appliance consumers can use Amazon Verified Permissions to centrally manage permissions and fine-grained authorizations. Developers can offer more intuitive, user-friendly experiences by designing interfaces that align with user personas and multi-tenancy authorization strategies, which can lead to higher user satisfaction and adoption. Traditionally, implementing authorization logic using role based access control (RBAC) or attribute based access control (ABAC) within IoT applications can become complex as the number of connected devices and associated user roles grows. This often leads to an unmanageable increase in access rules that must be hard-coded into each application, requiring excessive compute power for evaluation. By using Verified Permissions, you can externalize the authorization logic using Cedar policy language, enabling you to define fine-grained permissions that combine RBAC and ABAC models. This decouples permissions from your application’s business logic, providing a centralized and scalable way to manage authorization while reducing development effort.

In this post, we walk you through a reference architecture that outlines an end-to-end smart thermostat application solution using AWS IoT Core, Verified Permissions, and other AWS services. We show you how to use Verified Permissions to build an authorization solution using Cedar policy language to define dynamic policy-based access controls for different user personas. The post includes a link to a GitHub repository that houses the code for the web dashboard and the Verified Permissions logic to control access to the solution APIs.

Solution overview

This solution consists of a smart thermostat IoT device and an AWS hosted web application using Verified Permissions for fine-grained access to various application APIs. For this use case, the AWS IoT Core device is being simulated by an AWS Cloud9 environment and communicates with the IoT service using AWS IoT Device SDK for Python. After being configured, the device connects to AWS IoT Core to receive commands and send messages to various MQTT topics.

As a general practice, when a user-facing IoT solution is implemented, the manufacturer performs administrative tasks such as:

  1. Embedding AWS Private Certificate Authority certificates into each IoT device (in this case a smart thermostat). Usually this is done on the assembly line and the certificates used to verify the IoT endpoints are burned into device memory along with the firmware.
  2. Creating an Amazon Cognito user pool that provides sign-up and sign-in options for web and mobile application users and hosts the authentication process.
  3. Creating policy stores and policy templates in Verified Permissions. Based on who signs up, the manufacturer creates policies with Verified Permissions to link each signed-up user to certain allowed resources or IoT devices.
  4. The mapping of user to device is stored in a datastore. For this solution, you’ll use an Amazon DynamoDB table to record the relationship.

The user who purchases the device (the primary device owner) performs the following tasks:

  1. Signs up on the manufacturer’s web application or mobile app and registers the IoT device by entering a unique serial number. The mapping between user details and the device serial number is stored in the datastore through an automated process that is initiated after sign-up and device claim.
  2. Connects the new device to an existing wireless network, which initiates a registration process to securely connect to AWS IoT Core services within the manufacturer’s account.
  3. Invites other users (such as guests, family members, or the power company) through a referral, invitation link, or a designated OAuth process.
  4. Assign roles to the other users and therefore permissions.
     
Figure 1: Sample smart home application architecture built using AWS services

Figure 1: Sample smart home application architecture built using AWS services

Figure 1 depicts the solution as three logical components:

  1. The first component depicts device operations through AWS IoT Core. The smart thermostat is on site and it communicates with AWS IoT Core and its state is managed through the AWS IoT Device Shadow Service.
  2. The second component depicts the web application, which is the application interface that customers use. It’s a ReactJS-backed single page application deployed using AWS Amplify.
  3. The third component shows the backend application, which is built using Amazon API Gateway, AWS Lambda, and DynamoDB. A Cognito user pool is used to manage application users and their authentication. Authorization is handled by Verified Permissions where you create and manage policies that are evaluated when the web application calls backend APIs. These policies are evaluated against each authorization policy to provide an access decision to deny or allow an action.

The solution flow itself can be broken down into three steps after the device is onboarded and users have signed up:

  1. The smart thermostat device connects and communicates with AWS IoT Core using the MQTT protocol. A classic Device Shadow is created for the AWS IoT thing Thermostat1 when the UpdateThingShadow call is made the first time through the AWS SDK for a new device. AWS IoT Device Shadow service lets the web application query and update the device’s state in case of connectivity issues.
  2. Users sign up or sign in to the Amplify hosted smart home application and authenticate themselves against a Cognito user pool. They’re mapped to a device, which is stored in a DynamoDB table.
  3. After the users sign in, they’re allowed to perform certain tasks and view certain sections of the dashboard based on the different roles and policies managed by Verified Permissions. The underlying Lambda function that’s responsible for handling the API calls queries the DynamoDB table to provide user context to Verified Permissions.

Prerequisites

  1. To deploy this solution, you need access to the AWS Management Console and AWS Command Line Interface (AWS CLI) on your local machine with sufficient permissions to access required services, including Amplify, Verified Permissions, and AWS IoT Core. For this solution, you’ll give the services full access to interact with different underlying services. But in production, we recommend following security best practices with AWS Identity and Access Management (IAM), which involves scoping down policies.
  2. Set up Amplify CLI by following these instructions. We recommend the latest NodeJS stable long-term support (LTS) version. At the time of publishing this post, the LTS version was v20.11.1. Users can manage multiple NodeJS versions on their machines by using a tool such as Node Version Manager (nvm).

Walkthrough

The following table describes the actions, resources, and authorization decisions that will be enforced through Verified Permissions policies to achieve fine-grained access control. In this example, John is the primary device owner and has purchased and provisioned a new smart thermostat device called Thermostat1. He has invited Jane to access his device and has given her restricted permissions. John has full control over the device whereas Jane is only allowed to read the temperature and set the temperature between 72°F and 78°F.

John has also decided to give his local energy provider (Power Company) access to the device so that they can set the optimum temperature during the day to manage grid load and offer him maximum savings on his energy bill. However, they can only do so between 2:00 PM and 5:00 PM.

For security purposes the verified permissions default decision is DENY for unauthorized principals.

Name Principal Action Resource Authorization decision
Any Default Default Default Deny
John john_doe Any Thermostat1 Allow
Jane jane_doe GetTemperature Thermostat1 Allow
Jane jane_doe SetTemperature Thermostat1 Allow only if desired temperature is between 72°F and 78°F.
Power Company powercompany GetTemperature Thermostat1 Allow only if accessed between the hours of 2:00 PM and 5:00 PM
Power Company powercompany SetTemperature Thermostat1 Allow only if the temperature is set between the hours of 2:00 PM and 5:00 PM

Create a Verified Permissions policy store

Verified Permissions is a scalable permissions management and fine-grained authorization service for the applications that you build. The policies are created using Cedar, a dedicated language for defining access permissions in applications. Cedar seamlessly integrates with popular authorization models such as RBAC and ABAC.

A policy is a statement that either permits or forbids a principal to take one or more actions on a resource. A policy store is a logical container that stores your Cedar policies, schema, and principal sources. A schema helps you to validate your policy and identify errors based on the definitions you specify. See Cedar schema to learn about the structure and formal grammar of a Cedar schema.

To create the policy store

  1. Sign in to the Amazon Verified Permissions console and choose Create policy store.
  2. In the Configuration Method section, select Empty Policy Store and choose Create policy store.
     
Figure 2: Create an empty policy store

Figure 2: Create an empty policy store

Note: Make a note of the policy store ID to use when you deploy the solution.

To create a schema for the application

  1. On the Verified Permissions page, select Schema.
  2. In the Schema section, choose Create schema.
     
    Figure 3: Create a schema

    Figure 3: Create a schema

  3. In the Edit schema section, choose JSON mode, paste the following sample schema for your application, and choose Save changes.
    {
        "AwsIotAvpWebApp": {
            "entityTypes": {
                "Device": {
                    "shape": {
                        "attributes": {
                            "primaryOwner": {
                                "name": "User",
                                "required": true,
                                "type": "Entity"
                            }
                        },
                        "type": "Record"
                    },
                    "memberOfTypes": []
                },
                "User": {}
            },
            "actions": {
                "GetTemperature": {
                    "appliesTo": {
                        "context": {
                            "attributes": {
                                "desiredTemperature": {
                                    "type": "Long"
                                },
                                "time": {
                                    "type": "Long"
                                }
                            },
                            "type": "Record"
                        },
                        "resourceTypes": [
                            "Device"
                        ],
                        "principalTypes": [
                            "User"
                        ]
                    }
                },
                "SetTemperature": {
                    "appliesTo": {
                        "resourceTypes": [
                            "Device"
                        ],
                        "principalTypes": [
                            "User"
                        ],
                        "context": {
                            "attributes": {
                                "desiredTemperature": {
                                    "type": "Long"
                                },
                                "time": {
                                    "type": "Long"
                                }
                            },
                            "type": "Record"
                        }
                    }
                }
            }
        }
    }

When creating policies in Cedar, you can define authorization rules using a static policy or a template-linked policy.

Static policies

In scenarios where a policy explicitly defines both the principal and the resource, the policy is categorized as a static policy. These policies are immediately applicable for authorization decisions, as they are fully defined and ready for implementation.

Template-linked policies

On the other hand, there are situations where a single set of authorization rules needs to be applied across a variety of principals and resources. Consider an IoT application where actions such as SetTemperature and GetTemperature must be permitted for specific devices. Using static policies for each unique combination of principal and resource can lead to an excessive number of almost identical policies, differing only in their principal and resource components. This redundancy can be efficiently addressed with policy templates. Policy templates allow for the creation of policies using placeholders for the principal, the resource, or both. After a policy template is established, individual policies can be generated by referencing this template and specifying the desired principal and resource. These template-linked policies function the same as static policies, offering a streamlined and scalable solution for policy management.

To create a policy that allows access to the primary owner of the device using a static policy

  1. In the Verified Permissions console, on the left pane, select Policies, then choose Create policy and select Create static policy from the drop-down menu.
     
    Figure 4: Create static policy

    Figure 4: Create static policy

  2. Define the policy scope:
    1. Select Permit for the Policy effect.
       
      Figure 5: Define policy effect

      Figure 5: Define policy effect

    2. Select All Principals for Principals scope.
    3. Select All Resources for Resource scope.
    4. Select All Actions for Actions scope and choose Next.
       
      Figure 6: Define policy scope

      Figure 6: Define policy scope

  3. On the Details page, under Policy, paste the following full-access policy, which grants the primary owner permission to perform both SetTemperature and GetTemperature actions on the smart thermostat unconditionally. Choose Create policy.
    	permit (principal, action, resource)
    	when { resource.primaryOwner == principal };
    Figure 7: Write and review policy statement

    Figure 7: Write and review policy statement

To create a static policy to allow a guest user to read the temperature

In this example, the guest user is Jane (username: jane_doe).

  1. Create another static policy and specify the policy scope.
    1. Select Permit for the Policy effect.
       
      Figure 8: Define the policy effect

      Figure 8: Define the policy effect

    2. Select Specific principal for the Principals scope.
    3. Select AwsIotAvpWebApp::User and enter jane_doe.
       
      Figure 9: Define the policy scope

      Figure 9: Define the policy scope

    4. Select Specific resource for the Resources scope.
    5. Select AwsIotAvpWebApp::Device and enter Thermostat1.
    6. Select Specific set of actions for the Actions scope.
    7. Select GetTemperature and choose Next.
       
      Figure 10: Define resource and action scopes

      Figure 10: Define resource and action scopes

    8. Enter the Policy description: Allow jane_doe to read thermostat1.
    9. Choose Create policy.

Next, you will create reusable policy templates to manage policies efficiently. To create a policy template for a guest user with restricted temperature settings that limit the temperature range they can set to between 72°F and 78°F. In this case, the guest user is going to be Jane (username: jane_doe)

To create a reusable policy template

  1. Select Policy template and enter Guest user template as the description.
  2. Paste the following sample policy in the Policy body and choose Create policy template.
    permit (
        principal == ?principal,
        action in [AwsIotAvpWebApp::Action::"SetTemperature"],
        resource == ?resource
    )
    when { context.desiredTemperature >= 72 && context.desiredTemperature <= 78 };
Figure 11: Create guest user policy template

Figure 11: Create guest user policy template

As you can see, you don’t specify the principal and resource yet. You enter those when you create an actual policy from the policy template. The context object will be populated with the desiredTemperature property in the application and used to evaluate the decision.

You also need to create a policy template for the Power Company user with restricted time settings. Cedar policies don’t support date/time format, so you must represent 2:00 PM and 5:00 PM as elapsed minutes from midnight.

To create a policy template for the power company

  1. Select Policy template and enter Power company user template as the description.
  2. Paste the following sample policy in the Policy body and choose Create policy template.
    permit (
        principal == ?principal,
        action in [AwsIotAvpWebApp::Action::"SetTemperature", AwsIotAvpWebApp::Action::"GetTemperature"],
        resource == ?resource
    )
    when { context.time >= 840 && context.time < 1020 };

The policy templates accept the user and resource. The next step is to create a template-linked policy for Jane to set and get thermostat readings based on the Guest user template that you created earlier. For simplicity, you will manually create this policy using the Verified Permissions console. In production, application policies can be dynamically created using the Verified Permissions API.

To create a template-linked policy for a guest user

  1. In the Verified Permissions console, on the left pane, select Policies, then choose Create policy and select Create template-linked policy from the drop-down menu.
     
    Figure 12: Create new template-linked policy

    Figure 12: Create new template-linked policy

  2. Select the Guest user template and choose next.
     
    Figure 13: Select Guest user template

    Figure 13: Select Guest user template

  3. Under parameter selection:
    1. For Principal enter AwsIotAvpWebApp::User::”jane_doe”.
    2. For Resource enter AwsIotAvpWebApp::Device::”Thermostat1″.
    3. Choose Create template-linked policy.
       
      Figure 14: Create guest user template-linked policy

      Figure 14: Create guest user template-linked policy

Note that with this policy in place, jane_doe can only set the temperature of the device Thermostat1 to between 72°F and 78°F.

To create a template-linked policy for the power company user

Based on the template that was set up for power company, you now need an actual policy for it.

  1. In the Verified Permissions console, go to the left pane and select Policies, then choose Create policy and select Create template-linked policy from the drop-down menu.
  2. Select the Power company user template and choose next.
  3. Under Parameter selection, for Principal enter AwsIotAvpWebApp::User::”powercompany”, and for Resource enter AwsIotAvpWebApp::Device::”Thermostat1″, and choose Create template-linked policy.

Now that you have a set of policies in a policy store, you need to update the backend codebase to include this information and then deploy the web application using Amplify.

The policy statements in this post intentionally use human-readable values such as jane_doe and powercompany for the principal entity. This is useful when discussing general concepts but in production systems, customers should use unique and immutable values for entities. See Get the best out of Amazon Verified Permissions by using fine-grained authorization methods for more information.

Deploy the solution code from GitHub

Go to the GitHub repository to set up the Amplify web application. The repository Readme file provides detailed instructions on how to set up the web application. You will need your Verified Permissions policy store ID to deploy the application. For convenience, we’ve provided an onboarding script—deploy.sh—which you can use to deploy the application.

To deploy the application

  1. Close the repository.
    git clone https://github.com/aws-samples/amazon-verified-permissions-iot-
    amplify-smart-home-application.git

  2. Deploy the application.
    ./deploy.sh <region> <Verified Permissions Policy Store ID>

After the web dashboard has been deployed, you’ll create an IoT device using AWS IoT Core.

Create an IoT device and connect it to AWS IoT Core

With the users, policies, and templates, and the Amplify smart home application in place, you can now create a device and connect it to AWS IoT Core to complete the solution.

To create Thermostat1” device and connect it to AWS IoT Core

  1. From the left pane in the AWS IoT console, select Connect one device.
     
    Figure 15: Connect device using AWS IoT console

    Figure 15: Connect device using AWS IoT console

  2. Review how IoT Thing works and then choose Next.
     
    Figure 16: Review how IoT Thing works before proceeding

    Figure 16: Review how IoT Thing works before proceeding

  3. Choose Create a new thing and enter Thermostat1 as the Thing name and choose next.
    &bsp;
    Figure 17: Create the new IoT thing

    Figure 17: Create the new IoT thing

  4. Select Linux/macOS as the Device platform operating system and Python as the AWS IoT Core Device SDK and choose next.
     
    Figure 18: Choose the platform and SDK for the device

    Figure 18: Choose the platform and SDK for the device

  5. Choose Download connection kit and choose next.
     
    Figure 19: Download the connection kit to use for creating the Thermostat1 device

    Figure 19: Download the connection kit to use for creating the Thermostat1 device

  6. Review the three steps to display messages from your IoT device. You will use them to verify the thermostat1 IoT device connectivity to the AWS IoT Core platform. They are:
    1. Step 1: Add execution permissions
    2. Step 2: Run the start script
    3. Step 3: Return to the AWS IoT Console to view the device’s message
       
      Figure 20: How to display messages from an IoT device

      Figure 20: How to display messages from an IoT device

Solution validation

With all of the pieces in place, you can now test the solution.

Primary owner signs in to the web application to set Thermostat1 temperature to 82°F

Figure 21: Thermostat1 temperature update by John

Figure 21: Thermostat1 temperature update by John

  1. Sign in to the Amplify web application as John. You should be able to view the Thermostat1 controller on the dashboard.
  2. Set the temperature to 82°F.
  3. The Lambda function processes the request and performs an API call to Verified Permissions to determine whether to ALLOW or DENY the action based on the policies. Verified Permissions sends back an ALLOW, as the policy that was previously set up allows unrestricted access for primary owners.
  4. Upon receiving the response from Verified Permissions, the Lambda function sends ALLOW permission back to the web application and an API call to the AWS IoT Device Shadow service to update the device (Thermostat1) temperature to 82°F.
     
Figure 22: Policy evaluation decision is ALLOW when a primary owner calls SetTemperature

Figure 22: Policy evaluation decision is ALLOW when a primary owner calls SetTemperature

Guest user signs in to the web application to set Thermostat1 temperature to 80°F

Figure 23: Thermostat1 temperature update by Jane

Figure 23: Thermostat1 temperature update by Jane

  1. If you sign in as Jane to the Amplify web application, you can view the Thermostat1 controller on the dashboard.
  2. Set the temperature to 80°F.
  3. The Lambda function validates the actions by sending an API call to Verified Permissions to determine whether to ALLOW or DENY the action based on the established policies. Verified Permissions sends back a DENY, as the policy only permits temperature adjustments between 72°F and 78°F.
  4. Upon receiving the response from Verified Permissions, the Lambda function sends DENY permissions back to the web application and an unauthorized response is returned.
     
    Figure 24: Guest user jane_doe receives a DENY when calling SetTemperature for a desired temperature of 80°F

    Figure 24: Guest user jane_doe receives a DENY when calling SetTemperature for a desired temperature of 80°F

  5. If you repeat the process (still as Jane) but set Thermostat1 to 75°F, the policy will cause the request to be allowed.
     
    Figure 25: Guest user jane_doe receives an ALLOW when calling SetTemperature for a desired temperature of 75°F

    Figure 25: Guest user jane_doe receives an ALLOW when calling SetTemperature for a desired temperature of 75°F

  6. Similarly, jane_doe is allowed run GetTemperature on the device Thermostat1. When the temperature is set to 74°F, the device shadow is updated. The IoT device being simulated by your AWS Cloud9 instance reads desired the temperature field and sets the reported value to 74.
  7. Now, when jane_doe runs GetTemperature, the value of the device is reported as 74 as shown in Figure 26. We encourage you to try different restrictions in the World Settings (outside temperature and time) by adding restrictions to the static policy that allows GetTemperature for guest user.
     
    Figure 26: Guest user jane_doe receives an ALLOW when calling GetTemperature for the reported temperature

    Figure 26: Guest user jane_doe receives an ALLOW when calling GetTemperature for the reported temperature

Power company signs in to the web application to set Thermostat1 to 78°F at 3.30 PM

Figure 27: Thermostat1 temperature set to 78°F by powercompany user at a specified time

Figure 27: Thermostat1 temperature set to 78°F by powercompany user at a specified time

  1. Sign in as the powercompany user to the Amplify web application using an API. You can view the Thermostat1 controller on the dashboard.
  2. To test this scenario, set the current time to 3:30 PM, and try to set the temperature to 78°F.
  3. The Lambda function validates the actions by sending an API call to Verified Permissions to determine whether to ALLOW or DENY the action based on pre-established policies. Verified Permissions returns ALLOW permission, because the policy for powercompany permits device temperature changes between 2:00 PM and 5:00 PM.
  4. Upon receiving the response from Verified Permissions, the Lambda function sends ALLOW permission back to the web application and an API call to the AWS IoT Device Shadow service to update the Thermostat1 temperature to 78°F.
     
    Figure 28: powercompany receives an ALLOW when SetTemperature is called with the desired temperature of 78°F

    Figure 28: powercompany receives an ALLOW when SetTemperature is called with the desired temperature of 78°F

Note: As an optional exercise, we also made jane_doe a device owner for device Thermostat2. This can be observed in the users.json file in the Github repository. We encourage you to create your own policies and restrict functions for Thermostat2 after going through this post. You will need to create separate Verified Permissions policies and update the Lambda functions to interact with these policies.

We encourage you to create policies for guests and the power company and restrict permissions based on the following criteria:

  1. Verify Jane Doe can perform GetTemperature and SetTemperature actions on Thermostat2.
  2. John Doe should not be able to set the temperature on device Thermostat2 outside of the time range of 4:00 PM and 6:00 PM and outside of the temperature range of 68°F and 72°F.
  3. Power Company can only perform the GetTemperature operation, but there are no restrictions on time and outside temperature.

To help you verify the solution, we’ve provided the correct policies under the challenge directory in the GitHub repository.

Clean up

Deploying the Thermostat application in your AWS account will incur costs. To avoid ongoing charges, when you’re done examining the solution, delete the resources that were created. This includes the Amplify hosted web application, API Gateway resource, AWS Cloud 9 environment, the Lambda function, DynamoDB table, Cognito user pool, AWS IoT Core resources, and Verified Permissions policy store.

Amplify resources can be deleted by going to the AWS CloudFormation console and deleting the stacks that were used to provision various services.

Conclusion

In this post, you learned about creating and managing fine-grained permissions using Verified Permissions for different user personas for your smart thermostat IoT device. With Verified Permissions, you can strengthen your security posture and build smart applications aligned with Zero Trust principles for real-time authorization decisions. To learn more, we recommend:

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Author

Rajat Mathur

Rajat is a Principal Solutions Architect at Amazon Web Services. Rajat is a passionate technologist who enjoys building innovative solutions for AWS customers. His core areas of focus are IoT, Networking, and Serverless computing. In his spare time, Rajat enjoys long drives, traveling, and spending time with family.

Pronoy Chopra

Pronoy Chopra

Pronoy is a Senior Solutions Architect with the Startups Generative AI team at AWS. He specializes in architecting and developing IoT and Machine Learning solutions. He has co-founded two startups and enjoys being hands-on with projects in the IoT, AI/ML and Serverless domain. His work in Magnetoencephalography has been cited many times in the effort to build better brain-compute interfaces.

Syed Sanoor

Syed Sanoor

Syed serves as a Solutions Architect, assisting customers in the enterprise sector. With a foundation in software engineering, he takes pleasure in crafting solutions tailored to client needs. His expertise predominantly lies in C# and IoT. During his leisure time, Syed enjoys piloting drones and playing cricket.

Dynamic DAG generation with YAML and DAG Factory in Amazon MWAA

Post Syndicated from Jayesh Shinde original https://aws.amazon.com/blogs/big-data/dynamic-dag-generation-with-yaml-and-dag-factory-in-amazon-mwaa/

Amazon Managed Workflow for Apache Airflow (Amazon MWAA) is a managed service that allows you to use a familiar Apache Airflow environment with improved scalability, availability, and security to enhance and scale your business workflows without the operational burden of managing the underlying infrastructure. In Airflow, Directed Acyclic Graphs (DAGs) are defined as Python code. Dynamic DAGs refer to the ability to generate DAGs on the fly during runtime, typically based on some external conditions, configurations, or parameters. Dynamic DAGs helps you to create, schedule, and run tasks within a DAG based on data and configurations that may change over time.

There are various ways to introduce dynamism in Airflow DAGs (dynamic DAG generation) using environment variables and external files. One of the approaches is to use the DAG Factory YAML based configuration file method. This library aims to facilitate the creation and configuration of new DAGs by using declarative parameters in YAML. It allows default customizations and is open-source, making it simple to create and customize new functionalities.

In this post, we explore the process of creating Dynamic DAGs with YAML files, using the DAG Factory library. Dynamic DAGs offer several benefits:

  1. Enhanced code reusability – By structuring DAGs through YAML files, we promote reusable components, reducing redundancy in your workflow definitions.
  2. Streamlined maintenance – YAML-based DAG generation simplifies the process of modifying and updating workflows, ensuring smoother maintenance procedures.
  3. Flexible parameterization – With YAML, you can parameterize DAG configurations, facilitating dynamic adjustments to workflows based on varying requirements.
  4. Improved scheduler efficiency – Dynamic DAGs enable more efficient scheduling, optimizing resource allocation and enhancing overall workflow runs
  5. Enhanced scalability – YAML-driven DAGs allow for parallel runs, enabling scalable workflows capable of handling increased workloads efficiently.

By harnessing the power of YAML files and the DAG Factory library, we unleash a versatile approach to building and managing DAGs, empowering you to create robust, scalable, and maintainable data pipelines.

Overview of solution

In this post, we will use an example DAG file that is designed to process a COVID-19 data set. The workflow process involves processing an open source data set offered by WHO-COVID-19-Global. After we install the DAG-Factory Python package, we create a YAML file that has definitions of various tasks. We process the country-specific death count by passing Country as a variable, which creates individual country-based DAGs.

The following diagram illustrates the overall solution along with data flows within logical blocks.

Overview of the Solution

Prerequisites

For this walkthrough, you should have the following prerequisites:

Additionally, complete the following steps (run the setup in an AWS Region where Amazon MWAA is available):

  1. Create an Amazon MWAA environment (if you don’t have one already). If this is your first time using Amazon MWAA, refer to Introducing Amazon Managed Workflows for Apache Airflow (MWAA).

Make sure the AWS Identity and Access Management (IAM) user or role used for setting up the environment has IAM policies attached for the following permissions:

The access policies mentioned here are just for the example in this post. In a production environment, provide only the needed granular permissions by exercising least privilege principles.

  1. Create an unique (within an account) Amazon S3 bucket name while creating your Amazon MWAA environment, and create folders called dags and requirements.
    Amazon S3 Bucket
  2. Create and upload a requirements.txt file with the following content to the requirements folder. Replace {environment-version} with your environment’s version number, and {Python-version} with the version of Python that’s compatible with your environment:
    --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-{Airflow-version}/constraints-{Python-version}.txt"
    dag-factory==0.19.0
    pandas==2.1.4

Pandas is needed just for the example use case described in this post, and dag-factory is the only required plug-in. It is recommended to check the compatibility of the latest version of dag-factory with Amazon MWAA. The boto and psycopg2-binary libraries are included with the Apache Airflow v2 base install and don’t need to be specified in your requirements.txt file.

  1. Download the WHO-COVID-19-global data file to your local machine and upload it under the dags prefix of your S3 bucket.

Make sure that you are pointing to the latest AWS S3 bucket version of your requirements.txt file for the additional package installation to happen. This should typically take between 15 – 20 minutes depending on your environment configuration.

Validate the DAGs

When your Amazon MWAA environment shows as Available on the Amazon MWAA console, navigate to the Airflow UI by choosing Open Airflow UI next to your environment.

Validate the DAG

Verify the existing DAGs by navigating to the DAGs tab.

Verify the DAG

Configure your DAGs

Complete the following steps:

  1. Create empty files named dynamic_dags.yml, example_dag_factory.py and process_s3_data.py on your local machine.
  2. Edit the process_s3_data.py file and save it with following code content, then upload the file back to the Amazon S3 bucket dags folder. We are doing some basic data processing in the code:
    1. Read the file from an Amazon S3 location
    2. Rename the Country_code column as appropriate to the country.
    3. Filter data by the given country.
    4. Write the processed final data into CSV format and upload back to S3 prefix.
import boto3
import pandas as pd
import io
   
def process_s3_data(COUNTRY):
### Top level Variables replace S3_BUCKET with your bucket name ###
    s3 = boto3.client('s3')
    S3_BUCKET = "my-mwaa-assets-bucket-sfj33ddkm"
    INPUT_KEY = "dags/WHO-COVID-19-global-data.csv"
    OUTPUT_KEY = "dags/count_death"
### get csv file ###
   response = s3.get_object(Bucket=S3_BUCKET, Key=INPUT_KEY)
   status = response['ResponseMetadata']['HTTPStatusCode']
   if status == 200:
### read csv file and filter based on the country to write back ###
       df = pd.read_csv(response.get("Body"))
       df.rename(columns={"Country_code": "country"}, inplace=True)
       filtered_df = df[df['country'] == COUNTRY]
       with io.StringIO() as csv_buffer:
                   filtered_df.to_csv(csv_buffer, index=False)
                   response = s3.put_object(
                       Bucket=S3_BUCKET, Key=OUTPUT_KEY + '_' + COUNTRY + '.csv', Body=csv_buffer.getvalue()
                   )
       status = response['ResponseMetadata']['HTTPStatusCode']
       if status == 200:
           print(f"Successful S3 put_object response. Status - {status}")
       else:
           print(f"Unsuccessful S3 put_object response. Status - {status}")
   else:
       print(f"Unsuccessful S3 get_object response. Status - {status}")
  1. Edit the dynamic_dags.yml and save it with the following code content, then upload the file back to the dags folder. We are stitching various DAGs based on the country as follows:
    1. Define the default arguments that are passed to all DAGs.
    2. Create a DAG definition for individual countries by passing op_args
    3. Map the process_s3_data function with python_callable_name.
    4. Use Python Operator to process csv file data stored in Amazon S3 bucket.
    5. We have set schedule_interval as 10 minutes, but feel free to adjust this value as needed.
default:
  default_args:
    owner: "airflow"
    start_date: "2024-03-01"
    retries: 1
    retry_delay_sec: 300
  concurrency: 1
  max_active_runs: 1
  dagrun_timeout_sec: 600
  default_view: "tree"
  orientation: "LR"
  schedule_interval: "*/10 * * * *"
 
module3_dynamic_dag_Australia:
  tasks:
    task_process_s3_data:
      task_id: process_s3_data
      operator: airflow.operators.python.PythonOperator
      python_callable_name: process_s3_data
      python_callable_file: /usr/local/airflow/dags/process_s3_data.py
      op_args:
        - "Australia"
 
module3_dynamic_dag_Brazil:
  tasks:
    task_process_s3_data:
      task_id: process_s3_data
      operator: airflow.operators.python.PythonOperator
      python_callable_name: process_s3_data
      python_callable_file: /usr/local/airflow/dags/process_s3_data.py
      op_args:
        - "Brazil"
 
module3_dynamic_dag_India:
  tasks:
    task_process_s3_data:
      task_id: process_s3_data
      operator: airflow.operators.python.PythonOperator
      python_callable_name: process_s3_data
      python_callable_file: /usr/local/airflow/dags/process_s3_data.py
      op_args:
        - "India"
 
module3_dynamic_dag_Japan:
  tasks:
    task_process_s3_data:
      task_id: process_s3_data
      operator: airflow.operators.python.PythonOperator
      python_callable_name: process_s3_data
      python_callable_file: /usr/local/airflow/dags/process_s3_data.py
      op_args:
        - "Japan"
 
module3_dynamic_dag_Mexico:
  tasks:
    task_process_s3_data:
      task_id: process_s3_data
      operator: airflow.operators.python.PythonOperator
      python_callable_name: process_s3_data
      python_callable_file: /usr/local/airflow/dags/process_s3_data.py
      op_args:
        - "Mexico"
 
module3_dynamic_dag_Russia:
  tasks:
    task_process_s3_data:
      task_id: process_s3_data
      operator: airflow.operators.python.PythonOperator
      python_callable_name: process_s3_data
      python_callable_file: /usr/local/airflow/dags/process_s3_data.py
      op_args:
        - "Russia"
 
module3_dynamic_dag_Spain:
  tasks:
    task_process_s3_data:
      task_id: process_s3_data
      operator: airflow.operators.python.PythonOperator
      python_callable_name: process_s3_data
      python_callable_file: /usr/local/airflow/dags/process_s3_data.py
      op_args:
        - "Spain"
  1. Edit the file example_dag_factory.py and save it with the following code content, then upload the file back to dags folder. The code cleans the existing the DAGs and generates clean_dags() method and the creating new DAGs using the generate_dags() method from the DagFactory instance.
from airflow import DAG
import dagfactory
  
config_file = "/usr/local/airflow/dags/dynamic_dags.yml"
example_dag_factory = dagfactory.DagFactory(config_file)
  
## to clean up or delete any existing DAGs ##
example_dag_factory.clean_dags(globals())
## generate and create new DAGs ##
example_dag_factory.generate_dags(globals())
  1. After you upload the files, go back to the Airflow UI console and navigate to the DAGs tab, where you will find new DAGs.
    List the new DAGs
  2. Once you upload the files, go back to the Airflow UI console and under the DAGs tab you will find new DAGs are appearing as shown below:DAGs

You can enable DAGs by making them active and testing them individually. Upon activation, an additional CSV file named count_death_{COUNTRY_CODE}.csv is generated in the dags folder.

Cleaning up

There may be costs associated with using the various AWS services discussed in this post. To prevent incurring future charges, delete the Amazon MWAA environment after you have completed the tasks outlined in this post, and empty and delete the S3 bucket.

Conclusion

In this blog post we demonstrated how to use the dag-factory library to create dynamic DAGs. Dynamic DAGs are characterized by their ability to generate results with each parsing of the DAG file based on configurations. Consider using dynamic DAGs in the following scenarios:

  • Automating migration from a legacy system to Airflow, where flexibility in DAG generation is crucial
  • Situations where only a parameter changes between different DAGs, streamlining the workflow management process
  • Managing DAGs that are reliant on the evolving structure of a source system, providing adaptability to changes
  • Establishing standardized practices for DAGs across your team or organization by creating these blueprints, promoting consistency and efficiency
  • Embracing YAML-based declarations over complex Python coding, simplifying DAG configuration and maintenance processes
  • Creating data driven workflows that adapt and evolve based on the data inputs, enabling efficient automation

By incorporating dynamic DAGs into your workflow, you can enhance automation, adaptability, and standardization, ultimately improving the efficiency and effectiveness of your data pipeline management.

To learn more about Amazon MWAA DAG Factory, visit Amazon MWAA for Analytics Workshop: DAG Factory. For additional details and code examples on Amazon MWAA, visit the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repository.


About the Authors

 Jayesh Shinde is Sr. Application Architect with AWS ProServe India. He specializes in creating various solutions that are cloud centered using modern software development practices like serverless, DevOps, and analytics.

Harshd Yeola is Sr. Cloud Architect with AWS ProServe India helping customers to migrate and modernize their infrastructure into AWS. He specializes in building DevSecOps and scalable infrastructure using containers, AIOPs, and AWS Developer Tools and services.

Quickly go from Idea to PR with CodeCatalyst using Amazon Q

Post Syndicated from Brendan Jenkins original https://aws.amazon.com/blogs/devops/quickly-go-from-idea-to-pr-with-codecatalyst-using-amazon-q/

Amazon Q feature development enables teams using Amazon CodeCatalyst to scale with AI to assist developers in completing everyday software development tasks. Developers can now go from an idea in an issue to a fully tested, merge-ready, running application code in a Pull Request (PR) with natural language inputs in a few clicks. Developers can also provide feedback to Amazon Q directly on the published pull request and ask it to generate a new revision. If the code change falls short of expectations, a new development environment can be created directly from the pull request, necessary adjustments can be made manually, a new revision published, and proceed with the merge upon approval.

In this blog, we will walk through a use case leveraging the Modern three-tier web application blueprint, and adding a feature to the web application. We’ll leverage Amazon Q feature development to quickly go from Idea to PR. We also suggest following the steps outlined below in this blog in your own application so you can gain a better understanding of how you can use this feature in your daily work.

Solution Overview

Amazon Q feature development is integrated into CodeCatalyst. Figure 1 details how users can assign Amazon Q an issue. When assigning the issue, users answer a few preliminary questions and Amazon Q outputs the proposed approach, where users can either approve or provide additional feedback to Amazon Q. Once approved, Amazon Q will generate a PR where users can review, revise, and merge the PR into the repository.

Figure 1: Amazon Q feature development workflow

Figure 1: Amazon Q feature development workflow

Prerequisites

Although we will walk through a sample use case in this blog using a Blueprint from CodeCatalyst, after, we encourage you to try this with your own application so you can gain hands-on experience with utilizing this feature. If you are using CodeCatalyst for the first time, you’ll need:

Walkthrough

Step 1: Creating the blueprint

In this blog, we’ll leverage the Modern three-tier web application blueprint to walk through a sample use case. This blueprint creates a Mythical Mysfits three-tier web application with modular presentation, application, and data layers.

Figure 2: Creating a new Modern three-tier application blueprint

Figure 2: Creating a new Modern three-tier application blueprint

First, within your space click “Create Project” and select the Modern three-tier web application CodeCatalyst Blueprint as shown above in Figure 2.

Enter a Project name and select: Lambda for the Compute Platform and Amplify Hosting for Frontend Hosting Options. Additionally, ensure your AWS account is selected along with creating a new IAM Role.

Once the project is finished creating, the application will deploy via a CodeCatalyst workflow, assuming the AWS account and IAM role were setup correctly. The deployed application will be similar to the Mythical Mysfits website.

Step 2: Create a new issue

The Product Manager (PM) has asked us to add a feature to the newly created application, which entails creating the ability to add new mythical creatures. The PM has provided a detailed description to get started.

In the Issues section of our new project, click Create Issue

For the Issue title, enter “Ability to add a new mythical creature” and for the Description enter “Users should be able to add a new mythical creature to the website. There should be a new Add button on the UI, when prompted should allow the user to fill in Name, Age, Description, Good/Evil, Lawful/Chaotic, Species, Profile Image URI and thumbnail Image URI for the new creature. When the user clicks save, the application should leverage the existing API in app.py to save the new creature to the DynamoDB table.”

Furthermore, click Assign to Amazon Q as shown below in Figure 3.

Figure 3: Assigning a new issue to Amazon Q

Figure 3: Assigning a new issue to Amazon Q

Lastly, enable the Require Amazon Q to stop after each step and await review of its work. In this use case, we do not anticipate having any changes to our workflow files to support this new feature so we will leave the Allow Amazon Q to modify workflow files disabled as shown below in Figure 4. Click Create Issue and Amazon Q will get started.

Figure 4: Configurations for assigning Amazon Q

Figure 4: Configurations for assigning Amazon Q

Step 3: Review Amazon Qs Approach

After a few minutes, Amazon Q will generate its understanding of the project in the Background section as well as an Approach to make the changes for the issue you created as show in Figure 5 below

(**Note: The Background and Approach generated for you may be different than what is shown in Figure 5 below).

We have the option to proceed as is or can reply to the Approach via a Comment to provide feedback so Amazon Q can refine it to align better with the use case.

Figure 5: Reviewing Amazon Qs Background and Approach

Figure 5: Reviewing Amazon Qs Background and Approach

In the approach, we notice Amazon Q is suggesting it will create a new method to create and save the new item to the table, but we already have an existing method. We decide to leave feedback as show in Figure 6 letting Amazon Q know the existing method should be leveraged.

Figure 6: Provide feedback to Approach

Figure 6: Provide feedback to Approach

Amazon Q will now refine the approach based on the feedback provided. The refined approach generated by Amazon Q meets our requirements, including unit tests, so we decide to click Proceed as shown in Figure 7 below.

Figure 7: Confirm approach and click Proceed

Figure 7: Confirm approach and click Proceed

Now, Amazon Q will generate the code for implementation & create a PR with code changes that can be reviewed.

Step 4: Review the PR

Within our project, under Code on the left panel click on Pull requests. You should see the new PR created by Amazon Q.

The PR description contains the approach that Amazon Q took to generate the code. This is helpful to reviewers who want to gain a high-level understanding of the changes included in the PR before diving into the details. You will also be able to review all changes made to the code as shown below in Figure 8.

Figure 8: Changes within PR

Figure 8: Changes within PR

Step 5 (Optional): Provide feedback on PR

After reviewing the changes in the PR, I leave comments on a few items that can be improved. Notably, all fields on the new input form for creating a new creature should be required. After I complete leaving comments, I hit the Create Revision button. Amazon Q will take my comments, update the code accordingly and create a new revision of the PR as shown in Figure 9 below.

Figure 9: PR Revision created

Figure 9: PR Revision created.

After reviewing the latest revision created by Amazon Q, I am happy with the changes and proceed with testing the changes directly from CodeCatalyst by utilizing Dev Environments. Once I have completed testing of the new feature and everything works as expected, we will let our peers review the PR to provide feedback and approve the pull request.

As part of following the steps in this blog post, if you upgraded your Space to Standard or Enterprise tier, please ensure you downgrade to the Free tier to avoid any unwanted additional charges. Additionally, delete the project and any associated resources deployed in the walkthrough.

Unassign Amazon Q from any issues no longer being worked on. If Amazon Q has finished its work on an issue or could not find a solution, make sure to unassign Amazon Q to avoid reaching the maximum quota for generative AI features. For more information, see Managing generative AI features and Pricing.

Best Practices for using Amazon Q Feature Development

You can follow a few best practices to ensure you experience the best results when using Amazon Q feature development:

  1. When describing your feature or issue, provide as much context as possible to get the best result from Amazon Q. Being too vague or unclear may not produce ideal results for your use case.
  2. Changes and new features should be as focused as possible. You will likely not experience the best results when making large and complex changes in a single issue. Instead, break the changes or feature up into smaller, more manageable issues where you will see better results.
  3. Leverage the feedback feature to practice giving input on approaches Amazon Q takes to ensure it gets to a similar outcome as highlighted in the blog.

Conclusion

In this post, you’ve seen how you can quickly go from Idea to PR using the Amazon Q Feature development capability in CodeCatalyst. You can leverage this new feature to start building new features in your applications. Check out Amazon CodeCatalyst feature development today.

About the authors

Brent Everman

Brent is a Senior Technical Account Manager with AWS, based out of Pittsburgh. He has over 17 years of experience working with enterprise and startup customers. He is passionate about improving the software development experience and specializes in AWS’ Next Generation Developer Experience services.

Brendan Jenkins

Brendan Jenkins is a Solutions Architect at Amazon Web Services (AWS) working with Enterprise AWS customers providing them with technical guidance and helping achieve their business goals. He has an area of specialization in DevOps and Machine Learning technology.

Fahim Sajjad

Fahim is a Solutions Architect at Amazon Web Services. He helps customers transform their business by helping in designing their cloud solutions and offering technical guidance. Fahim graduated from the University of Maryland, College Park with a degree in Computer Science. He has deep interested in AI and Machine learning. Fahim enjoys reading about new advancements in technology and hiking.

Abdullah Khan

Abdullah is a Solutions Architect at AWS. He attended the University of Maryland, Baltimore County where he earned a degree in Information Systems. Abdullah currently helps customers design and implement solutions on the AWS Cloud. He has a strong interest in artificial intelligence and machine learning. In his spare time, Abdullah enjoys hiking and listening to podcasts.

Architecting for Disaster Recovery on AWS Outposts Racks with AWS Elastic Disaster Recovery

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/architecting-for-disaster-recovery-on-aws-outposts-racks-with-aws-elastic-disaster-recovery/

This blog post is written by Brianna Rosentrater, Hybrid Edge Specialist SA.

AWS Elastic Disaster Recovery Service (AWS DRS) now supports disaster recovery (DR) architectures that include on-premises Windows and Linux workloads running on AWS Outposts. AWS DRS minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery. Both services are billed and managed from your AWS Management Console.

Like workloads running in AWS Regions, it’s critical to plan for failures. Outposts are designed with resiliency in mind, providing redundant power, networking, and are available to order with N+M active compute instance capacity. In other words, for every physical N compute servers, you have the option of including M redundant hosts capable of handling the workload during a failure. When leveraging AWS DRS with Outpost, you can plan for larger-scale failure modes, such as data center outages, by replicating mission-critical workloads to other remote data center locations or the AWS Region.

In this post, you’ll learn how AWS DRS can be used with Outpost rack to architect for high availability in the event of a site failure. The post will examine several different architectures enabled by AWS DRS that provide DR for Outpost, and the benefits of each method described.

Prerequisites

Each of these architectures described below need the following:

Public internet access isn’t needed, AWS PrivateLink and AWS Direct Connect are supported for replication and failback which is a significant security benefit.

Planning for failure

Disasters come in many forms and are often unplanned and unexpected events. Regardless of whether your workload resides on premises, in a colocation facility, or in an AWS Region, it’s critical to define the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) which are often workload-specific. These two metrics profile how long a service can be down during recovery and quantify the acceptable amount of data loss. RTO and RPO guide you in choosing the appropriate strategy such as backup and recovery, pilot light, warm standby, or a multi-site (active-active) approach.

With AWS DRS, while failing back to a test machine (not the original source server), replication of the source server continues. This allows failback drills without impacting RPO, and non-disruptive failback drills are an important part of disaster planning to validate your recovery plan meets your expected RPO/RTO as per your business requirements.

How AWS DRS integrates with Outpost

AWS DRS uses an AWS Replication Agent at the source to capture the workload and transfer it to a lightweight staging area, which resides on an Outpost equipped with Amazon S3 on Outposts. This method also provides the ability to perform low-effort, non-disruptive DR drills before making the final cutover. The AWS Replication Agent doesn’t need a reboot nor does it impact your applications during installation.

When an Outpost’s subnet is selected as the target for replication or launch, all associated AWS DRS components remain within the Outpost, including the AWS DRS server conversion technology. These conversion servers convert source disks of servers being migrated so that they can boot and run in the target infrastructure, Amazon EBS volumes, snapshots, and replication servers. The replication servers replicate the disks to the target infrastructure. With AWS DRS you can control the data replication path using private connectivity options such as a virtual private network (VPN), AWS Direct Connect, VPC peering, or another private connection. Learn more about using a private IP for data replication.

AWS DRS provides nearly continuous replication for mission-critical workloads and supports deployment patterns including on-premises to Outpost, Outpost to Region, Region to Outpost, and between two logical Outposts through local networks. To leverage Outpost with AWS DRS, simply select the Outpost subnet as your target or source for replication when configuring AWS DRS for your workload. If you are currently using CloudEndure DR for disaster recovery with Outpost, see these detailed instructions for migrating to AWS DRS from CloudEndure DR.

DR from on-premises to Outpost

Outpost can be used as a DR target for on-premises workloads. By deploying an Outpost in a remote data center or colocation a significant distance from the source within the same geo-political boundary, you can replicate workloads across great distances and increase resiliency of the data while ensuring adherence to data residency policies or legislation.

DR from on-premises to Outposts

Figure 1 – DR from on-premises to Outposts

In Figure 1, on premises sources replicate traffic from a LAN to a staging area residing in an Outpost subnet via the local gateway. This allows workloads to failover from their on-premises environment to an Outpost in a different physical location during a disaster.

The staging areas and replication servers run on Amazon Elastic Compute Cloud (Amazon EC2) with Amazon EBS volumes and require Amazon S3 on Outposts where the Amazon EBS snapshots reside.

The replication agent is responsible for providing nearly continuous, block-level replication from your LAN using TCP/1500 with traffic routing to Amazon EC2 instances using the Outposts local gateway.

DR from Outpost to Region

Since its initial release, Outpost has supported Amazon EBS snapshots written to Amazon S3 located in the AWS Region. Backup to an AWS Region is one of the most cost-effective and easiest-to-configure DR approaches, enabling data redundancy outside of your Outpost and data center.

This method also offers flexibility for restoration within an AWS Region if the original deployment is irrecoverable. However, depending on the frequency of the snapshots and the timing of the failure, backup, and recovery to the Region has the potential to have an RPO/RTO spanning hours depending on the throughput of the service link.

For critical workloads, AWS DRS can reduce RTO to minutes and RPO in the sub-second range. After creating an initial replication of workloads that reside on the Outpost, AWS DRS provides nearly continuous, block-level replication in the Region. Just like replication from non-AWS virtual machines or bare metal servers, AWS DRS resources, including Replication Servers, Conversion Servers, Amazon EBS Volumes, and Snapshots reside in the Region.

DR from Outpost to Region

Figure 2 – DR from Outpost to Region

In Figure 2, data replication is performed over the service link from Amazon EC2 instances running locally on an Outpost to an AWS Region. The service link traverses either public Region connectivity or AWS Direct Connect.

AWS Direct Connect is the recommended option because it provides low latency and consistent bandwidth for the service link back to a Region, which also improves the reliability of transmission for AWS DRS replication traffic.

The service link is comprised of redundant, encrypted VPN tunnels. Replication traffic can also be sent privately without traversing the public internet by leveraging Private Virtual Interfaces with Direct Connect for the service link.

With this architecture in place, you can mitigate disasters and reduce downtime by failing over to the AWS Region using AWS DRS.

DR from Region to Outpost

AWS provides multiple Availability Zones (AZs) within a Region and isolated AWS Regions globally for the greatest possible fault tolerance and stability. The reliability pillar of AWS’s Well-Architected Framework encourages distributing workloads across AZs and replicating data between Regions when the need for distances exceeds those of AZs.

AWS DRS supports nearly continuous replication of workloads from a Region to an Outpost within your data center or colocation facility for DR. This deployment model provides increased durability from a source AWS Region to an Outpost anchored to a different Region.

In this model, AWS DRS components remain on-premises within the Outpost, but data charges are applicable as data egresses from the Region back to the data center and Amazon S3 on Outposts is required on the destination Outpost.

DR from Region to Outpost

Figure 3 – DR from Region to Outpost

Implementing the preceding architecture diagram enables failover of critical workloads from the Region to on-premises Outposts seamlessly. Keep in mind that AWS Regions provide the management and control plane for Outpost, making it critical to consider probability and frequency of service link interruptions as a part of your DR planning. Scenarios such as warm standby with pre-allocated Amazon EC2 and Amazon EBS resources may prove more resilient during service link disruptions.

DR between two Outposts

Each logical Outpost is comprised of one or more physical racks. Logical Outposts are in independent colocations of one another, and support deployments in disparate data centers or colocation facilities. You can elect to have multiple logical Outposts anchored to different Availability Zones or Regions. AWS DRS unlocks options for replication between two logical Outposts, leading to increased resiliency and reducing the impact of your data center as a single point of failure. In the following architecture, nearly continuous replication captured from a single Outpost source is applied at a second logical Outpost.

DR between two Outposts

Figure 4 – DR between two Outposts

Supporting both directional and bidirectional replication between Outposts can minimize disruption caused by events that take down a data center, Availability Zone, or even the entire Region result in minimal disruption. In the following architecture diagram, bidirectional data replication occurs between the Outposts by routing traffic via the local gateways, minimizing outbound data charges from the Region and allowing for more direct routing between deployment sites that could potentially span significant distances. AWS DRS cannot communicate with resources directly utilizing a customer-owned IP address pool (CoIP pool).

Figure 5 – DR between two Outposts – bidirectional 

Architecture Considerations

When planning an Outpost deployment leveraging AWS DRS, it’s critical to consider the impact on storage. As a general best practice, AWS recommends planning for a 2:1 ratio consisting of EBS volumes used for nearly continuous replication and Amazon EBS snapshots on Amazon S3 for point-in-time recovery. While it’s unlikely that all servers would need recovery simultaneously, it’s also important to allocate a reserve of EBS volume capacity, which will launch at the time of recovery. Amazon S3 on Outpost is needed for each Outpost used as a replication destination, and the recommendation is to plan for a 1:1 ratio consisting of S3 on Outposts storage, plus the rate of data change. For example, if your data change rate is 10%, you’d want to plan for 110% S3 on Outpost use with AWS DRS.

Amazon CloudWatch has integrated metrics for EC2, Amazon EBS, and Amazon S3 capacity on Outposts, making it easy to create custom tailored dashboards and integrate with Simple Notification Service (Amazon SNS) for alerts at defined thresholds. Monitoring these metrics is critical in making sure that proper free space is available for data replication to occur unimpeded. CloudWatch has metrics available for AWS DRS as well. You can also use the AWS DRS service page in the AWS Console to monitor the status of your recovery instances.

Consider taking advantage of Recovery Plans within AWS DRS to make sure that related services are recovered in a particular order. For example, during a disaster, it might be critical to first bring up a database before recovering application tiers. Recovery plans provide the ability to group related services and apply wait times to individual targets.

Conclusion

AWS Outpost enables low latency, data residency, or data gravity-constrained workloads by supplying managed cloud compute and storage services within your data center or colocation. When coupled with AWS DRS, you can decrease RPO and RTO through a variety of flexible deployment models with sources and destinations ranging from on-premises, the Region, or another AWS Outpost.

Serverless IoT email capture, attachment processing, and distribution

Post Syndicated from Stacy Conant original https://aws.amazon.com/blogs/messaging-and-targeting/serverless-iot-email-capture-attachment-processing-and-distribution/

Many customers need to automate email notifications to a broad and diverse set of email recipients, sometimes from a sensor network with a variety of monitoring capabilities. Many sensor monitoring software products include an SMTP client to achieve this goal. However, managing email server infrastructure requires specialty expertise and operating an email server comes with additional cost and inherent risk of breach, spam, and storage management. Organizations also need to manage distribution of attachments, which could be large and potentially contain exploits or viruses. For IoT use cases, diagnostic data relevance quickly expires, necessitating retention policies to regularly delete content.

Solution Overview

This solution uses the Amazon Simple Email Service (SES) SMTP interface to receive SMTP client messages, and processes the message to replace an attachment with a pre-signed URL in the resulting email to its intended recipients. Attachments are stored separately in an Amazon Simple Storage Service (S3) bucket with a lifecycle policy implemented. This reduces the storage requirements of recipient email server receiving notification emails. Additionally, this solution leverages built-in anti-spam and security scanning capabilities to deal with spam and potentially malicious attachments while at the same time providing the mechanism by which pre-signed attachment links can be revoked should the emails be distributed to unintended recipients.

The solution uses:

  • Amazon SES SMTP interface to receive incoming emails.
  • Amazon SES receipt rule on a (sub)domain controlled by administrators, to store raw incoming emails in an Amazon S3 bucket.
  • AWS Lambda function, triggered on S3 ObjectCreated event, to process raw emails, extract attachments, replace each with pre-signed URL with configurable expiry, and send the processed emails to intended recipients.

Solution Flow Details:

  1. SMTP client transmits email content to an email address in a (sub) domain with MX record set to Amazon SES service’s regional endpoint.
  2. Amazon SES SMTP interface receives an email and forwards it to SES Receipt Rule(s) for processing.
  3. A matching Amazon SES Receipt Rule saves incoming email into an Amazon S3 Bucket.
  4. Amazon S3 Bucket emits an S3 ObjectCreated Event, and places the event onto the Amazon Simple Queue Services (SQS) queue.
  5. The AWS Lambda service polls the inbound messages’ SQS queue and feeds events to the Lambda function.
  6. The Lambda function, retrieves email files from the S3 bucket, parses the email sender/subject/body, saves attachments to a separate attachment S3 bucket (7), and replaces attachments with pre-signed URLs in the email body. The Lambda function then extracts intended recipient addresses from the email body. If the body contains properly formatted recipients list, email is then sent using SES API (9), otherwise a notice is posted to a fallback Amazon Simple Notification Service (SNS) Topic (8).
  7. The Lambda function saves extracted attachments, if any, into an attachments bucket.
  8. Malformed email notifications are posted to a fallback Amazon SNS Topic.
  9. The Lambda function invokes Amazon SES API to send the processed email to all intended recipient addresses.
  10. If the Lambda function is unable to process email successfully, the inbound message is placed on to the SQS dead-letter queue (DLQ) queue for later intervention by the operator.
  11. SES delivers an email to each recipients’ mail server.
  12. Intended recipients download emails from their corporate mail servers and retrieve attachments from the S3 pre-signed URL(s) embedded in the email body.
  13. An alarm is triggered and a notification is published to Amazon SNS Alarms Topic whenever:
    • More than 50 failed messages are in the DLQ.
    • Oldest message on incoming SQS queue is older than 3 minutes – unable to keep up with inbound messages (flooding).
    • The incoming SQS queue contains over 180 messages (configurable) over 5 minutes old.

Setting up Amazon SES

For this solution you will need an email account where you can receive emails. You’ll also need a (sub)domain for which you control the mail exchanger (MX) record. You can obtain your (sub)domain either from Amazon Route53 or another domain hosting provider.

Verify the sender email address

You’ll need to follow the instructions to Verify an email address for all identities that you use as “From”, “Source”, ” Sender”, or “Return-Path” addresses. You’ll also need to follow these instructions for any identities you wish to send emails to during initial testing while your SES account is in the “Sandbox” (see next “Moving out of the SES Sandbox” section).

Moving out of the SES Sandbox

Amazon SES accounts are “in the Sandbox” by default, limiting email sending only to verified identities. AWS does this to prevent fraud and abuse as well as protecting your reputation as an email sender. When your account leaves the Sandbox, SES can send email to any recipient, regardless of whether the recipient’s address or domain is verified by SES. However, you still have to verify all identities that you use as “From”, “Source”, “Sender”, or “Return-Path” addresses.
Follow the Moving out of the SES Sandbox instructions in the SES Developer Guide. Approval is usually within 24 hours.

Set up the SES SMTP interface

Follow the workshop lab instructions to set up email sending from your SMTP client using the SES SMTP interface. Once you’ve completed this step, your SMTP client can open authenticated sessions with the SES SMTP interface and send emails. The workshop will guide you through the following steps:

  1. Create SMTP credentials for your SES account.
    • IMPORTANT: Never share SMTP credentials with unauthorized individuals. Anyone with these credentials can send as many SMTP requests and in whatever format/content they choose. This may result in end-users receiving emails with malicious content, administrative/operations overload, and unbounded AWS charges.
  2. Test your connection to ensure you can send emails.
  3. Authenticate using the SMTP credentials generated in step 1 and then send a test email from an SMTP client.

Verify your email domain and bounce notifications with Amazon SES

In order to replace email attachments with a pre-signed URL and other application logic, you’ll need to set up SES to receive emails on a domain or subdomain you control.

  1. Verify the domain that you want to use for receiving emails.
  2. Publish a mail exchanger record (MX record) and include the Amazon SES inbound receiving endpoint for your AWS region ( e.g. inbound-smtp.us-east-1.amazonaws.com for US East Northern Virginia) in the domain DNS configuration.
  3. Amazon SES automatically manages the bounce notifications whenever recipient email is not deliverable. Follow the Set up notifications for bounces and complaints guide to setup bounce notifications.

Deploying the solution

The solution is implemented using AWS CDK with Python. First clone the solution repository to your local machine or Cloud9 development environment. Then deploy the solution by entering the following commands into your terminal:

python -m venv .venv
. ./venv/bin/activate
pip install -r requirements.txt

cdk deploy \
--context SenderEmail=<verified sender email> \
 --context RecipientEmail=<recipient email address> \
 --context ConfigurationSetName=<configuration set name>

Note:

The RecipientEmail CDK context parameter in the cdk deploy command above can be any email address in the domain you verified as part of the Verify the domain step. In other words, if the verified domain is acme-corp.com, then the emails can be [email protected], [email protected], etc.

The ConfigurationSetName CDK context can be obtained by navigating to Identities in Amazon SES console, selecting the verified domain (same as above), switching to “Configuration set” tab and selecting name of the “Default configuration set”

After deploying the solution, please, navigate to Amazon SES Email receiving in AWS console, edit the rule set and set it to Active.

Testing the solution end-to-end

Create a small file and generate a base64 encoding so that you can attach it to an SMTP message:

echo content >> demo.txt
cat demo.txt | base64 > demo64.txt
cat demo64.txt

Install openssl (which includes an SMTP client capability) using the following command:

sudo yum install openssl

Now run the SMTP client (openssl is used for the proof of concept, be sure to complete the steps in the workshop lab instructions first):

openssl s_client -crlf -quiet -starttls smtp -connect email-smtp.<aws-region>.amazonaws.com:587

and feed in the commands (replacing the brackets [] and everything between them) to send the SMTP message with the attachment you created.

EHLO amazonses.com
AUTH LOGIN
[base64 encoded SMTP user name]
[base64 encoded SMTP password]
MAIL FROM:[VERIFIED EMAIL IN SES]
RCPT TO:[VERIFIED EMAIL WITH SES RECEIPT RULE]
DATA
Subject: Demo from openssl
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="XXXXboundary text"

This is a multipart message in MIME format.

--XXXXboundary text
Content-Type: text/plain

Line1:This is a Test email sent to coded list of email addresses using the Amazon SES SMTP interface from openssl SMTP client.
Line2:Email_Rxers_Code:[ANYUSER1@DOMAIN_A,ANYUSER2@DOMAIN_B,ANYUSERX@DOMAIN_Y]:Email_Rxers_Code:
Line3:Last line.

--XXXXboundary text
Content-Type: text/plain;
Content-Transfer-Encoding: Base64
Content-Disposition: attachment; filename="demo64.txt"
Y29udGVudAo=
--XXXXboundary text
.
QUIT

Note: For base64 SMTP username and password above, use values obtained in Set up the SES SMTP interface, step 1. So for example, if the username is AKZB3LJAF5TQQRRPQZO1, then you can obtain base64 encoded value using following command:

echo -n AKZB3LJAF5TQQRRPQZO1 |base64
QUtaQjNMSkFGNVRRUVJSUFFaTzE=

This makes base64 encoded value QUtaQjNMSkFGNVRRUVJSUFFaTzE= Repeat same process for SMTP username and password values in the example above.

The openssl command should result in successful SMTP authentication and send. You should receive an email that looks like this:

Optimizing Security of the Solution

  1. Do not share DNS credentials. Unauthorized access can lead to domain control, potential denial of service, and AWS charges. Restrict access to authorized personnel only.
  2. Do not set the SENDER_EMAIL environment variable to the email address associated with the receipt rule. This address is a closely guarded secret, known only to administrators, and should be changed frequently.
  3. Review access to your code repository regularly to ensure there are no unauthorized changes to your code base.
  4. Utilize Permissions Boundaries to restrict the actions permitted by an IAM user or role.

Cleanup

To cleanup, start by navigating to Amazon SES Email receiving in AWS console, and setting the rule set to Inactive.

Once completed, delete the stack:

cdk destroy

Cleanup AWS SES Access Credentials

In Amazon SES Console, select Manage existing SMTP credentials, select the username for which credentials were created in Set up the SES SMTP interface above, navigate to the Security credentials tab and in the Access keys section, select Action -> Delete to delete AWS SES access credentials.

Troubleshooting

If you are not receiving the email or email is not being sent correctly there are a number of common causes of these errors:

  • HTTP Error 554 Message rejected email address is not verified. The following identities failed the check in region :
    • This means that you have attempted to send an email from address that has not been verified.
    • Please, ensure that the “MAIL FROM:[VERIFIED EMAIL IN SES]” email address sent via openssl matches the SenderEmail=<verified sender email> email address used in cdk deploy.
    • Also make sure this email address was used in Verify the sender email address step.
  • Email is not being delivered/forwarded
    • The incoming S3 bucket under the incoming prefix, contains file called AMAZON_SES_SETUP_NOTIFICATION. This means that MX record of the domain setup is missing. Please, validate that the MX record (step 2) of Verify your email domain with Amazon SES to receive emails section is fully configured.
    • Please ensure after deploying the Amazon SES solution, the created rule set was made active by navigating to Amazon SES Email receiving in AWS console, and set it to Active.
    • This may mean that the destination email address has bounced. Please, navigate to Amazon SES Suppression list in AWS console ensure that recipient’s email is not in the suppression list. If it is listed, you can see the reason in the “Suppression reason” column. There you may either manually remove from the suppression list or if the recipient email is not valid, consider using a different recipient email address.
AWS Legal Disclaimer: Sample code, software libraries, command line tools, proofs of concept, templates, or other related technology are provided as AWS Content or Third-Party Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content or Third-Party Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content or Third-Party Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content or Third-Party Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

About the Authors

Tarek Soliman

Tarek Soliman

Tarek is a Senior Solutions Architect at AWS. His background is in Software Engineering with a focus on distributed systems. He is passionate about diving into customer problems and solving them. He also enjoys building things using software, woodworking, and hobby electronics.

Dave Spencer

Dave Spencer

Dave is a Senior Solutions Architect at AWS. His background is in cloud solutions architecture, Infrastructure as Code (Iac), systems engineering, and embedded systems programming. Dave’s passion is developing partnerships with Department of Defense customers to maximize technology investments and realize their strategic vision.

Ayman Ishimwe

Ayman Ishimwe

Ayman is a Solutions Architect at AWS based in Seattle, Washington. He holds a Master’s degree in Software Engineering and IT from Oakland University. With prior experience in software development, specifically in building microservices for distributed web applications, he is passionate about helping customers build robust and scalable solutions on AWS cloud services following best practices.

Dmytro Protsiv

Dmytro Protsiv

Dmytro is a Cloud Applications Architect for with Amazon Web Services. He is passionate about helping customers to solve their business challenges around application modernization.

Stacy Conant

Stacy Conant

Stacy is a Solutions Architect working with DoD and US Navy customers. She enjoys helping customers understand how to harness big data and working on data analytics solutions. On the weekends, you can find Stacy crocheting, reading Harry Potter (again), playing with her dogs and cooking with her husband.

Power analytics as a service capabilities using Amazon Redshift

Post Syndicated from Sandipan Bhaumik original https://aws.amazon.com/blogs/big-data/power-analytics-as-a-service-capabilities-using-amazon-redshift/

Analytics as a service (AaaS) is a business model that uses the cloud to deliver analytic capabilities on a subscription basis. This model provides organizations with a cost-effective, scalable, and flexible solution for building analytics. The AaaS model accelerates data-driven decision-making through advanced analytics, enabling organizations to swiftly adapt to changing market trends and make informed strategic choices.

Amazon Redshift is a cloud data warehouse service that offers real-time insights and predictive analytics capabilities for analyzing data from terabytes to petabytes. It offers features like data sharing, Amazon Redshift ML, Amazon Redshift Spectrum, and Amazon Redshift Serverless, which simplify application building and make it effortless for AaaS companies to embed rich data analytics capabilities. Amazon Redshift delivers up to 4.9 times lower cost per user and up to 7.9 times better price-performance than other cloud data warehouses.

The Powered by Amazon Redshift program helps AWS Partners operating an AaaS model quickly build analytics applications using Amazon Redshift and successfully scale their business. For example, you can build visualizations on top of Amazon Redshift and embed them within applications to provide outstanding analytics experiences for end-users. In this post, we explore how AaaS providers scale their processes with Amazon Redshift to deliver insights to their customers.

AaaS delivery models

While serving analytics at scale, AaaS providers and customers can choose where to store the data and where to process the data.

AaaS providers could choose to ingest and process all the customer data into their own account and deliver insights to the customer account. Alternatively, they could choose to directly process data in-place within the customer’s account.

The choice of these delivery models depends on many factors, and each has their own benefits. Because AaaS providers service multiple customers, they could mix these models in a hybrid fashion, meeting each customer’s preference. The following diagram illustrates the two delivery models.

We explore the technical details of each model in the next sections.

Build AaaS on Amazon Redshift

Amazon Redshift has features that allow AaaS providers the flexibility to deploy three unique delivery models:

  • Managed model – Processing data within the Redshift data warehouse the AaaS provider manages
  • Bring-your-own-Redshift (BYOR) model – Processing data directly within the customer’s Redshift data warehouse
  • Hybrid model – Using a mix of both models depending on customer needs

These delivery models give AaaS providers the flexibility to deliver insights to their customers no matter where the data warehouse is located.

Let’s look at how each of these delivery models work in practice.

Managed model

In this model, the AaaS provider ingests customer data in their own account, and engages their own Redshift data warehouse for processing. Then they use one or more methods to deliver the generated insights to their customers. Amazon Redshift enables companies to securely build multi-tenant applications, ensuring data isolation, integrity, and confidentiality. It provides features like row-level security (RLS), column-level security (CLS) for fine-grained access control, role-based access control (RBAC), and assigning permissions at the database and schema level.

The following diagram illustrates the managed delivery model and the various methods AaaS providers can use to deliver insights to their customers.

The workflow includes the following steps:

  1. The AaaS provider pulls data from customer data sources like operational databases, files, and APIs, and ingests them into the Redshift data warehouse hosted in their account.
  2. Data processing jobs enrich the data in Amazon Redshift. This could be an application the AaaS provider has built to process data, or they could use a data processing service like Amazon EMR or AWS Glue to run Spark applications.
  3. Now the AaaS provider has multiple methods to deliver insights to their customers:
    1. Option 1 – The enriched data with insights is shared directly with the customer’s Redshift instance using the Amazon Redshift data sharing feature. End-users consume data using business intelligence (BI) tools and analytics applications.
    2. Option 2 – If AaaS providers are publishing generic insights to AWS Data Exchange to reach millions of AWS customers and monetize those insights, their customers can use AWS Data Exchange for Amazon Redshift. With this feature, customers get instant insights in their Redshift data warehouse without having to write extract, transform, and load (ETL) pipelines to ingest the data. AWS Data Exchange provides their customers a secure and compliant way to subscribe to the data with consolidated billing and subscription management.
    3. Option 3 – The AaaS provider exposes insights on a web application using the Amazon Redshift Data API. Customers access the web application directly from the internet. The gives the AaaS provider the flexibility to expose insights outside an AWS account.
    4. Option 4 – Customers connect to the AaaS provider’s Redshift instance using Amazon QuickSight or other third-party BI tools through a JDBC connection.

In this model, the customer shifts the responsibility of data management and governance to the AaaS providers, with light services to consume insights. This leads to improved decision-making as customers focus on core activities and save time from tedious data management tasks. Because AaaS providers move data from the customer accounts, there could be associated data transfer costs depending on how they move the data. However, because they deliver this service at scale to multiple customers, they can offer cost-efficient services using economies of scale.

BYOR model

In cases where the customer hosts a Redshift data warehouse and wants to run analytics in their own data platform without moving data out, you use the BYOR model.

The following diagram illustrates the BYOR model, where AaaS providers process data to add insights directly in their customer’s data warehouse so the data never leaves the customer account.

The solution includes the following steps:

  1. The customer ingests all the data from various data sources into their Redshift data warehouse.
  2. The data undergoes processing:
    1. The AaaS provider uses a secure channel, AWS PrivateLink for the Redshift Data API, to push data processing logic directly in the customer’s Redshift data warehouse.
    2. They use the same channel to process data at scale with multiple customers. The diagram illustrates a second customer, but this can scale to hundreds or thousands of customers. AaaS providers can tailor data processing logic per customer by isolating scripts for each customer and deploying them according to the customer’s identity, providing a customized and efficient service.
  3. The customer’s end-users consume data from their own account using BI tools and analytics applications.
  4. The customer has control over how to expose insights to their end-users.

This delivery model allows customers to manage their own data, reducing dependency on AaaS providers and cutting data transfer costs. By keeping data in their own environment, customers can reduce the risk of data breach while benefiting from insights for better decision-making.

Hybrid model

Customers have diverse needs influenced by factors like data security, compliance, and technical expertise. To cover a broader range of customers, AaaS providers can choose a hybrid approach that delivers both the managed model and the BYOR model depending on the customer, offering flexibility and the ability to serve multiple customers.

The following diagram illustrates the AaaS provider delivering insights through the BYOR model for Customer 1 and 4, the managed model for Customer 2 and 3, and so on.

Conclusion

In this post, we talked about the rising demand of analytics as a service and how providers can use the capabilities of Amazon Redshift to deliver insights to their customers. We examined two primary delivery models: the managed model, where AaaS providers process data on their own accounts, and the BYOR model, where AaaS providers process and enrich data directly in their customer’s account. Each method offers unique benefits, such as cost-efficiency, enhanced control, and personalized insights. The flexibility of the AWS Cloud facilitates a hybrid model, accommodating diverse customer needs and allowing AaaS providers to scale. We also introduced the Powered by Amazon Redshift program, which supports AaaS businesses in building effective analytics applications, fostering improved user engagement and business growth.

We take this opportunity to invite our ISV partners to reach out to us and learn more about the Powered by Amazon Redshift program.


About the Authors

Sandipan Bhaumik is a Senior Analytics Specialist Solutions Architect based in London, UK. He helps customers modernize their traditional data platforms using the modern data architecture in the cloud to perform analytics at scale.

Sain Das is a Senior Product Manager on the Amazon Redshift team and leads Amazon Redshift GTM for partner programs, including the Powered by Amazon Redshift and Redshift Ready programs.

Accelerate security automation using Amazon CodeWhisperer

Post Syndicated from Brendan Jenkins original https://aws.amazon.com/blogs/security/accelerate-security-automation-using-amazon-codewhisperer/

In an ever-changing security landscape, teams must be able to quickly remediate security risks. Many organizations look for ways to automate the remediation of security findings that are currently handled manually. Amazon CodeWhisperer is an artificial intelligence (AI) coding companion that generates real-time, single-line or full-function code suggestions in your integrated development environment (IDE) to help you quickly build software. By using CodeWhisperer, security teams can expedite the process of writing security automation scripts for various types of findings that are aggregated in AWS Security Hub, a cloud security posture management (CSPM) service.

In this post, we present some of the current challenges with security automation and walk you through how to use CodeWhisperer, together with Amazon EventBridge and AWS Lambda, to automate the remediation of Security Hub findings. Before reading further, please read the AWS Responsible AI Policy.

Current challenges with security automation

Many approaches to security automation, including Lambda and AWS Systems Manager Automation, require software development skills. Furthermore, the process of manually writing code for remediation can be a time-consuming process for security professionals. To help overcome these challenges, CodeWhisperer serves as a force multiplier for qualified security professionals with development experience to quickly and effectively generate code to help remediate security findings.

Security professionals should still cultivate software development skills to implement robust solutions. Engineers should thoroughly review and validate any generated code, as manual oversight remains critical for security.

Solution overview

Figure 1 shows how the findings that Security Hub produces are ingested by EventBridge, which then invokes Lambda functions for processing. The Lambda code is generated with the help of CodeWhisperer.

Figure 1: Diagram of the solution

Security Hub integrates with EventBridge so you can automatically process findings with other services such as Lambda. To begin remediating the findings automatically, you can configure rules to determine where to send findings. This solution will do the following:

  1. Ingest an Amazon Security Hub finding into EventBridge.
  2. Use an EventBridge rule to invoke a Lambda function for processing.
  3. Use CodeWhisperer to generate the Lambda function code.

It is important to note that there are two types of automation for Security Hub finding remediation:

  • Partial automation, which is initiated when a human worker selects the Security Hub findings manually and applies the automated remediation workflow to the selected findings.
  • End-to-end automation, which means that when a finding is generated within Security Hub, this initiates an automated workflow to immediately remediate without human intervention.

Important: When you use end-to-end automation, we highly recommend that you thoroughly test the efficiency and impact of the workflow in a non-production environment first before moving forward with implementation in a production environment.

Prerequisites

To follow along with this walkthrough, make sure that you have the following prerequisites in place:

Implement security automation

In this scenario, you have been tasked with making sure that versioning is enabled across all Amazon Simple Storage Service (Amazon S3) buckets in your AWS account. Additionally, you want to do this in a way that is programmatic and automated so that it can be reused in different AWS accounts in the future.

To do this, you will perform the following steps:

  1. Generate the remediation script with CodeWhisperer
  2. Create the Lambda function
  3. Integrate the Lambda function with Security Hub by using EventBridge
  4. Create a custom action in Security Hub
  5. Create an EventBridge rule to target the Lambda function
  6. Run the remediation

Generate a remediation script with CodeWhisperer

The first step is to use VS Code to create a script so that CodeWhisperer generates the code for your Lambda function in Python. You will use this Lambda function to remediate the Security Hub findings generated by the [S3.14] S3 buckets should use versioning control.

Note: The underlying model of CodeWhisperer is powered by generative AI, and the output of CodeWhisperer is nondeterministic. As such, the code recommended by the service can vary by user. By modifying the initial code comment to prompt CodeWhisperer for a response, customers can change the corresponding output to help meet their needs. Customers should subject all code generated by CodeWhisperer to typical testing and review protocols to verify that it is free of errors and is in line with applicable organizational security policies. To learn about best practices on prompt engineering with CodeWhisperer, see this AWS blog post.

To generate the remediation script

  1. Open a new VS Code window, and then open or create a new folder for your file to reside in.
  2. Create a Python file called cw-blog-remediation.py as shown in Figure 2.
     
    Figure 2: New VS Code file created called cw-blog-remediation.py

    Figure 2: New VS Code file created called cw-blog-remediation.py

  3. Add the following imports to the Python file.
    import json
    import boto3

  4. Because you have the context added to your file, you can now prompt CodeWhisperer by using a natural language comment. In your file, below the import statements, enter the following comment and then press Enter.
    # Create lambda function that turns on versioning for an S3 bucket after the function is triggered from Amazon EventBridge

  5. Accept the first recommendation that CodeWhisperer provides by pressing Tab to use the Lambda function handler, as shown in Figure 3.
    &ngsp;
    Figure 3: Generation of Lambda handler

    Figure 3: Generation of Lambda handler

  6. To get the recommendation for the function from CodeWhisperer, press Enter. Make sure that the recommendation you receive looks similar to the following. CodeWhisperer is nondeterministic, so its recommendations can vary.
    import json
    import boto3
    
    # Create lambda function that turns on versioning for an S3 bucket after function is triggered from Amazon EventBridge
    def lambda_handler(event, context):
        s3 = boto3.client('s3')
        bucket = event['detail']['requestParameters']['bucketName']
        response = s3.put_bucket_versioning(
            Bucket=bucket,
            VersioningConfiguration={
                'Status': 'Enabled'
            }
        )
        print(response)
        return {
            'statusCode': 200,
            'body': json.dumps('Versioning enabled for bucket ' + bucket)
        }
    

  7. Take a moment to review the user actions and keyboard shortcut keys. Press Tab to accept the recommendation.
  8. You can change the function body to fit your use case. To get the Amazon Resource Name (ARN) of the S3 bucket from the EventBridge event, replace the bucket variable with the following line:
    bucket = event['detail']['findings'][0]['Resources'][0]['Id']

  9. To prompt CodeWhisperer to extract the bucket name from the bucket ARN, use the following comment:
    # Take the S3 bucket name from the ARN of the S3 bucket

    Your function code should look similar to the following:

    import json
    import boto3
    
    # Create lambda function that turns on versioning for an S3 bucket after function is triggered from Amazon EventBridge
    def lambda_handler(event, context):
        s3 = boto3.client('s3')
       bucket = event['detail']['findings'][0]['Resources'][0]['Id']
             # Take the S3 bucket name from the ARN of the S3 bucket
       bucket = bucket.split(':')[5]
    
        response = s3.put_bucket_versioning(
            Bucket=bucket,
            VersioningConfiguration={
                'Status': 'Enabled'
            }
        )
        print(response)
        return {
            'statusCode': 200,
            'body': json.dumps('Versioning enabled for bucket ' + bucket)
        }
    

  10. Create a .zip file for cw-blog-remediation.py. Find the file in your local file manager, right-click the file, and select compress/zip. You will use this .zip file in the next section of the post.

Create the Lambda function

The next step is to use the automation script that you generated to create the Lambda function that will enable versioning on applicable S3 buckets.

To create the Lambda function

  1. Open the AWS Lambda console.
  2. In the left navigation pane, choose Functions, and then choose Create function.
  3. Select Author from Scratch and provide the following configurations for the function:
    1. For Function name, select sec_remediation_function.
    2. For Runtime, select Python 3.12.
    3. For Architecture, select x86_64.
    4. For Permissions, select Create a new role with basic Lambda permissions.
  4. Choose Create function.
  5. To upload your local code to Lambda, select Upload from and then .zip file, and then upload the file that you zipped.
  6. Verify that you created the Lambda function successfully. In the Code source section of Lambda, you should see the code from the automation script displayed in a new tab, as shown in Figure 4.
     
    Figure 4: Source code that was successfully uploaded

    Figure 4: Source code that was successfully uploaded

  7. Choose the Code tab.
  8. Scroll down to the Runtime settings pane and choose Edit.
  9. For Handler, enter cw-blog-remediation.lambda_handler for your function handler, and then choose Save, as shown in Figure 5.
     
    Figure 5: Updated Lambda handler

    Figure 5: Updated Lambda handler

  10. For security purposes, and to follow the principle of least privilege, you should also add an inline policy to the Lambda function’s role to perform the tasks necessary to enable versioning on S3 buckets.
    1. In the Lambda console, navigate to the Configuration tab and then, in the left navigation pane, choose Permissions. Choose the Role name, as shown in Figure 6.
       
      Figure 6: Lambda role in the AWS console

      Figure 6: Lambda role in the AWS console

    2. In the Add permissions dropdown, select Create inline policy.
       
      Figure 7: Create inline policy

      Figure 7: Create inline policy

    3. Choose JSON, add the following policy to the policy editor, and then choose Next.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": "s3:PutBucketVersioning",
                  "Resource": "*"
              }
          ]
      }

    4. Name the policy PutBucketVersioning and choose Create policy.

Create a custom action in Security Hub

In this step, you will create a custom action in Security Hub.

To create the custom action

  1. Open the Security Hub console.
  2. In the left navigation pane, choose Settings, and then choose Custom actions.
  3. Choose Create custom action.
  4. Provide the following information, as shown in Figure 8:
    • For Name, enter TurnOnS3Versioning.
    • For Description, enter Action that will turn on versioning for a specific S3 bucket.
    • For Custom action ID, enter TurnOnS3Versioning.
       
      Figure 8: Create a custom action in Security Hub

      Figure 8: Create a custom action in Security Hub

  5. Choose Create custom action.
  6. Make a note of the Custom action ARN. You will need this ARN when you create a rule to associate with the custom action in EventBridge.

Create an EventBridge rule to target the Lambda function

The next step is to create an EventBridge rule to capture the custom action. You will define an EventBridge rule that matches events (in this case, findings) from Security Hub that were forwarded by the custom action that you defined previously.

To create the EventBridge rule

  1. Navigate to the EventBridge console.
  2. On the right side, choose Create rule.
  3. On the Define rule detail page, give your rule a name and description that represents the rule’s purpose—for example, you could use the same name and description that you used for the custom action. Then choose Next.
  4. Scroll down to Event pattern, and then do the following:
    1. For Event source, make sure that AWS services is selected.
    2. For AWS service, select Security Hub.
    3. For Event type, select Security Hub Findings – Custom Action.
    4. Select Specific custom action ARN(s) and enter the ARN for the custom action that you created earlier.
       
    Figure 9: Specify the EventBridge event pattern for the Security Hub custom action workflow

    Figure 9: Specify the EventBridge event pattern for the Security Hub custom action workflow

    As you provide this information, the Event pattern updates.

  5. Choose Next.
  6. On the Select target(s) step, in the Select a target dropdown, select Lambda function. Then from the Function dropdown, select sec_remediation_function.
  7. Choose Next.
  8. On the Configure tags step, choose Next.
  9. On the Review and create step, choose Create rule.

Run the automation

Your automation is set up and you can now test the automation. This test covers a partial automation workflow, since you will manually select the finding and apply the remediation workflow to one or more selected findings.

Important: As we mentioned earlier, if you decide to make the automation end-to-end, you should assess the impact of the workflow in a non-production environment. Additionally, you may want to consider creating preventative controls if you want to minimize the risk of event occurrence across an entire environment.

To run the automation

  1. In the Security Hub console, on the Findings tab, add a filter by entering Title in the search box and selecting that filter. Select IS and enter S3 general purpose buckets should have versioning enabled (case sensitive). Choose Apply.
  2. In the filtered list, choose the Title of an active finding.
  3. Before you start the automation, check the current configuration of the S3 bucket to confirm that your automation works. Expand the Resources section of the finding.
  4. Under Resource ID, choose the link for the S3 bucket. This opens a new tab on the S3 console that shows only this S3 bucket.
  5. In your browser, go back to the Security Hub tab (don’t close the S3 tab—you will need to return to it), and on the left side, select this same finding, as shown in Figure 10.
     
    Figure 10: Filter out Security Hub findings to list only S3 bucket-related findings

    Figure 10: Filter out Security Hub findings to list only S3 bucket-related findings

  6. In the Actions dropdown list, choose the name of your custom action.
     
    Figure 11: Choose the custom action that you created to start the remediation workflow

    Figure 11: Choose the custom action that you created to start the remediation workflow

  7. When you see a banner that displays Successfully started action…, go back to the S3 browser tab and refresh it. Verify that the S3 versioning configuration on the bucket has been enabled as shown in figure 12.
     
    Figure 12: Versioning successfully enabled

    Figure 12: Versioning successfully enabled

Conclusion

In this post, you learned how to use CodeWhisperer to produce AI-generated code for custom remediations for a security use case. We encourage you to experiment with CodeWhisperer to create Lambda functions that remediate other Security Hub findings that might exist in your account, such as the enforcement of lifecycle policies on S3 buckets with versioning enabled, or using automation to remove multiple unused Amazon EC2 elastic IP addresses. The ability to automatically set public S3 buckets to private is just one of many use cases where CodeWhisperer can generate code to help you remediate Security Hub findings.

To sum up, CodeWhisperer acts as a tool that can help boost the productivity of security experts who have coding abilities, assisting them to swiftly write code to address security issues. However, security specialists should continue building their software development capabilities to implement robust solutions. Engineers should carefully review and test any generated code, since human oversight is still vital for security.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Brendan Jenkins

Brendan Jenkins

Brendan is a Solutions Architect at AWS who works with enterprise customers, providing them with technical guidance and helping them achieve their business goals. He specializes in DevOps and machine learning (ML) technology.

Chris Shea

Chris Shea

Chris is an AWS Solutions Architect serving enterprise customers in the PropTech and AdTech industry verticals, providing guidance and the tools that customers need for success. His areas of interest include AI for DevOps and AI/ML technology.

Tim Manik

Tim Manik

Tim is a Solutions Architect at AWS working with enterprise customers on migrations and modernizations. He specializes in cybersecurity and AI/ML and is passionate about bridging the gap between the two fields.

Angel Tolson

Angel Tolson

Angel is a Solutions Architect at AWS working with small to medium size businesses, providing them with technical guidance and helping them achieve their business goals. She is particularly interested in cloud operations and networking.

Achieve near real time operational analytics using Amazon Aurora PostgreSQL zero-ETL integration with Amazon Redshift

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/achieve-near-real-time-operational-analytics-using-amazon-aurora-postgresql-zero-etl-integration-with-amazon-redshift/

“Data is at the center of every application, process, and business decision. When data is used to improve customer experiences and drive innovation, it can lead to business growth,”

Swami Sivasubramanian, VP of Database, Analytics, and Machine Learning at AWS in With a zero-ETL approach, AWS is helping builders realize near-real-time analytics.

Customers across industries are becoming more data driven and looking to increase revenue, reduce cost, and optimize their business operations by implementing near real time analytics on transactional data, thereby enhancing agility. Based on customer needs and their feedback, AWS is investing and steadily progressing towards bringing our zero-ETL vision to life so that builders can focus more on creating value from data, instead of preparing data for analysis.

Our zero-ETL integration with Amazon Redshift facilitates point-to-point data movement to get it ready for analytics, artificial intelligence (AI) and machine learning (ML) using Amazon Redshift on petabytes of data. Within seconds of transactional data being written into supported AWS databases, zero-ETL seamlessly makes the data available in Amazon Redshift, removing the need to build and maintain complex data pipelines that perform extract, transform, and load (ETL) operations.

To help you focus on creating value from data instead of investing undifferentiated time and resources in building and managing ETL pipelines between transactional databases and data warehouses, we announced four AWS database zero-ETL integrations with Amazon Redshift at AWS re:Invent 2023:

In this post, we provide step-by-step guidance on how to get started with near real time operational analytics using the Amazon Aurora PostgreSQL zero-ETL integration with Amazon Redshift.

Solution overview

To create a zero-ETL integration, you specify an Amazon Aurora PostgreSQL-Compatible Edition cluster (compatible with PostgreSQL 15.4 and zero-ETL support) as the source, and a Redshift data warehouse as the target. The integration replicates data from the source database into the target data warehouse.

You must create Aurora PostgreSQL DB provisioned clusters within the Amazon RDS Database Preview Environment and a Redshift provisioned preview cluster or serverless preview workgroup, in the US East (Ohio) AWS Region. For Amazon Redshift, make sure that you choose the preview_2023 track in order to use zero-ETL integrations.

The following diagram illustrates the architecture implemented in this post.

The following are the steps needed to set up the zero-ETL integration for this solution. For complete getting started guides, refer to Working with Aurora zero-ETL integrations with Amazon Redshift and Working with zero-ETL integrations.

bdb-3883-image001

After Step1, you can also skip Steps 2–4 and directly start creating your zero-ETL integration from Step 5, in which case Amazon RDS will show a message about missing configurations and you can choose Fix it for me to let Amazon RDS automatically configure the steps.

  1. Configure the Aurora PostgreSQL source with a customized DB cluster parameter group.
  2. Configure the Amazon Redshift Serverless destination with the required resource policy for its namespace.
  3. Update the Redshift Serverless workgroup to enable case-sensitive identifiers.
  4. Configure the required permissions.
  5. Create the zero-ETL integration.
  6. Create a database from the integration in Amazon Redshift.
  7. Start analyzing the near real time transactional data.

Configure the Aurora PostgreSQL source with a customized DB cluster parameter group

For Aurora PostgreSQL DB clusters, you must create the custom parameter group within the Amazon RDS Database Preview Environment, in the US East (Ohio) Region. You can directly access the Amazon RDS Preview Environment.

To create an Aurora PostgreSQL database, complete the following steps:

  1. On the Amazon RDS console, choose Parameter groups in the navigation pane.
  2. Choose Create parameter group.
  3. For Parameter group family, choose aurora-postgresql15.
  4. For Type, choose DB Cluster Parameter Group.
  5. For Group name, enter a name (for example, zero-etl-custom-pg-postgres).
  6. Choose Create.bdb-3883-image002

Aurora PostgreSQL zero-ETL integrations with Amazon Redshift require specific values for the Aurora DB cluster parameters, which requires enhanced logical replication (aurora.enhanced_logical_replication).

  1. On the Parameter groups page, select the newly created parameter group.
  2. On the Actions menu, choose Edit.
  3. Set the following Aurora PostgreSQL (aurora-postgresql15 family) cluster parameter settings:
    • rds.logical_replication=1
    • aurora.enhanced_logical_replication=1
    • aurora.logical_replication_backup=0
    • aurora.logical_replication_globaldb=0

Enabling enhanced logical replication (aurora.enhanced_logical_replication) automatically sets the REPLICA IDENTITY parameter to FULL, which means that all column values are written to the write ahead log (WAL).

  1. Choose Save Changes.bdb-3883-image003
  2. Choose Databases in the navigation pane, then choose Create database.
    bdb-3883-image004
  3. For Engine type, select Amazon Aurora.
  4. For Edition, select Amazon Aurora PostgreSQL-Compatible Edition.
  5. For Available versions, choose Aurora PostgreSQL (compatible with PostgreSQL 15.4 and Zero-ETL Support).bdb-3883-image006
  6. For Templates, select Production.
  7. For DB cluster identifier, enter zero-etl-source-pg.bdb-3883-image007
  8. Under Credentials Settings, enter a password for Master password or use the option to automatically generate a password for you.
  9. In the Instance configuration section, select Memory optimized classes.
  10. Choose a suitable instance size (the default is db.r5.2xlarge).bdb-3883-image008
  11. Under Additional configuration, for DB cluster parameter group, choose the parameter group you created earlier (zero-etl-custom-pg-postgres).bdb-3883-image009
  12. Leave the default settings for the remaining configurations.
  13. Choose Create database.

In a few minutes, this should spin up an Aurora PostgreSQL cluster, with one writer and one reader instance, with the status changing from Creating to Available. The newly created Aurora PostgreSQL cluster will be the source for the zero-ETL integration.

bdb-3883-image010

The next step is to create a named database in Amazon Aurora PostgreSQL for the zero-ETL integration.

The PostgreSQL resource model allows you to create multiple databases within a cluster. Therefore, during the zero-ETL integration creation step, you need to specify which database you want to use as the source for your integration.

When setting up PostgreSQL, you get three standard databases out of the box: template0, template1, and postgres. Whenever you create a new database in PostgreSQL, you are actually basing it off one of these three databases in your cluster. The database created during Aurora PostgreSQL cluster creation is based on template0. The CREATE DATABASE command works by copying an existing database, and if not explicitly specified, by default, it copies the standard system database template1. For the named database for zero-ETL integration, the database is required to be created using template1 and not template0. Therefore, if an initial database name is added under Additional configuration, that would be created using template0 and cannot be used for zero-ETL integration.

  1. To create a new named database using CREATE DATABASE within the new Aurora PostgreSQL cluster zero-etl-source-pg, first get the endpoint of the writer instance of the PostgreSQL cluster.bdb-3883-image011
  2. From a terminal or using AWS CloudShell, SSH into the PostgreSQL cluster and run the following commands to install psql and create a new database zeroetl_db:
    sudo dnf install postgresql15
    psql –version
    psql -h <RDS Write Instance Endpoint> -p 5432 -U postgres
    create database zeroetl_db template template1;

Adding template template1 is optional, because by default, if not mentioned, CREATE DATABASE will use template1.

You can also connect via a client and create the database. Refer to Connect to an Aurora PostgreSQL DB cluster for the options to connect to the PostgreSQL cluster.

Configure Redshift Serverless as destination

After you create your Aurora PostgreSQL source database cluster, you configure a Redshift target data warehouse. The data warehouse must comply with the following requirements:

  • Created in preview (for Aurora PostgreSQL sources only)
  • Uses an RA3 node type (ra3.16xlarge, ra3.4xlarge, or ra3.xlplus) with at least two nodes, or Redshift Serverless
  • Encrypted (if using a provisioned cluster)

For this post, we create and configure a Redshift Serverless workgroup and namespace as the target data warehouse, following these steps:

  1. On the Amazon Redshift console, choose Serverless dashboard in the navigation pane.

Because the zero-ETL integration for Amazon Aurora PostgreSQL to Amazon Redshift has been launched in preview (not for production purposes), you need to create the target data warehouse in a preview environment.

  1. Choose Create preview workgroup.

The first step is to configure the Redshift Serverless workgroup.

  1. For Workgroup name, enter a name (for example, zero-etl-target-rs-wg).bdb-3883-image014
  2. Additionally, you can choose the capacity, to limit the compute resources of the data warehouse. The capacity can be configured in increments of 8, from 8–512 RPUs. For this post, set this to 8 RPUs.
  3. Choose Next.bdb-3883-image016

Next, you need to configure the namespace of the data warehouse.

  1. Select Create a new namespace.
  2. For Namespace, enter a name (for example, zero-etl-target-rs-ns).
  3. Choose Next.bdb-3883-image017
  4. Choose Create workgroup.
  5. After the workgroup and namespace are created, choose Namespace configurations in the navigation pane and open the namespace configuration.
  6. On the Resource policy tab, choose Add authorized principals.

An authorized principal identifies the user or role that can create zero-ETL integrations into the data warehouse.

bdb-3883-image018

  1. For IAM principal ARN or AWS account ID, you can enter either the ARN of the AWS user or role, or the ID of the AWS account that you want to grant access to create zero-ETL integrations. (An account ID is stored as an ARN.)
  2. Choose Save changes.bdb-3883-image019

After the Authorized principal is configured, you need to allow the source database to update your Redshift data warehouse. Therefore, you must add the source database as an authorized integration source to the namespace.

  1. Choose Add authorized integration source.bdb-3883-image020
  2. For Authorized source ARN, enter the ARN of the Aurora PostgreSQL cluster, because it’s the source of the zero-ETL integration.

You can obtain the ARN of the Aurora PostgreSQL cluster on the Amazon RDS console, the Configuration tab under Amazon Resource Name.

  1. Choose Save changes.bdb-3883-image021

Update the Redshift Serverless workgroup to enable case-sensitive identifiers

Amazon Aurora PostgreSQL is case sensitive by default, and case sensitivity is disabled on all provisioned clusters and Redshift Serverless workgroups. For the integration to be successful, the case sensitivity parameter enable_case_sensitive_identifier must be enabled for the data warehouse.

In order to modify the enable_case_sensitive_identifier parameter in a Redshift Serverless workgroup, you need to use the AWS Command Line Interface (AWS CLI), because the Amazon Redshift console doesn’t currently support modifying Redshift Serverless parameter values. Run the following command to update the parameter:

aws redshift-serverless update-workgroup --workgroup-name zero-etl-target-rs-wg --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true --region us-east-2

A simple way to connect to the AWS CLI is to use CloudShell, which is a browser-based shell that provides command line access to the AWS resources and tools directly from a browser. The following screenshot illustrates how to run the command in the CloudShell.

bdb-3883-image022

Configure required permissions

To create a zero-ETL integration, your user or role must have an attached identity-based policy with the appropriate AWS Identity and Access Management (IAM) permissions. An AWS account owner can configure required permissions for user or roles who may create zero-ETL integrations. The sample policy allows the associated principal to perform following actions:

  • Create zero-ETL integrations for the source Aurora DB cluster.
  • View and delete all zero-ETL integrations.
  • Create inbound integrations into the target data warehouse. Amazon Redshift has a different ARN format for provisioned and serverless:
  • Provisioned clusterarn:aws:redshift:{region}:{account-id}:namespace:namespace-uuid
  • Serverlessarn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid

This permission is not required if the same account owns the Redshift data warehouse and this account is an authorized principal for that data warehouse.

Complete the following steps to configure the permissions:

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. Create a new policy called rds-integrations using the following JSON. For the Amazon Aurora PostgreSQL preview, all ARNs and actions within the Amazon RDS Database Preview Environment have -preview appended to the service namespace. Therefore, in the following policy, instead of rds, you need to use rds-preview. For example, rds-preview:CreateIntegration.
{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "rds:CreateIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:cluster:source-cluster",
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DescribeIntegration"
        ],
        "Resource": ["*"]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DeleteIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "redshift:CreateInboundIntegration"
        ],
        "Resource": [
            "arn:aws:redshift:{region}:{account-id}:cluster:namespace-uuid"
        ]
    }]
}
  1. Attach the policy you created to your IAM user or role permissions.

Create the zero-ETL integration

To create the zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Choose Create zero-ETL integration.bdb-3883-image023
  3. For Integration identifier, enter a name, for example zero-etl-demo.
  4. Choose Next.bdb-3883-image025
  5. For Source database, choose Browse RDS databases.bdb-3883-image026
  6. Select the source database zero-etl-source-pg and choose Choose.
  7. For Named database, enter the name of the new database created in the Amazon Aurora PostgreSQL (zeroetl-db).
  8. Choose Next.bdb-3883-image028
  9. In the Target section, for AWS account, select Use the current account.
  10. For Amazon Redshift data warehouse, choose Browse Redshift data warehouses.bdb-3883-image029

We discuss the Specify a different account option later in this section.

  1. Select the Redshift Serverless destination namespace (zero-etl-target-rs-ns), and choose Choose.bdb-3883-image031
  2. Add tags and encryption, if applicable, and choose Next.bdb-3883-image032
  3. Verify the integration name, source, target, and other settings, and choose Create zero-ETL integration.

You can choose the integration on the Amazon RDS console to view the details and monitor its progress. It takes about 30 minutes to change the status from Creating to Active, depending on size of the dataset already available in the source.

bdb-3883-image033

bdb-3883-image034

To specify a target Redshift data warehouse that’s in another AWS account, you must create a role that allows users in the current account to access resources in the target account. For more information, refer to Providing access to an IAM user in another AWS account that you own.

Create a role in the target account with the following permissions:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "redshift:DescribeClusters",
            "redshift-serverless:ListNamespaces"
         ],
         "Resource":[
            "*"
         ]
      }
   ]
}

The role must have the following trust policy, which specifies the target account ID. You can do this by creating a role with a trusted entity as an AWS account ID in another account.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS": "arn:aws:iam::{external-account-id}:root"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

The following screenshot illustrates creating this on the IAM console.

bdb-3883-image035

Then, while creating the zero-ETL integration, for Specify a different account, choose the destination account ID and the name of the role you created.

Create a database from the integration in Amazon Redshift

To create your database, complete the following steps:

  1. On the Redshift Serverless dashboard, navigate to the zero-etl-target-rs-ns namespace.
  2. Choose Query data to open the query editor v2.
    bdb-3883-image036
  3. Connect to the Redshift Serverless data warehouse by choosing Create connection.
    bdb-3883-image037
  4. Obtain the integration_id from the svv_integration system table:
    SELECT integration_id FROM svv_integration; -- copy this result, use in the next sql

  5. Use the integration_id from the previous step to create a new database from the integration. You must also include a reference to the named database within the cluster that you specified when you created the integration.
    CREATE DATABASE aurora_pg_zetl FROM INTEGRATION '<result from above>' DATABASE zeroetl_db;

bdb-3883-image038

The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near real time.

Analyze the near real time transactional data

Now you can start analyzing the near real time data from the Amazon Aurora PostgreSQL source to the Amazon Redshift target:

  1. Connect to your source Aurora PostgreSQL database. In this demo, we use psql to connect to Amazon Aurora PostgreSQL:
    psql -h <amazon_aurora_postgres_writer_endpoint> -p 5432 -d zeroetl_db -U postgres

bdb-3883-image039

  1. Create a sample table with a primary key. Make sure that all tables to be replicated from source to target have a primary key. Tables without a primary key can’t be replicated to the target.
CREATE TABLE NATION  ( 
N_NATIONKEY  INTEGER NOT NULL PRIMARY KEY, 
N_NAME       CHAR(25) NOT NULL,
N_REGIONKEY  INTEGER NOT NULL,
N_COMMENT    VARCHAR(152));
  1. Insert dummy data into the nation table and verify if the data is properly loaded:
INSERT INTO nation VALUES (1, 'USA', 1 , 'united states of america');
SELECT * FROM nation;

bdb-3883-image040

This sample data should now be replicated in Amazon Redshift.

Analyze the source data in the destination

On the Redshift Serverless dashboard, open query editor v2 and connect to the database aurora_pg_zetl you created earlier.

Run the following query to validate the successful replication of the source data into Amazon Redshift:

SELECT * FROM aurora_pg_etl.public.nation;

bdb-3883-image041

You can also use the following query to validate the initial snapshot or ongoing change data capture (CDC) activity:

SELECT * FROM sys_integration_activity ORDER BY last_commit_timestamp desc;

bdb-3883-image042

Monitoring

There are several options to obtain metrics on the performance and status of the Aurora PostgreSQL zero-ETL integration with Amazon Redshift.

If you navigate to the Amazon Redshift console, you can choose Zero-ETL integrations in the navigation pane. You can choose the zero-ETL integration you want and display Amazon CloudWatch metrics related to the integration. These metrics are also directly available in CloudWatch.

bdb-3883-image043

For each integration, there are two tabs with information available:

  • Integration metrics – Shows metrics such as the number of tables successfully replicated and lag details
    bdb-3883-image044
  • Table statistics – Shows details about each table replicated from Amazon Aurora PostgreSQL to Amazon Redshift
    bdb-3883-image045

In addition to the CloudWatch metrics, you can query the following system views, which provide information about the integrations:

Clean up

When you delete a zero-ETL integration, your transactional data isn’t deleted from Aurora or Amazon Redshift, but Aurora doesn’t send new data to Amazon Redshift.

To delete a zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Select the zero-ETL integration that you want to delete and choose Delete.
    bdb-3883-image046
  3. To confirm the deletion, enter confirm and choose Delete.
    bdb-3883-image048

Conclusion

In this post, we explained how you can set up the zero-ETL integration from Amazon Aurora PostgreSQL to Amazon Redshift, a feature that reduces the effort of maintaining data pipelines and enables near real time analytics on transactional and operational data.

To learn more about zero-ETL integration, refer to Working with Aurora zero-ETL integrations with Amazon Redshift and Limitations.


About the Authors

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Juan Luis Polo Garzon is an Associate Specialist Solutions Architect at AWS, specialized in analytics workloads. He has experience helping customers design, build and modernize their cloud-based analytics solutions. Outside of work, he enjoys travelling, outdoors and hiking, and attending to live music events.

Sushmita Barthakur is a Senior Solutions Architect at Amazon Web Services, supporting Enterprise customers architect their workloads on AWS. With a strong background in Data Analytics and Data Management, she has extensive experience helping customers architect and build Business Intelligence and Analytics Solutions, both on-premises and the cloud. Sushmita is based out of Tampa, FL and enjoys traveling, reading and playing tennis.

Detecting and remediating inactive user accounts with Amazon Cognito

Post Syndicated from Harun Abdi original https://aws.amazon.com/blogs/security/detecting-and-remediating-inactive-user-accounts-with-amazon-cognito/

For businesses, particularly those in highly regulated industries, managing user accounts isn’t just a matter of security but also a compliance necessity. In sectors such as finance, healthcare, and government, where regulations often mandate strict control over user access, disabling stale user accounts is a key compliance activity. In this post, we show you a solution that uses serverless technologies to track and disable inactive user accounts. While this process is particularly relevant for those in regulated industries, it can also be beneficial for other organizations looking to maintain a clean and secure user base.

The solution focuses on identifying inactive user accounts in Amazon Cognito and automatically disabling them. Disabling a user account in Cognito effectively restricts the user’s access to applications and services linked with the Amazon Cognito user pool. After their account is disabled, the user cannot sign in, access tokens are revoked for their account and they are unable to perform API operations that require user authentication. However, the user’s data and profile within the Cognito user pool remain intact. If necessary, the account can be re-enabled, allowing the user to regain access and functionality.

While the solution focuses on the example of a single Amazon Cognito user pool in a single account, you also learn considerations for multi-user pool and multi-account strategies.

Solution overview

In this section, you learn how to configure an AWS Lambda function that captures the latest sign-in records of users authenticated by Amazon Cognito and write this data to an Amazon DynamoDB table. A time-to-live (TTL) indicator is set on each of these records based on the user inactivity threshold parameter defined when deploying the solution. This TTL represents the maximum period a user can go without signing in before their account is disabled. As these items reach their TTL expiry in DynamoDB, a second Lambda function is invoked to process the expired items and disable the corresponding user accounts in Cognito. For example, if the user inactivity threshold is configured to be 7 days, the accounts of users who don’t sign in within 7 days of their last sign-in will be disabled. Figure 1 shows an overview of the process.

Note: This solution functions as a background process and doesn’t disable user accounts in real time. This is because DynamoDB Time to Live (TTL) is designed for efficiency and to remain within the constraints of the Amazon Cognito quotas. Set your users’ and administrators’ expectations accordingly, acknowledging that there might be a delay in the reflection of changes and updates.

Figure 1: Architecture diagram for tracking user activity and disabling inactive Amazon Cognito users

Figure 1: Architecture diagram for tracking user activity and disabling inactive Amazon Cognito users

As shown in Figure 1, this process involves the following steps:

  1. An application user signs in by authenticating to Amazon Cognito.
  2. Upon successful user authentication, Cognito initiates a post authentication Lambda trigger invoking the PostAuthProcessorLambda function.
  3. The PostAuthProcessorLambda function puts an item in the LatestPostAuthRecordsDDB DynamoDB table with the following attributes:
    1. sub: A unique identifier for the authenticated user within the Amazon Cognito user pool.
    2. timestamp: The time of the user’s latest sign-in, formatted in UTC ISO standard.
    3. username: The authenticated user’s Cognito username.
    4. userpool_id: The identifier of the user pool to which the user authenticated.
    5. ttl: The TTL value, in seconds, after which a user’s inactivity will initiate account deactivation.
  4. Items in the LatestPostAuthRecordsDDB DynamoDB table are automatically purged upon reaching their TTL expiry, launching events in DynamoDB Streams.
  5. DynamoDB Streams events are filtered to allow invocation of the DDBStreamProcessorLambda function only for TTL deleted items.
  6. The DDBStreamProcessorLambda function runs to disable the corresponding user accounts in Cognito.

Implementation details

In this section, you’re guided through deploying the solution, demonstrating how to integrate it with your existing Amazon Cognito user pool and exploring the solution in more detail.

Note: This solution begins tracking user activity from the moment of its deployment. It can’t retroactively track or manage user activities that occurred prior to its implementation. To make sure the solution disables currently inactive users in the first TTL period after deploying the solution, you should do a one-time preload of those users into the DynamoDB table. If this isn’t done, the currently inactive users won’t be detected because users are detected as they sign in. For the same reason, users who create accounts but never sign in won’t be detected either. To detect user accounts that sign up but never sign in, implement a post confirmation Lambda trigger to invoke a Lambda function that processes user sign-up records and writes them to the DynamoDB table.

Prerequisites

Before deploying this solution, you must have the following prerequisites in place:

  • An existing Amazon Cognito user pool. This user pool is the foundation upon which the solution operates. If you don’t have a Cognito user pool set up, you must create one before proceeding. See Creating a user pool.
  • The ability to launch a CloudFormation template. The second prerequisite is the capability to launch an AWS CloudFormation template in your AWS environment. The template provisions the necessary AWS services, including Lambda functions, a DynamoDB table, and AWS Identity and Access Management (IAM) roles that are integral to the solution. The template simplifies the deployment process, allowing you to set up the entire solution with minimal manual configuration. You must have the necessary permissions in your AWS account to launch CloudFormation stacks and provision these services.

To deploy the solution

  1. Choose the following Launch Stack button to deploy the solution’s CloudFormation template:

    Launch Stack

    The solution deploys in the AWS US East (N. Virginia) Region (us-east-1) by default. To deploy the solution in a different Region, use the Region selector in the console navigation bar and make sure that the services required for this walkthrough are supported in your newly selected Region. For service availability by Region, see AWS Services by Region.

  2. On the Quick Create Stack screen, do the following:
    1. Specify the stack details.
      1. Stack name: The stack name is an identifier that helps you find a particular stack from a list of stacks. A stack name can contain only alphanumeric characters (case sensitive) and hyphens. It must start with an alphabetic character and can’t be longer than 128 characters.
      2. CognitoUserPoolARNs: A comma-separated list of Amazon Cognito user pool Amazon Resource Names (ARNs) to monitor for inactive users.
      3. UserInactiveThresholdDays: Time (in days) that the user account is allowed to be inactive before it’s disabled.
    2. Scroll to the bottom, and in the Capabilities section, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
    3. Choose Create Stack.

Integrate with your existing user pool

With the CloudFormation template deployed, you can set up Lambda triggers in your existing user pool. This is a key step for tracking user activity.

Note: This walkthrough is using the new AWS Management Console experience. Alternatively, These steps could also be done using CloudFormation.

To integrate with your existing user pool

  1. Navigate to the Amazon Cognito console and select your user pool.
  2. Navigate to User pool properties.
  3. Under Lambda triggers, choose Add Lambda trigger. Select the Authentication radio button, then add a Post authentication trigger and assign the PostAuthProcessorLambda function.

Note: Amazon Cognito allows you to set up one Lambda trigger per event. If you already have a configured post authentication Lambda trigger, you can refactor the existing Lambda function, adding new features directly to minimize the cold starts associated with invoking additional functions (for more information, see Anti-patterns in Lambda-based applications). Keep in mind that when Cognito calls your Lambda function, the function must respond within 5 seconds. If it doesn’t and if the call can be retried, Cognito retries the call. After three unsuccessful attempts, the function times out. You can’t change this 5-second timeout value.

Figure 2: Add a post-authentication Lambda trigger and assign a Lambda function

Figure 2: Add a post-authentication Lambda trigger and assign a Lambda function

When you add a Lambda trigger in the Amazon Cognito console, Cognito adds a resource-based policy to your function that permits your user pool to invoke the function. When you create a Lambda trigger outside of the Cognito console, including a cross-account function, you must add permissions to the resource-based policy of the Lambda function. Your added permissions must allow Cognito to invoke the function on behalf of your user pool. You can add permissions from the Lambda console or use the Lambda AddPermission API operation. To configure this in CloudFormation, you can use the AWS::Lambda::Permission resource.

Explore the solution

The solution should now be operational. It’s configured to begin monitoring user sign-in activities and automatically disable inactive user accounts according to the user inactivity threshold. Use the following procedures to test the solution:

Note: When testing the solution, you can set the UserInactiveThresholdDays CloudFormation parameter to 0. This minimizes the time it takes for user accounts to be disabled.

Step 1: User authentication

  1. Create a user account (if one doesn’t exist) in the Amazon Cognito user pool integrated with the solution.
  2. Authenticate to the Cognito user pool integrated with the solution.
     
    Figure 3: Example user signing in to the Amazon Cognito hosted UI

    Figure 3: Example user signing in to the Amazon Cognito hosted UI

Step 2: Verify the sign-in record in DynamoDB

Confirm the sign-in record was successfully put in the LatestPostAuthRecordsDDB DynamoDB table.

  1. Navigate to the DynamoDB console.
  2. Select the LatestPostAuthRecordsDDB table.
  3. Select Explore Table Items.
  4. Locate the sign-in record associated with your user.
     
Figure 4: Locating the sign-in record associated with the signed-in user

Figure 4: Locating the sign-in record associated with the signed-in user

Step 3: Confirm user deactivation in Amazon Cognito

After the TTL expires, validate that the user account is disabled in Amazon Cognito.

  1. Navigate to the Amazon Cognito console.
  2. Select the relevant Cognito user pool.
  3. Under Users, select the specific user.
  4. Verify the Account status in the User information section.
     
Figure 5: Screenshot of the user that signed in with their account status set to disabled

Figure 5: Screenshot of the user that signed in with their account status set to disabled

Note: TTL typically deletes expired items within a few days. Depending on the size and activity level of a table, the actual delete operation of an expired item can vary. TTL deletes items on a best effort basis, and deletion might take longer in some cases.

The user’s account is now disabled. A disabled user account can’t be used to sign in, but still appears in the responses to GetUser and ListUsers API requests.

Design considerations

In this section, you dive deeper into the key components of this solution.

DynamoDB schema configuration:

The DynamoDB schema has the Amazon Cognito sub attribute as the partition key. The Cognito sub is a globally unique user identifier within Cognito user pools that cannot be changed. This configuration ensures each user has a single entry in the table, even if the solution is configured to track multiple user pools. See Other considerations for more about tracking multiple user pools.

Using DynamoDB Streams and Lambda to disable TTL deleted users

This solution uses DynamoDB TTL and DynamoDB Streams alongside Lambda to process user sign-in records. The TTL feature automatically deletes items past their expiration time without write throughput consumption. The deleted items are captured by DynamoDB Streams and processed using Lambda. You also apply event filtering within the Lambda event source mapping, ensuring that the DDBStreamProcessorLambda function is invoked exclusively for TTL-deleted items (see the following code example for the JSON filter pattern). This approach reduces invocations of the Lambda functions, simplifies code, and reduces overall cost.

{
    "Filters": [
        {
            "Pattern": { "userIdentity": { "type": ["Service"], "principalId": ["dynamodb.amazonaws.com"] } }
        }
    ]
}

Handling API quotas:

The DDBStreamProcessorLambda function is configured to comply with the AdminDisableUser API’s quota limits. It processes messages in batches of 25, with a parallelization factor of 1. This makes sure that the solution remains within the nonadjustable 25 requests per second (RPS) limit for AdminDisableUser, avoiding potential API throttling. For more details on these limits, see Quotas in Amazon Cognito.

Dead-letter queues:

Throughout the architecture, dead-letter queues (DLQs) are used to handle message processing failures gracefully. They make sure that unprocessed records aren’t lost but instead are queued for further inspection and retry.

Other considerations

The following considerations are important for scaling the solution in complex environments and maintaining its integrity. The ability to scale and manage the increased complexity is crucial for successful adoption of the solution.

Multi-user pool and multi-account deployment

While this solution discussed a single Amazon Cognito user pool in a single AWS account, this solution can also function in environments with multiple user pools. This involves deploying the solution and integrating with each user pool as described in Integrating with your existing user pool. Because of the AdminDisableUser API’s quota limit for the maximum volume of requests in one AWS Region in one AWS account, consider deploying the solution separately in each Region in each AWS account to stay within the API limits.

Efficient processing with Amazon SQS:

Consider using Amazon Simple Queue Service (Amazon SQS) to add a queue between the PostAuthProcessorLambda function and the LatestPostAuthRecordsDDB DynamoDB table to optimize processing. This approach decouples user sign-in actions from DynamoDB writes, and allows for batching writes to DynamoDB, reducing the number of write requests.

Clean up

Avoid unwanted charges by cleaning up the resources you’ve created. To decommission the solution, follow these steps:

  1. Remove the Lambda trigger from the Amazon Cognito user pool:
    1. Navigate to the Amazon Cognito console.
    2. Select the user pool you have been working with.
    3. Go to the Triggers section within the user pool settings.
    4. Manually remove the association of the Lambda function with the user pool events.
  2. Remove the CloudFormation stack:
    1. Open the CloudFormation console.
    2. Locate and select the CloudFormation stack that was used to deploy the solution.
    3. Delete the stack.
    4. CloudFormation will automatically remove the resources created by this stack, including Lambda functions, Amazon SQS queues, and DynamoDB tables.

Conclusion

In this post, we walked you through a solution to identify and disable stale user accounts based on periods of inactivity. While the example focuses on a single Amazon Cognito user pool, the approach can be adapted for more complex environments with multiple user pools across multiple accounts. For examples of Amazon Cognito architectures, see the AWS Architecture Blog.

Proper planning is essential for seamless integration with your existing infrastructure. Carefully consider factors such as your security environment, compliance needs, and user pool configurations. You can modify this solution to suit your specific use case.

Maintaining clean and active user pools is an ongoing journey. Continue monitoring your systems, optimizing configurations, and keeping up-to-date on new features. Combined with well-architected preventive measures, automated user management systems provide strong defenses for your applications and data.

For further reading, see the AWS Well-Architected Security Pillar and more posts like this one on the AWS Security Blog.

If you have feedback about this post, submit comments in the Comments section. If you have questions about this post, start a new thread on the Amazon Cognito re:Post forum or contact AWS Support.

Harun Abdi

Harun Abdi

Harun is a Startup Solutions Architect based in Toronto, Canada. Harun loves working with customers across different sectors, supporting them to architect reliable and scalable solutions. In his spare time, he enjoys playing soccer and spending time with friends and family.

Dylan Souvage

Dylan Souvage

Dylan is a Partner Solutions Architect based in Austin, Texas. Dylan loves working with customers to understand their business needs and enable them in their cloud journey. In his spare time, he enjoys going out in nature and going on long road trips.

Automate large-scale data validation using Amazon EMR and Apache Griffin

Post Syndicated from Dipal Mahajan original https://aws.amazon.com/blogs/big-data/automate-large-scale-data-validation-using-amazon-emr-and-apache-griffin/

Many enterprises are migrating their on-premises data stores to the AWS Cloud. During data migration, a key requirement is to validate all the data that has been moved from source to target. This data validation is a critical step, and if not done correctly, may result in the failure of the entire project. However, developing custom solutions to determine migration accuracy by comparing the data between the source and target can often be time-consuming.

In this post, we walk through a step-by-step process to validate large datasets after migration using a configuration-based tool using Amazon EMR and the Apache Griffin open source library. Griffin is an open source data quality solution for big data, which supports both batch and streaming mode.

In today’s data-driven landscape, where organizations deal with petabytes of data, the need for automated data validation frameworks has become increasingly critical. Manual validation processes are not only time-consuming but also prone to errors, especially when dealing with vast volumes of data. Automated data validation frameworks offer a streamlined solution by efficiently comparing large datasets, identifying discrepancies, and ensuring data accuracy at scale. With such frameworks, organizations can save valuable time and resources while maintaining confidence in the integrity of their data, thereby enabling informed decision-making and enhancing overall operational efficiency.

The following are standout features for this framework:

  • Utilizes a configuration-driven framework
  • Offers plug-and-play functionality for seamless integration
  • Conducts count comparison to identify any disparities
  • Implements robust data validation procedures
  • Ensures data quality through systematic checks
  • Provides access to a file containing mismatched records for in-depth analysis
  • Generates comprehensive reports for insights and tracking purposes

Solution overview

This solution uses the following services:

  • Amazon Simple Storage Service (Amazon S3) or Hadoop Distributed File System (HDFS) as the source and target.
  • Amazon EMR to run the PySpark script. We use a Python wrapper on top of Griffin to validate data between Hadoop tables created over HDFS or Amazon S3.
  • AWS Glue to catalog the technical table, which stores the results of the Griffin job.
  • Amazon Athena to query the output table to verify the results.

We use tables that store the count for each source and target table and also create files that show the difference of records between source and target.

The following diagram illustrates the solution architecture.

Architecture_Diagram

In the depicted architecture and our typical data lake use case, our data either resides n Amazon S3 or is migrated from on premises to Amazon S3 using replication tools such as AWS DataSync or AWS Database Migration Service (AWS DMS). Although this solution is designed to seamlessly interact with both Hive Metastore and the AWS Glue Data Catalog, we use the Data Catalog as our example in this post.

This framework operates within Amazon EMR, automatically running scheduled tasks on a daily basis, as per the defined frequency. It generates and publishes reports in Amazon S3, which are then accessible via Athena. A notable feature of this framework is its capability to detect count mismatches and data discrepancies, in addition to generating a file in Amazon S3 containing full records that didn’t match, facilitating further analysis.

In this example, we use three tables in an on-premises database to validate between source and target : balance_sheet, covid, and survery_financial_report.

Prerequisites

Before getting started, make sure you have the following prerequisites:

Deploy the solution

To make it straightforward for you to get started, we have created a CloudFormation template that automatically configures and deploys the solution for you. Complete the following steps:

  1. Create an S3 bucket in your AWS account called bdb-3070-griffin-datavalidation-blog-${AWS::AccountId}-${AWS::Region} (provide your AWS account ID and AWS Region).
  2. Unzip the following file to your local system.
  3. After unzipping the file to your local system, change <bucket name> to the one you created in your account (bdb-3070-griffin-datavalidation-blog-${AWS::AccountId}-${AWS::Region}) in the following files:
    1. bootstrap-bdb-3070-datavalidation.sh
    2. Validation_Metrics_Athena_tables.hql
    3. datavalidation/totalcount/totalcount_input.txt
    4. datavalidation/accuracy/accuracy_input.txt
  4. Upload all the folders and files in your local folder to your S3 bucket:
    aws s3 cp . s3://<bucket_name>/ --recursive

  5. Run the following CloudFormation template in your account.

The CloudFormation template creates a database called griffin_datavalidation_blog and an AWS Glue crawler called griffin_data_validation_blog on top of the data folder in the .zip file.

  1. Choose Next.
    Cloudformation_template_1
  2. Choose Next again.
  3. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.

You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:

aws cloudformation describe-stacks --stack-name <stack-name> --region us-east-1 --query Stacks[0].Outputs
  1. Run the AWS Glue crawler and verify that six tables have been created in the Data Catalog.
  2. Run the following CloudFormation template in your account.

This template creates an EMR cluster with a bootstrap script to copy Griffin-related JARs and artifacts. It also runs three EMR steps:

  • Create two Athena tables and two Athena views to see the validation matrix produced by the Griffin framework
  • Run count validation for all three tables to compare the source and target table
  • Run record-level and column-level validations for all three tables to compare between the source and target table
  1. For SubnetID, enter your subnet ID.
  2. Choose Next.
    Cloudformation_template_2
  3. Choose Next again.
  4. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  5. Choose Create stack.

You can view the stack outputs on the console or by using the following AWS CLI command:

aws cloudformation describe-stacks --stack-name <stack-name> --region us-east-1 --query Stacks[0].Outputs

It takes approximately 5 minutes for the deployment to complete. When the stack is complete, you should see the EMRCluster resource launched and available in your account.

When the EMR cluster is launched, it runs the following steps as part of the post-cluster launch:

  • Bootstrap action – It installs the Griffin JAR file and directories for this framework. It also downloads sample data files to use in the next step.
  • Athena_Table_Creation – It creates tables in Athena to read the result reports.
  • Count_Validation – It runs the job to compare the data count between source and target data from the Data Catalog table and stores the results in an S3 bucket, which will be read via an Athena table.
  • Accuracy – It runs the job to compare the data rows between the source and target data from the Data Catalog table and store the results in an S3 bucket, which will be read via the Athena table.

Athena_table

When the EMR steps are complete, your table comparison is done and ready to view in Athena automatically. No manual intervention is needed for validation.

Validate data with Python Griffin

When your EMR cluster is ready and all the jobs are complete, it means the count validation and data validation are complete. The results have been stored in Amazon S3 and the Athena table is already created on top of that. You can query the Athena tables to view the results, as shown in the following screenshot.

The following screenshot shows the count results for all tables.

Summary_table

The following screenshot shows the data accuracy results for all tables.

Detailed_view

The following screenshot shows the files created for each table with mismatched records. Individual folders are generated for each table directly from the job.

mismatched_records

Every table folder contains a directory for each day the job is run.

S3_path_mismatched

Within that specific date, a file named __missRecords contains records that do not match.

S3_path_mismatched_2

The following screenshot shows the contents of the __missRecords file.

__missRecords

Clean up

To avoid incurring additional charges, complete the following steps to clean up your resources when you’re done with the solution:

  1. Delete the AWS Glue database griffin_datavalidation_blog and drop the database griffin_datavalidation_blog cascade.
  2. Delete the prefixes and objects you created from the bucket bdb-3070-griffin-datavalidation-blog-${AWS::AccountId}-${AWS::Region}.
  3. Delete the CloudFormation stack, which removes your additional resources.

Conclusion

This post showed how you can use Python Griffin to accelerate the post-migration data validation process. Python Griffin helps you calculate count and row- and column-level validation, identifying mismatched records without writing any code.

For more information about data quality use cases, refer to Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog and AWS Glue Data Quality.


About the Authors

Dipal Mahajan serves as a Lead Consultant at Amazon Web Services, providing expert guidance to global clients in developing highly secure, scalable, reliable, and cost-efficient cloud applications. With a wealth of experience in software development, architecture, and analytics across diverse sectors such as finance, telecom, retail, and healthcare, he brings invaluable insights to his role. Beyond the professional sphere, Dipal enjoys exploring new destinations, having already visited 14 out of 30 countries on his wish list.

Akhil is a Lead Consultant at AWS Professional Services. He helps customers design & build scalable data analytics solutions and migrate data pipelines and data warehouses to AWS. In his spare time, he loves travelling, playing games and watching movies.

Ramesh Raghupathy is a Senior Data Architect with WWCO ProServe at AWS. He works with AWS customers to architect, deploy, and migrate to data warehouses and data lakes on the AWS Cloud. While not at work, Ramesh enjoys traveling, spending time with family, and yoga.

Simplify your query management with search templates in Amazon OpenSearch Service

Post Syndicated from Arun Lakshmanan original https://aws.amazon.com/blogs/big-data/simplify-your-query-management-with-search-templates-in-amazon-opensearch-service/

Amazon OpenSearch Service is an Apache-2.0-licensed distributed search and analytics suite offered by AWS. This fully managed service allows organizations to secure data, perform keyword and semantic search, analyze logs, alert on anomalies, explore interactive log analytics, implement real-time application monitoring, and gain a more profound understanding of their information landscape. OpenSearch Service provides the tools and resources needed to unlock the full potential of your data. With its scalability, reliability, and ease of use, it’s a valuable solution for businesses seeking to optimize their data-driven decision-making processes and improve overall operational efficiency.

This post delves into the transformative world of search templates. We unravel the power of search templates in revolutionizing the way you handle queries, providing a comprehensive guide to help you navigate through the intricacies of this innovative solution. From optimizing search processes to saving time and reducing complexities, discover how incorporating search templates can elevate your query management game.

Search templates

Search templates empower developers to articulate intricate queries within OpenSearch, enabling their reuse across various application scenarios, eliminating the complexity of query generation in the code. This flexibility also grants you the ability to modify your queries without requiring application recompilation. Search templates in OpenSearch use the mustache template, which is a logic-free templating language. Search templates can be reused by their name. A search template that is based on mustache has a query structure and placeholders for the variable values. You use the _search API to query, specifying the actual values that OpenSearch should use. You can create placeholders for variables that will be changed to their true values at runtime. Double curly braces ({{}}) serve as placeholders in templates.

Mustache enables you to generate dynamic filters or queries based on the values passed in the search request, making your search requests more flexible and powerful.

In the following example, the search template runs the query in the “source” block by passing in the values for the field and value parameters from the “params” block:

GET /myindex/_search/template
 { 
      "source": {   
         "query": { 
             "bool": {
               "must": [
                 {
                   "match": {
                    "{{field}}": "{{value}}"
                 }
             }
        ]
     }
    }
  },
 "params": {
    "field": "place",
    "value": "sweethome"
  }
}

You can store templates in the cluster with a name and refer to them in a search instead of attaching the template in each request. You use the PUT _scripts API to publish a template to the cluster. Let’s say you have an index of books, and you want to search for books with publication date, ratings, and price. You could create and publish a search template as follows:

PUT /_scripts/find_book
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "publish_date": {
                  "gte": "{{gte_date}}"
                }
              }
            },
            {
              "range": {
                "rating": {
                  "gte": "{{gte_rating}}"
                }
              }
            },
            {
              "range": {
                "price": {
                  "lte": "{{lte_price}}"
                }
              }
            }
          ]
        }
      }
    }
  }
}

In this example, you define a search template called find_book that uses the mustache template language with defined placeholders for the gte_date, gte_rating, and lte_price parameters.

To use the search template stored in the cluster, you can send a request to OpenSearch with the appropriate parameters. For example, you can search for products that have been published in the last year with ratings greater than 4.0, and priced less than $20:

POST /books/_search/template
{
  "id": "find_book",
  "params": {
    "gte_date": "now-1y",
    "gte_rating": 4.0,
    "lte_price": 20
  }
}

This query will return all books that have been published in the last year, with a rating of at least 4.0, and a price less than $20 from the books index.

Default values in search templates

Default values are values that are used for search parameters when the query that engages the template doesn’t specify values for them. In the context of the find_book example, you can set default values for the from, size, and gte_date parameters in case they are not provided in the search request. To set default values, you can use the following mustache template:

PUT /_scripts/find_book
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "publish_date": {
                  "gte": "{{gte_date}}{{^gte_date}}now-1y{{/gte_date}}"
                }
              }
            },
            {
              "range": {
                "rating": {
                  "gte": "{{gte_rating}}"
                }
              }
            },
            {
              "range": {
                "price": {
                  "lte": "{{lte_price}}"
                }
              }
            }
          ]
        },
        "from": "{{from}}{{^from}}0{{/from}}",
        "size": "{{size}}{{^size}}2{{/size}}"
      }
    }
  }
}

In this template, the {{from}}, {{size}}, and {{gte_date}} parameters are placeholders that can be filled in with specific values when the template is used in a search. If no value is specified for {{from}}, {{size}}, and {{gte_date}}, OpenSearch uses the default values of 0, 2, and now-1y, respectively. This means that if a user searches for products without specifying from, size, and gte_date, the search will return just two products matching the search criteria for 1 year.

You can also use the render API as follows if you have a stored template and want to validate it:

POST _render/template
{
  "id": "find_book",
  "params": {
    "gte_date": "now-1y",
    "gte_rating": 4.0,
    "lte_price": 20
  }
}

Conditions in search templates

The conditional statement that allows you to control the flow of your search template based on certain conditions. It’s often used to include or exclude certain parts of the search request based on certain parameters. The syntax as follows:

{{#Any condition}}
  ... code to execute if the condition is true ...
{{/Any}}

The following example searches for books based on the gte_date, gte_rating, and lte_price parameters and an optional stock parameter. The if condition is used to include the condition_block/term query only if the stock parameter is present in the search request. If the is_available parameter is not present, the condition_block/term query will be skipped.

GET /books/_search/template
{
  "source": """{
    "query": {
      "bool": {
        "must": [
        {{#is_available}}
        {
          "term": {
            "in_stock": "{{is_available}}"
          }
        },
        {{/is_available}}
          {
            "range": {
              "publish_date": {
                "gte": "{{gte_date}}"
              }
            }
          },
          {
            "range": {
              "rating": {
                "gte": "{{gte_rating}}"
              }
            }
          },
          {
            "range": {
              "price": {
                "lte": "{{lte_price}}"
              }
            }
          }
        ]
      }
    }
  }""",
  "params": {
    "gte_date": "now-3y",
    "gte_rating": 4.0,
    "lte_price": 20,
    "is_available": true
  }
}

By using a conditional statement in this way, you can make your search requests more flexible and efficient by only including the necessary filters when they are needed.

To make the query valid inside the JSON, it needs to be escaped with triple quotes (""") in the payload.

Loops in search templates

A loop is a feature of mustache templates that allows you to iterate over an array of values and run the same code block for each item in the array. It’s often used to generate a dynamic list of filters or queries based on the values passed in the search request. The syntax is as follows:

{{#list item in array}}
  ... code to execute for each item ...
{{/list}}

The following example searches for books based on a query string ({{query}}) and an array of categories to filter the search results. The mustache loop is used to generate a match filter for each item in the categories array.

GET books/_search/template
{
  "source": """{
    "query": {
      "bool": {
        "must": [
        {{#list}}
        {
          "match": {
            "category": "{{list}}"
          }
        }
        {{/list}}
          {
          "match": {
            "title": "{{name}}"
          }
        }
        ]
      }
    }
  }""",
  "params": {
    "name": "killer",
    "list": ["Classics", "comics", "Horror"]
  }
}

The search request is rendered as follows:

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "killer"
          }
        },
        {
          "match": {
            "category": "Classics"
          }
        },
        {
          "match": {
            "category": "comics"
          }
        },
        {
          "match": {
            "category": "Horror"
          }
        }
      ]
    }
  }
}

The loop has generated a match filter for each item in the categories array, resulting in a more flexible and efficient search request that filters by multiple categories. By using the loops, you can generate dynamic filters or queries based on the values passed in the search request, making your search requests more flexible and powerful.

Advantages of using search templates

The following are key advantages of using search templates:

  • Maintainability – By separating the query definition from the application code, search templates make it straightforward to manage changes to the query or tune search relevancy. You don’t have to compile and redeploy your application.
  • Consistency – You can construct search templates that allow you to design standardized query patterns and reuse them throughout your application, which can help maintain consistency across your queries.
  • Readability – Because templates can be constructed using a more terse and expressive syntax, complicated queries are straightforward to test and debug.
  • Testing – Search templates can be tested and debugged independently of the application code, facilitating simpler problem-solving and relevancy tuning without having to re-deploy the application. You can easily create A/B testing with different templates for the same search.
  • Flexibility – Search templates can be quickly updated or adjusted to account for modifications to the data or search specifications.

Best practices

Consider the following best practices when using search templates:

  •  Before deploying your template to production, make sure it is fully tested. You can test the effectiveness and correctness of your template with example data. It is highly recommended to run the application tests that use these templates before publishing.
  • Search templates allow for the addition of input parameters, which you can use to modify the query to suit the needs of a particular use case. Reusing the same template with varied inputs is made simpler by parameterizing the inputs.
  • Manage the templates in an external source control system.
  • Avoid hard-coding values inside the query—instead, use defaults.

Conclusion

In this post, you learned the basics of search templates, a powerful feature of OpenSearch, and how templates help streamline search queries and improve performance. With search templates, you can build more robust search applications in less time.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

Stay tuned for more exciting updates and new features in OpenSearch Service.


About the authors

Arun Lakshmanan is a Search Specialist with Amazon OpenSearch Service based out of Chicago, IL. He has over 20 years of experience working with enterprise customers and startups. He loves to travel and spend quality time with his family.

Madhan Kumar Baskaran works as a Search Engineer at AWS, specializing in Amazon OpenSearch Service. His primary focus involves assisting customers in constructing scalable search applications and analytics solutions. Based in Bengaluru, India, Madhan has a keen interest in data engineering and DevOps.

Terraform CI/CD and testing on AWS with the new Terraform Test Framework

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/terraform-ci-cd-and-testing-on-aws-with-the-new-terraform-test-framework/

Image of HashiCorp Terraform logo and Amazon Web Services (AWS) Logo. Underneath the AWS Logo are the service logos for AWS CodeCommit, AWS CodeBuild, AWS CodePipeline, and Amazon S3. Graphic created by Kevon Mayers

Graphic created by Kevon Mayers

 Introduction

Organizations often use Terraform Modules to orchestrate complex resource provisioning and provide a simple interface for developers to enter the required parameters to deploy the desired infrastructure. Modules enable code reuse and provide a method for organizations to standardize deployment of common workloads such as a three-tier web application, a cloud networking environment, or a data analytics pipeline. When building Terraform modules, it is common for the module author to start with manual testing. Manual testing is performed using commands such as terraform validate for syntax validation, terraform plan to preview the execution plan, and terraform apply followed by manual inspection of resource configuration in the AWS Management Console. Manual testing is prone to human error, not scalable, and can result in unintended issues. Because modules are used by multiple teams in the organization, it is important to ensure that any changes to the modules are extensively tested before the release. In this blog post, we will show you how to validate Terraform modules and how to automate the process using a Continuous Integration/Continuous Deployment (CI/CD) pipeline.

Terraform Test

Terraform test is a new testing framework for module authors to perform unit and integration tests for Terraform modules. Terraform test can create infrastructure as declared in the module, run validation against the infrastructure, and destroy the test resources regardless if the test passes or fails. Terraform test will also provide warnings if there are any resources that cannot be destroyed. Terraform test uses the same HashiCorp Configuration Language (HCL) syntax used to write Terraform modules. This reduces the burden for modules authors to learn other tools or programming languages. Module authors run the tests using the command terraform test which is available on Terraform CLI version 1.6 or higher.

Module authors create test files with the extension *.tftest.hcl. These test files are placed in the root of the Terraform module or in a dedicated tests directory. The following elements are typically present in a Terraform tests file:

  • Provider block: optional, used to override the provider configuration, such as selecting AWS region where the tests run.
  • Variables block: the input variables passed into the module during the test, used to supply non-default values or to override default values for variables.
  • Run block: used to run a specific test scenario. There can be multiple run blocks per test file, Terraform executes run blocks in order. In each run block you specify the command Terraform (plan or apply), and the test assertions. Module authors can specify the conditions such as: length(var.items) != 0. A full list of condition expressions can be found in the HashiCorp documentation.

Terraform tests are performed in sequential order and at the end of the Terraform test execution, any failed assertions are displayed.

Basic test to validate resource creation

Now that we understand the basic anatomy of a Terraform tests file, let’s create basic tests to validate the functionality of the following Terraform configuration. This Terraform configuration will create an AWS CodeCommit repository with prefix name repo-.

# main.tf

variable "repository_name" {
  type = string
}
resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description     = "Test repository."
}

Now we create a Terraform test file in the tests directory. See the following directory structure as an example:

├── main.tf 
└── tests 
└── basic.tftest.hcl

For this first test, we will not perform any assertion except for validating that Terraform execution plan runs successfully. In the tests file, we create a variable block to set the value for the variable repository_name. We also added the run block with command = plan to instruct Terraform test to run Terraform plan. The completed test should look like the following:

# basic.tftest.hcl

variables {
  repository_name = "MyRepo"
}

run "test_resource_creation" {
  command = plan
}

Now we will run this test locally. First ensure that you are authenticated into an AWS account, and run the terraform init command in the root directory of the Terraform module. After the provider is initialized, start the test using the terraform test command.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... pass
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... pass

Our first test is complete, we have validated that the Terraform configuration is valid and the resource can be provisioned successfully. Next, let’s learn how to perform inspection of the resource state.

Create resource and validate resource name

Re-using the previous test file, we add the assertion block to checks if the CodeCommit repository name starts with a string repo- and provide error message if the condition fails. For the assertion, we use the startswith function. See the following example:

# basic.tftest.hcl

variables {
  repository_name = "MyRepo"
}

run "test_resource_creation" {
  command = plan

  assert {
    condition = startswith(aws_codecommit_repository.test.repository_name, "repo-")
    error_message = "CodeCommit repository name ${var.repository_name} did not start with the expected value of ‘repo-****’."
  }
}

Now, let’s assume that another module author made changes to the module by modifying the prefix from repo- to my-repo-. Here is the modified Terraform module.

# main.tf

variable "repository_name" {
  type = string
}
resource "aws_codecommit_repository" "test" {
  repository_name = format("my-repo-%s", var.repository_name)
  description = "Test repository."
}

We can catch this mistake by running the the terraform test command again.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... fail
╷
│ Error: Test assertion failed
│
│ on tests/basic.tftest.hcl line 9, in run "test_resource_creation":
│ 9: condition = startswith(aws_codecommit_repository.test.repository_name, "repo-")
│ ├────────────────
│ │ aws_codecommit_repository.test.repository_name is "my-repo-MyRepo"
│
│ CodeCommit repository name MyRepo did not start with the expected value 'repo-***'.
╵
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... fail

Failure! 0 passed, 1 failed.

We have successfully created a unit test using assertions that validates the resource name matches the expected value. For more examples of using assertions see the Terraform Tests Docs. Before we proceed to the next section, don’t forget to fix the repository name in the module (revert the name back to repo- instead of my-repo-) and re-run your Terraform test.

Testing variable input validation

When developing Terraform modules, it is common to use variable validation as a contract test to validate any dependencies / restrictions. For example, AWS CodeCommit limits the repository name to 100 characters. A module author can use the length function to check the length of the input variable value. We are going to use Terraform test to ensure that the variable validation works effectively. First, we modify the module to use variable validation.

# main.tf

variable "repository_name" {
  type = string
  validation {
    condition = length(var.repository_name) <= 100
    error_message = "The repository name must be less than or equal to 100 characters."
  }
}

resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description = "Test repository."
}

By default, when variable validation fails during the execution of Terraform test, the Terraform test also fails. To simulate this, create a new test file and insert the repository_name variable with a value longer than 100 characters.

# var_validation.tftest.hcl

variables {
  repository_name = “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
}

run “test_invalid_var” {
  command = plan
}

Notice on this new test file, we also set the command to Terraform plan, why is that? Because variable validation runs prior to Terraform apply, thus we can save time and cost by skipping the entire resource provisioning. If we run this Terraform test, it will fail as expected.

❯ terraform test
tests/basic.tftest.hcl… in progress
run “test_resource_creation”… pass
tests/basic.tftest.hcl… tearing down
tests/basic.tftest.hcl… pass
tests/var_validation.tftest.hcl… in progress
run “test_invalid_var”… fail
╷
│ Error: Invalid value for variable
│
│ on main.tf line 1:
│ 1: variable “repository_name” {
│ ├────────────────
│ │ var.repository_name is “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
│
│ The repository name must be less than or equal to 100 characters.
│
│ This was checked by the validation rule at main.tf:3,3-13.
╵
tests/var_validation.tftest.hcl… tearing down
tests/var_validation.tftest.hcl… fail

Failure! 1 passed, 1 failed.

For other module authors who might iterate on the module, we need to ensure that the validation condition is correct and will catch any problems with input values. In other words, we expect the validation condition to fail with the wrong input. This is especially important when we want to incorporate the contract test in a CI/CD pipeline. To prevent our test from failing due introducing an intentional error in the test, we can use the expect_failures attribute. Here is the modified test file:

# var_validation.tftest.hcl

variables {
  repository_name = “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
}

run “test_invalid_var” {
  command = plan

  expect_failures = [
    var.repository_name
  ]
}

Now if we run the Terraform test, we will get a successful result.

❯ terraform test
tests/basic.tftest.hcl… in progress
run “test_resource_creation”… pass
tests/basic.tftest.hcl… tearing down
tests/basic.tftest.hcl… pass
tests/var_validation.tftest.hcl… in progress
run “test_invalid_var”… pass
tests/var_validation.tftest.hcl… tearing down
tests/var_validation.tftest.hcl… pass

Success! 2 passed, 0 failed.

As you can see, the expect_failures attribute is used to test negative paths (the inputs that would cause failures when passed into a module). Assertions tend to focus on positive paths (the ideal inputs). For an additional example of a test that validates functionality of a completed module with multiple interconnected resources, see this example in the Terraform CI/CD and Testing on AWS Workshop.

Orchestrating supporting resources

In practice, end-users utilize Terraform modules in conjunction with other supporting resources. For example, a CodeCommit repository is usually encrypted using an AWS Key Management Service (KMS) key. The KMS key is provided by end-users to the module using a variable called kms_key_id. To simulate this test, we need to orchestrate the creation of the KMS key outside of the module. In this section we will learn how to do that. First, update the Terraform module to add the optional variable for the KMS key.

# main.tf

variable "repository_name" {
  type = string
  validation {
    condition = length(var.repository_name) <= 100
    error_message = "The repository name must be less than or equal to 100 characters."
  }
}

variable "kms_key_id" {
  type = string
  default = ""
}

resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description = "Test repository."
  kms_key_id = var.kms_key_id != "" ? var.kms_key_id : null
}

In a Terraform test, you can instruct the run block to execute another helper module. The helper module is used by the test to create the supporting resources. We will create a sub-directory called setup under the tests directory with a single kms.tf file. We also create a new test file for KMS scenario. See the updated directory structure:

├── main.tf
└── tests
├── setup
│ └── kms.tf
├── basic.tftest.hcl
├── var_validation.tftest.hcl
└── with_kms.tftest.hcl

The kms.tf file is a helper module to create a KMS key and provide its ARN as the output value.

# kms.tf

resource "aws_kms_key" "test" {
  description = "test KMS key for CodeCommit repo"
  deletion_window_in_days = 7
}

output "kms_key_id" {
  value = aws_kms_key.test.arn
}

The new test will use two separate run blocks. The first run block (setup) executes the helper module to generate a KMS key. This is done by assigning the command apply which will run terraform apply to generate the KMS key. The second run block (codecommit_with_kms) will then use the KMS key ARN output of the first run as the input variable passed to the main module.

# with_kms.tftest.hcl

run "setup" {
  command = apply
  module {
    source = "./tests/setup"
  }
}

run "codecommit_with_kms" {
  command = apply

  variables {
    repository_name = "MyRepo"
    kms_key_id = run.setup.kms_key_id
  }

  assert {
    condition = aws_codecommit_repository.test.kms_key_id != null
    error_message = "KMS key ID attribute value is null"
  }
}

Go ahead and run the Terraform init, followed by Terraform test. You should get the successful result like below.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... pass
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... pass
tests/var_validation.tftest.hcl... in progress
run "test_invalid_var"... pass
tests/var_validation.tftest.hcl... tearing down
tests/var_validation.tftest.hcl... pass
tests/with_kms.tftest.hcl... in progress
run "create_kms_key"... pass
run "codecommit_with_kms"... pass
tests/with_kms.tftest.hcl... tearing down
tests/with_kms.tftest.hcl... pass

Success! 4 passed, 0 failed.

We have learned how to run Terraform test and develop various test scenarios. In the next section we will see how to incorporate all the tests into a CI/CD pipeline.

Terraform Tests in CI/CD Pipelines

Now that we have seen how Terraform Test works locally, let’s see how the Terraform test can be leveraged to create a Terraform module validation pipeline on AWS. The following AWS services are used:

  • AWS CodeCommit – a secure, highly scalable, fully managed source control service that hosts private Git repositories.
  • AWS CodeBuild – a fully managed continuous integration service that compiles source code, runs tests, and produces ready-to-deploy software packages.
  • AWS CodePipeline – a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates.
  • Amazon Simple Storage Service (Amazon S3) – an object storage service offering industry-leading scalability, data availability, security, and performance.
Terraform module validation pipeline Architecture. Multiple interconnected AWS services such as AWS CodeCommit, CodeBuild, CodePipeline, and Amazon S3 used to build a Terraform module validation pipeline.

Terraform module validation pipeline

In the above architecture for a Terraform module validation pipeline, the following takes place:

  • A developer pushes Terraform module configuration files to a git repository (AWS CodeCommit).
  • AWS CodePipeline begins running the pipeline. The pipeline clones the git repo and stores the artifacts to an Amazon S3 bucket.
  • An AWS CodeBuild project configures a compute/build environment with Checkov installed from an image fetched from Docker Hub. CodePipeline passes the artifacts (Terraform module) and CodeBuild executes Checkov to run static analysis of the Terraform configuration files.
  • Another CodeBuild project configured with Terraform from an image fetched from Docker Hub. CodePipeline passes the artifacts (repo contents) and CodeBuild runs Terraform command to execute the tests.

CodeBuild uses a buildspec file to declare the build commands and relevant settings. Here is an example of the buildspec files for both CodeBuild Projects:

# Checkov
version: 0.1
phases:
  pre_build:
    commands:
      - echo pre_build starting

  build:
    commands:
      - echo build starting
      - echo starting checkov
      - ls
      - checkov -d .
      - echo saving checkov output
      - checkov -s -d ./ > checkov.result.txt

In the above buildspec, Checkov is run against the root directory of the cloned CodeCommit repository. This directory contains the configuration files for the Terraform module. Checkov also saves the output to a file named checkov.result.txt for further review or handling if needed. If Checkov fails, the pipeline will fail.

# Terraform Test
version: 0.1
phases:
  pre_build:
    commands:
      - terraform init
      - terraform validate

  build:
    commands:
      - terraform test

In the above buildspec, the terraform init and terraform validate commands are used to initialize Terraform, then check if the configuration is valid. Finally, the terraform test command is used to run the configured tests. If any of the Terraform tests fails, the pipeline will fail.

For a full example of the CI/CD pipeline configuration, please refer to the Terraform CI/CD and Testing on AWS workshop. The module validation pipeline mentioned above is meant as a starting point. In a production environment, you might want to customize it further by adding Checkov allow-list rules, linting, checks for Terraform docs, or pre-requisites such as building the code used in AWS Lambda.

Choosing various testing strategies

At this point you may be wondering when you should use Terraform tests or other tools such as Preconditions and Postconditions, Check blocks or policy as code. The answer depends on your test type and use-cases. Terraform test is suitable for unit tests, such as validating resources are created according to the naming specification. Variable validations and Pre/Post conditions are useful for contract tests of Terraform modules, for example by providing error warning when input variables value do not meet the specification. As shown in the previous section, you can also use Terraform test to ensure your contract tests are running properly. Terraform test is also suitable for integration tests where you need to create supporting resources to properly test the module functionality. Lastly, Check blocks are suitable for end to end tests where you want to validate the infrastructure state after all resources are generated, for example to test if a website is running after an S3 bucket configured for static web hosting is created.

When developing Terraform modules, you can run Terraform test in command = plan mode for unit and contract tests. This allows the unit and contract tests to run quicker and cheaper since there are no resources created. You should also consider the time and cost to execute Terraform test for complex / large Terraform configurations, especially if you have multiple test scenarios. Terraform test maintains one or many state files within the memory for each test file. Consider how to re-use the module’s state when appropriate. Terraform test also provides test mocking, which allows you to test your module without creating the real infrastructure.

Conclusion

In this post, you learned how to use Terraform test and develop various test scenarios. You also learned how to incorporate Terraform test in a CI/CD pipeline. Lastly, we also discussed various testing strategies for Terraform configurations and modules. For more information about Terraform test, we recommend the Terraform test documentation and tutorial. To get hands on practice building a Terraform module validation pipeline and Terraform deployment pipeline, check out the Terraform CI/CD and Testing on AWS Workshop.

Authors

Kevon Mayers

Kevon Mayers is a Solutions Architect at AWS. Kevon is a Terraform Contributor and has led multiple Terraform initiatives within AWS. Prior to joining AWS he was working as a DevOps Engineer and Developer, and before that was working with the GRAMMYs/The Recording Academy as a Studio Manager, Music Producer, and Audio Engineer. He also owns a professional production company, MM Productions.

Welly Siauw

Welly Siauw is a Principal Partner Solution Architect at Amazon Web Services (AWS). He spends his day working with customers and partners, solving architectural challenges. He is passionate about service integration and orchestration, serverless and artificial intelligence (AI) and machine learning (ML). He has authored several AWS blog posts and actively leads AWS Immersion Days and Activation Days. Welly spends his free time tinkering with espresso machines and outdoor hiking.

How to generate security findings to help your security team with incident response simulations

Post Syndicated from Jonathan Nguyen original https://aws.amazon.com/blogs/security/how-to-generate-security-findings-to-help-your-security-team-with-incident-response-simulations/

Continually reviewing your organization’s incident response capabilities can be challenging without a mechanism to create security findings with actual Amazon Web Services (AWS) resources within your AWS estate. As prescribed within the AWS Security Incident Response whitepaper, it’s important to periodically review your incident response capabilities to make sure your security team is continually maturing internal processes and assessing capabilities within AWS. Generating sample security findings is useful to understand the finding format so you can enrich the finding with additional metadata or create and prioritize detections within your security information event management (SIEM) solution. However, if you want to conduct an end-to-end incident response simulation, including the creation of real detections, sample findings might not create actionable detections that will start your incident response process because of alerting suppressions you might have configured, or imaginary metadata (such as synthetic Amazon Elastic Compute Cloud (Amazon EC2) instance IDs), which might confuse your remediation tooling.

In this post, we walk through how to deploy a solution that provisions resources to generate simulated security findings for actual provisioned resources within your AWS account. Generating simulated security findings in your AWS account gives your security team an opportunity to validate their cyber capabilities, investigation workflow and playbooks, escalation paths across teams, and exercise any response automation currently in place.

Important: It’s strongly recommended that the solution be deployed in an isolated AWS account with no additional workloads or sensitive data. No resources deployed within the solution should be used for any purpose outside of generating the security findings for incident response simulations. Although the security findings are non-destructive to existing resources, they should still be done in isolation. For any AWS solution deployed within your AWS environment, your security team should review the resources and configurations within the code.

Conducting incident response simulations

Before deploying the solution, it’s important that you know what your goal is and what type of simulation to conduct. If you’re primarily curious about the format that active Amazon GuardDuty findings will create, you should generate sample findings with GuardDuty. At the time of this writing, Amazon Inspector doesn’t currently generate sample findings.

If you want to validate your incident response playbooks, make sure you have playbooks for the security findings the solution generates. If those playbooks don’t exist, it might be a good idea to start with a high-level tabletop exercise to identify which playbooks you need to create.

Because you’re running this sample in an AWS account with no workloads, it’s recommended to run the sample solution as a purple team exercise. Purple team exercises should be periodically run to support training for new analysts, validate existing playbooks, and identify areas of improvement to reduce the mean time to respond or identify areas where processes can be optimized with automation.

Now that you have a good understanding of the different simulation types, you can create security findings in an isolated AWS account.

Prerequisites

  1. [Recommended] A separate AWS account containing no customer data or running workloads
  2. GuardDuty, along with GuardDuty Kubernetes Protection
  3. Amazon Inspector must be enabled
  4. [Optional] AWS Security Hub can be enabled to show a consolidated view of security findings generated by GuardDuty and Inspector

Solution architecture

The architecture of the solution can be found in Figure 1.

Figure 1: Sample solution architecture diagram

Figure 1: Sample solution architecture diagram

  1. A user specifies the type of security findings to generate by passing an AWS CloudFormation parameter.
  2. An Amazon Simple Notification Service (Amazon SNS) topic is created to subscribe to findings for notifications. Subscribed users are notified of the finding through the deployed SNS topic.
  3. Upon user selection of the CloudFormation parameter, EC2 instances are provisioned to run commands to generate security findings.

    Note: If the parameter inspector is provided during deployment, then only one EC2 instance is deployed. If the parameter guardduty is provided during deployment, then two EC2 instances are deployed.

  4. For Amazon Inspector findings:
    1. The Amazon EC2 user data creates a .txt file with vulnerable images, pulls down Docker images from open source vulhub, and creates an Amazon Elastic Container Registry (Amazon ECR) repository with the vulnerable images.
    2. The EC2 user data pushes and tags the images in the ECR repository which results in Amazon Inspector findings being generated.
    3. An Amazon EventBridge cron-style trigger rule, inspector_remediation_ecr, invokes an AWS Lambda function.
    4. The Lambda function, ecr_cleanup_function, cleans up the vulnerable images in the deployed Amazon ECR repository based on applied tags and sends a notification to the Amazon SNS topic.

      Note: The ecr_cleanup_function Lambda function is also invoked as a custom resource to clean up vulnerable images during deployment. If there are issues with cleanup, the EventBridge rule continually attempts to clean up vulnerable images.

  5. For GuardDuty, the following actions are taken and resources are deployed:
    1. An AWS Identity and Access Management (IAM) user named guardduty-demo-user is created with an IAM access key that is INACTIVE.
    2. An AWS Systems Manager parameter stores the IAM access key for guardduty-demo-user.
    3. An AWS Secrets Manager secret stores the inactive IAM secret access key for guardduty-demo-user.
    4. An Amazon DynamoDB table is created, and the table name is stored in a Systems Manager parameter to be referenced within the EC2 user data.
    5. An Amazon Simple Storage Service (Amazon S3) bucket is created, and the bucket name is stored in a Systems Manager parameter to be referenced within the EC2 user data.
    6. A Lambda function adds a threat list to GuardDuty that includes the IP addresses of the EC2 instances deployed as part of the sample.
    7. EC2 user data generates GuardDuty findings for the following:
      1. Amazon Elastic Kubernetes Service (Amazon EKS)
        1. Installs eksctl from GitHub.
        2. Creates an EC2 key pair.
        3. Creates an EKS cluster (dependent on availability zone capacity).
        4. Updates EKS cluster configuration to make a dashboard public.
      2. DynamoDB
        1. Adds an item to the DynamoDB table for Joshua Tree.
      3. EC2
        1. Creates an AWS CloudTrail trail named guardduty-demo-trail-<GUID> and subsequently deletes the same CloudTrail trail. The <GUID> is randomly generated by using the $RANDOM function
        2. Runs portscan on 172.31.37.171 (an RFC 1918 private IP address) and private IP of the EKS Deployment EC2 instance provisioned as part of the sample. Port scans are primarily used by bad actors to search for potential vulnerabilities. The target of the port scans are internal IP addresses and do not leave the sample VPC deployed.
        3. Curls DNS domains that are labeled for bitcoin, command and control, and other domains associated with known threats.
      4. Amazon S3
        1. Disables Block Public Access and server access logging for the S3 bucket provisioned as part of the solution.
      5. IAM
        1. Deletes the existing account password policy and creates a new password policy with a minimum length of six characters.
  6. The following Amazon EventBridge rules are created:
    1. guardduty_remediation_eks_rule – When a GuardDuty finding for EKS is created, a Lambda function attempts to delete the EKS resources. Subscribed users are notified of the finding through the deployed SNS topic.
    2. guardduty_remediation_credexfil_rule – When a GuardDuty finding for InstanceCredentialExfiltration is created, a Lambda function is used to revoke the IAM role’s temporary security credentials and AWS permissions. Subscribed users are notified of the finding through the deployed SNS topic.
    3. guardduty_respond_IAMUser_rule – When a GuardDuty finding for IAM is created, subscribed users are notified through the deployed SNS topic. There is no remediation activity triggered by this rule.
    4. Guardduty_notify_S3_rule – When a GuardDuty finding for Amazon S3 is created, subscribed users are notified through the deployed Amazon SNS topic. This rule doesn’t invoke any remediation activity.
  7. The following Lambda functions are created:
    1. guardduty_iam_remediation_function – This function revokes active sessions and sends a notification to the SNS topic.
    2. eks_cleanup_function – This function deletes the EKS resources in the EKS CloudFormation template.

      Note: Upon attempts to delete the overall sample CloudFormation stack, this runs to delete the EKS CloudFormation template.

  8. An S3 bucket stores EC2 user data scripts run from the EC2 instances

Solution deployment

You can deploy the SecurityFindingGeneratorStack solution by using either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK).

Option 1: Deploy the solution with AWS CloudFormation using the console

Use the console to sign in to your chosen AWS account and then choose the Launch Stack button to open the AWS CloudFormation console pre-loaded with the template for this solution. It takes approximately 10 minutes for the CloudFormation stack to complete.

Launch Stack

Option 2: Deploy the solution by using the AWS CDK

You can find the latest code for the SecurityFindingGeneratorStack solution in the SecurityFindingGeneratorStack GitHub repository, where you can also contribute to the sample code. For instructions and more information on using the AWS Cloud Development Kit (AWS CDK), see Get Started with AWS CDK.

To deploy the solution by using the AWS CDK

  1. To build the app when navigating to the project’s root folder, use the following commands:
    npm install -g aws-cdk-lib
    npm install

  2. Run the following command in your terminal while authenticated in your separate deployment AWS account to bootstrap your environment. Be sure to replace <INSERT_AWS_ACCOUNT> with your account number and replace <INSERT_REGION> with the AWS Region that you want the solution deployed to.
    cdk bootstrap aws://<INSERT_AWS_ACCOUNT>/<INSERT_REGION>

  3. Deploy the stack to generate findings based on a specific parameter that is passed. The following parameters are available:
    1. inspector
    2. guardduty
    cdk deploy SecurityFindingGeneratorStack –parameters securityserviceuserdata=inspector

Reviewing security findings

After the solution successfully deploys, security findings should start appearing in your AWS account’s GuardDuty console within a couple of minutes.

Amazon GuardDuty findings

In order to create a diverse set of GuardDuty findings, the solution uses Amazon EC2 user data to run scripts. Those scripts can be found in the sample repository. You can also review and change scripts as needed to fit your use case or to remove specific actions if you don’t want specific resources to be altered or security findings to be generated.

A comprehensive list of active GuardDuty finding types and details for each finding can be found in the Amazon GuardDuty user guide. In this solution, activities which cause the following GuardDuty findings to be generated, are performed:

To generate the EKS security findings, the EKS Deployment EC2 instance is running eksctl commands that deploy CloudFormation templates. If the EKS cluster doesn’t deploy, it might be because of capacity restraints in a specific Availability Zone. If this occurs, manually delete the failed EKS CloudFormation templates.

If you want to create the EKS cluster and security findings manually, you can do the following:

  1. Sign in to the Amazon EC2 console.
  2. Connect to the EKS Deployment EC2 instance using an IAM role that has access to start a session through Systems Manager. After connecting to the ssm-user, issue the following commands in the Session Manager session:
    1. sudo chmod 744 /home/ec2-user/guardduty-script.sh
    2. chown ec2-user /home/ec2-user/guardduty-script.sh
    3. sudo /home/ec2-user/guardduty-script.sh

It’s important that your security analysts have an incident response playbook. If playbooks don’t exist, you can refer to the GuardDuty remediation recommendations or AWS sample incident response playbooks to get started building playbooks.

Amazon Inspector findings

The findings for Amazon Inspector are generated by using the open source Vulhub collection. The open source collection has pre-built vulnerable Docker environments that pull images into Amazon ECR.

The Amazon Inspector findings that are created vary depending on what exists within the open source library at deployment time. The following are examples of findings you will see in the console:

For Amazon Inspector findings, you can refer to parts 1 and 2 of Automate vulnerability management and remediation in AWS using Amazon Inspector and AWS Systems Manager.

Clean up

If you deployed the security finding generator solution by using the Launch Stack button in the console or the CloudFormation template security_finding_generator_cfn, do the following to clean up:

  1. In the CloudFormation console for the account and Region where you deployed the solution, choose the SecurityFindingGeneratorStack stack.
  2. Choose the option to Delete the stack.

If you deployed the solution by using the AWS CDK, run the command cdk destroy.

Important: The solution uses eksctl to provision EKS resources, which deploys additional CloudFormation templates. There are custom resources within the solution that will attempt to delete the provisioned CloudFormation templates for EKS. If there are any issues, you should verify and manually delete the following CloudFormation templates:

  • eksctl-GuardDuty-Finding-Demo-cluster
  • eksctl-GuardDuty-Finding-Demo-addon-iamserviceaccount-kube-system-aws-node
  • eksctl-GuardDuty-Finding-Demo-nodegroup-ng-<GUID>

Conclusion

In this blog post, I showed you how to deploy a solution to provision resources in an AWS account to generate security findings. This solution provides a technical framework to conduct periodic simulations within your AWS environment. By having real, rather than simulated, security findings, you can enable your security teams to interact with actual resources and validate existing incident response processes. Having a repeatable mechanism to create security findings also provides your security team the opportunity to develop and test automated incident response capabilities in your AWS environment.

AWS has multiple services to assist with increasing your organization’s security posture. Security Hub provides native integration with AWS security services as well as partner services. From Security Hub, you can also implement automation to respond to findings using custom actions as seen in Use Security Hub custom actions to remediate S3 resources based on Amazon Macie discovery results. In part two of a two-part series, you can learn how to use Amazon Detective to investigate security findings in EKS clusters. Amazon Security Lake automatically normalizes and centralizes your data from AWS services such as Security Hub, AWS CloudTrail, VPC Flow Logs, and Amazon Route 53, as well as custom sources to provide a mechanism for comprehensive analysis and visualizations.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Incident Response re:Post or contact AWS Support.

Author

Jonathan Nguyen

Jonathan is a Principal Security Architect at AWS. His background is in AWS security with a focus on threat detection and incident response. He helps enterprise customers develop a comprehensive AWS security strategy and deploy security solutions at scale, and trains customers on AWS security best practices.

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

Post Syndicated from Saeed Barghi original https://aws.amazon.com/blogs/big-data/improve-healthcare-services-through-patient-360-a-zero-etl-approach-to-enable-near-real-time-data-analytics/

Healthcare providers have an opportunity to improve the patient experience by collecting and analyzing broader and more diverse datasets. This includes patient medical history, allergies, immunizations, family disease history, and individuals’ lifestyle data such as workout habits. Having access to those datasets and forming a 360-degree view of patients allows healthcare providers such as claim analysts to see a broader context about each patient and personalize the care they provide for every individual. This is underpinned by building a complete patient profile that enables claim analysts to identify patterns, trends, potential gaps in care, and adherence to care plans. They can then use the result of their analysis to understand a patient’s health status, treatment history, and past or upcoming doctor consultations to make more informed decisions, streamline the claim management process, and improve operational outcomes. Achieving this will also improve general public health through better and more timely interventions, identify health risks through predictive analytics, and accelerate the research and development process.

AWS has invested in a zero-ETL (extract, transform, and load) future so that builders can focus more on creating value from data, instead of having to spend time preparing data for analysis. The solution proposed in this post follows a zero-ETL approach to data integration to facilitate near real-time analytics and deliver a more personalized patient experience. The solution uses AWS services such as AWS HealthLake, Amazon Redshift, Amazon Kinesis Data Streams, and AWS Lake Formation to build a 360 view of patients. These services enable you to collect and analyze data in near real time and put a comprehensive data governance framework in place that uses granular access control to secure sensitive data from unauthorized users.

Zero-ETL refers to a set of features on the AWS Cloud that enable integrating different data sources with Amazon Redshift:

Solution overview

Organizations in the healthcare industry are currently spending a significant amount of time and money on building complex ETL pipelines for data movement and integration. This means data will be replicated across multiple data stores via bespoke and in some cases hand-written ETL jobs, resulting in data inconsistency, latency, and potential security and privacy breaches.

With support for querying cross-account Apache Iceberg tables via Amazon Redshift, you can now build a more comprehensive patient-360 analysis by querying all patient data from one place. This means you can seamlessly combine information such as clinical data stored in HealthLake with data stored in operational databases such as a patient relationship management system, together with data produced from wearable devices in near real-time. Having access to all this data enables healthcare organizations to form a holistic view of patients, improve care coordination across multiple organizations, and provide highly personalized care for each individual.

The following diagram depicts the high-level solution we build to achieve these outcomes.

Deploy the solution

You can use the following AWS CloudFormation template to deploy the solution components:

This stack creates the following resources and necessary permissions to integrate the services:

AWS Solution setup

AWS HealthLake

AWS HealthLake enables organizations in the health industry to securely store, transform, transact, and analyze health data. It stores data in HL7 FHIR format, which is an interoperability standard designed for quick and efficient exchange of health data. When you create a HealthLake data store, a Fast Healthcare Interoperability Resources (FHIR) data repository is made available via a RESTful API endpoint. Simultaneously and as part of AWS HealthLake managed service, the nested JSON FHIR data undergoes an ETL process and is stored in Apache Iceberg open table format in Amazon S3.

To create an AWS HealthLake data store, refer to Getting started with AWS HealthLake. Make sure to select the option Preload sample data when creating your data store.

In real-world scenarios and when you use AWS HealthLake in production environments, you don’t need to load sample data into your AWS HealthLake data store. Instead, you can use FHIR REST API operations to manage and search resources in your AWS HealthLake data store.

We use two tables from the sample data stored in HealthLake: patient and allergyintolerance.

Query AWS HealthLake tables with Redshift Serverless

Amazon Redshift is the data warehousing service available on the AWS Cloud that provides up to six times better price-performance than any other cloud data warehouses in the market, with a fully managed, AI-powered, massively parallel processing (MPP) data warehouse built for performance, scale, and availability. With continuous innovations added to Amazon Redshift, it is now more than just a data warehouse. It enables organizations of different sizes and in different industries to access all the data they have in their AWS environments and analyze it from one single location with a set of features under the zero-ETL umbrella. Amazon Redshift integrates with AWS HealthLake and data lakes through Redshift Spectrum and Amazon S3 auto-copy features, enabling you to query data directly from files on Amazon S3.

Query AWS HealthLake data with Amazon Redshift

Amazon Redshift makes it straightforward to query the data stored in S3-based data lakes with automatic mounting of an AWS Glue Data Catalog in the Redshift query editor v2. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog. To get started with this feature, see Querying the AWS Glue Data Catalog. After it is set up and you’re connected to the Redshift query editor v2, complete the following steps:

  1. Validate that your tables are visible in the query editor V2. The Data Catalog objects are listed under the awsdatacatalog database.

FHIR data stored in AWS HealthLake is highly nested. To learn about how to un-nest semi-structured data with Amazon Redshift, see Tutorial: Querying nested data with Amazon Redshift Spectrum.

  1. Use the following query to un-nest the allergyintolerance and patient tables, join them together, and get patient details and their allergies:
    WITH patient_allergy AS 
    (
        SELECT
            resourcetype, 
            c AS allery_category,
            a."patient"."reference",
            SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id,
            a.recordeddate AS allergy_record_date,
            NVL(cd."code", 'NA') AS allergy_code,
            NVL(cd.display, 'NA') AS allergy_description
    
        FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a
                LEFT JOIN a.category c ON TRUE
                LEFT JOIN a.reaction r ON TRUE
                LEFT JOIN r.manifestation m ON TRUE
                LEFT JOIN m.coding cd ON TRUE
    ), patinet_info AS
    (
        SELECT id,
                gender,
                g as given_name,
                n.family as family_name,
                pr as prefix
    
        FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p
                LEFT JOIN p.name n ON TRUE
                LEFT JOIN n.given g ON TRUE
                LEFT JOIN n.prefix pr ON TRUE
    )
    SELECT DISTINCT p.id, 
            p.gender, 
            p.prefix,
            p.given_name,
            p.family_name,
            pa.allery_category,
            pa.allergy_code,
            pa.allergy_description
    from patient_allergy pa
        JOIN patinet_info p
            ON pa.patient_id = p.id
    ORDER BY p.id, pa.allergy_code
    ;
    

To eliminate the need for Amazon Redshift to un-nest data every time a query is run, you can create a materialized view to hold un-nested and flattened data. Materialized views are an effective mechanism to deal with complex and repeating queries. They contain a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database.

  1. Use the following SQL to create a materialized view. You use it later to build a complete view of patients:
    CREATE MATERIALIZED VIEW patient_allergy_info AUTO REFRESH YES AS
    WITH patient_allergy AS 
    (
        SELECT
            resourcetype, 
            c AS allery_category,
            a."patient"."reference",
            SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id,
            a.recordeddate AS allergy_record_date,
            NVL(cd."code", 'NA') AS allergy_code,
            NVL(cd.display, 'NA') AS allergy_description
    
        FROM
            "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a
                LEFT JOIN a.category c ON TRUE
                LEFT JOIN a.reaction r ON TRUE
                LEFT JOIN r.manifestation m ON TRUE
                LEFT JOIN m.coding cd ON TRUE
    ), patinet_info AS
    (
        SELECT id,
                gender,
                g as given_name,
                n.family as family_name,
                pr as prefix
    
        FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p
                LEFT JOIN p.name n ON TRUE
                LEFT JOIN n.given g ON TRUE
                LEFT JOIN n.prefix pr ON TRUE
    )
    SELECT DISTINCT p.id, 
            p.gender, 
            p.prefix,
            p.given_name,
            p.family_name,
            pa.allery_category,
            pa.allergy_code,
            pa.allergy_description
    from patient_allergy pa
        JOIN patinet_info p
            ON pa.patient_id = p.id
    ORDER BY p.id, pa.allergy_code
    ;
    

You have confirmed you can query data in AWS HealthLake via Amazon Redshift. Next, you set up zero-ETL integration between Amazon Redshift and Amazon Aurora MySQL.

Set up zero-ETL integration between Amazon Aurora MySQL and Redshift Serverless

Applications such as front-desk software, which are used to schedule appointments and register new patients, store data in OLTP databases such as Aurora. To get data out of OLTP databases and have them ready for analytics use cases, data teams might have to spend a considerable amount of time to build, test, and deploy ETL jobs that are complex to maintain and scale.

With the Amazon Redshift zero-ETL integration with Amazon Aurora MySQL, you can run analytics on the data stored in OLTP databases and combine them with the rest of the data in Amazon Redshift and AWS HealthLake in near real time. In the next steps in this section, we connect to a MySQL database and set up zero-ETL integration with Amazon Redshift.

Connect to an Aurora MySQL database and set up data

Connect to your Aurora MySQL database using your editor of choice using AdminUsername and AdminPassword that you entered when running the CloudFormation stack. (For simplicity, it is the same for Amazon Redshift and Aurora.)

When you’re connected to your database, complete the following steps:

  1. Create a new database by running the following command:
    CREATE DATABASE front_desk_app_db;

  2. Create a new table. This table simulates storing patient information as they visit clinics and other healthcare centers. For simplicity and to demonstrate specific capabilities, we assume that patient IDs are the same in AWS HealthLake and the front-of-office application. In real-world scenarios, this can be a hashed version of a national health care number:
    CREATE TABLE patient_appointment ( 
          patient_id varchar(250), 
          gender varchar(1), 
          date_of_birth date, 
          appointment_datetime datetime, 
          phone_number varchar(15), 
          PRIMARY KEY (patient_id, appointment_datetime) 
    );

Having a primary key in the table is mandatory for zero-ETL integration to work.

  1. Insert new records into the source table in the Aurora MySQL database. To demonstrate the required functionalities, make sure the patient_id of the sample records inserted into the MySQL database match the ones in AWS HealthLake. Replace [patient_id_1] and [patient_id_2] in the following query with the ones from the Redshift query you ran previously (the query that joined allergyintolerance and patient):
    INSERT INTO front_desk_app_db.patient_appointment (patient_id, gender, date_of_birth, appointment_datetime, phone_number)
    
    VALUES([PATIENT_ID_1], 'F', '1988-7-04', '2023-12-19 10:15:00', '0401401401'),
    ([PATIENT_ID_1], 'F', '1988-7-04', '2023-09-19 11:00:00', '0401401401'),
    ([PATIENT_ID_1], 'F', '1988-7-04', '2023-06-06 14:30:00', '0401401401'),
    ([PATIENT_ID_2], 'F', '1972-11-14', '2023-12-19 08:15:00', '0401401402'),
    ([PATIENT_ID_2], 'F', '1972-11-14', '2023-01-09 12:15:00', '0401401402');

Now that your source table is populated with sample records, you can set up zero-ETL and have data ingested into Amazon Redshift.

Set up zero-ETL integration between Amazon Aurora MySQL and Amazon Redshift

Complete the following steps to create your zero-ETL integration:

  1. On the Amazon RDS console, choose Databases in the navigation pane.
  2. Choose the DB identifier of your cluster (not the instance).
  3. On the Zero-ETL Integration tab, choose Create zero-ETL integration.
  4. Follow the steps to create your integration.

Create a Redshift database from the integration

Next, you create a target database from the integration. You can do this by running a couple of simple SQL commands on Amazon Redshift. Log in to the query editor V2 and run the following commands:

  1. Get the integration ID of the zero-ETL you set up between your source database and Amazon Redshift:
    SELECT * FROM svv_integration;

  2. Create a database using the integration ID:
    CREATE DATABASE ztl_demo FROM INTEGRATION '[INTEGRATION_ID ';

  3. Query the database and validate that a new table is created and populated with data from your source MySQL database:
    SELECT * FROM ztl_demo.front_desk_app_db.patient_appointment;

It might take a few seconds for the first set of records to appear in Amazon Redshift.

This shows that the integration is working as expected. To validate it further, you can insert a new record in your Aurora MySQL database, and it will be available in Amazon Redshift for querying in near real time within a few seconds.

Set up streaming ingestion for Amazon Redshift

Another aspect of zero-ETL on AWS, for real-time and streaming data, is realized through Amazon Redshift Streaming Ingestion. It provides low-latency, high-speed ingestion of streaming data from Kinesis Data Streams and Amazon MSK. It lowers the effort required to have data ready for analytics workloads, lowers the cost of running such workloads on the cloud, and decreases the operational burden of maintaining the solution.

In the context of healthcare, understanding an individual’s exercise and movement patterns can help with overall health assessment and better treatment planning. In this section, you send simulated data from wearable devices to Kinesis Data Streams and integrate it with the rest of the data you already have access to from your Redshift Serverless data warehouse.

For step-by-step instructions, refer to Real-time analytics with Amazon Redshift streaming ingestion. Note the following steps when you set up streaming ingestion for Amazon Redshift:

  1. Select wearables_stream and use the following template when sending data to Amazon Kinesis Data Streams via Kinesis Data Generator, to simulate data generated by wearable devices. Replace [PATIENT_ID_1] and [PATIENT_ID_2] with the patient IDs you earlier when inserting new records into your Aurora MySQL table:
    {
       "patient_id": "{{random.arrayElement(["[PATIENT_ID_1]"," [PATIENT_ID_2]"])}}",
       "steps_increment": "{{random.arrayElement(
          [0,1]
       )}}",
       "heart_rate": {{random.number( 
          {
             "min":45,
             "max":120}
       )}}
    }

  2. Create an external schema called from_kds by running the following query and replacing [IAM_ROLE_ARN] with the ARN of the role created by the CloudFormation stack (Patient360BlogRole):
    CREATE EXTERNAL SCHEMA from_kds
    FROM KINESIS
    IAM_ROLE '[IAM_ROLE_ARN]';

  3. Use the following SQL when creating a materialized view to consume data from the stream:
    CREATE MATERIALIZED VIEW patient_wearable_data AUTO REFRESH YES AS 
    SELECT approximate_arrival_timestamp, 
          JSON_PARSE(kinesis_data) as Data FROM from_kds."wearables_stream" 
    WHERE CAN_JSON_PARSE(kinesis_data);

  4. To validate that streaming ingestion works as expected, refresh the materialized view to get the data you already sent to the data stream and query the table to make sure data has landed in Amazon Redshift:
    REFRESH MATERIALIZED VIEW patient_wearable_data;
    
    SELECT *
    FROM patient_wearable_data
    ORDER BY approximate_arrival_timestamp DESC;

Query and analyze patient wearable data

The results in the data column of the preceding query are in JSON format. Amazon Redshift makes it straightforward to work with semi-structured data in JSON format. It uses PartiQL language to offer SQL-compatible access to relational, semi-structured, and nested data. Use the following query to flatten data:

SELECT data."patient_id"::varchar AS patient_id,       
      data."steps_increment"::integer as steps_increment,       
      data."heart_rate"::integer as heart_rate, 
      approximate_arrival_timestamp 
FROM patient_wearable_data 
ORDER BY approximate_arrival_timestamp DESC;

The result looks like the following screenshot.

Now that you know how to flatten JSON data, you can analyze it further. Use the following query to get the number of minutes a patient has been physically active per day, based on their heart rate (greater than 80):

WITH patient_wearble_flattened AS
(
   SELECT data."patient_id"::varchar AS patient_id,
      data."steps_increment"::integer as steps_increment,
      data."heart_rate"::integer as heart_rate,
      approximate_arrival_timestamp,
      DATE(approximate_arrival_timestamp) AS date_received,
      extract(hour from approximate_arrival_timestamp) AS    hour_received,
      extract(minute from approximate_arrival_timestamp) AS minute_received
   FROM patient_wearable_data
), patient_active_minutes AS
(
   SELECT patient_id,
      date_received,
      hour_received,
      minute_received,
      avg(heart_rate) AS heart_rate
   FROM patient_wearble_flattened
   GROUP BY patient_id,
      date_received,
      hour_received,
      minute_received
   HAVING avg(heart_rate) > 80
)
SELECT patient_id,
      date_received,
      COUNT(heart_rate) AS active_minutes_count
FROM patient_active_minutes
GROUP BY patient_id,
      date_received
ORDER BY patient_id,
      date_received;

Create a complete patient 360

Now that you are able to query all patient data with Redshift Serverless, you can combine the three datasets you used in this post and form a comprehensive patient 360 view with the following query:

WITH patient_appointment_info AS
(
      SELECT "patient_id",
         "gender",
         "date_of_birth",
         "appointment_datetime",
         "phone_number"
      FROM ztl_demo.front_desk_app_db.patient_appointment
),
patient_wearble_flattened AS
(
      SELECT data."patient_id"::varchar AS patient_id,
         data."steps_increment"::integer as steps_increment,
         data."heart_rate"::integer as heart_rate,
         approximate_arrival_timestamp,
         DATE(approximate_arrival_timestamp) AS date_received,
         extract(hour from approximate_arrival_timestamp) AS hour_received,
         extract(minute from approximate_arrival_timestamp) AS minute_received
      FROM patient_wearable_data
), patient_active_minutes AS
(
      SELECT patient_id,
         date_received,
         hour_received,
         minute_received,
         avg(heart_rate) AS heart_rate
      FROM patient_wearble_flattened
      GROUP BY patient_id,
         date_received,
         hour_received,
         minute_received
         HAVING avg(heart_rate) > 80
), patient_active_minutes_count AS
(
      SELECT patient_id,
         date_received,
         COUNT(heart_rate) AS active_minutes_count
      FROM patient_active_minutes
      GROUP BY patient_id,
         date_received
)
SELECT pai.patient_id,
      pai.gender,
      pai.prefix,
      pai.given_name,
      pai.family_name,
      pai.allery_category,
      pai.allergy_code,
      pai.allergy_description,
      ppi.date_of_birth,
      ppi.appointment_datetime,
      ppi.phone_number,
      pamc.date_received,
      pamc.active_minutes_count
FROM patient_allergy_info pai
      LEFT JOIN patient_active_minutes_count pamc
            ON pai.patient_id = pamc.patient_id
      LEFT JOIN patient_appointment_info ppi
            ON pai.patient_id = ppi.patient_id
GROUP BY pai.patient_id,
      pai.gender,
      pai.prefix,
      pai.given_name,
      pai.family_name,
      pai.allery_category,
      pai.allergy_code,
      pai.allergy_description,
      ppi.date_of_birth,
      ppi.appointment_datetime,
      ppi.phone_number,
      pamc.date_received,
      pamc.active_minutes_count
ORDER BY pai.patient_id,
      pai.gender,
      pai.prefix,
      pai.given_name,
      pai.family_name,
      pai.allery_category,
      pai.allergy_code,
      pai.allergy_description,
      ppi.date_of_birth DESC,
      ppi.appointment_datetime DESC,
      ppi.phone_number DESC,
      pamc.date_received,
      pamc.active_minutes_count

You can use the solution and queries used here to expand the datasets used in your analysis. For example, you can include other tables from AWS HealthLake as needed.

Clean up

To clean up resources you created, complete the following steps:

  1. Delete the zero-ETL integration between Amazon RDS and Amazon Redshift.
  2. Delete the CloudFormation stack.
  3. Delete AWS HealthLake data store

Conclusion

Forming a comprehensive 360 view of patients by integrating data from various different sources offers numerous benefits for organizations operating in the healthcare industry. It enables healthcare providers to gain a holistic understanding of a patient’s medical journey, enhances clinical decision-making, and allows for more accurate diagnosis and tailored treatment plans. With zero-ETL features for data integration on AWS, it is effortless to build a view of patients securely, cost-effectively, and with minimal effort.

You can then use visualization tools such as Amazon QuickSight to build dashboards or use Amazon Redshift ML to enable data analysts and database developers to train machine learning (ML) models with the data integrated through Amazon Redshift zero-ETL. The result is a set of ML models that are trained with a broader view into patients, their medical history, and their lifestyle, and therefore enable you make more accurate predictions about their upcoming health needs.


About the Authors

Saeed Barghi is a Sr. Analytics Specialist Solutions Architect specializing in architecting enterprise data platforms. He has extensive experience in the fields of data warehousing, data engineering, data lakes, and AI/ML. Based in Melbourne, Australia, Saeed works with public sector customers in Australia and New Zealand.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 17 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Use Amazon Verified Permissions for fine-grained authorization at scale

Post Syndicated from Abhishek Panday original https://aws.amazon.com/blogs/security/use-amazon-verified-permissions-for-fine-grained-authorization-at-scale/

Implementing user authentication and authorization for custom applications requires significant effort. For authentication, customers often use an external identity provider (IdP) such as Amazon Cognito. Yet, authorization logic is typically implemented in code. This code can be prone to errors, especially as permissions models become complex, and presents significant challenges when auditing permissions and deciding who has access to what. As a result, within Common Weakness Enumeration’s (CWE’s) list of the Top 25 Most Dangerous Software Weaknesses for 2023, four are related to incorrect authorization.

At re:Inforce 2023, we launched Amazon Verified Permissions, a fine-grained permissions management service for the applications you build. Verified Permissions centralizes permissions in a policy store and lets developers use those permissions to authorize user actions within their applications. Permissions are expressed as Cedar policies. You can learn more about the benefits of moving your permissions centrally and expressing them as policies in Policy-based access control in application development with Amazon Verified Permissions.

In this post, we explore how you can provide a faster and richer user experience while still authorizing all requests in the application. You will learn two techniques—bulk authorization and response caching—to improve the efficiency of your applications. We describe how you can apply these techniques when listing authorized resources and actions and loading multiple components on webpages.

Use cases

You can use Verified Permissions to enforce permissions that determine what the user is able to see at the level of the user interface (UI), and what the user is permitted to do at the level of the API.

  1. UI permissions enable developers to control what a user is allowed see in the application. Developers enforce permissions in the UI to control the list of resources a user can see and the actions they can take. For example, a UI-level permission in a banking application might determine whether a transfer funds button is enabled for a given account.
  2. API permissions enable developers to control what a user is allowed to do in an application. Developers control access to individual API calls made by an application on behalf of the user. For example, an API-level permission in a banking application might determine whether a user is permitted to initiate a funds transfer from an account.

Cedar provides consistent and readable policies that can be used at both the level of the UI and the API. For example, a single policy can be checked at the level of the UI to determine whether to show the transfer funds button and checked at the level of the API to determine authority to initiate the funds transfer.

Challenges

Verified Permissions can be used for implementing fine-grained API permissions. Customer applications can use Verified Permissions to authorize API requests, based on centrally managed Cedar policies, with low latency. Applications authorize such requests by calling the IsAuthorized API of the service, and the response contains whether the request is allowed or denied. Customers are happy with the latency of individual authorization requests, but have asked us to help them improve performance for use cases that require multiple authorization requests. They typically mention two use cases:

  • Compound authorizationCompound authorization is needed when one high-level API action involves many low-level actions, each of which has its own permissions. This requires the application to make multiple requests to Verified Permissions to authorize the user action. For example, in a banking application, loading a credit card statement requires three API calls: GetCreditCardDetails, GetCurrentStatement, and GetCreditLimit. This requires three calls to Verified Permissions, one for each API call.
  • UI permissions: Developers implement UI permissions by calling the same authorization API for every possible resource a principal can access. Each request involves an API call, and the UI can only be presented after all of them have completed. Alternatively, for a resource-centric view, the application can make the call for multiple principals to determine which ones have access.

Solution

In this post, we show you two techniques to optimize the application’s latency based on API permissions and UI permissions.

  1. Batch authorization allows you to make up to 30 authorization decisions in a single API call. This feature was released in November 2023. See the what’s new post and API specifications to learn more.
  2. Response caching enables you to cache authorization responses in a policy enforcement point such as Amazon API Gateway, AWS AppSync, or AWS Lambda. You can cache responses using native enforcement point caches (for example, API Gateway caching) or managed caching services such as Amazon ElastiCache.

Solving for enforcing fine grained permissions while delivering a great user experience

You can use UI permissions to authorize what resources and actions a user can view in an application. We see developers implementing these controls by first generating a small set of resources based on database filters and then further reducing the set down to authorized resources by checking permissions on each resource using Verified Permissions. For example, when a user of a business banking system tries to view balances on company bank accounts, the application first filters the list to the set of bank accounts for that company. The application then filters the list further to only include the accounts that the user is authorized to view by making an API request to Verified Permissions for each account in the list. With batch authorization, the application can make a single API call to Verified Permissions to filter the list down to the authorized accounts.

Similarly, you can use UI permissions to determine what components of a page or actions should be visible to users of the application. For example, in a banking application, the application wants to control the sub-products (such as credit card, bank account, or stock trading) visible to a user or only display authorized actions (such as transfer or change address) when displaying an account overview page. Customers want to use Verified Permissions to determine which components of the page to display, but that can adversely impact the user experience (UX) if they make multiple API calls to build the page. With batch authorization, you can make one call to Verified Permissions to determine permissions for all components of the page. This enables you to provide a richer experience in your applications by displaying only the components that the user is allowed to access while maintaining low page load latency.

Solving for enforcing permissions for every API call without impacting performance

Compound authorization is where a single user action results in a sequence of multiple authorization calls. You can use bulk authorization combined with response caching to improve efficiency. The application makes a single bulk authorization request to Verified Permissions to determine whether each of the component API calls are permitted and the response is cached. This cache is then referenced for each component’s API call in the sequence.

Sample application – Use cases, personas, and permissions

We’re using an online order management application for a toy store to demonstrate how you can apply batch authorization and response caching to improve UX and application performance.

One function of the application is to enable employees in a store to process online orders.

Personas

The application is used by two types of users:

  • Pack associates are responsible for picking, packing, and shipping orders. They’re assigned to a specific department.
  • Store managers are responsible for overseeing the operations of a store.

Use cases

The application supports these use cases:

  1. Listing orders: Users can list orders. A user should only see the orders for which they have view permissions.
    • Pack associates can list all orders of their department.
    • Store managers can list all orders of their store.

    Figure 1 shows orders for Julian, who is a pack associate in the Soft Toy department

    Figure 1: Orders for Julian in the Soft Toy department

    Figure 1: Orders for Julian in the Soft Toy department

  2. Order actions: Users can take some actions on an order. The application enables the relevant UI elements based on the user’s permissions.
    • Pack associates can select Get Box Size and Mark as Shipped, as shown in Figure 2.
    • Store managers can select Get Box Size, Mark as Shipped, Cancel Order, and Route to different warehouse.
    Figure 2: Actions available to Julian as a pack associate

    Figure 2: Actions available to Julian as a pack associate

  3. Viewing an order: Users can view the details of a specific order. When a user views an order, the application loads the details, label, and receipt. Figure 3 shows the available actions for Julian who is a pack associate.
    Figure 3: Order Details for Julian, showing permitted actions

    Figure 3: Order Details for Julian, showing permitted actions

Policy design

The application uses Verified Permissions as a centralized policy store. These policies are expressed in Cedar. The application uses the Role management using policy templates approach for implementing role-based access controls. We encourage you to read best practices for using role-based access control in Cedar to understand if the approach fits your use case.

In the sample application, the policy template for the store owner role looks like the following:

permit (
        principal == ?principal,
        action in [
                avp::sample::toy::store::Action::"OrderActions",
                avp::sample::toy::store::Action::"AddPackAssociate",
                avp::sample::toy::store::Action::"AddStoreManager",
                avp::sample::toy::store::Action::"ListPackAssociates",
                avp::sample::toy::store::Action::"ListStoreManagers"
        ],
        resource in ?resource
);

When a user is assigned a role, the application creates a policy from the corresponding template by passing the user and store. For example, the policy created for the store owner is as follows:

permit (
    principal ==  avp::sample::toy::store::User::"test_user_pool|sub_store_manager_user", 
    action in  [
                avp::sample::toy::store::Action::"OrderActions",
                avp::sample::toy::store::Action::"AddPackAssociate",
                avp::sample::toy::store::Action::"AddStoreManager",
                avp::sample::toy::store::Action::"ListPackAssociates",
                avp::sample::toy::store::Action::"ListStoreManagers"
    ],
    resource in avp::sample::toy::store::Store::"toy store 1"
);

To learn more about the policy design of this application, see the readme file of the application.

Use cases – Design and implementation

In this section, we discuss high level design, challenges with the barebones integration, and how you can use the preceding techniques to reduce latency and costs.

Listing orders

Figure 4: Architecture for listing orders

Figure 4: Architecture for listing orders

As shown in Figure 4, the process to list orders is:

  1. The user accesses the application hosted in AWS Amplify.
  2. The user then authenticates through Amazon Cognito and obtains an identity token.
  3. The application uses Amplify to load the order page. The console calls the API ListOrders to load the order.
  4. The API is hosted in API Gateway and protected by a Lambda authorizer function.
  5. The Lambda function collects entity information from an in-memory data store to formulate the isAuthorized request.
  6. Then the Lambda function invokes Verified Permissions to authorize the request. The function checks against Verified Permissions for each order in the data store for the ListOrder call. If Verified Permissions returns deny, the order is not provided to the user. If Verified Permissions returns allow, the request is moved forward.

Challenge

Figure 5 shows that the application called IsAuthorized multiple times, sequentially. Multiple sequential calls cause the page to be slow to load and increase infrastructure costs.

Figure 5: Graph showing repeated calls to IsAuthorized

Figure 5: Graph showing repeated calls to IsAuthorized

Reduce latency using batch authorization

If you transition to using batch authorization, the application can receive 30 authorization decisions with a single API call to Verified Permissions. As you can see in Figure 6, the time to authorize has reduced from close to 800 ms to 79 ms, delivering a better overall user experience.

Figure 6: Reduced latency by using batch authorization

Figure 6: Reduced latency by using batch authorization

Order actions

Figure 7: Order actions architecture

Figure 7: Order actions architecture

As shown in Figure 7, the process to get authorized actions for an order is:

  1. The user goes to the application landing page on Amplify.
  2. The application calls the Order actions API at API Gateway
  3. The application sends a request to initiate order actions to display only authorized actions to the user.
  4. The Lambda function collects entity information from an in-memory data store to formulate the isAuthorized request.
  5. The Lambda function then checks with Verified Permissions for each order action. If Verified Permissions returns deny, the action is dropped. If Verified Permissions returns allow, the request is moved forward and the action is added to a list of order actions to be sent in a follow-up request to Verified Permissions to provide the actions in the user’s UI.

Challenge

As you saw with listing orders, Figure 8 shows how the application is still calling IsAuthorized multiple times, sequentially. This means the page remains slow to load and has increased impacts on infrastructure costs.

Figure 8: Graph showing repeated calls to IsAuthorized

Figure 8: Graph showing repeated calls to IsAuthorized

Reduce latency using batch authorization

If you add another layer by transitioning to using batch authorization once again, the application can receive all decisions with a single API call to Verified Permissions. As you can see from Figure 9, the time to authorize has reduced from close to 500 ms to 150 ms, delivering an improved user experience.

Figure 9: Graph showing results of layering batch authorization

Figure 9: Graph showing results of layering batch authorization

Viewing an order

Figure 10: Order viewing architecture

Figure 10: Order viewing architecture

The process to view an order, shown in Figure 10, is:

  1. The user accesses the application hosted in Amplify.
  2. The user authenticates through Amazon Cognito and obtains an identity token.
  3. The application calls three APIs hosted at API Gateway.
  4. The API’s: Get order details, Get label, and Get receipt are targeted sequentially to load the UI for the user in the application.
  5. A Lambda authorizer protects each of the above-mentioned APIs and is launched for each invoke.
  6. The Lambda function collects entity information from an in-memory data store to formulate the isAuthorized request.
  7. For each API, the following steps are repeated. The Lambda authorizer is invoked three times during page load.
    1. The Lambda function invokes Verified Permissions to authorize the request. If Verified Permissions returns deny, the request is rejected and an HTTP unauthorized response (403) is sent back. If Verified Permissions returns allow, the request is moved forward. 
    2. If the request is allowed, API Gateway calls the Lambda Order Management function to process the request. This is the primary Lambda function supporting the application and typically contains the core business logic of the application.

Challenge

In using the standard authorization pattern for this use case, the application calls Verified Permissions three times. This is because the user action to view an order requires compound authorization because each API call made by the console is authorized. While this enforces least privilege, it impacts the page load and reload latency of the application.

Reduce latency using batch authorization and decision caching

You can use batch authorization and decision caching to reduce latency. In the sample application, the cache is maintained by API Gateway. As shown in Figure 11, applying these techniques to the console application results in only one call to Verified Permissions, reducing latency.

Figure 11: Batch authorization with decision caching architecture

Figure 11: Batch authorization with decision caching architecture

The decision caching processshown in Figure 11, is:

  1. The user accesses the application hosted in Amplify.
  2. The user then authenticates through Amazon Cognito and obtains an identity token.
  3. The application then calls three APIs hosted at API Gateway
  4. When the Lambda function for the Get order details API is invoked, it uses the Lambda Authorizer to call batch authorization to get authorization decisions for the requested action, Get order details, and related actions, Get label and Get receipt.
  5. A Lambda authorizer protects each of the above-mentioned APIs but because of batch authorization, is invoked only once.
  6. The Lambda function collects entity information from an in-memory data store to formulate the isAuthorized request.
  7. The Lambda function invokes Verified Permissions to authorize the request. If Verified Permissions returns deny, the request is rejected and an HTTP unauthorized response (403) is sent back. If Verified Permissions returns allow, the request is moved forward.
    1. API Gateway caches the authorization decision for all actions (the requested action and related actions).
    2. If the request is allowed by the Lambda authorizer function, API Gateway calls the order management Lambda function to process the request. This is the primary Lambda function supporting the application and typically contains the core business logic of the application.
    3. When subsequent APIs are called, the API Gateway uses the cached authorization decisions and doesn’t use the Lambda authorization function.

Caching considerations

You’ve seen how you can use caching to implement fine-grained authorization at scale in your applications. This technique works well when your application has high cache hit rates, where authorization results are frequently loaded from the cache. Applications where the users initiate the same action multiple times or have a predictable sequence of actions will observe high cache hit rates. Another consideration is that employing caching can delay the time between policy updates and policy enforcement. We don’t recommend using caching for authorization decisions if your application requires policies to take effect quickly or your policies are time dependent (for example, a policy that gives access between 10:00 AM and 2:00 PM).

Conclusion

In this post, we showed you how to implement fine grained permissions in application at scale using Verified Permissions. We covered how you can use batch authorization and decision caching to improve performance and ensure Verified Permissions remains a cost-effective solution for large-scale applications. We applied these techniques to a demo application, avp-toy-store-sample, that is available to you for hands-on testing. For more information about Verified Permissions, see the Amazon Verified Permissions product details and Resources.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Abhishek Panday

Abhishek Panday

Abhishek is a product manager in the Amazon Verified Permissions team. He’s been working with AWS for more than two years and has been at Amazon for over five years. Abhishek enjoys working with customers to understand their challenges and building products to solve those challenges. Abhishek currently lives in Seattle and enjoys playing soccer, hiking, and cooking Indian cuisines.

Jeremy Wave

Jeremy Ware

Jeremy is a Security Specialist Solutions Architect focused on Identity and Access Management. Jeremy and his team enable AWS customers to implement sophisticated, scalable, and secure architectures to solve business challenges. Jeremy has spent many years working to improve the security maturity at numerous global enterprises. In his free time, Jeremy loves to enjoy the outdoors with his family.