All posts by Somdeb Bhattacharjee

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Post Syndicated from Somdeb Bhattacharjee original https://aws.amazon.com/blogs/big-data/implement-a-custom-subscription-workflow-for-unmanaged-amazon-s3-assets-published-with-amazon-datazone/

Organizational data is often fragmented across multiple lines of business, leading to inconsistent and sometimes duplicate datasets. This fragmentation can delay decision-making and erode trust in available data. Amazon DataZone, a data management service, helps you catalog, discover, share, and govern data stored across AWS, on-premises systems, and third-party sources. Although Amazon DataZone automates subscription fulfillment for structured data assets—such as data stored in Amazon Simple Storage Service (Amazon S3), cataloged with the AWS Glue Data Catalog, or stored in Amazon Redshift—many organizations also rely heavily on unstructured data. For these customers, extending the streamlined data discovery and subscription workflows in Amazon DataZone to unstructured data, such as files stored in Amazon S3, is critical.

For example, Genentech, a leading biotechnology company, has vast sets of unstructured gene sequencing data organized across multiple S3 buckets and prefixes. They need to enable direct access to these data assets for downstream applications efficiently, while maintaining governance and access controls.

In this post, we demonstrate how to implement a custom subscription workflow using Amazon DataZone, Amazon EventBridge, and AWS Lambda to automate the fulfillment process for unmanaged data assets, such as unstructured data stored in Amazon S3. This solution enhances governance and simplifies access to unstructured data assets across the organization.

Solution overview

For our use case, the data producer has unstructured data stored in S3 buckets, organized with S3 prefixes. We want to publish this data to Amazon DataZone as discoverable S3 data. On the consumer side, users need to search for these assets, request subscriptions, and access the data within an Amazon SageMaker notebook, using their own custom AWS Identity and Access Management (IAM) roles.

The proposed solution involves creating a custom subscription workflow that uses the event-driven architecture of Amazon DataZone. Amazon DataZone keeps you informed of key activities (events) within your data portal, such as subscription requests, updates, comments, and system events. These events are delivered through the EventBridge default event bus.

An EventBridge rule captures subscription events and invokes a custom Lambda function. This Lambda function contains the logic to manage access policies for the subscribed unmanaged asset, automating the subscription process for unstructured S3 assets. This approach streamlines data access while ensuring proper governance.

To learn more about working with events using EventBridge, refer to Events via Amazon EventBridge default bus.

The solution architecture is shown in the following screenshot.

Custom subscription workflow architecture diagram

To implement the solution, we complete the following steps:

  1. As a data producer, publish an unstructured S3 based data asset as S3ObjectCollectionType to Amazon DataZone.
  2. For the consumer, create a custom AWS service environment in the consumer Amazon DataZone project and add a subscription target for the IAM role attached to a SageMaker notebook instance. Now, as a consumer, request access to the unstructured asset published in the previous step.
  3. When the request is approved, capture the subscription created event using an EventBridge rule.
  4. Invoke a Lambda function as the target for the EventBridge rule and pass the event payload to it:
  5. The Lambda function does 2 things:
    1. Fetches the asset details, including the Amazon Resource Name (ARN) of the S3 published asset and the IAM role ARN from the subscription target.
    2. Uses the information to update the S3 bucket policy granting List/Get access to the IAM role.

Prerequisites

To follow along with the post, you should have an AWS account. If you don’t have one, you can sign up for one.

For this post, we assume you know how to create an Amazon DataZone domain and Amazon DataZone projects. For more information, see Create domains and Working with projects and environments in Amazon DataZone.

Also, for simplicity, we use the same IAM role for the Amazon DataZone admin (creating domains) as well the producer and consumer personas.

Publish unstructured S3 data to Amazon DataZone

We have uploaded some sample unstructured data into an S3 bucket. This is the data that will be published to Amazon DataZone. You can use any unstructured data, such as an image or text file.

On the Properties tab of the S3 folder, note the ARN of the S3 bucket prefix.

Complete the following steps to publish the data:

  1. Create an Amazon DataZone domain in the account and navigate to the domain portal using the link for Data portal URL.

DataZone domain creation

  1. Create a new Amazon DataZone project (for this post, we name it unstructured-data-producer-project) for publishing the unstructured S3 data asset.
  2. On the Data tab of the project, choose Create data asset.

Data asset creation

  1. Enter a name for the asset.
  2. For Asset type, choose S3 object collection.
  3. For S3 location ARN, enter the ARN of the S3 prefix.

After you create the asset, you can add glossaries or metadata forms, but it’s not necessary for this post. You can publish the data asset so it’s now discoverable within the Amazon DataZone portal.

Set up the SageMaker notebook and SageMaker instance IAM role

Create an IAM role which will be attached to the SageMaker notebook instance. For the trust policy, allow SageMaker to assume this role and leave the Permissions tab blank. We refer to this role as the instance-role throughout the post.

SageMaker instance role

Next, create a SageMaker notebook instance from the SageMaker console. Attach the instance-role to the notebook instance.

SageMaker instance

Set up the consumer Amazon DataZone project, custom AWS service environment, and subscription target

Complete the following steps:

  1. Log in to the Amazon DataZone portal and create a consumer project (for this post, we call it custom-blueprint-consumer-project), which will used by the consumer persona to subscribe to the unstructured data asset.

Custom blueprint project name

We use the recently launched custom blueprints for AWS services for creating the environment in this consumer project. The custom blueprint allows you to bring your own environment IAM role to integrate your existing AWS resources with Amazon DataZone. For this post, we create a custom environment to directly integrate SageMaker notebook access from the Amazon DataZone portal.

  1. Before you create the custom environment, create the environment IAM role that will be used in the custom blueprint. The role should have a trust policy as shown in the following screenshot. For the permissions, attach the AWS managed policy AmazonSageMakerFullAccess. We refer to this role as the environment-role throughout the post.

Custom Environment role

  1. To create the custom environment, first enable the Custom AWS Service blueprint on the Amazon DataZone console.

Enable custom blueprint

  1. Open the blueprint to create a new environment as shown in the following screenshot.
  2. For Owning project, use the consumer project that you created earlier and for Permissions, use the environment-role.

Custom environment project and role

  1. After you create the environment, open it to create a customized URL for the SageMaker notebook access.

SageMaker custom URL

  1. Create a new custom AWS link and enter the URL from the SageMaker notebook.

You can find it by navigating to the SageMaker console and choosing Notebooks in the navigation pane.

  1. Choose Customize to add the custom link.

Add the custom link

  1. Next, create a subscription target in the custom environment to pass the instance role that needs access to the unstructured data.

A subscription target is an Amazon DataZone engineering concept that allows Amazon DataZone to fulfill subscription requests for managed assets by granting access based on the information defined in the target like domain-id, environment-id, or authorized-principals.

Currently, creation of subscription targets is only allowed using the AWS Command Line Interface (AWS CLI). You can use the command create-subscription-target to create the subscription target.

The following is an example JSON payload for the subscription target creation. Create it as a JSON file on your workstation (for this post, we call it blog-sub-target.json). Replace the domain ID and the environment ID with the corresponding values for your domain and environment.

{
"domainIdentifier": "<<your-domain-id>>",
"environmentIdentifier": "<<your-environment-id>>",
"name": "custom-s3-target-consumerenv",
"type": "GlueSubscriptionTargetType",
"manageAccessRole": "<<provide the environment-role here>>",
"applicableAssetTypes": ["S3ObjectCollectionAssetType"],
"provider": "Custom Provider",
"authorizedPrincipals": [ "<<provide the instance-role here>>"],
"subscriptionTargetConfig": [{
"formName": "GlueSubscriptionTargetConfigForm",
"content": "{\"databaseName\":\"customdb1\"}"
}]
}

You can get the domain ID from the user name button in the upper right Amazon DataZone data portal; it’s in the format dzd_<<some-random-characters>>.

For the environment ID, you can find it on the Settings tab of the environment within your consumer project.

  1. Open an AWS CloudShell environment and upload the JSON payload file using the Actions option in the CloudShell terminal.
  2. You can now create a new subscription target using the following AWS CLI command:

aws datazone create-subscription-target --cli-input-json file://blog-sub-target.json

Create subscription target

  1. To verify the subscription target was created successfully, run the list-subscription-target command from the AWS CloudShell environment:
aws datazone list-subscription-targets —domain-identifier <<domain-id>> —environment-identifier <<environment-id>>

Create a function to respond to subscription events

Now that you have the consumer environment and subscription target set up, the next step is to implement a custom workflow for handling subscription requests.

The simplest mechanism to handle subscription events is a Lambda function. The exact implementation may vary based on environment; for this post, we walk through the steps to create a simple function to handle subscription creation and cancellation.

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose Create function.
  3. Select Author from scratch.
  4. For Function name, enter a name (for example, create-s3policy-for-subscription-target).
  5. For Runtime¸ choose Python 3.12.
  6. Choose Create function.

Author Lambda function

This should open the Code tab for the function and allow editing of the Python code for the function. Let’s look at some of the key components of a function to handle the subscription for unmanaged S3 assets.

Handle only relevant events

When the function gets invoked, we check to make sure it’s one of the events that’s relevant for managing access. Otherwise, the function can simply return a message without taking further action.

def lambda_handler(event, context):
    # Get the basic info about the event
    event_detail = event['detail']

    # Make sure it's one of the events we're interested in
    event_source = event['source']
    event_type = event['detail-type']

    if event_source != 'aws.datazone':
        return '{"Response" : "Not a DataZone event"}'
    elif event_type not in ['Subscription Created', 'Subscription Cancelled', 
                               'Subscription Revoked']:
        return '{"Response" : "Not a subscription created, cancelled, or revoked event"}'

These subscription events should include both the domain ID and a request ID (among other attributes). You can use these to look up the details of the subscription request in Amazon DataZone:

sub_request = dz.get_subscription_request_details(
domainIdentifier = domain_id,
identifier= sub_request_id
)
asset_listing = sub_request['subscribedListings'][0]['item']['assetListing']
form_data = json.loads(asset_listing['forms'])
asset_id = asset_listing['entityId']
asset_version = asset_listing['entityRevision']
asset_type = asset_listing['entityType']

Part of the subscription request should include the ARN for the S3 bucket in question, so you can retrieve that:

# We only want to take action if this is a S3 asset
    if asset_type == 'S3ObjectCollectionAssetType':
        # Get the bucket ARN from the form info for the asset
        bucket_arn = form_data['S3ObjectCollectionForm']['bucketArn']
        
        #Get the principal from the subscription target
        principal = get_principal(domain_id,project_id)

        try:
            # Get the bucket name from the ARN                    
            bucket_name_with_prefix = bucket_arn.split(':')[5]
            bucket_name = bucket_name_with_prefix.split('/')[0]
           
        except IndexError:
            response = '{"Response" : "Could not find bucket name in ARN"}'
            return response

You can also use the Amazon DataZone API calls to get the environment associated with the project making the subscription request for this S3 asset. After retrieving the environment ID, you can check which IAM principals have been authorized to access unmanaged S3 assets using the subscription target:

        list_sub_target = dz.list_subscription_targets(
            domainIdentifier=domain_id,
            environmentIdentifier=environment_id,
            maxResults=50,
            sortBy='CREATED_AT',
            sortOrder='DESCENDING'
            )
        
        print('asset type:', list_sub_target['items'][0]['applicableAssetTypes'])
        
        if list_sub_target['items'][0]['applicableAssetTypes'] == ['S3ObjectCollectionAssetType']:
            role_arn = list_sub_target['items'][0]['authorizedPrincipals']
            print('role arn',role_arn)

If this is a new subscription, add the relevant IAM principal to the S3 bucket policy by appending a statement that allows the desired S3 actions on this bucket for the new principal:

        if event_type == 'Subscription Created':
            if bucket_arn[-1] == '/':
                statement_block.append({
                    'Sid' : sid_string,
                    'Action': S3_ACTION_STRING,
                    'Resource': [
                        bucket_arn,
                        bucket_arn + '*'
                    ],
                    'Effect': 'Allow',
                    'Principal': {'AWS': principal}
                })

Conversely, if this is a subscription being revoked or cancelled, remove the previously added statement from the bucket policy to make sure the IAM principal no longer has access:

        elif event_type == 'Subscription Cancelled' or event_type == 'Subscription Revoked':
            # Remove the statement from the policy if it's there
            # Made sure to handle case where there's no Sid for a statement
            pruned_statement_block = []
            for statement in statement_block:
                if 'Sid' not in statement or statement['Sid'] != sid_string:
                    pruned_statement_block.append(statement)
            statement_block = pruned_statement_block

The completed function should be able to handle adding or removing principals like IAM roles or users to a bucket policy. Be sure to handle cases where there is no existing bucket policy or where a cancellation means removing the only statement in the policy, meaning the entire bucket policy is no longer needed.

The following is an example of a completed function:

import json
import boto3
import os


dz = boto3.client('datazone')
s3 = boto3.client('s3')

# The list of actions to be permitted on the bucket in the newly granted policy
S3_ACTION_STRING = 's3:*'

def build_policy_statements(event_type, statement_block, principal, sub_request_id, bucket_arn):
        # Generate a Sid that should be unique
        sid_string = ''.join(c for c in f'DZ{principal}{sub_request_id}' if c.isalnum())
        # Add a new policy statement that gives the prinicpal access to whole bucket.
        # If it turns out something other than bucket ARN is allowed in asset, we can
        # get more granular than that
        # Sid that should be unique in case we need to handle unsubscribe
        print('statement block :',statement_block)
        if event_type == 'Subscription Created':
            if bucket_arn[-1] == '/':
                statement_block.append({
                    'Sid' : sid_string,
                    'Action': S3_ACTION_STRING,
                    'Resource': [
                        bucket_arn,
                        bucket_arn + '*'
                    ],
                    'Effect': 'Allow',
                    'Principal': {'AWS': principal}
                })
            else:
                statement_block.append({
                    'Sid' : sid_string,
                    'Action': S3_ACTION_STRING,
                    'Resource': [
                        bucket_arn,
                        bucket_arn + '/*'
                    ],
                    'Effect': 'Allow',
                    'Principal': {'AWS': principal}
                })
        elif event_type == 'Subscription Cancelled' or event_type == 'Subscription Revoked':
            # Remove the statement from the policy if it's there
            # Made sure to handle case where there's no Sid for a statement
            pruned_statement_block = []
            for statement in statement_block:
                if 'Sid' not in statement or statement['Sid'] != sid_string:
                    pruned_statement_block.append(statement)
            statement_block = pruned_statement_block
           

        return statement_block

def lambda_handler(event, context):
    """Lambda function reacting to DataZone subscribe events

    Parameters
    ----------
    event: dict, required
        Event Bridge Events Format

    context: object, required
        Lambda Context runtime methods and attributes

    Returns
    ------
        Simple reponse indicating success or failure reason
    """
    # Get the basic info about the event
    event_detail = event['detail']

    # Make sure it's one of the events we're interested in
    event_source = event['source']
    event_type = event['detail-type']

    if event_source != 'aws.datazone':
        return '{"Response" : "Not a DataZone event"}'
    elif event_type not in ['Subscription Created', 'Subscription Cancelled', 
                               'Subscription Revoked']:
        return '{"Response" : "Not a subscription created, cancelled, or revoked event"}'

    
    # get the domain_id and other information
    domain_id = event_detail['metadata']['domain']
    project_id = event_detail['metadata']['owningProjectId']
    sub_request_id = event_detail['data']['subscriptionRequestId']
    listing_id = event_detail['data']['subscribedListing']['id']
    listing_version = event_detail['data']['subscribedListing']['version']
    
    print('domain-id',domain_id)
    print('project-id:',project_id)
    
    sub_request = dz.get_subscription_request_details(
        domainIdentifier = domain_id,
        identifier= sub_request_id
    )
   
    # Retrieve info about the asset from the request
    asset_listing = sub_request['subscribedListings'][0]['item']['assetListing']
    form_data = json.loads(asset_listing['forms'])
    asset_id = asset_listing['entityId']
    asset_version = asset_listing['entityRevision']
    asset_type = asset_listing['entityType']

    # We only want to take action if this is a S3 asset
    if asset_type == 'S3ObjectCollectionAssetType':
        # Get the bucket ARN from the form info for the asset
        bucket_arn = form_data['S3ObjectCollectionForm']['bucketArn']
        
        #Get the principal from the subscription target
        principal = get_principal(domain_id,project_id)

        try:
            # Get the bucket name from the ARN                    
            bucket_name_with_prefix = bucket_arn.split(':')[5]
            bucket_name = bucket_name_with_prefix.split('/')[0]
           
        except IndexError:
            response = '{"Response" : "Could not find bucket name in ARN"}'
            return response

        # Get the current bucket policy, or else make a blank one if there currently
        # is no policy
        try:
            bucket_policy = json.loads(s3.get_bucket_policy(Bucket=bucket_name)['Policy'])
        except s3.exceptions.from_code('NoSuchBucketPolicy'):
            bucket_policy = {'Statement': []}
        except:
            response = '{"Response" : "Could not get bucket policy"}'
            return response
        
        # Gets new policy with the subscribing principal either added or removed based on
        # event type
        new_policy_statements = build_policy_statements(event_type, bucket_policy['Statement'], principal, 
                                               sub_request_id, bucket_arn)

            
        # Write back the new policy. This can fail if the new policy is too big
        # or if for some reason the function role doesn't have rights to do this
        # If we removed the only policy statement, then just delete the policy
        try: 
            if not new_policy_statements:
                s3.delete_bucket_policy(Bucket = bucket_name)
            else:
                bucket_policy['Statement'] = new_policy_statements
                policy_string = json.dumps(bucket_policy)
                print('policy string :',policy_string)
                s3.put_bucket_policy(
                    Bucket=bucket_name,
                    Policy = policy_string
                )
        except Exception as e: 
            response = f'{{"Response" : "Error updating bucket policy: {e.args}"}}'
            return response
        
        # If we got here everything went as planned
        response = f'{{"Response" : "Updated policy for " + {bucket_name}}}'
    else:
        response = '{"Response" : "Not an S3 asset"}'


    return response

def get_principal(domain_id,project_id):
    # Call list environments to get the environment id
    listenv_request = dz.list_environments(
        domainIdentifier = domain_id,
        projectIdentifier= project_id
    )
    
   # In our example environment, there is only one of these
    environment_id = listenv_request['items'][0]['id']

   # Get the role we want to give access to from the subscription target info
    list_sub_target = dz.list_subscription_targets(
        domainIdentifier=domain_id,
        environmentIdentifier=environment_id,
        maxResults=50,
        sortBy='CREATED_AT',
        sortOrder='DESCENDING'
        )

    if list_sub_target['items'][0]['applicableAssetTypes'] == ['S3ObjectCollectionAssetType']:
       role_arn = list_sub_target['items'][0]['authorizedPrincipals']
   else:
        role_arn = []

    return role_arn

Because this Lambda function is intended to manage bucket policies, the role assigned to it will need a policy that allows the following actions on any buckets it is intended to manage:

  • s3:GetBucketPolicy
  • s3:PutBucketPolicy
  • s3:DeleteBucketPolicy

Now you have a function that is capable of editing bucket policies to add or remove the principals configured for your subscription targets, but you need something to invoke this function any time a subscription is created, cancelled, or revoked. In the next section, we cover how to use EventBridge to integrate this new function with Amazon DataZone.

Respond to subscription events in EventBridge

For events that take place within Amazon DataZone, it publishes information about each event in EventBridge. You can watch for any of these events, and invoke actions based on matching predefined rules. In this case, we’re interested in asset subscriptions being created, cancelled, or revoked, because those will determine when we grant or revoke access to the data in Amazon S3.

  1. On the EventBridge console, choose Rules in the navigation pane.

The default event bus should automatically be present; we use it for creating the Amazon DataZone subscription rule.

  1. Choose Create rule.
  2. In the Rule detail section, enter the following:
    1. For Name, enter a name (for example, DataZoneSubscriptions).
    2. For Description, enter a description that explains the purpose of the rule.
    3. For Event bus, choose default.
    4. Turn on Enable the rule on the selected event bus.
    5. For Rule type, select Rule with an event pattern.
  3. Choose Next.

EventBridge rule

  1. In the Event source section, select AWS Events or EventBridge partner events as the source of the events.

Define Event source

  1. In the Creation method section, select Custom Pattern (JSON editor) to enable exact specification of the events needed for this solution.

Choose custom pattern

  1. In the Event pattern section, enter the following code:

{
"detail-type": ["Subscription Created", "Subscription Cancelled", "Subscription Revoked"],
"source": ["aws.datazone"]
}

Define custom pattern JSON

  1. Choose Next.

Now that we’ve defined the events to watch for, we can make sure those Amazon DataZone events get sent to the Lambda function we defined in the previous section.

  1. On the Select target(s) page, enter the following for Target 1:
    1. For Target types, select AWS service.
    2. For Select a target, choose Lambda function
    3. For Function, choose create-s3policy-for-subscription-target.
  2. Choose Skip to Review and create.

Define event target

  1. On the Review and create page, choose Create rule.

Subscribe to the unstructured data asset

Now that you have the custom subscription workflow in place, you can test the workflow by subscribing to the unstructured data asset.

  1. In the Amazon DataZone portal, search for the unstructured data asset you published by browsing the catalog.

Search unstructured asset

  1. Subscribe to the unstructured data asset using the consumer project, which starts the Amazon DataZone approval workflow.

Subscribe to unstructured asset

  1. You should get a notification for the subscription request; follow the link and approve it.

When the subscription is approved, it will invoke the custom EventBridge Lambda workflow, which will create the S3 bucket policies for the instance role to access the S3 object. You can verify that by navigating to the S3 bucket and reviewing the permissions.

Access the subscribed asset from the Amazon DataZone portal

Now that the consumer project has been given access to the unstructured asset, you can access it from the Amazon DataZone portal.

  1. In the Amazon DataZone portal, open the consumer project and navigate to the Environments
  2. Choose the SageMaker-Notebook

Choose SageMaker notebook on the consumer project

  1. In the confirmation pop-up, choose Open custom.

Choose Custom

This will redirect you to the SageMaker notebook assuming the environment role. You can see the SageMaker notebook instance.

  1. Choose Open JupyterLab.

Open JupyterLab Notebook

  1. Choose conda_python3 to launch a new notebook.

Launch Notebook

  1. Add code to run get_object on the unstructured S3 data that you subscribed earlier and run the cells.

Now, because the S3 bucket policy has been updated to allow the instance role access to the S3 objects, you should see the get_object call return a HTTPStatusCode of 200.

Multi-account implementation

In the instructions so far, we’ve deployed everything in a single AWS account, but in larger organizations, resources can be distributed throughout AWS accounts, often managed by AWS Organizations. The same pattern can be applied in a multi-account environment, with some minor additions. Instead of directly acting on a bucket, the Lambda function in the domain account can assume a role in other accounts that contain S3 buckets to be managed. In each account with an S3 bucket containing assets, create a role that allows editing the bucket policy and has a trust policy referencing the Lambda role in the domain account as a principal.

Clean up

If you’ve finished experimenting and don’t want to incur any further cost for the resources deployed, you can clean up the components as follows:

  1. Delete the Amazon DataZone domain.
  2. Delete the Lambda function.
  3. Delete the SageMaker instance.
  4. Delete the S3 bucket that hosted the unstructured asset.
  5. Delete the IAM roles.

Conclusion

By implementing this custom workflow, organizations can extend the simplified subscription and access workflows provided by Amazon DataZone to their unstructured data stored in Amazon S3. This approach provides greater control over unstructured data assets, facilitating discovery and access across the enterprise.

We encourage you to try out the solution for your own use case, and share your feedback in the comments.


About the Authors

Somdeb Bhattacharjee is a Senior Solutions Architect specializing on data and analytics. He is part of the global Healthcare and Life sciences industry at AWS, helping his customers modernize their data platform solutions to achieve their business outcomes.

Sam YatesSam Yates is a Senior Solutions Architect in the Healthcare and Life Sciences business unit at AWS. He has spent most of the past two decades helping life sciences companies apply technology in pursuit of their missions to help patients. Sam holds BS and MS degrees in Computer Science.

How to Audit and Report S3 Prefix Level Access Using S3 Access Analyzer

Post Syndicated from Somdeb Bhattacharjee original https://aws.amazon.com/blogs/architecture/how-to-audit-and-report-s3-prefix-level-access-using-s3-access-analyzer/

Data Services teams in all industries are developing centralized data platforms that provide shared access to datasets across multiple business units and teams within the organization. This makes data governance easier, minimizes data redundancy thus reducing cost, and improves data integrity. The central data platform is often built with AWS Simple Storage Service (S3).

A common pattern for providing access to this data is for you to set up cross-account IAM Users and IAM Roles to allow direct access to the datasets stored in S3 buckets. You then enforce the permission on these datasets with S3 Bucket Policies or S3 Access Point policies. These policies can be very granular and you can provide access at the bucket level, prefix level as well as object level within an S3 bucket.

To reduce risk and unintended access, you can use Access Analyzer for S3 to identify S3 buckets within your zone of trust (Account or Organization) that are shared with external identities. Access Analyzer for S3 provides a lot of useful information at the bucket level but you often need S3 audit capability one layer down, at the S3 prefix level, since you are most likely going to organize your data using S3 prefixes.

Common use cases

Many organizations need to ingest a lot of third-party/vendor datasets and then distribute these datasets within the organization in a subscription-based model. Irrespective of how the data is ingested, whether it is using AWS Transfer Family service or other mechanisms, all the ingested datasets are stored in a single S3 bucket with a separate prefix for each vendor dataset. The hierarchy can be represented as:

vendor-s3-bucket
       ->vendorA-prefix
               ->vendorA.dataset.csv
       ->vendorB-prefix
               ->vendorB.dataset.csv

Based on this design, access is also granted to the data subscribers at the S3 prefix level. Access Analyzer for S3 does not provide visibility at the S3 prefix level so you need to develop custom scripts to extract this information from the S3 policy documents. You also need the information in an easy-to-consume format, for example, a csv file, that can be queried, filtered, readily downloaded and shared across the organization.

To help address this requirement, we show how to implement a solution that builds on the S3 access analyzer findings to generate a csv file on a pre-configured frequency. This solution provides details about:

  • External Principals outside your trust zone that have access to your S3 buckets
  • Permissions granted to these external principals (read, write)
  • List of s3 prefixes these external principals have access to that is configured using S3 bucket policy and/or S3 access point policies.

Architecture Overview

Architecture Diagram showing How to Audit and Report S3 prefix level access using S3 Access Analyzer

Figure 1 – How to Audit and Report S3 prefix level access using S3 Access Analyzer

The solution entails the following steps:

Step 1 – The Access Analyzer ARN and the S3 bucket parameters are passed to an AWS Lambda function via Environment variables.

Step 2 – The Lambda code uses the Access Analyzer ARN to call the list-findings API to retrieve the findings information and store it in the S3 bucket (under json prefix) in JSON format.

Step 3 – The Lambda function then also parses the JSON file to get the required fields and store it as a csv file in the same S3 bucket (under report prefix). It also scans the bucket policy and/or the access point policies to retrieve the S3 prefix level permission granted to the external identity. That information is added to the csv file.

Steps 4 and 5 – As part of the initial deployment, an AWS Glue crawler is provided to discover and create the schema of the csv file and store it in the AWS Glue Data Catalog.

Step 6 – An Amazon Athena query is run to create a spreadsheet of the findings that can be downloaded and distributed for audit.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • S3 buckets that are shared with external identities via cross-account IAM roles or IAM users.  Follow these instructions in this user guide to set up cross-account S3 bucket access.
  • IAM Access Analyzer enabled for your AWS account. Follow these instructions to enable IAM Access Analyzer within your account.

Once the IAM Access Analyzer is enabled, you should be able to view the Analyzer findings from the S3 console by selecting the bucket name and clicking on the ‘View findings’ box or directly going to the Access Analyzer findings on the IAM console.

When you select a ‘Finding id’ for an S3 Bucket, a screen similar to the following will appear:

Figure 2 - IAM Console Screenshot

Figure 2 – IAM Console Screenshot

Setup

Now that your access analyzer is running, you can open the link below to deploy the CloudFormation template. Make sure to launch the CloudFormation in the same AWS Region where IAM Access Analyzer has been enabled.

Launch template

Specify a name for the stack and input the following parameters:

  • ARN of the Access Analyzer which you can find from the IAM Console.
  • New S3 bucket where your findings will be stored. The Cloudformation template will add a suffix to the bucket name you provide to ensure uniqueness.
Figure 3 - CloudFormation Template screenshot

Figure 3 – CloudFormation Template screenshot

  • Select Next twice and on the final screen check the box allowing CloudFormation to create the IAM resources before selecting Create Stack.
  • It will take a couple of minutes for the stack to create the resources and launch the AWS Lambda function.
  • Once the stack is in CREATE_COMPLETE status, go to the Outputs tab of the stack and note down the value against the DataS3BucketName key. This is the S3 bucket the template generated. It would be of the format analyzer-findings-xxxxxxxxxxxx. Go to the S3 console and view the contents of the bucket.
    There should be two folders archive/ and report/. In the report folder you should have the csv file containing the findings report.
  • You can download the csv directly and open it in a excel sheet to view the contents. If would like to query the csv based on different attributes, follow the next set of steps.
    Go to the AWS Glue console and click on Crawlers. There should be an analyzer-crawler created for you. Select the crawler to run it.
  • After the crawler runs successfully, you should see a new table, analyzer-report created under analyzerdb Glue database.
  • Select the tablename to view the table properties and schema.
  • To query the table, go to the Athena console and select the analyzerdb database. Then you can run a query like “Select * from analyzer-report where externalaccount = <<valid external account>>” to list all the S3 buckets the external account has access to.
Figure 4 - Amazon Athena Console screenshot

Figure 4 – Amazon Athena Console screenshot

The output of the query with a subset of columns is shown as follows:

Figure 5 - Output of Amazon Athena Query

Figure 5 – Output of Amazon Athena Query

This CloudFormation template also creates a Cloudwatch event rule, testanalyzer-ScheduledRule-xxxxxxx, that launches the Lambda function every Monday to generate a new version of the findings csv file. You can update the rule to set it to a frequency you desire.

Clean Up

To avoid incurring costs, remember to delete the resources you created. First, manually delete the folders ‘archive’ and ‘report’ in the S3 bucket and then delete the CloudFormation stack you deployed at the beginning of the setup.

Conclusion

In this blog, we showed how you can build audit capabilities for external principals accessing your S3 buckets at a prefix level. Organizations looking to provide shared access to datasets across multiple business units will find this solution helpful in improving their security posture. Give this solution a try and share your feedback!

Field Notes: Restricting Amazon WorkSpaces Users to Run Amazon Athena Queries

Post Syndicated from Somdeb Bhattacharjee original https://aws.amazon.com/blogs/architecture/field-notes-restricting-amazon-workspaces-users-to-run-amazon-athena-queries/

One of the use cases we hear from customers is that they want to provide very limited access to Amazon Workspaces users (for example contractors, consultants) in an AWS account. At the same time they want to allow them to query Amazon Simple Storage Service (Amazon S3) data in another account using Amazon Athena over a JDBC connection.

For example, marketing companies might provide private access to the first party data to media agencies through this mechanism.

The restrictions they want to put in place are:

  • For security reasons these Amazon WorkSpaces should not have internet connectivity. So the access to Amazon Athena must be over AWS PrivateLink.
  • Public access to Athena is not allowed using the credentials used for the JDBC connection. This is to prevent the users from leveraging the credentials to query the data from anywhere else.

In this post, we show how to use Amazon Virtual Private Cloud (Amazon VPC) endpoints for Athena, along with AWS Identity and Access Management (AWS IAM) policies. This provides private access to query the Amazon S3 data while preventing users from querying the data from outside their Amazon WorkSpaces or using the Athena public endpoint.

Let’s review the steps to achieve this:

  • Initial setup of two AWS accounts (AccountA and AccountB)
  • Set up Amazon S3 bucket with sample data in AccountA
  • Set up an IAM user with Amazon S3 and Athena access in AccountA
  • Create an Amazon VPC endpoint for Athena in AccountA
  • Set up Amazon WorkSpaces for a user in AccountB
  • Install a SQL client tool (we will use DbVisualizer Free) and Athena JDBC driver in Amazon WorkSpaces in AccountB
  • Use DbVisualizer to the query the Amazon S3 data in AccountA using the Athena public endpoint
  • Update IAM policy for user in AccountA to restrict private only access

 Prerequisites

To follow the steps in this post, you need two AWS Accounts. The Amazon VPC and subnet requirements are specified in the detailed steps.

Note: The AWS CloudFormation template used in this blog post is for US-EAST-1 (N. Virginia) Region so ensure the Region setting for both the accounts are set to US-EAST-1 (N. Virginia).

Walkthrough

The two AWS accounts are:

AccountA – Contains the Amazon S3 bucket where the data is stored. For AccountA you can create a new Amazon VPC or use the default Amazon VPC.

AccountB – Amazon WorkSpaces account. Use the following AWS CloudFormation template for AccountB:

  • The AWS CloudFormation template will create a new Amazon VPC in AccountB with CIDR 10.10.0.0/16 and set up one public subnet and two private subnets.
  • It will also create a NAT Gateway in the public subnet and create both public and private route tables.
  • Since we will be launching Amazon WorkSpaces in these private subnets and not all Availability Zones (AZ) are supported by Amazon WorkSpaces, it is important to choose the right AZ when creating them.

Review the documentation to learn which AWS Regions/AZ are supported.

We have provided two parameters in the AWS CloudFormation template:

  • AZName1
  • AZName2

Step 1

Before launching the CloudFormation stack:

  • Log in to AccountB
  • Search for AWS Resource Access Manager
  • On the right-hand side, you will notice the AZ ID to AZ Name mapping. Note down the AZ Name corresponding to AZ ID use1-az2 and use1-az4
  • Now launch the CloudFormation template and remember to choose the AZ names you noted down earlier
    • https://athena-workspaces-blogpost.s3.amazonaws.com/vpc.yaml
  • Enter the CloudFormation Stack Name as – ‘AthenaWorkspaces’ and leave everything default.
  • Once the CloudFormation stack creation is complete, create a peering connection from AccountB to AccountA.
  • Update the associated route tables for the private subnets with the new peering connection.

For information on how to create a VPC peering connection, refer to AWS documentation on VPC Peering.

AccountB VPC Route Table:

AccountB VPC Route Table:

AccountA VPC Route Table:

AccountA VPC Route Table

Step 2

  • Create a new Amazon S3 bucket in AccountA with a bucket name that starts with ‘athena-’.
  • Next, you can download a sample file and upload it to the Amazon S3 bucket you just created.
  • Use the following statements to create AWS Glue database. Use an external table for the data in the Amazon S3 bucket so that you can query it from Athena.
  • Go to Athena console and define a new database:

CREATE DATABASE IF NOT EXISTS sampledb

Once the database is created, create a new table in sampledb (by selecting sampledb from the “Database” drop down menu). Replace the <<your bucket name>> with the bucket you just created:

CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.amazon_reviews_tsv(
  marketplace string, 
  customer_id string, 
  review_id string, 
  product_id string, 
  product_parent string, 
  product_title string,
  product_category string,
  star_rating int, 
  helpful_votes int, 
  total_votes int, 
  vine string, 
  verified_purchase string, 
  review_headline string, 
  review_body string, 
  review_date string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
LOCATION
  's3://<<your bucket name>>/'
TBLPROPERTIES ("skip.header.line.count"="1")

 

Step 3

  • In AccountA, create a new IAM user with programmatic access.
  • Save the access key and secret access key.
  • For the same user add an Inline Policy which allows the following actions:

IAM summary

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowAthenaReadActions",
            "Effect": "Allow",
            "Action": [
                "athena:ListWorkGroups",
                "athena:ListDataCatalogs",
                "athena:GetExecutionEngine",
                "athena:GetExecutionEngines",
                "athena:GetNamespace",
                "athena:GetCatalogs",
                "athena:GetNamespaces",
                "athena:GetTables",
                "athena:GetTable"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowAthenaWorkgroupActions",
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryResults",
                "athena:DeleteNamedQuery",
                "athena:GetNamedQuery",
                "athena:ListQueryExecutions",
                "athena:StopQueryExecution",
                "athena:GetQueryResultsStream",
                "athena:ListNamedQueries",
                "athena:CreateNamedQuery",
                "athena:GetQueryExecution",
                "athena:BatchGetNamedQuery",
                "athena:BatchGetQueryExecution",
                "athena:GetWorkGroup"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowGlueActionsViaVPCE",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:CreateDatabase",
                "glue:GetTables",
                "glue:GetTable"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowGlueActionsViaAthena",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:CreateDatabase",
                "glue:GetTables",
                "glue:GetTable"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowS3ActionsViaAthena",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload",
                "s3:CreateBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::athena-*"
            ]
        }
    ]
}

 

Step 4

  • In this step, we create an Interface VPC endpoint (AWS PrivateLink) for Athena in AccountA. When you use an interface VPC endpoint, communication between your Amazon VPC and Athena is conducted entirely within the AWS network.
  • Each VPC endpoint is represented by one or more Elastic Network Interfaces (ENIs) with private IP addresses in your VPC subnets.
  • To create an Interface VPC endpoint follow the instructions and select Athena in the AWS Services list. Do not select the checkbox for Enable Private DNS Name.
  • Ensure the security group that is attached to the Amazon VPC endpoint is open to inbound traffic on port 443 and 444 for source AccountB VPC CIDR 10.10.0.0/16. Port 444 is used by Athena to stream query results.
  • Once you create the VPC endpoint, you will get a DNS endpoint name which is in the following format. We are going to use this in JDBC connection from the SQL client.

      VPC_Endpoint_ID.athena.Region.vpce.amazonaws.com

Step 5

  • In this step we set up Amazon WorkSpaces in AccountB.
  • Each Amazon WorkSpace is associated with the specific Amazon VPC and AWS Directory Service construct that you used to create it. All Directory Service constructs (Simple AD, AD Connector, and Microsoft AD) require two subnets to operate, each in different Availability Zones. This is why we created 2 private subnets at the beginning.
  • For this blog post I have used Simple AD as the directory service for the Amazon WorkSpaces.
  • By default, IAM users don’t have permissions for Amazon WorkSpaces resources and operations.
  • To allow IAM users to manage Amazon WorkSpaces resources, you must create an IAM policy that explicitly grants them permissions
  • Then attach the policy to the IAM users or groups that require those permissions.
  • To start, go to the Amazon WorkSpaces console and select Advanced Setup.
    • Set up a new directory using the SimpleAD option.
    • Use the “small” directory size and choose the Amazon VPC and private subnets you created in Step 1 for AccountB.
    • Once you create the directory, register the directory with Amazon WorkSpaces by selecting “Register” from the “Action” menu.
    • Select private subnets you created in Step 1 for AccountB.

Directory info

  • Next, launch Amazon WorkSpaces by following the Launch WorkSpaces button.
  • Select the directory you created and create a new user.
  • For the bundle, choose Standard with Windows 10 (PCoIP).
  • After the Amazon WorkSpaces is created, you can log in to the Amazon WorkSpaces using a client software. You can download it from https://clients.amazonworkspaces.com/
  • Login to your Amazon WorkSpace, install a SQL Client of your choice. At this point your Amazon WorkSpace still has Internet access via the NAT Gateway
  • I have used DbVisualizer (the free version) as the SQL client. Once you have that installed, install the JDBC driver for Athena following the instructions
  • Now you can set up the JDBC connections to Athena using the access key and secret key you set up for an IAM user in AccountA.

Step 6

To test out both the Athena public endpoint and the Athena VPC endpoint, create two connections using the same credentials.

For the Athena public endpoint, you need to use athena.us-east-1.amazonaws.com service endpoint. (jdbc:awsathena://athena.us-east-1.amazonaws.com:443;S3OutputLocation=s3://<athena-bucket-name>/)

Athena public

For the VPC Endpoint Connection, use the VPC Endpoint you created in Step 4 (jdbc:awsathena://vpce-<>.athena.us-east-1.vpce.amazonaws.com:443;S3OutputLocation=s3://<athena-bucket-name>/)

Database connection Athena

Now run a simple query to select records from the amazon_reviews_tsv table using both the connections.

SELECT * FROM sampledb.amazon_reviews_tsv limit 10

You should be able to see results using both the connections. Since the private subnets are still connected to the internet via the NAT Gateway, you can query using the Athena public endpoint.

Run the AWS Command Line Interface (AWS CLI) command using the credentials used for the JDBC connection from your workstation. You should be able to access the Amazon S3 bucket objects and the Athena query run list using the following commands.

aws s3 ls s3://athena-workspaces-blogpost

aws athena list-query-executions

Step 7

  • Now we lock down the access as described in the beginning of this blog post by taking the following actions:
  • Update the route table for the private subnets by removing the route for the internet so access to the Athena public endpoint is restricted from the Amazon WorkSpaces. The only access will be allowed through the Athena VPC Endpoint.
  • Add conditional checks to the IAM user access policy that will restrict access to the Amazon S3 buckets and Athena only if:
    • The request came in through the VPC endpoint. For this we use the “aws:SourceVpce” check and provide the VPC Endpoint ID value.
    • The request for Amazon S3 data is through Athena. For this we use the condition “aws:CalledVia” and provide a value of “athena.amazonaws.com”.
  • In the IAM access policy below replace <<your vpce id>> with your VPC endpoint id and update the previous inline policy which was added to the IAM user in Step 3.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowAthenaReadActions",
            "Effect": "Allow",
            "Action": [
                "athena:ListWorkGroups",
                "athena:ListDataCatalogs",
                "athena:GetExecutionEngine",
                "athena:GetExecutionEngines",
                "athena:GetNamespace",
                "athena:GetCatalogs",
                "athena:GetNamespaces",
                "athena:GetTables",
                "athena:GetTable"
            ],
            "Resource": "*",
            "Condition":{
               "StringEquals":{
                  "aws:SourceVpce":[
                     "<<your vpce id>>"
                  ]
               }
            }
        },
        {
            "Sid": "AllowAthenaWorkgroupActions",
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryResults",
                "athena:DeleteNamedQuery",
                "athena:GetNamedQuery",
                "athena:ListQueryExecutions",
                "athena:StopQueryExecution",
                "athena:GetQueryResultsStream",
                "athena:ListNamedQueries",
                "athena:CreateNamedQuery",
                "athena:GetQueryExecution",
                "athena:BatchGetNamedQuery",
                "athena:BatchGetQueryExecution",
                "athena:GetWorkGroup"
            ],
            "Resource": "*",
            "Condition":{
               "StringEquals":{
                  "aws:SourceVpce":[
                     "<<your vpce id>>"
                  ]
               }
            }
        },
        {
            "Sid": "AllowGlueActionsViaVPCE",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:CreateDatabase",
                "glue:GetTables",
                "glue:GetTable"
            ],
            "Resource": "*",
            "Condition":{
               "StringEquals":{
                  "aws:SourceVpce":[
                     "<<your vpce id>>"
                  ]
               }
            }
        },
        {
            "Sid": "AllowGlueActionsViaAthena",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:CreateDatabase",
                "glue:GetTables",
                "glue:GetTable"
            ],
            "Resource": "*",
            "Condition":{
               "ForAnyValue:StringEquals":{
                  "aws:CalledVia":[
                     "athena.amazonaws.com"
                  ]
               }
            }
        },
        {
            "Sid": "AllowS3ActionsViaAthena",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload",
                "s3:CreateBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::athena-*"
            ],
            "Condition":{
               "ForAnyValue:StringEquals":{
                  "aws:CalledVia":[
                     "athena.amazonaws.com"
                  ]
               }
            }
        }
    ]
}

Once you applied the changes, try to reconnect using both the Athena VPC endpoint as well Athena public endpoint connections. The Athena VPC endpoint connection should work but the public endpoint connection will time out. Also try the same Amazon S3 and Athena AWS CLI commands. You should get access denied for both the operations.

Clean Up

To avoid incurring costs, remember to delete the resources that you created.

For AWS AccountA:

  • Delete the S3 buckets
  • Delete the database you created in AWS Glue
  • Delete the Amazon VPC endpoint you created for Amazon Athena

For AccountB:

  • Delete the Amazon Workspace you created along with the Simple AD directory. You can review more information on how to delete your Workspaces.

Conclusion

In this blog post, I showed how to leverage Amazon VPC endpoints and IAM policies to privately connect to Amazon Athena from Amazon Workspaces that don’t have internet connectivity.

Give this solution a try and share your feedback in the comments!

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.