Tag Archives: Amazon Macie

Detecting sensitive data in DynamoDB with Macie

Post Syndicated from Sheldon Sides original https://aws.amazon.com/blogs/security/detecting-sensitive-data-in-dynamodb-with-macie/

Amazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in Amazon Web Services (AWS). It gives you the ability to automatically scan for sensitive data and get an inventory of your Amazon Simple Storage Service (Amazon S3) buckets. Macie also gives you the added ability to detect which buckets are public, unencrypted, and accessible from other AWS accounts.

In this post, we’ll walk through how to use Macie to detect sensitive data in Amazon DynamoDB tables by exporting the data to Amazon S3 so that Macie can scan the data. An example of why you would deploy a solution like this is if you have potentially sensitive data stored in DynamoDB tables. When we’re finished, you’ll have a solution that can set up on-demand or scheduled Macie discovery jobs to detect sensitive data exported from DynamoDB to S3.

Architecture

In figure 1, you can see an architectural diagram explaining the flow of the solution that you’ll be deploying.

Figure 1: Solution architecture

Figure 1: Solution architecture

Here’s a brief overview of the steps that you’ll take to deploy the solution. Some steps you will do manually, while others will be handled by the provided AWS CloudFormation template. The following outline describes the steps taken to extract the data from DynamoDB and store it in S3, which allows Macie to run a discovery job against the data.

  1. Enable Amazon Macie, if it isn’t already enabled.
  2. Deploy a test DynamoDB dataset.
  3. Create an S3 bucket to export DynamoDB data to.
  4. Configure an AWS Identity and Access Management (IAM) policy and role. (These are used by the Lambda function to access the S3 and DynamoDB tables)
  5. Deploy an AWS Lambda function to export DynamoDB data to S3.
  6. Set up an Amazon EventBridge rule to schedule export of the DynamoDB data.
  7. Create a Macie discovery job to discover sensitive data from the DynamoDB data export.
  8. View the results of the Macie discovery job.

The goal is that when you finish, you have a solution that you can use to set up either on-demand or scheduled Macie discovery jobs to detect sensitive data that was exported from DynamoDB to S3.

Prerequisite: Enable Macie

If Macie hasn’t been enabled in your account, complete Step 1 in Getting started with Amazon Macie to enable Macie. Once you’ve enabled Macie, you can proceed with the deployment of the CloudFormation template.

Deploy the CloudFormation template

In this section, you start by deploying the CloudFormation template that will deploy all the resources needed for the solution. You can then review the output of the resources that have been deployed.

To deploy the CloudFormation template

  1. Download the CloudFormation template: https://github.com/aws-samples/macie-dynamodb-blog/blob/main/src/cft.yaml
  2. Sign in to the AWS Management Console and navigate to the CloudFormation console.
  3. Choose Upload a template file, and then select the CloudFormation template that you downloaded in the previous step. Choose Next.

    Figure 2 - Uploading the CloudFormation template to be deployed

    Figure 2 – Uploading the CloudFormation template to be deployed

  4. For Stack Name, name your stack macie-blog, and then choose Next.

    Figure 3: Naming your CloudFormation stack

    Figure 3: Naming your CloudFormation stack

  5. For Configure stack options, keep the default values and choose Next.
  6. At the bottom of the Review screen, select the I acknowledge that AWS CloudFormation might create IAM resources check box, and then choose Create stack.

    Figure 4: Acknowledging that this CloudFormation template will create IAM roles

    Figure 4: Acknowledging that this CloudFormation template will create IAM roles

You should then see the following screen. It may take several minutes for the CloudFormation template to finish deploying.

Figure 5: CloudFormation stack creation in progress

Figure 5: CloudFormation stack creation in progress

View CloudFormation output

Once the CloudFormation template has been completely deployed, choose the Outputs tab, and you will see the following screen. Here you’ll find the names and URLs for all the AWS resources that are needed to complete the remainder of the solution.

Figure 6: Completed CloudFormation stack output

Figure 6: Completed CloudFormation stack output

For easier reference, open a new browser tab to your AWS Management Console and leave this tab open. This will make it easier to quickly copy and paste the resource URLs as you navigate to different resources during this walkthrough.

Import DynamoDB data

In this section, we walk through importing the test dataset to DynamoDB. You first start by downloading the test CSV datasets, then upload those datasets to S3 and run the Lambda function that imports the data to DynamoDB. Finally, you review the data that was imported into DynamoDB.

Test datasets

Download the following test datasets:

  1. Accounts Info test dataset (accounts.csv): https://github.com/aws-samples/macie-dynamodb-blog/blob/main/datasets/accounts.csv
  2. People test dataset (people.csv): https://github.com/aws-samples/macie-dynamodb-blog/blob/main/datasets/people.csv

Upload data to the S3 import bucket

Now that you’ve downloaded the test datasets, you’ll need to navigate to the data import S3 bucket and upload the data.

To upload the datasets to the S3 import bucket

  1. Navigate to the CloudFormation Outputs tab, where you’ll find the bucket information.

    Figure 7: S3 bucket output values for the CloudFormation stack

    Figure 7: S3 bucket output values for the CloudFormation stack

  2. Copy the ImportS3BucketURL link and navigate to the URL.
  3. Upload the two test CSV datasets, people.csv and accounts.csv, to your S3 bucket.
  4. After the upload is complete, you should see the two CSV files in the S3 bucket. You’ll use these files as your test DynamoDB data.

    Figure 8: Test S3 datasets in the S3 bucket

    Figure 8: Test S3 datasets in the S3 bucket

View the data import Lambda function

Now that you have your test data staged for loading, you’ll import it into DynamoDB by using a Lambda function that was deployed with the CloudFormation template. To start, navigate to the CloudFormation console and get the URL to the Lambda function that will handle the data import to DynamoDB, as shown in figure 9.

Figure 9: CloudFormation output information for the People DynamoDB table

Figure 9: CloudFormation output information for the People DynamoDB table

To run the data import Lambda function

  1. Copy the LambdaImportS3DataToDynamoURL link and navigate to the URL. You will see the Import-Data-To-DynamoDB Lambda function, as shown in figure 10.

    Figure 10: The Lambda function that imports data to DynamoDB

    Figure 10: The Lambda function that imports data to DynamoDB

  2. Choose the Test button in the upper right-hand corner. In the dialog screen, for Event name, enter Test and replace the value with {}.
  3. Your screen should now look as shown in figure 11. Choose Create.

    Figure 11: Configuring a test event to manually run the Lambda function

    Figure 11: Configuring a test event to manually run the Lambda function

  4. Choose the Test button again in the upper right-hand corner. You should now see the Lambda function running, as shown in figure 12.

    Figure 12: View of the Lambda function running

    Figure 12: View of the Lambda function running

  5. Once the Lambda function is finished running, you can expand the Details section. You should see a screen similar to the one in figure 13. When you see this screen, the test datasets have successfully been imported into the DynamoDB tables.

    Figure 13: View of the data import Lambda function after it runs successfully

    Figure 13: View of the data import Lambda function after it runs successfully

View the DynamoDB test dataset

Now that you have the datasets imported, you can look at the data in the console.

To view the test dataset

  1. Navigate to the two DynamoDB tables. You can do this by getting the URL values from the CloudFormation Outputs tab. Figure 14 shows the URL for the accounts tables.
    Figure 14: Output values for CloudFormation stack DynamoDB account tables

    Figure 14: Output values for CloudFormation stack DynamoDB account tables

    Figure 15 shows the URL for the people tables.

    Figure 15: Output values for CloudFormation stack DynamoDB people tables

    Figure 15: Output values for CloudFormation stack DynamoDB people tables

  2. Copy the AccountsDynamoDBTableURL link value and navigate to it in the browser. Then choose the Items tab.
    Figure 16: View of DynamoDB account-info-macie table data

    Figure 16: View of DynamoDB account-info-macie table data

    You should now see a screen showing data similar to the screen in figure 16. This DynamoDB table stores the test account data that you will use to run a Macie discovery job against after the data has been exported to S3.

  3. Navigate to the PeopleDynamoDBTableURL link that is in the CloudFormation output. Then choose the Items tab.

    Figure 17: View of DynamoDB people table data

    Figure 17: View of DynamoDB people table data

You should now see a screen showing data similar to the screen in figure 17. This DynamoDB table stores the test people data that you will use to run a Macie discovery job against after the data has been exported to S3.

Export DynamoDB data to S3

In the previous section, you set everything up and staged the data to DynamoDB. In this section, you will export data from DynamoDB to S3.

View the EventBridge rule

The EventBridge rule that was deployed earlier allows you to automatically schedule the export of DynamoDB data to S3. You will can export data in hours, in minutes, or in days. The purpose of the EventBridge rule is to allow you to set up an automated data pipeline from DynamoDB to S3. For demonstration purposes, you’ll run the Lambda function that the EventBridge rule uses manually, so that you can see the data be exported to S3 without having to wait.

To view the EventBridge rule

  1. Navigate to the CloudFormation Outputs tab for the CloudFormation stack you deployed earlier.

    Figure 18: CloudFormation output information for the EventBridge rule

    Figure 18: CloudFormation output information for the EventBridge rule

  2. Navigate to the EventBridgeRule link. You should see the following screen.

    Figure 19: EventBridge rule configuration details page

    Figure 19: EventBridge rule configuration details page

On this screen, you can see that we’ve set the event schedule to run every hour. The interval can be changed to fit your business needs. We have set it for 1 hour for demonstration purposes only. To make changes to the interval, you can choose the Edit button to make changes and then save the rule.

In the Target(s) section, we’ve configured a Lambda function named Export-DynamoDB-Data-To-S3 to handle the process of exporting data to the S3 bucket the Macie discovery job will run against. We will cover the Lambda function that handles the export of the data from DynamoDB next.

View the data export Lambda function

In this section, you’ll take a look at the Lambda function that handles the exporting of DynamoDB data to the S3 bucket that Macie will run its discovery job against.

To view the Lambda function

  1. Navigate to the CloudFormation Outputs tab for the CloudFormation stack you deployed earlier.

    Figure 20: CloudFormation output information for the Lambda function that exports DynamoDB data to S3

    Figure 20: CloudFormation output information for the Lambda function that exports DynamoDB data to S3

  2. Copy the link value for LambdaExportDynamoDBDataToS3URL and navigate to the URL in your browser. You should see the Python code that will handle the exporting of data to S3. The code has been commented so that you can easily follow it and refactor it for your needs.
  3. Scroll to the Environment variables section.
    Figure 21: Environment variables used by the Lambda function

    Figure 21: Environment variables used by the Lambda function

    You will see two environment variables:

  • bucket_to_export_to – This environment variable is used by the function as the S3 bucket location to save the DynamoDB data to. This is the bucket that the Macie discovery will run against.
  • dynamo_db_tables – This environment variable is a comma-delimited list of DynamoDB tables that will be read and have data exported to S3. If there was another table that you wanted to export data from, you would simply add it to the comma-delimited list and it would be part of the export.

Export DynamoDB data

In this section, you will manually run the Lambda function to export the DynamoDB tables data to S3. As stated previously, you would normally allow the EventBridge rule to handle the automated export of the data to S3. In order to see the export in action, you’re going to manually run the function.

To run the export Lambda function

  1. In the console, scroll back to the top of the screen and choose the Test button.
  2. Name the test dynamoDBExportTest, and for the test data create an empty JSON object “{}” as shown in figure 22.

    Figure 22: Configuring a test event to manually test the data export Lambda function

    Figure 22: Configuring a test event to manually test the data export Lambda function

  3. Choose Create.
  4. Choose the Test button again to run the Lambda function to export the DynamoDB data to S3.

    Figure 23: View of the screen where you run the Lambda function to export data

    Figure 23: View of the screen where you run the Lambda function to export data

  5. It could take about one minute to export the data from DynamoDB to S3. Once the Lambda function exports the data, you should see a screen similar to the following one.

    Figure 24: The result after you successfully run the data export Lambda function

    Figure 24: The result after you successfully run the data export Lambda function

View the exported DynamoDB data

Now that the DynamoDB data has been exported for Macie to run discovery jobs against, you can navigate to S3 to verify that the files exported to the bucket.

To view the data, navigate to the CloudFormation stack Output tab. Find the ExportS3BucketURL, shown in figure 25, and navigate to the link.

Figure 25: CloudFormation output information for the S3 buckets that the DynamoDB data was exported to

Figure 25: CloudFormation output information for the S3 buckets that the DynamoDB data was exported to

You should then see two different JSON files for the two DynamoDB tables that data was exported from, as shown in figure 26.

Figure 26: View of S3 objects that were exported to S3

Figure 26: View of S3 objects that were exported to S3

This is the file naming convention that’s used for the files:

<Service-name>-<DynamoDB-Table-Name>-<AWS-Region>-<DataAndTime>.json

Next, you’ll create a Macie discovery job to run against the files in this S3 bucket to discover sensitive data.

Create the Macie discovery job

In this section, you’ll create a Macie discovery job and view the results after the job has finished running.

To create the discovery job

  1. In the AWS Management Console, navigate to Macie. In the left-hand menu, choose Jobs.

    Figure 27: Navigation menu to Macie discovery jobs

    Figure 27: Navigation menu to Macie discovery jobs

  2. Choose the Create job button.

    Figure 28: Macie discovery job list screen

    Figure 28: Macie discovery job list screen

  3. Using the Bucket Name filter, search for the S3 bucket that the DynamoDB data was exported to. This can be found in the CloudFormation stack output, as shown in figure 29.

    Figure 29: CloudFormation stack output

    Figure 29: CloudFormation stack output

  4. Select the value you see for ExportS3BucketName, as shown in figure 30.

    Note: The value you see for your bucket name will be slightly different, based on the random characters added to the end of the bucket name generated by CloudFormation.

    Figure 30: Selecting the S3 bucket to include in the Macie discovery job

    Figure 30: Selecting the S3 bucket to include in the Macie discovery job

  5. Once you’ve found the S3 bucket, select the check box next to it, and then choose Next.
  6. On the Review S3 Buckets screen, if you’re satisfied with the selected buckets, choose Next.

Following are some important options when setting up Macie data discovery jobs.

Scheduling
You have the following scheduling options for the data discovery job:

  • Daily
  • Weekly
  • Monthly

Data Sampling
This allows you to randomly sample a percentage of the data that the Macie discovery job will run against.

Object criteria
This enables you to target objects based on certain metadata values. The values are:

  • Tags – Target objects with certain tags.
  • Last modified – Target objects based on when they were last modified.
  • File extensions – Target objects based on file extensions.
  • Object size – Target objects based on the file size.

You can include or exclude objects based on these object criteria filters.

Set the discovery job scope

For demonstration purposes, this will be a one-time discovery job.

To set the discovery job scope

  1. On the Scope page that appears after you create the job, set the following options for the job scope:
    1. Select the One-time job option.
    2. Leave Sampling depth set to 100%, and choose Next.

      Figure 31: Selecting the objects that should be in scope for this discovery job

      Figure 31: Selecting the objects that should be in scope for this discovery job

  2. On the Custom data identifiers screen, select account_number, and then choose Next.With the custom identifier, you can create custom business logic to look for certain patterns in files stored in S3. In this example, the job generates a finding for any file that contains data with the following format:

    Account Number Format: Starts with “XYZ-” followed by 11 numbers

    The logic to create a custom data identifier can be found in the CloudFormation template.

    Figure 32: Custom data identifiers

    Figure 32: Custom data identifiers

  3. Give your discovery job the name dynamodb-macie-discovery-job. For Description, enter Discovery job to detect sensitive data exported from DynamoDB, and choose Next.
    Figure 33: Giving the Macie discovery job a name and description

    Figure 33: Giving the Macie discovery job a name and description

    You will then see the Review and create screen, as shown in figure 34.

    Figure 34: The Macie discovery job review screen

    Figure 34: The Macie discovery job review screen

    Note: Macie must have proper permissions to decrypt objects that are part of the Macie discovery job. The CloudFormation template that you deployed during the initial setup has already deployed an AWS Key Management Service (AWS KMS) key with the proper permissions.

    For this proof of concept you won’t store the results, so you can select the check box next to Override this requirement. If you wanted to store detailed results of the discovery job long term, you would configure a repository for data discovery results. To view detailed steps for setting this up, see Storing and retaining discovery results with Amazon Macie.

Submit the discovery job

Next, you can submit the discovery job. On the Review and create screen, choose the Submit button to start the discovery job. You should see a screen similar to the following.

Figure 35: A Macie discovery job run that is in progress

Figure 35: A Macie discovery job run that is in progress

The amount of data that is being scanned dictates how long the job will take to run. You can choose the Refresh button at the top of the screen to see the updated status of the job. This job, based on the size of the test dataset, will take about seven minutes to complete.

Review the job results

Now that the Macie discovery job has run, you can review the results to see what sensitive data was discovered in the data exported from DynamoDB.

You should see the following screen once the job has successfully run.

Figure 36: View of the completed Macie discovery job

Figure 36: View of the completed Macie discovery job

On the right, you should see another pane with more information related to the discovery job. The pane should look like the following screen.

Figure 37: Summary showing which S3 bucket the discovery job ran against and start and complete time

Figure 37: Summary showing which S3 bucket the discovery job ran against and start and complete time

Note: If you don’t see this pane, choose on the discovery job to have this information displayed.

To review the job results

  1. On the page for the discovery job, in the Show Results list, select Show findings.

    Figure 38: Option to view discovery job findings

    Figure 38: Option to view discovery job findings

  2. The Findings screen appears, as follows.
    Figure 39: Viewing the list of findings generated by the Macie discovery job

    Figure 39: Viewing the list of findings generated by the Macie discovery job

    The discovery job that you ran has two different “High Severity” finding types:

    SensitiveData:S3Object/Personal – The object contains personal information, such as full names or identification numbers.

    SensitiveData:S3Object/Multiple – The object contains more than one type of sensitive data.

    Learn more about Macie findings types.

  3. Choose the SensitiveData:S3Object/Personal finding type, and you will see an information pane appear to the right, as shown in figure 40.Some of the key information that you can find here:

    Severity – What the severity of the finding is: Low, Medium, or High.
    Resource – The S3 bucket where the S3 object exists that caused the finding to be generated.
    Region – The Region where the S3 bucket exists.

    Figure 40: Viewing the severity of the discovery job finding

    Figure 40: Viewing the severity of the discovery job finding

    Since the finding is based on the detection of personal information in the S3 object, you get the number of times and type of personal data that was discovered, as shown in figure 41.

    Figure 41: Viewing the number of social security numbers that were discovered in the finding

    Figure 41: Viewing the number of social security numbers that were discovered in the finding

    Here you can see that 10 names were detected in the data that you exported from the DynamoDB table. Occurrences of name equals 10 line ranges, which tells you that the names were found on 10 different lines in the file. If you choose the 10 line ranges link, you are given the starting line and column in the document where the name was discovered.

    The S3 object that triggered the finding is displayed in the Resource affected section, as shown in figure 42.

    Figure 42: The S3 object that generated the Macie finding

    Figure 42: The S3 object that generated the Macie finding

Now that you know which S3 object contains the sensitive data, you can investigate further to take appropriate action to protect the data.

View the Macie finding details

In this section, you will walk through how to read and download the objects related to the Macie discovery job.

To download and view the S3 object that contains the finding

  1. In the Overview section of the finding details, select the value for the Resource link. You will then be taken to the object in the S3 bucket.

    Figure 43: Viewing the S3 bucket where the object is located that generated the Macie finding

    Figure 43: Viewing the S3 bucket where the object is located that generated the Macie finding

  2. You can then download the S3 object from the S3 bucket to view the file content and further investigate the file content for sensitive data. Select the check box next to the S3 object, and choose the Download button at the top of the screen.Next, we will look at the SensitiveData:S3Object/Multiple finding type that was generated. This finding type lets us know that there are multiple types of potentially sensitive data related to an object stored in S3.
  3. In the left navigation menu, navigate back to the Jobs menu.
  4. Choose the job that you created in the previous steps. In the Show Results list, select Show Findings.
  5. Select the SensitiveData:S3Object/Multiple finding type. An information pane appears to the right. As with the previous finding, you will see the severity, Region, S3 bucket location, and other relevant information about the finding. For this finding, we will focus on the Custom data identifiers and Personal info sections.
    Figure 44: Details about the sensitive data that was discovered by the Macie discovery job

    Figure 44: Details about the sensitive data that was discovered by the Macie discovery job

    Here you can see that the discovery job found 10 names on 10 different lines in the file. Also, you can see that 10 account numbers were discovered on 10 different lines in the file, based on the custom identifier that was included as part of the discovery job.

    This finding demonstrates how you can use the built-in Macie identifiers, such as names, and also include custom business logic based on your organization’s needs by using Macie custom data identifiers.

    To view the data and investigate further, follow the same steps as in the previous finding you investigated.

  6. Navigate to the top of the screen and in the Overview section, locate the Resource.

    Figure 45: Viewing the S3 bucket where the object is located that generated the Macie finding

    Figure 45: Viewing the S3 bucket where the object is located that generated the Macie finding

  7. Choose Resource, which will take you to the S3 object to download. You can now view the contents of the file and investigate further.

You’ve now created a Macie discovery job to scan for sensitive data stored in an S3 bucket that originated in DynamoDB. You can also automate this solution further by using EventBridge rules to detect Macie findings to take actions against those objects with sensitive data.

Solution cleanup

In order to clean up the solution that you just deployed, complete the following steps. Note that you need to do these steps to stop data from being exported from DynamoDB to S3 every 1 hour.

To perform cleanup

  1. Navigate to the S3 buckets used to import and export data. You can find the bucket names in the CloudFormation Outputs tab in the console, as shown in figure 7 and figure 25.
  2. After you’ve navigated to each of the buckets, delete all objects from the bucket.
  3. Navigate to the CloudFormation console, and then delete the CloudFormation stack named macie-blog. After the stack is deleted, the solution will no longer be deployed in your AWS account.

Summary

After deploying the solution, we hope you have a better understanding of how you can use Macie to detect sensitive from other data sources, such as DynamoDB, as outlined in this post. The following are links to resources that you can use to further expand your knowledge of Amazon Macie capabilities and features.

Additional resources

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Sheldon Sides

Sheldon is a Senior Solutions Architect, focused on helping customers implement native AWS security services. He enjoys using his experience as a consultant and running a cloud security startup to help customers build secure AWS Cloud solutions. His interests include working out, software development, and learning about the latest technologies.

Use Macie to discover sensitive data as part of automated data pipelines

Post Syndicated from Brandon Wu original https://aws.amazon.com/blogs/security/use-macie-to-discover-sensitive-data-as-part-of-automated-data-pipelines/

Data is a crucial part of every business and is used for strategic decision making at all levels of an organization. To extract value from their data more quickly, Amazon Web Services (AWS) customers are building automated data pipelines—from data ingestion to transformation and analytics. As part of this process, my customers often ask how to prevent sensitive data, such as personally identifiable information, from being ingested into data lakes when it’s not needed. They highlight that this challenge is compounded when ingesting unstructured data—such as files from process reporting, text files from chat transcripts, and emails. They also mention that identifying sensitive data inadvertently stored in structured data fields—such as in a comment field stored in a database—is also a challenge.

In this post, I show you how to integrate Amazon Macie as part of the data ingestion step in your data pipeline. This solution provides an additional checkpoint that sensitive data has been appropriately redacted or tokenized prior to ingestion. Macie is a fully managed data security and privacy service that uses machine learning and pattern matching to discover sensitive data in AWS.

When Macie discovers sensitive data, the solution notifies an administrator to review the data and decide whether to allow the data pipeline to continue ingesting the objects. If allowed, the objects will be tagged with an Amazon Simple Storage Service (Amazon S3) object tag to identify that sensitive data was found in the object before progressing to the next stage of the pipeline.

This combination of automation and manual review helps reduce the risk that sensitive data—such as personally identifiable information—will be ingested into a data lake. This solution can be extended to fit your use case and workflows. For example, you can define custom data identifiers as part of your scans, add additional validation steps, create Macie suppression rules to archive findings automatically, or only request manual approvals for findings that meet certain criteria (such as high severity findings).

Solution overview

Many of my customers are building serverless data lakes with Amazon S3 as the primary data store. Their data pipelines commonly use different S3 buckets at each stage of the pipeline. I refer to the S3 bucket for the first stage of ingestion as the raw data bucket. A typical pipeline might have separate buckets for raw, curated, and processed data representing different stages as part of their data analytics pipeline.

Typically, customers will perform validation and clean their data before moving it to a raw data zone. This solution adds validation steps to that pipeline after preliminary quality checks and data cleaning is performed, noted in blue (in layer 3) of Figure 1. The layers outlined in the pipeline are:

  1. Ingestion – Brings data into the data lake.
  2. Storage – Provides durable, scalable, and secure components to store the data—typically using S3 buckets.
  3. Processing – Transforms data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. This processing layer is where the additional validation steps are added to identify instances of sensitive data that haven’t been appropriately redacted or tokenized prior to consumption.
  4. Consumption – Provides tools to gain insights from the data in the data lake.

 

Figure 1: Data pipeline with sensitive data scan

Figure 1: Data pipeline with sensitive data scan

The application runs on a scheduled basis (four times a day, every 6 hours by default) to process data that is added to the raw data S3 bucket. You can customize the application to perform a sensitive data discovery scan during any stage of the pipeline. Because most customers do their extract, transform, and load (ETL) daily, the application scans for sensitive data on a scheduled basis before any crawler jobs run to catalog the data and after typical validation and data redaction or tokenization processes complete.

You can expect that this additional validation will add 5–10 minutes to your pipeline execution at a minimum. The validation processing time will scale linearly based on object size, but there is a start-up time per job that is constant.

If sensitive data is found in the objects, an email is sent to the designated administrator requesting an approval decision, which they indicate by selecting the link corresponding to their decision to approve or deny the next step. In most cases, the reviewer will choose to adjust the sensitive data cleanup processes to remove the sensitive data, deny the progression of the files, and re-ingest the files in the pipeline.

Additional considerations for deploying this application for regular use are discussed at the end of the blog post.

Application components

The following resources are created as part of the application:

Note: the application uses various AWS services, and there are costs associated with these resources after the Free Tier usage. See AWS Pricing for details. The primary drivers of the solution cost will be the amount of data ingested through the pipeline, both for Amazon S3 storage and data processed for sensitive data discovery with Macie.

The architecture of the application is shown in Figure 2 and described in the text that follows.
 

Figure 2: Application architecture and logic

Figure 2: Application architecture and logic

Application logic

  1. Objects are uploaded to the raw data S3 bucket as part of the data ingestion process.
  2. A scheduled EventBridge rule runs the sensitive data scan Step Functions workflow.
  3. triggerMacieScan Lambda function moves objects from the raw data S3 bucket to the scan stage S3 bucket.
  4. triggerMacieScan Lambda function creates a Macie sensitive data discovery job on the scan stage S3 bucket.
  5. checkMacieStatus Lambda function checks the status of the Macie sensitive data discovery job.
  6. isMacieStatusCompleteChoice Step Functions Choice state checks whether the Macie sensitive data discovery job is complete.
    1. If yes, the getMacieFindingsCount Lambda function runs.
    2. If no, the Step Functions Wait state waits 60 seconds and then restarts Step 5.
  7. getMacieFindingsCount Lambda function counts all of the findings from the Macie sensitive data discovery job.
  8. isSensitiveDataFound Step Functions Choice state checks whether sensitive data was found in the Macie sensitive data discovery job.
    1. If there was sensitive data discovered, run the triggerManualApproval Lambda function.
    2. If there was no sensitive data discovered, run the moveAllScanStageS3Files Lambda function.
  9. moveAllScanStageS3Files Lambda function moves all of the objects from the scan stage S3 bucket to the scanned data S3 bucket.
  10. triggerManualApproval Lambda function tags and moves objects with sensitive data discovered to the manual review S3 bucket, and moves objects with no sensitive data discovered to the scanned data S3 bucket. The function then sends a notification to the ApprovalRequestNotification Amazon SNS topic as a notification that manual review is required.
  11. Email is sent to the email address that’s subscribed to the ApprovalRequestNotification Amazon SNS topic (from the application deployment template) for the manual review user with the option to Approve or Deny pipeline ingestion for these objects.
  12. Manual review user assesses the objects with sensitive data in the manual review S3 bucket and selects the Approve or Deny links in the email.
  13. The decision request is sent from the Amazon API Gateway to the receiveApprovalDecision Lambda function.
  14. manualApprovalChoice Step Functions Choice state checks the decision from the manual review user.
    1. If denied, run the deleteManualReviewS3Files Lambda function.
    2. If approved, run the moveToScannedDataS3Files Lambda function.
  15. deleteManualReviewS3Files Lambda function deletes the objects from the manual review S3 bucket.
  16. moveToScannedDataS3Files Lambda function moves the objects from the manual review S3 bucket to the scanned data S3 bucket.
  17. The next step of the automated data pipeline will begin with the objects in the scanned data S3 bucket.

Prerequisites

For this application, you need the following prerequisites:

You can use AWS Cloud9 to deploy the application. AWS Cloud9 includes the AWS CLI and AWS SAM CLI to simplify setting up your development environment.

Deploy the application with AWS SAM CLI

You can deploy this application using the AWS SAM CLI. AWS SAM uses AWS CloudFormation as the underlying deployment mechanism. AWS SAM is an open-source framework that you can use to build serverless applications on AWS.

To deploy the application

  1. Initialize the serverless application using the AWS SAM CLI from the GitHub project in the aws-samples repository. This will clone the project locally which includes the source code for the Lambda functions, Step Functions state machine definition file, and the AWS SAM template. On the command line, run the following:
    sam init --location gh: aws-samples/amazonmacie-datapipeline-scan
    

    Alternatively, you can clone the Github project directly.

  2. Deploy your application to your AWS account. On the command line, run the following:
    sam deploy --guided
    

    Complete the prompts during the guided interactive deployment. The first deployment prompt is shown in the following example.

    Configuring SAM deploy
    ======================
    
            Looking for config file [samconfig.toml] :  Found
            Reading default arguments  :  Success
    
            Setting default arguments for 'sam deploy'
            =========================================
            Stack Name [maciepipelinescan]:
    

  3. Settings:
    • Stack Name – Name of the CloudFormation stack to be created.
    • AWS RegionRegion—for example, us-west-2, eu-west-1, ap-southeast-1—to deploy the application to. This application was tested in the us-west-2 and ap-southeast-1 Regions. Before selecting a Region, verify that the services you need are available in those Regions (for example, Macie and Step Functions).
    • Parameter StepFunctionName – Name of the Step Functions state machine to be created—for example, maciepipelinescanstatemachine).
    • Parameter BucketNamePrefix – Prefix to apply to the S3 buckets to be created (S3 bucket names are globally unique, so choosing a random prefix helps ensure uniqueness).
    • Parameter ApprovalEmailDestination – Email address to receive the manual review notification.
    • Parameter EnableMacie – Whether you need Macie enabled in your account or Region. You can select yes or no; select yes if you need Macie to be enabled for you as part of this template, select no, if you already have Macie enabled.
  4. Confirm changes and provide approval for AWS SAM CLI to deploy the resources to your AWS account by responding y to prompts, as shown in the following example. You can accept the defaults for the SAM configuration file and SAM configuration environment prompts.
    #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
    Confirm changes before deploy [y/N]: y
    #SAM needs permission to be able to create roles to connect to the resources in your template
    Allow SAM CLI IAM role creation [Y/n]: y
    ReceiveApprovalDecisionAPI may not have authorization defined, Is this okay? [y/N]: y
    ReceiveApprovalDecisionAPI may not have authorization defined, Is this okay? [y/N]: y
    Save arguments to configuration file [Y/n]: y
    SAM configuration file [samconfig.toml]: 
    SAM configuration environment [default]:
    

    Note: This application deploys an Amazon API Gateway with two REST API resources without authorization defined to receive the decision from the manual review step. You will be prompted to accept each resource without authorization. A token (Step Functions taskToken) is used to authenticate the requests.

  5. This creates an AWS CloudFormation changeset. Once the changeset creation is complete, you must provide a final confirmation of y to Deploy the changeset? [y/N] when prompted as shown in the following example.
    Changeset created successfully. arn:aws:cloudformation:ap-southeast-1:XXXXXXXXXXXX:changeSet/samcli-deploy1605213119/db681961-3635-4305-b1c7-dcc754c7XXXX
    
    
    Previewing CloudFormation changeset before deployment
    ======================================================
    Deploy this changeset? [y/N]:
    

Your application is deployed to your account using AWS CloudFormation. You can track the deployment events in the command prompt or via the AWS CloudFormation console.

After the application deployment is complete, you must confirm the subscription to the Amazon SNS topic. An email will be sent to the email address entered in Step 3 with a link that you need to select to confirm the subscription. This confirmation provides opt-in consent for AWS to send emails to you via the specified Amazon SNS topic. The emails will be notifications of potentially sensitive data that need to be approved. If you don’t see the verification email, be sure to check your spam folder.

Test the application

The application uses an EventBridge scheduled rule to start the sensitive data scan workflow, which runs every 6 hours. You can manually start an execution of the workflow to verify that it’s working. To test the function, you will need a file that contains data that matches your rules for sensitive data. For example, it is easy to create a spreadsheet, document, or text file that contains names, addresses, and numbers formatted like credit card numbers. You can also use this generated sample data to test Macie.

We will test by uploading a file to our S3 bucket via the AWS web console. If you know how to copy objects from the command line, that also works.

Upload test objects to the S3 bucket

  1. Navigate to the Amazon S3 console and upload one or more test objects to the <BucketNamePrefix>-data-pipeline-raw bucket. <BucketNamePrefix> is the prefix you entered when deploying the application in the AWS SAM CLI prompts. You can use any objects as long as they’re a supported file type for Amazon Macie. I suggest uploading multiple objects, some with and some without sensitive data, in order to see how the workflow processes each.

Start the Scan State Machine

  1. Navigate to the Step Functions state machines console. If you don’t see your state machine, make sure you’re connected to the same region that you deployed your application to.
  2. Choose the state machine you created using the AWS SAM CLI as seen in Figure 3. The example state machine is maciepipelinescanstatemachine, but you might have used a different name in your deployment.
     
    Figure 3: AWS Step Functions state machines console

    Figure 3: AWS Step Functions state machines console

  3. Select the Start execution button and copy the value from the Enter an execution name – optional box. Change the Input – optional value replacing <execution id> with the value just copied as follows:
    {
        “id”: “<execution id>”
    }
    

    In my example, the <execution id> is fa985a4f-866b-b58b-d91b-8a47d068aa0c from the Enter an execution name – optional box as shown in Figure 4. You can choose a different ID value if you prefer. This ID is used by the workflow to tag the objects being processed to ensure that only objects that are scanned continue through the pipeline. When the EventBridge scheduled event starts the workflow as scheduled, an ID is included in the input to the Step Functions workflow. Then select Start execution again.
     

    Figure 4: New execution dialog box

    Figure 4: New execution dialog box

  4. You can see the status of your workflow execution in the Graph inspector as shown in Figure 5. In the figure, the workflow is at the pollForCompletionWait step.
     
    Figure 5: AWS Step Functions graph inspector

    Figure 5: AWS Step Functions graph inspector

The sensitive discovery job should run for about five to ten minutes. The jobs scale linearly based on object size, but there is a start-up time per job that is constant. If sensitive data is found in the objects uploaded to the <BucketNamePrefix>-data-pipeline-upload S3 bucket, an email is sent to the address provided during the AWS SAM deployment step, notifying the recipient requesting of the need for an approval decision, which they indicate by selecting the link corresponding to their decision to approve or deny the next step as shown in Figure 6.
 

Figure 6: Sensitive data identified email

Figure 6: Sensitive data identified email

When you receive this notification, you can investigate the findings by reviewing the objects in the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket. Based on your review, you can either apply remediation steps to remove any sensitive data or allow the data to proceed to the next step of the data ingestion pipeline. You should define a standard response process to address discovery of sensitive data in the data pipeline. Common remediation steps include review of the files for sensitive data, deleting the files that you do not want to progress, and updating the ETL process to redact or tokenize sensitive data when re-ingesting into the pipeline. When you re-ingest the files into the pipeline without sensitive data, the files will not be flagged by Macie.

The workflow performs the following:

  • If you select Approve, the files are moved to the <BucketNamePrefix>-data-pipeline-scanned-data S3 bucket with an Amazon S3 SensitiveDataFound object tag with a value of true.
  • If you select Deny, the files are deleted from the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket.
  • If no action is taken, the Step Functions workflow execution times out after five days and the file will automatically be deleted from the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket after 10 days.

Clean up the application

You’ve successfully deployed and tested the sensitive data pipeline scan workflow. To avoid ongoing charges for resources you created, you should delete all associated resources by deleting the CloudFormation stack. In order to delete the CloudFormation stack, you must first delete all objects that are stored in the S3 buckets that you created for the application.

To delete the application

  1. Empty the S3 buckets created in this application (<BucketNamePrefix>-data-pipeline-raw S3 bucket, <BucketNamePrefix>-data-pipeline-scan-stage, <BucketNamePrefix>-data-pipeline-manual-review, and <BucketNamePrefix>-data-pipeline-scanned-data).
  2. Delete the CloudFormation stack used to deploy the application.

Considerations for regular use

Before using this application in a production data pipeline, you will need to stop and consider some practical matters. First, the notification mechanism used when sensitive data is identified in the objects is email. Email doesn’t scale: you should expand this solution to integrate with your ticketing or workflow management system. If you choose to use email, subscribe a mailing list so that the work of reviewing and responding to alerts is shared across a team.

Second, the application is run on a scheduled basis (every 6 hours by default). You should consider starting the application when your preliminary validations have completed and are ready to perform a sensitive data scan on the data as part of your pipeline. You can modify the EventBridge Event Rule to run in response to an Amazon EventBridge event instead of a scheduled basis.

Third, the application currently uses a 60 second Step Functions Wait state when polling for the Macie discovery job completion. In real world scenarios, the discovery scan will take 10 minutes at a minimum, likely several orders of magnitude longer. You should evaluate the typical execution times for your application execution and tune the polling period accordingly. This will help reduce costs related to running Lambda functions and log storage within CloudWatch Logs. The polling period is defined in the Step Functions state machine definition file (macie_pipeline_scan.asl.json) under the pollForCompletionWait state.

Fourth, the application currently doesn’t account for false positives in the sensitive data discovery job results. Also, the application will progress or delete all objects identified based on the decision by the reviewer. You should consider expanding the application to handle false positives through automation rather than manual review / intervention (such as deleting the files from the manual review bucket or removing the sensitive data tags applied).

Last, the solution will stop the ingestion of a subset of objects into your pipeline. This behavior is similar to other validation and data quality checks that most customers perform as part of the data pipeline. However, you should test to ensure that this will not cause unexpected outcomes and address them in your downstream application logic accordingly.

Conclusion

In this post, I showed you how to integrate sensitive data discovery using Macie as an additional validation step in an automated data pipeline. You’ve reviewed the components of the application, deployed it using the AWS SAM CLI, tested to validate that the application functions as expected, and cleaned up by removing deployed resources.

You now know how to integrate sensitive data scanning into your ETL pipeline. You can use automation and—where required—manual review to help reduce the risk of sensitive data, such as personally identifiable information, being inadvertently ingested into a data lake. You can take this application and customize it to fit your use case and workflows, such as using custom data identifiers as part of your scans, adding additional validation steps, creating Macie suppression rules to define cases to archive findings automatically, or only request manual approvals for findings that meet certain criteria (such as high severity findings).

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Brandon Wu

Brandon is a security solutions architect helping financial services organizations secure their critical workloads on AWS. In his spare time, he enjoys exploring outdoors and experimenting in the kitchen.

Discover sensitive data by using custom data identifiers with Amazon Macie

Post Syndicated from Kayla Jing original https://aws.amazon.com/blogs/security/discover-sensitive-data-by-using-custom-data-identifiers-with-amazon-macie/

As you put more and more data in the cloud, you need to rely on security automation to keep it secure at scale. AWS recently launched Amazon Macie, a fully managed service that uses machine learning and pattern matching to help you detect, classify, and better protect your sensitive data stored in the AWS Cloud.

Many data breaches are not the result of malicious activity from unauthorized users, but rather from mistakes made by authorized users. To monitor and manage the security of sensitive data, you must first be able to identify it. In this post, we show you how to use custom data identifiers with Macie to identify sensitive data. Once you know what’s sensitive, you can start designing security controls that operate at scale to monitor and remediate risk automatically.

Macie comes with a set of managed data identifiers that you can use to discover many types of sensitive data. These are somewhat generic and broadly applicable to many organizations. What makes Macie unique is its ability to help you address specific data needs. Macie enables you to expand your sensitive data detection through the new custom data identifiers. Custom data identifiers can be used to highlight organizational proprietary data, intellectual property, and specific scenarios.

Custom Data Identifiers in Macie help you find and identify sensitive data based on your own organization’s specific needs. In this post, we show you a step-by-step walkthrough of how to define and run custom data identifiers to automatically discover specific, sensitive data. Before you begin using Custom Data Identifiers, you need to enable Macie and configure detailed logging. Follow these instructions to enable Macie and follow these instructions to configure detailed logging, if you haven’t done that already.

When to use the Custom Data Identifier resource

To begin, imagine you’re an IT administrator for a manufacturing company that’s headquartered in France. Your company has acquired a few additional local subsidiaries, including an R&D facility in São Paulo, Brazil. The company is migrating to AWS, and in the process is classifying registration information, employee information, and product data into encrypted and non-encrypted storage.

You want to identify sensitive data for the following three scenarios:

  • SIRET-NIC: SIRET-NIC is a unique number assigned to businesses in France. This number is issued by their National Institute of Statistics (INSEE) when a business is registered. A sample file that contains SIRET-NIC information is shown in the following figure. Each record in the file includes the GUID, employee name, employee email, the company name, the date it was issued, and the SIRET-NIC number.

    Figure 1: SIRET-NIC dataset

    Figure 1: SIRET-NIC dataset

  • Brazil CPF (Cadastro de Pessoas Físicas – Natural Persons Register): CPF is a unique number assigned by the Brazilian revenue agency to people subject to taxes in the country. Each of your employees residing in the Brazilian office has a CPF.
  • Prototyping naming convention: Your company has products that are publicly available, but also products that are still in the prototyping stage and should be kept confidential. A sample file that contains Brazil CPF numbers and the prototype names is shown in the following figure.

    Figure 2: Brazil CPF and prototype number dataset

    Figure 2: Brazil CPF and prototype number dataset

Configure the Custom Data Identifier resource in the Macie console

To use custom data identifiers to identify your organization’s sensitive information, you must:

  1. Create custom data identifiers.
  2. Create a job to scan your Amazon Simple Storage Service (Amazon S3) bucket to locate the data patterns that match your custom data identifiers.
  3. Respond to the returned results.

The following steps introduce you to the Custom Data Identifier resource in Macie.

Designing Custom Data Identifiers for use with Amazon Macie

In the previous section you discovered 3 scenarios that your company will like to protect SIRET-NIC, Brazil CPF, and your prototyping naming convention. You now need to first create a specific REGEX pattern for each of these scenarios. There are different syntaxes and dialects of regular expression languages. Amazon Macie supports a subset of the Perl Compatible Regular Expressions (PCRE) library, and you can learn more about it in Regex support in custom data identifiers section. Once the patterns are ready, follow the instructions below to create the custom data identifiers.

Creating Custom Data Identifiers in Amazon Macie

  1. Sign in to the AWS Management Console.
  2. Enter Amazon Macie in the AWS services search box.
  3. Choose Amazon Macie.
  4. In the navigation pane on the left-hand side, under Settings, choose Custom data identifiers as shown in the following figure.

    Figure 3: Custom data identifiers console

    Figure 3: Custom data identifiers console

Create a custom data identifier

  1. Choose Create on the custom data identifier console.
  2. Name: Enter a name for your custom data identifier. Make it descriptive so you know what it does. For example, enter SIRET-NIC for the SIRET-NIC number you use.
  3. Description: Enter a description of the custom data identifier.
  4. Regular expression (regex): Define the pattern you want to identify. Use a Regular Expression (“regex”) to create the desired pattern. For example, a SIRET-NIC number contains 14 digits—9 numbers followed by a hyphen and then 5 more numbers. The first part, 9 numbers, can stay together or separated by spaces into 3 groups of 3. The specific regex pattern for this is \b(\d{3}\s?){2}\d{3}\-\d{5}\b
  5. Keywords: Define expressions that identify the text to match. The SIRET-NIC number itself is publicly accessible information. But in your case, you want to encrypt the information about the company that was registered during the month the acquisition happened (April 2020), thus the information will not leak to your competitors. So, the keywords here will be all the days in April.
  6. (Optional) Ignore words: Use this box to enter text that you want to be ignored. In this example scenario, you know your security training materials always use an example SIRET-NICs of 12345789-12345 and 000000000-00000. You can enter these values here, so that your security training materials are not flagged as sensitive data containing SIRET-NICs.
  7. Maximum match distance: Use this box to define the proximity between the result and the keywords. If you enter 20, Macie will provide results that include the specified keyword and 20 characters on either side of it.

Note: Do not select Submit yet. After entering the settings and before selecting Submit, you should test your custom data identifier with sample data to confirm that it works.

With all the attributes set, your console will look like what is shown in Figure 4.

Figure 4: SIRET-NIC custom data identifier creation

Figure 4: SIRET-NIC custom data identifier creation

Test your SIRET-NIC custom data identifier

Use the Evaluate section on the right-hand panel of the Macie console to confirm that the regex pattern and other configurations for your custom data identifier are correct.

Follow the steps below to use the Evaluate section.

  1. Enter test data in the sample data box.
  2. Select Submit. There will be one match per record in the file if the configurations are correct and your custom data identifier is ready.The following figure is an example of the Evaluate section using test data. The test data has 3 records, each record has 5 fields which are GUID, employee name, employee email, company name, date SIRET-NIC was issued, and the SIRET-NIC number.

    Figure 5: Evaluate, showing sample data

    Figure 5: Evaluate, showing sample data

  3. After verifying your SIRET-NIC custom data identifier works in the Evaluate section, now select Submit on the New custom data identifier window to create the custom data identifier.

Create a Brazil CPF Custom Data Identifier

Congrats on creating your first custom data identifier! Now use the same steps to create and test custom data identifiers for the Brazil CPF and prototyping naming convention scenarios. The Brazil CPF number usually shows up in the format of 000.000.000-00.

Use the following values for the Brazil CPF scenario, as shown in the following figure:

  • Name: Brazil CPF
  • Description: The format for Brazil CPF in our sample data is 000.000.000-00
  • Regular expression: \b(\d{3}\.){2}\d{3}\-\d{2}\b

    Figure 6: Brazil CPF custom data identifier

    Figure 6: Brazil CPF custom data identifier

Create a Prototype Name Custom Data Identifier

Assume that your company has a very strict and regular naming scheme for prototype part numbers. It is P, followed by a hyphen, and then 2 letters and 4 digits. E.g., P-AB1234. You want to identify objects in S3 that contain references to private prototype parts. This is a small pattern, and so if we’re not careful it will cause Macie to flag objects that do not actually contain one of our prototype numbers. We suggest adding \b at the beginning and the end of the regular expression. The \b symbol means a “word boundary” and word boundaries are basically whitespace, punctuation, or other things that are not letters and numbers. With \b, you limit the pattern so that you only match if the entire word matches the pattern. For example, P-AB1234 will match the pattern, but STEP-AB123456 and P-XY123 will not match the pattern. This gives you finer grained control and reduces false positives.

Use the following values for the prototyping name scenario, as shown in the following figure:

  • Name: Prototyping Naming
  • Description: Any prototype name start with P means it’s private. The format for private prototype name is P-2 capital letters and 4 numbers
  • Regular expression: \bP\-[A-Z]{2}\d{4}\b
Figure 7: Prototyping naming custom data identifier

Figure 7: Prototyping naming custom data identifier

You should now see a page like the following figure, indicating that the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers are successfully configured.

Figure 8: Successfully configured custom data identifier

Figure 8: Successfully configured custom data identifier

Set up a Test Bucket to Demonstrate Macie

Before we can see Macie do its work, we have to create a bucket with some test data that we can scan. We’ve provided some sample data files that you can download. Follow these instructions to create a test bucket and load our test data into the test bucket.

  1. Download the sample data and unzip it.
  2. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.
  3. Choose Create bucket. The Create bucket wizard opens.
  4. In Bucket name, enter a DNS-compliant name for your bucket. The bucket name must:
    • Be unique across all of Amazon S3.
    • Be between 3 and 63 characters long.
    • Not contain uppercase characters.
    • Start with a lowercase letter or number.

    We created a bucket called bucketformacieuse; you have to choose another name because this one is already taken by us.

  5. In Region, choose the AWS Region where you want the bucket to reside.
  6. Select Create, to finish the bucket creation.
  7. Open the bucket you just created and upload the two Excel files you downloaded in step 1.

Use Macie to create a job to scan your data

Now you can create a job to scan your Amazon S3 bucket to detect and locate the data patterns defined in the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers.

To create a job

  1. In the navigation pane, choose Jobs, and then select Create Job on the upper right.
  2. Select Amazon S3 buckets: Select the S3 bucket you want to analyze. In this case, we are using the bucket previously created, bucketformacieuse.
  3. Review Amazon S3 buckets: Verify that you selected the S3 bucket you want the job to scan and analyze.
  4. Scope: Select your scope. For this example, choose the One-time job option as your scope. The scope specifies how often you want the job to run. This can be either a one-time job or a scheduled job. If you choose a scheduled job, you can define how often you want your job to scan your Amazon S3 bucket.
  5. Custom data identifiers: Select the 3 custom data identifiers you created to be associated with this job, and then select Next. This is shown in the following figure.

    Figure 9: Select your custom data identifiers

    Figure 9: Select your custom data identifiers

  6. Name and description: Enter a name and description for the job.
  7. Review and create: Review and verify all your settings, and then select Create.

You now have a job in Macie to scan the Amazon S3 buckets you’ve chosen using the 3 custom data identifiers you created. More information about creating jobs is available in Running sensitive data discovery jobs in Amazon Macie.

Respond to results

Macie will help you be secure when you’re effectively responding to the findings that it produces. For our example, we’ll show you how to review your findings manually. You can look at your findings by bucket, type, or job, or see a collective summary of all findings. In this example, let’s look at all findings.

To review your results

  1. In the navigation pane on the left-hand side, choose Findings. Findings include the severity, the type, the resources affected, and when the findings were last updated.
  2. The following figure shows an example of the results you might see on the findings page. There are two findings for the selected job. The compagnie_français.csv and the empresa_brasileira.csv files contain the custom data identifiers that you created earlier and added to the job.

    Figure 10: Findings

    Figure 10: Findings

  3. Let’s look at the details of one of the findings so you can review the results. From the page showing the 4r findings, select the file that contains your custom data identifier for the Brazil CPF: empresa_brasileira.csv. The number of custom data identifiers found in the document is shown in the Result section on the right, as shown in the following figure.

    Figure 11: Findings detail page for the Brazil CPF custom data identifiers

    Figure 11: Findings detail page for the Brazil CPF custom data identifiers

  4. Now look at the findings details for the compagnie_français.csv file. It shows the number of custom data identifiers found in the file. In this case Macie found 13 SIRET-NIC numbers as shown in the following figure.

    Figure 12: Findings page for the French company file

    Figure 12: Findings page for the French company file

  5. If you configured detailed logging, the results will be saved in the Amazon S3 bucket you specified. The S3 bucket location can be found in the Details section after Detailed result location as shown in the preceding figure.

Now that you’ve used Macie and the Custom Data Identifiers resource to obtain these findings, you can identify what data to place in encrypted storage, and what can be placed in non-encrypted storage when migrating to AWS. Macie and custom data identifiers provide an automated tool to help you enhance protection of your sensitive data by providing you the information to help detect and classify your data in the AWS Cloud.

Using Macie at Scale

Custom Data Identifiers help you tell Macie what to look for. As you move more and more data to the cloud, you’ll need to make new identifiers and new rules. As your rules and identifiers grow you will need to create automation that responds to things that are found. For example, perhaps a lambda function turns on encryption in a bucket when it finds sensitive data in that bucket. Or perhaps a function automatically applies tags to buckets where sensitive data is found, and those buckets and their owners start to appear on reports for audit and compliance. Once you’ve done this at small scale, think about how you will automate responses at larger scale.

Conclusion

The new Custom Data Identifier resource in the newly enhanced Macie can help you detect, classify, and protect sensitive data types unique to your organization. This post focused on the functionality and use of custom data identifiers to automatically discover sensitive data stored in Amazon S3. You can also review the managed data identifiers to see a list of personally identifiable information (PII) that Macie can detect by default. Visit What is Amazon Macie? to learn more.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Kayla Jing

Kayla is a Solutions Architect at Amazon Web Services based out of Seattle. She has experience in data science with a focus on Data Analytics and Machine Learning.

Author

Joshua Choung

Joshua is a Solutions Architect based out of Seattle. He works with customers to provide architectural and technical guidance and training on their AWS cloud journey.

Author

Laura Reith

Laura is a Solutions Architect at Amazon Web Services. Before AWS, she worked as a Solutions Architect in Taiwan focusing on physical security and retail analytics.