Tag Archives: Amazon Macie

Automate the archival and deletion of sensitive data using Amazon Macie

Post Syndicated from Subhro Bose original https://aws.amazon.com/blogs/big-data/automate-the-archival-and-deletion-of-sensitive-data-using-amazon-macie/

Customers are looking for ways to securely and cost-efficiently manage large volumes of sensitive data archival and deletion in their data lake by following regulations and data protection and privacy laws, such as GDPR, POPIA, and LGPD. This post describes a way to automatically identify sensitive data stored in your data lake within AWS, tag the data according to its sensitivity level, and apply appropriate lifecycle policies in a secured and cost-effective way.

Amazon Macie is a managed data security and data privacy service that uses machine learning (ML) and pattern matching to discover and protect your sensitive data stored in Amazon Simple Storage Service. (Amazon S3). In this post, we show you how to develop a solution using Macie, Amazon Kinesis Data Firehose, Amazon S3, Amazon EventBridge, and AWS Lambda to identify sensitive data across a large number of S3 buckets, tag them, and apply lifecycle policies for transition and deletion.

Solution overview

The following diagram illustrates the architecture of our solution.

The flow of the solution is as follows:

  1. An Amazon S3 bucket contains multiple objects with different sensitivities of data.
  2. A Macie job analyses the S3 bucket to identify the different sensitivities.
    1. An EventBridge rule is triggered for each finding that Macie generates from the job.
    2. The rule copies the results created by the Macie job to a Kinesis Data Firehose delivery stream.
    3. The delivery stream copies the results to an S3 bucket as a JSON file for future audit requirements.
  3. The arrival of the results triggers a Lambda function that parses the sensitivity metadata from the JSON file.
  4. The function tags the objects in the bucket mentioned in Step 1 and creates an S3 Lifecycle policy based on the sensitivity level of each object and overwrites an S3 Lifecycle policy for each existing object.
  5. The S3 Lifecycle policy moves data to different classes and deletes data based on the configured rules. For example, we implement the following rules:
    1. Archive objects with high sensitivity, tagged as High, after 700 days.
    2. Delete objects tagged as High after 3,000 days.

Create resources with AWS CloudFormation

We provide an AWS CloudFormation template to create the following resources:

  • An S3 bucket named archival-blog-<account_number>-<region_name> as a sample subject bucket as described above.
  • An S3 bucket named archival-blog-results-<account_number>-<region_name> to store the results generated by the Macie job.
  • A Firehose delivery stream to send the results of the Macie job to the S3 bucket.
  • An EventBridge rule that matches the incoming event of a result generated by the Macie job and routes the result to the Firehose delivery stream.
  • A Lambda function to apply tags and S3 Lifecycle policies on the data objects of the subject S3 bucket based on the result generated by the Macie job.
  • AWS Identity and Access Management (IAM) roles and policies with appropriate permissions.

Launch the following stack, providing your stack name:

After the cloud formation stack is deployed, copy  sample_pii.xlsx and recipe.xlsx as sample data for Macie to detect as sensitive data in archival-blog-<account_number>-<region_name>.

Next, we scan the subject S3 bucket for sensitive data to tag and attach the appropriate lifecycle policies.

Configure a Macie job

Macie uses ML and pattern matching to discover and protect your sensitive data in AWS. To configure a Macie job, complete the following steps:

  1. On the Macie console, create a new job by choosing Create Job.
  2. Select the bucket that you want to analyze and choose Next.
  3. Select a schedule if you want to run the job on a schedule, or One-time job if you want to run the job one time.For this post, we select One-time job. We can also choose Scheduled job for periodic jobs in production, but this is out of the scope of this post.
  4. Choose Next.
  5. Enter a name for the job and choose Next.
  6. Review the job details and confirm they’re correct before choosing Submit.

The job immediately starts after you submit it.

Review the results

Whenever the Macie job runs, it generates the following results.

Secondly, the Lambda function tags sensitive object, with Sensitivity : High.

Thirdly, the function creates a corresponding S3 Lifecycle policy.

Clean up

When you’re done with this solution, delete the sample data from the subject S3 bucket, delete all the data objects that are stored as Macie results in the S3 bucket for this post, and then delete the CloudFormation stack to remove all the service resources used in the solution.

Conclusion

In this post, we highlighted how you can use Macie to scan all your data stored on Amazon S3 and how to store your security findings in Amazon S3 using EventBridge and Kinesis Data Firehose. We also explored how you can use Lambda to tag the relevant objects in your subject S3 bucket using the Macie job’s security findings. An S3 Lifecycle transition policy moves tagged objects across different storage classes to help save cost, and an S3 Lifecycle expiration policy deletes objects that have reached the end of their lifecycle. According to the privacy laws like GDPR and POPIA, personal data should be retained as long as the data needs to be retained for legal purposes, or needs to be processed. In this post, we provide a mechanism to allow archival of sensitive data which you might not want to delete immediately, but reduce related storage costs and delete it after a certain period when that data is no longer required. The data archival and deletion periods used above are sample numbers that can be customised within the Lambda function based on requirements. Additionally, please also explore Macie’s different type of findings. You can use these different finding to build capabilities like sending notifications, creating searches in cloud trail S3 object logs for who is accessing specific objects, etc.

If you have any questions, comments, or concerns, please reach out to AWS Support. If you have feedback about this post, submit it in the comments section.


About the Authors

Subhro Bose is a Data Architect in Emergent Technologies and Intelligence Platform in Amazon. He loves working on ways for emergent technologies such as AI/ML, big data, quantum, and more to help businesses across different industry verticals succeed within their innovation journey.

 

 

 

Akshay Chandiramani is Data Analytics Consultant for AWS Professional Services. He is passionate about solving complex big data and MLOps problems for customers to create tangible business outcomes. In his spare time, he enjoys playing snooker and table tennis, and capturing trends on the stock market.

 

 

 

Ikenna Uzoh is a Senior Manager, Business Intelligence (BI) and Analytics at AWS. Before his current role, he was a Senior Big Data and Analytics Architect with AWS Professional Services. He is passionate about building data platforms (data lakes, BI, Machine learning) that help organizations achieve their desired outcomes.

Creating a notification workflow from sensitive data discover with Amazon Macie, Amazon EventBridge, AWS Lambda, and Slack

Post Syndicated from Bruno Silviera original https://aws.amazon.com/blogs/security/creating-a-notification-workflow-from-sensitive-data-discover-with-amazon-macie-amazon-eventbridge-aws-lambda-and-slack/

Following the example of the EU in implementing the General Data Protection Regulation (GDPR), many countries are implementing similar data protection laws. In response, many companies are forming teams that are responsible for data protection. Considering the volume of information that companies maintain, it’s essential that these teams are alerted when sensitive data is at risk.

This post shows how to deploy a solution that uses Amazon Macie to discover sensitive data. This solution enables you to set up automatic notification to your company’s designated data protection team via a Slack channel when sensitive data that needs to be protected is discovered by Amazon EventBridge and AWS Lambda.

The challenge

Let’s imagine that you’re part of a team that’s responsible for classifying your organization’s data but the data structure isn’t documented. Amazon Macie provides you the ability to run a scheduled classification job that examines your data, and you want to notify the data protection team when there’s new sensitive data to classify. Let’s build a solution to automatically notify the data protection team.

Solution overview

To be scalable and cost-effective, this solution uses serverless technologies and managed AWS services, including:

  • Macie – A fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in Amazon Web Services (AWS).
  • EventBridge – A serverless event bus that connects application data from your apps, SaaS, and AWS services. EventBridge can respond to specific events or run according to a schedule. The solution presented in this post uses EventBridge to initiate a custom Lambda function in response to a specific event.
  • Lambda – Runs code in response to events such as changes in data, changes in application state, or user actions. In this solution, a Lambda function is initiated by EventBridge.

Solution architecture

The architecture workflow is shown in Figure 1 and includes the following steps:

  1. Macie runs a classification job and publishes its findings to EventBridge as a JSON object.
  2. The EventBridge rule captures the findings and invokes a Lambda function as a target.
  3. The Lambda function parses the JSON object. The function then sends a custom message to a Slack channel with the sensitive data finding for the data protection team to evaluate and respond to.

 

Figure 1: Solution architecture workflow

Figure 1: Solution architecture workflow

Set up Slack

For this solution, you need a Slack workspace and an incoming webhook. The workspace must be in place before you create the webhook.

Create a Slack workspace

If you already have a Slack workspace in your environment, you can skip forward, to creating the webhook.

If you don’t have a Slack workspace, follow the steps in Create a Slack Workspace to create one.

Create an incoming webhook in Slack API

  1. Go to your Slack API.
  2. Choose Start Building to create an app.
  3. Enter the following details for your app:
    • App Namemacie-to-slack.
    • Development Slack Workspace – Choose the Slack workspace—either an existing workspace or one you created for this solution—to receive the Macie findings.
  4. Choose the Create App button.
  5. In the left menu, choose Incoming Webhooks.
  6. At the Activate Incoming Webhooks screen, move the slider from OFF to ON.
  7. Scroll down and choose Add New Webhook to Workspace.
  8. In the screen asking where your app should post, enter the name of the Slack channel from your Workspace that you want to send notification to and choose Authorize.
  9. On the next screen, scroll down to the Webhook URL section. Make a note of the URL to use later.

Deploy the CloudFormation template with the solution

The deployment of the CloudFormation template automatically creates the following resources:

  • A Lambda function that begins with the name named macie-to-slack-lambdafindingsToSlack-.
  • An EventBridge rule named MacieFindingsToSlack.
  • An IAM role named MacieFindingsToSlackkRole.
  • A permission to invoke the Lambda function named LambdaInvokePermission.

Note: Before you proceed, make sure you’re deploying the template to the same Region that your production Macie is running.

To deploy the Cloudformation template

  1. Download the YAML template to your computer.

    Note: To save the template, you can right click the Raw button at the top of the code and then select Save link as if you’re using Chrome, or the equivalent in your browser. This file is used in Step 4.

  2. Open CloudFormation in the AWS Management Console.
  3. On the Welcome page, choose Create stack and then choose With new resources.
  4. On Step 1 — Specify template, choose Upload a template file, select Choose file and then select the file template.yaml (the file extension might be .YML), then choose Next.
  5. On Step 2 — Specify stack details:
    1. Enter macie-to-slack as the Stack name.
    2. At the Slack Incoming Web Hook URL, paste the webhook URL you copied earlier.
    3. At Slack channel, enter the name of the channel in your workspace that will receive the alerts and choose Next.
    Figure 2: Defining stack details

    Figure 2: Defining stack details

  6. On Step 3 – Configure Stack options, you can leave the default settings, or change them for your environment. Choose Next to continue.
  7. At the bottom of Step 4 – Review, select I acknowledge that AWS CloudFormation might create IAM resources, and choose Create stack.

    Figure 3: Confirmation before stack creation

    Figure 3: Confirmation before stack creation

  8. Wait for the stack to reach status CREATE_COMPLETE.

Running the solution

At this point, you’ve deployed the solution and your resources are created.

To test the solution, you can schedule a Macie job targeting a bucket that contains a file with sensitive information that Macie can detect.

Note: You can check the Amazon Macie documentation to see the list of supported managed data identifiers.

When the Macie job is complete, any findings are sent to the Slack channel.

Figure 4: Macie finding delivered to Slack channel

Figure 4: Macie finding delivered to Slack channel

Select the link in the message sent to the Slack channel to open that finding in the Macie console, as shown in Figure 5.

Figure 5: Finding details

Figure 5: Finding details

And you’re done!

Now your Macie finding results are delivered to your Slack channel where they can be easily monitored, reducing response time and risk exposure.

If you deployed this for testing purposes, or want to clean this up and move to your production account, you can delete the Cloudformation stack:

  1. Open the CloudFormation console.
  2. Select the stack and choose Delete.

Conclusion

In this blog post we walked through the steps to configure a notification workflow using Macie, Lambda, and EventBridge to send sensitive data findings to your data protection team via a Slack channel.

Your data protection team will appreciate the timely notifications of sensitive data findings, giving you the ability to focus on creating controls to improve data security and compliance with regulations related to protection and treatment of personal data.

For more information about data privacy on AWS, see Data Privacy FAQ.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Bruno Silveira

Bruno is a Solutions Architect Manager in the Public Sector team with focus on educational institutions in Brazil. His previous career was in government, financial services, utilities, and nonprofit institutions. Bruno is an enthusiast of cloud security and an appreciator of good rock’n roll with a good beer.

Author

Julio Carvalho

Julio is a Principal Security Solutions Architect at AWS for the Latin American financial market. As a security specialist, he helps customers solve protection and compliance challenges on their cloud journey.

Deploy an automated ChatOps solution for remediating Amazon Macie findings

Post Syndicated from Nick Cuneo original https://aws.amazon.com/blogs/security/deploy-an-automated-chatops-solution-for-remediating-amazon-macie-findings/

The amount of data being collected, stored, and processed by Amazon Web Services (AWS) customers is growing at an exponential rate. In order to keep pace with this growth, customers are turning to scalable cloud storage services like Amazon Simple Storage Service (Amazon S3) to build data lakes at the petabyte scale. Customers are looking for new, automated, and scalable ways to address their data security and compliance requirements, including the need to identify and protect their sensitive data. Amazon Macie helps customers address this need by offering a managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data that is stored in Amazon S3.

In this blog post, I show you how to deploy a solution that establishes an automated event-driven workflow for notification and remediation of sensitive data findings from Macie. Administrators can review and approve remediation of findings through a ChatOps-style integration with Slack. Slack is a business communication tool that provides messaging functionality, including persistent chat rooms known as channels. With this solution, you can streamline the notification, investigation, and remediation of sensitive data findings in your AWS environment.

Prerequisites

Before you deploy the solution, make sure that your environment is set up with the following prerequisites:

Important: This solution uses various AWS services, and there are costs associated with these resources after the Free Tier usage. See the AWS pricing page for details.

Solution overview

The solution architecture and workflow are detailed in Figure 1.

Figure 1: Solution overview

Figure 1: Solution overview

This solution allows for the configuration of auto-remediation behavior based on finding type and finding severity. For each finding type, you can define whether you want the offending S3 object to be automatically quarantined, or whether you want the finding details to be reviewed and approved by a human in Slack prior to being quarantined. In a similar manner, you can define the minimum severity level (Low, Medium, High) that a finding must have before the solution will take action. By adjusting these parameters, you can manage false positives and tune the volume and type of findings about which you want to be notified and take action. This configurability is important because customers have different security, risk, and regulatory requirements.

Figure 1 details the services used in the solution and the integration points between them. Let’s walk through the full sequence from the detection of sensitive data to the remediation (quarantine) of the offending object.

  1. Macie is configured with sensitive data discovery jobs (scheduled or one-time), which you create and run to detect sensitive data within S3 buckets. When Macie runs a job, it uses a combination of criteria and techniques to analyze objects in S3 buckets that you specify. For a full list of the categories of sensitive data Macie can detect, see the Amazon Macie User Guide.
  2. For each sensitive data finding, an event is sent to Amazon EventBridge that contains the finding details. An EventBridge rule triggers a Lambda function for processing.
  3. The Finding Handler Lambda function parses the event and examines the type of the finding. Based on the auto-remediation configuration, the function either invokes the Finding Remediator function for immediate remediation, or sends the finding details for manual review and remediation approval through Slack.
  4. Delegated security and compliance administrators monitor the configured Slack channel for notifications. Notifications provide high-level finding information, remediation status, and a link to the Macie console for the finding in question. For findings configured for manual review, administrators can choose to approve the remediation in Slack by using an action button on the notification.
  5. After an administrator chooses the Remediate button, Slack issues an API call to an Amazon API Gateway endpoint, supplying both the unique identifier of the finding to be remediated and that of the Slack user. API Gateway proxies the request to a Remediation Handler Lambda function.
  6. The Remediation Handler Lambda function validates the request and request signature, extracts the offending object’s location from the finding, and makes an asynchronous call to the Finding Remediator Lambda function.
  7. The Finding Remediator Lambda function moves the offending object from the source bucket to a designated S3 quarantine bucket with restricted access.
  8. Finally, the Finding Remediator Lambda function uses a callback URL to update the original finding notification in Slack, indicating that the offending object has now been quarantined.

Deploy the solution

Now we’ll walk through the steps for configuring Slack and deploying the solution into your AWS environment by using the AWS CDK. The AWS CDK is a software development framework that you can use to define cloud infrastructure in code and provision through AWS CloudFormation.

The deployment steps can be summarized as follows:

  1. Configure a Slack channel and app
  2. Check the project out from GitHub
  3. Set the configuration parameters
  4. Build and deploy the solution
  5. Configure Slack with an API Gateway endpoint

To configure a Slack channel and app

  1. In your browser, make sure you’re logged into the Slack workspace where you want to integrate the solution.
  2. Create a new channel where you will send the notifications, as follows:
    1. Choose the + icon next to the Channels menu, and select Create a channel.
    2. Give your channel a name, for example macie-findings, and make sure you turn on the Make private setting.

      Important: By providing Slack users with access to this configured channel, you’re providing implicit access to review Macie finding details and approve remediations. To avoid unwanted user access, it’s strongly recommended that you make this channel private and by invite only.

  3. On your Apps page, create a new app by selecting Create New App, and then enter the following information:
    1. For App Name, enter a name of your choosing, for example MacieRemediator.
    2. Select your chosen development Slack workspace that you logged into in step 1.
    3. Choose Create App.
    Figure 2: Create a Slack app

    Figure 2: Create a Slack app

  4. You will then see the Basic Information page for your app. Scroll down to the App Credentials section, and note down the Signing Secret. This secret will be used by the Lambda function that handles all remediation requests from Slack. The function uses the secret with Hash-based Message Authentication Code (HMAC) authentication to validate that requests to the solution are legitimate and originated from your trusted Slack channel.

    Figure 3: Signing secret

    Figure 3: Signing secret

  5. Scroll back to the top of the Basic Information page, and under Add features and functionality, select the Incoming Webhooks tile. Turn on the Activate Incoming Webhooks setting.
  6. At the bottom of the page, choose Add New Webhook to Workspace.
    1. Select the macie-findings channel you created in step 2, and choose Allow.
    2. You should now see webhook URL details under Webhook URLs for Your Workspace. Use the Copy button to note down the URL, which you will need later.

      Figure 4: Webhook URL

      Figure 4: Webhook URL

To check the project out from GitHub

The solution source is available on GitHub in AWS Samples. Clone the project to your local machine or download and extract the available zip file.

To set the configuration parameters

In the root directory of the project you’ve just cloned, there’s a file named cdk.json. This file contains configuration parameters to allow integration with the macie-findings channel you created earlier, and also to allow you to control the auto-remediation behavior of the solution. Open this file and make sure that you review and update the following parameters:

  • autoRemediateConfig – This nested attribute allows you to specify for each sensitive data finding type whether you want to automatically remediate and quarantine the offending object, or first send the finding to Slack for human review and authorization. Note that you will still be notified through Slack that auto-remediation has taken place if this attribute is set to AUTO. Valid values are either AUTO or REVIEW. You can use the default values.
  • minSeverityLevel – Macie assigns all findings a Severity level. With this parameter, you can define a minimum severity level that must be met before the solution will trigger action. For example, if the parameter is set to MEDIUM, the solution won’t take any action or send any notifications when a finding has a LOW severity, but will take action when a finding is classified as MEDIUM or HIGH. Valid values are: LOW, MEDIUM, and HIGH. The default value is set to LOW.
  • slackChannel – The name of the Slack channel you created earlier (macie-findings).
  • slackWebHookUrl – For this parameter, enter the webhook URL that you noted down during Slack app setup in the “Configure a Slack channel and app” step.
  • slackSigningSecret – For this parameter, enter the signing secret that you noted down during Slack app setup.

Save your changes to the configuration file.

To build and deploy the solution

  1. From the command line, make sure that your current working directory is the root directory of the project that you cloned earlier. Run the following commands:
    • npm install – Installs all Node.js dependencies.
    • npm run build – Compiles the CDK TypeScript source.
    • cdk bootstrap – Initializes the CDK environment in your AWS account and Region, as shown in Figure 5.

      Figure 5: CDK bootstrap output

      Figure 5: CDK bootstrap output

    • cdk deploy – Generates a CloudFormation template and deploys the solution resources.

    The resources created can be reviewed in the CloudFormation console and can be summarized as follows:

    • Lambda functions – Finding Handler, Remediation Handler, and Remediator
    • IAM execution roles and associated policy – The roles and policy associated with each Lambda function and the API Gateway
    • S3 bucket – The quarantine S3 bucket
    • EventBridge rule – The rule that triggers the Lambda function for Macie sensitive data findings
    • API Gateway – A single remediation API with proxy integration to the Lambda handler
  2. After you run the deploy command, you’ll be prompted to review the IAM resources deployed as part of the solution. Press y to continue.
  3. Once the deployment is complete, you’ll be presented with an output parameter, shown in Figure 6, which is the endpoint for the API Gateway that was deployed as part of the solution. Copy this URL.

    Figure 6: CDK deploy output

    Figure 6: CDK deploy output

To configure Slack with the API Gateway endpoint

  1. Open Slack and return to the Basic Information page for the Slack app you created earlier.
  2. Under Add features and functionality, select the Interactive Components tile.
  3. Turn on the Interactivity setting.
  4. In the Request URL box, enter the API Gateway endpoint URL you copied earlier.
  5. Choose Save Changes.

    Figure 7: Slack app interactivity

    Figure 7: Slack app interactivity

Now that you have the solution components deployed and Slack configured, it’s time to test things out.

Test the solution

The testing steps can be summarized as follows:

  1. Upload dummy files to S3
  2. Run the Macie sensitive data discovery job
  3. Review and act upon Slack notifications
  4. Confirm that S3 objects are quarantined

To upload dummy files to S3

Two sample text files containing dummy financial and personal data are available in the project you cloned from GitHub. If you haven’t changed the default auto-remediation configurations, these two files will exercise both the auto-remediation and manual remediation review flows.

Find the files under sensitive-data-samples/dummy-financial-data.txt and sensitive-data-samples/dummy-personal-data.txt. Take these two files and upload them to S3 by using either the console, as shown in Figure 8, or AWS CLI. You can choose to use any new or existing bucket, but make sure that the bucket is in the same AWS account and Region that was used to deploy the solution.

Figure 8: Dummy files uploaded to S3

Figure 8: Dummy files uploaded to S3

To run a Macie sensitive data discovery job

  1. Navigate to the Amazon Macie console, and make sure that your selected Region is the same as the one that was used to deploy the solution.
    1. If this is your first time using Macie, choose the Get Started button, and then choose Enable Macie.
  2. On the Macie Summary dashboard, you will see a Create Job button at the top right. Choose this button to launch the Job creation wizard. Configure each step as follows:
    1. Select S3 buckets: Select the bucket where you uploaded the dummy sensitive data file. Choose Next.
    2. Review S3 buckets: No changes are required, choose Next.
    3. Scope: For Job type, choose One-time job. Make sure Sampling depth is set to 100%. Choose Next.
    4. Custom data identifiers: No changes are required, choose Next.
    5. Name and description: For Job name, enter any name you like, such as Dummy job, and then choose Next.
    6. Review and create: Review your settings; they should look like the following sample. Choose Submit.
Figure 9: Configure the Macie sensitive data discovery job

Figure 9: Configure the Macie sensitive data discovery job

Macie will launch the sensitive data discovery job. You can track its status from the Jobs page within the Macie console.

To review and take action on Slack notifications

Within five minutes of submitting the data discovery job, you should expect to see two notifications appear in your configured Slack channel. One notification, similar to the one in Figure 10, is informational only and is related to an auto-remediation action that has taken place.

Figure 10: Slack notification of auto-remediation for the file containing dummy financial data

Figure 10: Slack notification of auto-remediation for the file containing dummy financial data

The other notification, similar to the one in Figure 11, requires end user action and is for a finding that requires administrator review. All notifications will display key information such as the offending S3 object, a description of the finding, the finding severity, and other relevant metadata.

Figure 11: Slack notification for human review of the file containing dummy personal data

Figure 11: Slack notification for human review of the file containing dummy personal data

(Optional) You can review the finding details by choosing the View Macie Finding in Console link in the notification.

In the Slack notification, choose the Remediate button to quarantine the object. The notification will be updated with confirmation of the quarantine action, as shown in Figure 12.

Figure 12: Slack notification of authorized remediation

Figure 12: Slack notification of authorized remediation

To confirm that S3 objects are quarantined

Finally, navigate to the S3 console and validate that the objects have been removed from their original bucket and placed into the quarantine bucket listed in the notification details, as shown in Figure 13. Note that you may need to refresh your S3 object listing in the browser.

Figure 13: Slack notification of authorized remediation

Figure 13: Slack notification of authorized remediation

Congratulations! You now have a fully operational solution to detect and respond to Macie sensitive data findings through a Slack ChatOps workflow.

Solution cleanup

To remove the solution and avoid incurring additional charges from the AWS resources that you deployed, complete the following steps.

To remove the solution and associated resources

  1. Navigate to the Macie console. Under Settings, choose Suspend Macie.
  2. Navigate to the S3 console and delete all objects in the quarantine bucket.
  3. Run the command cdk destroy from the command line within the root directory of the project. You will be prompted to confirm that you want to remove the solution. Press y.

Summary

In this blog post, I showed you how to integrate Amazon Macie sensitive data findings with an auto-remediation and Slack ChatOps workflow. We reviewed the AWS services used, how they are integrated, and the steps to configure, deploy, and test the solution. With Macie and the solution in this blog post, you can substantially reduce the heavy lifting associated with detecting and responding to sensitive data in your AWS environment.

I encourage you to take this solution and customize it to your needs. Further enhancements could include supporting policy findings, adding additional remediation actions, or integrating with additional findings from AWS Security Hub.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Nick Cuneo

Nick is an Enterprise Solutions Architect at AWS who works closely with Australia’s largest financial services organisations. His previous roles span operations, software engineering, and design. Nick is passionate about application and network security, automation, microservices, and event driven architectures. Outside of work, he enjoys motorsport and is found most weekends in his garage wrenching on cars.

Detecting sensitive data in DynamoDB with Macie

Post Syndicated from Sheldon Sides original https://aws.amazon.com/blogs/security/detecting-sensitive-data-in-dynamodb-with-macie/

Amazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in Amazon Web Services (AWS). It gives you the ability to automatically scan for sensitive data and get an inventory of your Amazon Simple Storage Service (Amazon S3) buckets. Macie also gives you the added ability to detect which buckets are public, unencrypted, and accessible from other AWS accounts.

In this post, we’ll walk through how to use Macie to detect sensitive data in Amazon DynamoDB tables by exporting the data to Amazon S3 so that Macie can scan the data. An example of why you would deploy a solution like this is if you have potentially sensitive data stored in DynamoDB tables. When we’re finished, you’ll have a solution that can set up on-demand or scheduled Macie discovery jobs to detect sensitive data exported from DynamoDB to S3.

Architecture

In figure 1, you can see an architectural diagram explaining the flow of the solution that you’ll be deploying.

Figure 1: Solution architecture

Figure 1: Solution architecture

Here’s a brief overview of the steps that you’ll take to deploy the solution. Some steps you will do manually, while others will be handled by the provided AWS CloudFormation template. The following outline describes the steps taken to extract the data from DynamoDB and store it in S3, which allows Macie to run a discovery job against the data.

  1. Enable Amazon Macie, if it isn’t already enabled.
  2. Deploy a test DynamoDB dataset.
  3. Create an S3 bucket to export DynamoDB data to.
  4. Configure an AWS Identity and Access Management (IAM) policy and role. (These are used by the Lambda function to access the S3 and DynamoDB tables)
  5. Deploy an AWS Lambda function to export DynamoDB data to S3.
  6. Set up an Amazon EventBridge rule to schedule export of the DynamoDB data.
  7. Create a Macie discovery job to discover sensitive data from the DynamoDB data export.
  8. View the results of the Macie discovery job.

The goal is that when you finish, you have a solution that you can use to set up either on-demand or scheduled Macie discovery jobs to detect sensitive data that was exported from DynamoDB to S3.

Prerequisite: Enable Macie

If Macie hasn’t been enabled in your account, complete Step 1 in Getting started with Amazon Macie to enable Macie. Once you’ve enabled Macie, you can proceed with the deployment of the CloudFormation template.

Deploy the CloudFormation template

In this section, you start by deploying the CloudFormation template that will deploy all the resources needed for the solution. You can then review the output of the resources that have been deployed.

To deploy the CloudFormation template

  1. Download the CloudFormation template: https://github.com/aws-samples/macie-dynamodb-blog/blob/main/src/cft.yaml
  2. Sign in to the AWS Management Console and navigate to the CloudFormation console.
  3. Choose Upload a template file, and then select the CloudFormation template that you downloaded in the previous step. Choose Next.

    Figure 2 - Uploading the CloudFormation template to be deployed

    Figure 2 – Uploading the CloudFormation template to be deployed

  4. For Stack Name, name your stack macie-blog, and then choose Next.

    Figure 3: Naming your CloudFormation stack

    Figure 3: Naming your CloudFormation stack

  5. For Configure stack options, keep the default values and choose Next.
  6. At the bottom of the Review screen, select the I acknowledge that AWS CloudFormation might create IAM resources check box, and then choose Create stack.

    Figure 4: Acknowledging that this CloudFormation template will create IAM roles

    Figure 4: Acknowledging that this CloudFormation template will create IAM roles

You should then see the following screen. It may take several minutes for the CloudFormation template to finish deploying.

Figure 5: CloudFormation stack creation in progress

Figure 5: CloudFormation stack creation in progress

View CloudFormation output

Once the CloudFormation template has been completely deployed, choose the Outputs tab, and you will see the following screen. Here you’ll find the names and URLs for all the AWS resources that are needed to complete the remainder of the solution.

Figure 6: Completed CloudFormation stack output

Figure 6: Completed CloudFormation stack output

For easier reference, open a new browser tab to your AWS Management Console and leave this tab open. This will make it easier to quickly copy and paste the resource URLs as you navigate to different resources during this walkthrough.

Import DynamoDB data

In this section, we walk through importing the test dataset to DynamoDB. You first start by downloading the test CSV datasets, then upload those datasets to S3 and run the Lambda function that imports the data to DynamoDB. Finally, you review the data that was imported into DynamoDB.

Test datasets

Download the following test datasets:

  1. Accounts Info test dataset (accounts.csv): https://github.com/aws-samples/macie-dynamodb-blog/blob/main/datasets/accounts.csv
  2. People test dataset (people.csv): https://github.com/aws-samples/macie-dynamodb-blog/blob/main/datasets/people.csv

Upload data to the S3 import bucket

Now that you’ve downloaded the test datasets, you’ll need to navigate to the data import S3 bucket and upload the data.

To upload the datasets to the S3 import bucket

  1. Navigate to the CloudFormation Outputs tab, where you’ll find the bucket information.

    Figure 7: S3 bucket output values for the CloudFormation stack

    Figure 7: S3 bucket output values for the CloudFormation stack

  2. Copy the ImportS3BucketURL link and navigate to the URL.
  3. Upload the two test CSV datasets, people.csv and accounts.csv, to your S3 bucket.
  4. After the upload is complete, you should see the two CSV files in the S3 bucket. You’ll use these files as your test DynamoDB data.

    Figure 8: Test S3 datasets in the S3 bucket

    Figure 8: Test S3 datasets in the S3 bucket

View the data import Lambda function

Now that you have your test data staged for loading, you’ll import it into DynamoDB by using a Lambda function that was deployed with the CloudFormation template. To start, navigate to the CloudFormation console and get the URL to the Lambda function that will handle the data import to DynamoDB, as shown in figure 9.

Figure 9: CloudFormation output information for the People DynamoDB table

Figure 9: CloudFormation output information for the People DynamoDB table

To run the data import Lambda function

  1. Copy the LambdaImportS3DataToDynamoURL link and navigate to the URL. You will see the Import-Data-To-DynamoDB Lambda function, as shown in figure 10.

    Figure 10: The Lambda function that imports data to DynamoDB

    Figure 10: The Lambda function that imports data to DynamoDB

  2. Choose the Test button in the upper right-hand corner. In the dialog screen, for Event name, enter Test and replace the value with {}.
  3. Your screen should now look as shown in figure 11. Choose Create.

    Figure 11: Configuring a test event to manually run the Lambda function

    Figure 11: Configuring a test event to manually run the Lambda function

  4. Choose the Test button again in the upper right-hand corner. You should now see the Lambda function running, as shown in figure 12.

    Figure 12: View of the Lambda function running

    Figure 12: View of the Lambda function running

  5. Once the Lambda function is finished running, you can expand the Details section. You should see a screen similar to the one in figure 13. When you see this screen, the test datasets have successfully been imported into the DynamoDB tables.

    Figure 13: View of the data import Lambda function after it runs successfully

    Figure 13: View of the data import Lambda function after it runs successfully

View the DynamoDB test dataset

Now that you have the datasets imported, you can look at the data in the console.

To view the test dataset

  1. Navigate to the two DynamoDB tables. You can do this by getting the URL values from the CloudFormation Outputs tab. Figure 14 shows the URL for the accounts tables.
    Figure 14: Output values for CloudFormation stack DynamoDB account tables

    Figure 14: Output values for CloudFormation stack DynamoDB account tables

    Figure 15 shows the URL for the people tables.

    Figure 15: Output values for CloudFormation stack DynamoDB people tables

    Figure 15: Output values for CloudFormation stack DynamoDB people tables

  2. Copy the AccountsDynamoDBTableURL link value and navigate to it in the browser. Then choose the Items tab.
    Figure 16: View of DynamoDB account-info-macie table data

    Figure 16: View of DynamoDB account-info-macie table data

    You should now see a screen showing data similar to the screen in figure 16. This DynamoDB table stores the test account data that you will use to run a Macie discovery job against after the data has been exported to S3.

  3. Navigate to the PeopleDynamoDBTableURL link that is in the CloudFormation output. Then choose the Items tab.

    Figure 17: View of DynamoDB people table data

    Figure 17: View of DynamoDB people table data

You should now see a screen showing data similar to the screen in figure 17. This DynamoDB table stores the test people data that you will use to run a Macie discovery job against after the data has been exported to S3.

Export DynamoDB data to S3

In the previous section, you set everything up and staged the data to DynamoDB. In this section, you will export data from DynamoDB to S3.

View the EventBridge rule

The EventBridge rule that was deployed earlier allows you to automatically schedule the export of DynamoDB data to S3. You will can export data in hours, in minutes, or in days. The purpose of the EventBridge rule is to allow you to set up an automated data pipeline from DynamoDB to S3. For demonstration purposes, you’ll run the Lambda function that the EventBridge rule uses manually, so that you can see the data be exported to S3 without having to wait.

To view the EventBridge rule

  1. Navigate to the CloudFormation Outputs tab for the CloudFormation stack you deployed earlier.

    Figure 18: CloudFormation output information for the EventBridge rule

    Figure 18: CloudFormation output information for the EventBridge rule

  2. Navigate to the EventBridgeRule link. You should see the following screen.

    Figure 19: EventBridge rule configuration details page

    Figure 19: EventBridge rule configuration details page

On this screen, you can see that we’ve set the event schedule to run every hour. The interval can be changed to fit your business needs. We have set it for 1 hour for demonstration purposes only. To make changes to the interval, you can choose the Edit button to make changes and then save the rule.

In the Target(s) section, we’ve configured a Lambda function named Export-DynamoDB-Data-To-S3 to handle the process of exporting data to the S3 bucket the Macie discovery job will run against. We will cover the Lambda function that handles the export of the data from DynamoDB next.

View the data export Lambda function

In this section, you’ll take a look at the Lambda function that handles the exporting of DynamoDB data to the S3 bucket that Macie will run its discovery job against.

To view the Lambda function

  1. Navigate to the CloudFormation Outputs tab for the CloudFormation stack you deployed earlier.

    Figure 20: CloudFormation output information for the Lambda function that exports DynamoDB data to S3

    Figure 20: CloudFormation output information for the Lambda function that exports DynamoDB data to S3

  2. Copy the link value for LambdaExportDynamoDBDataToS3URL and navigate to the URL in your browser. You should see the Python code that will handle the exporting of data to S3. The code has been commented so that you can easily follow it and refactor it for your needs.
  3. Scroll to the Environment variables section.
    Figure 21: Environment variables used by the Lambda function

    Figure 21: Environment variables used by the Lambda function

    You will see two environment variables:

  • bucket_to_export_to – This environment variable is used by the function as the S3 bucket location to save the DynamoDB data to. This is the bucket that the Macie discovery will run against.
  • dynamo_db_tables – This environment variable is a comma-delimited list of DynamoDB tables that will be read and have data exported to S3. If there was another table that you wanted to export data from, you would simply add it to the comma-delimited list and it would be part of the export.

Export DynamoDB data

In this section, you will manually run the Lambda function to export the DynamoDB tables data to S3. As stated previously, you would normally allow the EventBridge rule to handle the automated export of the data to S3. In order to see the export in action, you’re going to manually run the function.

To run the export Lambda function

  1. In the console, scroll back to the top of the screen and choose the Test button.
  2. Name the test dynamoDBExportTest, and for the test data create an empty JSON object “{}” as shown in figure 22.

    Figure 22: Configuring a test event to manually test the data export Lambda function

    Figure 22: Configuring a test event to manually test the data export Lambda function

  3. Choose Create.
  4. Choose the Test button again to run the Lambda function to export the DynamoDB data to S3.

    Figure 23: View of the screen where you run the Lambda function to export data

    Figure 23: View of the screen where you run the Lambda function to export data

  5. It could take about one minute to export the data from DynamoDB to S3. Once the Lambda function exports the data, you should see a screen similar to the following one.

    Figure 24: The result after you successfully run the data export Lambda function

    Figure 24: The result after you successfully run the data export Lambda function

View the exported DynamoDB data

Now that the DynamoDB data has been exported for Macie to run discovery jobs against, you can navigate to S3 to verify that the files exported to the bucket.

To view the data, navigate to the CloudFormation stack Output tab. Find the ExportS3BucketURL, shown in figure 25, and navigate to the link.

Figure 25: CloudFormation output information for the S3 buckets that the DynamoDB data was exported to

Figure 25: CloudFormation output information for the S3 buckets that the DynamoDB data was exported to

You should then see two different JSON files for the two DynamoDB tables that data was exported from, as shown in figure 26.

Figure 26: View of S3 objects that were exported to S3

Figure 26: View of S3 objects that were exported to S3

This is the file naming convention that’s used for the files:

<Service-name>-<DynamoDB-Table-Name>-<AWS-Region>-<DataAndTime>.json

Next, you’ll create a Macie discovery job to run against the files in this S3 bucket to discover sensitive data.

Create the Macie discovery job

In this section, you’ll create a Macie discovery job and view the results after the job has finished running.

To create the discovery job

  1. In the AWS Management Console, navigate to Macie. In the left-hand menu, choose Jobs.

    Figure 27: Navigation menu to Macie discovery jobs

    Figure 27: Navigation menu to Macie discovery jobs

  2. Choose the Create job button.

    Figure 28: Macie discovery job list screen

    Figure 28: Macie discovery job list screen

  3. Using the Bucket Name filter, search for the S3 bucket that the DynamoDB data was exported to. This can be found in the CloudFormation stack output, as shown in figure 29.

    Figure 29: CloudFormation stack output

    Figure 29: CloudFormation stack output

  4. Select the value you see for ExportS3BucketName, as shown in figure 30.

    Note: The value you see for your bucket name will be slightly different, based on the random characters added to the end of the bucket name generated by CloudFormation.

    Figure 30: Selecting the S3 bucket to include in the Macie discovery job

    Figure 30: Selecting the S3 bucket to include in the Macie discovery job

  5. Once you’ve found the S3 bucket, select the check box next to it, and then choose Next.
  6. On the Review S3 Buckets screen, if you’re satisfied with the selected buckets, choose Next.

Following are some important options when setting up Macie data discovery jobs.

Scheduling
You have the following scheduling options for the data discovery job:

  • Daily
  • Weekly
  • Monthly

Data Sampling
This allows you to randomly sample a percentage of the data that the Macie discovery job will run against.

Object criteria
This enables you to target objects based on certain metadata values. The values are:

  • Tags – Target objects with certain tags.
  • Last modified – Target objects based on when they were last modified.
  • File extensions – Target objects based on file extensions.
  • Object size – Target objects based on the file size.

You can include or exclude objects based on these object criteria filters.

Set the discovery job scope

For demonstration purposes, this will be a one-time discovery job.

To set the discovery job scope

  1. On the Scope page that appears after you create the job, set the following options for the job scope:
    1. Select the One-time job option.
    2. Leave Sampling depth set to 100%, and choose Next.

      Figure 31: Selecting the objects that should be in scope for this discovery job

      Figure 31: Selecting the objects that should be in scope for this discovery job

  2. On the Custom data identifiers screen, select account_number, and then choose Next.With the custom identifier, you can create custom business logic to look for certain patterns in files stored in S3. In this example, the job generates a finding for any file that contains data with the following format:

    Account Number Format: Starts with “XYZ-” followed by 11 numbers

    The logic to create a custom data identifier can be found in the CloudFormation template.

    Figure 32: Custom data identifiers

    Figure 32: Custom data identifiers

  3. Give your discovery job the name dynamodb-macie-discovery-job. For Description, enter Discovery job to detect sensitive data exported from DynamoDB, and choose Next.
    Figure 33: Giving the Macie discovery job a name and description

    Figure 33: Giving the Macie discovery job a name and description

    You will then see the Review and create screen, as shown in figure 34.

    Figure 34: The Macie discovery job review screen

    Figure 34: The Macie discovery job review screen

    Note: Macie must have proper permissions to decrypt objects that are part of the Macie discovery job. The CloudFormation template that you deployed during the initial setup has already deployed an AWS Key Management Service (AWS KMS) key with the proper permissions.

    For this proof of concept you won’t store the results, so you can select the check box next to Override this requirement. If you wanted to store detailed results of the discovery job long term, you would configure a repository for data discovery results. To view detailed steps for setting this up, see Storing and retaining discovery results with Amazon Macie.

Submit the discovery job

Next, you can submit the discovery job. On the Review and create screen, choose the Submit button to start the discovery job. You should see a screen similar to the following.

Figure 35: A Macie discovery job run that is in progress

Figure 35: A Macie discovery job run that is in progress

The amount of data that is being scanned dictates how long the job will take to run. You can choose the Refresh button at the top of the screen to see the updated status of the job. This job, based on the size of the test dataset, will take about seven minutes to complete.

Review the job results

Now that the Macie discovery job has run, you can review the results to see what sensitive data was discovered in the data exported from DynamoDB.

You should see the following screen once the job has successfully run.

Figure 36: View of the completed Macie discovery job

Figure 36: View of the completed Macie discovery job

On the right, you should see another pane with more information related to the discovery job. The pane should look like the following screen.

Figure 37: Summary showing which S3 bucket the discovery job ran against and start and complete time

Figure 37: Summary showing which S3 bucket the discovery job ran against and start and complete time

Note: If you don’t see this pane, choose on the discovery job to have this information displayed.

To review the job results

  1. On the page for the discovery job, in the Show Results list, select Show findings.

    Figure 38: Option to view discovery job findings

    Figure 38: Option to view discovery job findings

  2. The Findings screen appears, as follows.
    Figure 39: Viewing the list of findings generated by the Macie discovery job

    Figure 39: Viewing the list of findings generated by the Macie discovery job

    The discovery job that you ran has two different “High Severity” finding types:

    SensitiveData:S3Object/Personal – The object contains personal information, such as full names or identification numbers.

    SensitiveData:S3Object/Multiple – The object contains more than one type of sensitive data.

    Learn more about Macie findings types.

  3. Choose the SensitiveData:S3Object/Personal finding type, and you will see an information pane appear to the right, as shown in figure 40.Some of the key information that you can find here:

    Severity – What the severity of the finding is: Low, Medium, or High.
    Resource – The S3 bucket where the S3 object exists that caused the finding to be generated.
    Region – The Region where the S3 bucket exists.

    Figure 40: Viewing the severity of the discovery job finding

    Figure 40: Viewing the severity of the discovery job finding

    Since the finding is based on the detection of personal information in the S3 object, you get the number of times and type of personal data that was discovered, as shown in figure 41.

    Figure 41: Viewing the number of social security numbers that were discovered in the finding

    Figure 41: Viewing the number of social security numbers that were discovered in the finding

    Here you can see that 10 names were detected in the data that you exported from the DynamoDB table. Occurrences of name equals 10 line ranges, which tells you that the names were found on 10 different lines in the file. If you choose the 10 line ranges link, you are given the starting line and column in the document where the name was discovered.

    The S3 object that triggered the finding is displayed in the Resource affected section, as shown in figure 42.

    Figure 42: The S3 object that generated the Macie finding

    Figure 42: The S3 object that generated the Macie finding

Now that you know which S3 object contains the sensitive data, you can investigate further to take appropriate action to protect the data.

View the Macie finding details

In this section, you will walk through how to read and download the objects related to the Macie discovery job.

To download and view the S3 object that contains the finding

  1. In the Overview section of the finding details, select the value for the Resource link. You will then be taken to the object in the S3 bucket.

    Figure 43: Viewing the S3 bucket where the object is located that generated the Macie finding

    Figure 43: Viewing the S3 bucket where the object is located that generated the Macie finding

  2. You can then download the S3 object from the S3 bucket to view the file content and further investigate the file content for sensitive data. Select the check box next to the S3 object, and choose the Download button at the top of the screen.Next, we will look at the SensitiveData:S3Object/Multiple finding type that was generated. This finding type lets us know that there are multiple types of potentially sensitive data related to an object stored in S3.
  3. In the left navigation menu, navigate back to the Jobs menu.
  4. Choose the job that you created in the previous steps. In the Show Results list, select Show Findings.
  5. Select the SensitiveData:S3Object/Multiple finding type. An information pane appears to the right. As with the previous finding, you will see the severity, Region, S3 bucket location, and other relevant information about the finding. For this finding, we will focus on the Custom data identifiers and Personal info sections.
    Figure 44: Details about the sensitive data that was discovered by the Macie discovery job

    Figure 44: Details about the sensitive data that was discovered by the Macie discovery job

    Here you can see that the discovery job found 10 names on 10 different lines in the file. Also, you can see that 10 account numbers were discovered on 10 different lines in the file, based on the custom identifier that was included as part of the discovery job.

    This finding demonstrates how you can use the built-in Macie identifiers, such as names, and also include custom business logic based on your organization’s needs by using Macie custom data identifiers.

    To view the data and investigate further, follow the same steps as in the previous finding you investigated.

  6. Navigate to the top of the screen and in the Overview section, locate the Resource.

    Figure 45: Viewing the S3 bucket where the object is located that generated the Macie finding

    Figure 45: Viewing the S3 bucket where the object is located that generated the Macie finding

  7. Choose Resource, which will take you to the S3 object to download. You can now view the contents of the file and investigate further.

You’ve now created a Macie discovery job to scan for sensitive data stored in an S3 bucket that originated in DynamoDB. You can also automate this solution further by using EventBridge rules to detect Macie findings to take actions against those objects with sensitive data.

Solution cleanup

In order to clean up the solution that you just deployed, complete the following steps. Note that you need to do these steps to stop data from being exported from DynamoDB to S3 every 1 hour.

To perform cleanup

  1. Navigate to the S3 buckets used to import and export data. You can find the bucket names in the CloudFormation Outputs tab in the console, as shown in figure 7 and figure 25.
  2. After you’ve navigated to each of the buckets, delete all objects from the bucket.
  3. Navigate to the CloudFormation console, and then delete the CloudFormation stack named macie-blog. After the stack is deleted, the solution will no longer be deployed in your AWS account.

Summary

After deploying the solution, we hope you have a better understanding of how you can use Macie to detect sensitive from other data sources, such as DynamoDB, as outlined in this post. The following are links to resources that you can use to further expand your knowledge of Amazon Macie capabilities and features.

Additional resources

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Sheldon Sides

Sheldon is a Senior Solutions Architect, focused on helping customers implement native AWS security services. He enjoys using his experience as a consultant and running a cloud security startup to help customers build secure AWS Cloud solutions. His interests include working out, software development, and learning about the latest technologies.

Use Macie to discover sensitive data as part of automated data pipelines

Post Syndicated from Brandon Wu original https://aws.amazon.com/blogs/security/use-macie-to-discover-sensitive-data-as-part-of-automated-data-pipelines/

Data is a crucial part of every business and is used for strategic decision making at all levels of an organization. To extract value from their data more quickly, Amazon Web Services (AWS) customers are building automated data pipelines—from data ingestion to transformation and analytics. As part of this process, my customers often ask how to prevent sensitive data, such as personally identifiable information, from being ingested into data lakes when it’s not needed. They highlight that this challenge is compounded when ingesting unstructured data—such as files from process reporting, text files from chat transcripts, and emails. They also mention that identifying sensitive data inadvertently stored in structured data fields—such as in a comment field stored in a database—is also a challenge.

In this post, I show you how to integrate Amazon Macie as part of the data ingestion step in your data pipeline. This solution provides an additional checkpoint that sensitive data has been appropriately redacted or tokenized prior to ingestion. Macie is a fully managed data security and privacy service that uses machine learning and pattern matching to discover sensitive data in AWS.

When Macie discovers sensitive data, the solution notifies an administrator to review the data and decide whether to allow the data pipeline to continue ingesting the objects. If allowed, the objects will be tagged with an Amazon Simple Storage Service (Amazon S3) object tag to identify that sensitive data was found in the object before progressing to the next stage of the pipeline.

This combination of automation and manual review helps reduce the risk that sensitive data—such as personally identifiable information—will be ingested into a data lake. This solution can be extended to fit your use case and workflows. For example, you can define custom data identifiers as part of your scans, add additional validation steps, create Macie suppression rules to archive findings automatically, or only request manual approvals for findings that meet certain criteria (such as high severity findings).

Solution overview

Many of my customers are building serverless data lakes with Amazon S3 as the primary data store. Their data pipelines commonly use different S3 buckets at each stage of the pipeline. I refer to the S3 bucket for the first stage of ingestion as the raw data bucket. A typical pipeline might have separate buckets for raw, curated, and processed data representing different stages as part of their data analytics pipeline.

Typically, customers will perform validation and clean their data before moving it to a raw data zone. This solution adds validation steps to that pipeline after preliminary quality checks and data cleaning is performed, noted in blue (in layer 3) of Figure 1. The layers outlined in the pipeline are:

  1. Ingestion – Brings data into the data lake.
  2. Storage – Provides durable, scalable, and secure components to store the data—typically using S3 buckets.
  3. Processing – Transforms data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. This processing layer is where the additional validation steps are added to identify instances of sensitive data that haven’t been appropriately redacted or tokenized prior to consumption.
  4. Consumption – Provides tools to gain insights from the data in the data lake.

 

Figure 1: Data pipeline with sensitive data scan

Figure 1: Data pipeline with sensitive data scan

The application runs on a scheduled basis (four times a day, every 6 hours by default) to process data that is added to the raw data S3 bucket. You can customize the application to perform a sensitive data discovery scan during any stage of the pipeline. Because most customers do their extract, transform, and load (ETL) daily, the application scans for sensitive data on a scheduled basis before any crawler jobs run to catalog the data and after typical validation and data redaction or tokenization processes complete.

You can expect that this additional validation will add 5–10 minutes to your pipeline execution at a minimum. The validation processing time will scale linearly based on object size, but there is a start-up time per job that is constant.

If sensitive data is found in the objects, an email is sent to the designated administrator requesting an approval decision, which they indicate by selecting the link corresponding to their decision to approve or deny the next step. In most cases, the reviewer will choose to adjust the sensitive data cleanup processes to remove the sensitive data, deny the progression of the files, and re-ingest the files in the pipeline.

Additional considerations for deploying this application for regular use are discussed at the end of the blog post.

Application components

The following resources are created as part of the application:

Note: the application uses various AWS services, and there are costs associated with these resources after the Free Tier usage. See AWS Pricing for details. The primary drivers of the solution cost will be the amount of data ingested through the pipeline, both for Amazon S3 storage and data processed for sensitive data discovery with Macie.

The architecture of the application is shown in Figure 2 and described in the text that follows.
 

Figure 2: Application architecture and logic

Figure 2: Application architecture and logic

Application logic

  1. Objects are uploaded to the raw data S3 bucket as part of the data ingestion process.
  2. A scheduled EventBridge rule runs the sensitive data scan Step Functions workflow.
  3. triggerMacieScan Lambda function moves objects from the raw data S3 bucket to the scan stage S3 bucket.
  4. triggerMacieScan Lambda function creates a Macie sensitive data discovery job on the scan stage S3 bucket.
  5. checkMacieStatus Lambda function checks the status of the Macie sensitive data discovery job.
  6. isMacieStatusCompleteChoice Step Functions Choice state checks whether the Macie sensitive data discovery job is complete.
    1. If yes, the getMacieFindingsCount Lambda function runs.
    2. If no, the Step Functions Wait state waits 60 seconds and then restarts Step 5.
  7. getMacieFindingsCount Lambda function counts all of the findings from the Macie sensitive data discovery job.
  8. isSensitiveDataFound Step Functions Choice state checks whether sensitive data was found in the Macie sensitive data discovery job.
    1. If there was sensitive data discovered, run the triggerManualApproval Lambda function.
    2. If there was no sensitive data discovered, run the moveAllScanStageS3Files Lambda function.
  9. moveAllScanStageS3Files Lambda function moves all of the objects from the scan stage S3 bucket to the scanned data S3 bucket.
  10. triggerManualApproval Lambda function tags and moves objects with sensitive data discovered to the manual review S3 bucket, and moves objects with no sensitive data discovered to the scanned data S3 bucket. The function then sends a notification to the ApprovalRequestNotification Amazon SNS topic as a notification that manual review is required.
  11. Email is sent to the email address that’s subscribed to the ApprovalRequestNotification Amazon SNS topic (from the application deployment template) for the manual review user with the option to Approve or Deny pipeline ingestion for these objects.
  12. Manual review user assesses the objects with sensitive data in the manual review S3 bucket and selects the Approve or Deny links in the email.
  13. The decision request is sent from the Amazon API Gateway to the receiveApprovalDecision Lambda function.
  14. manualApprovalChoice Step Functions Choice state checks the decision from the manual review user.
    1. If denied, run the deleteManualReviewS3Files Lambda function.
    2. If approved, run the moveToScannedDataS3Files Lambda function.
  15. deleteManualReviewS3Files Lambda function deletes the objects from the manual review S3 bucket.
  16. moveToScannedDataS3Files Lambda function moves the objects from the manual review S3 bucket to the scanned data S3 bucket.
  17. The next step of the automated data pipeline will begin with the objects in the scanned data S3 bucket.

Prerequisites

For this application, you need the following prerequisites:

You can use AWS Cloud9 to deploy the application. AWS Cloud9 includes the AWS CLI and AWS SAM CLI to simplify setting up your development environment.

Deploy the application with AWS SAM CLI

You can deploy this application using the AWS SAM CLI. AWS SAM uses AWS CloudFormation as the underlying deployment mechanism. AWS SAM is an open-source framework that you can use to build serverless applications on AWS.

To deploy the application

  1. Initialize the serverless application using the AWS SAM CLI from the GitHub project in the aws-samples repository. This will clone the project locally which includes the source code for the Lambda functions, Step Functions state machine definition file, and the AWS SAM template. On the command line, run the following:
    sam init --location gh: aws-samples/amazonmacie-datapipeline-scan
    

    Alternatively, you can clone the Github project directly.

  2. Deploy your application to your AWS account. On the command line, run the following:
    sam deploy --guided
    

    Complete the prompts during the guided interactive deployment. The first deployment prompt is shown in the following example.

    Configuring SAM deploy
    ======================
    
            Looking for config file [samconfig.toml] :  Found
            Reading default arguments  :  Success
    
            Setting default arguments for 'sam deploy'
            =========================================
            Stack Name [maciepipelinescan]:
    

  3. Settings:
    • Stack Name – Name of the CloudFormation stack to be created.
    • AWS RegionRegion—for example, us-west-2, eu-west-1, ap-southeast-1—to deploy the application to. This application was tested in the us-west-2 and ap-southeast-1 Regions. Before selecting a Region, verify that the services you need are available in those Regions (for example, Macie and Step Functions).
    • Parameter StepFunctionName – Name of the Step Functions state machine to be created—for example, maciepipelinescanstatemachine).
    • Parameter BucketNamePrefix – Prefix to apply to the S3 buckets to be created (S3 bucket names are globally unique, so choosing a random prefix helps ensure uniqueness).
    • Parameter ApprovalEmailDestination – Email address to receive the manual review notification.
    • Parameter EnableMacie – Whether you need Macie enabled in your account or Region. You can select yes or no; select yes if you need Macie to be enabled for you as part of this template, select no, if you already have Macie enabled.
  4. Confirm changes and provide approval for AWS SAM CLI to deploy the resources to your AWS account by responding y to prompts, as shown in the following example. You can accept the defaults for the SAM configuration file and SAM configuration environment prompts.
    #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
    Confirm changes before deploy [y/N]: y
    #SAM needs permission to be able to create roles to connect to the resources in your template
    Allow SAM CLI IAM role creation [Y/n]: y
    ReceiveApprovalDecisionAPI may not have authorization defined, Is this okay? [y/N]: y
    ReceiveApprovalDecisionAPI may not have authorization defined, Is this okay? [y/N]: y
    Save arguments to configuration file [Y/n]: y
    SAM configuration file [samconfig.toml]: 
    SAM configuration environment [default]:
    

    Note: This application deploys an Amazon API Gateway with two REST API resources without authorization defined to receive the decision from the manual review step. You will be prompted to accept each resource without authorization. A token (Step Functions taskToken) is used to authenticate the requests.

  5. This creates an AWS CloudFormation changeset. Once the changeset creation is complete, you must provide a final confirmation of y to Deploy the changeset? [y/N] when prompted as shown in the following example.
    Changeset created successfully. arn:aws:cloudformation:ap-southeast-1:XXXXXXXXXXXX:changeSet/samcli-deploy1605213119/db681961-3635-4305-b1c7-dcc754c7XXXX
    
    
    Previewing CloudFormation changeset before deployment
    ======================================================
    Deploy this changeset? [y/N]:
    

Your application is deployed to your account using AWS CloudFormation. You can track the deployment events in the command prompt or via the AWS CloudFormation console.

After the application deployment is complete, you must confirm the subscription to the Amazon SNS topic. An email will be sent to the email address entered in Step 3 with a link that you need to select to confirm the subscription. This confirmation provides opt-in consent for AWS to send emails to you via the specified Amazon SNS topic. The emails will be notifications of potentially sensitive data that need to be approved. If you don’t see the verification email, be sure to check your spam folder.

Test the application

The application uses an EventBridge scheduled rule to start the sensitive data scan workflow, which runs every 6 hours. You can manually start an execution of the workflow to verify that it’s working. To test the function, you will need a file that contains data that matches your rules for sensitive data. For example, it is easy to create a spreadsheet, document, or text file that contains names, addresses, and numbers formatted like credit card numbers. You can also use this generated sample data to test Macie.

We will test by uploading a file to our S3 bucket via the AWS web console. If you know how to copy objects from the command line, that also works.

Upload test objects to the S3 bucket

  1. Navigate to the Amazon S3 console and upload one or more test objects to the <BucketNamePrefix>-data-pipeline-raw bucket. <BucketNamePrefix> is the prefix you entered when deploying the application in the AWS SAM CLI prompts. You can use any objects as long as they’re a supported file type for Amazon Macie. I suggest uploading multiple objects, some with and some without sensitive data, in order to see how the workflow processes each.

Start the Scan State Machine

  1. Navigate to the Step Functions state machines console. If you don’t see your state machine, make sure you’re connected to the same region that you deployed your application to.
  2. Choose the state machine you created using the AWS SAM CLI as seen in Figure 3. The example state machine is maciepipelinescanstatemachine, but you might have used a different name in your deployment.
     
    Figure 3: AWS Step Functions state machines console

    Figure 3: AWS Step Functions state machines console

  3. Select the Start execution button and copy the value from the Enter an execution name – optional box. Change the Input – optional value replacing <execution id> with the value just copied as follows:
    {
        “id”: “<execution id>”
    }
    

    In my example, the <execution id> is fa985a4f-866b-b58b-d91b-8a47d068aa0c from the Enter an execution name – optional box as shown in Figure 4. You can choose a different ID value if you prefer. This ID is used by the workflow to tag the objects being processed to ensure that only objects that are scanned continue through the pipeline. When the EventBridge scheduled event starts the workflow as scheduled, an ID is included in the input to the Step Functions workflow. Then select Start execution again.
     

    Figure 4: New execution dialog box

    Figure 4: New execution dialog box

  4. You can see the status of your workflow execution in the Graph inspector as shown in Figure 5. In the figure, the workflow is at the pollForCompletionWait step.
     
    Figure 5: AWS Step Functions graph inspector

    Figure 5: AWS Step Functions graph inspector

The sensitive discovery job should run for about five to ten minutes. The jobs scale linearly based on object size, but there is a start-up time per job that is constant. If sensitive data is found in the objects uploaded to the <BucketNamePrefix>-data-pipeline-upload S3 bucket, an email is sent to the address provided during the AWS SAM deployment step, notifying the recipient requesting of the need for an approval decision, which they indicate by selecting the link corresponding to their decision to approve or deny the next step as shown in Figure 6.
 

Figure 6: Sensitive data identified email

Figure 6: Sensitive data identified email

When you receive this notification, you can investigate the findings by reviewing the objects in the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket. Based on your review, you can either apply remediation steps to remove any sensitive data or allow the data to proceed to the next step of the data ingestion pipeline. You should define a standard response process to address discovery of sensitive data in the data pipeline. Common remediation steps include review of the files for sensitive data, deleting the files that you do not want to progress, and updating the ETL process to redact or tokenize sensitive data when re-ingesting into the pipeline. When you re-ingest the files into the pipeline without sensitive data, the files will not be flagged by Macie.

The workflow performs the following:

  • If you select Approve, the files are moved to the <BucketNamePrefix>-data-pipeline-scanned-data S3 bucket with an Amazon S3 SensitiveDataFound object tag with a value of true.
  • If you select Deny, the files are deleted from the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket.
  • If no action is taken, the Step Functions workflow execution times out after five days and the file will automatically be deleted from the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket after 10 days.

Clean up the application

You’ve successfully deployed and tested the sensitive data pipeline scan workflow. To avoid ongoing charges for resources you created, you should delete all associated resources by deleting the CloudFormation stack. In order to delete the CloudFormation stack, you must first delete all objects that are stored in the S3 buckets that you created for the application.

To delete the application

  1. Empty the S3 buckets created in this application (<BucketNamePrefix>-data-pipeline-raw S3 bucket, <BucketNamePrefix>-data-pipeline-scan-stage, <BucketNamePrefix>-data-pipeline-manual-review, and <BucketNamePrefix>-data-pipeline-scanned-data).
  2. Delete the CloudFormation stack used to deploy the application.

Considerations for regular use

Before using this application in a production data pipeline, you will need to stop and consider some practical matters. First, the notification mechanism used when sensitive data is identified in the objects is email. Email doesn’t scale: you should expand this solution to integrate with your ticketing or workflow management system. If you choose to use email, subscribe a mailing list so that the work of reviewing and responding to alerts is shared across a team.

Second, the application is run on a scheduled basis (every 6 hours by default). You should consider starting the application when your preliminary validations have completed and are ready to perform a sensitive data scan on the data as part of your pipeline. You can modify the EventBridge Event Rule to run in response to an Amazon EventBridge event instead of a scheduled basis.

Third, the application currently uses a 60 second Step Functions Wait state when polling for the Macie discovery job completion. In real world scenarios, the discovery scan will take 10 minutes at a minimum, likely several orders of magnitude longer. You should evaluate the typical execution times for your application execution and tune the polling period accordingly. This will help reduce costs related to running Lambda functions and log storage within CloudWatch Logs. The polling period is defined in the Step Functions state machine definition file (macie_pipeline_scan.asl.json) under the pollForCompletionWait state.

Fourth, the application currently doesn’t account for false positives in the sensitive data discovery job results. Also, the application will progress or delete all objects identified based on the decision by the reviewer. You should consider expanding the application to handle false positives through automation rather than manual review / intervention (such as deleting the files from the manual review bucket or removing the sensitive data tags applied).

Last, the solution will stop the ingestion of a subset of objects into your pipeline. This behavior is similar to other validation and data quality checks that most customers perform as part of the data pipeline. However, you should test to ensure that this will not cause unexpected outcomes and address them in your downstream application logic accordingly.

Conclusion

In this post, I showed you how to integrate sensitive data discovery using Macie as an additional validation step in an automated data pipeline. You’ve reviewed the components of the application, deployed it using the AWS SAM CLI, tested to validate that the application functions as expected, and cleaned up by removing deployed resources.

You now know how to integrate sensitive data scanning into your ETL pipeline. You can use automation and—where required—manual review to help reduce the risk of sensitive data, such as personally identifiable information, being inadvertently ingested into a data lake. You can take this application and customize it to fit your use case and workflows, such as using custom data identifiers as part of your scans, adding additional validation steps, creating Macie suppression rules to define cases to archive findings automatically, or only request manual approvals for findings that meet certain criteria (such as high severity findings).

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Brandon Wu

Brandon is a security solutions architect helping financial services organizations secure their critical workloads on AWS. In his spare time, he enjoys exploring outdoors and experimenting in the kitchen.

Discover sensitive data by using custom data identifiers with Amazon Macie

Post Syndicated from Kayla Jing original https://aws.amazon.com/blogs/security/discover-sensitive-data-by-using-custom-data-identifiers-with-amazon-macie/

As you put more and more data in the cloud, you need to rely on security automation to keep it secure at scale. AWS recently launched Amazon Macie, a fully managed service that uses machine learning and pattern matching to help you detect, classify, and better protect your sensitive data stored in the AWS Cloud.

Many data breaches are not the result of malicious activity from unauthorized users, but rather from mistakes made by authorized users. To monitor and manage the security of sensitive data, you must first be able to identify it. In this post, we show you how to use custom data identifiers with Macie to identify sensitive data. Once you know what’s sensitive, you can start designing security controls that operate at scale to monitor and remediate risk automatically.

Macie comes with a set of managed data identifiers that you can use to discover many types of sensitive data. These are somewhat generic and broadly applicable to many organizations. What makes Macie unique is its ability to help you address specific data needs. Macie enables you to expand your sensitive data detection through the new custom data identifiers. Custom data identifiers can be used to highlight organizational proprietary data, intellectual property, and specific scenarios.

Custom Data Identifiers in Macie help you find and identify sensitive data based on your own organization’s specific needs. In this post, we show you a step-by-step walkthrough of how to define and run custom data identifiers to automatically discover specific, sensitive data. Before you begin using Custom Data Identifiers, you need to enable Macie and configure detailed logging. Follow these instructions to enable Macie and follow these instructions to configure detailed logging, if you haven’t done that already.

When to use the Custom Data Identifier resource

To begin, imagine you’re an IT administrator for a manufacturing company that’s headquartered in France. Your company has acquired a few additional local subsidiaries, including an R&D facility in São Paulo, Brazil. The company is migrating to AWS, and in the process is classifying registration information, employee information, and product data into encrypted and non-encrypted storage.

You want to identify sensitive data for the following three scenarios:

  • SIRET-NIC: SIRET-NIC is a unique number assigned to businesses in France. This number is issued by their National Institute of Statistics (INSEE) when a business is registered. A sample file that contains SIRET-NIC information is shown in the following figure. Each record in the file includes the GUID, employee name, employee email, the company name, the date it was issued, and the SIRET-NIC number.

    Figure 1: SIRET-NIC dataset

    Figure 1: SIRET-NIC dataset

  • Brazil CPF (Cadastro de Pessoas Físicas – Natural Persons Register): CPF is a unique number assigned by the Brazilian revenue agency to people subject to taxes in the country. Each of your employees residing in the Brazilian office has a CPF.
  • Prototyping naming convention: Your company has products that are publicly available, but also products that are still in the prototyping stage and should be kept confidential. A sample file that contains Brazil CPF numbers and the prototype names is shown in the following figure.

    Figure 2: Brazil CPF and prototype number dataset

    Figure 2: Brazil CPF and prototype number dataset

Configure the Custom Data Identifier resource in the Macie console

To use custom data identifiers to identify your organization’s sensitive information, you must:

  1. Create custom data identifiers.
  2. Create a job to scan your Amazon Simple Storage Service (Amazon S3) bucket to locate the data patterns that match your custom data identifiers.
  3. Respond to the returned results.

The following steps introduce you to the Custom Data Identifier resource in Macie.

Designing Custom Data Identifiers for use with Amazon Macie

In the previous section you discovered 3 scenarios that your company will like to protect SIRET-NIC, Brazil CPF, and your prototyping naming convention. You now need to first create a specific REGEX pattern for each of these scenarios. There are different syntaxes and dialects of regular expression languages. Amazon Macie supports a subset of the Perl Compatible Regular Expressions (PCRE) library, and you can learn more about it in Regex support in custom data identifiers section. Once the patterns are ready, follow the instructions below to create the custom data identifiers.

Creating Custom Data Identifiers in Amazon Macie

  1. Sign in to the AWS Management Console.
  2. Enter Amazon Macie in the AWS services search box.
  3. Choose Amazon Macie.
  4. In the navigation pane on the left-hand side, under Settings, choose Custom data identifiers as shown in the following figure.

    Figure 3: Custom data identifiers console

    Figure 3: Custom data identifiers console

Create a custom data identifier

  1. Choose Create on the custom data identifier console.
  2. Name: Enter a name for your custom data identifier. Make it descriptive so you know what it does. For example, enter SIRET-NIC for the SIRET-NIC number you use.
  3. Description: Enter a description of the custom data identifier.
  4. Regular expression (regex): Define the pattern you want to identify. Use a Regular Expression (“regex”) to create the desired pattern. For example, a SIRET-NIC number contains 14 digits—9 numbers followed by a hyphen and then 5 more numbers. The first part, 9 numbers, can stay together or separated by spaces into 3 groups of 3. The specific regex pattern for this is \b(\d{3}\s?){2}\d{3}\-\d{5}\b
  5. Keywords: Define expressions that identify the text to match. The SIRET-NIC number itself is publicly accessible information. But in your case, you want to encrypt the information about the company that was registered during the month the acquisition happened (April 2020), thus the information will not leak to your competitors. So, the keywords here will be all the days in April.
  6. (Optional) Ignore words: Use this box to enter text that you want to be ignored. In this example scenario, you know your security training materials always use an example SIRET-NICs of 12345789-12345 and 000000000-00000. You can enter these values here, so that your security training materials are not flagged as sensitive data containing SIRET-NICs.
  7. Maximum match distance: Use this box to define the proximity between the result and the keywords. If you enter 20, Macie will provide results that include the specified keyword and 20 characters on either side of it.

Note: Do not select Submit yet. After entering the settings and before selecting Submit, you should test your custom data identifier with sample data to confirm that it works.

With all the attributes set, your console will look like what is shown in Figure 4.

Figure 4: SIRET-NIC custom data identifier creation

Figure 4: SIRET-NIC custom data identifier creation

Test your SIRET-NIC custom data identifier

Use the Evaluate section on the right-hand panel of the Macie console to confirm that the regex pattern and other configurations for your custom data identifier are correct.

Follow the steps below to use the Evaluate section.

  1. Enter test data in the sample data box.
  2. Select Submit. There will be one match per record in the file if the configurations are correct and your custom data identifier is ready.The following figure is an example of the Evaluate section using test data. The test data has 3 records, each record has 5 fields which are GUID, employee name, employee email, company name, date SIRET-NIC was issued, and the SIRET-NIC number.

    Figure 5: Evaluate, showing sample data

    Figure 5: Evaluate, showing sample data

  3. After verifying your SIRET-NIC custom data identifier works in the Evaluate section, now select Submit on the New custom data identifier window to create the custom data identifier.

Create a Brazil CPF Custom Data Identifier

Congrats on creating your first custom data identifier! Now use the same steps to create and test custom data identifiers for the Brazil CPF and prototyping naming convention scenarios. The Brazil CPF number usually shows up in the format of 000.000.000-00.

Use the following values for the Brazil CPF scenario, as shown in the following figure:

  • Name: Brazil CPF
  • Description: The format for Brazil CPF in our sample data is 000.000.000-00
  • Regular expression: \b(\d{3}\.){2}\d{3}\-\d{2}\b

    Figure 6: Brazil CPF custom data identifier

    Figure 6: Brazil CPF custom data identifier

Create a Prototype Name Custom Data Identifier

Assume that your company has a very strict and regular naming scheme for prototype part numbers. It is P, followed by a hyphen, and then 2 letters and 4 digits. E.g., P-AB1234. You want to identify objects in S3 that contain references to private prototype parts. This is a small pattern, and so if we’re not careful it will cause Macie to flag objects that do not actually contain one of our prototype numbers. We suggest adding \b at the beginning and the end of the regular expression. The \b symbol means a “word boundary” and word boundaries are basically whitespace, punctuation, or other things that are not letters and numbers. With \b, you limit the pattern so that you only match if the entire word matches the pattern. For example, P-AB1234 will match the pattern, but STEP-AB123456 and P-XY123 will not match the pattern. This gives you finer grained control and reduces false positives.

Use the following values for the prototyping name scenario, as shown in the following figure:

  • Name: Prototyping Naming
  • Description: Any prototype name start with P means it’s private. The format for private prototype name is P-2 capital letters and 4 numbers
  • Regular expression: \bP\-[A-Z]{2}\d{4}\b
Figure 7: Prototyping naming custom data identifier

Figure 7: Prototyping naming custom data identifier

You should now see a page like the following figure, indicating that the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers are successfully configured.

Figure 8: Successfully configured custom data identifier

Figure 8: Successfully configured custom data identifier

Set up a Test Bucket to Demonstrate Macie

Before we can see Macie do its work, we have to create a bucket with some test data that we can scan. We’ve provided some sample data files that you can download. Follow these instructions to create a test bucket and load our test data into the test bucket.

  1. Download the sample data and unzip it.
  2. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.
  3. Choose Create bucket. The Create bucket wizard opens.
  4. In Bucket name, enter a DNS-compliant name for your bucket. The bucket name must:
    • Be unique across all of Amazon S3.
    • Be between 3 and 63 characters long.
    • Not contain uppercase characters.
    • Start with a lowercase letter or number.

    We created a bucket called bucketformacieuse; you have to choose another name because this one is already taken by us.

  5. In Region, choose the AWS Region where you want the bucket to reside.
  6. Select Create, to finish the bucket creation.
  7. Open the bucket you just created and upload the two Excel files you downloaded in step 1.

Use Macie to create a job to scan your data

Now you can create a job to scan your Amazon S3 bucket to detect and locate the data patterns defined in the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers.

To create a job

  1. In the navigation pane, choose Jobs, and then select Create Job on the upper right.
  2. Select Amazon S3 buckets: Select the S3 bucket you want to analyze. In this case, we are using the bucket previously created, bucketformacieuse.
  3. Review Amazon S3 buckets: Verify that you selected the S3 bucket you want the job to scan and analyze.
  4. Scope: Select your scope. For this example, choose the One-time job option as your scope. The scope specifies how often you want the job to run. This can be either a one-time job or a scheduled job. If you choose a scheduled job, you can define how often you want your job to scan your Amazon S3 bucket.
  5. Custom data identifiers: Select the 3 custom data identifiers you created to be associated with this job, and then select Next. This is shown in the following figure.

    Figure 9: Select your custom data identifiers

    Figure 9: Select your custom data identifiers

  6. Name and description: Enter a name and description for the job.
  7. Review and create: Review and verify all your settings, and then select Create.

You now have a job in Macie to scan the Amazon S3 buckets you’ve chosen using the 3 custom data identifiers you created. More information about creating jobs is available in Running sensitive data discovery jobs in Amazon Macie.

Respond to results

Macie will help you be secure when you’re effectively responding to the findings that it produces. For our example, we’ll show you how to review your findings manually. You can look at your findings by bucket, type, or job, or see a collective summary of all findings. In this example, let’s look at all findings.

To review your results

  1. In the navigation pane on the left-hand side, choose Findings. Findings include the severity, the type, the resources affected, and when the findings were last updated.
  2. The following figure shows an example of the results you might see on the findings page. There are two findings for the selected job. The compagnie_français.csv and the empresa_brasileira.csv files contain the custom data identifiers that you created earlier and added to the job.

    Figure 10: Findings

    Figure 10: Findings

  3. Let’s look at the details of one of the findings so you can review the results. From the page showing the 4r findings, select the file that contains your custom data identifier for the Brazil CPF: empresa_brasileira.csv. The number of custom data identifiers found in the document is shown in the Result section on the right, as shown in the following figure.

    Figure 11: Findings detail page for the Brazil CPF custom data identifiers

    Figure 11: Findings detail page for the Brazil CPF custom data identifiers

  4. Now look at the findings details for the compagnie_français.csv file. It shows the number of custom data identifiers found in the file. In this case Macie found 13 SIRET-NIC numbers as shown in the following figure.

    Figure 12: Findings page for the French company file

    Figure 12: Findings page for the French company file

  5. If you configured detailed logging, the results will be saved in the Amazon S3 bucket you specified. The S3 bucket location can be found in the Details section after Detailed result location as shown in the preceding figure.

Now that you’ve used Macie and the Custom Data Identifiers resource to obtain these findings, you can identify what data to place in encrypted storage, and what can be placed in non-encrypted storage when migrating to AWS. Macie and custom data identifiers provide an automated tool to help you enhance protection of your sensitive data by providing you the information to help detect and classify your data in the AWS Cloud.

Using Macie at Scale

Custom Data Identifiers help you tell Macie what to look for. As you move more and more data to the cloud, you’ll need to make new identifiers and new rules. As your rules and identifiers grow you will need to create automation that responds to things that are found. For example, perhaps a lambda function turns on encryption in a bucket when it finds sensitive data in that bucket. Or perhaps a function automatically applies tags to buckets where sensitive data is found, and those buckets and their owners start to appear on reports for audit and compliance. Once you’ve done this at small scale, think about how you will automate responses at larger scale.

Conclusion

The new Custom Data Identifier resource in the newly enhanced Macie can help you detect, classify, and protect sensitive data types unique to your organization. This post focused on the functionality and use of custom data identifiers to automatically discover sensitive data stored in Amazon S3. You can also review the managed data identifiers to see a list of personally identifiable information (PII) that Macie can detect by default. Visit What is Amazon Macie? to learn more.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Kayla Jing

Kayla is a Solutions Architect at Amazon Web Services based out of Seattle. She has experience in data science with a focus on Data Analytics and Machine Learning.

Author

Joshua Choung

Joshua is a Solutions Architect based out of Seattle. He works with customers to provide architectural and technical guidance and training on their AWS cloud journey.

Author

Laura Reith

Laura is a Solutions Architect at Amazon Web Services. Before AWS, she worked as a Solutions Architect in Taiwan focusing on physical security and retail analytics.