Amazon Macie is a fully managed data security service that uses machine learning and pattern matching to discover and help protect your sensitive data, such as personally identifiable information (PII), payment card data, and Amazon Web Services (AWS) credentials. Analyzing large volumes of data for the presence of sensitive information can be expensive, due to the nature of compute-intensive operations involved in the process.
Macie offers several capabilities to help customers reduce the cost of discovering sensitive data, including automated data discovery, which can reduce your spend with new data sampling techniques that are custom-built for Amazon Simple Storage Service (Amazon S3). In this post, we will walk through such Macie capabilities and best practices so that you can cost-efficiently discover sensitive data with Macie.
Overview of the Macie pricing plan
Let’s do a quick recap of how customers pay for the Macie service. With Macie, you are charged based on three dimensions: the number of S3 buckets evaluated for bucket inventory and monitoring, the number of S3 objects monitored for automated data discovery, and the quantity of data inspected for sensitive data discovery. You can read more about these dimensions on the Macie pricing page.
The majority of the cost incurred by customers is driven by the quantity of data inspected for sensitive data discovery. For this reason, we will limit the scope of this post to techniques that you can use to optimize the quantity of data that you scan with Macie.
Not all security use cases require the same quantity of data for scanning
Broadly speaking, you can choose to scan your data in two ways—run full scans on your data or sample a portion of it. When to use which method depends on your use cases and business objectives. Full scans are useful when customers have identified what they want to scan. A few examples of such use cases are: scanning buckets that are open to the internet, monitoring a bucket for unintentionally added sensitive data by scanning every new object, or performing analysis on a bucket after a security incident.
The other option is to use sampling techniques for sensitive data discovery. This method is useful when security teams want to reduce data security risks. For example, just knowing that an S3 bucket contains credit card numbers is enough information to prioritize it for remediation activities.
Macie offers both options, and you can discover sensitive data either by creating and running sensitive data discovery jobs that perform full scans on targeted locations, or by configuring Macie to perform automated sensitive data discovery for your account or organization. You can also use both options simultaneously in Macie.
Use automated data discovery as a best practice
Automated data discovery in Macie minimizes the quantity of data scanning that is needed to a fraction of your S3 estate.
When you enable Macie for the first time, automated data discovery is enabled by default. When you already use Macie in your organization, you can enable automatic data discovery in the management console of the Amazon Macie administrator account. This Macie capability automatically starts discovering sensitive data in your S3 buckets and builds a sensitive data profile for each bucket. The profiles are organized in a visual, interactive data map, and you can use the data map to identify data security risks that need immediate attention.
Automated data discovery in Macie starts to evaluate the level of sensitivity of each of your buckets by using intelligent and fully managed data sampling techniques to minimize the quantity of data scanning needed. During evaluation, objects are organized with similar S3 metadata, such as bucket names, object-key prefixes, file-type extensions, and storage class, into groups that are likely to have similar content. Macie then selects small, but representative, samples from each identified group of objects and scans them to detect the presence of sensitive data. Macie has a feedback loop that uses the results of previously scanned samples to prioritize the next set of samples to inspect.
The automated sensitive data discovery feature is designed to detect sensitive data at scale across hundreds of buckets and accounts, which makes it easier to identify the S3 buckets that need to be prioritized for more focused scanning. Because the amount of data that needs to be scanned is reduced, this task can be done at fraction of the cost of running a full data inspection across all your S3 buckets. The Macie console displays the scanning results as a heat map (Figure 1), which shows the consolidated information grouped by account, and whether a bucket is sensitive, not sensitive, or not analyzed yet.
Figure 1: A heat map showing the results of automated sensitive data discovery
There is a 30-day free trial period when you enable automatic data discovery on your AWS account. During the trial period, in the Macie console, you can see the estimated cost of running automated sensitive data discovery after the trial period ends. After the evaluation period, we charge based on the total quantity of S3 objects in your account, as well as the bytes that are scanned for sensitive content. Charges are prorated per day. You can disable this capability at any time.
Tune your monthly spend on automated sensitive data discovery
To further reduce your monthly spend on automated sensitive data, Macie allows you to exclude buckets from automated discovery. For example, you might consider excluding buckets that are used for storing operational logs, if you’re sure they don’t contain any sensitive information. Your monthly spend is reduced roughly by the percentage of data in those excluded buckets compared to your total S3 estate.
Figure 2 shows the setting in the heatmap area of the Macie console that you can use to exclude a bucket from automated discovery.
Figure 2: Excluding buckets from automated sensitive data discovery from the heatmap
You can also use the automated data discovery settings page to specify multiple buckets to be excluded, as shown in Figure 3.
Figure 3: Excluding buckets from the automated sensitive data discovery settings page
How to run targeted, cost-efficient sensitive data discovery jobs
Making your sensitive data discovery jobs more targeted make them more cost-efficient, because it reduces the quantity of data scanned. Consider using the following strategies:
Make your sensitive data discovery jobs as targeted and specific as possible in their scope by using the Object criteria settings on the Refine the scope page, shown in Figure 4.
Figure 4: Adjusting the scope of a sensitive data discovery job
Options to make discovery jobs more targeted include:
Include objects by using the “last modified” criterion — If you are aware of the frequency at which your classifiable S3-hosted objects get modified, and you want to scan the resources that changed at a particular point in time, include in your scope the objects that were modified at a certain date or time by using the “last modified” criterion.
Consider using random object sampling — With this option, you specify the percentage of eligible S3 objects that you want Macie to analyze when a sensitive data discovery job runs. If this value is less than 100%, Macie selects eligible objects to analyze at random, up to the specified percentage, and analyzes the data in those objects. If your data is highly consistent and you want to determine whether a specific S3 bucket, rather than each object, contains sensitive information, adjust the sampling depth accordingly.
Include objects with specific extensions, tags, or storage size — To fine tune the scope of a sensitive data discovery job, you can also define custom criteria that determine which S3 objects Macie includes or excludes from a job’s analysis. These criteria consist of one or more conditions that derive from properties of S3 objects. You can exclude objects with specific file name extensions, exclude objects by using tags as the criterion, and exclude objects on the basis of their storage size. For example, you can use a criteria-based job to scan the buckets associated with specific tag key/value pairs such as Environment: Production.
Specify S3 bucket criteria in your job — Use a criteria-based job to scan only buckets that have public read/write access. For example, if you have 100 buckets with 10 TB of data, but only two of those buckets containing 100 GB are public, you could reduce your overall Macie cost by 99% by using a criteria-based job to classify only the public buckets.
Consider scheduling jobs based on how long objects live in your S3 buckets. Running jobs at a higher frequency than needed can result in unnecessary costs in cases where objects are added and deleted frequently. For example, if you’ve determined that the S3 objects involved contain high velocity data that is expected to reside in your S3 bucket for a few days, and you’re concerned that sensitive data might remain, scheduling your jobs to run at a lower frequency will help in driving down costs. In addition, you can deselect the Include existing objects checkbox to scan only new objects.
Figure 5: Specifying the frequency of a sensitive data discovery job
As a best practice, review your scheduled jobs periodically to verify that they are still meaningful to your organization. If you aren’t sure whether one of your periodic jobs continues to be fit for purpose, you can pause it so that you can investigate whether it is still needed, without incurring potentially unnecessary costs in the meantime. If you determine that a periodic job is no longer required, you can cancel it completely.
Figure 6: Pausing a scheduled sensitive data discovery job
If you don’t know where to start to make your jobs more targeted, use the results of Macie automated data discovery to plan your scanning strategy. Start with small buckets and the ones that have policy findings associated with them.
In multi-account environments, you can monitor Macie’s usage across your organization in AWS Organizations through the usage page of the delegated administrator account. This will enable you to identify member accounts that are incurring higher costs than expected, and you can then investigate and take appropriate actions to keep expenditure low.
Take advantage of the Macie pricing calculator so that you get an estimate of your Macie fees in advance.
Conclusion
In this post, we highlighted the best practices to keep in mind and configuration options to use when you discover sensitive data with Amazon Macie. We hope that you will walk away with a better understanding of when to use the automated data discovery capability and when to run targeted sensitive data discovery jobs. You can use the pointers in this post to tune the quantity of data you want to scan with Macie, so that you can continuously optimize your Macie spend.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Macie re:Post.
Want more AWS Security news? Follow us on Twitter.
Given the pace of Amazon Web Services (AWS) innovation, it can be challenging to stay up to date on the latest AWS service and feature launches. AWS provides services and tools to help you protect your data, accounts, and workloads from unauthorized access. AWS data protection services provide encryption capabilities, key management, and sensitive data discovery. Last year, we saw growth and evolution in AWS data protection services as we continue to give customers features and controls to help meet their needs. Protecting data in the AWS Cloud is a top priority because we know you trust us to help protect your most critical and sensitive asset: your data. This post will highlight some of the key AWS data protection launches in the last year that security professionals should be aware of.
AWS Key Management Service Create and control keys to encrypt or digitally sign your data
In April, AWS Key Management Service (AWS KMS) launched hash-based message authentication code (HMAC) APIs. This feature introduced the ability to create AWS KMS keys that can be used to generate and verify HMACs. HMACs are a powerful cryptographic building block that incorporate symmetric key material within a hash function to create a unique keyed message authentication code. HMACs provide a fast way to tokenize or sign data such as web API requests, credit card numbers, bank routing information, or personally identifiable information (PII). This technology is used to verify the integrity and authenticity of data and communications. HMACs are often a higher performing alternative to asymmetric cryptographic methods like RSA or elliptic curve cryptography (ECC) and should be used when both message senders and recipients can use AWS KMS.
At AWS re:Invent in November, AWS KMS introduced the External Key Store (XKS), a new feature for customers who want to protect their data with encryption keys that are stored in an external key management system under their control. This capability brings new flexibility for customers to encrypt or decrypt data with cryptographic keys, independent authorization, and audit in an external key management system outside of AWS. XKS can help you address your compliance needs where encryption keys for regulated workloads must be outside AWS and solely under your control. To provide customers with a broad range of external key manager options, AWS KMS developed the XKS specification with feedback from leading key management and hardware security module (HSM) manufacturers as well as service providers that can help customers deploy and integrate XKS into their AWS projects.
AWS Nitro System A combination of dedicated hardware and a lightweight hypervisor enabling faster innovation and enhanced security
In November, we published The Security Design of the AWS Nitro System whitepaper. The AWS Nitro System is a combination of purpose-built server designs, data processors, system management components, and specialized firmware that serves as the underlying virtualization technology that powers all Amazon Elastic Compute Cloud (Amazon EC2) instances launched since early 2018. This new whitepaper provides you with a detailed design document that covers the inner workings of the AWS Nitro System and how it is used to help secure your most critical workloads. The whitepaper discusses the security properties of the Nitro System, provides a deeper look into how it is designed to eliminate the possibility of AWS operator access to a customer’s EC2 instances, and describes its passive communications design and its change management process. Finally, the paper surveys important aspects of the overall system design of Amazon EC2 that provide mitigations against potential side-channel vulnerabilities that can exist in generic compute environments.
AWS Secrets Manager Centrally manage the lifecycle of secrets
In February, AWS Secrets Manager added the ability to schedule secret rotations within specific time windows. Previously, Secrets Manager supported automated rotation of secrets within the last 24 hours of a specified rotation interval. This new feature added the ability to limit a given secret rotation to specific hours on specific days of a rotation interval. This helps you avoid having to choose between the convenience of managed rotations and the operational safety of application maintenance windows. In November, Secrets Manager also added the capability to rotate secrets as often as every four hours, while providing the same managed rotation experience.
In May, Secrets Manager started publishing secrets usage metrics to Amazon CloudWatch. With this feature, you have a streamlined way to view how many secrets you are using in Secrets Manager over time. You can also set alarms for an unexpected increase or decrease in number of secrets.
At the end of December, Secrets Manager added support for managed credential rotation for service-linked secrets. This feature helps eliminate the need for you to manage rotation Lambda functions and enables you to set up rotation without additional configuration. Amazon Relational Database Service (Amazon RDS) has integrated with this feature to streamline how you manage your master user password for your RDS database instances. Using this feature can improve your database’s security by preventing the RDS master user password from being visible during the database creation workflow. Amazon RDS fully manages the master user password’s lifecycle and stores it in Secrets Manager whenever your RDS database instances are created, modified, or restored. To learn more about how to use this feature, see Improve security of Amazon RDS master database credentials using AWS Secrets Manager.
AWS Private Certificate Authority Create private certificates to identify resources and protect data
In September, AWS Private Certificate Authority (AWS Private CA) launched as a standalone service. AWS Private CA was previously a feature of AWS Certificate Manager (ACM). One goal of this launch was to help customers differentiate between ACM and AWS Private CA. ACM and AWS Private CA have distinct roles in the process of creating and managing the digital certificates used to identify resources and secure network communications over the internet, in the cloud, and on private networks. This launch coincided with the launch of an updated console for AWS Private CA, which includes accessibility improvements to enhance screen reader support and additional tab key navigation for people with motor impairment.
In October, AWS Private CA introduced a short-lived certificate mode, a lower-cost mode of AWS Private CA that is designed for issuing short-lived certificates. With this new mode, public key infrastructure (PKI) administrators, builders, and developers can save money when issuing certificates where a validity period of 7 days or fewer is desired. To learn more about how to use this feature, see How to use AWS Private Certificate Authority short-lived certificate mode.
Additionally, AWS Private CA supported the launches of certificate-based authentication with Amazon AppStream 2.0 and Amazon WorkSpaces to remove the logon prompt for the Active Directory domain password. AppStream 2.0 and WorkSpaces certificate-based authentication integrates with AWS Private CA to automatically issue short-lived certificates when users sign in to their sessions. When you configure your private CA as a third-party root CA in Active Directory or as a subordinate to your Active Directory Certificate Services enterprise CA, AppStream 2.0 or WorkSpaces with AWS Private CA can enable rapid deployment of end-user certificates to seamlessly authenticate users. To learn more about how to use this feature, see How to use AWS Private Certificate Authority short-lived certificate mode.
AWS Certificate Manager Provision and manage SSL/TLS certificates with AWS services and connected resources
In early November, ACM launched the ability to request and use Elliptic Curve Digital Signature Algorithm (ECDSA) P-256 and P-384 TLS certificates to help secure your network traffic. You can use ACM to request ECDSA certificates and associate the certificates with AWS services like Application Load Balancer or Amazon CloudFront. Previously, you could only request certificates with an RSA 2048 key algorithm from ACM. Now, AWS customers who need to use TLS certificates with at least 120-bit security strength can use these ECDSA certificates to help meet their compliance needs. The ECDSA certificates have a higher security strength—128 bits for P-256 certificates and 192 bits for P-384 certificates—when compared to 112-bit RSA 2048 certificates that you can also issue from ACM. The smaller file footprint of ECDSA certificates makes them ideal for use cases with limited processing capacity, such as small Internet of Things (IoT) devices.
Amazon Macie Discover and protect your sensitive data at scale
Amazon Macie introduced two major features at AWS re:Invent. The first is a new capability that allows for one-click, temporary retrieval of up to 10 samples of sensitive data found in Amazon Simple Storage Service (Amazon S3). With this new capability, you can more readily view and understand which contents of an S3 object were identified as sensitive, so you can review, validate, and quickly take action as needed without having to review every object that a Macie job returned. Sensitive data samples captured with this new capability are encrypted by using customer-managed AWS KMS keys and are temporarily viewable within the Amazon Macie console after retrieval.
Additionally, Amazon Macie introduced automated sensitive data discovery, a new feature that provides continual, cost-efficient, organization-wide visibility into where sensitive data resides across your Amazon S3 estate. With this capability, Macie automatically samples and analyzes objects across your S3 buckets, inspecting them for sensitive data such as personally identifiable information (PII) and financial data; builds an interactive data map of where your sensitive data in S3 resides across accounts; and provides a sensitivity score for each bucket. Macie uses multiple automated techniques, including resource clustering by attributes such as bucket name, file types, and prefixes, to minimize the data scanning needed to uncover sensitive data in your S3 buckets. This helps you continuously identify and remediate data security risks without manual configuration and lowers the cost to monitor for and respond to data security risks.
Support for new open source encryption libraries
In February, we announced the availability of s2n-quic, an open source Rust implementation of the QUIC protocol, in our AWS encryption open source libraries. QUIC is a transport layer network protocol used by many web services to provide lower latencies than classic TCP. AWS has long supported open source encryption libraries for network protocols; in 2015 we introduced s2n-tls as a library for implementing TLS over HTTP. The name s2n is short for signal to noise and is a nod to the act of encryption—disguising meaningful signals, like your critical data, as seemingly random noise. Similar to s2n-tls, s2n-quic is designed to be small and fast, with simplicity as a priority. It is written in Rust, so it has some of the benefits of that programming language, such as performance, threads, and memory safety.
Cryptographic computing for AWS Clean Rooms (preview)
At re:Invent, we also announced AWS Clean Rooms, currently in preview, which includes a cryptographic computing feature that allows you to run a subset of queries on encrypted data. Clean rooms help customers and their partners to match, analyze, and collaborate on their combined datasets—without sharing or revealing underlying data. If you have data handling policies that require encryption of sensitive data, you can pre-encrypt your data by using a common collaboration-specific encryption key so that data is encrypted even when queries are run. With cryptographic computing, data that is used in collaborative computations remains encrypted at rest, in transit, and in use (while being processed).
With AWS, you control your data by using powerful AWS services and tools to determine where your data is stored, how it is secured, and who has access to it. In 2023, we will further the AWS Digital Sovereignty Pledge, our commitment to offering AWS customers the most advanced set of sovereignty controls and features available in the cloud.
You can join us at our security learning conference, AWS re:Inforce 2023, in Anaheim, CA, June 13–14, for the latest advancements in AWS security, compliance, identity, and privacy solutions.
AWS re:Invent returned to Las Vegas, Nevada, November 28 to December 2, 2022. After a virtual event in 2020 and a hybrid 2021 edition, spirits were high as over 51,000 in-person attendees returned to network and learn about the latest AWS innovations.
Now in its 11th year, the conference featured 5 keynotes, 22 leadership sessions, and more than 2,200 breakout sessions and hands-on labs at 6 venues over 5 days.
With well over 100 service and feature announcements—and innumerable best practices shared by AWS executives, customers, and partners—distilling highlights is a challenge. From a security perspective, three key themes emerged.
Turn data into actionable insights
Security teams are always looking for ways to increase visibility into their security posture and uncover patterns to make more informed decisions. However, as AWS Vice President of Data and Machine Learning, Swami Sivasubramanian, pointed out during his keynote, data often exists in silos; it isn’t always easy to analyze or visualize, which can make it hard to identify correlations that spark new ideas.
“Data is the genesis for modern invention.” – Swami Sivasubramanian, AWS VP of Data and Machine Learning
At AWS re:Invent, we launched new features and services that make it simpler for security teams to store and act on data. One such service is Amazon Security Lake, which brings together security data from cloud, on-premises, and custom sources in a purpose-built data lake stored in your account. The service, which is now in preview, automates the sourcing, aggregation, normalization, enrichment, and management of security-related data across an entire organization for more efficient storage and query performance. It empowers you to use the security analytics solutions of your choice, while retaining control and ownership of your security data.
Amazon Security Lake has adopted the Open Cybersecurity Schema Framework (OCSF), which AWS cofounded with a number of organizations in the cybersecurity industry. The OCSF helps standardize and combine security data from a wide range of security products and services, so that it can be shared and ingested by analytics tools. More than 37 AWS security partners have announced integrations with Amazon Security Lake, enhancing its ability to transform security data into a powerful engine that helps drive business decisions and reduce risk. With Amazon Security Lake, analysts and engineers can gain actionable insights from a broad range of security data and improve threat detection, investigation, and incident response processes.
Strengthen security programs
According to Gartner, by 2026, at least 50% of C-Level executives will have performance requirements related to cybersecurity risk built into their employment contracts. Security is top of mind for organizations across the globe, and as AWS CISO CJ Moses emphasized during his leadership session, we are continuously building new capabilities to help our customers meet security, risk, and compliance goals.
In addition to Amazon Security Lake, several new AWS services announced during the conference are designed to make it simpler for builders and security teams to improve their security posture in multiple areas.
Identity and networking
Authorization is a key component of applications. Amazon Verified Permissions is a scalable, fine-grained permissions management and authorization service for custom applications that simplifies policy-based access for developers and centralizes access governance. The new service gives developers a simple-to-use policy and schema management system to define and manage authorization models. The policy-based authorization system that Amazon Verified Permissions offers can shorten development cycles by months, provide a consistent user experience across applications, and facilitate integrated auditing to support stringent compliance and regulatory requirements.
Additional services that make it simpler to define authorization and service communication include Amazon VPC Lattice, an application-layer service that consistently connects, monitors, and secures communications between your services, and AWS Verified Access, which provides secure access to corporate applications without a virtual private network (VPN).
Threat detection and monitoring
Monitoring for malicious activity and anomalous behavior just got simpler. Amazon GuardDuty RDS Protection expands the threat detection capabilities of GuardDuty by using tailored machine learning (ML) models to detect suspicious logins to Amazon Aurora databases. You can enable the feature with a single click in the GuardDuty console, with no agents to manually deploy, no data sources to enable, and no permissions to configure. When RDS Protection detects a potentially suspicious or anomalous login attempt that indicates a threat to your database instance, GuardDuty generates a new finding with details about the potentially compromised database instance. You can view GuardDuty findings in AWS Security Hub, Amazon Detective (if enabled), and Amazon EventBridge, allowing for integration with existing security event management or workflow systems.
To bolster vulnerability management processes, Amazon Inspector now supports AWS Lambda functions, adding automated vulnerability assessments for serverless compute workloads. With this expanded capability, Amazon Inspector automatically discovers eligible Lambda functions and identifies software vulnerabilities in application package dependencies used in the Lambda function code. Actionable security findings are aggregated in the Amazon Inspector console, and pushed to Security Hub and EventBridge to automate workflows.
Data protection and privacy
The first step to protecting data is to find it. Amazon Macie now automatically discovers sensitive data, providing continual, cost-effective, organization-wide visibility into where sensitive data resides across your Amazon Simple Storage Service (Amazon S3) estate. With this new capability, Macie automatically and intelligently samples and analyzes objects across your S3 buckets, inspecting them for sensitive data such as personally identifiable information (PII), financial data, and AWS credentials. Macie then builds and maintains an interactive data map of your sensitive data in S3 across your accounts and Regions, and provides a sensitivity score for each bucket. This helps you identify and remediate data security risks without manual configuration and reduce monitoring and remediation costs.
Encryption is a critical tool for protecting data and building customer trust. The launch of the end-to-end encrypted enterprise communication service AWS Wickr offers advanced security and administrative controls that can help you protect sensitive messages and files from unauthorized access, while working to meet data retention requirements.
Management and governance
Maintaining compliance with regulatory, security, and operational best practices as you provision cloud resources is key. AWS Config rules, which evaluate the configuration of your resources, have now been extended to support proactive mode, so that they can be incorporated into infrastructure-as-code continuous integration and continuous delivery (CI/CD) pipelines to help identify noncompliant resources prior to provisioning. This can significantly reduce time spent on remediation.
Managing the controls needed to meet your security objectives and comply with frameworks and standards can be challenging. To make it simpler, we launched comprehensive controls management with AWS Control Tower. You can use it to apply managed preventative, detective, and proactive controls to accounts and organizational units (OUs) by service, control objective, or compliance framework. You can also use AWS Control Tower to turn on Security Hub detective controls across accounts in an OU. This new set of features reduces the time that it takes to define and manage the controls required to meet specific objectives, such as supporting the principle of least privilege, restricting network access, and enforcing data encryption.
Do more with less
As we work through macroeconomic conditions, security leaders are facing increased budgetary pressures. In his opening keynote, AWS CEO Adam Selipsky emphasized the effects of the pandemic, inflation, supply chain disruption, energy prices, and geopolitical events that continue to impact organizations.
Now more than ever, it is important to maintain your security posture despite resource constraints. Citing specific customer examples, Selipsky underscored how the AWS Cloud can help organizations move faster and more securely. By moving to the cloud, agricultural machinery manufacturer Agco reduced costs by 78% while increasing data retrieval speed, and multinational HVAC provider Carrier Global experienced a 40% reduction in the cost of running mission-critical ERP systems.
“If you’re looking to tighten your belt, the cloud is the place to do it.” – Adam Selipsky, AWS CEO
Security teams can do more with less by maximizing the value of existing controls, and bolstering security monitoring and analytics capabilities. Services and features announced during AWS re:Invent—including Amazon Security Lake, sensitive data discovery with Amazon Macie, support for Lambda functions in Amazon Inspector, Amazon GuardDuty RDS Protection, and more—can help you get more out of the cloud and address evolving challenges, no matter the economic climate.
Security is our top priority
AWS re:Invent featured many more highlights on a variety of topics, such as Amazon EventBridge Pipes and the pre-announcement of GuardDuty EKS Runtime protection, as well as Amazon CTO Dr. Werner Vogels’ keynote, and the security partnerships showcased on the Expo floor. It was a whirlwind week, but one thing is clear: AWS is working harder than ever to make our services better and to collaborate on solutions that ease the path to proactive security, so that you can focus on what matters most—your business.
Amazon Macie is a fully managed data security service that uses machine learning and pattern matching to help you discover and protect sensitive data in Amazon Simple Storage Service (Amazon S3). With Macie, you can analyze objects in your S3 buckets to detect occurrences of sensitive data, such as personally identifiable information (PII), financial information, personal health information, and access credentials.
In this post, we walk you through a solution to gain comprehensive and organization-wide visibility into which types of sensitive data are present in your S3 storage, where the data is located, and how much is present. Once enabled, Macie automatically starts discovering sensitive data in your S3 storage and builds a sensitive data profile for each bucket. The profiles are organized in a visual, interactive data map, and you can use the data map to run targeted sensitive data discovery jobs. Both automated data discovery and targeted jobs produce rich, detailed sensitive data discovery results. This solution uses Amazon Athena and Amazon QuickSight to deep-dive on the Macie results, and to help you analyze, visualize, and report on sensitive data discovered by Macie, even when the data is distributed across millions of objects, thousands of S3 buckets, and thousands of AWS accounts. Athena is an interactive query service that makes it simpler to analyze data directly in Amazon S3 using standard SQL. QuickSight is a cloud-scale business intelligence tool that connects to multiple data sources, including Athena databases and tables.
This solution is relevant to data security, data governance, and security operations engineering teams.
The challenge: how to summarize sensitive data discovered in your growing S3 storage
Macie issues findings when an object is found to contain sensitive data. In addition to findings, Macie keeps a record of each S3 object analyzed in a bucket of your choice for long-term storage. These records are known as sensitive data discovery results, and they include additional context about your data in Amazon S3. Due to the large size of the results file, Macie exports the sensitive data discovery results to an S3 bucket, so you need to take additional steps to query and visualize the results. We discuss the differences between findings and results in more detail later in this post.
With the increasing number of data privacy guidelines and compliance mandates, customers need to scale their monitoring to encompass thousands of S3 buckets across their organization. The growing volume of data to assess, and the growing list of findings from discovery jobs, can make it difficult to review and remediate issues in a timely manner. In addition to viewing individual findings for specific objects, customers need a way to comprehensively view, summarize, and monitor sensitive data discovered across their S3 buckets.
To illustrate this point, we ran a Macie sensitive data discovery job on a dataset created by AWS. The dataset contains about 7,500 files that have sensitive information, and Macie generated a finding for each sensitive file analyzed, as shown in Figure 1.
Figure 1: Macie findings from the dataset
Your security team could spend days, if not months, analyzing these individual findings manually. Instead, we outline how you can use Athena and QuickSight to query and visualize the Macie sensitive data discovery results to understand your data security posture.
The additional information in the sensitive data discovery results will help you gain comprehensive visibility into your data security posture. With this visibility, you can answer questions such as the following:
What are the top 5 most commonly occurring sensitive data types?
Which AWS accounts have the most findings?
How many S3 buckets are affected by each of the sensitive data types?
Your security team can write their own customized queries to answer questions such as the following:
Is there sensitive data in AWS accounts that are used for development purposes?
Is sensitive data present in S3 buckets that previously did not contain sensitive information?
Was there a change in configuration for S3 buckets containing the greatest amount of sensitive data?
Findings provide a report of potential policy violations with an S3 bucket, or the presence of sensitive data in a specific S3 object. Each finding provides a severity rating, information about the affected resource, and additional details, such as when Macie found the issue. Findings are published to the Macie console, AWS Security Hub, and Amazon EventBridge.
In contrast, results are a collection of records for each S3 object that a Macie job analyzed. These records contain information about objects that do and do not contain sensitive data, including up to 1,000 occurrences of each sensitive data type that Macie found in a given object, and whether Macie was unable to analyze an object because of issues such as permissions settings or use of an unsupported format. If an object contains sensitive data, the results record includes detailed information that isn’t available in the finding for the object.
One of the key benefits of querying results is to uncover gaps in your data protection initiatives—these gaps can occur when data in certain buckets can’t be analyzed because Macie was denied access to those buckets, or was unable to decrypt specific objects. The following table maps some of the key differences between findings and results.
Findings
Results
Enabled by default
Yes
No
Location of published results
Macie console, Security Hub, and EventBridge
S3 bucket
Details of S3 objects that couldn’t be scanned
No
Yes
Details of S3 objects in which no sensitive data was found
No
Yes
Identification of files inside compressed archives that contain sensitive data
No
Yes
Number of occurrences reported per object
Up to 15
Up to 1,000
Retention period
90 days in Macie console
Defined by customer
Architecture
As shown in Figure 2, you can build out the solution in three steps:
Enable the results and publish them to an S3 bucket
Build out the Athena table to query the results by using SQL
Visualize the results with QuickSight
Figure 2: Architecture diagram showing the flow of the solution
Prerequisites
To implement the solution in this blog post, you must first complete the following prerequisites:
To follow along with the examples in this post, download the sample dataset. The dataset is a single .ZIP file that contains three directories (fk, rt, and mkro). For this post, we used three accounts in our organization, created an S3 bucket in each of them, and then copied each directory to an individual bucket, as shown in Figure 3.
Figure 3: Sample data loaded into three different AWS accounts
Note: All data in this blog post has been artificially created by AWS for demonstration purposes and has not been collected from any individual person. Similarly, such data does not, nor is it intended, to relate back to any individual person.
Step 1: Enable the results and publish them to an S3 bucket
Publication of the discovery results to Amazon S3 is not enabled by default. The setup requires that you specify an S3 bucket to store the results (we also refer to this as the discovery results bucket), and use an AWS Key Management Service (AWS KMS) key to encrypt the bucket.
If you are analyzing data across multiple accounts in your organization, then you need to enable the results in your delegated Macie administrator account. You do not need to enable results in individual member accounts. However, if you’re running Macie jobs in a standalone account, then you should enable the Macie results directly in that account.
Select the AWS Region from the upper right of the page.
From the left navigation pane, select Discovery results.
Select Configure now.
Select Create Bucket, and enter a unique bucket name. This will be the discovery results bucket name. Make note of this name because you will use it when you configure the Athena tables later in this post.
Under Encryption settings, select Create new key. This takes you to the AWS KMS console in a new browser tab.
In the AWS KMS console, do the following:
For Key type, choose symmetric, and for Key usage, choose Encrypt and Decrypt.
Enter a meaningful key alias (for example, macie-results-key) and description.
(Optional) For simplicity, set your current user or role as the Key Administrator.
Set your current user/role as a user of this key in the key usage permissions step. This will give you the right permissions to run the Athena queries later.
From the AWS KMS Key dropdown, select the new key.
To view KMS key policy statements that were automatically generated for your specific key, account, and Region, select View Policy. Copy these statements in their entirety to your clipboard.
Navigate back to the browser tab with the AWS KMS console and then do the following:
Select Customer managed keys.
Choose the KMS key that you created, choose Switch to policy view, and under Key policy, select Edit.
In the key policy, paste the statements that you copied. When you add the statements, do not delete any existing statements and make sure that the syntax is valid. Policies are in JSON format.
Review the inputs in the Settings page for Discovery results and then choose Save. Macie will perform a check to make sure that it has the right access to the KMS key, and then it will create a new S3 bucket with the required permissions.
If you haven’t run a Macie discovery job in the last 90 days, you will need to run a new discovery job to publish the results to the bucket.
In this step, you created a new S3 bucket and KMS key that you are using only for Macie. For instructions on how to enable and configure the results using existing resources, see Storing and retaining sensitive data discovery results with Amazon Macie. Make sure to review Macie pricing details before creating and running a sensitive data discovery job.
Step 2: Build out the Athena table to query the results using SQL
Now that you have enabled the discovery results, Macie will begin publishing them into your discovery results bucket in the form of jsonl.gz files. Depending on the amount of data, there could be thousands of individual files, with each file containing multiple records. To identify the top five most commonly occurring sensitive data types in your organization, you would need to query all of these files together.
In this step, you will configure Athena so that it can query the results using SQL syntax. Before you can run an Athena query, you must specify a query result bucket location in Amazon S3. This is different from the Macie discovery results bucket that you created in the previous step.
If you haven’t set up Athena previously, we recommend that you create a separate S3 bucket, and specify a query result location using the Athena console. After you’ve set up the query result location, you can configure Athena.
To create a new Athena database and table for the Macie results
Open the Athena console, and in the query editor, enter the following data definition language (DDL) statement. In the context of SQL, a DDL statement is a syntax for creating and modifying database objects, such as tables. For this example, we named our database macie_results.
CREATE DATABASE macie_results;
After running this step, you’ll see a new database in the Database dropdown. Make sure that the new macie_results database is selected for the next queries.
Figure 4: Create database in the Athena console
Create a table in the database by using the following DDL statement. Make sure to replace <RESULTS-BUCKET-NAME> with the name of the discovery results bucket that you created previously.
CREATE EXTERNAL TABLE maciedetail_all_jobs(
accountid string,
category string,
classificationdetails struct<jobArn:string,result:struct<status:struct<code:string,reason:string>,sizeClassified:string,mimeType:string,sensitiveData:array<struct<category:string,totalCount:string,detections:array<struct<type:string,count:string,occurrences:struct<lineRanges:array<struct<start:string,`end`:string,`startColumn`:string>>,pages:array<struct<pageNumber:string>>,records:array<struct<recordIndex:string,jsonPath:string>>,cells:array<struct<row:string,`column`:string,`columnName`:string,cellReference:string>>>>>>>,customDataIdentifiers:struct<totalCount:string,detections:array<struct<arn:string,name:string,count:string,occurrences:struct<lineRanges:array<struct<start:string,`end`:string,`startColumn`:string>>,pages:array<string>,records:array<string>,cells:array<string>>>>>>,detailedResultsLocation:string,jobId:string>,
createdat string,
description string,
id string,
partition string,
region string,
resourcesaffected struct<s3Bucket:struct<arn:string,name:string,createdAt:string,owner:struct<displayName:string,id:string>,tags:array<string>,defaultServerSideEncryption:struct<encryptionType:string,kmsMasterKeyId:string>,publicAccess:struct<permissionConfiguration:struct<bucketLevelPermissions:struct<accessControlList:struct<allowsPublicReadAccess:boolean,allowsPublicWriteAccess:boolean>,bucketPolicy:struct<allowsPublicReadAccess:boolean,allowsPublicWriteAccess:boolean>,blockPublicAccess:struct<ignorePublicAcls:boolean,restrictPublicBuckets:boolean,blockPublicAcls:boolean,blockPublicPolicy:boolean>>,accountLevelPermissions:struct<blockPublicAccess:struct<ignorePublicAcls:boolean,restrictPublicBuckets:boolean,blockPublicAcls:boolean,blockPublicPolicy:boolean>>>,effectivePermission:string>>,s3Object:struct<bucketArn:string,key:string,path:string,extension:string,lastModified:string,eTag:string,serverSideEncryption:struct<encryptionType:string,kmsMasterKeyId:string>,size:string,storageClass:string,tags:array<string>,embeddedFileDetails:struct<filePath:string,fileExtension:string,fileSize:string,fileLastModified:string>,publicAccess:boolean>>,
schemaversion string,
severity struct<description:string,score:int>,
title string,
type string,
updatedat string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='accountId,category,classificationDetails,createdAt,description,id,partition,region,resourcesAffected,schemaVersion,severity,title,type,updatedAt')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<RESULTS-BUCKET-NAME>/AWSLogs/'
After you complete this step, you will see a new table named maciedetail_all_jobs in the Tables section of the query editor.
Query the results to start gaining insights. For example, to identify the top five most common sensitive data types, run the following query:
select sensitive_data.category,
detections_data.type,
sum(cast(detections_data.count as INT)) total_detections
from maciedetail_all_jobs,
unnest(classificationdetails.result.sensitiveData) as t(sensitive_data),
unnest(sensitive_data.detections) as t(detections_data)
where classificationdetails.result.sensitiveData is not null
and resourcesaffected.s3object.embeddedfiledetails is null
group by sensitive_data.category, detections_data.type
order by total_detections desc
LIMIT 5
Running this query on the sample dataset gives the following output.
Figure 5: Results of a query showing the five most common sensitive data types in the dataset
(Optional) The previous query ran on all of the results available for Macie. You can further query which accounts have the greatest amount of sensitive data detected.
select accountid,
sum(cast(detections_data.count as INT)) total_detections
from maciedetail_all_jobs,
unnest(classificationdetails.result.sensitiveData) as t(sensitive_data),
unnest(sensitive_data.detections) as t(detections_data)
where classificationdetails.result.sensitiveData is not null
and resourcesaffected.s3object.embeddedfiledetails is null
group by accountid
order by total_detections desc
To test this query, we distributed the synthetic dataset across three member accounts in our organization, ran the query, and received the following output. If you enable Macie in just a single account, then you will only receive results for that one account.
Figure 6: Query results for total number of sensitive data detections across all accounts in an organization
In the previous step, you used Athena to query your Macie discovery results. Although the queries were powerful, they only produced tabular data as their output. In this step, you will use QuickSight to visualize the results of your Macie jobs.
Before creating the visualizations, you first need to grant QuickSight the right permissions to access Athena, the results bucket, and the KMS key that you used to encrypt the results.
Paste the following statement in the key policy. When you add the statement, do not delete any existing statements, and make sure that the syntax is valid. Replace <QUICKSIGHT_SERVICE_ROLE_ARN> and <KMS_KEY_ARN> with your own information. Policies are in JSON format.
{ "Sid": "Allow Quicksight Service Role to use the key",
"Effect": "Allow",
"Principal": {
"AWS": <QUICKSIGHT_SERVICE_ROLE_ARN>
},
"Action": "kms:Decrypt",
"Resource": <KMS_KEY_ARN>
}
To allow QuickSight access to Athena and the discovery results S3 bucket
In QuickSight, in the upper right, choose your user icon to open the profile menu, and choose US East (N.Virginia). You can only modify permissions in this Region.
In the upper right, open the profile menu again, and select Manage QuickSight.
Select Security & permissions.
Under QuickSight access to AWS services, choose Manage.
Make sure that the S3 checkbox is selected, click on Select S3 buckets, and then do the following:
Choose the discovery results bucket.
You do not need to check the box under Write permissions for Athena workgroup. The write permissions are not required for this post.
Select Finish.
Make sure that the Amazon Athena checkbox is selected.
Review the selections and be careful that you don’t inadvertently disable AWS services and resources that other users might be using.
Select Save.
In QuickSight, in the upper right, open the profile menu, and choose the Region where your results bucket is located.
Now that you’ve granted QuickSight the right permissions, you can begin creating visualizations.
To create a new dataset referencing the Athena table
On the QuickSight start page, choose Datasets.
On the Datasets page, choose New dataset.
From the list of data sources, select Athena.
Enter a meaningful name for the data source (for example, macie_datasource) and choose Create data source.
Select the database that you created in Athena (for example, macie_results).
Select the table that you created in Athena (for example, maciedetail_all_jobs), and choose Select.
You can either import the data into SPICE or query the data directly. We recommend that you use SPICE for improved performance, but the visualizations will still work if you query the data directly.
To create an analysis using the data as-is, choose Visualize.
You can then visualize the Macie results in the QuickSight console. The following example shows a delegated Macie administrator account that is running a visualization, with account IDs on the y axis and the count of affected resources on the x axis.
Figure 7: Visualize query results to identify total number of sensitive data detections across accounts in an organization
You can also visualize the aggregated data in QuickSight. For example, you can view the number of findings for each sensitive data category in each S3 bucket. The Athena table doesn’t provide aggregated data necessary for visualization. Instead, you need to query the table and then visualize the output of the query.
To query the table and visualize the output in QuickSight
On the Amazon QuickSight start page, choose Datasets.
On the Datasets page, choose New dataset.
Select the data source that you created in Athena (for example, macie_datasource) and then choose Create Dataset.
Select the database that you created in Athena (for example, macie_results).
Choose Use Custom SQL, enter the following query below, and choose Confirm Query.
select resourcesaffected.s3bucket.name as bucket_name,
sensitive_data.category,
detections_data.type,
sum(cast(detections_data.count as INT)) total_detections
from macie_results.maciedetail_all_jobs,
unnest(classificationdetails.result.sensitiveData) as t(sensitive_data),unnest(sensitive_data.detections) as t(detections_data)
where classificationdetails.result.sensitiveData is not null
and resourcesaffected.s3object.embeddedfiledetails is null
group by resourcesaffected.s3bucket.name, sensitive_data.category, detections_data.type
order by total_detections desc
You can either import the data into SPICE or query the data directly.
To create an analysis using the data as-is, choose Visualize.
Now you can visualize the output of the query that aggregates data across your S3 buckets. For example, we used the name of the S3 bucket to group the results, and then we created a donut chart of the output, as shown in Figure 6.
Figure 8: Visualize query results for total number of sensitive data detections across each S3 bucket in an organization
From the visualizations, we can identify which buckets or accounts in our organizations contain the most sensitive data, for further action. Visualizations can also act as a dashboard to track remediation.
You can replicate the preceding steps by using the sample queries from the amazon-macie-results-analytics GitHub repo to view data that is aggregated across S3 buckets, AWS accounts, or individual Macie jobs. Using these queries with the results of your Macie results will help you get started with tracking the security posture of your data in Amazon S3.
Conclusion
In this post, you learned how to enable sensitive data discovery results for Macie, query those results with Athena, and visualize the results in QuickSight.
Because Macie sensitive data discovery results provide more granular data than the findings, you can pursue a more comprehensive incident response when sensitive data is discovered. The sample queries in this post provide answers to some generic questions that you might have. After you become familiar with the structure, you can run other interesting queries on the data.
We hope that you can use this solution to write your own queries to gain further insights into sensitive data discovered in S3 buckets, according to the business needs and regulatory requirements of your organization. You can consider using this solution to better understand and identify data security risks that need immediate attention. For example, you can use this solution to answer questions such as the following:
Is financial information present in an AWS account where it shouldn’t be?
Are S3 buckets that contain PII properly hardened with access controls and encryption?
You can also use this solution to understand gaps in your data security initiatives by tracking files that Macie couldn’t analyze due to encryption or permission issues. To further expand your knowledge of Macie capabilities and features, see the following resources:
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Macie re:Post.
Want more AWS Security news? Follow us on Twitter.
Security teams use Amazon Macie to discover and protect sensitive data, such as names, payment card data, and AWS credentials, in Amazon Simple Storage Service (Amazon S3). When Macie discovers sensitive data, these teams will want to see examples of the actual sensitive data found. Reviewing a sampling of the discovered data helps them quickly confirm that the object is truly sensitive according to their data protection and privacy policies.
In this post, we walk you through how your data security teams are able to use a new capability in Amazon Macie to retrieve up to 10 examples of sensitive data found in your S3 objects, so that you are able to confirm the nature of the data at a glance. Additionally, we will discuss how you are able to control who is able to use this capability, so that only authorized personnel have permissions to view these examples.
The challenge customers face
After a Macie sensitive data discovery job is run, security teams start their work. The security team will review the Macie findings to investigate the discovered sensitive data and decide what actions to take to protect such data. The findings provide details that include the severity of the finding, information on the affected S3 object, and a summary of the type, location, and amount of sensitive data found. However, Macie findings only contain pointers to data that Macie found in the object. In order to complete their investigation, customers in the past had to do additional work to extract the contents of a sensitive object, such as navigating to a different AWS account where the object is located, downloading and manually searching for keywords in a file editor, or writing and refining SQL queries by using Amazon S3 Select. The investigations are further slowed down when the object type is one that is not easily readable without additional tooling, such as big-data file types like Avro and Parquet. By using the Macie capability to retrieve sensitive data samples, you are able to review the discovered data and make decisions concerning the finding remediation.
Prerequisites
To implement the ability to retrieve and reveal samples of sensitive data, you’ll need the following prerequisites:
Note: The detailed classification results contain a record for each Amazon S3 object that you configure the job to analyze, and include the location of up to 1,000 occurrences of each type of sensitive data that Macie found in an object. Macie uses the location information in the detailed classification results to retrieve the examples of sensitive data. The detailed classification results are stored in an S3 bucket of your choice. In this post, we will refer to this bucket as DOC-EXAMPLE-BUCKET1.
Create an S3 bucket that contains sensitive data in Account B. In this post, we will refer to this bucket as DOC-EXAMPLE-BUCKET2.
(Optional) Add sensitive data to DOC-EXAMPLE-BUCKET2. This post uses a sample dataset that contains fake sensitive data. You are able to download this sample dataset, unarchive the .zip folder, and follow these steps to upload the objects to S3. This is a synthetic dataset generated by AWS that we will use for the examples in this post. All data in this blog post has been artificially created by AWS for demonstration purposes and has not been collected from any individual person. Similarly, such data does not relate back to any individual person, nor is it intended to.
Create and run a sensitive data discovery job from Account A to analyze the contents of DOC-EXAMPLE-BUCKET2.
Configure Macie to retrieve and reveal examples of sensitive data
In this section, we’ll describe how to configure Macie so that you are able to retrieve and view examples of sensitive data from Macie findings.
To configure Macie (console)
In the AWS Management Console, in the Macie delegated administrator account (Account A), follow these steps from the Amazon Macie User Guide.
To configure Macie (AWS CLI)
Confirm that you have Macie enabled.
$ aws macie2 get-macie-session --query 'status'
// The expected response is "ENABLED"
Confirm that you have configured the detailed classification results bucket.
$ aws macie2 get-classification-export-configuration
// The expected response is:
{"configuration": {
"s3Destination": {"bucketName": " DOC-EXAMPLE-BUCKET1 ",
"kmsKeyArn": "arn:aws:kms:<YOUR-REGION>:<YOUR-ACCOUNT-ID>:key/<KEY-USED-TO-ENCRYPT-DOC-EXAMPLE-BUCKET1>"}}}
Create a new KMS key to encrypt the retrieved examples of sensitive data. Make sure that the key is created in the same AWS Region where you are operating Macie.
Enable the feature in Macie and configure it to encrypt the data by using REVEAL-KMS-KEY. You do not specify a key policy for your new KMS key in this step. The key policy will be discussed later in the post.
Control access to read sensitive data and protect data displayed in Macie
This new Macie capability uses the AWS Identity and Access Management (IAM) policies, S3 bucket policies, and AWS KMS key policies that you have defined in your accounts. This means that in order to see examples through the Macie console or by invoking the Macie API, the IAM principal needs to have read access to the S3 object and to decrypt the object if it is server-side encrypted. It’s important to note that Macie uses the IAM permissions of the AWS principal to locate, retrieve, and reveal the samples and does not use the Macie service-linked role to perform these tasks.
Using the setup discussed in the previous section, you will walk through how to control access to the ability to retrieve and reveal sensitive data examples. To recap, you created and ran a discovery job from the Amazon Macie delegated administrator account (Account A) to analyze the contents of DOC-EXAMPLE-BUCKET2 in a member account (Account B). You configured Macie to retrieve examples and to encrypt the examples of sensitive data with the REVEAL-KMS-KEY.
The next step is to create and use an IAM role that will be assumed by other users in Account A to retrieve and reveal examples of sensitive data discovered by Macie. In this post, we’ll refer to this role as MACIE-REVEAL-ROLE.
To apply the principle of least privilege and allow only authorized personnel to view the sensitive data samples, grant the following permissions so that Macie users who assume MACIE-REVEAL-ROLE will be able to successfully retrieve and reveal examples of sensitive data:
Step 1 – Update the IAM policy for MACIE-REVEAL-ROLE.
Step 2 – Update the KMS key policy for REVEAL-KMS-KEY.
Step 3 – Update the S3 bucket policy for DOC-EXAMPLE-BUCKET2 and the KMS key policy used for its server-side encryption in Account B.
After you grant these permissions, MACIE-REVEAL-ROLE is succcesfully able to retrieve and reveal examples of sensitive data in DOC-EXAMPLE-BUCKET2, as shown in Figure 1.
Figure 1: Macie runs the discovery job from the delegated administrator account in a member account, and MACIE-REVEAL-ROLE retrieves examples of sensitive data
Step 1: Update the IAM policy
Provide the following required permissions to MACIE-REVEAL-ROLE:
Allow GetObject from DOC-EXAMPLE-BUCKET2 in Account B.
Allow decryption of DOC-EXAMPLE-BUCKET2 if it is server-side encrypted with a customer managed key (SSE-KMS).
Allow GetObject from DOC-EXAMPLE-BUCKET1.
Allow decryption of the Macie discovery results.
Allow the necessary Macie actions to retrieve and reveal sensitive data examples.
To set up the required permissions
Use the following commands to provide the permissions. Make sure to replace the placeholders with your own data.
Next, update the KMS key policy that is used to encrypt sensitive data samples that you retrieve and reveal in your delegated administrator account.
To update the key policy
Allow the MACIE-REVEAL-ROLE access to the KMS key that you created for protecting the retrieved sensitive data, using the following commands. Make sure to replace the placeholders with your own data.
Use the following commands to update the bucket policy of DOC-EXAMPLE-BUCKET2 to allow cross-account access for MACIE-REVEAL-ROLE to get objects from this bucket.
Now that you’ve put in place the necessary permissions, users who assume MACIE-REVEAL-ROLE will be able to conveniently retrieve and reveal sensitive data samples.
To retrieve and reveal sensitive data samples
In the Macie console, in the left navigation pane, choose Findings, and select a specific finding. Under Sensitive Data, choose Review.
Figure 2: The finding details panel
On the Reveal sensitive data page, choose Reveal samples.
Figure 3: The Reveal sensitive data page
Under Sensitive data, you will be able to view up to 10 examples of the sensitive data found by Amazon Macie.
Figure 4: Examples of sensitive data revealed in the Amazon Macie console
You are able to find additional information on setting up the Macie Reveal function in the Amazon Macie User Guide.
Conclusion
In this post, we showed how you are to retrieve and review examples of sensitive data that were found in Amazon S3 using Amazon Macie. This capability will make it easier for your data protection teams to review the sensitive contents found in S3 buckets across the accounts in your AWS environment. With this information, security teams are able to quickly take remediation actions, such as updating the configuration of sensitive buckets, quarantining files with sensitive information, or sending a notification to the owner of the account where the sensitive data resides. In certain cases, you are able to add the examples to an allow list in Macie if you don’t want Macie to report those as sensitive data (for example, corporate addresses or sample data that is used for testing).
The following are links to additional resources that you will be able to use to expand your knowledge of Amazon Macie capabilities and features:
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Macie re:Post.
Want more AWS Security news? Follow us on Twitter.
Today, we announce automated data discovery for Amazon Macie. This new capability allows you to gain visibility into where your sensitive data resides on Amazon Simple Storage Service (Amazon S3) at a fraction of the cost of running a full data inspection across all your S3 buckets.
At AWS, security is our first priority. The security of the infrastructure itself, but also the security of your data. We give you access to services to manage identities and access, to protect the network and your applications, to detect suspicious activities, to protect your data, and to report on and monitor your compliance status.
Amazon Macie is a data security service that discovers sensitive data using machine learning and pattern matching and enables visibility and automated protection against data security risks. You use Amazon Macie to protect your data in S3 by scanning for the presence of sensitive data, such as names, addresses, and credit card numbers, and continually monitoring for properly configured preventative controls, such as encryption and access policies. Amazon Macie generates alerts when it detects publicly accessible buckets, unencrypted buckets, or buckets shared with an AWS account outside of your organization. You may also configure Amazon Macie to scan your S3 to run full sensitive data discovery scans on your S3 buckets to provide visibility into where sensitive data resides.
But customers operating at scale told us it is difficult to know where to start. When employees and applications add new buckets and generate petabytes of data on a daily basis, what should be scanned first?
Automated data discovery automates the continual discovery of sensitive data and potential data security risks across your entire set of buckets aggregated at AWS Organizations level.
When you enable automated discovery in the console, Macie starts to evaluate the level of sensitivity of each of your buckets and highlights any data security risks. Automated data discovery introduces intelligent and fully managed data sampling to provide an optimized sample rate that meaningfully reduces the amount of data that needs to be analyzed. This reduces the cost of discovering S3 buckets containing sensitive data compared to the cost of full data inspection.
You can tune automated data discovery to only identify the types of sensitive data that are relevant for your use case by choosing from over 100 managed sensitive data types, such as personally identifiable information (PII) and financial records with specific formats for multiple countries. For example, you can enable detection of Spanish or Swedish driving license numbers and choose to ignore US Social Security numbers, depending on your use cases. When the specific type of data you manage is not on our list, you can create custom data types that may be unique to your business, such as employee or patient identification numbers.
Let’s See It in Action Automated data discovery is on by default for all new Amazon Macie customers, and existing Macie customers can enable it with one click in the AWS Management Console of the Amazon Macie administrator account. There is a 30-day free trial, and you can always opt out at the administrator level.
I can enable or disable the capability from the Automated discovery entry–under Settings–on the left side navigation menu. The Status section reveals the current status.
On the same page, I can configure the list of managed data identifiers. I can turn on or off individual types of data among more than one hundred managed data identifier types. I can also configure new ones. I select Edit on the Managed data identifiers section to include or exclude additional data identifiers.
If I have some buckets with lots of objects and others with a few, Macie won’t spend all its time inspecting one really large bucket at the expense of other smaller ones. Macie also prioritizes buckets that it knows the least about. For example, if it looked at the majority of objects in a small bucket, that bucket will be deprioritized compared to larger buckets where it has seen proportionally fewer objects.
Automated data discovery can provide an interactive data map of sensitive data distribution in S3 buckets within days of the feature being enabled. This data map refreshes daily as it intelligently picks and scans S3 objects in buckets and spreads the scan effort across the entire S3 estate in a given month.
Here is the Summary section of the Amazon Macie page. It looks like my set of buckets is secured. I have no bucket with public access, and 31 of my buckets might contain sensitive data.
When selecting the S3 buckets section of the navigation menu on the left side, I can see a data map of my buckets. The more red the squares are, the more sensitive data are detected in the buckets. The squares in blue represent buckets with no sensitive data detected so far. From there, I can drill down at bucket level to investigate the details.
Pricing and Availability When you are new to Amazon Macie, automated data discovery is enabled by default. When you already use Amazon Macie in your organization, you can enable automatic data discovery with one click in the Management Console of the Amazon Macie administrator account.
There is a 30-day free trial period when you enable automatic data discovery on your AWS account. After the evaluation period, we charge based on the total quantity of S3 objects in your account as well as the bytes scanned for sensitive content. Charges are prorated per day. You can disable this capability at any time. The pricing page has all the details.
When building serverless applications using AWS Lambda, there are a number of considerations regarding security, governance, and compliance. This post highlights how Lambda, as a serverless service, simplifies cloud security and compliance so you can concentrate on your business logic. It covers controls that you can implement for your Lambda workloads to ensure that your applications conform to your organizational requirements.
The Shared Responsibility Model
The AWS Shared Responsibility Model distinguishes between what AWS is responsible for and what customers are responsible for with cloud workloads. AWS is responsible for “Security of the Cloud” where AWS protects the infrastructure that runs all the services offered in the AWS Cloud. Customers are responsible for “Security in the Cloud”, managing and securing their workloads. When building traditional applications, you take on responsibility for many infrastructure services, including operating systems and network configuration.
Traditional application shared responsibility
One major benefit when building serverless applications is shifting more responsibility to AWS so you can concentrate on your business applications. AWS handles managing and patching the underlying servers, operating systems, and networking as part of running the services.
Serverless application shared responsibility
For Lambda, AWS manages the application platform where your code runs, which includes patching and updating the managed language runtimes. This reduces the attack surface while making cloud security simpler. You are responsible for the security of your code and AWS Identity and Access Management (IAM) to the Lambda service and within your function.
Lambda functions run in separate isolated AWS accounts that are dedicated to the Lambda service. Lambda invokes your code in a secure and isolated runtime environment within the Lambda service account. A runtime environment is a collection of resources running in a dedicated hardware-virtualized Micro Virtual Machines (MVM) on a Lambda worker node.
Lambda workers are bare metalEC2 Nitro instances, which are managed and patched by the Lambda service team. They have a maximum lease lifetime of 14 hours to keep the underlying infrastructure secure and fresh. MVMs are created by Firecracker, an open source virtual machine monitor (VMM) that uses Linux’s Kernel-based Virtual Machine (KVM) to create and manage MVMs securely at scale.
MVMs maintain a strong separation between runtime environments at the virtual machine hardware level, which increases security. Runtime environments are never reused across functions, function versions, or AWS accounts.
Isolation model for AWS Lambda workers
Network security
Lambda functions always run inside secure Amazon Virtual Private Cloud (Amazon VPCs) owned by the Lambda service. This gives the Lambda function access to AWS services and the public internet. There is no direct network inbound access to Lambda workers, runtime environments, or Lambda functions. All inbound access to a Lambda function only comes via the Lambda Invoke API, which sends the event object to the function handler.
You can configure a Lambda function to connect to private subnets in a VPC in your account if necessary, which you can control with IAM condition keys . The Lambda function still runs inside the Lambda service VPC but sends all network traffic through your VPC. Function outbound traffic comes from your own network address space.
AWS Lambda service VPC with VPC-to-VPC NAT to customer VPC
To give your VPC-connected function access to the internet, route outbound traffic to a NAT gateway in a public subnet. Connecting a function to a public subnet doesn’t give it internet access or a public IP address, as the function is still running in the Lambda service VPC and then routing network traffic into your VPC.
All internal AWS traffic uses the AWS Global Backbone rather than traversing the internet. You do not need to connect your functions to a VPC to avoid connectivity to AWS services over the internet. VPC connected functions allow you to control and audit outbound network access.
You can use security groups to control outbound traffic for VPC-connected functions and network ACLs to block access to CIDR IP ranges or ports. VPC endpoints allow you to enable private communications with supported AWS services without internet access.
You can use VPC Flow Logs to audit traffic going to and from network interfaces in your VPC.
Runtime environment re-use
Each runtime environment processes a single request at a time. After Lambda finishes processing the request, the runtime environment is ready to process an additional request for the same function version. For more information on how Lambda manages runtime environments, see Understanding AWS Lambda scaling and throughput.
Data can persist in the local temporary filesystem path, in globally scoped variables, and in environment variables across subsequent invocations of the same function version. Ensure that you only handle sensitive information within individual invocations of the function by processing it in the function handler, or using local variables. Do not re-use files in the local temporary filesystem to process unencrypted sensitive data. Do not put sensitive or confidential information into Lambda environment variables, tags, or other freeform fields such as Name fields.
AWS recommends using multiple accounts to isolate your resources because they provide natural boundaries for security, access, and billing. Use AWS Organizations to manage and govern individual member accounts centrally. You can use AWS Control Tower to automate many of the account build steps and apply managed guardrails to govern your environment. These include preventative guardrails to limit actions and detective guardrails to detect and alert on non-compliance resources for remediation.
Lambda access controls
Lambda permissions define what a Lambda function can do, and who or what can invoke the function. Consider the following areas when applying access controls to your Lambda functions to ensure least privilege:
Execution role
Lambda functions have permission to access other AWS resources using execution roles. This is an AWS principal that the Lambda service assumes which grants permissions using identity policy statements assigned to the role. The Lambda service uses this role to fetch and cache temporary security credentials, which are then available as environment variables during a function’s invocation. It may re-use them across different runtime environments that use the same execution role.
Ensure that each function has its own unique role with the minimum set of permissions..
Identity/user policies
IAM identity policies are attached to IAM users, groups, or roles. These policies allow users or callers to perform operations on Lambda functions. You can restrict who can create functions, or control what functions particular users can manage.
Resource policies
Resource policies define what identities have fine-grained inbound access to managed services. For example, you can restrict which Lambda function versions can add events to a specific Amazon EventBridge event bus. You can use resource-based policies on Lambda resources to control what AWS IAM identities and event sources can invoke a specific version or alias of your function. You also use a resource-based policy to allow an AWS service to invoke your function on your behalf. To see which services support resource-based policies, see “AWS services that work with IAM”.
Attribute-based access control (ABAC)
With attribute-based access control (ABAC), you can use tags to control access to your Lambda functions. With ABAC, you can scale an access control strategy by setting granular permissions with tags without requiring permissions updates for every new user or resource as your organization scales. You can also use tag policies with AWS Organizations to standardize tags across resources.
Permissions boundaries
Permissions boundaries are a way to delegate permission management safely. The boundary places a limit on the maximum permissions that a policy can grant. For example, you can use boundary permissions to limit the scope of the execution role to allow only read access to databases. A builder with permission to manage a function or with write access to the applications code repository cannot escalate the permissions beyond the boundary to allow write access.
Service control policies
When using AWS Organizations, you can use Service control policies (SCPs) to manage permissions in your organization. These provide guardrails for what actions IAM users and roles within the organization root or OUs can do. For more information, see the AWS Organizations documentation, which includes example service control policies.
Code signing
As you are responsible for the code that runs in your Lambda functions, you can ensure that only trusted code runs by using code signing with the AWS Signer service. AWS Signer digitally signs your code packages and Lambda validates the code package before accepting the deployment, which can be part of your automated software deployment process.
Auditing Lambda configuration, permissions and access
You should audit access and permissions regularly to ensure that your workloads are secure. Use the IAM console to view when an IAM role was last used.
IAM last used
IAM access advisor
Use IAM access advisor on the Access Advisor tab in the IAM console to review when was the last time an AWS service was used from a specific IAM user or role. You can use this to remove IAM policies and access from your IAM roles.
You can validate policies using IAM Access Analyzer, which provides over 100 policy checks with security warnings for overly permissive policies. To learn more about policy checks provided by IAM Access Analyzer, see “IAM Access Analyzer policy validation”.
You can also generate IAM policies based on access activity from CloudTrail logs, which contain the permissions that the role used in your specified date range.
IAM Access Analyzer
AWS Config
AWS Config provides you with a record of the configuration history of your AWS resources. AWS Config monitors the resource configuration and includes rules to alert when they fall into a non-compliant state.
For Lambda, you can track and alert on changes to your function configuration, along with the IAM execution role. This allows you to gather Lambda function lifecycle data for potential audit and compliance requirements. For more information, see the Lambda Operators Guide.
Lambda makes cloud security simpler by taking on more responsibility using the AWS Shared Responsibility Model. Lambda implements strict workload security at scale to isolate your code and prevent network intrusion to your functions. This post provides guidance on assessing and implementing best practices and tools for Lambda to improve your security, governance, and compliance controls. These include permissions, access controls, multiple accounts, and code security. Learn how to audit your function permissions, configuration, and access to ensure that your applications conform to your organizational requirements.
For more serverless learning resources, visit Serverless Land.
AWS re:Inforce returned to Boston last week, kicking off with a keynote from Amazon Chief Security Officer Steve Schmidt and AWS Chief Information Security officer C.J. Moses:
Be sure to take some time to watch this video and the other leadership sessions, and to use what you learn to take some proactive steps to improve your security posture.
Last Week’s Launches Here are some launches that caught my eye last week:
AWS Wickr uses 256-bit end-to-end encryption to deliver secure messaging, voice, and video calling, including file sharing and screen sharing, across desktop and mobile devices. Each call, message, and file is encrypted with a new random key and can be decrypted only by the intended recipient. AWS Wickr supports logging to a secure, customer-controlled data store for compliance and auditing, and offers full administrative control over data: permissions, ephemeral messaging options, and security groups. You can now sign up for the preview.
AWS Marketplace Vendor Insights helps AWS Marketplace sellers to make security and compliance data available through AWS Marketplace in the form of a unified, web-based dashboard. Designed to support governance, risk, and compliance teams, the dashboard also provides evidence that is backed by AWS Config and AWS Audit Manager assessments, external audit reports, and self-assessments from software vendors. To learn more, read the What’s New post.
GuardDuty Malware Protection protects Amazon Elastic Block Store (EBS) volumes from malware. As Danilo describes in his blog post, a malware scan is initiated when Amazon GuardDuty detects that a workload running on an EC2 instance or in a container appears to be doing something suspicious. The new malware protection feature creates snapshots of the attached EBS volumes, restores them within a service account, and performs an in-depth scan for malware. The scanner supports many types of file systems and file formats and generates actionable security findings when malware is detected.
Amazon Neptune Global Database lets you build graph applications that run across multiple AWS Regions using a single graph database. You can deploy a primary Neptune cluster in one region and replicate its data to up to five secondary read-only database clusters, with up to 16 read replicas each. Clusters can recover in minutes in the result of an (unlikely) regional outage, with a Recovery Point Objective (RPO) of 1 second and a Recovery Time Objective (RTO) of 1 minute. To learn a lot more and see this new feature in action, read Introducing Amazon Neptune Global Database.
Amazon Detective now Supports Kubernetes Workloads, with the ability to scale to thousands of container deployments and millions of configuration changes per second. It ingests EKS audit logs to capture API activity from users, applications, and the EKS control plane, and correlates user activity with information gleaned from Amazon VPC flow logs. As Channy notes in his blog post, you can enable Amazon Detective and take advantage of a free 30 day trial of the EKS capabilities.
AWS SSO is Now AWS IAM Identity Center in order to better represent the full set of workforce and account management capabilities that are part of IAM. You can create user identities directly in IAM Identity Center, or you can connect your existing Active Directory or standards-based identify provider. To learn more, read this post from the AWS Security Blog.
AWS Config Conformance Packs now provide you with percentage-based scores that will help you track resource compliance within the scope of the resources addressed by the pack. Scores are computed based on the product of the number of resources and the number of rules, and are reported to Amazon CloudWatch so that you can track compliance trends over time. To learn more about how scores are computed, read the What’s New post.
Amazon Macie now lets you perform one-click temporary retrieval of sensitive data that Macie has discovered in an S3 bucket. You can retrieve up to ten examples at a time, and use these findings to accelerate your security investigations. All of the data that is retrieved and displayed in the Macie console is encrypted using customer-managed AWS Key Management Service (AWS KMS) keys. To learn more, read the What’s New post.
AWS Control Tower was updated multiple times last week. CloudTrail Organization Logging creates an org-wide trail in your management account to automatically log the actions of all member accounts in your organization. Control Tower now reduces redundant AWS Config items by limiting recording of global resources to home regions. To take advantage of this change you need to update to the latest landing zone version and then re-register each Organizational Unit, as detailed in the What’s New post. Lastly, Control Tower’s region deny guardrail now includes AWS API endpoints for AWS Chatbot, Amazon S3 Storage Lens, and Amazon S3 Multi Region Access Points. This allows you to limit access to AWS services and operations for accounts enrolled in your AWS Control Tower environment.
Other AWS News Here are some other news items and customer stories that you may find interesting:
AWS Open Source News and Updates – My colleague Ricardo Sueiras writes a weekly open source newsletter and highlights new open source projects, tools, and demos from the AWS community. Read installment #122 here.
Growy Case Study – This Netherlands-based company is building fully-automated robot-based vertical farms that grow plants to order. Read the case study to learn how they use AWS IoT and other services to monitor and control light, temperature, CO2, and humidity to maximize yield and quality.
Cutting Cardboard Waste – Bin packing is almost certainly a part of every computer science curriculum! In the linked article from the Amazon Science site, you can learn how an Amazon Principal Research Scientist developed PackOpt to figure out the optimal set of boxes to use for shipments from Amazon’s global network of fulfillment centers. This is an NP-hard problem and the article describes how they build a parallelized solution that explores a multitude of alternative solutions, all running on AWS.
Upcoming Events Check your calendar and sign up for these online and in-person AWS events:
AWS Global Summits – AWS Global Summits are free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Registrations are open for the following AWS Summits in August:
IMAGINE 2022 – The IMAGINE 2022 conference will take place on August 3 at the Seattle Convention Center, Washington, USA. It’s a no-cost event that brings together education, state, and local leaders to learn about the latest innovations and best practices in the cloud. You can register here.
That’s all for this week. Check back next Monday for another Week in Review!
It’s critical for your enterprise to understand where sensitive data is stored in your organization and how and why it is shared. The ability to efficiently find data that is shared with entities outside your account and the contents of that data is paramount. You need a process to quickly detect and report which accounts have access to sensitive data. Amazon Macie is an AWS service that can detect many sensitive data types. Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and help protect your sensitive data in AWS.
AWS Identity and Access Management (IAM)Access Analyzer helps to identify resources in your organization and accounts, such as S3 buckets or IAM roles, that are shared with an external entity. When you enable IAM Access Analyzer, you create an analyzer for your entire organization or your account. The organization or account you choose is known as the zone of trust for the analyzer. The analyzer monitors the supported resources within your zone of trust. This analyzer enables IAM Access Analyzer to detect each instance of a resource shared outside the zone of trust and generates a finding about the resource and the external principals that have access to it.
Currently, you can use IAM Access Analyzer and Macie to detect external access and discover sensitive data as separate processes. You can join the findings from both to best evaluate the risk. The solution in this post integrates IAM Access Analyzer, Macie, and AWS Security Hub to automate the process of correlating findings between the services and presenting them in Security Hub.
How does the solution work?
First, IAM Access Analyzer discovers S3 buckets that are shared outside the zone of trust. Next, the solution schedules a Macie sensitive data discovery job for each of these buckets to determine if the bucket contains sensitive data. Upon discovery of shared sensitive data in S3, a custom high severity finding is created in Security Hub for review and incident response.
Solution architecture
This solution is based on a serverless architecture, and uses the following services:
IAM Access Analyzer detects shared S3 buckets outside of the zone of trust—the organization or account you choose is known as a zone of trust for the analyzer—and creates the event Access Analyzer Finding in EventBridge.
EventBridge triggers the Lambda function sda-aa-save-findings.
The sda-aa-save-findings function records each finding in DynamoDB.
An EventBridge scheduled event periodically starts a new cycle of the Step Function state machine, which immediately runs the Lambda function sda-macie-submit-scan. The template sets a 15-minute interval, but this is configurable.
The sda-macie-submit-scan function reads the IAM Access Analyzer findings that were created by sda-aa-save-findings from DynamoDB.
sda-macie-submit-scan launches a Macie classification job for each distinct S3 bucket that is related to one or more recent IAM Access Analyzer findings.
Macie performs a sensitive discovery scan on each requested S3 bucket.
The sda-macie-submit-scan function initiates the Lambda function sda-macie-check-status.
sda-macie-check-status periodically checks the status of each Macie classification job, waiting for all the Macie jobs initiated by this solution to complete.
Upon completion of the sda-macie-check-status function, the step function runs the Lambda function sda-sh-create-findings.
sda-sh-create-findings joins the resulting IAM Access Analyzer and Macie datasets for each S3 bucket.
sda-sh-create-findings publishes a finding to Security Hub for each bucket that has both external access and sensitive data.
Note: The Macie scan is skipped if the S3 bucket is tagged to be excluded or if it was recently scanned by Macie. See the Cost considerations section for more information on custom configurations.
Information security can review and act on the findings shown in Security Hub.
Sample Security Hub output
Figure 2 shows the sample findings that Security Hub will present. Each finding includes:
Severity
Workflow status
Record state
Company
Product
Title
Resource
Figure 2: Sample Security Hub findings
The output to Security Hub will display a severity of HIGH with workflow NEW, because this is the first time the event has been observed. The record state is ACTIVE because the workflow state is NEW. The title explains the reason for the event.
For example, if potentially sensitive data is discovered in a bucket that is shared outside a zone of trust, selecting an event will display the resources involved in the finding so you can investigate. For more information, see the Security Hub User Guide.
Notes:
Detection of public S3 buckets by IAM Access Analyzer will still occur through Security Hub and will be marked as critical severity. This solution does not add to or augment this finding in Security Hub.
If a finding in IAM Access Analyzer is archived, the solution does not update the related finding in Security Hub.
The Application code location, S3 Bucket and S3 Key fields will be pre-filled.
Under Service Activations, modify the activations based on the services you presently have running in your account.
Modify the Logging and Monitoring settings if required.
(Optional) Set an alert email address for errors.
Choose Next, then choose Next again.
Under Capabilities, select the check box.
Choose Create Stack. The solution will begin deploying; watch for the CREATE_COMPLETE message.
Figure 3: Sample CloudFormation deployment status
The solution is now deployed and will start monitoring for sensitive data that is being shared. It will send the findings to Security Hub for your teams to investigate.
Cost considerations
When you scan large S3 buckets with sensitive data, remember that Macie cost is based on the amount of data scanned. For more information on Macie costs, see Amazon Macie pricing.
This solution allows the following options, which you can use to help manage costs:
Use environment variables in Lambda to skip specific tagged buckets
Skip recently scanned S3 buckets and reuse prior findings
Figure 4: Screen shot of configurable environment variable
Conclusion
In this post, we discussed how the solution uses Lambda, Step Functions and EventBridge to integrate IAM Access Analyzer with Macie discovery jobs. We reviewed the components of the application, deployed it by using CloudFormation, and reviewed the output a security team would use to take the appropriate actions. We also provided two ways that you can manage the costs associated with the solution.
After you deploy this project, you can modify it to meet your organization’s needs. For example, you can modify the tags to skip specific S3 buckets your organization has already classified to hold sensitive data. Customers who use multiple AWS accounts can designate a centralized Security Hub administrator account to receive the solution alerts from each member account. For more information on this option, see Designating a Security Hub administrator account.
If you have feedback about this post, please submit it in the Comments section below. If you have questions about this post, please start a new thread on the AWS Identity and Access Management forum.
Other resources
For more information on correlating security findings with AWS Security Hub and Amazon EventBridge, refer to this blog post.
Want more AWS Security news? Follow us on Twitter.
In Part 4 of this series, Governing Security at Scale and IAM Baselining, we discussed building a multi-account strategy and improving access management and least privilege to prevent unwanted access and to enforce security controls.
As a refresher from previous posts in this series, our example e-commerce company’s “Shoppers” application runs in the cloud. The company experienced hypergrowth, which posed a number of platform and technology challenges, including enforcing security and governance controls to mitigate security risks.
With the pace of new infrastructure and software deployments, we had to ensure we maintain strong security. This post, Part 5, shows how we detect security misconfigurations, indicators of compromise, and other anomalous activity. We also show how we developed and iterated on our incident response processes.
Threat detection and data protection
With our newly acquired customer base from hypergrowth, we had to make sure we maintained customer trust. We also needed to detect and respond to security events quickly to reduce the scope and impact of any unauthorized activity. We were concerned about vulnerabilities on our public-facing web servers, accidental sensitive data exposure, and other security misconfigurations.
Prior to hypergrowth, application teams scanned for vulnerabilities and maintained the security of their applications. After hypergrowth, we established dedicated security team and identified tools to simplify the management of our cloud security posture. This allowed us to easily identify and prioritize security risks.
Use AWS security services to detect threats and misconfigurations
We use the following AWS security services to simplify the management of cloud security risks and reduce the burden of third-party integrations. This also minimizes the amount of engineering work required by our security team.
Detect threats with Amazon GuardDuty
We use Amazon GuardDuty to keep up with the newest threat actor tactics, techniques, and procedures (TTPs) and indicators of compromise (IOCs).
GuardDuty saves us time and reduces complexity, because we don’t have to continuously engineer detections for new TTPs and IOCs for static events and machine-learning-based detections. This allows our security analysts to focus on building runbooks and quickly responding to security findings.
Discover sensitive data with Amazon Macie for Amazon S3
To host our external website, we use a few public Amazon Simple Storage Service (Amazon S3) buckets with static assets. We don’t want developers to accidentally put sensitive data in these buckets, and we wanted to understand which S3 buckets contain sensitive information, such as financial or personally identifiable information (PII).
We explored building a custom scanner to search for sensitive data, but maintaining the search patterns was too complex. It was also costly to continuously re-scan files each month. Therefore, we use Amazon Macie to continuously scan our S3 buckets for sensitive data. After Macie makes its initial scan, it will only scan new or updated objects in those S3 buckets, which reduces our costs significantly. We added filter rules to exclude files of larger size and S3 prefixes to scan required objects and provided a sampling rate to further cost optimize scanning large S3 buckets (in our case, S3 buckets greater than 1 TB).
Scan for vulnerabilities with Amazon Inspector
Because we use a wide variety of operating systems and software, we must scan our Amazon Elastic Compute Cloud (Amazon EC2) instances for known software vulnerabilities, such as Log4J.
We use Amazon Inspector to run continuous vulnerability scans on our EC2 instances and Amazon Elastic Container Registry (Amazon ECR) container images. With Amazon Inspector, we can continuously detect if our developers are deploying and releasing vulnerable software on our EC2 instances and ECR images without setting up a third-party vulnerability scanner and installing additional endpoint agents.
Aggregate security findings with AWS Security Hub
We don’t want our security analysts to arbitrarily act on one security finding over another. This is time-consuming and does not properly prioritize the highest risks to address. We also need to track ownership, view progress of various findings, and build consistent responses for common security findings.
With AWS Security Hub, our analysts can seamlessly prioritize findings from GuardDuty, Macie, Amazon Inspector, and many other AWS services. Our analysts also use Security Hub’s built-in security checks and insights to identify AWS resources and accounts that have a high number of findings and act on them.
Setting up the threat detection services
This is how we set up these services:
Assigned a security tooling account (covered in a previous blog post in this series) as the delegated administrator for these services. The delegated administrator configures the services and aggregates findings from other member accounts.
Our security analysts use Security Hub-generated Jira tickets to view, prioritize, and respond to all security findings and misconfigurations across our AWS environment.
Through this configuration, our analysts no longer need to pivot between various AWS accounts, security tool consoles, and Regions, which makes the day-to-day management and operations much easier. Figure 1 depicts the data flow to Security Hub.
Figure 1. Aggregation of security services in security tooling account
Figure 2. Delegated administrator setup
Incident response
Before hypergrowth, there was no formal way to respond to security incidents. To prevent future security issues, we built incident response plans and processes to quickly address potential security incidents and minimize the impact and exposure. Following the AWS Security Incident Response Guide and NIST framework, we adopted the following best practices.
Playbooks and runbooks for repeatability
We developed incident response playbooks and runbooks for repeatable responses for security events that include:
Playbooks for more strategic scenarios and responses based on some of the sample playbooks found here.
Runbooks that provide step-by-step guidance for our security analysts to follow in case an event occurs. We used Amazon SageMaker notebooks and AWS Systems Manager Incident Manager runbooks to develop repeatable responses for pre-identified incidents, such as suspected command and control activity on an EC2 instance.
Automation for quicker response time
After developing our repeatable processes, we identified areas where we could accelerate responses to security threats by automating the response. We used the AWS Security Hub Automated Response and Remediation solution as a starting point.
By using this solution, we didn’t need to build our own automated response and remediation workflow. The code is also easy to read, repeat, and centrally deploy through AWS CloudFormation StackSets. We used some of the built-in remediations like disabling active keys that have not been rotated for more than 90 days, making all Amazon Elastic Block Store (Amazon EBS) snapshots private, and many more. With automatic remediation, our analysts can respond quicker and in a more holistic and repeatable way.
Simulations to improve incident response capabilities
We implemented quarterly incident response simulations. These simulations test how well prepared our people, processes, and technologies are for an incident. We included some cloud-specific simulations like an S3 bucket exposure and an externally shared Amazon Relational Database Service (Amazon RDS) snapshot to ensure our security staff are prepared for an incident in the cloud. We use the results of the simulations to iterate on our incident response processes.
Conclusion
In this blog post, we discussed how to prepare for, detect, and respond to security events in an AWS environment. We identified security services to detect security events, vulnerabilities, and misconfigurations. We then discussed how to develop incident response processes through building playbooks and runbooks, performing simulations, and automation. With these new capabilities, we can detect and respond to a security incident throughout hypergrowth.
Looking for more architecture content?AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!
Given the speed of Amazon Web Services (AWS) innovation, it can sometimes be challenging to keep up with AWS Security service and feature launches. To help you stay current, here’s an overview of some of the most important 2021 AWS Security launches that security professionals should be aware of. This is the first of two related posts; Part 2 will highlight some of the important 2021 launches that security professionals should be aware of across all AWS services.
Amazon GuardDuty
In 2021, the threat detection service Amazon GuardDuty expanded the internal AWS security intelligence it consumes to use more of the intel that AWS internal threat detection teams collect, including additional nation-state threat intelligence. Sharing more of the important intel that internal AWS teams collect lets you quickly improve your protection. GuardDuty also launched domain reputation modeling. These machine learning models take all the domain requests from across all of AWS, and feed them into a model that allows AWS to categorize previously unseen domains as highly likely to be malicious or benign based on their behavioral characteristics. In practice, AWS is seeing that these models often deliver high-fidelity threat detections, identifying malicious domains 7–14 days before they are identified and available on commercial threat feeds.
AWS also launched second generation anomaly detection for GuardDuty. Shortly after the original GuardDuty launch in 2017, AWS added additional anomaly detection for user behavior analytics and monitoring for unusual activity of AWS Identity and Access Management (IAM) users. After receiving customer feedback that the original feature was a little too noisy, and that it was difficult to understand why some findings were generated, the GuardDuty analytics team rebuilt this functionality on an entirely new machine learning model, considerably reducing the number of detections and generating a more accurate positive-detection rate. The new model also added additional context that security professionals (such as analysts) can use to understand why the model shows findings as suspicious or unusual.
Since its introduction, GuardDuty has detected when AWS EC2 Role credentials are used to call AWS APIs from IP addresses outside of AWS. Beginning in early 2022, GuardDuty now supports detection when credentials are used from other AWS accounts, inside the AWS network. This is a complex problem for customers to solve on their own, which is why the GuardDuty team added this enhancement. The solution considers that there are legitimate reasons why a source IP address that is communicating with AWS services APIs might be different than the Amazon Elastic Compute Cloud (Amazon EC2) instance IP address, or a NAT gateway associated with the instance’s VPC. The enhancement also considers complex network topologies that route traffic to one or multiple VPCs—for example, AWS Transit Gateway or AWS Direct Connect.
Our customers are increasingly running container workloads in production; helping to raise the security posture of these workloads became an AWS development priority in 2021. GuardDuty for EKS Protection is one recent feature that has resulted from this investment. This new GuardDuty feature monitors Amazon Elastic Kubernetes Service (Amazon EKS) cluster control plane activity by analyzing Kubernetes audit logs. GuardDuty is integrated with Amazon EKS, giving it direct access to the Kubernetes audit logs without requiring you to turn on or store these logs. Once a threat is detected, GuardDuty generates a security finding that includes container details such as pod ID, container image ID, and associated tags. See below for details on how the new Amazon Inspector is also helping to protect containers.
Amazon Inspector
At AWS re:Invent 2021, we launched the new Amazon Inspector, a vulnerability management service that continually scans AWS workloads for software vulnerabilities and unintended network exposure. The original Amazon Inspector was completely re-architected in this release to automate vulnerability management and to deliver near real-time findings to minimize the time needed to discover new vulnerabilities. This new Amazon Inspector has simple one-click enablement and multi-account support using AWS Organizations, similar to our other AWS Security services. This launch also introduces a more accurate vulnerability risk score, called the Inspector score. The Inspector score is a highly contextualized risk score that is generated for each finding by correlating Common Vulnerability and Exposures (CVE) metadata with environmental factors for resources such as network accessibility. This makes it easier for you to identify and prioritize your most critical vulnerabilities for immediate remediation. One of the most important new capabilities is that Amazon Inspector automatically discovers running EC2 instances and container images residing in Amazon Elastic Container Registry (Amazon ECR), at any scale, and immediately starts assessing them for known vulnerabilities. Now you can consolidate your vulnerability management solutions for both Amazon EC2 and Amazon ECR into one fully managed service.
AWS Security Hub
In addition to a significant number of smaller enhancements throughout 2021, in October AWS Security Hub, an AWS cloud security posture management service, addressed a top customer enhancement request by adding support for cross-Region finding aggregation. You can now view all your findings from all accounts and all selected Regions in a single console view, and act on them from an Amazon EventBridge feed in a single account and Region. Looking back at 2021, Security Hub added 72 additional best practice checks, four new AWS service integrations, and 13 new external partner integrations. A few of these integrations are Atlassian Jira Service Management, Forcepoint Cloud Security Gateway (CSG), and Amazon Macie. Security Hub also achieved FedRAMP High authorization to enable security posture management for high-impact workloads.
Amazon Macie
Based on customer feedback, data discovery tool Amazon Macie launched a number of enhancements in 2021. One new feature, which made it easier to manage Amazon Simple Storage Service (Amazon S3) buckets for sensitive data, was criteria-based bucket selection. This Macie feature allows you to define runtime criteria to determine which S3 buckets should be included in a sensitive data-discovery job. When a job runs, Macie identifies the S3 buckets that match your criteria, and automatically adds or removes them from the job’s scope. Before this feature, once a job was configured, it was immutable. Now, for example, you can create a policy where if a bucket becomes public in the future, it’s automatically added to the scan, and similarly, if a bucket is no longer public, it will no longer be included in the daily scan.
Originally Macie included all managed data identifiers available for all scans. However, customers wanted more surgical search criteria. For example, they didn’t want to be informed if there were exposed data types in a particular environment. In September 2021, Macie launched the ability to enable/disable managed data identifiers. This allows you to customize the data types you deem sensitive and would like Macie to alert on, in accordance with your organization’s data governance and privacy needs.
Amazon Detective
Amazon Detective is a service to analyze and visualize security findings and related data to rapidly get to the root cause of potential security issues. In January 2021, Amazon Detective added a convenient, time-saving integration that allows you to start security incident investigation workflows directly from the GuardDuty console. This new hyperlink pivot in the GuardDuty console takes findings directly from the GuardDuty console into the Detective console. Another time-saving capability added was the IP address drill down functionality. This new capability can be useful to security forensic teams performing incident investigations, because it helps quickly determine the communications that took place from an EC2 instance under investigation before, during, and after an event.
In December 2021, Detective added support for AWS Organizations to simplify management for security operations and investigations across all existing and future accounts in an organization. This launch allows new and existing Detective customers to onboard and centrally manage the Detective graph database for up to 1,200 AWS accounts.
AWS Key Management Service
In June 2021, AWS Key Management Service (AWS KMS) introduced multi-Region keys, a capability that lets you replicate keys from one AWS Region into another. With multi-Region keys, you can more easily move encrypted data between Regions without having to decrypt and re-encrypt with different keys for each Region. Multi-Region keys are supported for client-side encryption using direct AWS KMS API calls, or in a simplified manner with the AWS Encryption SDK and Amazon DynamoDB Encryption Client.
AWS Secrets Manager
Last year was a busy year for AWS Secrets Manager, with four feature launches to make it easier to manage secrets at scale, not just for client applications, but also for platforms. In March 2021, Secrets Manager launched multi-Region secrets to automatically replicate secrets for multi-Region workloads. Also in March, Secrets Manager added three new rules to AWS Config, to help administrators verify that secrets in Secrets Manager are configured according to organizational requirements. Then in April 2021, Secrets Manager added a CSI driver plug-in, to make it easy to consume secrets from Amazon EKS by using Kubernetes’s standard Secrets Store interface. In November, Secrets Manager introduced a higher secret limit of 500,000 per account to simplify secrets management for independent software vendors (ISVs) that rely on unique secrets for a large number of end customers. Although launched in January 2022, it’s also worth mentioning Secrets Manager’s release of rotation windows to align automatic rotation of secrets with application maintenance windows.
Amazon CodeGuru and Secrets Manager
In November 2021, AWS announced a new secrets detector feature in Amazon CodeGuru that searches your codebase for hardcoded secrets. Amazon CodeGuru is a developer tool powered by machine learning that provides intelligent recommendations to detect security vulnerabilities, improve code quality, and identify an application’s most expensive lines of code.
This new feature can pinpoint locations in your code with usernames and passwords; database connection strings, tokens, and API keys from AWS; and other service providers. When a secret is found in your code, CodeGuru Reviewer provides an actionable recommendation that links to AWS Secrets Manager, where developers can secure the secret with a point-and-click experience.
Looking ahead for 2022
AWS will continue to deliver experiences in 2022 that meet administrators where they govern, developers where they code, and applications where they run. A lot of customers are moving to container and serverless workloads; you can expect to see more work on this in 2022. You can also expect to see more work around integrations, like CodeGuru Secrets Detector identifying plaintext secrets in code (as noted previously).
To stay up-to-date in the year ahead on the latest product and feature launches and security use cases, be sure to read the Security service launch announcements. Additionally, stay tuned to the AWS Security Blog for Part 2 of this blog series, which will provide an overview of some of the important 2021 launches that security professionals should be aware of across all AWS services.
If you’re looking for more opportunities to learn about AWS security services, check out AWS re:Inforce, the AWS conference focused on cloud security, identity, privacy, and compliance, which will take place June 28-29 in Houston, Texas.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Let’s suppose you need to find sensitive data in an RDS-hosted database using Macie, which currently only supports S3 as a data source. Therefore, you will need to extract and store the data from RDS in S3. In addition, you will need an interface for audit teams to audit these findings.
Solution overview
Figure 1: Solution architecture workflow
The architecture of the solution in Figure 1 can be described as:
A MySQL engine running on RDS is populated with the Sakila sample database.
A DMS task connects to the Sakila database, transforms the data into a set of Parquet compressed files, and loads them into the dcp-macie bucket.
A Macie classification job analyzes the objects in the dcp-macie bucket using a combination of techniques such as machine learning and pattern matching to determine whether the objects contain sensitive data and to generate detailed reports on the findings.
Kinesis Data Firehose stores these reports in the dcp-glue bucket.
S3 event notification triggers an AWS Lambda function whenever an object is created in the dcp-glue bucket.
The Lambda function named Start Glue Workflow starts a Glue Workflow.
Glue Workflow transforms the data from JSONL to Apache Parquet file format and places it in the dcp-athena bucket. This provides better performance during data query and optimized storage usage using a binary optimized columnar storage.
Athena is used to query and visualize data generated by Macie.
Note: For better readability, the S3 bucket nomenclature omits the suffix containing the AWS Region and AWS account ID used to meet the global uniqueness naming requirement (for example, dcp-athena-us-east-1-123456789012).
The Sakila database schema consists of the following tables:
You can find an IAM policy with the required permissions here.
Step 1 – Deploying the CloudFormation template
You’ll use CloudFormation to provision quickly and consistently the AWS resources illustrated in Figure 1. Through a pre-built template file, it will create the infrastructure using an Infrastructure-as-Code (IaC) approach.
Select Create stack and choose With new resources (standard).
On Step 1 – Specify template, choose Upload a template file, select Choose file, and select the file template.yaml downloaded previously.
On Step 2 – Specify stack details, enter a name of your preference for Stack name. You might also adjust the parameters as needed, like the parameter CreateRDSServiceRole to create a service role for RDS if it does not exist in the current account.
On Step 3 – Configure stack options, select Next.
On Step 4 – Review, check the box for I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then select Create Stack.
Wait for the stack to show status CREATE_COMPLETE.
Note: It is expected that provisioning will take around 10 minutes to complete.
Step 2 – Running an AWS DMS task
To extract the data from the Amazon RDS instance, you need to run an AWS DMS task. This makes the data available for Amazon Macie in an S3 bucket in Parquet format.
In the left menu, select Database migration tasks.
Select the task Identifier named rdstos3task.
Select Actions.
Select the option Restart/Resume.
When the Status changes to Load Complete the task has finished and you will be able to see migrated data in your target bucket (dcp-macie).
Inside each folder you can see parquet file(s) with names similar to LOAD00000001.parquet. Now you can use Macie to discover if you have sensitive data in your database contents as exported to S3.
Step 3 – Running a classification job with Amazon Macie
Now you need to create a data classification job so you can assess the contents of your S3 bucket. The job you create will run once and evaluate the complete contents of your S3 bucket to determine whether it can identify PII among the data. As mentioned earlier, this job only uses the managed identifiers available with Macie – you could also add your own custom identifiers.
Choose the S3 bucket dcp-macie containing the output data from the DMS task. You may need to wait a minute and select the Refresh icon if the bucket names do not display.
Select Create job.
Select Next to continue.
Create a job with the following scope.
Sensitive data Discovery options: One-time job
Sampling Depth: 100%
Leave all other settings with their default values
Select Next to continue.
Select Next again to skip past the Custom data identifiers section.
Give the job a name and description.
Select Next to continue.
Verify the details of the job you have created and select Submit to continue.
You will see a green banner stating that The Job was successfully created. The job can take up to 15 minutes to conclude and the Status will change from Active to Complete. To open the findings from the job, select the job’s check box, choose Show results, and select Show findings.
Figure 2: Macie Findings screen
Note: You can navigate in the findings and select each checkbox to see the details.
Step 4 – Enabling querying on classification job results with Amazon Athena
If it’s your first-time using Athena you will see a message Before you run your first query, you need to set up a query result location in Amazon S3. Learn more. Select the link presented with this message.
In the Settings window, choose Select and then choose the bucket dcp-assets to store the Athena query results.
(Optional) To store the query results encrypted, check the box for Encrypt query results and select your preferred encryption type. To learn more about Amazon S3 encryption types, see Protecting data using encryption.
Select Save.
Step 5 – Query Amazon Macie results with Amazon Athena.
It might take a few minutes for the data to complete the flow between Amazon Macie and AWS Glue. After it’s finished, you’ll be able to see in Athena’s console the table dcp_athena within the database dcp.
Then, select the three dots next to the table dcp_athena and select the Preview table option to see a data preview, or run your own custom queries.
Select Delete and then select Delete Stack in the pop-up window.
Conclusion
In this blog post, we show how you can find Personally Identifiable Information (PII), and other data defined as sensitive, in Macie’s Managed Data Identifiers in an RDS-hosted MySQL database. You can use this solution with other relational databases like PostgreSQL, SQL Server, or Oracle, whether hosted on RDS or EC2. If you’re using Amazon DynamoDB, you may also find useful the blog post Detecting sensitive data in DynamoDB with Macie.
By classifying your data, you can inform your management of appropriate data protection and handling controls for the use of that data.
If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
In this post, we describe the AWS services that you can use to both detect and protect your data stored in Amazon Simple Storage Service (Amazon S3). When you analyze security in depth for your Amazon S3 storage, consider doing the following:
Monitor and remediate configuration changes with AWS Config
Using these additional AWS services along with Amazon S3 can improve your security posture across your accounts.
Audit and restrict Amazon S3 access with IAM Access Analyzer
IAM Access Analyzer allows you to identify unintended access to your resources and data. Users and developers need access to Amazon S3, but it’s important for you to keep users and privileges accurate and up to date.
Amazon S3 can often house sensitive and confidential information. To help secure your data within Amazon S3, you should be using AWS Key Management Service (AWS KMS) with server-side encryption at rest for Amazon S3. It is also important that you secure the S3 buckets so that you only allow access to the developers and users who require that access. Bucket policies and access control lists (ACLs) are the foundation of Amazon S3 security. Your configuration of these policies and lists determines the accessibility of objects within Amazon S3, and it is important to audit them regularly to properly secure and maintain the security of your Amazon S3 bucket.
IAM Access Analyzer can scan all the supported resources within a zone of trust. Access Analyzer then provides you with insight when a bucket policy or ACL allows access to any external entities that are not within your organization or your AWS account’s zone of trust.
The example in Figure 1 shows creating an analyzer with the zone of trust as the current account, but you can also create an analyzer with the organization as the zone of trust.
Figure 1: Creating IAM Access Analyzer and zone of trust
After you create your analyzer, IAM Access Analyzer automatically scans the resources in your zone of trust and returns the findings from your Amazon S3 storage environment. The initial scan shown in Figure 2 shows the findings of an unsecured S3 bucket.
Figure 2: Example of unsecured S3 bucket findings
For each finding, you can decide which action you would like to take. As shown in figure 3, you are given the option to archive (if the finding indicates intended access) or take action to modify bucket permissions (if the finding indicates unintended access).
Figure 3: Displays choice of actions to take
After you address the initial findings, Access Analyzer monitors your bucket policies for changes, and notifies you of access issues it finds. Access Analyzer is regional and must be enabled in each AWS Region independently.
Classify and secure sensitive data with Macie
Organizational compliance standards often require the identification and securing of sensitive data. Your organization’s sensitive data might contain personally identifiable information (PII), which includes things such as credit card numbers, birthdates, and addresses.
Macie is a data security and privacy service offered by AWS that uses machine learning and pattern matching to discover the sensitive data stored within Amazon S3. You can define your own custom type of sensitive data category that might be unique to your business or use case. Macie will automatically provide an inventory of S3 buckets and alert you of unprotected sensitive data.
Figure 4 shows a sample result from a Macie scan in which you can see important information regarding Amazon S3 public access, encryption settings, and sharing.
Figure 4: Sample results from a Macie scan
In addition to finding potential sensitive data, Macie also gives you a severity score based on the privacy risk, as shown in the example data in Figure 5.
Figure 5: Example Macie severity scores
When you use Macie in conjunction with AWS Step Functions, you can also automatically remediate any issues found. You can use this combination to help meet regulations such as General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA). Macie allows you to have constant visibility of sensitive data within your Amazon S3 storage environment.
When you deploy Macie in a multi-account configuration, your usage is rolled up to the master account to provide the total usage for all accounts and a breakdown across the entire organization.
Detect malicious access patterns with GuardDuty
Your customers and users can commit thousands of actions each day on S3 buckets. Discerning access patterns manually can be extremely time consuming as the volume of data increases. GuardDuty uses machine learning, anomaly detection, and integrated threat intelligence to analyze billions of events across multiple accounts and uses data collected in AWS CloudTrail logs for S3 data events as well as S3 access logs, VPC Flow Logs, and DNS logs. GuardDuty can be configured to analyze these logs and notify you of suspicious activity, such as unusual data access patterns, unusual discovery API calls, and more. After you receive a list of findings on these activities, you will be able to make informed decisions to secure your S3 buckets.
Figure 6 shows a sample list of findings returned by GuardDuty which shows the finding type, resource affected, and count of occurrences.
Figure 6: Example GuardDuty list of findings
You can select one of the results in Figure 6 to see the IP address and details associated from this potential malicious IP caller, as shown in Figure 7.
Figure 7: GuardDuty Malicious IP Caller detailed findings
Monitor and remediate configuration changes with AWS Config
Configuration management is important when securing Amazon S3, to prevent unauthorized users from gaining access. It is important that you monitor the configuration changes of your S3 buckets, whether the changes are intentional or unintentional. AWS Config can track all configuration changes that are made to an S3 bucket. For example, if an S3 bucket had its permissions and configurations unexpectedly changed, using AWS Config allows you to see the changes made, as well as who made them.
AWS Config can be used in conjunction with a service called AWS Lambda. If an S3 bucket is noncompliant, AWS Config can trigger a preprogrammed Lambda function and then the Lambda function can resolve those issues. This combination can be used to reduce your operational overhead in maintaining compliance within your S3 buckets.
Figure 8 shows a sample of AWS Config managed rules selected for configuration monitoring and gives a brief description of what the rule does.
Figure 8: Sample selections of AWS Managed Rules
Figure 9 shows a sample result of a non-compliant configuration and resource inventory listing the type of resource affected and the number of occurrences.
Figure 9: Example of AWS Config non-compliant resources
Conclusion
AWS has many offerings to help you audit and secure your storage environment. In this post, we discussed the particular combination of AWS services that together will help reduce the amount of time and focus your business devotes to security practices. This combination of services will also enable you to automate your responses to any unwanted permission and configuration changes, saving you valuable time and resources to dedicate elsewhere in your organization.
Securing sensitive information is a high priority for organizations for many reasons. At the same time, organizations are looking for ways to empower development teams to stay agile and innovative. Centralized security teams strive to create systems that align to the needs of the development teams, rather than mandating how those teams must operate.
Security teams who create automation for the discovery of sensitive data have some issues to consider. If development teams are able to self-provision data storage, how does the security team protect that data? If teams have a business need to store sensitive data, they must consider how, where, and with what safeguards that data is stored.
Data classification is part of the security pillar of a Well-Architected application. Following the guidelines provided in the AWS Well-Architected Framework, we can develop a resource-tagging scheme that fits our needs.
Overview of decentralized data validation system
In our example, we have multiple levels of data classification that represent different levels of risk associated with each classification. When a software development team creates a new Amazon Simple Storage Service (S3) bucket, they are responsible for labeling that bucket with a tag. This tag represents the classification of data stored in that bucket. The security team must maintain a system to validate that the data in those buckets meets the classification specified by the development teams.
This separation of roles and responsibilities for development and security teams who work independently requires a validation system that’s decoupled from S3 bucket creation. It should automatically detect new buckets or data in the existing buckets, and validate the data against the assigned classification tags. It should also notify the appropriate development teams of misclassified or unclassified buckets in a timely manner. These notifications can be through standard notification channels, such as email or Slack channel notifications.
Validation and alerts with AWS services
Figure 1. Validation system for data classification
We assume that teams are permitted to create S3 buckets and we will use AWS Config to enforce the following required tags: DataClassification and SupportSNSTopic. The DataClassification tag indicates what type of data is allowed in the bucket. The SupportSNSTopic tag indicates an Amazon Simple Notification Service (SNS) topic. If there are issues found with the data in the bucket, a message is published to the topic, and Amazon SNS will deliver an alert. For example, if there is personally identifiable information (PII) data in a bucket that is classified as non-sensitive, the system will alert the owners of the bucket.
Macie is configured to scan all S3 buckets on a scheduled basis. This configuration ensures that any new bucket and data placed in the buckets is analyzed the next time the Macie job runs.
Macie provides several managed data identifiers for discovering and classifying the data. These include bank account numbers, credit card information, authentication credentials, PII, and more. You can also create custom identifiers (or rules) to gather information not covered by the managed identifiers.
Macie integrates with Amazon EventBridge to allow us to capture data classification events and route them to one or more destinations for reporting and alerting needs. In our configuration, the event initiates an AWS Lambda. The Lambda function is used to validate the data classification inferred by Macie against the classification specified in the DataClassification tag using custom business logic. If a data classification violation is found, the Lambda then sends a message to the Amazon SNS topic specified in the SupportSNSTopic tag.
The Lambda function also creates custom metrics and sends those to Amazon CloudWatch. The metrics are organized by engineering team and severity. This allows the security team to create a dashboard of metrics based on the Macie findings. The findings can also be filtered per engineering team and severity to determine which teams need to be contacted to ensure remediation.
Conclusion
This solution provides a centralized security team with the tools it needs. The team can validate the data classification of an Amazon S3 bucket that is self-provisioned by a development team. New Amazon S3 buckets are automatically included in the Macie jobs and alerts. These are only sent out if the data in the bucket does not conform to the classification specified by the development team. The data auditing process is loosely coupled with the Amazon S3 Bucket creation process, enabling self-service capabilities for development teams, while ensuring proper data classification. Your teams can stay agile and innovative, while maintaining a strong security posture.
Customers are looking for ways to securely and cost-efficiently manage large volumes of sensitive data archival and deletion in their data lake by following regulations and data protection and privacy laws, such as GDPR, POPIA, and LGPD. This post describes a way to automatically identify sensitive data stored in your data lake within AWS, tag the data according to its sensitivity level, and apply appropriate lifecycle policies in a secured and cost-effective way.
Amazon Macie is a managed data security and data privacy service that uses machine learning (ML) and pattern matching to discover and protect your sensitive data stored in Amazon Simple Storage Service. (Amazon S3). In this post, we show you how to develop a solution using Macie, Amazon Kinesis Data Firehose, Amazon S3, Amazon EventBridge, and AWS Lambda to identify sensitive data across a large number of S3 buckets, tag them, and apply lifecycle policies for transition and deletion.
Solution overview
The following diagram illustrates the architecture of our solution.
The flow of the solution is as follows:
An Amazon S3 bucket contains multiple objects with different sensitivities of data.
A Macie job analyses the S3 bucket to identify the different sensitivities.
An EventBridge rule is triggered for each finding that Macie generates from the job.
The rule copies the results created by the Macie job to a Kinesis Data Firehose delivery stream.
The delivery stream copies the results to an S3 bucket as a JSON file for future audit requirements.
The arrival of the results triggers a Lambda function that parses the sensitivity metadata from the JSON file.
The function tags the objects in the bucket mentioned in Step 1 and creates an S3 Lifecycle policy based on the sensitivity level of each object and overwrites an S3 Lifecycle policy for each existing object.
The S3 Lifecycle policy moves data to different classes and deletes data based on the configured rules. For example, we implement the following rules:
Archive objects with high sensitivity, tagged as High, after 700 days.
Delete objects tagged as High after 3,000 days.
Create resources with AWS CloudFormation
We provide an AWS CloudFormation template to create the following resources:
An S3 bucket named archival-blog-<account_number>-<region_name> as a sample subject bucket as described above.
An S3 bucket named archival-blog-results-<account_number>-<region_name> to store the results generated by the Macie job.
A Firehose delivery stream to send the results of the Macie job to the S3 bucket.
An EventBridge rule that matches the incoming event of a result generated by the Macie job and routes the result to the Firehose delivery stream.
A Lambda function to apply tags and S3 Lifecycle policies on the data objects of the subject S3 bucket based on the result generated by the Macie job.
Launch the following stack, providing your stack name:
After the cloud formation stack is deployed, copy sample_pii.xlsx and recipe.xlsx as sample data for Macie to detect as sensitive data in archival-blog-<account_number>-<region_name>.
Next, we scan the subject S3 bucket for sensitive data to tag and attach the appropriate lifecycle policies.
Configure a Macie job
Macie uses ML and pattern matching to discover and protect your sensitive data in AWS. To configure a Macie job, complete the following steps:
On the Macie console, create a new job by choosing Create Job.
Select the bucket that you want to analyze and choose Next.
Select a schedule if you want to run the job on a schedule, or One-time job if you want to run the job one time.For this post, we select One-time job. We can also choose Scheduled job for periodic jobs in production, but this is out of the scope of this post.
Choose Next.
Enter a name for the job and choose Next.
Review the job details and confirm they’re correct before choosing Submit.
The job immediately starts after you submit it.
Review the results
Whenever the Macie job runs, it generates the following results.
Secondly, the Lambda function tags sensitive object, with Sensitivity : High.
Thirdly, the function creates a corresponding S3 Lifecycle policy.
Clean up
When you’re done with this solution, delete the sample data from the subject S3 bucket, delete all the data objects that are stored as Macie results in the S3 bucket for this post, and then delete the CloudFormation stack to remove all the service resources used in the solution.
Conclusion
In this post, we highlighted how you can use Macie to scan all your data stored on Amazon S3 and how to store your security findings in Amazon S3 using EventBridge and Kinesis Data Firehose. We also explored how you can use Lambda to tag the relevant objects in your subject S3 bucket using the Macie job’s security findings. An S3 Lifecycle transition policy moves tagged objects across different storage classes to help save cost, and an S3 Lifecycle expiration policy deletes objects that have reached the end of their lifecycle. According to the privacy laws like GDPR and POPIA, personal data should be retained as long as the data needs to be retained for legal purposes, or needs to be processed. In this post, we provide a mechanism to allow archival of sensitive data which you might not want to delete immediately, but reduce related storage costs and delete it after a certain period when that data is no longer required. The data archival and deletion periods used above are sample numbers that can be customised within the Lambda function based on requirements. Additionally, please also explore Macie’s different type of findings. You can use these different finding to build capabilities like sending notifications, creating searches in cloud trail S3 object logs for who is accessing specific objects, etc.
If you have any questions, comments, or concerns, please reach out to AWS Support. If you have feedback about this post, submit it in the comments section.
About the Authors
Subhro Bose is a Data Architect in Emergent Technologies and Intelligence Platform in Amazon. He loves working on ways for emergent technologies such as AI/ML, big data, quantum, and more to help businesses across different industry verticals succeed within their innovation journey.
Akshay Chandiramani is Data Analytics Consultant for AWS Professional Services. He is passionate about solving complex big data and MLOps problems for customers to create tangible business outcomes. In his spare time, he enjoys playing snooker and table tennis, and capturing trends on the stock market.
Ikenna Uzoh is a Senior Manager, Business Intelligence (BI) and Analytics at AWS. Before his current role, he was a Senior Big Data and Analytics Architect with AWS Professional Services. He is passionate about building data platforms (data lakes, BI, Machine learning) that help organizations achieve their desired outcomes.
Following the example of the EU in implementing the General Data Protection Regulation (GDPR), many countries are implementing similar data protection laws. In response, many companies are forming teams that are responsible for data protection. Considering the volume of information that companies maintain, it’s essential that these teams are alerted when sensitive data is at risk.
This post shows how to deploy a solution that uses Amazon Macie to discover sensitive data. This solution enables you to set up automatic notification to your company’s designated data protection team via a Slack channel when sensitive data that needs to be protected is discovered by Amazon EventBridge and AWS Lambda.
The challenge
Let’s imagine that you’re part of a team that’s responsible for classifying your organization’s data but the data structure isn’t documented. Amazon Macie provides you the ability to run a scheduled classification job that examines your data, and you want to notify the data protection team when there’s new sensitive data to classify. Let’s build a solution to automatically notify the data protection team.
Solution overview
To be scalable and cost-effective, this solution uses serverless technologies and managed AWS services, including:
Macie – A fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in Amazon Web Services (AWS).
EventBridge – A serverless event bus that connects application data from your apps, SaaS, and AWS services. EventBridge can respond to specific events or run according to a schedule. The solution presented in this post uses EventBridge to initiate a custom Lambda function in response to a specific event.
Lambda – Runs code in response to events such as changes in data, changes in application state, or user actions. In this solution, a Lambda function is initiated by EventBridge.
Solution architecture
The architecture workflow is shown in Figure 1 and includes the following steps:
Macie runs a classification job and publishes its findings to EventBridge as a JSON object.
The EventBridge rule captures the findings and invokes a Lambda function as a target.
The Lambda function parses the JSON object. The function then sends a custom message to a Slack channel with the sensitive data finding for the data protection team to evaluate and respond to.
Figure 1: Solution architecture workflow
Set up Slack
For this solution, you need a Slack workspace and an incoming webhook. The workspace must be in place before you create the webhook.
Create a Slack workspace
If you already have a Slack workspace in your environment, you can skip forward, to creating the webhook.
If you don’t have a Slack workspace, follow the steps in Create a Slack Workspace to create one.
Development Slack Workspace – Choose the Slack workspace—either an existing workspace or one you created for this solution—to receive the Macie findings.
Choose the Create App button.
In the left menu, choose Incoming Webhooks.
At the Activate Incoming Webhooks screen, move the slider from OFF to ON.
Scroll down and choose Add New Webhook to Workspace.
In the screen asking where your app should post, enter the name of the Slack channel from your Workspace that you want to send notification to and choose Authorize.
On the next screen, scroll down to the Webhook URL section. Make a note of the URL to use later.
Deploy the CloudFormation template with the solution
The deployment of the CloudFormation template automatically creates the following resources:
A Lambda function that begins with the name named macie-to-slack-lambdafindingsToSlack-.
An EventBridge rule named MacieFindingsToSlack.
An IAM role named MacieFindingsToSlackkRole.
A permission to invoke the Lambda function named LambdaInvokePermission.
Note: Before you proceed, make sure you’re deploying the template to the same Region that your production Macie is running.
Note: To save the template, you can right click the Raw button at the top of the code and then select Save link as if you’re using Chrome, or the equivalent in your browser. This file is used in Step 4.
On the Welcome page, choose Create stack and then choose With new resources.
On Step 1 — Specify template, choose Upload a template file, select Choose file and then select the file template.yaml (the file extension might be .YML), then choose Next.
On Step 2 — Specify stack details:
Enter macie-to-slack as the Stack name.
At the Slack Incoming Web Hook URL, paste the webhook URL you copied earlier.
At Slack channel, enter the name of the channel in your workspace that will receive the alerts and choose Next.
Figure 2: Defining stack details
On Step 3 – Configure Stack options, you can leave the default settings, or change them for your environment. Choose Next to continue.
At the bottom of Step 4 – Review, select I acknowledge that AWS CloudFormation might create IAM resources, and choose Create stack.
Figure 3: Confirmation before stack creation
Wait for the stack to reach status CREATE_COMPLETE.
Running the solution
At this point, you’ve deployed the solution and your resources are created.
To test the solution, you can schedule a Macie job targeting a bucket that contains a file with sensitive information that Macie can detect.
Note: You can check the Amazon Macie documentation to see the list of supported managed data identifiers.
When the Macie job is complete, any findings are sent to the Slack channel.
Figure 4: Macie finding delivered to Slack channel
Select the link in the message sent to the Slack channel to open that finding in the Macie console, as shown in Figure 5.
Figure 5: Finding details
And you’re done!
Now your Macie finding results are delivered to your Slack channel where they can be easily monitored, reducing response time and risk exposure.
If you deployed this for testing purposes, or want to clean this up and move to your production account, you can delete the Cloudformation stack:
In this blog post we walked through the steps to configure a notification workflow using Macie, Lambda, and EventBridge to send sensitive data findings to your data protection team via a Slack channel.
Your data protection team will appreciate the timely notifications of sensitive data findings, giving you the ability to focus on creating controls to improve data security and compliance with regulations related to protection and treatment of personal data.
For more information about data privacy on AWS, see Data Privacy FAQ.
If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
The amount of data being collected, stored, and processed by Amazon Web Services (AWS) customers is growing at an exponential rate. In order to keep pace with this growth, customers are turning to scalable cloud storage services like Amazon Simple Storage Service (Amazon S3) to build data lakes at the petabyte scale. Customers are looking for new, automated, and scalable ways to address their data security and compliance requirements, including the need to identify and protect their sensitive data. Amazon Macie helps customers address this need by offering a managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data that is stored in Amazon S3.
In this blog post, I show you how to deploy a solution that establishes an automated event-driven workflow for notification and remediation of sensitive data findings from Macie. Administrators can review and approve remediation of findings through a ChatOps-style integration with Slack. Slack is a business communication tool that provides messaging functionality, including persistent chat rooms known as channels. With this solution, you can streamline the notification, investigation, and remediation of sensitive data findings in your AWS environment.
Prerequisites
Before you deploy the solution, make sure that your environment is set up with the following prerequisites:
You have access to an AWS account with permissions to create the resources listed in the solution overview through AWS CloudFormation.
You have a Slack account with permissions to add apps and integrations in your desired workspace and channel. If you aren’t already a Slack user, it’s free to sign up and create a workspace and channel of your own.
Important: This solution uses various AWS services, and there are costs associated with these resources after the Free Tier usage. See the AWS pricing page for details.
Solution overview
The solution architecture and workflow are detailed in Figure 1.
Figure 1: Solution overview
This solution allows for the configuration of auto-remediation behavior based on finding type and finding severity. For each finding type, you can define whether you want the offending S3 object to be automatically quarantined, or whether you want the finding details to be reviewed and approved by a human in Slack prior to being quarantined. In a similar manner, you can define the minimum severity level (Low, Medium, High) that a finding must have before the solution will take action. By adjusting these parameters, you can manage false positives and tune the volume and type of findings about which you want to be notified and take action. This configurability is important because customers have different security, risk, and regulatory requirements.
Figure 1 details the services used in the solution and the integration points between them. Let’s walk through the full sequence from the detection of sensitive data to the remediation (quarantine) of the offending object.
Macie is configured with sensitive data discovery jobs (scheduled or one-time), which you create and run to detect sensitive data within S3 buckets. When Macie runs a job, it uses a combination of criteria and techniques to analyze objects in S3 buckets that you specify. For a full list of the categories of sensitive data Macie can detect, see the Amazon Macie User Guide.
For each sensitive data finding, an event is sent to Amazon EventBridge that contains the finding details. An EventBridge rule triggers a Lambda function for processing.
The Finding Handler Lambda function parses the event and examines the type of the finding. Based on the auto-remediation configuration, the function either invokes the Finding Remediator function for immediate remediation, or sends the finding details for manual review and remediation approval through Slack.
Delegated security and compliance administrators monitor the configured Slack channel for notifications. Notifications provide high-level finding information, remediation status, and a link to the Macie console for the finding in question. For findings configured for manual review, administrators can choose to approve the remediation in Slack by using an action button on the notification.
After an administrator chooses the Remediate button, Slack issues an API call to an Amazon API Gateway endpoint, supplying both the unique identifier of the finding to be remediated and that of the Slack user. API Gateway proxies the request to a Remediation Handler Lambda function.
The Remediation Handler Lambda function validates the request and request signature, extracts the offending object’s location from the finding, and makes an asynchronous call to the Finding Remediator Lambda function.
The Finding Remediator Lambda function moves the offending object from the source bucket to a designated S3 quarantine bucket with restricted access.
Finally, the Finding Remediator Lambda function uses a callback URL to update the original finding notification in Slack, indicating that the offending object has now been quarantined.
Deploy the solution
Now we’ll walk through the steps for configuring Slack and deploying the solution into your AWS environment by using the AWS CDK. The AWS CDK is a software development framework that you can use to define cloud infrastructure in code and provision through AWS CloudFormation.
The deployment steps can be summarized as follows:
Configure a Slack channel and app
Check the project out from GitHub
Set the configuration parameters
Build and deploy the solution
Configure Slack with an API Gateway endpoint
To configure a Slack channel and app
In your browser, make sure you’re logged into the Slack workspace where you want to integrate the solution.
Create a new channel where you will send the notifications, as follows:
Choose the + icon next to the Channels menu, and select Create a channel.
Give your channel a name, for example macie-findings, and make sure you turn on the Make private setting.
Important: By providing Slack users with access to this configured channel, you’re providing implicit access to review Macie finding details and approve remediations. To avoid unwanted user access, it’s strongly recommended that you make this channel private and by invite only.
On your Apps page, create a new app by selecting Create New App, and then enter the following information:
For App Name, enter a name of your choosing, for example MacieRemediator.
Select your chosen development Slack workspace that you logged into in step 1.
Choose Create App.
Figure 2: Create a Slack app
You will then see the Basic Information page for your app. Scroll down to the App Credentials section, and note down the Signing Secret. This secret will be used by the Lambda function that handles all remediation requests from Slack. The function uses the secret with Hash-based Message Authentication Code (HMAC) authentication to validate that requests to the solution are legitimate and originated from your trusted Slack channel.
Figure 3: Signing secret
Scroll back to the top of the Basic Information page, and under Add features and functionality, select the Incoming Webhooks tile. Turn on the Activate Incoming Webhooks setting.
At the bottom of the page, choose Add New Webhook to Workspace.
Select the macie-findings channel you created in step 2, and choose Allow.
You should now see webhook URL details under Webhook URLs for Your Workspace. Use the Copy button to note down the URL, which you will need later.
Figure 4: Webhook URL
To check the project out from GitHub
The solution source is available on GitHub in AWS Samples. Clone the project to your local machine or download and extract the available zip file.
To set the configuration parameters
In the root directory of the project you’ve just cloned, there’s a file named cdk.json. This file contains configuration parameters to allow integration with the macie-findings channel you created earlier, and also to allow you to control the auto-remediation behavior of the solution. Open this file and make sure that you review and update the following parameters:
autoRemediateConfig – This nested attribute allows you to specify for each sensitive data finding type whether you want to automatically remediate and quarantine the offending object, or first send the finding to Slack for human review and authorization. Note that you will still be notified through Slack that auto-remediation has taken place if this attribute is set to AUTO. Valid values are either AUTO or REVIEW. You can use the default values.
minSeverityLevel – Macie assigns all findings a Severity level. With this parameter, you can define a minimum severity level that must be met before the solution will trigger action. For example, if the parameter is set to MEDIUM, the solution won’t take any action or send any notifications when a finding has a LOW severity, but will take action when a finding is classified as MEDIUM or HIGH. Valid values are: LOW, MEDIUM, and HIGH. The default value is set to LOW.
slackChannel – The name of the Slack channel you created earlier (macie-findings).
slackWebHookUrl – For this parameter, enter the webhook URL that you noted down during Slack app setup in the “Configure a Slack channel and app” step.
slackSigningSecret – For this parameter, enter the signing secret that you noted down during Slack app setup.
Save your changes to the configuration file.
To build and deploy the solution
From the command line, make sure that your current working directory is the root directory of the project that you cloned earlier. Run the following commands:
npm install – Installs all Node.js dependencies.
npm run build – Compiles the CDK TypeScript source.
cdk bootstrap – Initializes the CDK environment in your AWS account and Region, as shown in Figure 5.
Figure 5: CDK bootstrap output
cdk deploy – Generates a CloudFormation template and deploys the solution resources.
The resources created can be reviewed in the CloudFormation console and can be summarized as follows:
Lambda functions – Finding Handler, Remediation Handler, and Remediator
IAM execution roles and associated policy – The roles and policy associated with each Lambda function and the API Gateway
S3 bucket – The quarantine S3 bucket
EventBridge rule – The rule that triggers the Lambda function for Macie sensitive data findings
API Gateway – A single remediation API with proxy integration to the Lambda handler
After you run the deploy command, you’ll be prompted to review the IAM resources deployed as part of the solution. Press y to continue.
Once the deployment is complete, you’ll be presented with an output parameter, shown in Figure 6, which is the endpoint for the API Gateway that was deployed as part of the solution. Copy this URL.
Figure 6: CDK deploy output
To configure Slack with the API Gateway endpoint
Open Slack and return to the Basic Information page for the Slack app you created earlier.
Under Add features and functionality, select the Interactive Components tile.
Turn on the Interactivity setting.
In the Request URL box, enter the API Gateway endpoint URL you copied earlier.
Choose Save Changes.
Figure 7: Slack app interactivity
Now that you have the solution components deployed and Slack configured, it’s time to test things out.
Test the solution
The testing steps can be summarized as follows:
Upload dummy files to S3
Run the Macie sensitive data discovery job
Review and act upon Slack notifications
Confirm that S3 objects are quarantined
To upload dummy files to S3
Two sample text files containing dummy financial and personal data are available in the project you cloned from GitHub. If you haven’t changed the default auto-remediation configurations, these two files will exercise both the auto-remediation and manual remediation review flows.
Find the files under sensitive-data-samples/dummy-financial-data.txt and sensitive-data-samples/dummy-personal-data.txt. Take these two files and upload them to S3 by using either the console, as shown in Figure 8, or AWS CLI. You can choose to use any new or existing bucket, but make sure that the bucket is in the same AWS account and Region that was used to deploy the solution.
Figure 8: Dummy files uploaded to S3
To run a Macie sensitive data discovery job
Navigate to the Amazon Macie console, and make sure that your selected Region is the same as the one that was used to deploy the solution.
If this is your first time using Macie, choose the Get Started button, and then choose Enable Macie.
On the Macie Summary dashboard, you will see a Create Job button at the top right. Choose this button to launch the Job creation wizard. Configure each step as follows:
Select S3 buckets: Select the bucket where you uploaded the dummy sensitive data file. Choose Next.
Review S3 buckets: No changes are required, choose Next.
Scope: For Job type, choose One-time job. Make sure Sampling depth is set to 100%. Choose Next.
Custom data identifiers: No changes are required, choose Next.
Name and description: For Job name, enter any name you like, such as Dummy job, and then choose Next.
Review and create: Review your settings; they should look like the following sample. Choose Submit.
Figure 9: Configure the Macie sensitive data discovery job
Macie will launch the sensitive data discovery job. You can track its status from the Jobs page within the Macie console.
To review and take action on Slack notifications
Within five minutes of submitting the data discovery job, you should expect to see two notifications appear in your configured Slack channel. One notification, similar to the one in Figure 10, is informational only and is related to an auto-remediation action that has taken place.
Figure 10: Slack notification of auto-remediation for the file containing dummy financial data
The other notification, similar to the one in Figure 11, requires end user action and is for a finding that requires administrator review. All notifications will display key information such as the offending S3 object, a description of the finding, the finding severity, and other relevant metadata.
Figure 11: Slack notification for human review of the file containing dummy personal data
(Optional) You can review the finding details by choosing the View Macie Finding in Console link in the notification.
In the Slack notification, choose the Remediate button to quarantine the object. The notification will be updated with confirmation of the quarantine action, as shown in Figure 12.
Figure 12: Slack notification of authorized remediation
To confirm that S3 objects are quarantined
Finally, navigate to the S3 console and validate that the objects have been removed from their original bucket and placed into the quarantine bucket listed in the notification details, as shown in Figure 13. Note that you may need to refresh your S3 object listing in the browser.
Figure 13: Slack notification of authorized remediation
Congratulations! You now have a fully operational solution to detect and respond to Macie sensitive data findings through a Slack ChatOps workflow.
Solution cleanup
To remove the solution and avoid incurring additional charges from the AWS resources that you deployed, complete the following steps.
To remove the solution and associated resources
Navigate to the Macie console. Under Settings, choose Suspend Macie.
Navigate to the S3 console and delete all objects in the quarantine bucket.
Run the command cdk destroy from the command line within the root directory of the project. You will be prompted to confirm that you want to remove the solution. Press y.
Summary
In this blog post, I showed you how to integrate Amazon Macie sensitive data findings with an auto-remediation and Slack ChatOps workflow. We reviewed the AWS services used, how they are integrated, and the steps to configure, deploy, and test the solution. With Macie and the solution in this blog post, you can substantially reduce the heavy lifting associated with detecting and responding to sensitive data in your AWS environment.
I encourage you to take this solution and customize it to your needs. Further enhancements could include supporting policy findings, adding additional remediation actions, or integrating with additional findings from AWS Security Hub.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum or contact AWS Support.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
Amazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in Amazon Web Services (AWS). It gives you the ability to automatically scan for sensitive data and get an inventory of your Amazon Simple Storage Service (Amazon S3) buckets. Macie also gives you the added ability to detect which buckets are public, unencrypted, and accessible from other AWS accounts.
In this post, we’ll walk through how to use Macie to detect sensitive data in Amazon DynamoDB tables by exporting the data to Amazon S3 so that Macie can scan the data. An example of why you would deploy a solution like this is if you have potentially sensitive data stored in DynamoDB tables. When we’re finished, you’ll have a solution that can set up on-demand or scheduled Macie discovery jobs to detect sensitive data exported from DynamoDB to S3.
Architecture
In figure 1, you can see an architectural diagram explaining the flow of the solution that you’ll be deploying.
Figure 1: Solution architecture
Here’s a brief overview of the steps that you’ll take to deploy the solution. Some steps you will do manually, while others will be handled by the provided AWS CloudFormation template. The following outline describes the steps taken to extract the data from DynamoDB and store it in S3, which allows Macie to run a discovery job against the data.
Deploy an AWS Lambda function to export DynamoDB data to S3.
Set up an Amazon EventBridge rule to schedule export of the DynamoDB data.
Create a Macie discovery job to discover sensitive data from the DynamoDB data export.
View the results of the Macie discovery job.
The goal is that when you finish, you have a solution that you can use to set up either on-demand or scheduled Macie discovery jobs to detect sensitive data that was exported from DynamoDB to S3.
Prerequisite: Enable Macie
If Macie hasn’t been enabled in your account, complete Step 1 in Getting started with Amazon Macie to enable Macie. Once you’ve enabled Macie, you can proceed with the deployment of the CloudFormation template.
Deploy the CloudFormation template
In this section, you start by deploying the CloudFormation template that will deploy all the resources needed for the solution. You can then review the output of the resources that have been deployed.
Sign in to the AWS Management Console and navigate to the CloudFormation console.
Choose Upload a template file, and then select the CloudFormation template that you downloaded in the previous step. Choose Next.
Figure 2 – Uploading the CloudFormation template to be deployed
For Stack Name, name your stack macie-blog, and then choose Next.
Figure 3: Naming your CloudFormation stack
For Configure stack options, keep the default values and choose Next.
At the bottom of the Review screen, select the I acknowledge that AWS CloudFormation might create IAM resources check box, and then choose Create stack.
Figure 4: Acknowledging that this CloudFormation template will create IAM roles
You should then see the following screen. It may take several minutes for the CloudFormation template to finish deploying.
Figure 5: CloudFormation stack creation in progress
View CloudFormation output
Once the CloudFormation template has been completely deployed, choose the Outputs tab, and you will see the following screen. Here you’ll find the names and URLs for all the AWS resources that are needed to complete the remainder of the solution.
Figure 6: Completed CloudFormation stack output
For easier reference, open a new browser tab to your AWS Management Console and leave this tab open. This will make it easier to quickly copy and paste the resource URLs as you navigate to different resources during this walkthrough.
Import DynamoDB data
In this section, we walk through importing the test dataset to DynamoDB. You first start by downloading the test CSV datasets, then upload those datasets to S3 and run the Lambda function that imports the data to DynamoDB. Finally, you review the data that was imported into DynamoDB.
Now that you’ve downloaded the test datasets, you’ll need to navigate to the data import S3 bucket and upload the data.
To upload the datasets to the S3 import bucket
Navigate to the CloudFormation Outputs tab, where you’ll find the bucket information.
Figure 7: S3 bucket output values for the CloudFormation stack
Copy the ImportS3BucketURL link and navigate to the URL.
Upload the two test CSV datasets, people.csv and accounts.csv, to your S3 bucket.
After the upload is complete, you should see the two CSV files in the S3 bucket. You’ll use these files as your test DynamoDB data.
Figure 8: Test S3 datasets in the S3 bucket
View the data import Lambda function
Now that you have your test data staged for loading, you’ll import it into DynamoDB by using a Lambda function that was deployed with the CloudFormation template. To start, navigate to the CloudFormation console and get the URL to the Lambda function that will handle the data import to DynamoDB, as shown in figure 9.
Figure 9: CloudFormation output information for the People DynamoDB table
To run the data import Lambda function
Copy the LambdaImportS3DataToDynamoURL link and navigate to the URL. You will see the Import-Data-To-DynamoDB Lambda function, as shown in figure 10.
Figure 10: The Lambda function that imports data to DynamoDB
Choose the Test button in the upper right-hand corner. In the dialog screen, for Event name, enter Test and replace the value with {}.
Your screen should now look as shown in figure 11. Choose Create.
Figure 11: Configuring a test event to manually run the Lambda function
Choose the Test button again in the upper right-hand corner. You should now see the Lambda function running, as shown in figure 12.
Figure 12: View of the Lambda function running
Once the Lambda function is finished running, you can expand the Details section. You should see a screen similar to the one in figure 13. When you see this screen, the test datasets have successfully been imported into the DynamoDB tables.
Figure 13: View of the data import Lambda function after it runs successfully
View the DynamoDB test dataset
Now that you have the datasets imported, you can look at the data in the console.
To view the test dataset
Navigate to the two DynamoDB tables. You can do this by getting the URL values from the CloudFormation Outputs tab. Figure 14 shows the URL for the accounts tables.
Figure 14: Output values for CloudFormation stack DynamoDB account tables
Figure 15 shows the URL for the people tables.
Figure 15: Output values for CloudFormation stack DynamoDB people tables
Copy the AccountsDynamoDBTableURL link value and navigate to it in the browser. Then choose the Items tab.
Figure 16: View of DynamoDB account-info-macie table data
You should now see a screen showing data similar to the screen in figure 16. This DynamoDB table stores the test account data that you will use to run a Macie discovery job against after the data has been exported to S3.
Navigate to the PeopleDynamoDBTableURL link that is in the CloudFormation output. Then choose the Items tab.
Figure 17: View of DynamoDB people table data
You should now see a screen showing data similar to the screen in figure 17. This DynamoDB table stores the test people data that you will use to run a Macie discovery job against after the data has been exported to S3.
Export DynamoDB data to S3
In the previous section, you set everything up and staged the data to DynamoDB. In this section, you will export data from DynamoDB to S3.
View the EventBridge rule
The EventBridge rule that was deployed earlier allows you to automatically schedule the export of DynamoDB data to S3. You will can export data in hours, in minutes, or in days. The purpose of the EventBridge rule is to allow you to set up an automated data pipeline from DynamoDB to S3. For demonstration purposes, you’ll run the Lambda function that the EventBridge rule uses manually, so that you can see the data be exported to S3 without having to wait.
To view the EventBridge rule
Navigate to the CloudFormation Outputs tab for the CloudFormation stack you deployed earlier.
Figure 18: CloudFormation output information for the EventBridge rule
Navigate to the EventBridgeRule link. You should see the following screen.
On this screen, you can see that we’ve set the event schedule to run every hour. The interval can be changed to fit your business needs. We have set it for 1 hour for demonstration purposes only. To make changes to the interval, you can choose the Edit button to make changes and then save the rule.
In the Target(s) section, we’ve configured a Lambda function named Export-DynamoDB-Data-To-S3 to handle the process of exporting data to the S3 bucket the Macie discovery job will run against. We will cover the Lambda function that handles the export of the data from DynamoDB next.
View the data export Lambda function
In this section, you’ll take a look at the Lambda function that handles the exporting of DynamoDB data to the S3 bucket that Macie will run its discovery job against.
To view the Lambda function
Navigate to the CloudFormation Outputs tab for the CloudFormation stack you deployed earlier.
Figure 20: CloudFormation output information for the Lambda function that exports DynamoDB data to S3
Copy the link value for LambdaExportDynamoDBDataToS3URL and navigate to the URL in your browser. You should see the Python code that will handle the exporting of data to S3. The code has been commented so that you can easily follow it and refactor it for your needs.
Scroll to the Environment variables section.
Figure 21: Environment variables used by the Lambda function
You will see two environment variables:
bucket_to_export_to – This environment variable is used by the function as the S3 bucket location to save the DynamoDB data to. This is the bucket that the Macie discovery will run against.
dynamo_db_tables – This environment variable is a comma-delimited list of DynamoDB tables that will be read and have data exported to S3. If there was another table that you wanted to export data from, you would simply add it to the comma-delimited list and it would be part of the export.
Export DynamoDB data
In this section, you will manually run the Lambda function to export the DynamoDB tables data to S3. As stated previously, you would normally allow the EventBridge rule to handle the automated export of the data to S3. In order to see the export in action, you’re going to manually run the function.
To run the export Lambda function
In the console, scroll back to the top of the screen and choose the Test button.
Name the test dynamoDBExportTest, and for the test data create an empty JSON object “{}” as shown in figure 22.
Figure 22: Configuring a test event to manually test the data export Lambda function
Choose Create.
Choose the Test button again to run the Lambda function to export the DynamoDB data to S3.
Figure 23: View of the screen where you run the Lambda function to export data
It could take about one minute to export the data from DynamoDB to S3. Once the Lambda function exports the data, you should see a screen similar to the following one.
Figure 24: The result after you successfully run the data export Lambda function
View the exported DynamoDB data
Now that the DynamoDB data has been exported for Macie to run discovery jobs against, you can navigate to S3 to verify that the files exported to the bucket.
To view the data, navigate to the CloudFormation stack Output tab. Find the ExportS3BucketURL, shown in figure 25, and navigate to the link.
Figure 25: CloudFormation output information for the S3 buckets that the DynamoDB data was exported to
You should then see two different JSON files for the two DynamoDB tables that data was exported from, as shown in figure 26.
Figure 26: View of S3 objects that were exported to S3
This is the file naming convention that’s used for the files:
Next, you’ll create a Macie discovery job to run against the files in this S3 bucket to discover sensitive data.
Create the Macie discovery job
In this section, you’ll create a Macie discovery job and view the results after the job has finished running.
To create the discovery job
In the AWS Management Console, navigate to Macie. In the left-hand menu, choose Jobs.
Figure 27: Navigation menu to Macie discovery jobs
Choose the Create job button.
Figure 28: Macie discovery job list screen
Using the Bucket Name filter, search for the S3 bucket that the DynamoDB data was exported to. This can be found in the CloudFormation stack output, as shown in figure 29.
Figure 29: CloudFormation stack output
Select the value you see for ExportS3BucketName, as shown in figure 30.
Note: The value you see for your bucket name will be slightly different, based on the random characters added to the end of the bucket name generated by CloudFormation.
Figure 30: Selecting the S3 bucket to include in the Macie discovery job
Once you’ve found the S3 bucket, select the check box next to it, and then choose Next.
On the Review S3 Buckets screen, if you’re satisfied with the selected buckets, choose Next.
Following are some important options when setting up Macie data discovery jobs.
Scheduling You have the following scheduling options for the data discovery job:
Daily
Weekly
Monthly
Data Sampling This allows you to randomly sample a percentage of the data that the Macie discovery job will run against.
Object criteria This enables you to target objects based on certain metadata values. The values are:
Tags – Target objects with certain tags.
Last modified – Target objects based on when they were last modified.
File extensions – Target objects based on file extensions.
Object size – Target objects based on the file size.
You can include or exclude objects based on these object criteria filters.
Set the discovery job scope
For demonstration purposes, this will be a one-time discovery job.
To set the discovery job scope
On the Scope page that appears after you create the job, set the following options for the job scope:
Select the One-time job option.
Leave Sampling depth set to 100%, and choose Next.
Figure 31: Selecting the objects that should be in scope for this discovery job
On the Custom data identifiers screen, select account_number, and then choose Next.With the custom identifier, you can create custom business logic to look for certain patterns in files stored in S3. In this example, the job generates a finding for any file that contains data with the following format:
Account Number Format: Starts with “XYZ-” followed by 11 numbers
The logic to create a custom data identifier can be found in the CloudFormation template.
Figure 32: Custom data identifiers
Give your discovery job the name dynamodb-macie-discovery-job. For Description, enter Discovery job to detect sensitive data exported from DynamoDB, and choose Next.
Figure 33: Giving the Macie discovery job a name and description
You will then see the Review and create screen, as shown in figure 34.
Figure 34: The Macie discovery job review screen
Note: Macie must have proper permissions to decrypt objects that are part of the Macie discovery job. The CloudFormation template that you deployed during the initial setup has already deployed an AWS Key Management Service (AWS KMS) key with the proper permissions.
For this proof of concept you won’t store the results, so you can select the check box next to Override this requirement. If you wanted to store detailed results of the discovery job long term, you would configure a repository for data discovery results. To view detailed steps for setting this up, see Storing and retaining discovery results with Amazon Macie.
Submit the discovery job
Next, you can submit the discovery job. On the Review and create screen, choose the Submit button to start the discovery job. You should see a screen similar to the following.
Figure 35: A Macie discovery job run that is in progress
The amount of data that is being scanned dictates how long the job will take to run. You can choose the Refresh button at the top of the screen to see the updated status of the job. This job, based on the size of the test dataset, will take about seven minutes to complete.
Review the job results
Now that the Macie discovery job has run, you can review the results to see what sensitive data was discovered in the data exported from DynamoDB.
You should see the following screen once the job has successfully run.
Figure 36: View of the completed Macie discovery job
On the right, you should see another pane with more information related to the discovery job. The pane should look like the following screen.
Figure 37: Summary showing which S3 bucket the discovery job ran against and start and complete time
Note: If you don’t see this pane, choose on the discovery job to have this information displayed.
To review the job results
On the page for the discovery job, in the Show Results list, select Show findings.
Figure 38: Option to view discovery job findings
The Findings screen appears, as follows.
Figure 39: Viewing the list of findings generated by the Macie discovery job
The discovery job that you ran has two different “High Severity” finding types:
SensitiveData:S3Object/Personal – The object contains personal information, such as full names or identification numbers.
SensitiveData:S3Object/Multiple – The object contains more than one type of sensitive data.
Choose the SensitiveData:S3Object/Personal finding type, and you will see an information pane appear to the right, as shown in figure 40.Some of the key information that you can find here:
Severity – What the severity of the finding is: Low, Medium, or High. Resource – The S3 bucket where the S3 object exists that caused the finding to be generated. Region – The Region where the S3 bucket exists.
Figure 40: Viewing the severity of the discovery job finding
Since the finding is based on the detection of personal information in the S3 object, you get the number of times and type of personal data that was discovered, as shown in figure 41.
Figure 41: Viewing the number of social security numbers that were discovered in the finding
Here you can see that 10 names were detected in the data that you exported from the DynamoDB table. Occurrences of name equals 10 line ranges, which tells you that the names were found on 10 different lines in the file. If you choose the 10 line ranges link, you are given the starting line and column in the document where the name was discovered.
The S3 object that triggered the finding is displayed in the Resource affected section, as shown in figure 42.
Figure 42: The S3 object that generated the Macie finding
Now that you know which S3 object contains the sensitive data, you can investigate further to take appropriate action to protect the data.
View the Macie finding details
In this section, you will walk through how to read and download the objects related to the Macie discovery job.
To download and view the S3 object that contains the finding
In the Overview section of the finding details, select the value for the Resource link. You will then be taken to the object in the S3 bucket.
Figure 43: Viewing the S3 bucket where the object is located that generated the Macie finding
You can then download the S3 object from the S3 bucket to view the file content and further investigate the file content for sensitive data. Select the check box next to the S3 object, and choose the Download button at the top of the screen.Next, we will look at the SensitiveData:S3Object/Multiple finding type that was generated. This finding type lets us know that there are multiple types of potentially sensitive data related to an object stored in S3.
In the left navigation menu, navigate back to the Jobs menu.
Choose the job that you created in the previous steps. In the Show Results list, select Show Findings.
Select the SensitiveData:S3Object/Multiple finding type. An information pane appears to the right. As with the previous finding, you will see the severity, Region, S3 bucket location, and other relevant information about the finding. For this finding, we will focus on the Custom data identifiers and Personal info sections.
Figure 44: Details about the sensitive data that was discovered by the Macie discovery job
Here you can see that the discovery job found 10 names on 10 different lines in the file. Also, you can see that 10 account numbers were discovered on 10 different lines in the file, based on the custom identifier that was included as part of the discovery job.
This finding demonstrates how you can use the built-in Macie identifiers, such as names, and also include custom business logic based on your organization’s needs by using Macie custom data identifiers.
To view the data and investigate further, follow the same steps as in the previous finding you investigated.
Navigate to the top of the screen and in the Overview section, locate the Resource.
Figure 45: Viewing the S3 bucket where the object is located that generated the Macie finding
Choose Resource, which will take you to the S3 object to download. You can now view the contents of the file and investigate further.
You’ve now created a Macie discovery job to scan for sensitive data stored in an S3 bucket that originated in DynamoDB. You can also automate this solution further by using EventBridge rules to detect Macie findings to take actions against those objects with sensitive data.
Solution cleanup
In order to clean up the solution that you just deployed, complete the following steps. Note that you need to do these steps to stop data from being exported from DynamoDB to S3 every 1 hour.
To perform cleanup
Navigate to the S3 buckets used to import and export data. You can find the bucket names in the CloudFormation Outputs tab in the console, as shown in figure 7 and figure 25.
After you’ve navigated to each of the buckets, delete all objects from the bucket.
Navigate to the CloudFormation console, and then delete the CloudFormation stack named macie-blog. After the stack is deleted, the solution will no longer be deployed in your AWS account.
Summary
After deploying the solution, we hope you have a better understanding of how you can use Macie to detect sensitive from other data sources, such as DynamoDB, as outlined in this post. The following are links to resources that you can use to further expand your knowledge of Amazon Macie capabilities and features.
Data is a crucial part of every business and is used for strategic decision making at all levels of an organization. To extract value from their data more quickly, Amazon Web Services (AWS) customers are building automated data pipelines—from data ingestion to transformation and analytics. As part of this process, my customers often ask how to prevent sensitive data, such as personally identifiable information, from being ingested into data lakes when it’s not needed. They highlight that this challenge is compounded when ingesting unstructured data—such as files from process reporting, text files from chat transcripts, and emails. They also mention that identifying sensitive data inadvertently stored in structured data fields—such as in a comment field stored in a database—is also a challenge.
In this post, I show you how to integrate Amazon Macie as part of the data ingestion step in your data pipeline. This solution provides an additional checkpoint that sensitive data has been appropriately redacted or tokenized prior to ingestion. Macie is a fully managed data security and privacy service that uses machine learning and pattern matching to discover sensitive data in AWS.
When Macie discovers sensitive data, the solution notifies an administrator to review the data and decide whether to allow the data pipeline to continue ingesting the objects. If allowed, the objects will be tagged with an Amazon Simple Storage Service (Amazon S3) object tag to identify that sensitive data was found in the object before progressing to the next stage of the pipeline.
This combination of automation and manual review helps reduce the risk that sensitive data—such as personally identifiable information—will be ingested into a data lake. This solution can be extended to fit your use case and workflows. For example, you can define custom data identifiers as part of your scans, add additional validation steps, create Macie suppression rules to archive findings automatically, or only request manual approvals for findings that meet certain criteria (such as high severity findings).
Solution overview
Many of my customers are building serverless data lakes with Amazon S3 as the primary data store. Their data pipelines commonly use different S3 buckets at each stage of the pipeline. I refer to the S3 bucket for the first stage of ingestion as the raw data bucket. A typical pipeline might have separate buckets for raw, curated, and processed data representing different stages as part of their data analytics pipeline.
Typically, customers will perform validation and clean their data before moving it to a raw data zone. This solution adds validation steps to that pipeline after preliminary quality checks and data cleaning is performed, noted in blue (in layer 3) of Figure 1. The layers outlined in the pipeline are:
Ingestion – Brings data into the data lake.
Storage – Provides durable, scalable, and secure components to store the data—typically using S3 buckets.
Processing – Transforms data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. This processing layer is where the additional validation steps are added to identify instances of sensitive data that haven’t been appropriately redacted or tokenized prior to consumption.
Consumption – Provides tools to gain insights from the data in the data lake.
Figure 1: Data pipeline with sensitive data scan
The application runs on a scheduled basis (four times a day, every 6 hours by default) to process data that is added to the raw data S3 bucket. You can customize the application to perform a sensitive data discovery scan during any stage of the pipeline. Because most customers do their extract, transform, and load (ETL) daily, the application scans for sensitive data on a scheduled basis before any crawler jobs run to catalog the data and after typical validation and data redaction or tokenization processes complete.
You can expect that this additional validation will add 5–10 minutes to your pipeline execution at a minimum. The validation processing time will scale linearly based on object size, but there is a start-up time per job that is constant.
If sensitive data is found in the objects, an email is sent to the designated administrator requesting an approval decision, which they indicate by selecting the link corresponding to their decision to approve or deny the next step. In most cases, the reviewer will choose to adjust the sensitive data cleanup processes to remove the sensitive data, deny the progression of the files, and re-ingest the files in the pipeline.
Additional considerations for deploying this application for regular use are discussed at the end of the blog post.
Application components
The following resources are created as part of the application:
S3 buckets store data in various stages of processing: A raw data bucket for uploading objects for the data pipeline, a scanning bucket where objects are scanned for sensitive data, a manual review bucket holding objects where sensitive data was discovered, and a scanned data bucket for starting the next ingestion step of the data pipeline.
Lambda functions execute the logic to run the sensitive data scans and workflow.
Note: the application uses various AWS services, and there are costs associated with these resources after the Free Tier usage. See AWS Pricing for details. The primary drivers of the solution cost will be the amount of data ingested through the pipeline, both for Amazon S3 storage and data processed for sensitive data discovery with Macie.
The architecture of the application is shown in Figure 2 and described in the text that follows.
Figure 2: Application architecture and logic
Application logic
Objects are uploaded to the raw data S3 bucket as part of the data ingestion process.
A scheduled EventBridge rule runs the sensitive data scan Step Functions workflow.
triggerMacieScan Lambda function moves objects from the raw data S3 bucket to the scan stage S3 bucket.
triggerMacieScan Lambda function creates a Macie sensitive data discovery job on the scan stage S3 bucket.
checkMacieStatus Lambda function checks the status of the Macie sensitive data discovery job.
isMacieStatusCompleteChoiceStep Functions Choice state checks whether the Macie sensitive data discovery job is complete.
If yes, the getMacieFindingsCount Lambda function runs.
getMacieFindingsCount Lambda function counts all of the findings from the Macie sensitive data discovery job.
isSensitiveDataFound Step Functions Choice state checks whether sensitive data was found in the Macie sensitive data discovery job.
If there was sensitive data discovered, run the triggerManualApproval Lambda function.
If there was no sensitive data discovered, run the moveAllScanStageS3Files Lambda function.
moveAllScanStageS3Files Lambda function moves all of the objects from the scan stage S3 bucket to the scanned data S3 bucket.
triggerManualApproval Lambda function tags and moves objects with sensitive data discovered to the manual review S3 bucket, and moves objects with no sensitive data discovered to the scanned data S3 bucket. The function then sends a notification to the ApprovalRequestNotification Amazon SNS topic as a notification that manual review is required.
Email is sent to the email address that’s subscribed to the ApprovalRequestNotification Amazon SNS topic (from the application deployment template) for the manual review user with the option to Approve or Deny pipeline ingestion for these objects.
Manual review user assesses the objects with sensitive data in the manual review S3 bucket and selects the Approve or Deny links in the email.
The decision request is sent from the Amazon API Gateway to the receiveApprovalDecision Lambda function.
manualApprovalChoice Step Functions Choice state checks the decision from the manual review user.
If denied, run the deleteManualReviewS3Files Lambda function.
If approved, run the moveToScannedDataS3Files Lambda function.
deleteManualReviewS3Files Lambda function deletes the objects from the manual review S3 bucket.
moveToScannedDataS3Files Lambda function moves the objects from the manual review S3 bucket to the scanned data S3 bucket.
The next step of the automated data pipeline will begin with the objects in the scanned data S3 bucket.
Prerequisites
For this application, you need the following prerequisites:
You can use AWS Cloud9 to deploy the application. AWS Cloud9 includes the AWS CLI and AWS SAM CLI to simplify setting up your development environment.
Deploy the application with AWS SAM CLI
You can deploy this application using the AWS SAM CLI. AWS SAM uses AWS CloudFormation as the underlying deployment mechanism. AWS SAM is an open-source framework that you can use to build serverless applications on AWS.
To deploy the application
Initialize the serverless application using the AWS SAM CLI from the GitHub project in the aws-samples repository. This will clone the project locally which includes the source code for the Lambda functions, Step Functions state machine definition file, and the AWS SAM template. On the command line, run the following:
sam init --location gh: aws-samples/amazonmacie-datapipeline-scan
Alternatively, you can clone the Github project directly.
Deploy your application to your AWS account. On the command line, run the following:
sam deploy --guided
Complete the prompts during the guided interactive deployment. The first deployment prompt is shown in the following example.
Configuring SAM deploy
======================
Looking for config file [samconfig.toml] : Found
Reading default arguments : Success
Setting default arguments for 'sam deploy'
=========================================
Stack Name [maciepipelinescan]:
Settings:
Stack Name – Name of the CloudFormation stack to be created.
AWS Region – Region—for example, us-west-2, eu-west-1, ap-southeast-1—to deploy the application to. This application was tested in the us-west-2 and ap-southeast-1 Regions. Before selecting a Region, verify that the services you need are available in those Regions (for example, Macie and Step Functions).
Parameter StepFunctionName – Name of the Step Functions state machine to be created—for example, maciepipelinescanstatemachine).
Parameter BucketNamePrefix – Prefix to apply to the S3 buckets to be created (S3 bucket names are globally unique, so choosing a random prefix helps ensure uniqueness).
Parameter ApprovalEmailDestination – Email address to receive the manual review notification.
Parameter EnableMacie – Whether you need Macie enabled in your account or Region. You can select yes or no; select yes if you need Macie to be enabled for you as part of this template, select no, if you already have Macie enabled.
Confirm changes and provide approval for AWS SAM CLI to deploy the resources to your AWS account by responding y to prompts, as shown in the following example. You can accept the defaults for the SAM configuration file and SAM configuration environment prompts.
#Shows you resources changes to be deployed and require a 'Y' to initiate deploy
Confirm changes before deploy [y/N]: y
#SAM needs permission to be able to create roles to connect to the resources in your template
Allow SAM CLI IAM role creation [Y/n]: y
ReceiveApprovalDecisionAPI may not have authorization defined, Is this okay? [y/N]: y
ReceiveApprovalDecisionAPI may not have authorization defined, Is this okay? [y/N]: y
Save arguments to configuration file [Y/n]: y
SAM configuration file [samconfig.toml]:
SAM configuration environment [default]:
Note: This application deploys an Amazon API Gateway with two REST API resources without authorization defined to receive the decision from the manual review step. You will be prompted to accept each resource without authorization. A token (Step Functions taskToken) is used to authenticate the requests.
This creates an AWS CloudFormation changeset. Once the changeset creation is complete, you must provide a final confirmation of y to Deploy the changeset? [y/N] when prompted as shown in the following example.
Changeset created successfully. arn:aws:cloudformation:ap-southeast-1:XXXXXXXXXXXX:changeSet/samcli-deploy1605213119/db681961-3635-4305-b1c7-dcc754c7XXXX
Previewing CloudFormation changeset before deployment
======================================================
Deploy this changeset? [y/N]:
Your application is deployed to your account using AWS CloudFormation. You can track the deployment events in the command prompt or via the AWS CloudFormation console.
After the application deployment is complete, you must confirm the subscription to the Amazon SNS topic. An email will be sent to the email address entered in Step 3 with a link that you need to select to confirm the subscription. This confirmation provides opt-in consent for AWS to send emails to you via the specified Amazon SNS topic. The emails will be notifications of potentially sensitive data that need to be approved. If you don’t see the verification email, be sure to check your spam folder.
Test the application
The application uses an EventBridge scheduled rule to start the sensitive data scan workflow, which runs every 6 hours. You can manually start an execution of the workflow to verify that it’s working. To test the function, you will need a file that contains data that matches your rules for sensitive data. For example, it is easy to create a spreadsheet, document, or text file that contains names, addresses, and numbers formatted like credit card numbers. You can also use this generated sample data to test Macie.
We will test by uploading a file to our S3 bucket via the AWS web console. If you know how to copy objects from the command line, that also works.
Upload test objects to the S3 bucket
Navigate to the Amazon S3 console and upload one or more test objects to the <BucketNamePrefix>-data-pipeline-raw bucket. <BucketNamePrefix> is the prefix you entered when deploying the application in the AWS SAM CLI prompts. You can use any objects as long as they’re a supported file type for Amazon Macie. I suggest uploading multiple objects, some with and some without sensitive data, in order to see how the workflow processes each.
Start the Scan State Machine
Navigate to the Step Functions state machines console. If you don’t see your state machine, make sure you’re connected to the same region that you deployed your application to.
Choose the state machine you created using the AWS SAM CLI as seen in Figure 3. The example state machine is maciepipelinescanstatemachine, but you might have used a different name in your deployment.
Figure 3: AWS Step Functions state machines console
Select the Start execution button and copy the value from the Enter an execution name – optional box. Change the Input – optional value replacing <execution id> with the value just copied as follows:
{
“id”: “<execution id>”
}
In my example, the <execution id> is fa985a4f-866b-b58b-d91b-8a47d068aa0c from the Enter an execution name – optional box as shown in Figure 4. You can choose a different ID value if you prefer. This ID is used by the workflow to tag the objects being processed to ensure that only objects that are scanned continue through the pipeline. When the EventBridge scheduled event starts the workflow as scheduled, an ID is included in the input to the Step Functions workflow. Then select Start execution again.
Figure 4: New execution dialog box
You can see the status of your workflow execution in the Graph inspector as shown in Figure 5. In the figure, the workflow is at the pollForCompletionWait step.
Figure 5: AWS Step Functions graph inspector
The sensitive discovery job should run for about five to ten minutes. The jobs scale linearly based on object size, but there is a start-up time per job that is constant. If sensitive data is found in the objects uploaded to the <BucketNamePrefix>-data-pipeline-upload S3 bucket, an email is sent to the address provided during the AWS SAM deployment step, notifying the recipient requesting of the need for an approval decision, which they indicate by selecting the link corresponding to their decision to approve or deny the next step as shown in Figure 6.
Figure 6: Sensitive data identified email
When you receive this notification, you can investigate the findings by reviewing the objects in the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket. Based on your review, you can either apply remediation steps to remove any sensitive data or allow the data to proceed to the next step of the data ingestion pipeline. You should define a standard response process to address discovery of sensitive data in the data pipeline. Common remediation steps include review of the files for sensitive data, deleting the files that you do not want to progress, and updating the ETL process to redact or tokenize sensitive data when re-ingesting into the pipeline. When you re-ingest the files into the pipeline without sensitive data, the files will not be flagged by Macie.
The workflow performs the following:
If you select Approve, the files are moved to the <BucketNamePrefix>-data-pipeline-scanned-data S3 bucket with an Amazon S3 SensitiveDataFound object tag with a value of true.
If you select Deny, the files are deleted from the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket.
If no action is taken, the Step Functions workflow execution times out after five days and the file will automatically be deleted from the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket after 10 days.
Clean up the application
You’ve successfully deployed and tested the sensitive data pipeline scan workflow. To avoid ongoing charges for resources you created, you should delete all associated resources by deleting the CloudFormation stack. In order to delete the CloudFormation stack, you must first delete all objects that are stored in the S3 buckets that you created for the application.
To delete the application
Empty the S3 buckets created in this application (<BucketNamePrefix>-data-pipeline-raw S3 bucket, <BucketNamePrefix>-data-pipeline-scan-stage, <BucketNamePrefix>-data-pipeline-manual-review, and <BucketNamePrefix>-data-pipeline-scanned-data).
Before using this application in a production data pipeline, you will need to stop and consider some practical matters. First, the notification mechanism used when sensitive data is identified in the objects is email. Email doesn’t scale: you should expand this solution to integrate with your ticketing or workflow management system. If you choose to use email, subscribe a mailing list so that the work of reviewing and responding to alerts is shared across a team.
Second, the application is run on a scheduled basis (every 6 hours by default). You should consider starting the application when your preliminary validations have completed and are ready to perform a sensitive data scan on the data as part of your pipeline. You can modify the EventBridge Event Rule to run in response to an Amazon EventBridge event instead of a scheduled basis.
Third, the application currently uses a 60 second Step Functions Wait state when polling for the Macie discovery job completion. In real world scenarios, the discovery scan will take 10 minutes at a minimum, likely several orders of magnitude longer. You should evaluate the typical execution times for your application execution and tune the polling period accordingly. This will help reduce costs related to running Lambda functions and log storage within CloudWatch Logs. The polling period is defined in the Step Functions state machine definition file (macie_pipeline_scan.asl.json) under the pollForCompletionWait state.
Fourth, the application currently doesn’t account for false positives in the sensitive data discovery job results. Also, the application will progress or delete all objects identified based on the decision by the reviewer. You should consider expanding the application to handle false positives through automation rather than manual review / intervention (such as deleting the files from the manual review bucket or removing the sensitive data tags applied).
Last, the solution will stop the ingestion of a subset of objects into your pipeline. This behavior is similar to other validation and data quality checks that most customers perform as part of the data pipeline. However, you should test to ensure that this will not cause unexpected outcomes and address them in your downstream application logic accordingly.
Conclusion
In this post, I showed you how to integrate sensitive data discovery using Macie as an additional validation step in an automated data pipeline. You’ve reviewed the components of the application, deployed it using the AWS SAM CLI, tested to validate that the application functions as expected, and cleaned up by removing deployed resources.
You now know how to integrate sensitive data scanning into your ETL pipeline. You can use automation and—where required—manual review to help reduce the risk of sensitive data, such as personally identifiable information, being inadvertently ingested into a data lake. You can take this application and customize it to fit your use case and workflows, such as using custom data identifiers as part of your scans, adding additional validation steps, creating Macie suppression rules to define cases to archive findings automatically, or only request manual approvals for findings that meet certain criteria (such as high severity findings).
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
As you put more and more data in the cloud, you need to rely on security automation to keep it secure at scale. AWS recently launched Amazon Macie, a fully managed service that uses machine learning and pattern matching to help you detect, classify, and better protect your sensitive data stored in the AWS Cloud.
Many data breaches are not the result of malicious activity from unauthorized users, but rather from mistakes made by authorized users. To monitor and manage the security of sensitive data, you must first be able to identify it. In this post, we show you how to use custom data identifiers with Macie to identify sensitive data. Once you know what’s sensitive, you can start designing security controls that operate at scale to monitor and remediate risk automatically.
Macie comes with a set of managed data identifiers that you can use to discover many types of sensitive data. These are somewhat generic and broadly applicable to many organizations. What makes Macie unique is its ability to help you address specific data needs. Macie enables you to expand your sensitive data detection through the new custom data identifiers. Custom data identifiers can be used to highlight organizational proprietary data, intellectual property, and specific scenarios.
Custom Data Identifiers in Macie help you find and identify sensitive data based on your own organization’s specific needs. In this post, we show you a step-by-step walkthrough of how to define and run custom data identifiers to automatically discover specific, sensitive data. Before you begin using Custom Data Identifiers, you need to enable Macie and configure detailed logging. Follow these instructions to enable Macie and follow these instructions to configure detailed logging, if you haven’t done that already.
When to use the Custom Data Identifier resource
To begin, imagine you’re an IT administrator for a manufacturing company that’s headquartered in France. Your company has acquired a few additional local subsidiaries, including an R&D facility in São Paulo, Brazil. The company is migrating to AWS, and in the process is classifying registration information, employee information, and product data into encrypted and non-encrypted storage.
You want to identify sensitive data for the following three scenarios:
SIRET-NIC: SIRET-NIC is a unique number assigned to businesses in France. This number is issued by their National Institute of Statistics (INSEE) when a business is registered. A sample file that contains SIRET-NIC information is shown in the following figure. Each record in the file includes the GUID, employee name, employee email, the company name, the date it was issued, and the SIRET-NIC number.
Figure 1: SIRET-NIC dataset
Brazil CPF (Cadastro de Pessoas Físicas – Natural Persons Register): CPF is a unique number assigned by the Brazilian revenue agency to people subject to taxes in the country. Each of your employees residing in the Brazilian office has a CPF.
Prototyping naming convention: Your company has products that are publicly available, but also products that are still in the prototyping stage and should be kept confidential. A sample file that contains Brazil CPF numbers and the prototype names is shown in the following figure.
Figure 2: Brazil CPF and prototype number dataset
Configure the Custom Data Identifier resource in the Macie console
To use custom data identifiers to identify your organization’s sensitive information, you must:
The following steps introduce you to the Custom Data Identifier resource in Macie.
Designing Custom Data Identifiers for use with Amazon Macie
In the previous section you discovered 3 scenarios that your company will like to protect SIRET-NIC, Brazil CPF, and your prototyping naming convention. You now need to first create a specific REGEX pattern for each of these scenarios. There are different syntaxes and dialects of regular expression languages. Amazon Macie supports a subset of the Perl Compatible Regular Expressions (PCRE) library, and you can learn more about it in Regex support in custom data identifiers section. Once the patterns are ready, follow the instructions below to create the custom data identifiers.
Enter Amazon Macie in the AWS services search box.
Choose Amazon Macie.
In the navigation pane on the left-hand side, under Settings, choose Custom data identifiers as shown in the following figure.
Figure 3: Custom data identifiers console
Create a custom data identifier
Choose Create on the custom data identifier console.
Name: Enter a name for your custom data identifier. Make it descriptive so you know what it does. For example, enter SIRET-NIC for the SIRET-NIC number you use.
Description: Enter a description of the custom data identifier.
Regular expression (regex): Define the pattern you want to identify. Use a Regular Expression (“regex”) to create the desired pattern. For example, a SIRET-NIC number contains 14 digits—9 numbers followed by a hyphen and then 5 more numbers. The first part, 9 numbers, can stay together or separated by spaces into 3 groups of 3. The specific regex pattern for this is \b(\d{3}\s?){2}\d{3}\-\d{5}\b
Keywords: Define expressions that identify the text to match. The SIRET-NIC number itself is publicly accessible information. But in your case, you want to encrypt the information about the company that was registered during the month the acquisition happened (April 2020), thus the information will not leak to your competitors. So, the keywords here will be all the days in April.
(Optional) Ignore words: Use this box to enter text that you want to be ignored. In this example scenario, you know your security training materials always use an example SIRET-NICs of 12345789-12345 and 000000000-00000. You can enter these values here, so that your security training materials are not flagged as sensitive data containing SIRET-NICs.
Maximum match distance: Use this box to define the proximity between the result and the keywords. If you enter 20, Macie will provide results that include the specified keyword and 20 characters on either side of it.
Note: Do not select Submit yet. After entering the settings and before selecting Submit, you should test your custom data identifier with sample data to confirm that it works.
With all the attributes set, your console will look like what is shown in Figure 4.
Figure 4: SIRET-NIC custom data identifier creation
Test your SIRET-NIC custom data identifier
Use the Evaluate section on the right-hand panel of the Macie console to confirm that the regex pattern and other configurations for your custom data identifier are correct.
Follow the steps below to use the Evaluate section.
Enter test data in the sample data box.
Select Submit. There will be one match per record in the file if the configurations are correct and your custom data identifier is ready.The following figure is an example of the Evaluate section using test data. The test data has 3 records, each record has 5 fields which are GUID, employee name, employee email, company name, date SIRET-NIC was issued, and the SIRET-NIC number.
Figure 5: Evaluate, showing sample data
After verifying your SIRET-NIC custom data identifier works in the Evaluate section, now select Submit on the New custom data identifier window to create the custom data identifier.
Create a Brazil CPF Custom Data Identifier
Congrats on creating your first custom data identifier! Now use the same steps to create and test custom data identifiers for the Brazil CPF and prototyping naming convention scenarios. The Brazil CPF number usually shows up in the format of 000.000.000-00.
Use the following values for the Brazil CPF scenario, as shown in the following figure:
Name: Brazil CPF
Description: The format for Brazil CPF in our sample data is 000.000.000-00
Regular expression: \b(\d{3}\.){2}\d{3}\-\d{2}\b
Figure 6: Brazil CPF custom data identifier
Create a Prototype Name Custom Data Identifier
Assume that your company has a very strict and regular naming scheme for prototype part numbers. It is P, followed by a hyphen, and then 2 letters and 4 digits. E.g., P-AB1234. You want to identify objects in S3 that contain references to private prototype parts. This is a small pattern, and so if we’re not careful it will cause Macie to flag objects that do not actually contain one of our prototype numbers. We suggest adding \b at the beginning and the end of the regular expression. The \b symbol means a “word boundary” and word boundaries are basically whitespace, punctuation, or other things that are not letters and numbers. With \b, you limit the pattern so that you only match if the entire word matches the pattern. For example, P-AB1234 will match the pattern, but STEP-AB123456 and P-XY123 will not match the pattern. This gives you finer grained control and reduces false positives.
Use the following values for the prototyping name scenario, as shown in the following figure:
Name: Prototyping Naming
Description: Any prototype name start with P means it’s private. The format for private prototype name is P-2 capital letters and 4 numbers
Regular expression: \bP\-[A-Z]{2}\d{4}\b
Figure 7: Prototyping naming custom data identifier
You should now see a page like the following figure, indicating that the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers are successfully configured.
Figure 8: Successfully configured custom data identifier
Set up a Test Bucket to Demonstrate Macie
Before we can see Macie do its work, we have to create a bucket with some test data that we can scan. We’ve provided some sample data files that you can download. Follow these instructions to create a test bucket and load our test data into the test bucket.
Choose Create bucket. The Create bucket wizard opens.
In Bucket name, enter a DNS-compliant name for your bucket. The bucket name must:
Be unique across all of Amazon S3.
Be between 3 and 63 characters long.
Not contain uppercase characters.
Start with a lowercase letter or number.
We created a bucket called bucketformacieuse; you have to choose another name because this one is already taken by us.
In Region, choose the AWS Region where you want the bucket to reside.
Select Create, to finish the bucket creation.
Open the bucket you just created and upload the two Excel files you downloaded in step 1.
Use Macie to create a job to scan your data
Now you can create a job to scan your Amazon S3 bucket to detect and locate the data patterns defined in the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers.
To create a job
In the navigation pane, choose Jobs, and then select Create Job on the upper right.
Select Amazon S3 buckets: Select the S3 bucket you want to analyze. In this case, we are using the bucket previously created, bucketformacieuse.
Review Amazon S3 buckets: Verify that you selected the S3 bucket you want the job to scan and analyze.
Scope: Select your scope. For this example, choose the One-time job option as your scope. The scope specifies how often you want the job to run. This can be either a one-time job or a scheduled job. If you choose a scheduled job, you can define how often you want your job to scan your Amazon S3 bucket.
Custom data identifiers: Select the 3 custom data identifiers you created to be associated with this job, and then select Next. This is shown in the following figure.
Figure 9: Select your custom data identifiers
Name and description: Enter a name and description for the job.
Review and create: Review and verify all your settings, and then select Create.
You now have a job in Macie to scan the Amazon S3 buckets you’ve chosen using the 3 custom data identifiers you created. More information about creating jobs is available in Running sensitive data discovery jobs in Amazon Macie.
Respond to results
Macie will help you be secure when you’re effectively responding to the findings that it produces. For our example, we’ll show you how to review your findings manually. You can look at your findings by bucket, type, or job, or see a collective summary of all findings. In this example, let’s look at all findings.
To review your results
In the navigation pane on the left-hand side, choose Findings. Findings include the severity, the type, the resources affected, and when the findings were last updated.
The following figure shows an example of the results you might see on the findings page. There are two findings for the selected job. The compagnie_français.csv and the empresa_brasileira.csv files contain the custom data identifiers that you created earlier and added to the job.
Figure 10: Findings
Let’s look at the details of one of the findings so you can review the results. From the page showing the 4r findings, select the file that contains your custom data identifier for the Brazil CPF: empresa_brasileira.csv. The number of custom data identifiers found in the document is shown in the Result section on the right, as shown in the following figure.
Figure 11: Findings detail page for the Brazil CPF custom data identifiers
Now look at the findings details for the compagnie_français.csv file. It shows the number of custom data identifiers found in the file. In this case Macie found 13 SIRET-NIC numbers as shown in the following figure.
Figure 12: Findings page for the French company file
If you configured detailed logging, the results will be saved in the Amazon S3 bucket you specified. The S3 bucket location can be found in the Details section after Detailed result location as shown in the preceding figure.
Now that you’ve used Macie and the Custom Data Identifiers resource to obtain these findings, you can identify what data to place in encrypted storage, and what can be placed in non-encrypted storage when migrating to AWS. Macie and custom data identifiers provide an automated tool to help you enhance protection of your sensitive data by providing you the information to help detect and classify your data in the AWS Cloud.
Using Macie at Scale
Custom Data Identifiers help you tell Macie what to look for. As you move more and more data to the cloud, you’ll need to make new identifiers and new rules. As your rules and identifiers grow you will need to create automation that responds to things that are found. For example, perhaps a lambda function turns on encryption in a bucket when it finds sensitive data in that bucket. Or perhaps a function automatically applies tags to buckets where sensitive data is found, and those buckets and their owners start to appear on reports for audit and compliance. Once you’ve done this at small scale, think about how you will automate responses at larger scale.
Conclusion
The new Custom Data Identifier resource in the newly enhanced Macie can help you detect, classify, and protect sensitive data types unique to your organization. This post focused on the functionality and use of custom data identifiers to automatically discover sensitive data stored in Amazon S3. You can also review the managed data identifiers to see a list of personally identifiable information (PII) that Macie can detect by default. Visit What is Amazon Macie? to learn more.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum or contact AWS Support.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.