Tag Archives: Amazon S3

How to delete user data in an AWS data lake

Post Syndicated from George Komninos original https://aws.amazon.com/blogs/big-data/how-to-delete-user-data-in-an-aws-data-lake/

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution to delete specific users’ personal data.

In the context of the AWS big data and analytics ecosystem, every architecture, regardless of the problem it targets, uses Amazon Simple Storage Service (Amazon S3) as the core storage service. Despite its versatility and feature completeness, Amazon S3 doesn’t come with an out-of-the-box way to map a user identifier to S3 keys of objects that contain user’s data.

This post walks you through a framework that helps you purge individual user data within your organization’s AWS hosted data lake, and an analytics solution that uses different AWS storage layers, along with sample code targeting Amazon S3.

Reference architecture

To address the challenge of implementing a data purge framework, we reduced the problem to the straightforward use case of deleting a user’s data from a platform that uses AWS for its data pipeline. The following diagram illustrates this use case.

We’re introducing the idea of building and maintaining an index metastore that keeps track of the location of each user’s records and allows us locate to them efficiently, reducing the search space.

You can use the following architecture diagram to delete a specific user’s data within your organization’s AWS data lake.

For this initial version, we created three user flows that map each task to a fitting AWS service:

Flow 1: Real-time metastore update

The S3 ObjectCreated or ObjectDelete events trigger an AWS Lambda function that parses the object and performs an add/update/delete operation to keep the metadata index up to date. You can implement a simple workflow for any other storage layer, such as Amazon Relational Database Service (RDS), Amazon Aurora, or Amazon Elasticsearch Service (ES). We use Amazon DynamoDB and Amazon RDS for PostgreSQL as the index metadata storage options, but our approach is flexible to any other technology.

Flow 2: Purge data

When a user asks for their data to be deleted, we trigger an AWS Step Functions state machine through Amazon CloudWatch to orchestrate the workflow. Its first step triggers a Lambda function that queries the metadata index to identify the storage layers that contain user records and generates a report that’s saved to an S3 report bucket. A Step Functions activity is created and picked up by a Lambda Node JS based worker that sends an email to the approver through Amazon Simple Email Service (SES) with approve and reject links.

The following diagram shows a graphical representation of the Step Function state machine as seen on the AWS Management Console.

The approver selects one of the two links, which then calls an Amazon API Gateway endpoint that invokes Step Functions to resume the workflow. If you choose the approve link, Step Functions triggers a Lambda function that takes the report stored in the bucket as input, deletes the objects or records from the storage layer, and updates the index metastore. When the purging job is complete, Amazon Simple Notification Service (SNS) sends a success or fail email to the user.

The following diagram represents the Step Functions flow on the console if the purge flow completed successfully.

For the complete code base, see step-function-definition.json in the GitHub repo.

Flow 3: Batch metastore update

This flow refers to the use case of an existing data lake for which index metastore needs to be created. You can orchestrate the flow through AWS Step Functions, which takes historical data as input and updates metastore through a batch job. Our current implementation doesn’t include a sample script for this user flow.

Our framework

We now walk you through the two use cases we followed for our implementation:

  • You have multiple user records stored in each Amazon S3 file
  • A user has records stored in homogenous AWS storage layers

Within these two approaches, we demonstrate alternatives that you can use to store your index metastore.

Indexing by S3 URI and row number

For this use case, we use a free tier RDS Postgres instance to store our index. We created a simple table with the following code:

CREATE UNLOGGED TABLE IF NOT EXISTS user_objects (
				userid TEXT,
				s3path TEXT,
				recordline INTEGER
			);

You can index on user_id to optimize query performance. On object upload, for each row, you need to insert into the user_objects table a row that indicates the user ID, the URI of the target Amazon S3 object, and the row that corresponds to the record. For instance, when uploading the following JSON input, enter the following code:

{"user_id":"V34qejxNsCbcgD8C0HVk-Q","body":"…"}
{"user_id":"ofKDkJKXSKZXu5xJNGiiBQ","body":"…"}
{"user_id":"UgMW8bLE0QMJDCkQ1Ax5Mg","body ":"…"}

We insert the tuples into user_objects in the Amazon S3 location s3://gdpr-demo/year=2018/month=2/day=26/input.json. See the following code:

(“V34qejxNsCbcgD8C0HVk-Q”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 0)
(“ofKDkJKXSKZXu5xJNGiiBQ”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 1)
(“UgMW8bLE0QMJDCkQ1Ax5Mg”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 2)

You can implement the index update operation by using a Lambda function triggered on any Amazon S3 ObjectCreated event.

When we get a delete request from a user, we need to query our index to get some information about where we have stored the data to delete. See the following code:

SELECT s3path,
                ARRAY_AGG(recordline)
                FROM user_objects
                WHERE userid = ‘V34qejxNsCbcgD8C0HVk-Q’
                GROUP BY;

The preceding example SQL query returns rows like the following:

(“s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json“, {2102,529})

The output indicates that lines 529 and 2102 of S3 object s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json contain the requested user’s data and need to be purged. We then need to download the object, remove those rows, and overwrite the object. For a Python implementation of the Lambda function that implements this functionality, see deleteUserRecords.py in the GitHub repo.

Having the record line available allows you to perform the deletion efficiently in byte format. For implementation simplicity, we purge the rows by replacing the deleted rows with an empty JSON object. You pay a slight storage overhead, but you don’t need to update subsequent row metadata in your index, which would be costly. To eliminate empty JSON objects, we can implement an offline vacuum and index update process.

Indexing by file name and grouping by index key

For this use case, we created a DynamoDB table to store our index. We chose DynamoDB because of its ease of use and scalability; you can use its on-demand pricing model so you don’t need to guess how many capacity units you might need. When files are uploaded to the data lake, a Lambda function parses the file name (for example, 1001-.csv) to identify the user identifier and populates the DynamoDB metadata table. Userid is the partition key, and each different storage layer has its own attribute. For example, if user 1001 had data in Amazon S3 and Amazon RDS, their records look like the following code:

{"userid:": 1001, "s3":{"s3://path1", "s3://path2"}, "RDS":{"db1.table1.column1"}}

For a sample Python implementation of this functionality, see update-dynamo-metadata.py in the GitHub repo.

On delete request, we query the metastore table, which is DynamoDB, and generate a purge report that contains details on what storage layers contain user records, and storage layer specifics that can speed up locating the records. We store the purge report to Amazon S3. For a sample Lambda function that implements this logic, see generate-purge-report.py in the GitHub repo.

After the purging is approved, we use the report as input to delete the required resources. For a sample Lambda function implementation, see gdpr-purge-data.py in the GitHub repo.

Implementation and technology alternatives

We explored and evaluated multiple implementation options, all of which present tradeoffs, such as implementation simplicity, efficiency, critical data compliance, and feature completeness:

  • Scan every record of the data file to create an index – Whenever a file is uploaded, we iterate through its records and generate tuples (userid, s3Uri, row_number) that are then inserted to our metadata storing layer. On delete request, we fetch the metadata records for requested user IDs, download the corresponding S3 objects, perform the delete in place, and re-upload the updated objects, overwriting the existing object. This is the most flexible approach because it supports a single object to store multiple users’ data, which is a very common practice. The flexibility comes at a cost because it requires downloading and re-uploading the object, which introduces a network bottleneck in delete operations. User activity datasets such as customer product reviews are a good fit for this approach, because it’s unexpected to have multiple records for the same user within each partition (such as a date partition), and it’s preferable to combine multiple users’ activity in a single file. It’s similar to what was described in the section “Indexing by S3 URI and row number” and sample code is available in the GitHub repo.
  • Store metadata as file name prefix – Adding the user ID as the prefix of the uploaded object under the different partitions that are defined based on query pattern enables you to reduce the required search operations on delete request. The metadata handling utility finds the user ID from the file name and maintains the index accordingly. This approach is efficient in locating the resources to purge but assumes a single user per object, and requires you to store user IDs within the filename, which might require InfoSec considerations. Clickstream data, where you would expect to have multiple click events for a single customer on a single date partition during a session, is a good fit. We covered this approach in the section “Indexing by file name and grouping by index key” and you can download the codebase from the GitHub repo.
  • Use a metadata file – Along with uploading a new object, we also upload a metadata file that’s picked up by an indexing utility to create and maintain the index up to date. On delete request, we query the index, which points us to the records to purge. A good fit for this approach is a use case that already involves uploading a metadata file whenever a new object is uploaded, such as uploading multimedia data, along with their metadata. Otherwise, uploading a metadata file on every object upload might introduce too much of an overhead.
  • Use the tagging feature of AWS services – Whenever a new file is uploaded to Amazon S3, we use the Put Object Tagging Amazon S3 operation to add a key-value pair for the user identifier. Whenever there is a user data delete request, it fetches objects with that tag and deletes them. This option is straightforward to implement using the existing Amazon S3 API and can therefore be a very initial version of your implementation. However, it involves significant limitations. It assumes a 1:1 cardinality between Amazon S3 objects and users (each object only contains data for a single user), searching objects based on a tag is limited and inefficient, and storing user identifiers as tags might not be compliant with your organization’s InfoSec policy.
  • Use Apache Hudi – Apache Hudi is becoming a very popular option to perform record-level data deletion on Amazon S3. Its current version is restricted to Amazon EMR, and you can use it if you start to build your data lake from scratch, because you need to store your as Hudi datasets. Hudi is a very active project and additional features and integrations with more AWS services are expected.

The key implementation decision of our approach is separating the storage layer we use for our data and the one we use for our metadata. As a result, our design is versatile and can be plugged in any existing data pipeline. Similar to deciding what storage layer to use for your data, there are many factors to consider when deciding how to store your index:

  • Concurrency of requests – If you don’t expect too many simultaneous inserts, even something as simple as Amazon S3 could be a starting point for your index. However, if you get multiple concurrent writes for multiple users, you need to look into a service that copes better with transactions.
  • Existing team knowledge and infrastructure – In this post, we demonstrated using DynamoDB and RDS Postgres for storing and querying the metadata index. If your team has no experience with either of those but are comfortable with Amazon ES, Amazon DocumentDB (with MongoDB compatibility), or any other storage layer, use those. Furthermore, if you’re already running (and paying for) a MySQL database that’s not used to capacity, you could use that for your index for no additional cost.
  • Size of index – The volume of your metadata is orders of magnitude lower than your actual data. However, if your dataset grows significantly, you might need to consider going for a scalable, distributed storage solution rather than, for instance, a relational database management system.

Conclusion

GDPR has transformed best practices and introduced several extra technical challenges in designing and implementing a data lake. The reference architecture and scripts in this post may help you delete data in a manner that’s compliant with GDPR.

Let us know your feedback in the comments and how you implemented this solution in your organization, so that others can learn from it.

 


About the Authors

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

 

 

 

 

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives. Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

Discover sensitive data by using custom data identifiers with Amazon Macie

Post Syndicated from Kayla Jing original https://aws.amazon.com/blogs/security/discover-sensitive-data-by-using-custom-data-identifiers-with-amazon-macie/

As you put more and more data in the cloud, you need to rely on security automation to keep it secure at scale. AWS recently launched Amazon Macie, a fully managed service that uses machine learning and pattern matching to help you detect, classify, and better protect your sensitive data stored in the AWS Cloud.

Many data breaches are not the result of malicious activity from unauthorized users, but rather from mistakes made by authorized users. To monitor and manage the security of sensitive data, you must first be able to identify it. In this post, we show you how to use custom data identifiers with Macie to identify sensitive data. Once you know what’s sensitive, you can start designing security controls that operate at scale to monitor and remediate risk automatically.

Macie comes with a set of managed data identifiers that you can use to discover many types of sensitive data. These are somewhat generic and broadly applicable to many organizations. What makes Macie unique is its ability to help you address specific data needs. Macie enables you to expand your sensitive data detection through the new custom data identifiers. Custom data identifiers can be used to highlight organizational proprietary data, intellectual property, and specific scenarios.

Custom Data Identifiers in Macie help you find and identify sensitive data based on your own organization’s specific needs. In this post, we show you a step-by-step walkthrough of how to define and run custom data identifiers to automatically discover specific, sensitive data. Before you begin using Custom Data Identifiers, you need to enable Macie and configure detailed logging. Follow these instructions to enable Macie and follow these instructions to configure detailed logging, if you haven’t done that already.

When to use the Custom Data Identifier resource

To begin, imagine you’re an IT administrator for a manufacturing company that’s headquartered in France. Your company has acquired a few additional local subsidiaries, including an R&D facility in São Paulo, Brazil. The company is migrating to AWS, and in the process is classifying registration information, employee information, and product data into encrypted and non-encrypted storage.

You want to identify sensitive data for the following three scenarios:

  • SIRET-NIC: SIRET-NIC is a unique number assigned to businesses in France. This number is issued by their National Institute of Statistics (INSEE) when a business is registered. A sample file that contains SIRET-NIC information is shown in the following figure. Each record in the file includes the GUID, employee name, employee email, the company name, the date it was issued, and the SIRET-NIC number.

    Figure 1: SIRET-NIC dataset

    Figure 1: SIRET-NIC dataset

  • Brazil CPF (Cadastro de Pessoas Físicas – Natural Persons Register): CPF is a unique number assigned by the Brazilian revenue agency to people subject to taxes in the country. Each of your employees residing in the Brazilian office has a CPF.
  • Prototyping naming convention: Your company has products that are publicly available, but also products that are still in the prototyping stage and should be kept confidential. A sample file that contains Brazil CPF numbers and the prototype names is shown in the following figure.

    Figure 2: Brazil CPF and prototype number dataset

    Figure 2: Brazil CPF and prototype number dataset

Configure the Custom Data Identifier resource in the Macie console

To use custom data identifiers to identify your organization’s sensitive information, you must:

  1. Create custom data identifiers.
  2. Create a job to scan your Amazon Simple Storage Service (Amazon S3) bucket to locate the data patterns that match your custom data identifiers.
  3. Respond to the returned results.

The following steps introduce you to the Custom Data Identifier resource in Macie.

Designing Custom Data Identifiers for use with Amazon Macie

In the previous section you discovered 3 scenarios that your company will like to protect SIRET-NIC, Brazil CPF, and your prototyping naming convention. You now need to first create a specific REGEX pattern for each of these scenarios. There are different syntaxes and dialects of regular expression languages. Amazon Macie supports a subset of the Perl Compatible Regular Expressions (PCRE) library, and you can learn more about it in Regex support in custom data identifiers section. Once the patterns are ready, follow the instructions below to create the custom data identifiers.

Creating Custom Data Identifiers in Amazon Macie

  1. Sign in to the AWS Management Console.
  2. Enter Amazon Macie in the AWS services search box.
  3. Choose Amazon Macie.
  4. In the navigation pane on the left-hand side, under Settings, choose Custom data identifiers as shown in the following figure.

    Figure 3: Custom data identifiers console

    Figure 3: Custom data identifiers console

Create a custom data identifier

  1. Choose Create on the custom data identifier console.
  2. Name: Enter a name for your custom data identifier. Make it descriptive so you know what it does. For example, enter SIRET-NIC for the SIRET-NIC number you use.
  3. Description: Enter a description of the custom data identifier.
  4. Regular expression (regex): Define the pattern you want to identify. Use a Regular Expression (“regex”) to create the desired pattern. For example, a SIRET-NIC number contains 14 digits—9 numbers followed by a hyphen and then 5 more numbers. The first part, 9 numbers, can stay together or separated by spaces into 3 groups of 3. The specific regex pattern for this is \b(\d{3}\s?){2}\d{3}\-\d{5}\b
  5. Keywords: Define expressions that identify the text to match. The SIRET-NIC number itself is publicly accessible information. But in your case, you want to encrypt the information about the company that was registered during the month the acquisition happened (April 2020), thus the information will not leak to your competitors. So, the keywords here will be all the days in April.
  6. (Optional) Ignore words: Use this box to enter text that you want to be ignored. In this example scenario, you know your security training materials always use an example SIRET-NICs of 12345789-12345 and 000000000-00000. You can enter these values here, so that your security training materials are not flagged as sensitive data containing SIRET-NICs.
  7. Maximum match distance: Use this box to define the proximity between the result and the keywords. If you enter 20, Macie will provide results that include the specified keyword and 20 characters on either side of it.

Note: Do not select Submit yet. After entering the settings and before selecting Submit, you should test your custom data identifier with sample data to confirm that it works.

With all the attributes set, your console will look like what is shown in Figure 4.

Figure 4: SIRET-NIC custom data identifier creation

Figure 4: SIRET-NIC custom data identifier creation

Test your SIRET-NIC custom data identifier

Use the Evaluate section on the right-hand panel of the Macie console to confirm that the regex pattern and other configurations for your custom data identifier are correct.

Follow the steps below to use the Evaluate section.

  1. Enter test data in the sample data box.
  2. Select Submit. There will be one match per record in the file if the configurations are correct and your custom data identifier is ready.The following figure is an example of the Evaluate section using test data. The test data has 3 records, each record has 5 fields which are GUID, employee name, employee email, company name, date SIRET-NIC was issued, and the SIRET-NIC number.

    Figure 5: Evaluate, showing sample data

    Figure 5: Evaluate, showing sample data

  3. After verifying your SIRET-NIC custom data identifier works in the Evaluate section, now select Submit on the New custom data identifier window to create the custom data identifier.

Create a Brazil CPF Custom Data Identifier

Congrats on creating your first custom data identifier! Now use the same steps to create and test custom data identifiers for the Brazil CPF and prototyping naming convention scenarios. The Brazil CPF number usually shows up in the format of 000.000.000-00.

Use the following values for the Brazil CPF scenario, as shown in the following figure:

  • Name: Brazil CPF
  • Description: The format for Brazil CPF in our sample data is 000.000.000-00
  • Regular expression: \b(\d{3}\.){2}\d{3}\-\d{2}\b

    Figure 6: Brazil CPF custom data identifier

    Figure 6: Brazil CPF custom data identifier

Create a Prototype Name Custom Data Identifier

Assume that your company has a very strict and regular naming scheme for prototype part numbers. It is P, followed by a hyphen, and then 2 letters and 4 digits. E.g., P-AB1234. You want to identify objects in S3 that contain references to private prototype parts. This is a small pattern, and so if we’re not careful it will cause Macie to flag objects that do not actually contain one of our prototype numbers. We suggest adding \b at the beginning and the end of the regular expression. The \b symbol means a “word boundary” and word boundaries are basically whitespace, punctuation, or other things that are not letters and numbers. With \b, you limit the pattern so that you only match if the entire word matches the pattern. For example, P-AB1234 will match the pattern, but STEP-AB123456 and P-XY123 will not match the pattern. This gives you finer grained control and reduces false positives.

Use the following values for the prototyping name scenario, as shown in the following figure:

  • Name: Prototyping Naming
  • Description: Any prototype name start with P means it’s private. The format for private prototype name is P-2 capital letters and 4 numbers
  • Regular expression: \bP\-[A-Z]{2}\d{4}\b
Figure 7: Prototyping naming custom data identifier

Figure 7: Prototyping naming custom data identifier

You should now see a page like the following figure, indicating that the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers are successfully configured.

Figure 8: Successfully configured custom data identifier

Figure 8: Successfully configured custom data identifier

Set up a Test Bucket to Demonstrate Macie

Before we can see Macie do its work, we have to create a bucket with some test data that we can scan. We’ve provided some sample data files that you can download. Follow these instructions to create a test bucket and load our test data into the test bucket.

  1. Download the sample data and unzip it.
  2. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.
  3. Choose Create bucket. The Create bucket wizard opens.
  4. In Bucket name, enter a DNS-compliant name for your bucket. The bucket name must:
    • Be unique across all of Amazon S3.
    • Be between 3 and 63 characters long.
    • Not contain uppercase characters.
    • Start with a lowercase letter or number.

    We created a bucket called bucketformacieuse; you have to choose another name because this one is already taken by us.

  5. In Region, choose the AWS Region where you want the bucket to reside.
  6. Select Create, to finish the bucket creation.
  7. Open the bucket you just created and upload the two Excel files you downloaded in step 1.

Use Macie to create a job to scan your data

Now you can create a job to scan your Amazon S3 bucket to detect and locate the data patterns defined in the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers.

To create a job

  1. In the navigation pane, choose Jobs, and then select Create Job on the upper right.
  2. Select Amazon S3 buckets: Select the S3 bucket you want to analyze. In this case, we are using the bucket previously created, bucketformacieuse.
  3. Review Amazon S3 buckets: Verify that you selected the S3 bucket you want the job to scan and analyze.
  4. Scope: Select your scope. For this example, choose the One-time job option as your scope. The scope specifies how often you want the job to run. This can be either a one-time job or a scheduled job. If you choose a scheduled job, you can define how often you want your job to scan your Amazon S3 bucket.
  5. Custom data identifiers: Select the 3 custom data identifiers you created to be associated with this job, and then select Next. This is shown in the following figure.

    Figure 9: Select your custom data identifiers

    Figure 9: Select your custom data identifiers

  6. Name and description: Enter a name and description for the job.
  7. Review and create: Review and verify all your settings, and then select Create.

You now have a job in Macie to scan the Amazon S3 buckets you’ve chosen using the 3 custom data identifiers you created. More information about creating jobs is available in Running sensitive data discovery jobs in Amazon Macie.

Respond to results

Macie will help you be secure when you’re effectively responding to the findings that it produces. For our example, we’ll show you how to review your findings manually. You can look at your findings by bucket, type, or job, or see a collective summary of all findings. In this example, let’s look at all findings.

To review your results

  1. In the navigation pane on the left-hand side, choose Findings. Findings include the severity, the type, the resources affected, and when the findings were last updated.
  2. The following figure shows an example of the results you might see on the findings page. There are two findings for the selected job. The compagnie_français.csv and the empresa_brasileira.csv files contain the custom data identifiers that you created earlier and added to the job.

    Figure 10: Findings

    Figure 10: Findings

  3. Let’s look at the details of one of the findings so you can review the results. From the page showing the 4r findings, select the file that contains your custom data identifier for the Brazil CPF: empresa_brasileira.csv. The number of custom data identifiers found in the document is shown in the Result section on the right, as shown in the following figure.

    Figure 11: Findings detail page for the Brazil CPF custom data identifiers

    Figure 11: Findings detail page for the Brazil CPF custom data identifiers

  4. Now look at the findings details for the compagnie_français.csv file. It shows the number of custom data identifiers found in the file. In this case Macie found 13 SIRET-NIC numbers as shown in the following figure.

    Figure 12: Findings page for the French company file

    Figure 12: Findings page for the French company file

  5. If you configured detailed logging, the results will be saved in the Amazon S3 bucket you specified. The S3 bucket location can be found in the Details section after Detailed result location as shown in the preceding figure.

Now that you’ve used Macie and the Custom Data Identifiers resource to obtain these findings, you can identify what data to place in encrypted storage, and what can be placed in non-encrypted storage when migrating to AWS. Macie and custom data identifiers provide an automated tool to help you enhance protection of your sensitive data by providing you the information to help detect and classify your data in the AWS Cloud.

Using Macie at Scale

Custom Data Identifiers help you tell Macie what to look for. As you move more and more data to the cloud, you’ll need to make new identifiers and new rules. As your rules and identifiers grow you will need to create automation that responds to things that are found. For example, perhaps a lambda function turns on encryption in a bucket when it finds sensitive data in that bucket. Or perhaps a function automatically applies tags to buckets where sensitive data is found, and those buckets and their owners start to appear on reports for audit and compliance. Once you’ve done this at small scale, think about how you will automate responses at larger scale.

Conclusion

The new Custom Data Identifier resource in the newly enhanced Macie can help you detect, classify, and protect sensitive data types unique to your organization. This post focused on the functionality and use of custom data identifiers to automatically discover sensitive data stored in Amazon S3. You can also review the managed data identifiers to see a list of personally identifiable information (PII) that Macie can detect by default. Visit What is Amazon Macie? to learn more.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Kayla Jing

Kayla is a Solutions Architect at Amazon Web Services based out of Seattle. She has experience in data science with a focus on Data Analytics and Machine Learning.

Author

Joshua Choung

Joshua is a Solutions Architect based out of Seattle. He works with customers to provide architectural and technical guidance and training on their AWS cloud journey.

Author

Laura Reith

Laura is a Solutions Architect at Amazon Web Services. Before AWS, she worked as a Solutions Architect in Taiwan focusing on physical security and retail analytics.

Anonymize and manage data in your data lake with Amazon Athena and AWS Lake Formation

Post Syndicated from Manos Samatas original https://aws.amazon.com/blogs/big-data/anonymize-and-manage-data-in-your-data-lake-with-amazon-athena-and-aws-lake-formation/

Organizations collect and analyze more data than ever before. They move as fast as they can on their journey to become more data driven by using the insights from their data.

Different roles use data for different purposes. For example, data engineers transform the data before further processing, data analysts access the data and produce reports, and data scientists with domain and technical expertise can train machine learning algorithms. Those roles require access to the data, and access has never been easier to grant.

At the same time, most organizations have to comply with regulations when dealing with their customer data. For that reason, datasets that contain personally identifiable information (PII) is often anonymized. A common example of PII can be tables and columns that contain personal information about an individual (such as first name and last name) or tables with columns that, if joined with another table, can trace back to an individual.

You can use AWS Analytics services to anonymize your datasets. In this post, I describe how to use Amazon Athena to anonymize a dataset.  You can then use AWS Lake Formation to provide the right access to the right personas.

Use case

To better understand the concept, we use a straightforward use case: analysts in your organization need access to a dataset with sales data, some of which contains PII information. As the data lake admin, you’re not comfortable with all personnel having access to customers’ PII. To address this, you can use an anonymized dataset.

This use case has two users:

  • datalake_admin – Responsible for data anonymization and making sure the right permissions are enforced. They classify the data, generate anonymized datasets, and configures the required permissions.
  • datalake_analyst – Only has access to the anonymized dataset. They can extract patterns for users without tracing the request back to an individual customer.

The following AWS CloudFormation template generates the AWS Glue tables that you use later in this post:

However, the template doesn’t create the datalake_admin and datalake_analyst users. For more information about personas in Lake Formation, see Lake Formation Personas and IAM Permissions Reference.

Solution architecture

For this solution, you use the following services:

  • Lake Formation – Lake Formation makes it easy to set up a secure data lake—a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. The data lake admin can easily label the data and give users permission to access authorized datasets.
  • Athena – Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run. For this use case, the data lake admin uses Athena to anonymize the data, after which the data analyst can use Athena for interactive analytics over anonymized datasets.
  • Amazon S3Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. For this use case, you use Amazon S3 as storage for the data lake.

The following diagram illustrates the architecture for this solution.

In this architecture, there are no servers to manage. You only pay what you use. You can use the same solution for small or large datasets. The scaling happens behind the scenes but in a transparent way.

In the following sections, you look in more detail on how to do the following:

  • Label sensitive data with AWS Lake Formation
  • Anonymize data with Athena
  • Apply permissions with Lake Formation
  • Analyze the anonymized datasets

Labeling the sensitive data with Lake Formation

As a data lake admin, the first task is to label the personal information. Tags don’t enforce any security controls, but applying a good tagging strategy is a great way to describe the data. Tags are key-value pairs that you can apply for your AWS resources, including table and columns in your data lake. For this use case, you apply a very simple tagging strategy: for the columns that contain PII, you give the value PII.

You interact with the following tables from the tcp-ds dataset, which both have their data stored in Amazon S3 in CSV format:

  • store_sales – Stores sales data and references other tables that you can join together for more sophisticated business queries. The table has a foreign key with the customer table on the ss_customer_sk This key, when joined with the customer table, can uniquely identify a user. For that reason, treat this column as personal information.
  • customer – Stores customer data, a lot of which is PII. In addition to c_customer_sk, you could use data such as customer ID, (c_customer_id), customer name (c_first_name), customer last name (c_last_name), login (c_login), and email (c_email_address) to uniquely identify a customer.

To start tagging your columns (starting with the store_sales table), complete the following steps:

  1. As the data lake admin user, log in to the Lake Formation console.
  2. Choose Data Catalog Tables.
  3. Select store_sales.
  4. Choose Edit schema.
  5. Select the column you want to edit (ss_customer_sk).
  6. Choose Edit.
  7. For Key, enter Classification.
  8. For Value, enter PII.
  9. Choose Save.

To verify that you can apply the added column properties, use the Lake Formation API to get the table description.

  1. On the Data Catalog Tables page, select store_sales.
  2. Choose View properties.

The table properties look like the following JSON object:

{
"Name": "store_sales",
"DatabaseName": "tcp-ds-1tb",
"Owner": "owner",
"CreateTime": "2019-09-13T10:15:04.000Z",
"UpdateTime": "2020-03-18T16:10:34.000Z",
"LastAccessTime": "2019-09-13T10:15:03.000Z",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "ss_sold_date_sk",
"Type": "bigint",
"Parameters": {}
},
...
{
"Name": "ss_customer_sk",
"Type": "bigint",
"Parameters": {
"Classification": "PII"
}
},
...
}

The additional column properties are now in the table metadata.

  1. Repeat the preceding steps for the customer table and label the following columns:
    • c_customer_sk
    • c_customer_id
    • c_first_name
    • c_last_name
    • c_login
    • c_email_address

Adding a tag also allows you to perform metadata searches by tag attributes. For more information, see Discovering metadata with AWS Lake Formation: Part 1 and Discover metadata with AWS Lake Formation: Part 2.

Anonymizing data with Athena

The data lake admin now needs to provide the data analyst anonymized datasets for analytics. For this use case, you want to extract patterns on the customer table and the store_sales table separately, but you also want to join the two tables so you can perform more sophisticated queries.

The first step is to create a database in Lake Formation to organize tables in AWS Glue.

  1. On the Lake Formation console, under Data Catalog, choose Databases.
  2. Choose Create database.
  3. For Name, enter a name, such as anonymised_tcp_ds_1tb.
  4. Optionally, enter an Amazon S3 path for the database and a description.
  5. Choose Create database.

The next step is to create the tables that contain the anonymized data. Before you do so, consider the significance of each anonymized column from an analytics point of view. For columns that have little or no value in the analytics process, omitting the column altogether might be the right approach. You might use other columns as primary keys to join with other tables. To make sure that you can join the tables, you can apply a hash function to the table foreign keys.

A common approach to anonymize sensitive information is hashing. A hash function is any function that you can use to map data of arbitrary size to fixed-size values. For more information, see Hash function.

The following table summarizes your strategy for each column.

TableColumn Strategy
customercustomer_first_namehash
customercustomer_last_namehash
customerc_loginomit
customercustomer_idhash
Customerc_email_addressomit
customerc_customer_skhash
store_salesss_customer_skhash

If you use the same value as the input of your hash function, it always returns the same result. In addition, and contrary to encryption, you can’t reverse hashing.

  1. Use Athena string functions to hash individual columns and generate anonymized datasets.
  2. After you create those datasets, you can use Lake Formation to apply security controls. See the following code:
CREATE table "tcp-ds-anonymized".customer
WITH (format='parquet',external_location = 's3://tcp-ds-eu-west-1-1tb-anonymised/2/customer_parquet/')
AS SELECT       
         sha256(to_utf8(cast(c_customer_sk AS varchar))) AS c_customer_sk_anonym,
         sha256(to_utf8(cast(c_customer_id AS varchar))) AS c_customer_id_anonym,
         sha256(to_utf8(cast(c_first_name AS varchar))) AS c_first_name_anonym,
         sha256(to_utf8(cast(c_last_name AS varchar))) AS c_last_name_anonym,
         c_current_cdemo_sk,
         c_current_hdemo_sk,
         c_first_shipto_date_sk,
         c_first_sales_date_sk,
         c_salutation,
         c_preferred_cust_flag,
         c_current_addr_sk,
         c_birth_day,
         c_birth_month,
         c_birth_year,
         c_birth_country,
         c_last_review_date_sk
FROM customer
  1. To preview the data, enter the following code:
SELECT c_first_name_anonym, c_last_name_anonym FROM "tcp-ds-anonymized"."customer" limit 10;

The following screenshot shows the output of your query.

  1. To repeat these steps for the stores_sales table, enter the following code:
CREATE table "tcp-ds-anonymized".store_sales
WITH (format='parquet',external_location = 's3://tcp-ds-eu-west-1-1tb-anonymised/1/store_sales/')
AS SELECT sha256(to_utf8(cast(ss_customer_sk AS varchar))) AS ss_customer_sk_anonym,
         ss_sold_date_sk,
         ss_sales_price,
         ss_sold_time_sk,
         ss_item_sk,
         ss_hdemo_sk,
         ss_addr_sk,
         ss_store_sk,
         ss_promo_sk,
         ss_ticket_number,
         ss_quantity,
         ss_wholesale_cost,
         ss_list_price,
         ss_ext_discount_amt,
         ss_external_sales_price,
         ss_ext_wholesale_cost,
         ss_ext_list_price,
         ss_ext_tax,
         ss_coupon_amt,
         ss_net_paid,
         ss_net_paid_inc_tax,
         ss_net_profit
FROM store_sales;

One of the challenges you need to overcome when working with CTAS queries is that the query’s Amazon S3 location should be unique for the table you’re creating. You can add some incremental value or timestamp to the path of the table, for example, s3:/<bucket>/<table_name>/<version>, and make sure you use a different version number every time.

You can delete older data programmatically using Amazon S3 APIs or SDK. You can also use Amazon S3 lifecycle configuration to tell Amazon S3 to transition objects to another Amazon S3 storage class. For more information, see Object lifecycle management.

You can automate the anonymization of the CTAS query with AWS Glue jobs. AWS Glue provides a lightweight Python shell job option that can call the Amazon Athena API programmatically.

Applying permissions with Lake Formation

Now that you have the table structures and anonymized datasets, you can apply the required permissions using Lake Formation.

  1. On the Lake Formation console, under Data Catalog, choose Tables.
  2. Select the tables that contain the anonymized data.
  3. From the Actions drop-down menu, under Permissions, choose Grant.
  4. For IAM users and roles, choose the IAM user for the data analyst.
  5. For Table permissions, select Select.
  6. Choose Grant.

You can now view all table permissions and verify the permissions granted to a particular principal.

Analyzing the anonymized datasets

To verify that the role can access the right tables and query the anonymized datasets, complete the following steps:

  1. Sign in to the AWS Management Console as the data analyst.
  2. Under Analytics, choose Amazon Athena.

You should see a query field, similar to the following screenshot.

You can now test your access with queries. To see the top customers by revenue and last name, enter the following code:

SELECT c_last_name_anonym,
sum(ss_sales_price) AS total_sales
FROM store_sales
JOIN customer
ON store_sales.ss_customer_sk_anonym = customer.c_customer_sk_anonym
GROUP BY c_last_name_anonym
ORDER BY total_sales DESC limit 10;

The following screenshot shows the query output.

You can also try to query a table that you don’t have access to. You should receive an error message.

Conclusion

Anonymizing dataset is often a prerequisite before users can start analyzing a dataset. In this post, we discussed how data lake admins can use Athena and Lake Formation to label and anonymize data stored in Amazon S3. You can then use Lake Formation to apply permissions to the dataset and allow other users to access the data.

The services we discussed in this post are serverless. Building serverless applications means that your developers can focus on their core product instead of worrying about managing and operating servers or runtimes, either in the cloud or on-premises. This reduced overhead lets developers reclaim time and energy that they can spend on developing great products that scale and that are reliable.

 


About the Author

Manos Samatas is a Specialist Solutions Architect in Big Data and Analytics with Amazon Web Services. Manos lives and works in London. He is specialising in architecting Big Data and Analytics solutions for Public Sector customers in EMEA region.

How to retroactively encrypt existing objects in Amazon S3 using S3 Inventory, Amazon Athena, and S3 Batch Operations

Post Syndicated from Adam Kozdrowicz original https://aws.amazon.com/blogs/security/how-to-retroactively-encrypt-existing-objects-in-amazon-s3-using-s3-inventory-amazon-athena-and-s3-batch-operations/

Amazon Simple Storage Service (S3) is an object storage service that offers industry-leading scalability, performance, security, and data availability. With Amazon S3, you can choose from three different server-side encryption configurations when uploading objects:

  • SSE-S3 – uses Amazon S3-managed encryption keys
  • SSE-KMS – uses customer master keys (CMKs) stored in AWS Key Management Service (KMS)
  • SSE-C – uses master keys provided by the customer in each PUT or GET request

These options allow you to choose the right encryption method for the job. But as your organization evolves and new requirements arise, you might find that you need to change the encryption configuration for all objects. For example, you might be required to use SSE-KMS instead of SSE-S3 because you need more control over the lifecycle and permissions of the encryption keys in order to meet compliance goals.

You could change the settings on your buckets to use SSE-KMS rather than SSE-S3, but the switch only impacts newly uploaded objects, not objects that existed in the buckets before the change in encryption settings. Manually re-encrypting older objects under master keys in KMS may be time-prohibitive depending on how many objects there are. Automating this effort is possible using the right combination of features in AWS services.

In this post, I’ll show you how to use Amazon S3 Inventory, Amazon Athena, and Amazon S3 Batch Operations to provide insights on the encryption status of objects in S3 and to remediate incorrectly encrypted objects in a massively scalable, resilient, and cost-effective way. The solution uses a similar approach to the one mentioned in this blog post, but it has been designed with automation and multi-bucket scalability in mind. Tags are used to target individual noncompliant buckets in an account, and any encrypted (or unencrypted) object can be re-encrypted using SSE-S3 or SSE-KMS. Versioned buckets are also supported, and the solution operates on a regional level.

Note: You can’t re-encrypt to or from objects encrypted under SSE-C. This is because the master key material must be provided during the PUT or GET request, and cannot be provided as a parameter for S3 Batch Operations.

Moreover, the entire solution can be deployed in under 5 minutes using AWS CloudFormation. Simply tag your buckets targeted for encryption, upload the solution artifacts into S3, and deploy the artifact template through the CloudFormation console. In the following sections, you will see that the architecture has been built to be easy to use and operate, while at the same time containing a large number of customizable features for more advanced users.

Solution overview

At a high level, the core features of the architecture consist of 3 services interacting with one another: S3 Inventory reports (1) are delivered for targeted buckets, the report delivery events trigger an AWS Lambda function (2), and the Lambda function then executes S3 Batch (3) jobs using the reports as input to encrypt targeted buckets. Figure 1 below and the remainder of this section provide a more detailed look at what is happening underneath the surface. If this is not of high interest for you, feel free to skip ahead to the Prerequisites and Solution Deployment sections.

Figure 1: Solution architecture overview

Figure 1: Solution architecture overview

Here’s a detailed overview of how the solution works, as shown in Figure 1 above:

  1. When the CloudFormation template is first launched, a number of resources are created, including:
    • An S3 bucket to store the S3 Inventory reports
    • An S3 bucket to store S3 Batch Job completion reports
    • A CloudWatch event that is triggered by changes to tags on S3 buckets
    • An AWS Glue Database and AWS Glue Tables that can be used by Athena to query S3 Inventory and S3 Batch report findings
    • A Lambda function that is used as a Custom Resource during template launch, and afterwards as a target for S3 event notifications and CloudWatch events
  2. During deployment of the CloudFormation template, a Lambda-backed Custom Resource lists all S3 buckets within the AWS Region specified and checks to see if any has a configurable tag present (configured via an AWS CloudFormation parameter). When a bucket with the specified tag is discovered, the Lambda configures an S3 Inventory report for the discovered bucket to be delivered to the newly-created central report destination bucket.
  3. When a new S3 Inventory report arrives into the central report destination bucket (which can take between 1-2 days) from any of the tagged buckets, an S3 Event Notification triggers the Lambda to process it.
  4. The Lambda function first adds the path of the report CSV file as a partition to the AWS Glue table. This means that as each bucket delivers its report, it becomes instantly queryable by Athena, and any queries executed return the most recent information available on the status of the S3 buckets in the account.
  5. The Lambda function then checks the value of the EncryptBuckets parameter in the CloudFormation launch template to assess whether any re-encryption action should be taken. If it is set to yes, the Lambda function creates an S3 Batch job and executes it. The job takes each object listed in the manifest report and copies it over in the exact same location. When the copy occurs, SSE-KMS or SSE-S3 encryption is specified in the job parameters, effectively re-encrypting properly all identified objects.
  6. Once the batch job finishes for the S3 Inventory report, a completion report is sent to the central batch job report bucket. The CloudFormation template provides a parameter that controls the option to include either all successfully processed objects or only objects that were unsuccessfully processed. These reports can also be queried with Athena, since the reports are also added as partitions to the AWS Glue batch reports tables as they arrive.

Prerequisites

To follow along with the sample deployment, your AWS Identity and Access Management (IAM) principal (user or role) needs administrator access or equivalent.

Solution deployment

For this walkthrough, the solution will be configured to encrypt objects using SSE-KMS, rather than SSE-S3, when an inventory report is delivered for a bucket. Please note that the key policy of the KMS key will be automatically updated by the custom resource during launch to allow S3 to use it to encrypt inventory reports. No key policies are changed if SSE-S3 encryption is selected instead. The configuration in this walkthrough also adds a tag to all newly encrypted objects. You’ll learn how to use this tag to restrict access to unencrypted objects in versioned buckets. I’ll make callouts throughout the deployment guide for when you can choose a different configuration from what is deployed in this post.

To deploy the solution architecture and validate its functionality, you’ll perform five steps:

  1. Tag target buckets for encryption
  2. Deploy the CloudFormation template
  3. Validate delivery of S3 Inventory reports
  4. Confirm that reports are queryable with Athena
  5. Validate that objects are correctly encrypted

If you are only interested in deploying the solution and encrypting your existing environment, Steps 1 and 2 are all that are required to be completed. Steps 3 through 5 are optional on the other hand, and outline procedures that you would perform to validate the solution’s functionality. They are primarily for users who are looking to dive deep and take advantage of all of the features available.

With that being said, let’s get started with deploying the architecture!

Step 1: Tag target buckets

Navigate to the Amazon S3 console and identify which buckets should be targeted for inventorying and encryption. For each identified bucket, tag it with a designated key value pair by selecting Properties > Tags > Add tag. This demo uses the tag __Inventory: true and tags only one bucket called adams-lambda-functions, as shown in Figure 2.

Figure 2: Tagging a bucket targeted for encryption in Amazon S3

Figure 2: Tagging a bucket targeted for encryption in Amazon S3

Step 2: Deploy the CloudFormation template

  1. Download the S3 encryption solution. There will be two files that make up the backbone of the solution:
    • encrypt.py, which contains the Lambda microservices logic;
    • deploy.yml, which is the CloudFormation template that deploys the solution.
  2. Zip the file encrypt.py, rename it to encrypt.zip, and then upload it into any S3 bucket that is in the same Region as the one in which the CloudFormation template will be deployed. Your bucket should look like Figure 3:

    Figure 3: encrypt.zip uploaded into an S3 bucket

    Figure 3: encrypt.zip uploaded into an S3 bucket

  3. Navigate to the CloudFormation console and then create the CloudFormation stack using the deploy.yml template. For more information, see Getting Started with AWS CloudFormation in the CloudFormation User Guide. Figure 4 shows the parameters used to achieve the configuration specified for this walkthrough, with the fields outlined in red requiring input. You can choose your own configuration by altering the appropriate parameters if the ones specified do not fit your use case.

    Figure 4: Set the parameters in the CloudFormation stack

    Figure 4: Set the parameters in the CloudFormation stack

Step 3: Validate delivery of S3 Inventory reports

After you’ve successfully deployed the CloudFormation template, select any of your tagged S3 buckets and check that it now has an S3 Inventory report configuration. To do this, navigate to the S3 console, select a tagged bucket, select the Management tab, and then select Inventory, as shown in Figure 5. You should see that an inventory configuration exists. An inventory report will be delivered automatically to this bucket within 1 to 2 days, depending on the number of objects in the bucket. Make a note of the name of the bucket where the inventory report will be delivered. The bucket is given a semi-random name during creation through the CloudFormation template, so making a note of this will help you find the bucket more easily when you check for report delivery later.

Figure 5: Check that the tagged S3 bucket has an S3 Inventory report configuration

Figure 5: Check that the tagged S3 bucket has an S3 Inventory report configuration

Step 4: Confirm that reports are queryable with Athena

  1. After 1 to 2 days, navigate to the inventory reports destination bucket and confirm that reports have been delivered for buckets with the __Inventory: true tag. As shown in Figure 6, a report has been delivered for the adams-lambda-functions bucket.

    Figure 6: Confirm delivery of reports to the S3 reports destination bucket

    Figure 6: Confirm delivery of reports to the S3 reports destination bucket

  2. Next, navigate to the Athena console and select the AWS Glue database that contains the table holding the schema and partition locations for all of your reports. If you used the default values for the parameters when you launched the CloudFormation stack, the AWS Glue database will be named s3_inventory_database, and the table will be named s3_inventory_table. Run the following query in Athena:
    
    SELECT encryption_status, count(*) FROM s3_inventory_table GROUP BY encryption_status;
    

    The outputs of the query will be a snapshot aggregate count of objects in the categories of SSE-S3, SSE-C, SSE-KMS, or NOT-SSE across your tagged bucket environment, before encryption took place, as shown in Figure 7.

    Figure 7: Query results in Athena

    Figure 7: Query results in Athena

    From the query results, you can see that the adams-lambda-functions bucket had only two items in it, both of which were unencrypted. At this point, you can choose to perform any other analytics with Athena on the delivered inventory reports.

Step 5: Validate that objects are correctly encrypted

  1. Navigate to any of your target buckets in Amazon S3 and check the encryption status of a few sample objects by selecting the Properties tab of each object. The objects should now be encrypted using the specified KMS CMK. Because you set the AddTagToEncryptedObjects parameter to yes during the CloudFormation stack launch, these objects should also have the __ObjectEncrypted: true tag present. As an example, Figure 8 shows the rules_present_rule.zip object from the adams-lambda-functions bucket. This object has been properly encrypted using the correct KMS key, which has an alias of blog in this example, and it has been tagged with the specified key value pair.

    Figure 8: Checking the encryption status of an object in S3

    Figure 8: Checking the encryption status of an object in S3

  2. For further validation, navigate back to the Athena console and select the s3_batch_table from the s3_inventory_database, assuming that you left the default names unchanged. Then, run the following query:
    
    SELECT * FROM s3_batch_table;
    

    If encryption was successful, this query should result in zero items being returned because the solution by default only delivers S3 batch job completion reports on items that failed to copy. After validating by inspecting both the objects themselves and the batch completion reports, you can now safely say that the contents of the targeted S3 buckets are correctly encrypted.

Next steps

Congratulations! You’ve successfully deployed and operated a solution for rectifying S3 buckets with incorrectly encrypted and unencrypted objects. The architecture is massively scalable because it uses S3 Batch Operations and Lambda, it’s fully serverless, and it’s cost effective to run.

Please note that if you selected no for the EncryptBuckets parameter during the initial launch of the CloudFormation template, you can retroactively perform encryption on targeted buckets by simply doing a stack update. During the stack update, switch the EncryptBuckets parameter to yes, and proceed with deployment as normal. The update will reconfigure S3 inventory reports for all target S3 buckets to get the most up-to-date inventory. After the reports are delivered, encryption will proceed as desired.

Moreover, with the solution deployed, you can target new buckets for encryption just by adding the __Inventory: true tag. CloudWatch Events will register the tagging action and automatically configure an S3 Inventory report to be delivered for the newly tagged bucket.

Finally, now that your S3 buckets are properly encrypted, you should take a few more manual steps to help maintain your newfound account hygiene:

  • Perform remediation on unencrypted objects that may have failed to copy during the S3 Batch Operations job. The most common reason that objects fail to copy is when object size exceeds 5 GiB. S3 Batch Operations uses the standard CopyObject API call underneath the surface, but this API call can only handle objects less than 5 GiB in size. To successfully copy these objects, you can modify the solution you learned in this post to launch an S3 Batch Operations job that invokes Lambda functions. In the Lambda function logic, you can make CreateMultipartUpload API calls on objects that failed with a standard copy. The original batch job completion reports provide detail on exactly which objects failed to encrypt due to size.
  • Prohibit the retrieval of unencrypted object versions for buckets that had versioning enabled. When the object is copied over itself during the encryption process, the old unencrypted version of the object still exists. This is where the option in the solution to specify a tag on all newly encrypted objects becomes useful—you can now use that tag to draft a bucket policy that prohibits the retrieval of old unencrypted objects in your versioned buckets. For the solution that you deployed in this post, such a policy would look like this:
    
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect":     "Deny",
          "Action":     "s3:GetObject",
          "Resource":    "arn:aws:s3:::adams-lambda-functions/*",
          "Principal":   "*",
          "Condition": {  "StringNotEquals": {"s3:ExistingObjectTag/__ObjectEncrypted": "true" } }
        }
      ]
    }
    

  • Update bucket policies to prevent the upload of unencrypted or incorrectly encrypted objects. By updating bucket policies, you help ensure that in the future, newly uploaded objects will be correctly encrypted, which will help maintain account hygiene. The S3 encryption solution presented here is meant to be a onetime-use remediation tool, while you should view updating bucket policies as a preventative action. Proper use of bucket policies will help ensure that the S3 encryption solution is not needed again, unless another encryption requirement change occurs in the future. To learn more, see How to Prevent Uploads of Unencrypted Objects to Amazon S3.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon S3 forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Adam Kozdrowicz

Adam is a Data and Machine Learning Engineer for AWS Professional Services. He works closely with enterprise customers building big data applications on AWS, and he enjoys working with frameworks such as AWS Amplify, SAM, and CDK. During his free time, Adam likes to surf, travel, practice photography, and build machine learning models.

How Wind Mobility built a serverless data architecture

Post Syndicated from Pablo Giner original https://aws.amazon.com/blogs/big-data/how-wind-mobility-built-a-serverless-data-architecture/

Guest post by Pablo Giner, Head of BI, Wind Mobility.

Over the past few years, urban micro-mobility has become a trending topic. With the contamination indexes hitting historic highs, cities and companies worldwide have been introducing regulations and working on a wide spectrum of solutions to alleviate the situation.

We at Wind Mobility strive to make commuters’ life more sustainable and convenient by bringing short distance urban transportation to cities worldwide.

At Wind Mobility, we scale our services at the same pace as our users demand them, and we do it in an economically and environmentally viable way. We optimize our fleet distribution to avoid overcrowding cities with more scooters than those that are actually going to be used, and we position them just meters away from where our users need them and at the time of the day when they want them.

How do we do that? By optimizing our operations to their fullest. To do so, we need to be very well informed about our users’ behavior under varying conditions and understand our fleet’s potential.

Scalability and flexibility for rapid growth

We knew that before we could solve this challenge, we needed to collect data from many different sources, such as user interactions with our application, user demand, IoT signals from our scooters, and operational metrics. To analyze the numerous datasets collected and extract actionable insights, we needed to build a data lake. While the high-level goal was clear, the scope was less so. We were working hard to scale our operation as we continued to launch new markets. The rapid growth and expansion made it very difficult to predict the volume of data we would need to consume. We were also launching new microservices to support our growth, which resulted in more data sources to ingest. We needed an architecture that allowed us to be agile and quickly adopt to meet our growth. It became clear that a serverless architecture was best positioned to meet those needs, so we started to design our 100% serverless infrastructure.

The first challenge was ingesting and storing data from our scooters in the field, events from our mobile app, operational metrics, and partner APIs. We use AWS Lambda to capture changes in our operational databases and mobile app and push the events to Amazon Kinesis Data Streams, which allows us to take action in real time. We also use Amazon Kinesis Data Firehose to write the data to Amazon Simple Storage Service (Amazon S3), which we use for analytics.

After we were in Amazon S3 and adequately partitioned as per its most common use cases (we partition by date, region, and business line, depending on the data source), we had to find a way to query this data for both data profiling (understanding structure, content, and interrelationships) and ad hoc analysis. For that we chose AWS Glue crawlers to catalog our data and Amazon Athena to read from the AWS Glue Data Catalog and run queries. However, ad hoc analysis and data profiling are relatively sporadic tasks in our team, because most of the data processing computing hours are actually dedicated to transforming the multiple data sources into our data warehouse, consolidating the raw data, modeling it, adding new attributes, and picking the data elements, which constitute 95% of our analytics and predictive needs.

This is where all the heavy lifting takes place. We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from Amazon S3, processes it using Apache Spark, and writes the results to our Amazon Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.

Architecture overview

The following diagram represents our current data architecture, showing two serverless data collection, processing, and reporting pipelines:

  • Operational databases from Amazon Relational Database Service (Amazon RDS) and MongoDB
  • IoT and application events, followed by Athena for data profiling and Amazon Redshift for reporting

Our data is curated and transformed multiple times a day using an automated pipeline running on AWS Glue. The team can now focus on analyzing the data and building machine learning (ML) applications.

We chose Amazon QuickSight as our business intelligence tool to help us visualize and better understand our operational KPIs. Additionally, we use Amazon Elastic Container Registry (Amazon ECR) to store our Docker images containing our custom ML algorithms and Amazon Elastic Container Service (Amazon ECS) where we train, evaluate, and host our ML models. We schedule our models to be trained and evaluated multiple times a day. Taking as input curated data about demand, conversion, and flow of scooters, we run the models to help us optimize fleet utilization for a particular city at any given time.

The following diagram represents how data from the data lake is incorporated into our ML training, testing, and serving system. First, our developers work in the application code and commit their changes, which are built into new Docker images by our CI/CD pipeline and stored in the Amazon ECR registry. These images are pushed into Amazon ECS and tested in DEV and UAT environments before moving to PROD (where they are triggered by the Amazon ECS task scheduler). During their execution, the Amazon ECS tasks (some train the demand and usage forecasting models, some produce the daily and hourly predictions, and others optimize the fleet distribution to satisfy the forecast) read their configuration and pull data from Amazon S3 (which has been previously produced by scheduled AWS Glue jobs), finally storing their results back into Amazon S3. Executions of these pipelines are tracked via MLFlow (in a dedicated Amazon Elastic Compute Cloud (Amazon EC2) server) and the final result indicating the fleet operations required is fit into a Kepler map, which is then consumed by the operators on the field.

Conclusion

We at Wind Mobility place data at the forefront of our operations. For that, we need our data infrastructure to be as flexible as the industry and the context we operate in, which is why we chose serverless. Over the course of a year, we have built a data lake, a data warehouse, a BI suite, and a variety of (production) data science applications. All of that with a very small team.

Also, within the last 12 months, we have scaled up several of our data pipelines by a factor of 10, without slowing our momentum or redesigning any part of our architecture. When it came to double our fleet in 1 week and increase the frequency at which we capture data from scooters by a factor of 10, our serverless data architecture scaled with no issues. This allowed us to focus on adding value by simplifying our operation, reacting to changes quickly, and delighting our users.

We have measured our success in multiple dimensions:

  • Speed – Serverless is faster to deploy and expand; we believe we have reduced our time to market for the entire infrastructure by a factor of 2
  • Visibility – We have 360 degree visibility of our operations worldwide, accessible by our city managers, finance team, and management board
  • Optimized fleet deployment – We know, at any minute of the day, the number of scooters that our customers need over the next few hours, which reduces unsatisfied demand by more than 50%

If you face a similar challenge, our advice is clear: go fully serverless and use the spectrum of solutions available from AWS.

Follow us and discover more about Wind Mobility on Facebook, Instagram and LinkedIn.

 


About the Author

Pablo Giner is Head of BI at Wind Mobility. Pablo’s background is in wheels (motorcycle racing > vehicle engineering > collision insurance > eScooters sharing…) and for the last few years he has specialized in forming and developing data teams. At Wind Mobility, he leads the data function (data engineering + analytics + data science), and the project he is most proud of is what they call smart fleet rebalancing, an AI backed solution to reposition their fleet in real-time. “In God we trust. All others must bring data.” – W. Edward Deming

 

 

 

Adding voice to a CircuitPython project using Amazon Polly

Post Syndicated from Moheeb Zara original https://aws.amazon.com/blogs/compute/adding-voice-to-a-circuitpython-project-using-amazon-polly/

An Adafruit PyPortal displaying a quote while synthesizing and playing speech using Amazon Polly.

An Adafruit PyPortal displaying a quote while synthesizing and playing speech using Amazon Polly.

As a natural means of communication, voice is a powerful way to humanize an experience. What if you could make anything talk? This guide walks through how to leverage the cloud to add voice to an off-the-shelf microcontroller. Use it to develop more advanced ideas, like a talking toaster that encourages healthy breakfast habits or a house plant that can express its needs.

This project uses an Adafruit PyPortal, an open-source IoT touch display programmed using CircuitPython, a lightweight version of Python that works on embedded hardware. You copy your code to the PyPortal like you would to a thumb drive and it runs. Random quotes from the PaperQuotes API are periodically displayed on the PyPortal LCD.

A microcontroller can’t do speech synthesis on its own so I use Amazon Polly, a natural text to speech synthesis service, to generate audio. Adding speech also extends accessibility to the visually impaired. This project includes an example for requesting arbitrary speech in addition to random quotes. Use this example to add a voice to any CircuitPython project.

An Adafruit PyPortal, an external speaker, and a microSD card.

An Adafruit PyPortal, an external speaker, and a microSD card.

I deploy the backend to the AWS Cloud using the AWS Serverless Application Repository. The code on the PyPortal makes a REST call to the backend to fetch a quote and synthesize speech audio for playback on the device.

Prerequisites

You need the following to complete the project:

Deploy the backend application

An architecture diagram of the serverless backend when requesting speech synthesis of a text string.

An architecture diagram of the serverless backend when requesting speech synthesis of a text string.

The serverless backend consists of an Amazon API Gateway endpoint that invokes an AWS Lambda function. If called with a JSON object containing text and voiceId attributes, it uses Amazon Polly to synthesize speech and uploads an MP3 file as a public object to Amazon S3. Upon completion, it returns the URL for downloading the audio file. It also processes the submitted text and adds return lines so that it can appear text-wrapped when displayed on the PyPortal. For a full list of voices, see the Amazon Polly documentation. An example response:

To fetch quotes instead of a text field, call the endpoint with a comma-separated list of tags as shown in the following diagram. The Lambda function then calls the PaperQuotes API. It fetches up to 50 quotes per tag and selects a random one to synthesize as speech. As with arbitrary text, it returns a URL and a text-wrapped representation of the quote.

An architecture diagram of the serverless backend when requesting a random quote from the PaperQuotes API to synthesize as speech.

An architecture diagram of the serverless backend when requesting a random quote from the PaperQuotes API to synthesize as speech.

I use the AWS Serverless Application Model (AWS SAM) to create the backend template. While it can be deployed using the AWS SAM CLI, you can also deploy from the AWS Management Console:

  1. Generate a free PaperQuotes API key at paperquotes.com. The serverless backend requires this to fetch quotes.
  2. Navigate to the aws-serverless-pyportal-polly application in the AWS Serverless Application Repository.
  3. Under Application settings, enter the parameter, PaperQuotesAPIKey.
  4. Choose Deploy.
  5. Once complete, choose View CloudFormation Stack.
  6. Select the Outputs tab and make a note of the SpeechApiUrl. This is required for configuring the PyPortal.
  7. Click the link listed for SpeechApiKey in the Outputs tab.
  8. Click Show to reveal the API key. Make a note of this. This is required for authenticating requests from the PyPortal to the SpeechApiUrl.

PyPortal setup

The following instructions walk through installing the latest version of the Adafruit CircuityPython libraries and firmware. It also shows how to enable an external speaker module.

  1. Follow these instructions from Adafruit to install the latest version of the CircuitPython bootloader. At the time of writing, the latest version is 5.3.0.
  2. Follow these instructions to install the latest Adafruit CircuitPython library bundle. I use bundle version 5.x.
  3. Insert the microSD card in the slot located on the back of the device.
  4. Cut the jumper pad on the back of the device labeled A0. This enables you to use an external speaker instead of the built-in speaker.
  5. Plug the external speaker connector into the port labeled SPEAKER on the back of the device.
  6. Optionally install the Mu Editor, a multi-platform code editor and serial debugger compatible with Adafruit CircuitPython boards. This can help with troubleshooting issues.
  7. Optionally if you have a 3D printer at home, you can print a case for your PyPortal. This can protect and showcase your project.

Code PyPortal

As with regular Python, CircuitPython does not need to be compiled to execute. You can flash new firmware on the PyPortal by copying a Python file and necessary assets to a mounted volume. The bootloader runs code.py anytime the device starts or any files are updated.

  1. Use a USB cable to plug the PyPortal into your computer and wait until a new mounted volume CIRCUITPY is available.
  2. Download the project from GitHub. Inside the project, copy the contents of /circuit-python on to the CIRCUITPY volume.
  3. Inside the volume, open and edit the secrets.py file. Include your Wi-Fi credentials along with the SpeechApiKey and SpeechApiUrl API Gateway endpoint. These can be found under Outputs in the AWS CloudFormation stack created by the AWS Serverless Application Repository.
  4. Save the file, and the device restarts. It takes a moment to connect to Wi-Fi and make the first request.
    Optionally, if you installed the Mu Editor, you can click on “Serial” to follow along the device log.

The PyPortal takes a few moments to connect to the Wi-Fi network and make its first request. On success, you hear it greet you and describe itself. The default interval is set to then display and read a quote every five minutes.

Understanding the CircuitPython code

See the bottom of circuit-python/code.py from the GitHub project. When the PyPortal connects to Wi-Fi, the first thing it does is synthesize an arbitrary “hello world” text for display. It then begins periodically displaying and “speaking” quotes.

# Connect to WiFi
print("Connecting to WiFi...")
wifi.connect()
print("Connected!")

displayQuote("Ready!")

speakText('Hello world! I am an Adafruit PyPortal running Circuit Python speaking to you using AWS Serverless', 'Joanna')

while True:
    speakQuote('equality, humanity', 'Joanna')
    time.sleep(60*secrets['interval'])

Both the speakText and speakQuote function call the synthesizeSpeech function. The difference is whether text or tags are passed to the API.

def speakText(text, voice):
    data = { "text": text, "voiceId": voice }
    synthesizeSpeech(data)

def speakQuote(tags, voice):
    data = { "tags": tags, "voiceId": voice }
    synthesizeSpeech(data)

The synthesizeSpeech function posts the data to the API Gateway endpoint. It then invokes the Lambda function and returns the MP3 URL and the formatted text. The downloadfile function is called to fetch the MP3 file and store it on the SD card. displayQuote is called to display the quote on the LCD. Finally, the playMP3 opens the file and plays the speech audio using the built-in or external speaker.

def synthesizeSpeech(data):
    response = postToAPI(secrets['endpoint'], data)
    downloadfile(response['url'], '/sd/cache.mp3')
    displayQuote(response['text'])
    playMP3("/sd/cache.mp3")

Modifying the Lambda function

The serverless application includes a Lambda function, SynthesizeSpeechFunction, which can be modified directly in the Lambda console. The AWS SAM template used to deploy the AWS Serverless Application Repository application adds policies for accessing the S3 bucket where audio is stored. It also grants access to Amazon Polly for synthesizing speech. It also adds the PaperQuote API token as an environment variable and sets API Gateway as an event source.

SynthesizeSpeechFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: lambda_functions/SynthesizeSpeech/
      Handler: app.lambda_handler
      Runtime: python3.8
      Policies:
        - S3FullAccessPolicy:
            BucketName: !Sub "${AWS::StackName}-audio"
        - Version: '2012-10-17'
          Statement:
            - Effect: Allow
              Action:
                - polly:*
              Resource: '*'
      Environment:
        Variables:
          BUCKET_NAME: !Sub "${AWS::StackName}-audio"
          PAPER_QUOTES_TOKEN: !Ref PaperQuotesAPIKey
      Events:
        Speech:
          Type: Api
          Properties:
            RestApiId: !Ref SpeechApi
            Path: /speech
            Method: post

To edit the Lambda function, navigate back to the CloudFormation stack and click on the SpeechSynthesizeFunction under the Resources tab.

From here, you can edit the Lambda function code directly. Clicking Save deploys the new code.

The getQuotes function is called to fetch quotes from the PaperQuotes API. You can change this to call from a different source, such as a custom selection of quotes. Try modifying it to fetch social media posts or study questions.

Conclusion

I show how to add natural sounding text to speech on a microcontroller using a serverless backend. This is accomplished by deploying an application through the AWS Serverless Application Repository. The deployed API uses API Gateway to securely invoke a Lambda function that fetches quotes from the PaperQuotes API and generates speech using Amazon Polly. The speech audio is uploaded to S3.

I then show how to program a microcontroller, the Adafruit PyPortal, using CircuitPython. The code periodically calls the serverless API to fetch a quote and to download speech audio for playback. The sample code also demonstrates synthesizing arbitrary text to speech, meaning it can be used for any project you can conceive. Check out my previous guide on using the PyPortal to create a Martian weather display for inspiration.

Moovit embraces data lake architecture by extending their Amazon Redshift cluster to analyze billions of data points every day

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/moovit-embraces-data-lake-architecture-by-extending-their-amazon-redshift-cluster-to-analyze-billions-of-data-points-every-day/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

Moovit is a leading Mobility as a Service (MaaS) solutions provider and maker of the top urban mobility app. Guiding over 800 million users in more than 3,200 cities across 103 countries to get around town effectively and conveniently, Moovit has experienced exponential growth of their service in the last few years. The company amasses up to 6 billion anonymous data points a day to add to the world’s largest repository of transit and urban mobility data, aided by Moovit’s network of more than 685,000 local editors that help map and maintain local transit information in cities that would otherwise be unserved.

Like Moovit, many companies today are using Amazon Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data, such as big data analytics, numerous machine learning (ML) applications, and a range of tools to drive new use cases and business processes. Companies are looking to access all their data, all the time, by all users and get fast answers. The best solution for all those requirements is for companies to build a data lake, which is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale.

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can easily run big data analytics using services such as Amazon EMR and AWS Glue. You can also query structured data (such as CSV, Avro, and Parquet) and semi-structured data (such as JSON and XML) by using Amazon Athena and Amazon Redshift Spectrum. You can also use a data lake with ML services such as Amazon SageMaker to gain insights.

Moovit uses an Amazon Redshift cluster to allow different company teams to analyze vast amounts of data. They wanted a way to extend the collected data into the data lake and allow additional analytical teams to access more data to explore new ideas and business cases.

Additionally, Moovit was looking to manage their storage costs and evolve to a model that allowed cooler data to be maintained at the lowest cost in S3, and maintain the hottest data in Redshift for the most efficient query performance. The proposed solution implemented a hot/cold storage pattern using Amazon Redshift Spectrum and reduced the local disk utilization on the Amazon Redshift cluster to make sure costs are maintained. Moovit is currently evaluating the new RA3 node with managed storage as an additional level of flexibility that will allow them to easily scale the amount of hot/cold storage without limit.

In this post we demonstrate how Moovit, with the support of AWS, implemented a lake house architecture by employing the following best practices:

  • Unloading data into Amazon Simple Storage Service (Amazon S3)
  • Instituting a hot/cold pattern using Amazon Redshift Spectrum
  • Using AWS Glue to crawl and catalog the data
  • Querying data using Athena

Solution overview

The following diagram illustrates the solution architecture.

The solution includes the following steps:

  1. Unload data from Amazon Redshift to Amazon S3
  2. Create an AWS Glue Data Catalog using an AWS Glue crawler
  3. Query the data lake in Amazon Athena
  4. Query Amazon Redshift and the data lake with Amazon Redshift Spectrum

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  1. An AWS account.
  2. An Amazon Redshift cluster.
  3. The following AWS services and access: Amazon Redshift, Amazon S3, AWS Glue, and Athena.
  4. The appropriate AWS Identity and Access Management (IAM) permissions for Amazon Redshift Spectrum and AWS Glue to access Amazon S3 buckets. For more information, see IAM policies for Amazon Redshift Spectrum and Setting up IAM Permissions for AWS Glue.

Walkthrough

To demonstrate the process Moovit used during their data architecture, we use the industry-standard TPC-H dataset provided publicly by the TPC organization.

The Orders table has the following columns:

ColumnType
O_ORDERKEYint4
O_CUSTKEYint4
O_ORDERSTATUSvarchar
O_TOTALPRICEnumeric
O_ORDERDATEdate
O_ORDERPRIORITYvarchar
O_CLERKvarchar
O_SHIPPRIORITYint4
O_COMMENTvarchar
SKIPvarchar

Unloading data from Amazon Redshift to Amazon S3

Amazon Redshift allows you to unload your data using a data lake export to an Apache Parquet file format. Parquet is an efficient open columnar storage format for analytics. Parquet format is up to twice as fast to unload and consumes up to six times less storage in Amazon S3, compared with text formats.

To unload cold or historical data from Amazon Redshift to Amazon S3, you need to run an UNLOAD statement similar to the following code (substitute your IAM role ARN):

UNLOAD ('select o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment, skip
FROM tpc.orders
ORDER BY o_orderkey, o_orderdate') 
TO 's3://tpc-bucket/orders/' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account_number>:role/>Role<'
FORMAT AS parquet allowoverwrite PARTITION BY (o_orderdate);

It is important to define a partition key or column that minimizes Amazon S3 scans as much as possible based on the query patterns intended. The query pattern is often by date ranges; for this use case, use the o_orderdate field as the partition key.

Another important recommendation when unloading is to have file sizes between 128 MB and 512 MB. By default, the UNLOAD command splits the results to one or more files per node slice (virtual worker in the Amazon Redshift cluster) which allows you to use the Amazon Redshift MPP architecture. However, this can potentially cause files created by every slice to be small. In Moovit’s use case, the default UNLOAD using PARALLEL ON yielded dozens of small (MBs) files. For Moovit, PARALLEL OFF yielded the best results because it aggregated all the slices’ work into the LEADER node and wrote it out as a single stream controlling the file size using the MAXFILESIZE option.

Another performance enhancement applied in this use case was the use of Parquet’s min and max statistics. Parquet files have min_value and max_value column statistics for each row group that allow Amazon Redshift Spectrum to prune (skip) row groups that are out of scope for a query (range-restricted scan). To use row group pruning, you should sort the data by frequently-used columns. Min/max pruning helps scan less data from Amazon S3, which results in improved performance and reduced cost.

After unloading the data to your data lake, you can view your Parquet file’s content in Amazon S3 (assuming it’s under 128 MB). From the Actions drop-down menu, choose Select from.

You’re now ready to populate your Data Catalog using an AWS Glue crawler.

Creating a Data Catalog with an AWS Glue crawler

To query your data lake using Athena, you must catalog the data. The Data Catalog is an index of the location, schema, and runtime metrics of the data.

An AWS Glue crawler accesses your data store, extracts metadata (such as field types), and creates a table schema in the Data Catalog. For instructions, see Working with Crawlers on the AWS Glue Console.

Querying the data lake in Athena

After you create the crawler, you can view the schema and tables in AWS Glue and Athena, and can immediately start querying the data in Athena. The following screenshot shows the table in the Athena Query Editor.

Querying Amazon Redshift and the data lake using a unified view with Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows multiple Redshift clusters to query from same data in the lake. It enables the lake house architecture and allows data warehouse queries to reference data in the data lake as they would any other table. Amazon Redshift clusters transparently use the Amazon Redshift Spectrum feature when the SQL query references an external table stored in Amazon S3. Large multiple queries in parallel are possible by using Amazon Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from Amazon S3 back to the Amazon Redshift cluster.

Following best practices, Moovit decided to persist all their data in their Amazon S3 data lake and only store hot data in Amazon Redshift. They could query both hot and cold datasets in a single query with Amazon Redshift Spectrum.

The first step is creating an external schema in Amazon Redshift that maps a database in the Data Catalog. See the following code:

CREATE EXTERNAL SCHEMA spectrum 
FROM data catalog 
DATABASE 'datalake' 
iam_role 'arn:aws:iam::<account_number>:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

After the crawler creates the external table, you can start querying in Amazon Redshift using the mapped schema that you created earlier. See the following code:

SELECT * FROM spectrum.orders;

Lastly, create a late binding view that unions the hot and cold data:

CREATE OR REPLACE VIEW lake_house_joint_view AS (SELECT * FROM public.orders WHERE o_orderdate >= dateadd(‘day’,-90,date_trunc(‘day’,getdate())) 
UNION ALL SELECT * FROM spectrum.orders WHERE o_orderdate < dateadd(‘day’,-90,date_trunc(‘day’,getdate())) WITH NO SCHEMA BINDING;

Summary

In this post, we showed how Moovit unloaded data from Amazon Redshift to a data lake. By doing that, they exposed the data to many additional groups within the organization and democratized the data. These benefits of data democratization are substantial because various teams within Moovit can access the data, analyze it with various tools, and come up with new insights.

As an additional benefit, Moovit reduced their Amazon Redshift utilized storage, which allowed them to maintain cluster size and avoid additional spending by keeping all historical data within the data lake and only hot data in the Amazon Redshift cluster. Keeping only hot data on the Amazon Redshift cluster prevents Moovit from deleting data frequently, which saves IT resources, time, and effort.

If you are looking to extend your data warehouse to a data lake and leverage various tools for big data analytics and machine learning (ML) applications, we invite you to try out this walkthrough.

 


About the Authors

Yonatan Dolan is a Business Development Manager at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value.

 

 

 

 

Alon Gendler is a Startup Solutions Architect at Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud.

 

 

 

 

Vincent Gromakowski is a Specialist Solutions Architect for Amazon Web Services.

 

 

Tighten S3 permissions for your IAM users and roles using access history of S3 actions

Post Syndicated from Mathangi Ramesh original https://aws.amazon.com/blogs/security/tighten-s3-permissions-iam-users-and-roles-using-access-history-s3-actions/

Customers tell us that when their teams and projects are just getting started, administrators may grant broad access to inspire innovation and agility. Over time administrators need to restrict access to only the permissions required and achieve least privilege. Some customers have told us they need information to help them determine the permissions an application really needs, and which permissions they can remove without impacting applications. To help with this, AWS Identity and Access Management (IAM) reports the last time users and roles used each service, so you can know whether you can restrict access. This helps you to refine permissions to specific services, but we learned that customers also need to set more granular permissions to meet their security requirements.

We are happy to announce that we now include action-level last accessed information for Amazon Simple Storage Service (Amazon S3). This means you can tighten permissions to only the specific S3 actions that your application requires. The action-level last accessed information is available for S3 management actions. As you try it out, let us know how you’re using action-level information and what additional information would be valuable as we consider supporting more services.

The following is an example snapshot of S3 action last accessed information.
 

Figure 1: S3 action last accessed information snapshot

Figure 1: S3 action last accessed information snapshot

You can use the new action last accessed information for Amazon S3 in conjunction with other features that help you to analyze access and tighten S3 permissions. AWS IAM Access Analyzer generates findings when your resource policies allow access to your resources from outside your account or organization. Specifically for Amazon S3, when an S3 bucket policy changes, Access Analyzer alerts you if the bucket is accessible by users from outside the account, which helps you to protect your data from unintended access. You can use action last accessed information for your user or role, in combination with Access Analyzer findings, to improve the security posture of your S3 permissions. You can review the action last accessed information in the IAM console, or programmatically using the AWS Command Line Interface (AWS CLI) or a programmatic client.

Example use case for reviewing action last accessed details

Now I’ll walk you through an example to demonstrate how you identify unused S3 actions and reduce permissions for your IAM principals. In this example a system administrator, Martha Rivera, is responsible for managing access for her IAM principals. She periodically reviews permissions to ensure that teams follow security best practices. Specifically, she ensures that the team has only the minimum S3 permissions required to work on their application and achieve their use cases. To do this, Martha reviews the last accessed timestamp for each supported S3 action that the roles in her account have access to. Martha then uses this information to identify the S3 actions that are not used, and she restricts access to those actions by updating the policies.

To view action last accessed information in the AWS Management Console

  1. Open the IAM Console.
  2. In the navigation pane, select Roles, then choose the role that you want to analyze (for example, PaymentAppTestRole).
  3. Select the Access Advisor tab. This tab displays all the AWS services to which the role has permissions, as shown in Figure 2.
     
    Figure 2: List of AWS services to which the role has permissions

    Figure 2: List of AWS services to which the role has permissions

  4. On the Access Advisor tab, select Amazon S3 to view all the supported actions to which the role has permissions, when each action was last used by the role, and the AWS Region in which it was used, as shown in Figure 3.
     
    Figure 3: List of S3 actions with access data

    Figure 3: List of S3 actions with access data

In this example, Martha notices that PaymentAppTestRole has read and write S3 permissions. From the information in Figure 3, she sees that the role is using read actions for GetBucketLogging, GetBucketPolicy, and GetBucketTagging. She also sees that the role hasn’t used write permissions for CreateAccessPoint, CreateBucket, PutBucketPolicy, and others in the last 30 days. Based on this information, Martha updates the policies to remove write permissions. To learn more about updating permissions, see Modifying a Role in the AWS IAM User Guide.

At launch, you can review 50 days of access data, that is, any use of S3 actions in the preceding 50 days will show up as a last accessed timestamp. As this tracking period continues to increase, you can start making permissions decisions that apply to use cases with longer period requirements (for example, when 60 or 90 days is available).

Martha sees that the GetAccessPoint action shows Not accessed in the tracking period, which means that the action was not used since IAM started tracking access for the service, action, and AWS Region. Based on this information, Martha confidently removes this permission to further reduce permissions for the role.

Additionally, Martha notices that an action she expected does not show up in the list in Figure 3. This can happen for two reasons, either PaymentAppTestRole does not have permissions to the action, or IAM doesn’t yet track access for the action. In such a situation, do not update permission for those actions, based on action last accessed information. To learn more, see Refining Permissions Using Last Accessed Data in the AWS IAM User Guide.

To view action last accessed information programmatically

The action last accessed data is available through updates to the following existing APIs. These APIs now generate action last accessed details, in addition to service last accessed details:

  • generate-service-last-accessed-details: Call this API to generate the service and action last accessed data for a user or role. You call this API first to start a job that generates the action last accessed data for a user or role. This API returns a JobID that you will then use with get-service-last-accessed-details to determine the status of the job completion.
  • get-service-last-accessed-details: Call this API to retrieve the service and action last accessed data for a user or role based on the JobID you pass in. This API is paginated at the service level.

To learn more, see GenerateServiceLastAccessedDetails in the AWS IAM User Guide.

Conclusion

By using action last accessed information for S3, you can review access for supported S3 actions, remove unused actions, and restrict access to S3 to achieve least privilege. To learn more about how to use action last accessed information, see Refining Permissions Using Last Accessed Data in the AWS IAM User Guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS IAM forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Mathangi Ramesh

Mathangi Ramesh

Mathangi is the product manager for AWS Identity and Access Management. She enjoys talking to customers and working with data to solve problems. Outside of work, Mathangi is a fitness enthusiast and a Bharatanatyam dancer. She holds an MBA degree from Carnegie Mellon University.

Running a high-performance SAS Grid Manager cluster on AWS with Amazon FSx for Lustre

Post Syndicated from Neelam original https://aws.amazon.com/blogs/big-data/running-a-high-performance-sas-grid-manager-cluster-on-aws-with-amazon-fsx-for-lustre/

SAS® is a software provider of data science and analytics used by enterprises and government organizations. SAS Grid is a highly available, fast processing analytics platform that offers centralized management that balances workloads across different compute nodes. This application suite is capable of data management, visual analytics, governance and security, forecasting and text mining, statistical analysis, and environment management. SAS and AWS recently performed testing using the Amazon FSx for Lustre shared file system to determine how well a standard workload performs on AWS using SAS Grid Manager. For more information about the results, see the whitepaper Accelerating SAS Using High-Performing File Systems on Amazon Web Services.

In this post, we take a look at an approach to deploy underlying AWS infrastructure to run SAS Grid with FSx for Lustre that you can also apply to similar applications with demanding I/O requirements.

System design overview

Running high-performance workloads that use throughput heavily, with sensitivity to network latency, requires approaches outside of typical applications. AWS generally recommends that applications span multiple Availability Zones for high availability. In the case of latency sensitivity, high throughput applications traffic should be local for optimal performance. To maximize throughput, you can do the following:

  • Run in a virtual private cloud (VPC), using instance types that support enhanced networking
  • Run instances in the same Availability Zone
  • Run instances within a placement group

The following diagram illustrates the SAS Grid with FSx for Lustre architecture on AWS.

SAS Grid architecture consists of mid-tier nodes, metadata servers, and Grid compute nodes. The mid-tier nodes are responsible for running the Platform Web Services (PWS) and Load Sharing Facility (LSF) components. These components dispatch jobs submitted and return the status of each job.

To effectively run PWS and LSF on mid-tier nodes, you need Amazon Elastic Compute Cloud (Amazon EC2) instances with high memory. For this use case, the r5 instance family would meet this requirement.

Metadata servers contain the metadata repository that stores the metadata definitions of all SAS Grid manager products, which the r5 instance family can also serve effectively. We recommend either meeting or exceeding the recommended memory requirement of 24 GB of RAM or 8 MB per physical core (whichever is larger). Metadata servers don’t need compute-intensive resources or high I/O bandwidth; therefore, you can choose the r5 instance family for a balance of price and performance.

SAS Grid nodes are responsible for executing the jobs received by the grid, and EC2 instances capable of handling these jobs depend on the size, complexity, and volume of the work the grid performs. To meet the minimum requirements of SAS Grid workloads, we recommend having a minimum of 8 GB of physical RAM per core and a robust I/O throughput of 100–125 MB/second per physical core. For this use case, EC2 instance families of m5n and r5n suffice in meeting the RAM and throughput requirements. You can host SASDATA, SASWORK, and UTILLOC libraries in a shared file system. If you choose to offload SASWORK to instance storage, the i3en instance family meets this need because they support instance storage over 1.2 TB. In the next section, we take a look at how throughput testing was performed to arrive at the EC2 instance recommendations with FSx for Lustre.

Steps to maximize storage I/O performance

SAS Grid requires a shared file system, and we wanted to benchmark the performance of FSx for Lustre as the chosen shared file system against various EC2 instance families that meet the minimum requirements of 8 GB of physical RAM per core and 100–125 MB/second throughput per physical core.

FSx for Lustre is a fully managed file storage service designed for applications that require fast storage. As a POSIX-compliant file system, you can use FSx for Lustre with current Linux-based applications without having to make any changes. Although FSx for Lustre offers a choice between scratch and persistent type file systems, we recommend for SAS Grid to use persistent type FSx for Lustre file system because you need to store the SASWORK, SASDATA, and UTILLOC data and libraries for longer periods with high availability and data durability. To meet I/O throughput, make sure to select the appropriate storage capacity for throughput per unit of storage to achieve the desired range of 100–125 MB/second.

After setting up the file system, we recommend mounting FSx for Lustre with the flock mount option. The following code example is a mount command and mount option for FSx for Lustre:

$ sudo mount -t lustre -o noatime,flock fs-0123456789abcd.fsx.us-west- [email protected]:/za3atbmv /fsx
$ mount -t lustre
[email protected]:/za3atbmv on /fsx type lustre

(rw,noatime,seclabel,flock,lazystatfs)

Throughput testing and results

To select the best-placed EC2 instances for running SAS Grid with FSx for Lustre, we ran a series of highly parallel network throughput tests from individual EC2 instances against a 100.8 TiB persistent file system that had an aggregate throughput capacity of 19.688 GB/second. We ran these tests in multiple regions using multiple EC2 instance families (c5, c5n, i3, i3en, m5, m5a, m5ad, m5n, m5dn, r5, r5a, r5ad, r5n, and r5dn). The tests ran for 3 hours for each instance, and the DataWriteBytes metric of the file system was recorded every 1 minute. Only one instance was accessing the file system at a time, and the p99.9 results were captured. The metrics were consistent across all four Regions.

We observed that the i3en, m5n, m5dn, r5n, and r5dn EC2 instance families meet or exceed the minimum network performance and memory recommendations. For more information about the performance results, see the whitepaper Accelerating SAS Using High-Performing File Systems on Amazon Web Services. The i3 instance family is just shy of meeting the minimum network performance. If you want to use the instance storage for SASWORK and UTILLOC libraries, you can consider i3en instances.

M5n and r5n are a good blend of price and performance, and we recommend the m5n instance family for SAS Grid nodes. However, if your workload is memory bound, consider using r5n instances, which provide higher memory per physical core for a higher price point than m5n instances.

We also ran rhel_iotest.sh, which is available from the SAS technical support samples tool repository (SASTSST), using the same FSx for Lustre configuration as mentioned earlier. The following table shows the read and write performance per physical core for a variety of instances sizes in the m5n and r5n families.

Instance Type

Variable Network Performance Peak per Physical Core
Read (MB/second)Write (MB/second)
m5n.large850.20357.07
m5n.xlarge519.46386.25
m5n.2xlarge283.01446.84
m5n.4xlarge202.89376.57
m5n.8xlarge154.98297.71
r5n.large906.88429.93
r5n.xlarge488.36455.76
r5n.2xlarge256.96471.65
r5n.4xlarge203.31390.03
r5n.8xlarge149.63299.45

To take advantage of the elasticity, scalability, and flexibility of the cloud, we recommend spreading the SAS Grid and compute workload over a larger number of smaller instances versus using a smaller number of larger instances. For mid-tier, use a minimum of two instances, and for metadata servers, we recommend a minimum of three instances for the SAS Grid architecture.

Conclusions

Before FSx for Lustre file system, you either had to use Amazon Elastic File System (Amazon EFS) or a third-party file system from AWS Marketplace and Amazon Elastic Block Store (Amazon EBS) for the SASWORK, SASDATA, and UTILLOC libraries and storage data. Each storage option came with its own settings and limitations, which caused loss in performance. With FSx for Lustre, you have a single solution for all SAS Grid storage requirements, which allows you to focus on running your business instead of maintaining a file system. We recommend that SAS admin deploy SAS Grid with m5n and r5n instances for SAS Grid compute nodes when accessing FSx for Lustre file system.

If you have questions or suggestions, please leave a comment.

Build an AWS Well-Architected environment with the Analytics Lens

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/build-an-aws-well-architected-environment-with-the-analytics-lens/

Building a modern data platform on AWS enables you to collect data of all types, store it in a central, secure repository, and analyze it with purpose-built tools. Yet you may be unsure of how to get started and the impact of certain design decisions. To address the need to provide advice tailored to specific technology and application domains, AWS added the concept of well-architected lenses 2017. AWS now is happy to announce the Analytics Lens for the AWS Well-Architected Framework. This post provides an introduction of its purpose, topics covered, common scenarios, and services included.

The new Analytics Lens offers comprehensive guidance to make sure that your analytics applications are designed in accordance with AWS best practices. The goal is to give you a consistent way to design and evaluate cloud architectures, based on the following five pillars:

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization

The tool can help you assess the analytics workloads you have deployed in AWS by identifying potential risks and offering suggestions for improvements.

Using the Analytics Lens to address common requirements

The Analytics Lens models both the data architecture at the core of the analytics applications and the application behavior itself. These models are organized into the following six areas, which encompass the vast majority of analytics workloads deployed on AWS:

  1. Data ingestion
  2. Security and governance
  3. Catalog and search
  4. Central storage
  5. Processing and analytics
  6. User access

The following diagram illustrates these areas and their related AWS services.

There are a number of common scenarios where the Analytics Lens applies, such as the following:

  • Building a data lake as the foundation for your data and analytics initiatives
  • Efficient batch data processing at scale
  • Building a platform for streaming ingest and real-time event processing
  • Handling big data processing and streaming
  • Data-preparation operations

Whichever of these scenarios fits your needs, building to the principles of the Analytics Lens in the AWS Well-Architected Framework can help you implement best practices for success.

The Analytics Lens explains when and how to use the core services in the AWS analytics portfolio. These include Amazon Kinesis, Amazon Redshift, Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation. It also explains how Amazon Simple Storage Service (Amazon S3) can serve as the storage for your data lake and how to integrate with relevant AWS security services. With reference architectures, best practices advice, and answers to common questions, the Analytics Lens can help you make the right design decisions.

Conclusion

Applying the lens to your existing architectures can validate the stability and efficiency of your design (or provide recommendations to address the gaps that are identified). AWS is committed to the Analytics Lens as a living tool; as the analytics landscape evolves and new AWS services come on line, we’ll update the Analytics Lens appropriately. Our mission will always be to help you design and deploy well-architected applications.

For more information about building your own Well-Architected environment using the Analytics Lens, see the Analytics Lens whitepaper.

Special thanks to the following individuals who contributed to building this resource, among many others who helped with review and implementation: Radhika Ravirala, Laith Al-Saadoon, Wallace Printz, Ujjwal Ratan, and Neil Mukerje.

Are there questions you’d like to see answered in the tool? Share your thoughts and questions in the comments.

 


About the Authors

Nikki Rouda is the principal product marketing manager for data lakes and big data at Amazon Web Services. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 


Radhika Ravirala is a specialist solutions architect at Amazon Web Services, where she helps customers craft distributed analytics applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley.

Simplify data pipelines with AWS Glue automatic code generation and Workflows

Post Syndicated from Mohit Saxena original https://aws.amazon.com/blogs/big-data/simplify-data-pipelines-with-aws-glue-automatic-code-generation-and-workflows/

In the previous post of the series, we discussed how AWS Glue job bookmarks help you to incrementally load data from Amazon S3 and relational databases. We also saw how using the AWS Glue optimized Apache Parquet writer can help improve performance and manage schema evolution.

In the third post of the series, we’ll discuss three topics. First, we’ll look at how AWS Glue can automatically generate code to help transform data in common use cases such as selecting specific columns, flattening deeply nested records, efficiently parsing nested fields, and handling column data type evolution.

Second, we’ll outline how to use AWS Glue Workflows to build and orchestrate data pipelines using different Glue components such as Crawlers, Apache Spark and Python Shell ETL jobs.

Third, we’ll see how to leverage SparkSQL in your ETL jobs to perform SQL based transformations on datasets stored in Amazon S3 and relational databases.

Automatic Code Generation & Transformations: ApplyMapping, Relationalize, Unbox, ResolveChoice

AWS Glue can automatically generate code to help perform a variety of useful data transformation tasks. These transformations provide a simple to use interface for working with complex and deeply nested datasets. For example, some relational databases or data warehouses do not natively support nested data structures. AWS Glue can automatically generate the code necessary to flatten those nested data structures before loading them into the target database saving time and enabling non-technical users to work with data.

The following is a list of the popular transformations AWS Glue provides to simplify data processing:

  1. ApplyMapping is a transformation used to perform column projection and convert between data types. In this example, we use it to unnest several fields, such as action.id, which we map to the top-level action.id field. We also cast the id column to a long.
    medicare_output = medicare_src.apply_mapping(
        [('id, 'string', id, 'string'), 
        ('type, string, type', string),
        ('actor.id, 'int', actor.id', int),
        ('actor.login', 'string', actor.login', 'string'),
        ('actor.display_login', 'string', 'actor.display_login', 'string'),
        ('actor.gravatar_id', 'long', 'actor.gravatar_id', 'long'),
        ('actor.url', 'string','actor.url', 'string'),
        ('actor.avatar_url', 'string', 'actor.avatar_url', string)]
    )

  1. Relationalize converts a nested dataset stored in a DynamicFrameto a relational (rows and columns) format. Nested structures are unnested into top level columns and arrays decomposed into different tables with appropriate primary and foreign keys inserted. The result is a collection of DynamicFrames representing a set of tables that can be directly inserted into a relational database. More detail about relationalize can be found here.
    ## An example relationalizing and writing to Redshift
    dfc = history.relationalize("hist_root", redshift_temp_dir)
    ## Cycle through results and write to Redshift.
    for df_name in dfc.keys():
        df = dfc.select(df_name)
        print "Writing to Redshift table: ", df_name, " ..."
        glueContext.write_dynamic_frame.from_jdbc_conf(frame = df, 
            catalog_connection = "redshift3", 
            connection_options = {"dbtable": df_name, "database": "testdb"}, 
            redshift_tmp_dir = redshift_temp_dir)

  2. Unbox parses a string field of a certain type, such as JSON, into individual fields with their corresponding data types and store the result in a DynamicFrame. For example, you may have a CSV file with one field that is in JSON format {“a”: 3, “b”: “foo”, “c”: 1.2}. Unbox will reformat the JSON string into three distinct fields: an int, a string, and a double. The Unbox transformation is commonly used to replace costly Python User Defined Functions required to reformat data that may result in Apache Spark out of memory exceptions. The following example shows how to use Unbox:
    df_result = df_json.unbox('json', "json")

  3. ResolveChoice: AWS Glue Dynamic Frames support data where a column can have fields with different types. These columns are represented with Dynamic Frame’s choice type. For example, Dynamic Frame schema for the medicare dataset shows up as follows:
    root
     |-- drg definition: string
     |-- provider id: choice
     |    |-- long
     |    |-- string
     |-- provider name: string
     |-- provider street address: string

    This is because the “provider id” column could either be a long or string type. The Apache Spark Dataframe considers the whole dataset and is forced to cast it to the most general type, namely string. Dynamic Frames allow you to cast the type using the ResolveChoice transform. For example, you can cast the column to long type as follows.

    medicare_res = medicare_dynamicframe.resolveChoice(specs = [('provider id','cast:long')])
    
    medicare_res.printSchema()
     
    root
     |-- drg definition: string
     |-- provider id: long
     |-- provider name: string
     |-- provider street address: string

    This transform would also insert a null where the value was a string that could not be cast. As a result, the records with string type casted to null values can also be identified now. Alternatively, the choice type can also be cast to struct, which keeps values of both types.

Build and orchestrate data pipelines using AWS Glue Workflows

AWS Glue Workflows provide a visual tool to author data pipelines by combining Glue crawlers for schema discovery, and Glue Spark and Python jobs to transform the data. Relationships can be defined and parameters passed between task nodes to enable users to build pipelines of varying complexity. Workflows can be scheduled to run on a schedule or triggered programmatically. You can track the progress of each node independently or the entire workflow making it easier to troubleshoot your pipelines.

A typical workflow for ETL workloads is organized as follows:

  1. Glue Python command triggered manually, on a schedule, or on an external CloudWatch event. It would pre-process or list the partitions in Amazon S3 for a table under a base location. For example, a CloudTrail logs partition to process could be: s3://AWSLogs/ACCOUNTID/CloudTrail/REGION/YEAR/MONTH/DAY/HOUR/.The Python command can list all the regions and schedule crawlers to create different Glue Data Catalog tables on each region.
  2. Glue Crawlers triggered next to populate new partitions for every hour in Glue Data Catalog for recently ingested in Amazon S3.
  3. Concurrent Glue ETL jobs triggered to separately filter and process each partition or a group of partitions. For example, CloudTrail events corresponding to the last week can be read by a Glue ETL job by passing in the partition prefix as Glue job parameters and using Glue ETL push down predicates to just read all the partitions in that prefix.Partitioning and orchestrating concurrent Glue ETL jobs allows you to scale and reliably execute individual Apache Spark applications by processing only a subset of partitions in the Glue Data Catalog table. The transformed data can then be concurrently written back by all individual Glue ETL jobs to a common target table in Amazon S3 data lake, AWS Redshift or other databases.

Finally, a Glue Python command can be triggered to capture the completion status of the different Glue entities including Glue Crawlers, parallel Glue ETL jobs; and post-process or retry any failed components.

Executing SQL using SparkSQL in AWS Glue

AWS Glue Data Catalog as Hive Compatible Metastore

The AWS Glue Data Catalog is a managed metadata repository compatible with the Apache Hive Metastore API. You can follow the detailed instructions here to configure your AWS Glue ETL jobs and development endpoints to use the Glue Data Catalog. You also need to add the Hive SerDes to the class path of AWS Glue Jobs to serialize/deserialize data for the corresponding formats. You can then natively run Apache Spark SQL queries against your tables stored in the Data Catalog.

The following example assumes that you have crawled the US legislators dataset available at s3://awsglue-datasets/examples/us-legislators. We’ll use the Spark shell running on AWS Glue developer endpoint to execute SparkSQL queries directly on the legislators’ tables cataloged in the AWS Glue Data Catalog.

>>> spark.sql("use legislators")
DataFrame[]
>>> spark.sql("show tables").show()
+-----------+------------------+-----------+
|   database|         tableName|isTemporary|
+-----------+------------------+-----------+
|legislators|        areas_json|      false|
|legislators|    countries_json|      false|
|legislators|       events_json|      false|
|legislators|  memberships_json|      false|
|legislators|organizations_json|      false|
|legislators|      persons_json|      false|

>>> spark.sql("select distinct organization_id from memberships_json").show()
+--------------------+
|     organization_id|
+--------------------+
|d56acebe-8fdc-47b...|
|8fa6c3d2-71dc-478...|
+--------------------+

A similar approach to the above would be to use AWS Glue DynamicFrame API to read the data from S3. The DynamicFrame is then converted to a Spark DataFrame using the toDF method. Next, a temporary view can be registered for DataFrame, which can be queried using SparkSQL. The key difference between the two approaches is the use of Hive SerDes for the first approach, and native Glue/Spark readers for the second approach. The use of native Glue/Spark provides the performance and flexibility benefits such as computation of the schema at runtime, schema evolution, and job bookmarks support for Glue Dynamic Frames.

>>> memberships = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="memberships_json")
>>> memberships.toDF().createOrReplaceTempView("memberships")
>>> spark.sql("select distinct organization_id from memberships").show()
+--------------------+
|     organization_id|
+--------------------+
|d56acebe-8fdc-47b...|
|8fa6c3d2-71dc-478...|
+--------------------+

Workflows and S3 Consistency

If you have a workflow of external processes ingesting data into S3, or upstream AWS Glue jobs generating input for a table used by downstream jobs in a workflow, you can encounter the following Apache Spark errors.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 16.0 failed 4 times, most recent failure: Lost task 10.3 in stage 16.0 (TID 761, ip-<>.ec2.internal, executor 1): 
java.io.FileNotFoundException: No such file or directory 's3://<bucket>/fileprefix-c000.snappy.parquet'
It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 
'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

These errors happen when the upstream jobs overwrite to the same S3 objects that the downstream jobs are concurrently listing or reading. This can also happen due to eventual consistency of S3 resulting in overwritten or deleted objects get updated at a later time when the downstream jobs are reading. A common manifestation of this error occurs when you are create a SparkSQL view and execute SQL queries in the downstream job. To avoid these errors, the best practice is to set up a workflow with upstream and downstream jobs scheduled at different times, and read/write to different S3 partitions based on time.

You can also enable the S3-optimized output committer for your Glue jobs by passing in a special job parameter: “–enable-s3-parquet-optimized-committer” set to true. This committer improves application performance by avoiding list and rename operations in Amazon S3 during job and task commit phases. It also avoids issues that can occur with Amazon S3’s eventual consistency during job and task commit phases, and helps to minimize task failures.

Conclusion

In this post, we discussed how to leverage the automatic code generation process in AWS Glue ETL to simplify common data manipulation tasks such as data type conversion and flattening complex structures. We also explored using AWS Glue Workflows to build and orchestrate data pipelines of varying complexity. Lastly, we looked at how you can leverage the power of SQL, with the use of AWS Glue ETL and Glue Data Catalog, to query and transform your data.

In the final post, we will explore specific capabilities in AWS Glue and best practices to help you better manage the performance, scalability and operation of AWS Glue Apache Spark jobs.

 


About the Authors

Mohit Saxena is a technical lead manager at AWS Glue. His passion is building scalable distributed systems for efficiently managing data on cloud. He also enjoys watching movies, and reading about the latest technology.

 

 

IAM Access Analyzer flags unintended access to S3 buckets shared through access points

Post Syndicated from Andrea Nedic original https://aws.amazon.com/blogs/security/iam-access-analyzer-flags-unintended-access-to-s3-buckets-shared-through-access-points/

Customers use Amazon Simple Storage Service (S3) buckets to store critical data and manage access to data at scale. With Amazon S3 Access Points, customers can easily manage shared data sets by creating separate access points for individual applications. Access points are unique hostnames attached to a bucket and customers can set distinct permissions using access point policies. To help you identify buckets that can be accessed publicly or from other AWS accounts or organizations, AWS Identity and Access Management (IAM) Access Analyzer mathematically analyzes resource policies. Now, Access Analyzer analyzes access point policies in addition to bucket policies and bucket ACLs. This helps you find unintended access to S3 buckets that use access points. Access Analyzer makes it easier to identify and remediate unintended public, cross-account, or cross-organization sharing of your S3 buckets that use access points. This enables you to restrict bucket access and adhere to the security best practice of least privilege.

In this post, first I review Access Analyzer and how to enable it. Then I walk through an example of how to use Access Analyzer to identify an S3 bucket that is shared through an access point. Finally, I show you how to view Access Analyzer bucket findings in the S3 Management Console.

IAM Access Analyzer overview

Access Analyzer helps you determine which resources can be accessed publicly or from other accounts or organizations. Access Analyzer determines this by mathematically analyzing access control policies attached to resources. This form of analysis, called automated reasoning, applies logic and mathematical inference to determine all possible access paths allowed by a resource policy. This is how IAM Access Analyzer uses provable security to deliver comprehensive findings for unintended bucket access. You can enable Access Analyzer by navigating to the IAM console. From there, select Access Analyzer to create an analyzer for an account or an organization.

How to use IAM Access Analyzer to identify an S3 bucket shared through an access point

Once you’ve created your analyzer, you can view findings for resources that can be accessed publicly or from other AWS accounts or organizations. For your S3 bucket findings, the Shared through column indicates whether a bucket is shared through its S3 bucket policy, one of its access points, or the bucket ACL. Looking at the Shared through column in the image below, we see the first finding is shared through an Access point.

Figure 1: IAM Access Analyzer report of findings for resources shared outside of my account

Figure 1: IAM Access Analyzer report of findings for resources shared outside of my account

If you use access points to manage bucket access and one of your buckets is shared through an access point, you will see the bucket finding indicate ‘Access Point’. In this example, I select the first finding to learn more. In the detail image below, you can see that the Shared through field lists the Amazon Resource Name (arn) of the access point that grants access to the bucket and the details of the resources and principals. If this access wasn’t your intent, you can review the access point details in the S3 console. There you can modify the access point policy to remove access.

Figure 2: IAM Access Analyzer finding details for a bucket shared through an access point

Figure 2: IAM Access Analyzer finding details for a bucket shared through an access point

How to use Access Analyzer for S3 to identify an S3 bucket shared through an access point

You can also view Access Analyzer findings for S3 buckets in the S3 Management Console with Access Analyzer for S3. This view reports S3 buckets that are configured to allow access to anyone on the internet or other AWS accounts. This includes accounts outside of your AWS organization. For each public or shared bucket, Access Analyzer for S3 displays whether the bucket is shared through the bucket policy, access points, or the bucket ACL. In the example below, we see the my-test-public-bucket is set to public access using a Bucket policy and bucket ACL. Additionally, the my-test-bucket is shared access to other AWS accounts using a Bucket policy and one or more access points. After you identify a bucket with unintended access using Access Analyzer for S3, you can Block Public Access to the bucket. Amazon S3 block public access settings override the bucket policies that are applied to the bucket. The settings also override the access point policies applied to the bucket’s access points.

Figure 3: Access Analyzer for S3 findings report in the S3 Management Console

Figure 3: Access Analyzer for S3 findings report in the S3 Management Console

Next steps

To turn on IAM Access Analyzer at no additional cost, head over to the IAM console. IAM Access Analyzer is available in the IAM console and through APIs in all commercial AWS Regions and AWS GovCloud (US). To learn more about IAM Access Analyzer, visit the feature page.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS IAM Forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Andrea Nedic

Andrea is a Senior Technical Program Manager in the AWS Automated Reasoning Group. She enjoys hearing from customers about how they build on AWS. Outside of work, Andrea likes to ski, dance, and be outdoors. She holds a PhD from Princeton University.

How Siemens built a fully managed scheduling mechanism for updates on Amazon S3 data lakes

Post Syndicated from Pedro Bento original https://aws.amazon.com/blogs/big-data/how-siemens-built-a-fully-managed-scheduling-mechanism-for-consistent-updates-on-amazon-s3-data-lakes/

Siemens is a global technology leader with more than 370,000 employees and 170 years of experience. To protect Siemens from cybercrime, the Siemens Cyber Defense Center (CDC) continuously monitors Siemens’ networks and assets. To handle the resulting enormous data load, the CDC built a next-generation threat detection and analysis platform called ARGOS. ARGOS is a hybrid-cloud solution that makes heavy use of fully managed AWS services for streaming, big data processing, and machine learning.

Users such as security analysts, data scientists, threat intelligence teams, and incident handlers continuously access data in the ARGOS platform. Further, various automated components update, extend, and remove data to enrich information, improve data quality, enforce PII requirements, or mutate data due to schema evolution or additional data normalization requirements. Keeping the data always available and consistent presents multiple challenges.

While object-based data lakes are highly beneficial from a cost perspective compared to traditional transactional databases in such scenarios, they hardly allow for atomic updates or require highly complex and costly extensions. To overcome this problem, Siemens designed a solution that enables atomic file updates on Amazon S3-based data lakes without compromising query performance and availability.

This post presents this solution, which is an easy-to-use scheduling service for S3 data update tasks. Siemens uses it for multiple purposes, including pseudonymization, anonymization, and removal of sensitive data. This post demonstrates how to use the solution to remove values from a dataset after a predefined amount of time. Adding further data processing tasks is straightforward because the solution has a well-defined architecture and the whole stack consists of fewer than 200 lines of source code. It is solely based on fully managed AWS services and therefore achieves minimal operational overhead.

Architecture overview

This post uses an S3-based data lake with continuous data ingestion and Amazon Athena as query mechanism. The goal is to remove certain values after a predefined time automatically after ingestion. Applications and users consuming the data via Athena are not impacted (for example, they do not observe downtimes or data quality issues like duplication).

The following diagram illustrates the architecture of this solution.

Siemens built the solution with the following services and components:

  1. Scheduling trigger – New data (for example, in JSON format) is continuously uploaded to a S3 bucket.
  2. Task scheduling – As soon as new files land, an AWS Lambda function processes the resulting S3 bucket notification events. As part of the processing, it creates a new item on Amazon DynamoDB that specifies a Time to Live (TTL) and the path to that S3 object.
  3. Task execution trigger – When the TTL expires, the DynamoDB item is deleted from the table and the DynamoDB stream triggers a Lambda function that processes the S3 object at that path.
  4. Task execution – The Lambda function derives meta information (like the relevant S3 path) from the TTL expiration event and processes the S3 object. Finally, the new S3 object replaces the older version.
  5. Data usage – The updated data is available for querying from Athena without further manual processing, and uses S3’s eventual consistency on read operations.

About DynamoDB Streams and TTL

TTL for DynamoDB lets you define when items in a table expire so they can be deleted from the database automatically. TTL comes at no extra cost as a way to reduce storage use and reduce the cost of storing irrelevant data without using provisioned throughput. You can set a timestamp for deletion on a per-item basis, which allows you to limit storage usage to only those records that are relevant, by enabling TTL on a table.

Solution overview

To implement this solution manually, complete the following steps:

  1. Create a DynamoDB table and configure DynamoDB Streams.
  2. Create a Lambda function to insert TTL records.
  3. Configure an S3 event notification on the target bucket.
  4. Create a Lambda function that performs data processing tasks.
  5. Use Athena to query the processed data.

If you want to deploy the solution automatically, you may skip these steps, and use the AWS Cloudformation template provided.

Prerequisites

To complete this walkthrough, you must have the following:

  • An AWS account with access to the AWS Management Console.
  • A role with access to S3, DynamoDB, Lambda, and Athena.

Creating a DynamoDB table and configuring DynamoDB Streams

Start first with the time-based trigger setup. For this, you use S3 notifications, DynamoDB Streams, and a Lambda function to integrate both services. The DynamoDB table stores the items to process after a predefined time.

Complete the following steps:

  1. On the DynamoDB console, create a table.
  2. For Table name, enter objects-to-process.
  3. For Primary key, enter path and choose String.
  4. Select the table and click on Manage TTL next to “Time to live attribute” under table details.
  5. For TTL attribute, enter ttl.
  6. For DynamoDB Streams, choose Enable with view type New and old images.

Note that you can enable DynamoDB TTL on non-numeric attributes, but it only works on numeric attributes.

The DynamoDB TTL is not minute-precise. Expired items are typically deleted within 48 hours of expiration. However, you may experience shorter deviations of only 10–30 minutes from the actual TTL value. For more information, see Time to Live: How It Works.

Creating a Lambda function to insert TTL records

The first Lambda function you create is for scheduling tasks. It receives a S3 notification as input, recreates the S3 path (for example, s3://<bucket>/<key>), and creates a new item on DynamoDB with two attributes: the S3 path and the TTL (in seconds). For more information about a similar S3 notification event structure, see Test the Lambda Function.

To deploy the Lambda function, on the Lambda console, create a function named NotificationFunction with the Python 3.7 runtime and the following code:

import boto3, os, time

# Put here a new parameter for TTL, default 300, 5 minutes
default_ttl = 300

s3_client = boto3.client('s3')
table = boto3.resource('dynamodb').Table('objects-to-process')

def parse_bucket_and_key(s3_notif_event):
    s3_record = s3_notif_event['Records'][0]['s3']
    return s3_record['bucket']['name'], s3_record['object']['key']

def lambda_handler(event, context):
    try:
        bucket_name, key = parse_bucket_and_key(event)
        head_obj = s3_client.head_object(Bucket=bucket_name, Key=key)
        tags = s3_client.get_object_tagging(Bucket=bucket_name, Key=key)
        if(head_obj['ContentLength'] > 0 and len(tags['TagSet']) == 0):
            record_path = f"s3://{bucket_name}/{key}"
            table.put_item(Item={'path': record_path, 'ttl': int(time.time()) + default_ttl})
    except:
        pass # Ignore

Configuring S3 event notifications on the target bucket

You can take advantage of the scalability, security, and performance of S3 by using it as a data lake for storing your datasets. Additionally, you can use S3 event notifications to capture S3-related events, such as the creation or deletion of objects within a bucket. You can forward these events to other AWS services, such as Lambda.

To configure S3 event notifications, complete the following steps:

  1. On the S3 console, create an S3 bucket named data-bucket.
  2. Click on the bucket and go to “Properties” tab.
  3. Under Advanced Settings, choose Events and add a notification.
  4. For Name, enter MyEventNotification.
  5. For Events, select All object create events.
  6. For Prefix, enter dataset/.
  7. For Send to, choose Lambda Function.
  8. For Lambda, choose NotificationFunction.

This configuration restricts the scheduling to events that happen within your previously defined dataset. For more information, see How Do I Enable and Configure Event Notifications for an S3 Bucket?

Creating a Lambda function that performs data processing tasks

You have now created a time-based trigger for the deletion of the record in the DynamoDB table. However, when the system delete occurs and the change is recorded in DynamoDB Streams, no further action is taken. Lambda can poll the stream to detect these change records and trigger a function to process them according to the activity (INSERT, MODIFY, REMOVE).

This post is only concerned with deleted items because it uses the TTL feature of DynamoDB Streams to trigger task executions. Lambda gives you the flexibility to either process the item by itself or to forward the processing effort to somewhere else (such as an AWS Glue job or an Amazon SQS queue).

This post uses Lambda directly to process the S3 objects. The Lambda function performs the following tasks:

  1. Gets the S3 object from the DynamoDB item’s S3 path attribute.
  2. Modifies the object’s data.
  3. Overrides the old S3 object with the updated content and tags the object as processed.

Complete the following steps:

  1. On the Lambda console, create a function named JSONProcessingFunction with Python 3.7 as the runtime and the following code:
    import os, json, boto3
    from functools import partial
    from urllib.parse import urlparse
    
    s3 = boto3.resource('s3')
    
    def parse_bucket_and_key(s3_url_as_string):
        s3_path = urlparse(s3_url_as_string)
        return s3_path.netloc, s3_path.path[1:]
    
    def extract_s3path_from_dynamo_event(event):
        if event["Records"][0]["eventName"] == "REMOVE":
            return event["Records"][0]["dynamodb"]["Keys"]["path"]["S"]
    
    def modify_json(json_dict, column_name, value):
        json_dict[column_name] = value
        return json_dict
        
    def get_obj_contents(bucketname, key):
        obj = s3.Object(bucketname, key)
        return obj.get()['Body'].iter_lines()
    
    clean_column_2_func = partial(modify_json, column_name="file_contents", value="")
    
    def lambda_handler(event, context):
        s3_url_as_string = extract_s3path_from_dynamo_event(event)
        if s3_url_as_string:
            bucket_name, key = parse_bucket_and_key(s3_url_as_string)
            updated_json = "\n".join(map(json.dumps, map(clean_column_2_func, map(json.loads, get_obj_contents(bucket_name, key)))))
            s3.Object(bucket_name, key).put(Body=updated_json, Tagging="PROCESSED=True")
        else:
            print(f"Invalid event: {str(event)}")

  2. On the Lambda function configuration webpage, click on Add trigger.
  3. For Trigger configuration, choose DynamoDB.
  4. For DynamoDB table, choose objects-to-process.
  5. For Batch size, enter 1.
  6. For Batch window, enter 0.
  7. For Starting position, choose Trim horizon.
  8. Select Enable trigger.

You use batch size = 1 because each S3 object represented on the DynamoDB table is typically large. If these files are small, you can use a larger batch size. The batch size is essentially the number of files that your Lambda function processes at a time.

Because any new objects on S3 (in a versioning-enabled bucket) create an object creation event, even if its key already exists, you must make sure that your task schedule Lambda function ignores any object creation events that your task execution function creates. Otherwise, it creates an infinite loop. This post uses tags on S3 objects: when the task execution function processes an object, it adds a processed tag. The task scheduling function ignores those objects in subsequent executions.

Using Athena to query the processed data

The final step is to create a table for Athena to query the data. You can do this manually or by using an AWS Glue crawler that infers the schema directly from the data and automatically creates the table for you. This post uses a crawler because it can handle schema changes and add new partitions automatically. To create this crawler, use the following code:

aws glue create-crawler --name data-crawler \ 
--role <AWSGlueServiceRole-crawler> \
--database-name data_db \
--description 'crawl data bucket!' \
--targets \
"{\
  \"S3Targets\": [\
    {\
      \"Path\": \"s3://<data-bucket>/dataset/\"\
    }\
  ]\
}"

Replace <AWSGlueServiceRole-crawler> and <data-bucket> with the name of your AWSGlueServiceRole and S3 bucket, respectively.

When the crawling process is complete, you can start querying the data. You can use the Athena console to interact with the table while its underlying data is being transparently updated. See the following code:

SELECT * FROM data_db.dataset LIMIT 1000

Automated setup

You can use the following AWS CloudFormation template to create the solution described on this post on your AWS account. To launch the template, choose the following link:

This CloudFormation stack requires the following parameters:

  • Stack name – A meaningful name for the stack, for example, data-updater-solution.
  • Bucket name – The name of the S3 bucket to use for the solution. The stack creation process creates this bucket.
  • Time to Live – The number of seconds to expire items on the DynamoDB table. Referenced S3 objects are processed on item expiration.

Stack creation takes up to a few minutes. Check and refresh the AWS CloudFormation Resources tab to monitor the process while it is running.

When the stack shows the state CREATE_COMPLETE, you can start using the solution.

Testing the solution

To test the solution, download the mock_uploaded_data.json dataset created with the Mockaroo data generator. The use case is a web service in which users can upload files. The goal is to delete those files some predefined time after the upload to reduce storage and query costs. To this end, the provided code looks for the attribute file_contents and replaces its value with an empty string.

You can now upload new data into your data-bucket S3 bucket under the dataset/ prefix. Your NotificationFunction Lambda function processes the resulting bucket notification event for the upload, and a new item appears on your DynamoDB table. Shortly after the predefined TTL time, the JSONProcessingFunction Lambda function processes the data and you can check the resulting changes via an Athena query.

You can also confirm that a S3 object was processed successfully if the DynamoDB item corresponding to this S3 object is no longer present in the DynamoDB table and the S3 object has the processed tag.

Conclusion

This post showed how to automatically re-process objects on S3 after a predefined amount of time by using a simple and fully managed scheduling mechanism. Because you use S3 for storage, you automatically benefit from S3’s eventual consistency model, simply by using identical keys (names) both for the original and processed objects. This way, you avoid query results with duplicate or missing data. Also, incomplete or only partially uploaded objects do not result in data inconsistencies because S3 only creates new object versions for successfully completed file transfers.

You may have previously used Spark to process objects hourly. This requires you to monitor objects that must be processed, to move and process them in a staging area, and to move them back to their actual destination. The main drawback is the final step because, due to Spark’s parallelism nature, files are generated with different names and contents. That prevents direct file replacement in the dataset and leads to downtimes or potential data duplicates when data is queried during a move operation. Additionally, because each copy/delete operation could potentially fail, you have to deal with possible partially processed data manually.

From an operations perspective, AWS serverless services simplify your infrastructure. You can combine the scalability of these services with a pay-as-you-go plan to start with a low-cost POC and scale to production quickly—all with a minimal code base.

Compared to hourly Spark jobs, you could potentially reduce costs by up to 80%, which makes this solution both cheaper and simpler.

Special thanks to Karl Fuchs, Stefan Schmidt, Carlos Rodrigues, João Neves, Eduardo Dixo and Marco Henriques for their valuable feedback on this post’s content.

 


About the Authors

Pedro Completo Bento is a senior big data engineer working at Siemens CDC. He holds a Master in Computer Science from the Instituto Superior Técnico in Lisbon. He started his career as a full-stack developer, specializing later on big data challenges. Working with AWS, he builds highly reliable, performant and scalable systems on the cloud, while keeping the costs at bay. In his free time, he enjoys to play boardgames with his friends.

 

 

Arturo Bayo is a big data consultant at Amazon Web Services. He promotes a data-driven culture in enterprise customers around EMEA, providing specialized guidance on business intelligence and data lake projects while working with AWS customers and partners to build innovative solutions around data and analytics.

 

 

 

 

How to use KMS and IAM to enable independent security controls for encrypted data in S3

Post Syndicated from Paco Hope original https://aws.amazon.com/blogs/security/how-to-use-kms-and-iam-to-enable-independent-security-controls-for-encrypted-data-in-s3/

Typically, when you protect data in Amazon Simple Storage Service (Amazon S3), you use a combination of Identity and Access Management (IAM) policies and S3 bucket policies to control access, and you use the AWS Key Management Service (AWS KMS) to encrypt the data. This approach is well-understood, documented, and widely implemented. However, many customers want to extend the value of encryption beyond basic protection against unauthorized access to the storage layer where the data resides. They want to enforce a separation of duties between which team manages access to the storage layer and which team manages access to the encryption keys. This model ensures that configuration errors made by only one of these teams won’t compromise the data in ways that grant unauthorized access to plaintext data. For example, if the team that owns permissions to the S3 bucket mistakenly grants access to unauthorized users, when those users attempt to access objects in S3 they will fail. Why? Because the separate team who manages access to the keys didn’t grant those users access to use the keys for decryption.

You can create this kind of independent access control by combining KMS encryption with IAM policies and S3 bucket policies. When data is encrypted with a customer-managed KMS customer master key (CMK), the key’s policy acts as an independent access control. Users can be prevented from accessing the data, even though the IAM permissions and the S3 bucket policy would permit the access. Figure 1 shows a Venn diagram of the access that is required. The bucket policy, the IAM policy, and the KMS key policy all play a role. Users have permission for the data only when they are granted permissions in all three policies.
 

Figure 1: Venn diagram showing the required permissions for access

Figure 1: Venn diagram showing the required permissions for access

This exercise builds the resources shown in Figure 2:

  • Three AWS IAM roles
    1. A role (1) with permission to create and manage permissions on an S3 bucket (secure-bucket-admin)
    2. A role (2) with permission to create and manage permissions on a KMS master key (secure-key-admin)
    3. A role (3) with permissions to access (but not manage) a specific S3 bucket and to use (but not manage) a specific AWS KMS customer master key (authorized-users).
  • An S3 bucket (4) with a custom bucket policy (5) that only allows data to be stored if that data is encrypted with a specific KMS key. The ability to write to or read from this bucket will be restricted to the IAM role authorized-users.
  • A KMS key (6) with a specific key policy (7) that can only be used by the IAM role authorized-users and only managed by the IAM user secure-key-admin.

 

Figure 2: Architecture diagram

Figure 2: Architecture diagram

When you have completed this exercise, you will have:

  • Created an S3 bucket protected by IAM policies, and a bucket policy that enforces encryption.
  • Attached the IAM role authorized-users to an EC2 instance so your applications in that instance can assume that role and access encrypted data in the S3 bucket.
  • Uploaded and downloaded data from the bucket that is protected by the KMS key.
  • Demonstrated that when the KMS key policy is modified, removing access for the IAM role authorized-users, the applications on the EC2 instance no longer have access to the data in the S3 bucket.

Set things up

For simplicity, I create the S3 bucket, KMS keys, and EC2 instances all in the same region and in the same AWS account. It’s possible to use KMS keys that are owned by a different AWS account, to assume roles across accounts, and to have instances in different regions from the buckets and the keys. I discuss those variations at the end.

I assume you have at least one administrator identity available to you already: one that has broad rights for creating users, creating roles, managing KMS keys, and launching EC2 instances. I will refer to this as your “Admin identity” throughout these instructions. This can be a federated identity (for example, from your corporate identity provider or from a social identity), or it can be an AWS IAM user.

Assuming Roles

Throughout this exercise I will use IAM roles to acquire and release privileges. If you’re working from the AWS command line, you’ll need to configure your command line environment to use profiles. If you’re working from the AWS Management Console, then you’ll follow these instructions to switch role. If you haven’t worked with roles before, take a minute to follow those instructions and become familiar with it before continuing.

Step 1: Create IAM policies

First, I will create 3 policies that grant very specific sets of rights. Then, I will attach those policies to roles: two roles for administrators, and one for software running on EC2 instances. You’re going to create an S3 bucket in Step 3. That bucket, like all S3 buckets, needs a globally unique name. You will reference that bucket’s name in these policies, even though you will create the bucket later. Decide the name of your bucket now. When you reach steps that require you to type or paste a JSON policy document for your bucket policy, remember to use the name of your bucket where I have written secure-demo-bucket.

Step 1a: Create the S3 bucket management policy

While logged in to the console as your Admin user, create an IAM policy in the web console using the JSON tab. Name the policy secure-bucket-admin. When you reach the step to type or paste a JSON policy document, paste the JSON from Listing 1 below. This policy allows broad S3 administration rights (creating, deleting, and modifying policies), so it is a high privilege policy. In an effort to be concise, it grants all permissions to S3 and then takes a few away by explicitly denying them. The intention is to permit managing all aspects of the bucket’s operation, while denying all access to the contents of the bucket. The explicit deny mechanism is important because, due to IAM’s policy evaluation logic, an explicit deny cannot be overridden by subsequent “allow” statements or by attaching additional policies. As the S3 service evolves over time and new features are added, the policy will permit using those new features, without any change to this policy. If you prefer to enable features explicitly, you’ll need to rewrite this policy to explicitly allow only the features you want, and then come back and revise the policy every so often, as S3 features are added that your role needs to use.


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAllActions",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "DenyObjectAccess",
      "Action": [
        "s3:DeleteObject",
        "s3:DeleteObjectVersion",
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:PutObject",
        "s3:PutObjectAcl",
        "s3:PutObjectVersionAcl"
      ],
      "Effect": "Deny",
      "Resource": "arn:aws:s3:::secure-demo-bucket"
    }
  ]
}

Listing 1: secure-bucket-admin IAM policy
 
Your policy will have an ARN (it will look something like arn:aws:iam::111122223333:policy/secure-bucket-admin). Make a note of this ARN. You will use it later to attach to the secure-bucket-admin role you’ll create in step 2.

Step 1b: Create the KMS administrator policy

While logged in to the console as your Admin user, create an IAM policy in the web console using the JSON tab. Name the policy secure-key-admin. When you reach the step to type or paste a JSON policy document, paste the JSON from Listing 2 below. Be sure to add your own 12-digit AWS account number where I have written 111122223333. This policy allows broad KMS administration rights (creating keys, granting access to keys, and modifying key policies), so it is a high privilege policy. In an effort to be concise, this policy grants all permissions to the KMS service and then denies certain rights through an explicit deny statement. The intention is to permit managing all aspects of KMS keys, while denying all access to perform encryption and decryption using KMS keys. As the KMS service evolves over time and new features are added, the policy will permit using those new features, without any change to this policy. If you prefer to enable features explicitly, you’ll need to rewrite this policy to explicitly allow only the features you want, and then come back and revise the policy every so often, as KMS features are added that your role needs to use.


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAllKMS",
      "Action": "kms:*",
      "Effect": "Allow",
      "Resource": " arn:aws:kms:*:111122223333:key/*"
    },
    {
      "Sid": "DenyKMSKeyUsage",
      "Action": [
        "kms:Decrypt",
        "kms:Encrypt",
        "kms:GenerateDataKey",
        "kms:ReEncryptFrom",
        "kms:ReEncryptTo"
      ],
      "Effect": "Deny",
      "Resource": " arn:aws:kms:*:111122223333:key/*"
    }
  ]
}

Listing 2: secure-key-admin IAM policy
 
Your policy will have an ARN (it will look something like arn:aws:iam::111122223333:policy/secure-key-admin). Make a note of this ARN. You will use it later to attach to the secure-key-admin role you’ll create in step 2.

Step 1c: Create the S3 bucket usage policy

This final policy grants access to read and write encrypted data in the target S3 bucket. This is a narrowly-scoped policy that only grants rights to a single bucket. While logged in to the console as your Admin user, create an IAM policy in the web console using the JSON tab. Name the policy secure-bucket-access.

When you reach the step to type or paste a JSON policy document for your bucket policy, paste the JSON from Listing 3 below, substituting the name of your bucket on the two lines where I have secure-demo-bucket.


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "BasicList",
            "Effect": "Allow",
            "Action": [ "s3:ListAllMyBuckets", "s3:HeadBucket" ],
            "Resource": "*"
        },
        {
            "Sid": "AllowSecureBucket",
            "Effect": "Allow",
            "Action": [ "s3:PutObject", "s3:GetObjectAcl",
                "s3:GetObject", "s3:DeleteObjectVersion",
                "s3:DeleteObject", "s3:GetBucketLocation",
                "s3:GetObjectVersion" ],
            "Resource": [
                "arn:aws:s3:::secure-demo-bucket/*",
                "arn:aws:s3:::secure-demo-bucket"
            ]
        }
    ]
}

Listing 3: secure-bucket-access IAM policy

Note: In an effort to grant a minimal, but realistic, set of permissions, this IAM policy only grants access to basic get, put, and delete operations. You might have a use for other features, like tagging objects. If so, you will need to change the policy to enable the features you want to use.

Your policy will have an ARN (it will look something like arn:aws:iam::111122223333:policy/secure-bucket-access). Make a note of this ARN. You will use it later to attach to the authorized-users role you’ll create in step 2.

You might ask why this policy designed to control access to encrypted objects has no KMS permissions in it. Wouldn’t that prevent the users that assume this IAM role from using the encryption keys? It would normally prevent them, except you have the ability to list the authorized-users IAM role within the resource policy attached to the KMS key you’re about to create. By placing the authorized-users role in the KMS key resource policy, it further enforces the separation of duties so administrators in the account with an ability to modify IAM policies don’t inadvertently escalate privilege to other IAM users/roles and give them permissions to use KMS keys for decryption.

Step 2: Create IAM roles

An AWS IAM role is an identity that you can create in an AWS account that has specific permissions. An IAM role is similar to an IAM user, because it has permission policies that determine what the identity can and cannot do in AWS. It’s different from an IAM user because it’s not associated with a single person. A role can be used by users, by EC2 instances, by AWS services, or by other entities like AWS Lambda functions that you allow to use it. The IAM policies we created in step 1 do not grant permissions until we assign them to roles and assign the roles to users or entities.

Step 2a: Create the S3 bucket management role

This role will be used by administrators who need to manage the properties of the bucket.

  1. Follow the online instructions for creating an IAM role.
  2. Choose Another AWS account under the section labeled Select type of trusted entity.
  3. For the authorized AWS account ID, enter the 12-digit account number for the account that you’re working in. If you intend to authorize AWS IAM users that are defined in a different AWS IAM account to access the S3 bucket and decrypt objects, then you would include that AWS account’s ID number, instead.
  4. Name the IAM role secure-bucket-admin and import the customer managed policy named secure-bucket-admin that you created in step 1a to the role that you have created.

    Your AWS IAM role will have an ARN (it will look something like arn:aws:iam::111122223333:role/secure-bucket-admin). Make a note of this ARN. You will use it in the step 3 when you create your S3 bucket.

Step 2b: Create the KMS key management role

This role will be used by administrators who need to manage the KMS customer master keys that protect the data. The actions you take to manage the keys will be authorized by this role. Importantly, this role has no ability to modify the bucket, grant access to the bucket, or access any of the data in the bucket.

  1. Follow the online instructions for creating an IAM role.
  2. In the Select type of trusted entity section, select Another AWS account.
  3. For the authorized AWS account ID, enter the 12-digit account number for the account that you’re working in. If you intend to authorize AWS IAM users that are defined in a different AWS IAM account, then you would include that AWS account’s ID number, instead.
  4. Name the IAM role secure-key-admin and import the customer-managed policy named secure-key-admin that you created in step 1b to the role that you have created.

    Your AWS IAM role will have an ARN (it will look something like arn:aws:iam::111122223333:role/secure-key-admin). Make a note of this ARN. You will use it in step 4 when you create your KMS key.

Step 2c. Create the bucket usage role

This role will grant permissions to EC2 instances. An EC2 instance running with this role will be able to create and read encrypted data in the protected S3 bucket.

  1. Follow the online instructions for creating an IAM role.
  2. In the Select type of trusted entity section, select AWS service.
  3. Choose EC2 as the service that you will authorize. This authorizes all applications running on that EC2 instance to use credentials with permissions attached to the role.
  4. Name the IAM role authorized-users and import the customer-managed secure-bucket-access policy that you created in step 1c to the role that you have created.

This role is not for users trying to access the S3 bucket from any arbitrary application that happens to have the role’s credentials. It will only be used by users operating within applications running in AWS EC2 instances.

Step 3: Create an S3 bucket for the encrypted data

Log in to the console using your secure-bucket-admin role. (Either log in with the correct federated identity, or with the AWS IAM user you created in step 1d). Follow the instructions to create a bucket that will hold the encrypted data. In my example, I call my bucket secure-demo-bucket. You chose your own unique bucket name back in step 1. Type that bucket name throughout these steps where I use secure-demo-bucket. You will set a bucket policy and properties on that bucket later.

Step 4: Create a KMS key to encrypt and decrypt the data in the S3 bucket

Log out of the console and log back in using your secure-key-admin role. Create a customer-managed customer master key (CMK) to encrypt and decrypt the data in the S3 bucket you just created. If you already have a customer-managed CMK created that you want to use for this purpose, you can do that. To use your own CMK, skip steps 1-5 below about creating a key and, instead, select your existing key in the KMS console and then follow steps 6-8 to change the key policy to allow the authorized-users role permissions to use the key.

  1. In the AWS console, go to Key Management Service.
  2. Select the Create Key button.
  3. On the Step 1 screen, set a display name (called an “Alias”) for the key and a description. I recommend a meaningful description that tells others what the key is for.
  4. On the Step 2 screen, set tags if you need them to track usage of keys for billing purposes. Tags won’t have a functional impact in this exercise so you can skip this step if you want by selecting Next.
  5. On the Step 3 screen, select key administrators. Pick only the secure-key-admin IAM role. You must not pick the secure-bucket-admin role or the authorized-users role as key administrators to ensure separation of duties. For example, if you were to pick the authorized-users IAM role, then any user that assumed that role could escalate their own (or others’) privileges to use this key to decrypt any other data encrypted under this key in your account. If you were to pick the secure-bucket-admin user, then that user could modify permissions both on the S3 bucket and the KMS key in ways that allowed unauthorized users access to decrypt data.
  6. On the Step 4 screen, select key users. Pick only the authorized-users IAM role you created in step 2c.
  7. On the Step 5 screen, select Finish.

    After you have created the key, make note of the key’s ARN. It will look something like this:

    arn:aws:kms::11112222333:key/1234abcd-12ab-34cd-56ef-1234567890ab

    You will need it for the next step where you enforce all objects uploaded into the S3 bucket to be encrypted under this key.

Step 5: Modify the bucket policy

Log out of the console and log back with the secure-bucket-admin role. You’re going to attach a bucket policy to the bucket that does two things: it requires objects to be encrypted and it requires them to be encrypted with a specific KMS key. You will accomplish this by explicitly denying any attempt to call PutObject unless the correct conditions are true. This helps you increase your confidence that you will not store unencrypted data in this bucket.

Find the secure-demo-bucket bucket in the S3 web console, and then modify its bucket policy. Use the code from Listing 4 below as the entire bucket policy. Be sure to change secure-demo-bucket to the actual name of the bucket that you’re using in both places where it appears in the policy. You recorded the key’s ARN in step 4, make sure you insert that ARN for your KMS key where I use an example key ARN below.


{
    "Version": "2012-10-17",
    "Id": "PutObjPolicy",
    "Statement": [
      {
        "Sid": "DenyUnencryptedObjectUploads",
        "Effect": "Deny",
        "Principal": "*",
        "Action": "s3:PutObject",
        "Resource": "arn:aws:s3:::secure-demo-bucket/*",
        "Condition": {
          "StringNotEquals": {
            "s3:x-amz-server-side-encryption": "aws:kms"
          }
        }
      },
      {
        "Sid": "DenyWrongKMSKey",
        "Effect": "Deny",
        "Principal": "*",
        "Action": "s3:PutObject",
        "Resource": "arn:aws:s3:::secure-demo-bucket/*",
        "Condition": {
          "StringNotEquals": {
            "s3:x-amz-server-side-encryption-aws-kms-key-id": "arn:aws:kms::11112222333:key/1234abcd-12ab-34cd-56ef-1234567890ab"
          }
        }
      }
    ]
  }

Listing 4: Bucket policy requiring encryption

Note: This bucket policy is not retroactive: If you apply this policy to a bucket that already exists and already has unencrypted objects, nothing happens to the objects that are already in the bucket. They remain unencrypted. They can be fetched or deleted. Once the policy is applied, however, new objects cannot be put in the bucket unless they are correctly encrypted.

Instead of applying a bucket policy, you could consider turning on S3 default encryption. This feature forces all new objects uploaded to an S3 bucket to be encrypted using the KMS key you created in step 4 unless the user specifies a different key. This feature doesn’t prohibit callers from encrypting objects under other KMS keys, but it ensures that the data is protected even if the user does not specify KMS encryption when putting the object. The bucket policy in Listing 4 is a bit stricter than S3 default encryption because it ensures that no object is ever encrypted by any key other than the CMK created in step 4. That strictness means the attempt to put an object fails, unless the caller explicitly names the KMS keyId in every S3 PUT request. With S3 default encryption, attempts to put an object without specifying encryption will succeed, and the data will be protected by the named KMS CMK.

Step 7: Launch an EC2 instance to demonstrate the solution

The final step to showing how this solution works is to launch an EC2 instance and show that applications running in that instance can write and read data in the S3 bucket you created. If you launch an EC2 instance that has your authorized-users role attached and log in on that instance, you will be able to upload and download objects from the bucket, encrypting and decrypting transparently as you do it. No other identity (for example, other IAM users, other IAM roles, other EC2 instances, and Lambda functions) will be able to upload and download data to this S3 bucket because these other identities don’t have the permissions to use the KMS key that protects the data.

Start by logging out of the console and log back in as your Admin user. Following instructions to launch an EC2 instance:

  1. Choose an Amazon Linux AMI.
  2. Choose an instance type. Any instance type will work. If you launch an Amazon Linux t2.micro instance, it might qualify for free tier pricing.
  3. For IAM Role, select the authorized-users role from the drop-down menu.
  4. Make sure you specify an SSH key that you have access to, and make sure that you have a way to reach the EC2 instance over the network.

Satisfy yourself that it works as expected

At this point, the solution is complete and is running. I want to demonstrate that the KMS key is providing the independent access controls the way I said it would. I will modify the key policy to remove the instance’s rights to use the KMS key. Then, I will confirm that the commands that had succeeded before now fail after the key policy change. This shows how the KMS key and its policy are completely independent of the S3 bucket policies and the IAM policies.

Test 1: Uploading encrypted objects

Using SSH, log in on the EC2 instance you launched that has the authorized-users role attached.

You will need to download a file onto the EC2 instance that you can then upload, encrypted, to the S3 bucket. If you don’t have a file that you want to use, you can use the AWS Cryptographic Details whitepaper as a reasonable test file.

On the instance, run the following command to download a local copy of the AWS Cryptographic Details whitepaper that you can use as test data:


curl -O 'https://d1.awsstatic.com/whitepapers/KMS-Cryptographic-Details.pdf'

Side note: You should also read this whitepaper. It’s very informative on how AWS KMS is built and operated to secure your encryption keys.

On the EC2 instance, use the AWS command line to upload the file to the S3 bucket. Note all the options that tell S3 to use KMS encryption and to use the correct key ID. Remember to insert the bucket name for the bucket that you’re using and the ARN of your KMS key from step 4 above.


aws s3 cp KMS-Cryptographic-Details.pdf s3://secure-demo-bucket/
--sse aws:kms --sse-kms-key-id arn:aws:kms::11112222333:key/1234abcd-12ab-34cd-56ef-1234567890ab

If all went well, you should see a message like the following, showing that the object was uploaded successfully:


upload: ./KMS-Cryptographic-Details.pdf to s3://secure-demo-bucket/KMS-Cryptographic-Details.pdf

Test 2: Upload an Unencrypted Object

You can now prove the fact that a user on this instance attempting to upload unencrypted objects will fail. Run this command to upload a second copy of the PDF file to be called test2.pdf. Be sure to substitute your bucket’s name into the command.


aws s3 cp KMS-Cryptographic-Details.pdf s3://secure-demo-bucket/test2.pdf

You’ll notice this command doesn’t include the options instructing S3 to use KMS to encrypt the file. You should see this error message:


An error occurred (AccessDenied) when calling the PutObject operation: Access Denied

If you see no error, then double-check that your bucket policy in Step 5 above is correct.

Test 3: Downloading Encrypted Objects

You’ve now proven that the EC2 instance can upload encrypted objects and that unencrypted objects are refused. Now, you can prove that the EC2 instance has access to cause S3 to decrypt the encrypted object in the bucket using the KMS keys. Here’s how: While still on your EC2 instance, run this command, substituting your bucket name, to download a copy of the PDF file:


aws s3 cp s3://secure-demo-bucket/KMS-Cryptographic-Details.pdf test3.pdf

If this command succeeds, then you will have a file in your current directory on your EC2 instance named test3.pdf. That shows that you have successfully decrypted and downloaded the PDF file.

Test 4: Demonstrate that the key policy regulates access

Now, I will demonstrate the independence of access control provided by the KMS key policy. Leaving the bucket policy and IAM role/policy as they are, you will disable the EC2 instance’s access to the objects using the KMS key policy. The IAM policy for S3 and the bucket policy on the bucket would still normally permit the EC2 instance to access the data. But, because the KMS key policy will prevent use of the key by the authorized-users IAM role, S3 will fail to encrypt or decrypt the object. This means that any commands that execute on the EC2 instance will no longer be able to upload or download data from the S3 bucket.

First, modify the key policy.

  1. Log out of the console and log back in under the secure-key-admin user. Go to the Key Management Service console.
  2. In the left-hand navigation, select Customer managed keys and look for the key with the alias or Key ID that you’re using. The Key ID is the last 32 characters of the full key ARN.
  3. Select the Key ID for the key that you’re using to get to the screen where you can edit the key policy.
  4. In the list of Key users, you will see your authorized-users role listed. Select that role, and then select the Remove button to remove its access to use the KMS key.

At this point, the EC2 instance no longer has the permissions to use the KMS key because its role no longer grants it permission to use the key.

Repeat the command that you did in Test 1 that uploaded a PDF file to the bucket. In this case, try to make a second copy of the PDF file into an object named test4.pdf. Run this command, substituting your bucket name and your KMS key ID as required:


aws s3 cp KMS-Cryptographic-Details.pdf s3://secure-demo-bucket/test4.pdf --sse aws:kms --sse-kms-key-id abcdefab-1234-1234-1234-abcdef01234567890

You should see an error like this:


An error occurred (AccessDenied) when calling the PutObject operation: Access Denied

Now, try to download the copy of the KMS-Cryptographic-Details.pdf file from the bucket, again using the command that worked before, substituting the bucket name as required:


aws s3 cp s3://secure-demo-bucket/KMS-Cryptographic-Details.pdf test4.pdf

You should see an error message like this:


An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

These two commands are denied because when S3 tried to invoke KMS to encrypt or decrypt data, the EC2 instance role did not have permission to use the KMS key and thus the request failed. Note that there is no situation where the API call returns the KMS-encrypted data from S3. Either the API call succeeds, and you receive the decrypted data, or the API call fails, and you receive an error. All AWS services that use KMS to encrypt data behave this way—you either get the decrypted data, or you get an error message.

Restoring access to the key

To restore the EC2 instance’s access to the data, you authorize its role again in the KMS key policy:

  1. Go to Key Management Service in the AWS Console.
  2. Select Customer managed keys.
  3. Find the key that you’re using and select it.
  4. Find your authorized-users role in the list of roles, or type “authorized-users” in the search box to find it.
  5. Select the checkbox next to the authorized-users role, and then select Add to add that role as a key user.

The role will now have permission to use the key as it did before.

Useful variations on this solution

Variation 1: Using KMS keys in different AWS accounts

You can use a KMS key that is in a different AWS account for encrypting and decrypting. This allows administrators in a central AWS account to manage KMS keys, while the data itself resides in other AWS accounts. This can offer further separation of roles from the example above because even a highly privileged user (for example, root) in the account in which the authorized-users role exists won’t be able to modify the key policy. The account ID in which authorized-users role exists must be listed in the key policy. For more information, follow the instructions on sharing KMS keys across accounts.

Note that the KMS key and the S3 bucket must always be in the same region. The EC2 instance does not need to be in the same region as the S3 bucket. You will experience higher latency when your EC2 instance is not in the same region as the S3 bucket.

Variation 2: Granting KMS key usage permissions to other AWS services

EC2 is not the only service that can be granted a role this way. Lambda functions can be granted AWS IAM roles that allow them to use KMS keys. That would permit the Lambda functions with the correct roles to manipulate the S3 data, while other entities (users, EC2 instances) could not. Likewise, AWS services such as Amazon Athena might require access to a KMS key if you want to use it to search data stored in S3 that has been encrypted using KMS. If Athena is given permission to assume a role with permissions to use the KMS key, then Athena can successfully execute its search queries because S3 will be allowed to decrypt objects on behalf of Athena, which is acting on your behalf when assuming the authorized-users role.

Variation 3: Creating isolated authorization to encrypt vs decrypt

You can use the KMS key policy to isolate authorization to encrypt versus decrypt data between two identities. For example, if a role has the kms:Encrypt or kms:GenerateDataKey permissions for a key, that means that role can write encrypted data directly or ask an AWS service to do it on their behalf (for example, during an upload to an S3 bucket). If the role does not also have kms:Decrypt permission, it can’t read encrypted data. This write-only permission might be appropriate for data acquisition, security log delivery, or other functions that should not be allowed to read the data they have written. Likewise, if a role has the kms:Decrypt permission, then the role has the ability to read data. But if it lacks the kms:Encrypt permission, it cannot write or modify encrypted data. This kind of isolation authorization is suitable for audit functions and log aggregation functions that need to read data but typically are prohibited from modifying the data/logs that they read. The complete set of permissions for KMS key policies can be found in the KMS developers guide.

Cost of this solution

Three services with charges are used in this solution: EC2, S3, and KMS. The EC2 instance hours are charged according to standard EC2 pricing. Likewise, storing data in S3 will incur costs according to standard S3 pricing. There is no difference in S3 pricing for storing encrypted versus unencrypted data. Finally, KMS has a fixed price per month for each customer-managed CMK you create, which is described in the KMS pricing page. Each encryption and decryption of an object is a KMS API call and a certain number of KMS API calls are free each month. The number of free KMS API calls, and the price for API calls beyond the free tier, are described on the KMS pricing page.

Summary

The combination of IAM policies, S3 bucket policies, and KMS key policies gives you a powerful way to apply independent access control mechanisms on data. This mechanism means that one set of users can be granted rights to do maintenance operations on the buckets themselves, while not having rights to access or manipulate the data itself. Even a user or function with full privileges in S3 would be denied access to this encrypted data unless it also had the rights to use the KMS keys. It gives you an approach to access control that allows key policies to serve as an additional control when IAM policies or S3 bucket policies alone are not sufficient.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the AWS Key Management Service forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author bio

Paco Hope

Paco Hope is a Principal Security Consultant with AWS Professional Services working to help enterprise customers secure their workloads in the cloud. He has helped secure migration landing zones, design customer security architectures, and has mentored a number of AWS partners in the UK on AWS Security. He frequently speaks at information security conferences and security meetups.

Reducing custom code by using advanced rules in Amazon EventBridge

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/reducing-custom-code-by-using-advanced-rules-in-amazon-eventbridge/

Amazon EventBridge allows you to route events between AWS services, integrated software as a service (SaaS) applications, and your own applications. Event producers publish events onto an event bus, which uses rules to determine where to send those events. The rules can specify one or more targets, which can be other AWS services or Lambda functions. This model makes it easy to develop scalable, distributed serverless applications by handling event routing and filtering.

EventBridge Content Filtering

EventBridge recently introduced additional content filtering functionality, which creates new possibilities for building sophisticated rules. This blog post explores how to use event patterns to build rules that make this routing process more powerful without needing custom code. I show how this could work with a sample ATM banking application integrating into an AWS service.

Events, rules, and filtering

In EventBridge, an event is simply a JSON structure. It contains some top-level envelope fields, such as the source, event, and timestamp, followed by a detail field containing the body of the event. Events generated from AWS services always contain a number of descriptive fields and are identifiable by the source attribute prefix “aws”.

You can also generate events from your own applications. EventBridge requires specific envelope fields, but otherwise you are free to add additional attributes as needed. A typical event structure for a custom application looks like this:

{
  "Source": string,
  "EventBusName": string,
  "DetailType": string,
  "Detail": string
}

If your application uses nested attributes, you must convert the Detail attribute into a string. In programming languages such as Node.js, you can do this using JSON.stringify to send an event, and JSON.parse when receiving it. For example, for a banking application where an ATM application sends events to EventBridge, a cash withdrawal event may look like this:

{ 
  "Source": "custom.myATMapp",
  "EventBusName": "default",
  "DetailType": "transaction",
  "Detail": "{\"action\":\"withdrawal”,\"amount\":300}"
}

EventBridge rules use event patterns that are JSON structures. These match against the attributes in the events. In the rules, you only specify the fields where you want to apply filtering logic.

To see all events for a single application using an event bus, you can filter by source. Any incoming event with this source matches regardless of the content of other fields. In the ATM application example, a rule that accepts all events looks like this:

{ 
  "source": [ "custom.myATMapp" ]
}

EventBridge examines the incoming event and compares it against this rule. The rule specifies a source value of custom.myATMapp and, as this exists in the event, the pattern matches. It then routes the event to the rule’s targets:

EventBridge rules

The example above shows a static, exact match pattern – the attribute is either present, or it’s not. There are now additional operators available for dynamic matching based on specific comparison conditions. This provides functionality that’s similar to what you use in a SQL where clause for filtering records in a database.

Here is a summary of all the comparison operators available in EventBridge:

ComparisonExampleRule syntax
NullUserID is null“UserID”: [ null ]
EmptyLastName is empty“LastName”: [“”]
EqualsName is “Alice”“Name”: [ “Alice” ]
AndLocation is “New York” and Day is “Monday”“Location”: [ “New York” ], “Day”: [“Monday”]
OrPaymentType is “Credit” or “Debit”“PaymentType”: [ “Credit”, “Debit”]
NotWeather is anything but “Raining”“Weather”: [ { “anything-but”: [ “Raining” ] } ]
Numeric (equals)Price is 100“Price”: [ { “numeric”: [ “=”, 100 ] } ]
Numeric (range)Price is more than 10, and less than or equal to 20“Price”: [ { “numeric”: [ “>”, 10, “<=”, 20 ] } ]
ExistsProductName exists“ProductName”: [ { “exists”: true } ]
Does not existProductName does not exist“ProductName”: [ { “exists”: false } ]
Begins withRegion is in the US“Region”: [ {“prefix”: “us-“ } ]

Filtering events in a custom application

In this example, a bank runs software on a network of ATMs that forwards transactional information to EventBridge. This software sends all events to EventBridge, but downstream systems only want to receive a subset of ATM events:

ATM example application

The events from the ATMs have the following structure:

      {
        "Source": "custom.myATMapp",
        "EventBusName": "default",
        "DetailType": "transaction",
        "Time": "Wed Jan 29 2020 08:03:18 GMT-0500",
        "Detail":{
          "action": "withdrawal",
          "location": "NY-NYC-001",
          "amount": 300,
          "result": "approved",
          "transactionId": "123456",
          "cardPresent": true,
          "partnerBank": "Example Bank",
          "remainingFunds": 722.34
        }
      }

The downstream services can use the event patterns in EventBridge rules to ensure that they only receive specific events.

1. Transactions where the amount is over $300

The following event pattern filters for ATM transactions over $300.

{
  "source": [ "custom.myATMapp" ],
  "detail-type": [ "transaction" ],
  "detail": {
    "amount": [ { "numeric": [ ">", 300 ] } ]
  }
}

2. All ATMs in New York City

The ATM location attribute uses the format state-city-id, so NY-NYC-001 indicates that the machine is located in New York City in New York state. To filter events from only ATMs in the New York City area, I use a prefix in the filter:

{
  "source": [ "custom.myATMapp" ],
  "detail-type": [ "transaction" ],
  "detail": {
    "location": [ { "prefix": "NY-NYC-" } ]
  }
}

3. ATM customers using a third-party bank account

To filter for transactions that show a partnerBank attribute, the following event pattern checks for the existence of this attribute:

{
  "source": [ "custom.myATMapp" ],
  "detail-type": [ "transaction" ],
  "detail": {
    "partnerBank": [ { "exists": true } ]
  }
}

4. Combined filter

I can combine filters in a single event pattern to create use-cases that are more complex. For example, this filters on approved transactions where no partnerBank attribute exists, reporting from any ATM with a location different to NY-NYC-002:

{
  "source": [ "custom.myATMapp" ],
  "detail-type": [ "transaction" ],
  "detail": {
    "result": [ "approved" ],
    "partnerBank": [ { "exists": false } ],
    "location": [ { "anything-but": "NY-NYC-002" }]
  }
}

In each of these cases, EventBridge matches incoming events against the event patterns in these rules. If there is no match, it does not route the event. This eliminates custom code that otherwise exists to filter incoming events and terminate if necessary.

Filtering AWS events to create a custom S3-to-Lambda integration

EventBridge uses a variety of AWS services as native event sources. For other AWS services, such as Amazon S3, it consumes events via AWS CloudTrail. You must first enable CloudTrail logging for the service you want to use with EventBridge. Once enabled, you can filter on any of the attributes available in an AWS event. This allows you to create dynamic, flexible integrations in your event-driven applications.

The standard S3-to-Lambda trigger allows developers to subscribe a Lambda function to an event on a single bucket. Although these events can filter on prefixes and suffixes of object keys in S3, you cannot use multiple configurations that overlap. Beyond the prefix and suffix of the key name, you cannot filter further on any other attributes of the event before invoking the Lambda function. To examine the S3 event further, you must do this within the code in the function itself.

Using EventBridge, you can configure a rule between one or more S3 buckets, and one or more Lambda functions, based upon any of the attributes available. This enables you to create much more granular filters for routing events to downstream consumers. Using a declarative approach results in greater flexibility and less custom code. In this section, I show four use-cases where this could be useful.

S3 to EventBridge

(a) Invoking a single Lambda function from events in multiple buckets

This example uses multiple buckets with a common prefix in the bucket name (for example, buckets with the names “myApp-images”, “myApp-uploads”, and “myApp-archive”). You can use all these buckets as an event source to trigger the same Lambda function. This event pattern matches for all put events in those buckets:

{
  "source": [ "aws.s3" ],
  "detail-type": [ "AWS API Call via CloudTrail" ],
  "detail": {
    "eventSource": [ "s3.amazonaws.com" ],
    "eventName": [ "PutObject" ],
    "requestParameters": {
      "bucketName": [ { "prefix": "myApp-" } ]
    }
  }
}

(b) Invoking multiple consumers as targets

EventBridge allows up to five targets per rule, so you can specify up to five separate Lambda functions to receive the event. All five functions are invoked in parallel when the event pattern matches. To use this, add the targets in the rule – no changes to the event pattern is required.

If you need more than five targets, use Amazon Simple Notification Service (SNS). You can define an SNS topic as the EventBridge rule target, and then fan out from SNS to much larger number of subscribers.

{
  "source": [ "aws.s3" ],
  "detail-type": [ "AWS API Call via CloudTrail" ],
  "detail": {
    "eventSource": [ "s3.amazonaws.com" ],
    "eventName": [ "GetObject" ],
    "userAgent": [ "userAgent" ],

    "requestParameters": {
      "bucketName": [ "mybucket" ]
    }
  }
}

Conclusion

The new content filtering syntax in EventBridge enables precise filtering of events using comparison operators and ranges of values. This allows you to filter declaratively at the event bus rather than filtering downstream using custom code. For custom applications, like the ATM example, it enables you to build precise rules for specific use-cases, reducing the number of calls to targets.

This approach enables you to route events more precisely based upon any of the attributes reported in an event. This makes it easier to handle complex routing at the EventBridge level and reduces the need for custom code across your application.

To learn more about content filtering, see the Amazon EventBridge documentation.

Analyze your Amazon S3 spend using AWS Glue and Amazon Redshift

Post Syndicated from Shayon Sanyal original https://aws.amazon.com/blogs/big-data/analyze-your-amazon-s3-spend-using-aws-glue-and-amazon-redshift/

The AWS Cost & Usage Report (CUR) tracks your AWS usage and provides estimated charges associated with that usage. You can configure this report to present the data at hourly or daily intervals, and it is updated at least one time per day until it is finalized at the end of the billing period. The Cost & Usage Report is delivered automatically to an Amazon S3 bucket that you specify, and you can download it from there directly. You can also integrate the report into Amazon Redshift, query it with Amazon Athena, or upload it to Amazon QuickSight. For more information, see Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight.

This post presents a solution that uses AWS Glue Data Catalog and Amazon Redshift to analyze S3 usage and spend by combining the AWS CUR, S3 inventory reports, and S3 server access logs.

Prerequisites

Before you begin, complete the following prerequisites:

  • You need an S3 bucket for your S3 inventory and server access log data files. For more information, see Create a Bucket and What is Amazon S3?
  • You must have the appropriate IAM permissions for Amazon Redshift to be able to access the S3 buckets – for this post, choose two non-restrictive IAM roles (AmazonS3FullAccess and AWSGlueConsoleFullAccess), but restrict your access accordingly for your own scenarios.

Amazon S3 inventory

Amazon S3 inventory is one of the tools S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC), or Apache Parquet output files that list your objects and their corresponding metadata on a daily or weekly basis for a given S3 bucket.

Amazon S3 server access logs

Server access logging provides detailed records for the requests you make to a bucket. Server access logs are useful for many applications, for example in security and access audits. It can also help you learn about your customer base and understand your S3 bill.

AWS Glue

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central metadata repository known as the Data Catalog, a crawler to populate the Data Catalog with tables, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage. This post uses AWS Glue to catalog S3 inventory data and server access logs, which makes it available for you to query with Amazon Redshift Spectrum.

Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can use Amazon Redshift to efficiently query and retrieve structured and semi-structured data from files in S3 without having to load the data into Amazon Redshift native tables. You can create Amazon Redshift external tables by defining the structure for files and registering them as tables in the AWS Glue Data Catalog.

Setting up S3 inventory reports for analysis

This post uses the Parquet file format for its inventory reports and delivers the files daily to S3 buckets. You can select both the frequency of delivery and output file formats under Advanced settings as shown in the screenshot below:

For more information about configuring your S3 inventory, see How Do I Configure Amazon S3 Inventory?

The following diagram shows the data flow for this solution:

Below steps summarize the data flow diagram represented above:

  • S3 Inventory Reports are delivered to an S3 bucket that you configure.
  • The AWS Glue crawler then crawls this S3 bucket and populates the metadata in the AWS Glue Data Catalog.
  • The AWS Glue Data Catalog is then accessible through an external schema in Redshift.
  • The S3 Inventory Reports (available in the AWS Glue Data Catalog) and the Cost and Usage Reports (available in another S3 bucket) are now ready to be joined and queried for analysis.

The inventory reports are delivered to an S3 bucket. The following screenshot shows the S3 bucket structure for the S3 inventory reports:

There is a data folder in this bucket. This folder contains the Parquet data you want to analyze. The following screenshot shows the content of the folder.

Because these are daily files, there is one file per day.

Configuring an AWS Glue crawler

You can use an AWS Glue crawler to discover this dataset in your S3 bucket and create the table schemas in the Data Catalog. After you create these tables, you can query them directly from Amazon Redshift.

To configure your crawler to read S3 inventory files from your S3 bucket, complete the following steps:

  1. Choose a crawler name.
  2. Choose S3 as the data store and specify the S3 path up to the data
  3. Choose an IAM role to read data from S3 – AmazonS3FullAccess and AWSGlueConsoleFullAccess.
  4. Set a frequency schedule for the crawler to run.
  5. Configure the crawler’s output by selecting a database and adding a prefix (if any).

This post uses the database s3spendanalysis.

The following screenshot shows the completed crawler configuration.

Run this crawler to add tables to your Glue Data Catalog. After the crawler has completed successfully, go to the Tables section on your AWS Glue console to verify the table details and table metadata. The following screenshot shows the table details and table metadata after your AWS Glue crawler has completed successfully:

Creating an external schema

Before you can query the S3 inventory reports, you need to create an external schema (and subsequently, external tables) in Amazon Redshift. An Amazon Redshift external schema references an external database in an external data catalog. Because you are using an AWS Glue Data Catalog as your external catalog, after you create an external schema in Amazon Redshift, you can see all the external tables in your Data Catalog in Amazon Redshift. To create the external schema, enter the following code:

create external schema spectrum_schema from data catalog
database 's3spendanalysis'
iam_role 'arn:aws:iam::<AWS_IAM_ROLE>';

Querying the table

On the Amazon Redshift dashboard, under Query editor, you can see the data table. You can also query the svv_external_schemas system table to verify that your external schema has been created successfully. See the following screenshot.

You can now query the S3 inventory reports directly from Amazon Redshift without having to move the data into Amazon Redshift first. The following screenshot shows how to do this using the Query Editor in the Amazon Redshift console:

Setting up S3 server access logs for analysis

The following diagram shows the data flow for this solution.

Below steps summarize the data flow diagram represented above:

  • S3 Server Access Logs are delivered to an S3 bucket that you configure.
  • These server access logs are then directly accessible to be queried from Amazon Redshift (note that we’ll be using CREATE EXTERNAL TABLE in Redshift Spectrum for this purpose, explained below).
  • The S3 Server Access Logs and the Cost and Usage Reports (available in another S3 bucket) are now ready to be joined and queried for analysis.

The S3 server access logs are delivered to an S3 bucket. For more information about setting up server access logging, see Amazon S3 Server Access Logging.

The following screenshot shows the S3 bucket structure for the server access logs.

The server access log files consist of a sequence of new-line delimited log records. Each log record represents one request and consists of space-delimited fields. The following code is an example log record:

b8ad5f5cfd3c09418536b47b157851fb7bea4a00486471093a7d765e35a4f8ef s3spendanalysisblog [23/Sep/2018:22:10:52 +0000] 72.21.196.65 arn:aws:iam::<AWS Account #>:user/shayons D5633DAD1063C5CA REST.GET.LIFECYCLE - "GET /s3spendanalysisblog?lifecycle= HTTP/1.1" 404 NoSuchLifecycleConfiguration 332 - 105 - "-" "S3Console/0.4, aws-internal/3 aws-sdk-java/1.11.408 Linux/4.9.119-0.1.ac.277.71.329.metal1.x86_64 OpenJDK_64-Bit_Server_VM/25.181-b13 java/1.8.0_181" -

Creating an external table

You can define the S3 server access logs as an external table. Because you already have an external schema, create an external table using the following code. This post uses RegEx SerDe to create a table that allows you to correctly parse all the fields present in the S3 server access logs. See the following code:

CREATE EXTERNAL TABLE spectrum_schema.s3accesslogs(
BucketOwner                   varchar(256), 
Bucket                        varchar(256), 
RequestDateTime               varchar(256), 
RemoteIP                      varchar(256), 
Requester                     varchar(256), 
RequestID                     varchar(256), 
Operation                     varchar(256), 
Key                           varchar(256), 
RequestURI_operation          varchar(256),
RequestURI_key                varchar(256),
RequestURI_httpProtoversion   varchar(256),
HTTPstatus                    varchar(256), 
ErrorCode                     varchar(256), 
BytesSent                     varchar(256), 
ObjectSize                    varchar(256), 
TotalTime                     varchar(256), 
TurnAroundTime                varchar(256), 
Referrer                      varchar(256), 
UserAgent                     varchar(256), 
VersionId                     varchar(256))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) ([^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)'
  )
STORED AS TEXTFILE
LOCATION
  's3://s3spendanalysisblog/accesslogs/';

Validating the data

You can validate the external table data in Amazon Redshift. The following screenshot shows how to do this using the Query Editor in the Amazon Redshift console:

You are now ready to analyze the data.

Analyzing the data using Amazon Redshift

In this post, you have a CUR file per day in your S3 bucket. The files themselves are organized in a monthly hierarchy. See the following screenshot.

Each day’s file consists of the following files for CUR data:

  • myCURReport-1.csv.gz – A zipped file of the data itself
  • myCURReport-Manifest.json – A JSON file that contains the metadata for the file
  • myCURReport-RedshiftCommands.sql – Amazon Redshift table creation scripts and a COPY command to create the CUR table from a Redshift manifest file
  • myCURReport-RedshiftManifest.json – The Amazon Redshift manifest file to create the CUR table

Using Amazon Redshift is one of the many ways to carry out this analysis. Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL-based clients and BI tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources.

You are now ready to run SQL queries with the Amazon Redshift SQL Query Editor. This post also uses the psql client tool, a terminal-based front end from PostgreSQL, to query the data in the cluster.

To query the data, complete the following steps:

  1. Create a custom schema to contain your tables for analysis. See the following code:
    create schema if not exists redshift_schema;

    You should create your table in a schema other than public to control user access to database objects.

  2. Create a CUR table for the latest month in Amazon Redshift using the CUR SQL file in S3. See the following code:
    create table redshift_schema.AWSBilling201910 (
    identity_LineItemId VARCHAR(256),
    identity_TimeInterval VARCHAR(100),
    bill_InvoiceId VARCHAR(100),
    bill_BillingEntity VARCHAR(10),
    bill_BillType VARCHAR(100),
    bill_PayerAccountId VARCHAR(100),
    bill_BillingPeriodStartDate TIMESTAMPTZ,
    bill_BillingPeriodEndDate TIMESTAMPTZ,
    lineItem_UsageAccountId VARCHAR(100),
    lineItem_LineItemType VARCHAR(100),
    lineItem_UsageStartDate TIMESTAMPTZ,
    lineItem_UsageEndDate TIMESTAMPTZ,
    lineItem_ProductCode VARCHAR(100),
    lineItem_UsageType VARCHAR(100),
    lineItem_Operation VARCHAR(100),
    lineItem_AvailabilityZone VARCHAR(100),
    lineItem_ResourceId VARCHAR(256),
    lineItem_UsageAmount DECIMAL(11,2),
    lineItem_NormalizationFactor VARCHAR(10),
    lineItem_NormalizedUsageAmount DECIMAL(11,2),
    lineItem_CurrencyCode VARCHAR(10),
    lineItem_UnblendedRate DECIMAL(11,2),
    lineItem_UnblendedCost DECIMAL(11,2),
    lineItem_BlendedRate DECIMAL(11,2),
    lineItem_BlendedCost DECIMAL(11,2),
    lineItem_LineItemDescription VARCHAR(100),
    lineItem_TaxType VARCHAR(100),
    lineItem_LegalEntity VARCHAR(100),
    product_ProductName VARCHAR(100),
    product_alarmType VARCHAR(100),
    product_automaticLabel VARCHAR(100),
    product_availability VARCHAR(100),
    product_availabilityZone VARCHAR(100),
    product_clockSpeed VARCHAR(100),
    product_currentGeneration VARCHAR(100),
    product_databaseEngine VARCHAR(100),
    product_dedicatedEbsThroughput VARCHAR(100),
    product_deploymentOption VARCHAR(100),
    product_durability VARCHAR(100),
    product_ecu VARCHAR(100),
    product_edition VARCHAR(100),
    product_engineCode VARCHAR(100),
    product_enhancedNetworkingSupported VARCHAR(100),
    product_eventType VARCHAR(100),
    product_feeCode VARCHAR(100),
    product_feeDescription VARCHAR(100),
    product_fromLocation VARCHAR(100),
    product_fromLocationType VARCHAR(100),
    product_gpu VARCHAR(100),
    product_gpuMemory VARCHAR(100),
    product_group VARCHAR(100),
    product_groupDescription VARCHAR(100),
    product_instanceFamily VARCHAR(100),
    product_instanceType VARCHAR(100),
    product_instanceTypeFamily VARCHAR(100),
    product_io VARCHAR(100),
    product_labelingTaskType VARCHAR(100),
    product_licenseModel VARCHAR(100),
    product_location VARCHAR(100),
    product_locationType VARCHAR(100),
    product_maxThroughputvolume VARCHAR(100),
    product_maxVolumeSize VARCHAR(100),
    product_memory VARCHAR(100),
    product_messageDeliveryFrequency VARCHAR(100),
    product_messageDeliveryOrder VARCHAR(100),
    product_minVolumeSize VARCHAR(100),
    product_networkPerformance VARCHAR(100),
    product_normalizationSizeFactor VARCHAR(100),
    product_operation VARCHAR(100),
    product_physicalCpu VARCHAR(100),
    product_physicalGpu VARCHAR(100),
    product_physicalProcessor VARCHAR(100),
    product_processorArchitecture VARCHAR(100),
    product_processorFeatures VARCHAR(100),
    product_productFamily VARCHAR(100),
    product_protocol VARCHAR(100),
    product_queueType VARCHAR(100),
    product_region VARCHAR(100),
    product_servicecode VARCHAR(100),
    product_servicename VARCHAR(100),
    product_sku VARCHAR(100),
    product_storage VARCHAR(100),
    product_storageClass VARCHAR(100),
    product_storageMedia VARCHAR(100),
    product_subscriptionType VARCHAR(100),
    product_toLocation VARCHAR(100),
    product_toLocationType VARCHAR(100),
    product_transferType VARCHAR(100),
    product_usageFamily VARCHAR(100),
    product_usagetype VARCHAR(100),
    product_vcpu VARCHAR(100),
    product_version VARCHAR(100),
    product_volumeType VARCHAR(100),
    product_workforceType VARCHAR(100),
    pricing_RateId VARCHAR(100),
    pricing_publicOnDemandCost DECIMAL(11,2),
    pricing_publicOnDemandRate DECIMAL(11,2),
    pricing_term VARCHAR(100),
    pricing_unit VARCHAR(100),
    reservation_AmortizedUpfrontCostForUsage DECIMAL(11,2),
    reservation_AmortizedUpfrontFeeForBillingPeriod DECIMAL(11,2),
    reservation_EffectiveCost DECIMAL(11,2),
    reservation_EndTime TIMESTAMPTZ,
    reservation_ModificationStatus VARCHAR(100),
    reservation_NormalizedUnitsPerReservation BIGINT,
    reservation_RecurringFeeForUsage DECIMAL(11,2),
    reservation_StartTime TIMESTAMPTZ,
    reservation_SubscriptionId VARCHAR(100),
    reservation_TotalReservedNormalizedUnits BIGINT,
    reservation_TotalReservedUnits BIGINT,
    reservation_UnitsPerReservation BIGINT,
    reservation_UnusedAmortizedUpfrontFeeForBillingPeriod DECIMAL(11,2),
    reservation_UnusedNormalizedUnitQuantity BIGINT,
    reservation_UnusedQuantity BIGINT,
    reservation_UnusedRecurringFee DECIMAL(11,2),
    reservation_UpfrontValue BIGINT
    );

  3. Load the data into Amazon Redshift for the latest month, using the provided CUR Manifest file. See the following code:
    copy AWSBilling201910 from 's3://ss-cur//myCURReport/20191001-20191101/fd76beee-0709-42d5-bcb2-bb45f8ba1aae/myCURReport-RedshiftManifest.json'
    credentials 'arn:aws:iam::<AWS_IAM_ROLE>'
    GZIP CSV IGNOREHEADER 1 TIMEFORMAT 'auto' manifest;

  4. Validate the data loaded in the Amazon Redshift table. See the following code:
    select * from AWSBilling201910
    where lineItem_ProductCode = 'AmazonS3'
    and lineItem_ResourceId = 's3spendanalysisblog' limit 10;

    The following screenshot shows that data has been loaded correctly in the Amazon Redshift table:

Managing database security

You can manage database security in Amazon Redshift by controlling which users have access to which database objects. To make sure your objects are secure, create two groups: FINANCE and ADMIN, with two users in FINANCE and one user in ADMIN. Complete the following steps:

  1. Create the groups where the user accounts are assigned. The following code creates two different user groups:
    create group finance;
    create group admin;

    To view all user groups, query the PG_GROUP system catalog table (you should see finance and admin here):

    select * from pg_group:

  2. Create three database users with different privileges and add them to the groups. See the following code:
    create user finance1 password 'finance1Pass'
    in group finance;
    
    create user finance2 password 'finance2Pass'
    in group finance;
    
    create user admin1 password 'admin1Pass'
    in group admin;

    Validate the users have been successfully created. To view a list of users, query the PG_USER catalog table:

  3. Grant SELECT privileges to the FINANCE group and ALL privileges to the ADMIN group for your table AWSBilling201910 in redshift_schema. See the following code:
    grant select on table redshift_schema.AWSBilling201910 to group finance; 
    grant all on table redshift_schema.AWSBilling201910 to group admin;

    You can verify if you enforced database security correctly. The user finance1 tried to rename the table AWSBilling201910 in redshift_schema, but got a permission denied error message (due to restricted access). The following screenshot shows this scenario and the subsequent error message:

Example S3 inventory analysis

S3 charges split per bucket. The following query identifies the data storage and transfer costs for each separate S3 bucket:

SELECT
  "lineitem_productcode",
  "lineitem_usagetype",
  "lineitem_resourceid",
  b."storage_class",
  SUM(CASE
    WHEN "lineitem_usagetype" like '%Byte%' THEN "lineitem_usageamount"/1024
    ELSE "lineitem_usageamount"
  END) as "Usage",
  CASE
    WHEN "lineitem_usagetype" like '%Byte%' THEN 'TBs'
    ELSE 'Requests'
  END as "Usage Units",
  sum("lineitem_blendedcost") as cost
from awsbilling201902 a
  join spectrum_schema.data b
    on a.lineItem_ResourceId = b.bucket
where "product_productname" = 'Amazon Simple Storage Service'
group by
  "lineitem_productcode",
  "lineitem_usagetype",
  "lineitem_resourceid",
  b."storage_class"
order by
  sum("lineitem_blendedcost") desc;

The following screenshot shows the results of executing the above query:

Costs are split by type of storage (for example, Glacier versus standard storage).

The following query identifies S3 data transfer costs (intra-region and inter-region) by S3 storage class (usage amount, unblended cost, blended cost):

SELECT
 lineitem_productcode
 ,product_fromlocation
 ,product_tolocation,
  b.storage_class
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  join spectrum_schema.data b
ON
    a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
 AND a.product_productfamily = 'Data Transfer'
GROUP BY
 1,2,3,4
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

The following query identifies S3 fee, API request, and storage charges:

SELECT
 lineitem_productcode
 ,product_productfamily
 ,b.storage_class
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  join spectrum_schema.data b
ON
   a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
  and a.product_productfamily <> 'Data Transfer'
GROUP BY
 1,2,3
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

Server access logs sample analysis queries

S3 access log charges per operation type. The following query identifies the data storage and transfer costs for each separate HTTP operation:

SELECT
  "lineitem_productcode",
  "lineitem_usagetype",
  "lineitem_resourceid",
  b."operation",
  b."httpstatus",
  b."bytessent",
  SUM(CASE
      WHEN "lineitem_usagetype" like '%Byte%'
        THEN "lineitem_usageamount" / 1024
      ELSE "lineitem_usageamount"
      END) as "Usage",
  CASE
  WHEN "lineitem_usagetype" like '%Byte%'
    THEN 'TBs'
  ELSE 'Requests'
  END  as "Usage Units",
  sum("lineitem_blendedcost") as cost
from awsbilling201902 a
  join spectrum_schema.s3accesslogs b
    on a.lineItem_ResourceId = b.bucket
where "product_productname" = 'Amazon Simple Storage Service'
group by
  1, 2, 3, 4, 5, 6
order by
  sum("lineitem_blendedcost") desc;

The following screenshot shows the result of executing the above query:

The following query identifies S3 data transfer costs (intra-region and inter-region) by S3 operation and HTTP status (usage amount, unblended cost, blended cost):

SELECT
 lineitem_productcode
 ,product_fromlocation
 ,product_tolocation
 ,b.operation
 ,b.httpstatus
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  JOIN spectrum_schema.s3accesslogs b
ON
   a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
 AND a.product_productfamily = 'Data Transfer'
GROUP BY
 1,2,3,4,5
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

The following query identifies S3 fee, API request, and storage charges:

SELECT
 lineitem_productcode
 ,product_productfamily
 ,b.operation
 ,b.httpstatus
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  JOIN spectrum_schema.s3accesslogs b
ON
   a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
  and a.product_productfamily <> 'Data Transfer'
GROUP BY
 1,2,3,4
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

Overall data flow diagram

The following diagram shows the complete data flow for this solution.

Conclusion

AWS Glue makes provides an easy and convenient way to discover data stored in your S3 buckets automatically in a cloud-native, secure, and efficient way. This post demonstrated how to use AWS Glue and Amazon Redshift to analyze your S3 spend using Cost and Usage Reports. You also learned best practices for managing database security in Amazon Redshift through users and groups. Using this framework, you can start analyzing your S3 bucket spend with a few clicks in a matter of minutes on the AWS Management Console!

If you have questions or suggestions, please leave your thoughts in the comments section below.

 


About the Author

 Shayon Sanyal is a Data Architect, Data Lake for Global Financial Services at AWS.

 

 

 

Collect and distribute high-resolution crypto market data with ECS, S3, Athena, Lambda, and AWS Data Exchange

Post Syndicated from Jared Katz original https://aws.amazon.com/blogs/big-data/collect-and-distribute-high-resolution-crypto-market-data-with-ecs-s3-athena-lambda-and-aws-data-exchange/

This is a guest post by Floating Point Group. In their own words, “Floating Point Group is on a mission to bring institutional-grade trading services to the world of cryptocurrency.”

The need and demand for financial infrastructure designed specifically for trading digital assets may not be obvious. There’s a rather pervasive narrative that these coins and tokens are effectively natively digital counterparts to traditional assets such as currencies, commodities, equities, and fixed income. This narrative often manifests in the form of pithy one-liners recycled by pundits attempting to communicate the value proposition of various projects in the space (such as, “Bitcoin is just a currency with an algorithmically controlled, tamper-proof monetary policy,” or, “Ether is just a commodity like gasoline that you can use to pay for computational work on a global computer.”). Unsurprisingly, we at FPG often hear the question, “What’s so special about cryptocurrencies that they warrant dedicated financial services? Why do we need solutions for problems that have already been solved?”

The truth is that these assets and the widespread public interest surrounding them are entirely unprecedented. The decentralized ledger technology that serves as an immutable record of network transactions, the clever use of proof-of-work algorithms to economically incentivize rational actors to help uphold the security of the network (the proof-of-work concept dates back at least as far as 1993, but it was not until bitcoin that the technology showed potential for widespread adoption), the irreversible nature of transactions that poses unique legal challenges in cases such as human error or extortion, the precariousness of self-custody (third-party custody solutions don’t exactly have track records that inspire trust), the regulatory uncertainties that come with the difficulty of both classifying these assets as well as arbitrating their exchange which must ultimately be reconciled by entities like the IRS, SEC, and CFTC—it is all very new, and very weird. With 24-hour market volume regularly exceeding $100 billion, we decided to direct our focus towards problems related specifically to trading these assets. Granted, crypto trading has undoubtedly matured since the days of bartering for bitcoin in web forums and witnessing 10% price spreads between international exchanges. But there is still a long path ahead.

One major pain point we are aiming to address for institutional traders involves liquidity (or, more precisely, the lack thereof). Simply put, the buying and selling of cryptocurrencies occurs across many different trading venues (exchanges), and liquidity (the offers to buy or sell a certain quantity of an asset at a certain price) continues to become more fragmented as new exchanges emerge. So say you’re trying to buy 100 bitcoins. You must buy from people who are willing to sell. As you take the best (cheapest) offers, you’re left with increasingly expensive offers. By the time you fill your order (in this example, buy all 100 bitcoins), you may have paid a much higher average price than, say, the price you paid for the first bitcoin of your order. This phenomenon is referred to as slippage. One easy way to minimize slippage is by expanding your search for offers. So rather than looking at the offers on just one exchange, look at the offers across hundreds of exchanges. This process, traditionally referred to as smart order routing (SOR), is one of the core services we provide. Our SOR service allows traders to easily submit orders that our system can match against the best offers available across multiple trading venues by actively monitoring liquidity across dozens of exchanges.

Fanning out large orders in search of the best prices is a rather intuitive and widely applicable concept—roughly 75% of equities are purchased and sold via SOR. But the value of such a service for crypto markets is particularly salient: a perpetual cycle of new exchanges surging in popularity while incumbents falter has resulted in a seemingly incessant fragmentation of liquidity across trading venues—yet traders tend to assume an exchange-agnostic mindset, concerned exclusively with finding the best price for a given quantity of an asset.

Access to both real-time and historical market data is essential to the functionality of our SOR service. The highest resolution data we could hope to obtain for a given market would include every trade and every change applied to the order book, effectively allowing us to recreate the state of a market at any given point in time. The updates provided through the WebSocket streams are not sufficient for reconstructing order books. We also need to periodically fetch snapshots of the order books and store those, which we can do using an exchange’s REST API. We can fetch a snapshot and apply the corresponding updates from the streams to “replay” the order book.

Fortunately, this data is freely available, because many exchanges offer real-time feeds of market data via WebSocket APIs. We found several third-party vendors selling subscriptions to these data sets, typically in the form of CSV dumps delivered at a weekly or monthly cadence. This presented the question of build vs. buy. Given that we felt capable of building a robust and reliable system for ingesting real-time market data in a relatively short amount of time and at a fraction of the cost of purchasing the data from a vendor, we were already leaning in favor of building. Further investigation made buying look like an increasingly unattractive option. Disclaimers that multiple vendors issued about their inability to guarantee data quality and consistency did not inspire confidence. Inspecting sample data sets revealed that some essential fields provided in the original data streams were missing—fields necessary for achieving our goal of recreating the state of a market at an arbitrary point in time. We also recognized that a weekly or monthly delivery schedule would restrict our ability to explore relatively recent market data.

This post provides a high-level overview of how we ingest and store real-time market data and how we use the AWS Data Exchange API to organize and publish our data sets programmatically. Our system’s functionality extends well beyond data ingestion, normalization, and persistence; we run dedicated services for data validation, caching the most recent trade and order book for every market, computing and storing derivative metrics, and other services that help safeguard data accuracy and minimize the latency of our trading systems.

Data ingestion

The WebSocket streams we connect to for data consumption are often the same APIs responsible for providing real-time updates to an exchange’s trading dashboard.

WebSocket connections transmit data as discrete messages. We can inspect the content of individual messages as they stream into the browser. For example, the following screenshot shows a batch of order book updates.

The updates are expressed as arrays of bids and asks that were either added to the book or removed from it. Client-side code processes each update, resulting in a real-time rendering of the market’s order book. In practice, our data ingestion service (Ingester) does not read a single stream, but rather thousands of different streams, covering various data feeds for all markets across multiple exchanges. All the connections required for such broad coverage and the resulting flood of incoming data raise some obvious concerns about data loss. We’ve taken several measures to mitigate such concerns, including a redundant system design that allows us to spin up an arbitrary number of instances of the Ingester service. Like most of our microservices, Ingester is a Dockerized service run on Amazon ECS and deployed via Terraform.

All these instances consume the same data feeds as each other while a downstream mechanism handles deduplication (this is covered in more detail later in this post). We also set up Amazon CloudWatch alerts to notify us when we detect non-contiguous messages, indicating a gap in the incoming data. The alerts don’t directly mitigate data loss, but they do serve the important function of prompting an investigation.

Ingester builds up separate buffers of incoming messages, split out by data-type/exchange/market. Then, after a fixed time interval, each buffer is flushed into Amazon S3 as a gzipped JSON file. The buffer-flush cycle repeats.

The following screenshot shows a portion of the file content.

This code snippet is a single, pretty-printed JSON record from the file in the screenshot above.

{
   "event_type":"trade",
   "timestamp":1571980320422,
   "ticker_pair":"BTCUSDT",
   "trade_id":194230159,
   "price":"7405.69000000",
   "quantity":"3.20285300",
   "buyer_order_id":730178987,
   "seller_order_id":730178953,
   "trade_timestamp":1571980320417,
   "buyer_market_maker":false,
   "M":true
}

Ingester handles additional functionality, such as applying pre-defined mappings of venue-specific field names to our internal field names. Data normalization is one of many processes necessary to enable our systems to build a holistic understanding of market dynamics.

As with most distributed system designs, our services are written with horizontal scalability as a first-order priority. We took the same approach in designing our data ingestion service, but it has some features that make it a bit different than the archetypical horizontally scalable microservice. The most common motivations for adjusting the number of instances of a given service are load-balancing and throttling throughput. Either your system is experiencing backpressure and a consumer service scales to alleviate that pressure, or the consumer is over-provisioned and you scale down the number of instances for the sake of parsimony. For our data ingestion service, however, our motivation for running multiple instances is to minimize data loss via redundancy. The CPU usage for each instance is independent of instance count, because each instance does identical work.

For example, rather than helping alleviate backpressure by pulling messages from a single queue, each instance of our data ingestion service connects to the same WebSocket streams and performs the same amount of work. Another somewhat unusual and confounding aspect of horizontally scaling our data ingestion service is related to state: we batch records in memory and flush the records to S3 every minute (based on the incoming message’s timestamp, not the system timestamp, because those would be inconsistent). Redundancy is our primary measure for minimizing data loss, but we also need each instance to write the files to S3 in such a way that we don’t end up with duplicate records. Our first thought was that we’d need a mechanism for coordinating activity across the instances, such as maintaining a cache that would allow us to check if a record had already been persisted. But we realized that we could perform this deduplication without any coordination between instances at all. Most of the message streams we consume publish messages with sequence IDs. We can combine the sequence IDs with the incoming message timestamp to achieve our deduplication mechanism: we can deterministically generate the same exact file names containing the exact same data by writing our service code to check that the message added to the batch has the appropriate sequence ID relative to the previous message in the batch and using the timestamp on the incoming message to determine the exact start and end of each batch (we typically get a UNIX timestamp and check when we’ve rolled over to the next clock minute). This allows us to simply rely on a key collision in S3 for deduplication.

AWS suggests a similar solution for a slightly different problem, relating to Amazon Kinesis Data Streams. For more information, see Handling Duplicate Records.

With this scheme, even if records are processed more than one time, the resulting Amazon S3 file has the same name and has the same data. The retries only result in writing the same data to the same file more than one time.

After we store the data, we can perform simple analytics queries on the billions of records we’ve stored in S3 using Amazon Athena, a query service that requires minimal configuration and zero infrastructure overhead. Athena has a concept of partitions (inherited from one of its underlying services, Apache Hive). Partitions are mappings between virtual columns (in our case: pair, year, month, and day) and the S3 directories in which the corresponding data is stored.

S3’s file system is not actually hierarchical. Files are prepended with long key prefixes that are rendered as directories in the AWS console when browsing a bucket’s contents. This has some non-trivial performance consequences when querying or filtering on large data sets.

The following screenshot illustrates a typical directory path.

By pointing Athena directly to a particular subset of data, a well-defined partitioning scheme can drastically reduce query run times and costs. Though the ability the perform ad hoc business analytics queries is primarily a convenience, taking time to choose a sane multi-level partitioning scheme for Athena based on some of our most common access patterns seemed worthwhile. A poorly designed partition structure can result in Athena unnecessarily scanning huge swaths of data and ultimately render the service unusable.

Data publication

Our pipeline for transforming thousands of small gzipped JSON files into clean CSVs and loading them into AWS Data Exchange involves three distinct jobs, each expressed as an AWS Lambda function.

Job 1

Job 1 is initiated shortly after midnight UTC by a cron-scheduled CloudWatch event. As mentioned previously, our data ingestion service’s batching mechanism flushes each batch to S3 at a regular time interval. A timestamp on the incoming message (applied server-side) determines the rollover from one interval to the next, as opposed to the ingestion service’s system timestamp, so in the rare case that a non-trivial amount of time elapses between the consumption of the final message of batch n and the first message of batch n+1, we kick off the first Lambda function 20 minutes after midnight UTC to minimize the likelihood of omitting data pending write.

Job 1 formats values for the date and data source into an Athena query template and outputs the query results as a CSV to a specified prefix path in S3. (Every Athena query produces a .metadata file and a CSV file of the query results, though DDL statements do not output a CSV.) This PUT request to S3 triggers an S3 event notification.

We run a full replica data ingestion system as an additional layer of redundancy. Using the coalesce conditional expression, the Athena query in Job 1 merges data from our primary system with the corresponding data from our replica system, and fills in any gaps while deduplicating redundant records.

We experimented fairly extensively with AWS Glue and PySpark for the ETL-related work performed in Job 1. When we realized that we could merge all the small source files into one, join the primary and replica data sets, and sort the results with a single Athena query, we decided to stick with this seemingly simpler and more elegant approach.

The following code shows one of our Athena query templates.

Job 2

Job 2 is triggered by the S3 event notification from Job 1. Job 2 simply copies the query results CSV file to a different key within the same S3 bucket.

The motivation for this step is twofold. First, we cannot dictate the name of an Athena query results CSV file; it is automatically set to the Athena query ID. Second, when adding an S3 object as an asset to an AWS Data Exchange revision, the asset’s name is automatically set to the S3 object’s key. So to dictate how the CSV file name appears in AWS Data Exchange, we must first rename it, which we accomplish by copying it to a specified S3 key.

Job 3

Job 3 handles all work related to AWS Data Exchange and AWS Marketplace Catalog via their respective APIs. We use boto3, AWS’s Python SDK, to interface with these APIs. The AWS Marketplace Catalog API is necessary for adding data set revisions to products that have already been published. For more information, see Tutorial: Adding New Data Set Revisions to a Published Data Product.

Our code explicitly defines mappings with the following structure:

data source / DataSet / Product

The following code shows how we configure relationships between data sources, data sets, and products.

Our data sources are typically represented by a trading venue and data type combination (such as Binance trades or CoinbasePro order books). Each new file for a given data source is delivered as a single asset within a single new revision for a particular data set.

An S3 trigger kicks off the Lambda function. The trigger is scoped to a specified prefix that maps to a single data set. The function alias feature of AWS Lambda allows us to define the unique S3 triggers for each data set while reusing the same underlying Lambda function. Job 3 carries out the following steps (note that steps 1 through 5 refer to the AWS Data Exchange API while steps 6 and 7 refer to the AWS Marketplace Catalog API):

  1. Submits a request to create a new revision for the corresponding data set via CreateRevision.
  2. Adds the file that was responsible for triggering the Lambda function to the newly created revision via CreateJob using the IMPORT_ASSETS_FROM_S3 job type. To submit this job, we need to supply a few values: the S3 bucket and key values for the file are pulled from the Lambda event message, while the RevisionID argument comes from the response to the CreateRevision call in the previous step.
  3. Kicks off the job with StartJob, sourcing the JobID argument from the response to the CreateJob call in the previous step.
  4. Polls the job’s status via GetJob (using the job ID from the response to the StartJob call in the previous step) to check that our file (the asset) was successfully added to the revision.
  5. Finalizes the revision via UpdateRevision.
  6. Requests a description of the marketplace entity using DescribeEntity, passing in the product ID stored in our hardcoded mappings as the EntityID
  7. Kicks off the entity ChangeSet via StartChangeSet, passing in the entity ID from the previous step, the entity ID from the DescribeEntity response in the previous step as EntityID, the revision ARN parsed from the response to our earlier call to CreateRevision as RevisionArn, and the data set ARN as DataSetArn, which we fetch at the start of the code’s runtime using AWS Data Exchange API’s GetDataSet.

Here’s a thin wrapper class we wrote to carry out the steps detailed above:

from time import sleep
import logging
import json

import boto3

from config import (
    DATA_EXCHANGE_REGION,
    MARKETPLACE_CATALOG_REGION,
    LambdaS3TriggerMappings
)

logger = logging.getLogger()


class CustomDataExchangeClient:
    def __init__(self):
        self._de_client = boto3.client('dataexchange', region_name=DATA_EXCHANGE_REGION)
        self._mc_client = boto3.client('marketplace-catalog', region_name=MARKETPLACE_CATALOG_REGION)
    
    def _get_s3_data_source(self, bucket, prefix):
        return LambdaS3TriggerMappings[(bucket, prefix)]

    # Job State can be one of: WAITING | IN_PROGRESS | ERROR | COMPLETED | CANCELLED | TIMED_OUT
    def _wait_for_de_job_completion(self, job_id):
        while True:
            get_job_resp = self._de_client.get_job(JobId=job_id)
            if get_job_resp['State'] == 'COMPLETED':
                logger.info(f"Job '{job_id}' succeeded:\n\t{get_job_resp}")
                break
            elif get_job_resp['State'] in ('ERROR', 'CANCELLED'):
                raise Exception(f"Job '{job_id}' failed:\n\t{get_job_resp}")
            else:
                sleep(5)
                logger.info(f"Still waiting on job {job_id}...")
        return get_job_resp

    # ChangeSet Status can be one of: PREPARING | APPLYING | SUCCEEDED | CANCELLED | FAILED
    def _wait_for_mc_change_set_completion(self, change_set_id):
        while True:
            describe_change_set_resp = self._mc_client.describe_change_set(
                Catalog='AWSMarketplace',
                ChangeSetId=change_set_id
                )
            if describe_change_set_resp['Status'] == 'SUCCEEDED':
                logger.info(
                    f"ChangeSet '{change_set_id}' succeeded:\n\t{describe_change_set_resp}"
                )
                break
            elif describe_change_set_resp['Status'] in ('FAILED', 'CANCELLED'):
                raise Exception(
                    f"ChangeSet '{change_set_id}' failed:\n\t{describe_change_set_resp}"
                )
            else:
                sleep(1)
                logger.info(f"Still waiting on ChangeSet {change_set_id}...")
        return describe_change_set_resp

    def process_s3_event(self, s3_event):
        source_bucket = s3_event['Records'][0]['s3']['bucket']['name']
        source_key = s3_event['Records'][0]['s3']['object']['key']
        source_prefix = '/'.join(source_key.split('/')[0:-1])
        s3_data_source = self._get_s3_data_source(source_bucket, source_prefix)
        obj_name = source_key.split('/')[-1]
        
        s3_data_source.validate_object_name(obj_name)
        
        for data_set in s3_data_source.lambda_s3_trigger_target_data_sets:
            # Create revision
            create_revision_resp = self._de_client.create_revision(
                DataSetId=data_set.id,
                Comment=obj_name
            )
            logger.debug(create_revision_resp)
            revision_id = create_revision_resp['Id']
            revision_arn = create_revision_resp['Arn']

            # Create job
            create_job_resp = self._de_client.create_job(
                Type='IMPORT_ASSETS_FROM_S3',
                Details={
                    'ImportAssetsFromS3': {
                      'AssetSources': [
                          {
                              'Bucket': source_bucket,
                              'Key': source_key
                          },
                      ],
                      'DataSetId': data_set.id,
                      'RevisionId': revision_id
                    }
                }
            )
            logger.debug(create_job_resp)

            # Start job
            job_id = create_job_resp['Id']
            start_job_resp = self._de_client.start_job(JobId=job_id)
            logger.debug(start_job_resp)

            # Wait for Data Exchange job completion
            get_job_resp = self._wait_for_de_job_completion(job_id)
            logger.debug(get_job_resp)

            # Finalize revision
            update_revision_resp = self._de_client.update_revision(
                DataSetId=data_set.id,
                RevisionId=revision_id,
                Finalized=True
            )
            logger.debug(update_revision_resp)

            # Ensure revision finalization succeeded
            finalized_status = update_revision_resp['Finalized']
            if finalized_status is not True:
                raise Exception(f"Failed to finalize revision:\n{update_revision_resp}")

            # Publish the new revision to each product associated with the data set
            for product in data_set.products:
                # Describe the AWS Marketplace entity corresponding to the Data Exchange product
                describe_entity_resp = self._mc_client.describe_entity(
                    Catalog='AWSMarketplace',
                    EntityId=product.id
                )
                logger.debug(describe_entity_resp)

                entity_type = describe_entity_resp['EntityType']
                entity_id = describe_entity_resp['EntityIdentifier']

                # Isolate the target data set in the DescribeEntity response
                describe_entity_resp_data_sets = json.loads(describe_entity_resp['Details'])['DataSets']
                describe_entity_resp_data_set = list(
                    filter(lambda ds: ds['DataSetArn'] == data_set.arn, describe_entity_resp_data_sets)
                )
                # We should get the data set of interest in describe_entity_resp and only that data set
                assert len(describe_entity_resp_data_set) == 1

                # Start a ChangeSet to add the newly finalized revision to an existing product
                start_change_set_resp = self._mc_client.start_change_set(
                    Catalog='AWSMarketplace',
                    ChangeSet=[
                        {
                            "ChangeType": "AddRevisions",
                            "Entity": {
                                "Identifier": entity_id,
                                "Type": entity_type
                            },
                            "Details": json.dumps({
                                "DataSetArn": data_set.arn,
                                "RevisionArns": [revision_arn]
                            })
                        }
                    ]
                )
                logger.debug(start_change_set_resp)

                # Wait for the ChangeSet workflow to complete
                change_set_id = start_change_set_resp['ChangeSetId']
                describe_change_set_resp = self._wait_for_mc_change_set_completion(change_set_id)
                logger.debug(describe_change_set_resp)

The following screenshot shows the S3 trigger for Job 3.

The following screenshot shows an example of CloudWatch logs for Job 3.

The following screenshot shows a CloudWatch alarm for Job 3.

Finally, we can verify that our revisions were successfully added to their corresponding data sets and products through the AWS console.

AWS Data Exchange allows you to create private offers for your AWS account IDs, providing a convenient means of checking that revisions show up in each product as expected.

Conclusion

This post demonstrated how you can integrate AWS Data Exchange into an existing data pipeline frictionlessly. We’re pleased to have been invited to participate in the AWS Data Exchange private preview, and even more pleased with the service itself, which has proven to be a sophisticated yet natural extension of our system.

I want to offer special thanks to both Kyle Patsen and Rafic Melhem of the AWS Data Exchange team for generously fielding my questions (and patiently enduring my ramblings) for the better part of the past year. I also want to thank Lucas Adams for helping me design the system discussed in this post and, more importantly, for his unwavering vote of confidence.

If you are interested in learning more about FPG, don’t hesitate to contact us.

 

Easily Manage Shared Data Sets with Amazon S3 Access Points

Post Syndicated from Brandon West original https://aws.amazon.com/blogs/aws/easily-manage-shared-data-sets-with-amazon-s3-access-points/

Storage that is secure, scalable, durable, and highly available is a fundamental component of cloud computing. That’s why Amazon Simple Storage Service (S3) was the first service launched by AWS, back in 2006. It has been a building block of many of the more than 175 services that AWS now offers. As we approach the beginning of a new decade, capabilities like Amazon Redshift, Amazon Athena, Amazon EMR and AWS Lake Formation have made S3 not just a way to store objects but an engine for turning that data into insights. These capabilities mean that access patterns and requirements for the data stored in buckets have evolved.

Today we’re launching a new way to manage data access at scale for shared data sets in S3: Amazon S3 Access Points. S3 Access Points are unique hostnames with dedicated access policies that describe how data can be accessed using that endpoint. Before S3 Access Points, shared access to data meant managing a single policy document on a bucket. These policies could represent hundreds of applications with many differing permissions, making audits, and updates a potential bottleneck affecting many systems.

With S3 Access Points, you can add access points as you add additional applications or teams, keeping your policies specific and easier to manage. A bucket can have multiple access points, and each access point has its own AWS Identity and Access Management (IAM) policy. Access point policies are similar to bucket policies, but associated with the access point. S3 Access Points can also be restricted to only allow access from within a Amazon Virtual Private Cloud. And because each access point has a unique DNS name, you can now address your buckets with any name that is unique within your AWS account and region.

Creating S3 Access Points

Let’s add an access point to a bucket using the S3 Console. You can also create and manage your S3 Access Points using the AWS Command Line Interface (CLI), AWS SDKs, or via the API. I’ve selected a bucket that contains artifacts generated by a AWS Lambda function, and clicked on the access points tab.

Access points tab in S3 Console

Let’s create a new access point. I want to give an IAM user Alice permission to GET and PUT objects with the prefix Alice. I’m going to name this access point alices-access-point. There are options for restricting access to a Virtual Private Cloud, which just requires a Virtual Private Cloud ID. In this, I want to allow access from outside the VPC as well, so after I took this screenshot, I selected Internet and moved onto the next step.

Creating an Access Point

S3 Access Points makes it easy to block public access. I’m going to block all public access to this access point.

Public access settings

And now I can attach my policy. In this policy, our Principal is our user Alice, and the resource is our access point combined with every object with the prefix /Alice. For more examples of the kinds of policies you might want to attach to your S3 Access Points, take a look at the docs.

Creating access point policy

After I create the access point, I can access it by hostname using the format https://[access_point_name]-[accountID].s3-accesspoint.[region].amazonaws.com. Via the SDKs and CLI, I can use it the same way I would use a bucket once I’ve updated to the latest version. For example, assuming I were authenticated as Alice, I could do the following:

$ aws s3api get-object --key /Alice/object.zip --bucket arn:aws:s3:us-east-1:[my-account-id]:alices-access-point download.zip

Access points that are not restricted to VPCs can also be used via the S3 Console.

Things to Know

When it comes to software design, keeping scopes small and focused on a specific task is almost always a good decision. With S3 Access Points, you can customize hostnames and permissions for any user or application that needs access to your shared data set. Let us know how you like this new capability, and happy building!

— Brandon

Provisioning the Intuit Data Lake with Amazon EMR, Amazon SageMaker, and AWS Service Catalog

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/provisioning-the-intuit-data-lake-with-amazon-emr-amazon-sagemaker-and-aws-service-catalog/

This post shares Intuit’s learnings and recommendations for running a data lake on AWS. The Intuit Data Lake is built and operated by numerous teams in Intuit Data Platform. Thanks to Tristan Baker (Chief Architect), Neil Lamka (Principal Product Manager), Achal Kumar (Development Manager), Nicholas Audo, and Jimmy Armitage for their feedback and support.

A data lake is a centralized repository for storing structured and unstructured data at any scale. At Intuit, creating such a pile of raw data is easy. However, more interesting challenges present themselves:

  1. How should AWS accounts be organized?
  2. What ingestion methods will be used? How will analysts find the data they need?
  3. Where should data be stored? How should access be managed?
  4. What security measures are needed to protect Intuit’s sensitive data?
  5. Which parts of this ecosystem can be automated?

This post outlines the approach taken by Intuit, though it is important to remember that there are many ways to build a data lake (for example, AWS Lake Formation).

We’ll cover the technologies and processes involved in creating the Intuit Data Lake at a high level, including the overall structure and the automation used in provisioning accounts and resources. Watch this space in the future for more detailed blog posts on specific aspects of the system, from the other teams and engineers who worked together to build the Intuit Data Lake.

Architecture

Account Structure

Data lakes typically follow a hub-and-spoke model, with the hub account containing shared services that control access to data sources. For the purposes of this post, we’ll refer to the hub account as Central Data Lake.

In this pattern, access to Central Data Lake is apportioned to spoke accounts called Processing Accounts. This model maintains separation between end users and allows for division of billing among distinct business units.

 

 

It is common to maintain two ecosystems: pre-production (Pre-Prod) and production (Prod). This allows data lake administrators to silo access to data by preventing connectivity between Pre-Prod and Prod.

To enable experimentation and testing, it may also be advisable to maintain separate VPC-based environments within Pre-Prod accounts, such as dev, qa, and e2e. Processing Account VPCs would then be connected to the corresponding VPC in Central Data Lake.

Note that at first, we connected accounts via VPC Peering. However, as we scaled we quickly approached the hard limit of 125 VPC peering connections, requiring us to migrate to AWS Transit Gateway. As of this writing, we connect multiple new Processing Accounts weekly.

 

 

Central Data Lake

There may be numerous services running in a hub account, but we’ll focus on the aspects that are most relevant to this blog: ingestion, sanitization, storage, and a data catalog.

 

 

Ingestion, Sanitization, and Storage

A key component to Central Data Lake is a uniform ingestion pattern for streaming data. One example is an Apache Kafka cluster running on Amazon EC2. (You can read about how Intuit engineers do this in another AWS blog.) As we deal with hundreds of data sources, we’ve enabled access to ingestion mechanisms via AWS PrivateLink.

Note: Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an alternative for running Apache Kafka on Amazon EC2, but was not available at the start of Intuit’s migration.

In addition to stream processing, another method of ingestion is batch processing, such as jobs running on Amazon EMR. After data is ingested by one of these methods, it can be stored in Amazon S3 for further processing and analysis.

Intuit deals with a large volume of customer data, and each field is carefully considered and classified with a sensitivity level. All sensitive data that enters the lake is encrypted at the source. The ingestion systems retrieve the encrypted data and move it into the lake. Before it is written to S3, the data is sanitized by a proprietary RESTful service. Analysts and engineers operating within the data lake consume this masked data.

Data Catalog

A data catalog is a common way to give end users information about the data and where it lives. One example is a Hive Metastore backed by Amazon Aurora. Another alternative is the AWS Glue Data Catalog.

Processing Accounts

When Processing Accounts are delivered to end users, they include an identical set of resources. We’ll discuss the automation of Processing Accounts below, but the primary components are as follows:

 

 

                           Processing Account structure upon delivery to the customer

 

Data Storage Mechanisms

One reasonable question is whether all data should reside in Central Data Lake, or if it’s acceptable to distribute data across multiple accounts. A data lake might employ a combination of the two approaches, and classify data locations as primary or secondary.

The primary location for data is Central Data Lake, and it arrives there via the ingestion pipelines discussed previously. Processing Accounts can read from the primary source, either directly from the ingestion pipelines or from S3. Processing Accounts can contribute their transformed data back into Central Data Lake (primary), or store it in their own accounts (secondary). The proper storage location depends on the type of data, and who needs to consume it.

One rule worth enforcing is that no cross-account writes should be permitted. In other words, the IAM principal (in most cases, an IAM role assumed by EC2 via an instance profile) must be in the same account as the destination S3 bucket. This is because cross-account delegation is not supported—specifically, S3 bucket policies in Central Data Lake cannot grant Processing Account A access to objects written by a role in Processing Account B.

Another possibility is for EMR to assume different IAM roles via a custom credentials provider (see this AWS blog), but we chose not to go down this path at Intuit because it would have required many EMR jobs to be rewritten.

 

 

Data Access Patterns

The majority of end users are interested in the data that resides in S3. In Central Data Lake and some Processing Accounts, there may be a set of read-only S3 buckets: any account in the data lake ecosystem can read data from this type of bucket.

To facilitate management of S3 access for read-only buckets, we built a mechanism to control S3 bucket policies, administered entirely via code. Our deployment pipelines use account metadata to dynamically generate the correct S3 bucket policy based on the type of account (Pre-Prod or Prod). These policies are committed back into our code repository for auditability and ease of management.

We employ the same method for managing KMS key policies, as we use KMS with customer managed customer master keys (CMKs) for at-rest encryption in S3.

Here’s an example of a generated S3 bucket policy for a read-only bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ProcessingAccountReadOnly",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111111111111:root",
                    "arn:aws:iam::222222222222:root",
                    "arn:aws:iam::333333333333:root",
                    "arn:aws:iam::444444444444:root",
                    "arn:aws:iam::555555555555:root",
                    ...
                    ...
                    ...
                    "arn:aws:iam::999999999999:root",
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::intuit-data-lake-example/*",
                "arn:aws:s3:::intuit-data-lake-example"
            ]
        }
    ]
}

Note that we grant access at the account level, rather than using explicit IAM principal ARNs. Because the reads are cross-account, permissions are also required on the IAM principals in Processing Accounts. Maintaining these policies—with automation, at that level of granularity—is untenable at scale. Furthermore, using specific IAM principal ARNs would create an external dependency on foreign accounts. For example, if a Processing Account deletes an IAM role that is referenced in an S3 bucket policy in Central Data Lake, the bucket policy can no longer be saved, causing interruptions to deployment pipelines.

Security

Security is mission critical for any data lake. We’ll mention a subset of the controls we use, but not dive deep.

Encryption

Encryption can be enforced both in transit and at rest, using multiple methods:

  1. Traffic within the lake should use the latest version of TLS (1.2 as of this writing)
  2. Data can be encrypted with application-level (client-side) encryption
  3. KMS keys can used for at-rest encryption of S3, EBS, and RDS

Ingress and Egress

There’s nothing out of the ordinary in our approach to ingress and egress, but it’s worth mentioning the standard patterns we’ve found important:

Policies restricting ingress and egress are the primary points at which a data lake can guarantee quality (ingress) and prevent loss (egress).

Authorization

Access to the Intuit Data Lake is controlled via IAM roles, meaning no IAM users (with long-term credentials) are created. End users are granted access via an internal service that manages role-based, federated access to AWS accounts. Regular reviews are conducted to remove nonessential users.

Configuration Management

We use an internal fork of Cloud Custodian, which is a suite of preventative, detective, and responsive controls consisting of Amazon CloudWatch Events and AWS Config rules. Some of the violations it reports and (optionally) mitigates include:

  • Unauthorized CIDRs in inbound security group rules
  • Public S3 bucket policies and ACLs
  • IAM user console access
  • Unencrypted S3 buckets, EBS volumes, and RDS instances

Lastly, Amazon GuardDuty is enabled in all Intuit Data Lake accounts and is monitored by Intuit Security.

Automation

If there is one thing we’ve learned building the Intuit Data Lake, it is to automate everything.

There are four areas of automation we’ll discuss in this blog:

  1. Creation of Processing Accounts
  2. Processing Account Orchestration Pipeline
  3. Processing Account Terraform Pipeline
  4. EMR and SageMaker deployment via Service Catalog

Creation of Processing Accounts

The first step in creating a Processing Account is to make a request through an internal tool. This triggers automation that provisions an Intuit-stamped AWS account under the correct business unit.

 

Note: AWS Control Tower’s Account Factory was not available at the start of our journey, but it can be leveraged to provision new AWS accounts in a secured, best practice, self-service way.

Account setup also includes automated VPC creation (with optional VPN), fully automated using Service Catalog. End users simply specify subnet sizes.

It’s worth noting that Intuit leverages Service Catalog for self-service deployment of other common patterns, including ingress security groups, VPC endpoints, and VPC peering. Here’s an example portfolio:

Processing Account Orchestration Pipeline

After account creation and VPC provisioning, the Processing Account Orchestration Pipeline runs. This pipeline executes one-time tasks required for Processing Accounts. These tasks include:

  • Bootstrapping an IAM role for use in further configuration management
  • Creation of KMS keys for S3, EBS, and RDS encryption
  • Creation of variable files for the new account
  • Updating the master configuration file with account metadata
  • Generation of scripts to orchestrate the Terraform pipeline discussed below
  • Sharing Transit Gateways via Resource Access Manager

Processing Account Terraform Pipeline

This pipeline manages the lifecycle of dynamic, frequently-updated resources, including IAM roles, S3 buckets and bucket policies, KMS key policies, security groups, NACLs, and bastion hosts.

There is one pipeline for every Processing Account, and each pipeline deploys a series of layers into the account, using a set of parameterized deployment jobs. A layer is a logical grouping of Terraform modules and AWS resources, providing a way to shrink Terraform state files and reduce blast radius if redeployment of specific resources is required.

EMR and SageMaker Deployment via Service Catalog

AWS Service Catalog facilitates the provisioning of Amazon EMR and Amazon SageMaker, allowing end users to launch EMR clusters and SageMaker notebook instances that work out of the box, with embedded security.

Service Catalog allows data scientists and data engineers to launch EMR clusters in a self-service fashion with user-friendly parameters, and provides them with the following:

  • Bootstrap action to enable connectivity to services in Central Data Lake
  • EC2 instance profile to control S3, KMS, and other granular permissions
  • Security configuration that enables at-rest and in-transit encryption
  • Configuration classifications for optimal EMR performance
  • Encrypted AMI with monitoring and logging enabled
  • Custom Kerberos connection to LDAP

For SageMaker, we use Service Catalog to launch notebook instances with custom lifecycle configurations that set up connections or initialize the following: Hive Metastore, Kerberos, security, Splunk logging, and OpenDNS. You can read more about lifecycle configurations in this AWS blog. Launching a SageMaker notebook instance with best-practice configuration is as easy as follows:

 

 

Conclusion

This post illustrates the building blocks we used in creating the Intuit Data Lake. Our solution isn’t wholly unique, but comprised of common-sense approaches we’ve gleaned from dozens of engineers across Intuit, representing decades of experience. These practices have enabled us to push petabytes of data into the lake, and serve hundreds of Processing Accounts with varying needs. We are still building, but we hope our story helps you in your data lake journey.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Ben Covi is a staff software engineer at Intuit. At any given moment, he’s probably losing a game of Catan.

 

 

 

Secure your data on Amazon EMR using native EBS and per bucket S3 encryption options

Post Syndicated from Duncan Chan original https://aws.amazon.com/blogs/big-data/secure-your-data-on-amazon-emr-using-native-ebs-and-per-bucket-s3-encryption-options/

Data encryption is an effective solution to bolster data security. You can make sure that only authorized users or applications read your sensitive data by encrypting your data and managing access to the encryption key. One of the main reasons that customers from regulated industries such as healthcare and finance choose Amazon EMR is because it provides them with a compliant environment to store and access data securely.

This post provides a detailed walkthrough of two new encryption options to help you secure your EMR cluster that handles sensitive data. The first option is native EBS encryption to encrypt volumes attached to EMR clusters. The second option is an Amazon S3 encryption that allows you to use different encryption modes and customer master keys (CMKs) for individual S3 buckets with Amazon EMR.

Local disk encryption on Amazon EMR

Previously you could only choose Linux Unified Key Setup (LUKS) for at-rest encryption. You now have a choice of using LUKS or native EBS encryption to encrypt EBS volumes attached to an EMR cluster. EBS encryption provides the following benefits:

  • End-to-end encryption – When you enable EBS encryption for Amazon EMR, all data on EBS volumes, including intermediate disk spills from applications and Disk I/O between the nodes and EBS volumes, are encrypted. The snapshots that you take of an encrypted EBS volume are also encrypted and you can move them between AWS Regions as needed.
  • Amazon EMR root volumes encryption – There is no need to create a custom Amazon Linux Image for encrypting root volumes.
  • Easy auditing for encryption When you use LUKS encryption, though your EBS volumes are encrypted along with any instance store volumes, you still see EBS with Not Encrypted status when you use an Amazon EC2 API or the EC2 console to check on the encryption status. This is because the API doesn’t look into the EMR cluster to check the disk status; your auditors would need to SSH into the cluster to check for disk encrypted compliance. However, with EBS encryption, you can check the encryptions status from the EC2 console or through an EC2 API call.
  • Transparent Encryption – EBS encryption is transparent to any applications running on Amazon EMR and doesn’t require you to modify any code.

Amazon EBS encryption integrates with AWS KMS to provide the encryption keys that protect your data. To use this feature, you have to use a CMK in your account and Region. A CMK gives you control to create and manage the key, including enabling and disabling the key, controlling access, rotating the key, and deleting it. For more information, see Customer Master Keys.

Enabling EBS encryption on Amazon EMR

To enable EBS encryption on Amazon EMR, complete the following steps:

  1. Create your CMK in AWS KMS.
    You can do this either through the AWS KMS console, AWS CLI, or the AWS KMS CreateKey API. Create keys in the same Region as your EMR cluster. For more information, see Creating Keys.
  2. Give the Amazon EMR service role and EC2 instance profile permission to use your CMK on your behalf.
    If you are using the EMR_DefaultRole, add the policy with the following steps:

    • Open the AWS KMS console.
    • Choose your AWS Region.
    • Choose the key ID or alias of the CMK you created.
    • On the key details page, under Key Users, choose Add.
    • Choose the Amazon EMR service role.The name of the default role is EMR_DefaultRole.
    • Choose Attach.
    • Choose the Amazon EC2 instance profile.The name of the default role for the instance profile is EMR_EC2_DefaultRole.
    • Choose Attach.
      If you are using a customized policy, add the following code to the service role to allow Amazon EMR to create and use the CMK, with the resource being the CMK ARN:

      { 
      "Version": "2012-10-17", 
      "Statement": [ 
         { 
         "Sid": "EmrDiskEncryptionPolicy", 
         "Effect": "Allow", 
         "Action": [ 
            "kms:Encrypt", 
            "kms:Decrypt", 
            "kms:ReEncrypt*", 
            "kms:CreateGrant", 
            "kms:GenerateDataKeyWithoutPlaintext", 
            "kms:DescribeKey" 
            ], 
         "Resource": [ 
            " arn:aws:kms:region:account-id:key/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx " " 
            ] 
         } 
      ] 
      } 

       

  3. Create and configure the Amazon EMR Security configuration template.Do this either through the console or using CLI or SDK, with the following steps:
    • Open the Amazon EMR console.
    • Choose Security Configuration.
    • Under Local disk encryption, choose Enable at-rest encryption for local disks
    • For Key provider type, choose AWS KMS.
    • For AWS KMS customer master key, choose the key ARN of your CMK.This post uses the key ARN ebsEncryption_emr_default_role.
    • Select Encrypt EBS volumes with EBS encryption.

Default encryption with EC2 vs. Amazon EMR EBS encryption

EC2 has a similar feature called default encryption. With this feature, all EBS volumes in your account are encrypted without exception using a single CMK that you specify per Region. With EBS encryption from Amazon EMR, you can use different a KMS key per EMR cluster to secure your EBS volumes. You can use both EBS encryption provided by Amazon EMR and default encryption provided by EC2.

For this post, EBS encryption provided by Amazon EMR takes precedent, and you encrypt the EBS volumes attached to the cluster with the CMK that you selected in the security configuration.

S3 encryption

Amazon S3 encryption also works with Amazon EMR File System (EMRFS) objects read from and written to S3. You can use either server-side encryption (SSE) or client-side encryption (CSE) mode to encrypt objects in S3 buckets. The following table summarizes the different encryption modes available for S3 encryption in Amazon EMR.

Encryption locationKey storageKey management
SSE-S3Server side on S3S3S3
SSE-KMSServer side on S3KMS

Choose the AWS managed CMK for Amazon S3 with the alias aws/s3, or create a custom CMK.

 

CSE-KMSClient side on the EMR clusterKMSA custom CMK that you create.
CSE-CustomClient side on the EMR clusterYouYour own key provider.

The encryption choice you make depends on your specific workload requirements. Though SSE-S3 is the most straightforward option that allows you to fully delegate the encryption of S3 objects to Amazon S3 by selecting a check box, SSE-KMS or CSE-KMS are better options that give you granular control over CMKs in KMS by using policies. With AWS KMS, you can see when, where, and by whom your customer managed keys (CMK) were used, because AWS CloudTrail logs API calls for key access and key management. These logs provide you with full audit capabilities for your keys. For more information, see Encryption at Rest for EMRFS Data in Amazon S3.

Encrypting your S3 buckets with different encryption modes and keys

With S3 encryption on Amazon EMR, all the encryption modes use a single CMK by default to encrypt objects in S3. If you have highly sensitive content in specific S3 buckets, you may want to manage the encryption of these buckets separately by using different CMKs or encryption modes for individual buckets. You can accomplish this using the per bucket encryption overrides option in Amazon EMR. To do so, complete the following steps:

  1. Open the Amazon EMR console.
  2. Choose Security Configuration.
  3. Under S3 encryption, select Enable at-rest encryption for EMRFS data in Amazon S3.
  4. For Default encryption mode, choose your encryption mode.This post uses SSE-KMS.
  5. For AWS KMS customer master key, choose your key.The key you provide here encrypts all S3 buckets used with Amazon EMR. This post uses ebsEncryption_emr_default_role.
  6. Choose Per bucket encryption overrides.You can set different encryption modes for different buckets.
  7. For S3 bucket, add your S3 bucket that you want to encrypt differently.
  8. For Encryption mode, choose an encryption mode.
  9. For Encryption materials, enter your CMK.

If you have already enabled default encryption for S3 buckets directly in Amazon S3, you can also choose to bypass the S3 encryption options in the security configuration setting in Amazon EMR. This allows Amazon EMR to delegate encrypting objects in the buckets to Amazon S3, which uses the encryption key specified in the bucket policy to encrypt objects before persisting it on S3.

Summary

This post walked through the native EBS and S3 encryption options available with Amazon EMR to encrypt and secure your data. Please share your feedback on how these optimizations benefit your real-world workloads.

 


About the Author

Duncan Chan is a software development engineer for Amazon EMR. He enjoys learning and working on big data technologies. When he is not working, he will be playing with his dogs.