Encryption in Transit Today we are making EFS even more useful with the addition of support for encryption of data in transit. When used in conjunction with the existing support for encryption of data at rest, you now have the ability to protect your stored files using a defense-in-depth security strategy.
In order to make it easy for you to implement encryption in transit, we are also releasing an EFS mount helper. The helper (available in source code and RPM form) takes care of setting up a TLS tunnel to EFS, and also allows you to mount file systems by ID. The two features are independent; you can use the helper to mount file systems by ID even if you don’t make use of encryption in transit. The helper also supplies a recommended set of default options to the actual mount command.
Setting up Encryption I start by installing the EFS mount helper on my Amazon Linux instance:
$ sudo yum install -y amazon-efs-utils
Next, I visit the EFS Console and capture the file system ID:
Then I specify the ID (and the TLS option) to mount the file system:
$ sudo mount -t efs fs-92758f7b -o tls /mnt/efs
And that’s it! The encryption is transparent and has an almost negligible impact on data transfer speed.
Available Now You can start using encryption in transit today in all AWS Regions where EFS is available.
The mount helper is available for Amazon Linux. If you are running another distribution of Linux you will need to clone the GitHub repo and build your own RPM, as described in the README.
Amazon S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements. It gives you flexibility in the way you manage data for cost optimization, access control, and compliance. However, because the service is flexible, a user could accidentally configure buckets in a manner that is not secure. For example, let’s say you uploaded files to an Amazon S3 bucket with public read permissions, even though you intended only to share this file with a colleague or a partner. Although this might have accomplished your task to share the file internally, the file is now available to anyone on the internet, even without authentication.
In this blog post, we show you how to prevent your Amazon S3 buckets and objects from allowing public access. We discuss how to secure data in Amazon S3 with a defense-in-depth approach, where multiple security controls are put in place to help prevent data leakage. This approach helps prevent you from allowing public access to confidential information, such as personally identifiable information (PII) or protected health information (PHI).
Preventing your Amazon S3 buckets and objects from allowing public access
Every call to an Amazon S3 service becomes a REST API request. When your request is transformed via a REST call, the permissions are converted into parameters included in the HTTP header or as URL parameters. The Amazon S3 bucket policy allows or denies access to the Amazon S3 bucket or Amazon S3 objects based on policy statements, and then evaluates conditions based on those parameters. To learn more, see Using Bucket Policies and User Policies.
With this in mind, let’s say multiple AWS Identity and Access Management (IAM) users at Example Corp. have access to an Amazon S3 bucket and the objects in the bucket. Example Corp. wants to share the objects among its IAM users, while at the same time preventing the objects from being made available publicly.
To demonstrate how to do this, we start by creating an Amazon S3 bucket named examplebucket. After creating this bucket, we must apply the following bucket policy. This policy denies any uploaded object (PutObject) with the attribute x-amz-acl having the values public-read, public-read-write, or authenticated-read. This means authenticated users cannot upload objects to the bucket if the objects have public permissions.
“Deny any Amazon S3 request to PutObject or PutObjectAcl in the bucket examplebucket when the request includes one of the following access control lists (ACLs): public-read, public-read-write, or authenticated-read.”
Remember that IAM policies are evaluated not in a first-match-and-exit model. Instead, IAM evaluates first if there is an explicit Deny. If there is not, IAM continues to evaluate if you have an explicit Allow and then you have an implicit Deny.
The above policy creates an explicit Deny. Even when any authenticated user tries to upload (PutObject) an object with public read or write permissions, such as public-read or public-read-write or authenticated-read, the action will be denied. To understand how S3 Access Permissions work, you must understand what Access Control Lists (ACL) and Grants are. You can find the documentation here.
Now let’s continue our bucket policy explanation by examining the next statement.
This statement is very similar to the first statement, except that instead of checking the ACLs, we are checking specific user groups’ grants that represent the following groups:
AuthenticatedUsers group. Represented by http://acs.amazonaws.com/groups/global/AuthenticatedUsers, this group represents all AWS accounts. Access permissions to this group allow any AWS account to access the resource. However, all requests must be signed (authenticated).
AllUsers group. Represented by http://acs.amazonaws.com/groups/global/AllUsers, access permissions to this group allow anyone on the internet access to the resource. The requests can be signed (authenticated) or unsigned (anonymous). Unsigned requests omit the Authentication header in the request.
Now that you know how to deny object uploads with permissions that would make the object public, you just have two statement policies that prevent users from changing the bucket permissions (Denying s3:PutBucketACL from ACL and Denying s3:PutBucketACL from Grants).
Below is how we’re preventing users from changing the bucket permisssions.
As you can see above, the statement is very similar to the Object statements, except that now we use s3:PutBucketAcl instead of s3:PutObjectAcl, the Resource is just the bucket ARN, and the objects have the “/*” in the end of the ARN.
In this section, we showed how to prevent IAM users from accidently uploading Amazon S3 objects with public permissions to buckets. In the next section, we show you how to enforce multiple layers of security controls, such as encryption of data at rest and in transit while serving traffic from Amazon S3.
Securing data on Amazon S3 with defense-in-depth
Let’s say that Example Corp. wants to serve files securely from Amazon S3 to its users with the following requirements:
The data must be encrypted at rest and during transit.
The data must be accessible only by a limited set of public IP addresses.
All requests for data should be handled only by Amazon CloudFront (which is a content delivery network) instead of being directly available from an Amazon S3 URL. If you’re using an Amazon S3 bucket as the origin for a CloudFront distribution, you can grant public permission to read the objects in your bucket. This allows anyone to access your objects either through CloudFront or the Amazon S3 URL. CloudFront doesn’t expose Amazon S3 URLs, but your users still might have access to those URLs if your application serves any objects directly from Amazon S3, or if anyone gives out direct links to specific objects in Amazon S3.
A domain name is required to consume the content. Custom SSL certificate support lets you deliver content over HTTPS by using your own domain name and your own SSL certificate. This gives visitors to your website the security benefits of CloudFront over an SSL connection that uses your own domain name, in addition to lower latency and higher reliability.
To represent defense-in-depth visually, the following diagram contains several Amazon S3 objects (A) in a single Amazon S3 bucket (B). You can encrypt these objects on the server side or the client side. You also can configure the bucket policy such that objects are accessible only through CloudFront, which you can accomplish through an origin access identity (C). You then can configure CloudFront to deliver content only over HTTPS in addition to using your own domain name (D).
Defense-in-depth requirement 1: Data must be encrypted at rest and during transit
Let’s start with the objects themselves. Amazon S3 objects—files in this case—can range from zero bytes to multiple terabytes in size (see service limits for the latest information). Each Amazon S3 bucket includes a collection of objects, and the objects can be uploaded via the Amazon S3 console, AWS CLI, or AWS API.
If you choose to use server-side encryption, Amazon S3 encrypts your objects before saving them on disks in AWS data centers. To encrypt an object at the time of upload, you need to add the x-amz-server-side-encryption header to the request to tell Amazon S3 to encrypt the object using Amazon S3 managed keys (SSE-S3), AWS KMS managed keys (SSE-KMS), or customer-provided keys (SSE-C). There are two possible values for the x-amz-server-side-encryption header: AES256, which tells Amazon S3 to use Amazon S3 managed keys, and aws:kms, which tells Amazon S3 to use AWS KMS managed keys.
The following code example shows a Put request using SSE-S3.
PUT /example-object HTTP/1.1
Date: Wed, 8 Jun 2016 17:50:00 GMT
Authorization: authorization string
[11434 bytes of object data]
If you choose to use client-side encryption, you can encrypt data on the client side and upload the encrypted data to Amazon S3. In this case, you manage the encryption process, the encryption keys, and related tools. You encrypt data on the client side by using AWS KMS managed keys or a customer-supplied, client-side master key.
Defense-in-depth requirement 2: Data must be accessible only by a limited set of public IP addresses
At the Amazon S3 bucket level, you can configure permissions through a bucket policy. For example, you can limit access to the objects in a bucket by IP address range or specific IP addresses. Alternatively, you can make the objects accessible only through HTTPS.
The following bucket policy allows access to Amazon S3 objects only through HTTPS (the policy was generated with the AWS Policy Generator). Here the bucket policy explicitly denies ("Effect": "Deny") all read access ("Action": "s3:GetObject") from anybody who browses ("Principal": "*") to Amazon S3 objects within an Amazon S3 bucket if they are not accessed through HTTPS ("aws:SecureTransport": "false").
Defense-in-depth requirement 3: Data must not be publicly accessible directly from an Amazon S3 URL
Next, configure Amazon CloudFront to serve traffic from within the bucket. The use of CloudFront serves several purposes:
CloudFront is a content delivery network that acts as a cache to serve static files quickly to clients.
Depending on the number of requests, the cost of delivery is less than if objects were served directly via Amazon S3.
Objects served through CloudFront can be limited to specific countries.
Access to these Amazon S3 objects is available only through CloudFront. We do this by creating an origin access identity (OAI) for CloudFront and granting access to objects in the respective Amazon S3 bucket only to that OAI. As a result, access to Amazon S3 objects from the internet is possible only through CloudFront; all other means of accessing the objects—such as through an Amazon S3 URL—are denied. CloudFront acts not only as a content distribution network, but also as a host that denies access based on geographic restrictions. You apply these restrictions by updating your CloudFront web distribution and adding a whitelist that contains only a specific country’s name (let’s say Liechtenstein). Alternatively, you could add a blacklist that contains every country except that country. Learn more about how to use CloudFront geographic restriction to whitelist or blacklist a country to restrict or allow users in specific locations from accessing web content in the AWS Support Knowledge Center.
Defense-in-depth requirement 4: A domain name is required to consume the content
To serve content from CloudFront, you must use a domain name in the URLs for objects on your webpages or in your web application. The domain name can be either of the following:
The domain name that CloudFront automatically assigns when you create a distribution, such as d111111abcdef8.cloudfront.net
Your own domain name, such as example.com
For example, you might use one of the following URLs to return the file image.jpg:
You use the same URL format whether you store the content in Amazon S3 buckets or at a custom origin, like one of your own web servers.
Instead of using the default domain name that CloudFront assigns for you when you create a distribution, you can add an alternate domain name that’s easier to work with, like example.com. By setting up your own domain name with CloudFront, you can use a URL like this for objects in your distribution: http://example.com/images/image.jpg.
Let’s say that you already have a domain name hosted on Amazon Route 53. You would like to serve traffic from the domain name, request an SSL certificate, and add this to your CloudFront web distribution. The SSL offloading occurs in CloudFront by serving traffic securely from each CloudFront location. You also can configure CloudFront to deliver your content over HTTPS by using your custom domain name and your own SSL certificate. Serving web content through CloudFront reduces response from the origin as requests are redirected to the nearest edge location. This results in faster download times than if the visitor had requested the content from a data center that is located farther away.
In this post, we demonstrated how you can apply policies to Amazon S3 buckets so that only users with appropriate permissions are allowed to access the buckets. Anonymous users (with public-read/public-read-write permissions) and authenticated users without the appropriate permissions are prevented from accessing the buckets.
We also examined how to secure access to objects in Amazon S3 buckets. The objects in Amazon S3 buckets can be encrypted at rest and during transit. Doing so helps provide end-to-end security from the source (in this case, Amazon S3) to your users.
If you have feedback about this blog post, submit comments in the “Comments” section below. If you have questions about this blog post, start a new thread on the Amazon S3 forum or contact AWS Support.
Today, AWS announced Amazon DynamoDB encryption at rest, a new DynamoDB feature that gives you enhanced security of your data at rest by encrypting it using your associated AWS Key Management Service encryption keys. Encryption at rest can help you meet your security requirements for regulatory compliance.
You now can create an encrypted DynamoDB table anytime with a single click in the AWS Management Console or a single API call. Encrypting DynamoDB data has no impact on table performance. DynamoDB encryption at rest is available starting today in the US East (N. Virginia), US East (Ohio), US West (Oregon), and Europe (Ireland) Regions for no additional fees.
I recently heard my manager (Ariel Kelman, VP of Marketing for AWS) talk about the important role that education plays in our work. In fact, he assigned it a significantly higher priority than traditional marketing activities that focus on leads or conversions. I’ve also heard our other leaders talk about their work to create highly scalable education programs that will allow developers, architects, and other IT professionals to improve their skills and to earn AWS Certifications.
AWS Developer Professional Series Today I would like to tell you about the new AWS Developer Professional Series. The AWS Training and Certification team has teamed up with edX to create this new three-part series. Founded by MIT and Harvard, edX is the leading non-profit online learning destination, with a global community of over 14 million learners, backed up by 130 global partners including universities, non-profits, and institutions. This collaboration expands our offerings, and gives you another training option!
The new series is designed to help you and your colleagues to build development and DevOps skills on AWS. The courses are self-paced and build on each other in order to help you to create Python applications that run on AWS by way of the AWS SDK for Python (also known as Boto). Here are the courses:
AWS Developer: Optimizing on AWS – This course focuses on performance optimization and tuning of the application that you built in the predecessor courses. You will learn how to use caching and content distribution to increase performance and to improve the end-user experience for your app. You’ll also learn how to use AWS Key Management Service (KMS) to encrypt data at rest and in transit.
The courses are built with the expectation that you already have one to three years of software development experience, including some Python skills. Each course runs for six weeks and requires three to four hours of work per week on your part. Courses start in February (Building), April (Deploying), and May (Optimizing), and you can enroll now at no charge. You can also pursue a Verified Certificate for a fee of $149 per course.
The interesting thing that I can share after the meetup and after meeting with potential clients is that everyone (maybe unsurprisingly) has a very specific question that doesn’t get an immediate answer even after you follow the general guidelines. That is maybe a problem on the Regulation’s side, as it has not brought sufficient clarity to businesses.
As I said during the presentation – in technology we’re used with binary questions. In law and legal compliance an answer is somewhere on a scale between 1 and 10. “Do I have to encrypt my data at rest”? Well, I guess yes, but in terms of compliance I’d say “6 out of 10”, as it is not strict, depends on the multiple people’s interpretation of the sensitivity of the data and on other factors like access control.
So the communication between legal and technical people is key to understand what exactly implementation changes are needed.
The following 10 posts were the most viewed AWS Security Blog posts that we published during 2017. You can use this list as a guide to catch up on your AWS Security Blog reading or read a post again that you found particularly useful.
You’ve probably heard about GDPR. The new European data protection regulation that applies practically to everyone. Especially if you are working in a big company, it’s most likely that there’s already a process for gettign your systems in compliance with the regulation.
The regulation is basically a law that must be followed in all European countries (but also applies to non-EU companies that have users in the EU). In this particular case, it applies to companies that are not registered in Europe, but are having European customers. So that’s most companies. I will not go into yet another “12 facts about GDPR” or “7 myths about GDPR” posts/whitepapers, as they are often aimed at managers or legal people. Instead, I’ll focus on what GDPR means for developers.
Why am I qualified to do that? A few reasons – I was advisor to the deputy prime minister of a EU country, and because of that I’ve been both exposed and myself wrote some legislation. I’m familiar with the “legalese” and how the regulatory framework operates in general. I’m also a privacy advocate and I’ve been writing about GDPR-related stuff in the past, i.e. “before it was cool” (protecting sensitive data, the right to be forgotten). And finally, I’m currently working on a project that (among other things) aims to help with covering some GDPR aspects.
I’ll try to be a bit more comprehensive this time and cover as many aspects of the regulation that concern developers as I can. And while developers will mostly be concerned about how the systems they are working on have to change, it’s not unlikely that a less informed manager storms in in late spring, realizing GDPR is going to be in force tomorrow, asking “what should we do to get our system/website compliant”.
The rights of the user/client (referred to as “data subject” in the regulation) that I think are relevant for developers are: the right to erasure (the right to be forgotten/deleted from the system), right to restriction of processing (you still keep the data, but mark it as “restricted” and don’t touch it without further consent by the user), the right to data portability (the ability to export one’s data), the right to rectification (the ability to get personal data fixed), the right to be informed (getting human-readable information, rather than long terms and conditions), the right of access (the user should be able to see all the data you have about them), the right to data portability (the user should be able to get a machine-readable dump of their data).
Additionally, the relevant basic principles are: data minimization (one should not collect more data than necessary), integrity and confidentiality (all security measures to protect data that you can think of + measures to guarantee that the data has not been inappropriately modified).
Even further, the regulation requires certain processes to be in place within an organization (of more than 250 employees or if a significant amount of data is processed), and those include keeping a record of all types of processing activities carried out, including transfers to processors (3rd parties), which includes cloud service providers. None of the other requirements of the regulation have an exception depending on the organization size, so “I’m small, GDPR does not concern me” is a myth.
It is important to know what “personal data” is. Basically, it’s every piece of data that can be used to uniquely identify a person or data that is about an already identified person. It’s data that the user has explicitly provided, but also data that you have collected about them from either 3rd parties or based on their activities on the site (what they’ve been looking at, what they’ve purchased, etc.)
Having said that, I’ll list a number of features that will have to be implemented and some hints on how to do that, followed by some do’s and don’t’s.
“Forget me” – you should have a method that takes a userId and deletes all personal data about that user (in case they have been collected on the basis of consent, and not due to contract enforcement or legal obligation). It is actually useful for integration tests to have that feature (to cleanup after the test), but it may be hard to implement depending on the data model. In a regular data model, deleting a record may be easy, but some foreign keys may be violated. That means you have two options – either make sure you allow nullable foreign keys (for example an order usually has a reference to the user that made it, but when the user requests his data be deleted, you can set the userId to null), or make sure you delete all related data (e.g. via cascades). This may not be desirable, e.g. if the order is used to track available quantities or for accounting purposes. It’s a bit trickier for event-sourcing data models, or in extreme cases, ones that include some sort of blcokchain/hash chain/tamper-evident data structure. With event sourcing you should be able to remove a past event and re-generate intermediate snapshots. For blockchain-like structures – be careful what you put in there and avoid putting personal data of users. There is an option to use a chameleon hash function, but that’s suboptimal. Overall, you must constantly think of how you can delete the personal data. And “our data model doesn’t allow it” isn’t an excuse.
Notify 3rd parties for erasure – deleting things from your system may be one thing, but you are also obligated to inform all third parties that you have pushed that data to. So if you have sent personal data to, say, Salesforce, Hubspot, twitter, or any cloud service provider, you should call an API of theirs that allows for the deletion of personal data. If you are such a provider, obviously, your “forget me” endpoint should be exposed. Calling the 3rd party APIs to remove data is not the full story, though. You also have to make sure the information does not appear in search results. Now, that’s tricky, as Google doesn’t have an API for removal, only a manual process. Fortunately, it’s only about public profile pages that are crawlable by Google (and other search engines, okay…), but you still have to take measures. Ideally, you should make the personal data page return a 404 HTTP status, so that it can be removed.
Restrict processing – in your admin panel where there’s a list of users, there should be a button “restrict processing”. The user settings page should also have that button. When clicked (after reading the appropriate information), it should mark the profile as restricted. That means it should no longer be visible to the backoffice staff, or publicly. You can implement that with a simple “restricted” flag in the users table and a few if-clasues here and there.
Export data – there should be another button – “export data”. When clicked, the user should receive all the data that you hold about them. What exactly is that data – depends on the particular usecase. Usually it’s at least the data that you delete with the “forget me” functionality, but may include additional data (e.g. the orders the user has made may not be delete, but should be included in the dump). The structure of the dump is not strictly defined, but my recommendation would be to reuse schema.org definitions as much as possible, for either JSON or XML. If the data is simple enough, a CSV/XLS export would also be fine. Sometimes data export can take a long time, so the button can trigger a background process, which would then notify the user via email when his data is ready (twitter, for example, does that already – you can request all your tweets and you get them after a while).
Allow users to edit their profile – this seems an obvious rule, but it isn’t always followed. Users must be able to fix all data about them, including data that you have collected from other sources (e.g. using a “login with facebook” you may have fetched their name and address). Rule of thumb – all the fields in your “users” table should be editable via the UI. Technically, rectification can be done via a manual support process, but that’s normally more expensive for a business than just having the form to do it. There is one other scenario, however, when you’ve obtained the data from other sources (i.e. the user hasn’t provided their details to you directly). In that case there should still be a page where they can identify somehow (via email and/or sms confirmation) and get access to the data about them.
Consent checkboxes – this is in my opinion the biggest change that the regulation brings. “I accept the terms and conditions” would no longer be sufficient to claim that the user has given their consent for processing their data. So, for each particular processing activity there should be a separate checkbox on the registration (or user profile) screen. You should keep these consent checkboxes in separate columns in the database, and let the users withdraw their consent (by unchecking these checkboxes from their profile page – see the previous point). Ideally, these checkboxes should come directly from the register of processing activities (if you keep one). Note that the checkboxes should not be preselected, as this does not count as “consent”.
Re-request consent – if the consent users have given was not clear (e.g. if they simply agreed to terms & conditions), you’d have to re-obtain that consent. So prepare a functionality for mass-emailing your users to ask them to go to their profile page and check all the checkboxes for the personal data processing activities that you have.
“See all my data” – this is very similar to the “Export” button, except data should be displayed in the regular UI of the application rather than an XML/JSON format. For example, Google Maps shows you your location history – all the places that you’ve been to. It is a good implementation of the right to access. (Though Google is very far from perfect when privacy is concerned)
Age checks – you should ask for the user’s age, and if the user is a child (below 16), you should ask for parent permission. There’s no clear way how to do that, but my suggestion is to introduce a flow, where the child should specify the email of a parent, who can then confirm. Obviosuly, children will just cheat with their birthdate, or provide a fake parent email, but you will most likely have done your job according to the regulation (this is one of the “wishful thinking” aspects of the regulation).
Encrypt the data in transit. That means that communication between your application layer and your database (or your message queue, or whatever component you have) should be over TLS. The certificates could be self-signed (and possibly pinned), or you could have an internal CA. Different databases have different configurations, just google “X encrypted connections. Some databases need gossiping among the nodes – that should also be configured to use encryption
Encrypt the data at rest – this again depends on the database (some offer table-level encryption), but can also be done on machine-level. E.g. using LUKS. The private key can be stored in your infrastructure, or in some cloud service like AWS KMS.
Encrypt your backups – kind of obvious
Implement pseudonymisation – the most obvious use-case is when you want to use production data for the test/staging servers. You should change the personal data to some “pseudonym”, so that the people cannot be identified. When you push data for machine learning purposes (to third parties or not), you can also do that. Technically, that could mean that your User object can have a “pseudonymize” method which applies hash+salt/bcrypt/PBKDF2 for some of the data that can be used to identify a person
Protect data integrity – this is a very broad thing, and could simply mean “have authentication mechanisms for modifying data”. But you can do something more, even as simple as a checksum, or a more complicated solution (like the one I’m working on). It depends on the stakes, on the way data is accessed, on the particular system, etc. The checksum can be in the form of a hash of all the data in a given database record, which should be updated each time the record is updated through the application. It isn’t a strong guarantee, but it is at least something.
Have your GDPR register of processing activities in something other than Excel – Article 30 says that you should keep a record of all the types of activities that you use personal data for. That sounds like bureaucracy, but it may be useful – you will be able to link certain aspects of your application with that register (e.g. the consent checkboxes, or your audit trail records). It wouldn’t take much time to implement a simple register, but the business requirements for that should come from whoever is responsible for the GDPR compliance. But you can advise them that having it in Excel won’t make it easy for you as a developer (imagine having to fetch the excel file internally, so that you can parse it and implement a feature). Such a register could be a microservice/small application deployed separately in your infrastructure.
Log access to personal data – every read operation on a personal data record should be logged, so that you know who accessed what and for what purpose
Register all API consumers – you shouldn’t allow anonymous API access to personal data. I’d say you should request the organization name and contact person for each API user upon registration, and add those to the data processing register. Note: some have treated article 30 as a requirement to keep an audit log. I don’t think it is saying that – instead it requires 250+ companies to keep a register of the types of processing activities (i.e. what you use the data for). There are other articles in the regulation that imply that keeping an audit log is a best practice (for protecting the integrity of the data as well as to make sure it hasn’t been processed without a valid reason)
Finally, some “don’t’s”.
Don’t use data for purposes that the user hasn’t agreed with – that’s supposed to be the spirit of the regulation. If you want to expose a new API to a new type of clients, or you want to use the data for some machine learning, or you decide to add ads to your site based on users’ behaviour, or sell your database to a 3rd party – think twice. I would imagine your register of processing activities could have a button to send notification emails to users to ask them for permission when a new processing activity is added (or if you use a 3rd party register, it should probably give you an API). So upon adding a new processing activity (and adding that to your register), mass email all users from whom you’d like consent.
Don’t log personal data – getting rid of the personal data from log files (especially if they are shipped to a 3rd party service) can be tedious or even impossible. So log just identifiers if needed. And make sure old logs files are cleaned up, just in case
Don’t put fields on the registration/profile form that you don’t need – it’s always tempting to just throw as many fields as the usability person/designer agrees on, but unless you absolutely need the data for delivering your service, you shouldn’t collect it. Names you should probably always collect, but unless you are delivering something, a home address or phone is unnecessary.
Don’t assume 3rd parties are compliant – you are responsible if there’s a data breach in one of the 3rd parties (e.g. “processors”) to which you send personal data. So before you send data via an API to another service, make sure they have at least a basic level of data protection. If they don’t, raise a flag with management.
Don’t assume having ISO XXX makes you compliant – information security standards and even personal data standards are a good start and they will probably 70% of what the regulation requires, but they are not sufficient – most of the things listed above are not covered in any of those standards
Overall, the purpose of the regulation is to make you take conscious decisions when processing personal data. It imposes best practices in a legal way. If you follow the above advice and design your data model, storage, data flow , API calls with data protection in mind, then you shouldn’t worry about the huge fines that the regulation prescribes – they are for extreme cases, like Equifax for example. Regulators (data protection authorities) will most likely have some checklists into which you’d have to somehow fit, but if you follow best practices, that shouldn’t be an issue.
I think all of the above features can be implemented in a few weeks by a small team. Be suspicious when a big vendor offers you a generic plug-and-play “GDPR compliance” solution. GDPR is not just about the technical aspects listed above – it does have organizational/process implications. But also be suspicious if a consultant claims GDPR is complicated. It’s not – it relies on a few basic principles that are in fact best practices anyway. Just don’t ignore them.
Data security is paramount in many industries. Organizations that shift their IT infrastructure to the cloud must ensure that their data is protected and that the attack surface is minimized. This post focuses on a method of securely loading a subset of data from one Amazon Redshift cluster to another Amazon Redshift cluster that is located in a different AWS account. You can accomplish this by dynamically controlling the security group ingress rules that are attached to the clusters.
The case for creating a segregated data loading account
From a security perspective, it is easier to restrict access to sensitive infrastructure if the respective stages (dev, QA, staging, and prod) are each located in their own isolated AWS account. Another common method for isolating resources is to set up separate virtual private clouds (VPCs) for each stage, all within a single AWS account. Because many services live outside the VPC (for example, Amazon S3, Amazon DynamoDB, and Amazon Kinesis), it requires careful thought to isolate the resources that should be associated with dev, QA, staging, and prod.
The segregated account model setup does create more overhead. But it gives administrators more control without them having to create tags and use cumbersome naming conventions to define a logical stage. In the segregated account model, all the data and infrastructure that are located in an account belong to that particular stage of the release pipeline (dev, QA, staging, or prod).
But where should you put infrastructure that does not belong to one particular stage?
Infrastructure to support deployments or to load data across accounts is best located in another segregated account. By deploying infrastructure or loading data from a separate account, you can’t depend on any existing roles, VPCs, subnets, etc. Any information that is necessary to deploy your infrastructure or load the data must be captured up front. This allows you to perform repeatable processes in a predictable and secure manner. With the recent addition of the StackSets feature in AWS CloudFormation, you can provision and manage infrastructure in multiple AWS accounts and Regions from a single template. This four-part blog series discusses different ways of automating the creation of cross-account roles and capturing account-specific information.
Loading OpenFDA data into Amazon Redshift
Before you get started with loading data from one Amazon Redshift cluster to another, you first need to create an Amazon Redshift cluster and load some data into it. You can use the following AWS CloudFormation template to create an Amazon Redshift cluster. You need to create Amazon Redshift clusters in both the source and target accounts.
Description: This template creates a Redshift cluster given with the supplied username and password.
Description: The master username for the Redshift cluster.
Description: The master password for the Redshift cluster.
Description: The endpoint address of the Redshift cluster.
After you create your Amazon Redshift clusters, you can go ahead and load some data into the cluster that is located in your source account. One of the great benefits of AWS is the ability to host and share public datasets on Amazon S3. When you test different architectures, these datasets serve as useful resources to get up and running without a lot of effort. For this post, we use the OpenFDA food enforcement dataset because it is a relatively small file and is easy to work with.
In the source account, you need to spin up an Amazon EMR cluster with Apache Spark so that you can unzip the file and format it properly before loading it into Amazon Redshift. The following AWS CloudFormation template provides the EMR cluster that you need.
Description: This template creates an EMR cluster to load OpenFDA data into the source Redshift cluster.
Description: The name of the KeyPair to SSH into the EMR instances.
- Name: Hadoop
- Name: Spark
- Name: Zeppelin
- Name: Livy
Note: As an alternative, you can load the data using AWS Glue, which now supports Scala.
Now that your EMR cluster is up and running, you can submit this Scala code over a REST API call to Apache Livy. You also have the option of running this code inside of an Apache Zeppelin notebook.
Connect to your source Amazon Redshift cluster in your source account, and verify that the data is present by running a quick query:
select count(*) from public.food_enforcement;
Opening up the security groups
Now that the data has been loaded in the source Amazon Redshift cluster, it can be moved over to the target Amazon Redshift cluster. Because the security groups that are associated with the two clusters are very restrictive, there is no way to load the data from the centralized data loading AWS account without modifying the ingress rules on both security groups. Here are a few possible options:
Add an ingress rule to allow all traffic to port 5439 (the default Amazon Redshift port).
This option is not recommended because you are widening your attack surface significantly and exposing yourself to a potential attack.
Peer the VPC in the data loader account to the source and target Amazon Redshift VPCs, and modify the ingress rule to allow all traffic from the private IP range of the data loader VPC.
This solution is reasonably secure but does require some manual setup. Because the ingress rules in the source and target Amazon Redshift clusters allow access from the VPC private IP range, any resources in the data loader account can access both clusters, which is suboptimal.
Leave long-running Amazon EC2 instances or EMR clusters in the data loader AWS account and manually create specific ingress rules in the source and target Amazon Redshift security groups to allow for those specific IPs.
This option creates a lot of wasted cost because it requires leaving EC2 instances or an EMR cluster running indefinitely whether or not they are actually being used.
None of these three options is ideal, so let’s explore another option. One of the more powerful features of running EC2 instances in the cloud is the ability to dynamically manage and configure your environment using instance metadata. The AWS Cloud is dynamic by nature and incentivizes you to reduce costs by terminating instances when they are not being used. Therefore, instance metadata can serve as the glue to performing repeatable processes to these dynamic instances.
To load the data from the source Amazon Redshift cluster to the target Amazon Redshift cluster, perform the following steps:
Spin up an EC2 instance in the data loader account.
Use instance metadata to look up the IP of the EC2 instance.
Run a simple Python or Java program to perform a simple transformation and unload the data from the source Amazon Redshift cluster. Then load the results into the target Amazon Redshift cluster.
UNLOAD('select case when product_description ilike ''%milk%'' then 1 else 0 end as milk_flag
where left(recall_initiation_date, 4) >= 2016')
TO 's3://<Your S3 Bucket>/milk-food-enforcement.csv'
IAM_ROLE '<Your Redshift Role>'
FROM 's3://<Your S3 Bucket>/milk-food-enforcement.csv'
IAM_ROLE '<Your Redshift Role>'
Assume roles in the source and target accounts using AWS STS, and remove the ingress rules that were created in step 3.
Once step 5 is completed, you should see that the security groups for both Amazon Redshift clusters don’t allow traffic from any IP. Manually add your IP as an ingress rule to the target Amazon Redshift cluster’s security group on port 5439. When you run the following query, you should see that the data has been populated within the target Amazon Redshift cluster.
This post highlighted the importance of loading data in a secure manner across accounts. It mentioned reasons why you might want to provision infrastructure and load data from a centralized account. Several candidate solutions were discussed. Ultimately, the solution that we chose involved opening up security groups for a single IP and then closing them back up after the data was loaded. This solution minimizes the attack surface to a single IP and can be completely automated.
Our Health Customer Stories page lists just a few of the many customers that are building and running healthcare and life sciences applications that run on AWS. Customers like Verge Health, Care Cloud, and Orion Health trust AWS with Protected Health Information (PHI) and Personally Identifying Information (PII) as part of their efforts to comply with HIPAA and HITECH.
Sixteen More Services In my last HIPAA Eligibility Update I shared the news that we added eight additional services to our list of HIPAA eligible services. Today I am happy to let you know that we have added another sixteen services to the list, bringing the total up to 46. Here are the newest additions, along with some short descriptions and links to some of my blog posts to jog your memory:
Amazon EMR lets you have complete control over your cluster, giving you the flexibility to customize a cluster and install additional applications easily. EMR customers often use bootstrap actions to install and configure custom software in a cluster. However, bootstrap actions only run during the cluster or node startup. This makes it difficult for you to make configuration changes after a cluster is already running.
EMR clusters can also use a custom Amazon Machine Image (AMI). With the new support for launching clusters with custom Amazon Linux AMIs, customizing an EMR cluster is now even easier. However, the task of creating and managing custom AMIs can become increasingly difficult as the number of AMIs in your environment starts to increase.
Amazon EC2 Systems Manager helps you automate various management tasks such as automating AMI creation or running a command or script across hundreds of instances. In this post, I show how Systems Manager Automation can be used to automate the creation and patching of custom Amazon Linux AMIs for EMR.
Systems Manager Run Command lets you remotely manage the configuration of Amazon EC2 instances or on-premises machines. Run Command can be used to help you perform the following types of tasks on your EMR cluster nodes: install applications, restart daemons (HDFS, YARN, Presto, etc.), and make configuration changes. I also show how you can use Run Command to send commands to all nodes of a running EMR cluster.
Benefits of using a custom AMI
Although you can easily customize an EMR cluster using bootstrap actions, there can be benefits to using a custom AMI.
Reduction of cluster start time There are certain scenarios where a bootstrap action may affect your cluster start time. For example, your bootstrap action could be doing something like downloading a large program over the internet and delaying the time for your cluster to be ready. By adding and installing a program directly in the AMI, the time to complete a cluster launch may be reduced.
Prevent unexpected bootstrap action failures There are also scenarios where installing and configuring custom software directly in the AMI reduces the risk of unexpected failures. For example, a mirror or repo used by your bootstrap action to download a program might be offline or inaccessible. This could cause your bootstrap action to fail, which could cause a cluster launch failure.
Support for Amazon EBS root volume encryption A number of security and encryption features are available with EMR security configurations. This includes the ability to encrypt data at rest for HDFS (local volumes/Amazon EBS) and Amazon S3. However, certain regulatory/compliance policies may require that the root (boot) volume is also encrypted. By bringing your own Amazon Linux AMI, you can create AMIs that use encrypted EBS root volumes and use those AMIs for your EMR clusters.
Bring your own AMI requirements
Custom AMIs for EMR must meet the following requirements:
For the examples in this post, I show how you can set up the following solutions:
Automate a workflow of creating custom AMIs with pre-installed software
Run commands or make application configuration changes on all nodes of a running EMR cluster
Before you begin
In this post, the AWS CLI is used to execute the examples and steps shown. However, having the AWS CLI installed is not a requirement and the AWS Management Console can be used to perform the same tasks.
The region used for the examples is us-east-1 (N. Virginia).
Building a custom AMI with Systems Manager Automation
In this section, I show how you can use Automation to create a custom AMI. The following diagram shows an overview of the actions that the Automation will perform:
1) Configure roles for Automation
Before getting started, you have to configure an IAM instance profile role and a service role that Automation can use. The instance profile role gives Automation permission to perform actions on your instances, such as executing commands or starting and stopping services. The service role (or assume role) gives Automation permissions to perform actions on your behalf.
Configuring the required IAM roles for Automation is usually one of the hardest parts of setting up Automation. Luckily, you only do this step one time. We also have an AWS CloudFormation template that can be used to create and configure the required roles for Automation. For more information, see Method 1: Using AWS CloudFormation to Configure Roles for Automation.
An Automation document defines the actions that Systems Manager performs. In this step, you create a custom Automation document (customEmrAmiDocument) that performs the following steps:
Launch an EC2 instance from a base Amazon Linux AMI
Update installed software on the instance
Run additional Linux commands (optional)
Shut down the instance
Create an AMI of the instance
Terminate the instance
To create a custom Automation document, first download the customEmrAmiDocument.json document to your local machine. You can then use the console, AWS CLI, or AWS SDKs to create (upload) that Automation document in your account. The following example shows how to create an Automation document called “customEmrAmiDocument” using the AWS CLI:
Note: Creating an Automation document does not cause that document to be executed. You execute this document in the next step. Also note that file:// must be referenced followed by the path of the content file.
The “customEmrAmiDocument” Automation document created in the previous step has a list of parameters (SourceAmiId, InstanceIamRole, etc.), along with the description of each parameter. To describe the document parameters, run the following command:
The preceding command returns an output similar to the following:
"Description": "(Required) The source Amazon Machine Image ID.",
"Description": "(Required) The name of the role that enables Systems Manager (SSM) to manage the instance.",
When you start an Automation execution, you must pass the required parameters (SourceAmiId) along with any additional parameters for which you would like to overwrite the default value. For example, if you used CloudFormation to create the required IAM roles, you do not need to specify the InstanceRole and AutomationAssumeRole parameters.
To execute the document without including the InstanceRole and AutomationAssumeRole parameters, run the following command:
If your role names or ARNs have different values than the defaults, make sure that you specify those parameters accordingly. For example, if your instance profile/role is called “MyManagedInstanceProfile” and the Automation service role ARN is “arn:aws:iam::012345678910:role/MyAutomationServiceRole”, then your parameters to execute the Automation should be similar to the following:
I chose “ami-4fffc834” for the SourceAmiId parameter because it’s the latest Amazon Linux AMI in the us-east-1 (N. Virginia) region at the time of publication. It also has all the requirements needed for EMR custom AMIs. If you’re running your Automation document in a different region, set the SourceAmiId parameter to an AMI that’s available in that particular region (ex: “ami-aa5ebdd2” for us-west-2).
4) Finding details about the Automation execution
After the Automation execution is complete, you can view the steps that were executed in addition to the status of each step and their output. To view all Automation executions that used the “customEmrAmiDocument” document, you can run the following command:
The output of the preceding command contains details about each step executed by the Automation execution. To easily find the AMI ID/imageID of the AMI created during the Automation createImage step, run the following command:
For information about how to find the AMI ID of the custom AMI created by Automation, see step 4.
Using Run Command with EMR
In this section, I show how you can use Run Command to send commands to the nodes of a running EMR cluster. The following diagram shows an overview of a Run Command execution:
1) Configure the instance IAM role for Systems Manager
EC2 instances (EMR cluster nodes) need an IAM role to be able to communicate with the Systems Manager API. Because EMR already assigns an IAM role (usually called EMR_EC2_DefaultRole) to each cluster node, you can attach an additional managed policy (Systems Manager policy) to that role.
The following command attaches the “AmazonEC2RoleforSSM” managed policy to the EMR_EC2_DefaultRole role:
$ aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM --role-name EMR_EC2_DefaultRole
If you’re not using the default EC2 role, replace the –role-name parameter value with the role name that you’re using for your role.
Skip this step if your custom AMI was created by Automation. The customEmrAmiDocument Automation document that you used to create the custom AMI installs the SSM agent by default.
The Systems Manager (SSM) agent is used to process System Manager requests and configure your instances as specified in the request. For more information, see Installing SSM Agent on Linux.
3) Running a command with Run Command
You should now be able to run commands or Linux scripts on the instances that have the SSM agent running and the IAM role for SSM configured (Step 1 in this section). To view a list of instances that are ready to receive commands, run the following command:
The easiest way to send a command to all cluster nodes is by using a resource tag as the target for Run Command. If you didn’t add any tags to your EMR cluster during launch, you can add tags using the following command:
The preceding command is sent (executed) to all EC2 instances that have the following tags: environment=”emr-ssm”.
4) Finding details on a Run Command execution
For the Run Command (send-command) that was executed in the previous step, Run Command is executing a command to show the hostname (hostname -f) of an instance and its Python 3 version (python3 -V).
After executing the Run Command (send-command), it should return a “CommandID” field in the output. You can use that command ID to gather information on the instances that the command was sent to and to view the status of the command execution:
This post showed you some of the benefits of using custom AMIs for Amazon EMR and how you can use Automation to automate the management and creation of custom AMIs. I also showed how Run Command can be used to send commands and make configuration changes on all nodes of a running EMR cluster.
If you have questions or suggestions, please comment below.
Bruno Faria is an EMR Solution Architect with AWS. He works with our customers to provide them architectural guidance for running complex applications on Amazon EMR. In his spare time, he enjoys spending time with his family and learning about new big data solutions.
Whew – what a week! Tara, Randall, Ana, and I have been working around the clock to create blog posts for the announcements that we made at the AWS Summit in New York. Here’s a summary to help you to get started:
Amazon Macie – This new service helps you to discover, classify, and secure content at scale. Powered by machine learning and making use of Natural Language Processing (NLP), Macie looks for patterns and alerts you to suspicious behavior, and can help you with governance, compliance, and auditing. You can read Tara’s post to see how to put Macie to work; you select the buckets of interest, customize the classification settings, and review the results in the Macie Dashboard.
AWS Glue – Randall’s post (with deluxe animated GIFs) introduces you to this new extract, transform, and load (ETL) service. Glue is serverless and fully managed, As you can see from the post, Glue crawls your data, infers schemas, and generates ETL scripts in Python. You define jobs that move data from place to place, with a wide selection of transforms, each expressed as code and stored in human-readable form. Glue uses Development Endpoints and notebooks to provide you with a testing environment for the scripts you build. We also announced that Amazon Athena now integrates with Amazon Glue, as does Apache Spark and Hive on Amazon EMR.
AWS Migration Hub – This new service will help you to migrate your application portfolio to AWS. My post outlines the major steps and shows you how the Migration Hub accelerates, tracks,and simplifies your migration effort. You can begin with a discovery step, or you can jump right in and migrate directly. Migration Hub integrates with tools from our migration partners and builds upon the Server Migration Service and the Database Migration Service.
CloudHSM Update – We made a major upgrade to AWS CloudHSM, making the benefits of hardware-based key management available to a wider audience. The service is offered on a pay-as-you-go basis, and is fully managed. It is open and standards compliant, with support for multiple APIs, programming languages, and cryptography extensions. CloudHSM is an integral part of AWS and can be accessed from the AWS Management Console, AWS Command Line Interface (CLI), and through API calls. Read my post to learn more and to see how to set up a CloudHSM cluster.
Managed Rules to Secure S3 Buckets – We added two new rules to AWS Config that will help you to secure your S3 buckets. The s3-bucket-public-write-prohibited rule identifies buckets that have public write access and the s3-bucket-public-read-prohibited rule identifies buckets that have global read access. As I noted in my post, you can run these rules in response to configuration changes or on a schedule. The rules make use of some leading-edge constraint solving techniques, as part of a larger effort to use automated formal reasoning about AWS.
CloudTrail for All Customers – Tara’s post revealed that AWS CloudTrail is now available and enabled by default for all AWS customers. As a bonus, Tara reviewed the principal benefits of CloudTrail and showed you how to review your event history and to deep-dive on a single event. She also showed you how to create a second trail, for use with CloudWatch CloudWatch Events.
Encryption of Data at Rest for EFS – When you create a new file system, you now have the option to select a key that will be used to encrypt the contents of the files on the file system. The encryption is done using an industry-standard AES-256 algorithm. My post shows you how to select a key and to verify that it is being used.
Watch the Keynote My colleagues Adrian Cockcroft and Matt Wood talked about these services and others on the stage, and also invited some AWS customers to share their stories. Here’s the video:
When Jeff and I heard about this service, we both were curious on the meaning of the name Macie. Of course, Jeff being a great researcher looked up the name Macie and found that the name Macie has two meanings. It has both French and English (UK) based origin, it is typically a girl name, has various meanings. The first meaning of Macie that was found, said that that name meant “weapon”. The second meaning noted the name was representative of a person that is bold, sporty, and sweet. In a way, these definitions are appropriate, as today I am happy to announce that we are launching Amazon Macie, a new security service that uses machine learning to help identify and protect sensitive data stored in AWS from breaches, data leaks, and unauthorized access with Amazon Simple Storage Service (S3) being the initial data store. Therefore, I can imagine that Amazon Macie could be described as a bold, weapon for AWS customers providing a sweet service with a sporty user interface that helps to protects against malicious access of your data at rest. Whew, that was a mouthful, but I unbelievably got all the Macie descriptions out in a single sentence! Nevertheless, I am a thrilled to share with you the power of the new Amazon Macie service.
Amazon Macie is a service powered by machine learning that can automatically discover and classify your data stored in Amazon S3. But Macie doesn’t stop there, once your data has been classified by Macie, it assigns each data item a business value, and then continuously monitors the data in order to detect any suspicious activity based upon access patterns. Key features of the Macie service include:
Data Security Automation: analyzes, classifies, and processes data to understand the historical patterns, user authentications to data, data access locations, and times of access.
Data Security & Monitoring: actively monitors usage log data for anomaly detected along with automatic resolution of reported issues through CloudWatch Events and Lambda
Data Visibility for Proactive Loss prevention: Provides management visibility into details of storage data while providing immediate protection without the need for manual customer input
Data Research and Reporting: allows administrative configuration for reporting and alert management requirements
How does Amazon Macie accomplish this you ask?
Using machine learning algorithms for natural language processing (NLP), Macie can automate the classification of data in your S3 buckets. In addition, Amazon Macie takes advantage of predictive analytics algorithms enabling data access patterns to be dynamically analyzed. Learnings are then used to inform and to alert you on possible suspicious behavior. Macie also runs an engine specifically to detect common sources of personally identifiable information (PII), or sensitive personal information (SP). Macie takes advantage of AWS CloudTrail and continuously checks Cloudtrail events for PUT requests in S3 buckets and automatically classify new objects in almost real time.
While Macie is a powerful tool to use for security and data protection in the AWS cloud, it also can aid you with governance, compliance requirements, and/or audit standards. Many of you may already be aware of the EU’s most stringent privacy regulation to date – The General Protection Data Regulation (GDPR), which becomes enforceable on May 25, 2018. As Amazon Macie recognizes personally identifiable information (PII) and provides customers with dashboards and alerts, it will enable customers to comply with GDPR regulations around encryption and pseudonymization of data. When combined with Lambda queries, Macie becomes a powerful tool to help remediate GDPR concerns.
Tour of the Amazon Macie Service
Let’s look a tour of the service and look at Amazon Macie up close and personal.
First, I will log onto the Macie console and start the process of setting up Macie so that I can start to my data classification and protection by clicking the Get Started button.
As you can see, to enable the Amazon Macie service, I must have the appropriate IAM roles created for the service, and additionally I will need to have AWS CloudTrail enabled in my account.
I will create these roles and turn on the AWS CloudTrail service in my account. To make things easier for you to setup Macie, you can take advantage of sample template for CloudFormation provided in the Macie User Guide that will set up required IAM roles and policies for you, you then would only need to setup a trail as noted in the CloudTrail documentation.
If you have multiple AWS accounts, you should note that the account you use to enable the Macie service will be noted as the master account, you can integrate other accounts with the Macie service but they will have the member account designation. Users from member accounts will need to use an IAM role to federate access to the master account in order access the Macie console.
Now that my IAM roles are created and CloudTrail is enabled, I will click the Enable Macie button to start Macie’s data monitoring and protection.
Once Macie is finished starting the service in your account, you will be brought to the service main screen and any existing alerts in your account will be presented to you. Since I have just started the service, I currently have no existing alerts at this time.
Considering we are doing a tour of the Macie service, I will now integrate some of my S3 buckets with Macie. However, you do not have to specify any S3 buckets for Macie to start monitoring since the service already uses the AWS CloudTrail Management API analyze and process information. With this tour of Macie, I have decided to monitor some object level API events in from certain buckets in CloudTrail.
In order to integrate with S3, I will go to the Integrations tab of the Macie console. Once on the Integrations tab, I will see two options: Accounts and Services. The Account option is used to integrate member accounts with Macie and to set your data retention policy. Since I want to integrate specific S3 buckets with Macie, I’ll click the Services option go to the Services tab.
When I integrate Macie with the S3 service, a trail and a S3 bucket will be created to store logs about S3 data events. To get started, I will use the Select an account drop down to choose an account. Once my account is selected, the services available for integration are presented. I’ll select the Amazon S3 service by clicking the Add button.
Now I can select the buckets that I want Macie to analyze, selecting the Review and Save button takes me to a screen which I confirm that I desire object level logging by clicking Save button.
4 Next, on our Macie tour, let’s look at how we can customize data classification with Macie.
As we discussed, Macie will automatically monitor and classify your data. Once Macie identifies your data it will classify your data objects by file and content type. Macie will also use a support vector machine (SVM) classifier to classify the content within S3 objects in addition to the metadata of the file. In deep learning/machine learning fields of study, support vector machines are supervised learning models, which have learning algorithms used for classification and regression analysis of data. Macie trained the SVM classifier by using a data of varying content types optimized to support accurate detection of data content even including the source code you may write.
Macie will assign only one content type per data object or file, however, you have the ability to enable or disable content type and file extensions in order to include or exclude them from the Macie service classifying these objects. Once Macie classifies the data, it will assign risk level of the object between 1 and 10 with 10 being the highest risk and 1 being the lowest data risk level.
To customize the classification of our data with Macie, I’ll go to the Settings Tab. I am now presented with the choices available to enable or disable the Macie classifications settings.
For an example during our tour of Macie, I will choose File extension. When presented with the list of file extensions that Macie tracks and uses for classifications.
As a test, I’ll edit the apk file extension for Android application install file, and disable monitoring of this file by selecting No – disabled from the dropdown and clicking the Save button. Of course, later I will turn this back on since I want to keep my entire collection of data files safe including my Android development binaries.
One last thing I want to note about data classification using Macie is that the service provides visibility in how you data object are being classified and highlights data assets that you have stored regarding how critical or important the information for compliance, for your personal data and for your business.
Now that we have explored the data that Macie classifies and monitors, the last stop on our service tour is the Macie dashboard.
The Macie Dashboard provides us with a complete picture of all of the data and activity that has been gathered as Macie monitors and classifies our data. The dashboard displays Metrics and Views grouped by categories to provide different visual perspectives of your data. Within these dashboard screens, you also you can go from a metric perspective directly to the Research tab to build and run queries based on the metric. These queries can be used to set up customized alerts for notification of any possible security issues or problems. We won’t have an opportunity to tour the Research or Alerts tab, but you can find out more information about these features in the Macie user guide.
Turning back to the Dashboard, there are so many great resources in the Macie Dashboard that we will not be able to stop at each view, metric, and feature during our tour, so let me give you an overview of all the features of the dashboard that you can take advantage of using.
Dashboard Metrics – monitored data grouped by the following categories:
High-risk S3 objects: data objects with risk levels of 8 through 10.
Total event occurrences: – total count of all event occurrences since Macie was enabled
Total user sessions – 5-minute snapshot of CloudTrail data
Dashboard Views – views to display various points of the monitored data and activity:
S3 objects for a selected time range
S3 objects by personally identifiable information (PII)
S3 objects by ACL
CloudTrail events and associated users
CloudTrail errors and associated users
AWS CLoudTrail events
AWS CloudTrail user identity types
Well, that concludes our tour of the new and exciting Amazon Macie service. Amazon Macie is a sensational new service that uses the power of machine learning and deep learning to aid you in securing, identifying, and protecting your data stored in Amazon S3. Using natural language processing (NLP) to automate data classification, Amazon Macie enables you to easily get started with high accuracy classification and immediate protection of your data by simply enabling the service. The interactive dashboards give visibility to the where, what, who, and when of your information allowing you to proactively analyze massive streams of data, data accesses, and API calls in your environment. Learn more about Amazon Macie by visiting the product page or the documentation in the Amazon Macieuser guide.
Encryption at Rest Today we are adding support for encryption of data at rest. When you create a new file system, you can select a key that will be used to encrypt the contents of the files that you store on the file system. The key can be a built-in key that is managed by AWS or a key that you created yourself using AWS Key Management Service (KMS). File metadata (file names, directory names, and directory contents) will be encrypted using a key managed by AWS. Both forms of encryption are implemented using an industry-standard AES-256 algorithm.
You can set this up in seconds when you create a new file system. You simply choose the built-in key (aws/elasticfilesystem) or one of your own:
EFS will take care of the rest! You can select the filesystem in the console to verify that it is encrypted as desired:
A cryptographic algorithm that meets the approval of FIPS 140-2 is used to encrypt data and metadata. The encryption is transparent and has a minimal effect on overall performance.
You can use AWS Identity and Access Management (IAM) to control access to the Customer Master Key (CMK). The CMK must be enabled in order to grant access to the file system; disabling the key prevents it from being used to create new file systems and blocks access (after a period of time) to existing file systems that it protects. To learn more about your options, read Managing Access to Encrypted File Systems.
Available Now Encryption of data at rest is available now in all regions where EFS is supported, at no additional charge.
Our customers run an incredible variety of mission-critical workloads on AWS, many of which process and store sensitive data. As detailed in our Overview of Security Processes document, AWS customers have access to an ever-growing set of options for encrypting and protecting this data. For example, Amazon Relational Database Service (RDS) supports encryption of data at rest and in transit, with options tailored for each supported database engine (MySQL, SQL Server, Oracle, MariaDB, PostgreSQL, and Aurora).
Major CloudHSM Update Today, building on what we have learned from our first-generation product, we are making a major update to CloudHSM, with a set of improvements designed to make the benefits of hardware-based key management available to a much wider audience while reducing the need for specialized operating expertise. Here’s a summary of the improvements:
Pay As You Go – CloudHSM is now offered under a pay-as-you-go model that is simpler and more cost-effective, with no up-front fees.
Fully Managed – CloudHSM is now a scalable managed service; provisioning, patching, high availability, and backups are all built-in and taken care of for you. Scheduled backups extract an encrypted image of your HSM from the hardware (using keys that only the HSM hardware itself knows) that can be restored only to identical HSM hardware owned by AWS. For durability, those backups are stored in Amazon Simple Storage Service (S3), and for an additional layer of security, encrypted again with server-side S3 encryption using an AWS KMS master key.
Open & Compatible – CloudHSM is open and standards-compliant, with support for multiple APIs, programming languages, and cryptography extensions such as PKCS #11, Java Cryptography Extension (JCE), and Microsoft CryptoNG (CNG). The open nature of CloudHSM gives you more control and simplifies the process of moving keys (in encrypted form) from one CloudHSM to another, and also allows migration to and from other commercially available HSMs.
More Secure – CloudHSM Classic (the original model) supports the generation and use of keys that comply with FIPS 140-2 Level 2. We’re stepping that up a notch today with support for FIPS 140-2 Level 3, with security mechanisms that are designed to detect and respond to physical attempts to access or modify the HSM. Your keys are protected with exclusive, single-tenant access to tamper-resistant HSMs that appear within your Virtual Private Clouds (VPCs). CloudHSM supports quorum authentication for critical administrative and key management functions. This feature allows you to define a list of N possible identities that can access the functions, and then require at least M of them to authorize the action. It also supports multi-factor authentication using tokens that you provide.
Diving In You can create CloudHSM clusters that contain 1 to 32 HSMs, each in a separate Availability Zone in a particular AWS Region. Spreading HSMs across AZs gives you high availability (including built-in load balancing); adding more HSMs gives you additional throughput. The HSMs within a cluster are kept in sync: performing a task or operation on one HSM in a cluster automatically updates the others. Each HSM in a cluster has its own Elastic Network Interface (ENI).
All interaction with an HSM takes place via the AWS CloudHSM client. It runs on an EC2 instance and uses certificate-based mutual authentication to create secure (TLS) connections to the HSMs.
At the hardware level, each HSM includes hardware-enforced isolation of crypto operations and key storage. Each customer HSM runs on dedicated processor cores.
The next step is to apply the signed certificate to the cluster using the console or the CLI. After this has been done, the cluster can be activated by changing the password for the HSM’s administrative user, otherwise known as the Crypto Officer (CO).
Once the cluster has been created, initialized and activated, it can be used to protect data. Applications can use the APIs in AWS CloudHSM SDKs to manage keys, encrypt & decrypt objects, and more. The SDKs provide access to the CloudHSM client (running on the same instance as the application). The client, in turn, connects to the cluster across an encrypted connection.
Available Today The new HSM is available today in the US East (Northern Virginia), US West (Oregon), US East (Ohio), and EU (Ireland) Regions, with more in the works. Pricing starts at $1.45 per HSM per hour.
In case you missed any AWS Security Blog posts published so far in 2017, they are summarized and linked to below. The posts are shown in reverse chronological order (most recent first), and the subject matter ranges from protecting dynamic web applications against DDoS attacks to monitoring AWS account configuration changes and API calls to Amazon EC2 security groups.
March 22:How to Help Protect Dynamic Web Applications Against DDoS Attacks by Using Amazon CloudFront and Amazon Route 53 Using a content delivery network (CDN) such as Amazon CloudFront to cache and serve static text and images or downloadable objects such as media files and documents is a common strategy to improve webpage load times, reduce network bandwidth costs, lessen the load on web servers, and mitigate distributed denial of service (DDoS) attacks. AWS WAF is a web application firewall that can be deployed on CloudFront to help protect your application against DDoS attacks by giving you control over which traffic to allow or block by defining security rules. When users access your application, the Domain Name System (DNS) translates human-readable domain names (for example, www.example.com) to machine-readable IP addresses (for example, 192.0.2.44). A DNS service, such as Amazon Route 53, can effectively connect users’ requests to a CloudFront distribution that proxies requests for dynamic content to the infrastructure hosting your application’s endpoints. In this blog post, I show you how to deploy CloudFront with AWS WAF and Route 53 to help protect dynamic web applications (with dynamic content such as a response to user input) against DDoS attacks. The steps shown in this post are key to implementing the overall approach described in AWS Best Practices for DDoS Resiliency and enable the built-in, managed DDoS protection service, AWS Shield.
March 21:New AWS Encryption SDK for Python Simplifies Multiple Master Key Encryption The AWS Cryptography team is happy to announce a Python implementation of the AWS Encryption SDK. This new SDK helps manage data keys for you, and it simplifies the process of encrypting data under multiple master keys. As a result, this new SDK allows you to focus on the code that drives your business forward. It also provides a framework you can easily extend to ensure that you have a cryptographic library that is configured to match and enforce your standards. The SDK also includes ready-to-use examples. If you are a Java developer, you can refer to this blog post to see specific Java examples for the SDK. In this blog post, I show you how you can use the AWS Encryption SDK to simplify the process of encrypting data and how to protect your encryption keys in ways that help improve application availability by not tying you to a single region or key management solution.
March 21:Updated CJIS Workbook Now Available by Request The need for guidance when implementing Criminal Justice Information Services (CJIS)–compliant solutions has become of paramount importance as more law enforcement customers and technology partners move to store and process criminal justice data in the cloud. AWS services allow these customers to easily and securely architect a CJIS-compliant solution when handling criminal justice data, creating a durable, cost-effective, and secure IT infrastructure that better supports local, state, and federal law enforcement in carrying out their public safety missions. AWS has created several documents (collectively referred to as the CJIS Workbook) to assist you in aligning with the FBI’s CJIS Security Policy. You can use the workbook as a framework for developing CJIS-compliant architecture in the AWS Cloud. The workbook helps you define and test the controls you operate, and document the dependence on the controls that AWS operates (compute, storage, database, networking, regions, Availability Zones, and edge locations).
March 9:New Cloud Directory API Makes It Easier to Query Data Along Multiple Dimensions Today, we made available a new Cloud Directory API, ListObjectParentPaths, that enables you to retrieve all available parent paths for any directory object across multiple hierarchies. Use this API when you want to fetch all parent objects for a specific child object. The order of the paths and objects returned is consistent across iterative calls to the API, unless objects are moved or deleted. In case an object has multiple parents, the API allows you to control the number of paths returned by using a paginated call pattern. In this blog post, I use an example directory to demonstrate how this new API enables you to retrieve data across multiple dimensions to implement powerful applications quickly.
March 8:How to Access the AWS Management Console Using AWS Microsoft AD and Your On-Premises Credentials AWS Directory Service for Microsoft Active Directory, also known as AWS Microsoft AD, is a managed Microsoft Active Directory (AD) hosted in the AWS Cloud. Now, AWS Microsoft AD makes it easy for you to give your users permission to manage AWS resources by using on-premises AD administrative tools. With AWS Microsoft AD, you can grant your on-premises users permissions to resources such as the AWS Management Console instead of adding AWS Identity and Access Management (IAM) user accounts or configuring AD Federation Services (AD FS) with Security Assertion Markup Language (SAML). In this blog post, I show how to use AWS Microsoft AD to enable your on-premises AD users to sign in to the AWS Management Console with their on-premises AD user credentials to access and manage AWS resources through IAM roles.
March 7:How to Protect Your Web Application Against DDoS Attacks by Using Amazon Route 53 and an External Content Delivery Network Distributed Denial of Service (DDoS) attacks are attempts by a malicious actor to flood a network, system, or application with more traffic, connections, or requests than it is able to handle. To protect your web application against DDoS attacks, you can use AWS Shield, a DDoS protection service that AWS provides automatically to all AWS customers at no additional charge. You can use AWS Shield in conjunction with DDoS-resilient web services such as Amazon CloudFront and Amazon Route 53 to improve your ability to defend against DDoS attacks. Learn more about architecting for DDoS resiliency by reading the AWS Best Practices for DDoS Resiliency whitepaper. You also have the option of using Route 53 with an externally hosted content delivery network (CDN). In this blog post, I show how you can help protect the zone apex (also known as the root domain) of your web application by using Route 53 to perform a secure redirect to prevent discovery of your application origin.
February 23:s2n Is Now Handling 100 Percent of SSL Traffic for Amazon S3 Today, we’ve achieved another important milestone for securing customer data: we have replaced OpenSSL with s2n for all internal and external SSL traffic in Amazon Simple Storage Service (Amazon S3) commercial regions. This was implemented with minimal impact to customers, and multiple means of error checking were used to ensure a smooth transition, including client integration tests, catching potential interoperability conflicts, and identifying memory leaks through fuzz testing.
February 13:How to Create an Organizational Chart with Separate Hierarchies by Using Amazon Cloud Directory Amazon Cloud Directory enables you to create directories for a variety of use cases, such as organizational charts, course catalogs, and device registries. Cloud Directory offers you the flexibility to create directories with hierarchies that span multiple dimensions. For example, you can create an organizational chart that you can navigate through separate hierarchies for reporting structure, location, and cost center. In this blog post, I show how to use Cloud Directory APIs to create an organizational chart with two separate hierarchies in a single directory. I also show how to navigate the hierarchies and retrieve data. I use the Java SDK for all the sample code in this post, but you can use other language SDKs or the AWS CLI.
February 9:New! Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI AWS Identity and Access Management (IAM) roles enable your applications running on Amazon EC2 to use temporary security credentials that AWS creates, distributes, and rotates automatically. Using temporary credentials is an IAM best practice because you do not need to maintain long-term keys on your instance. Using IAM roles for EC2 also eliminates the need to use long-term AWS access keys that you have to manage manually or programmatically. Starting today, you can enable your applications to use temporary security credentials provided by AWS by attaching an IAM role to an existing EC2 instance. You can also replace the IAM role attached to an existing EC2 instance. In this blog post, I show how you can attach an IAM role to an existing EC2 instance by using the AWS CLI.
January 30:How to Protect Data at Rest with Amazon EC2 Instance Store Encryption Encrypting data at rest is vital for regulatory compliance to ensure that sensitive data saved on disks is not readable by any user or application without a valid key. Some compliance regulations such as PCI DSS and HIPAA require that data at rest be encrypted throughout the data lifecycle. To this end, AWS provides data-at-rest options and key management to support the encryption process. For example, you can encrypt Amazon EBS volumes and configure Amazon S3 buckets for server-side encryption (SSE) using AES-256 encryption. Additionally, Amazon RDS supports Transparent Data Encryption (TDE). Instance storage provides temporary block-level storage for Amazon EC2 instances. This storage is located on disks attached physically to a host computer. Instance storage is ideal for temporary storage of information that frequently changes, such as buffers, caches, and scratch data. By default, files stored on these disks are not encrypted. In this blog post, I show a method for encrypting data on Linux EC2 instance stores by using Linux built-in libraries. This method encrypts files transparently, which protects confidential data. As a result, applications that process the data are unaware of the disk-level encryption.
January 27:How to Detect and Automatically Remediate Unintended Permissions in Amazon S3 Object ACLs with CloudWatch Events Amazon S3Access Control Lists (ACLs) enable you to specify permissions that grant access to S3 buckets and objects. When S3 receives a request for an object, it verifies whether the requester has the necessary access permissions in the associated ACL. For example, you could set up an ACL for an object so that only the users in your account can access it, or you could make an object public so that it can be accessed by anyone. If the number of objects and users in your AWS account is large, ensuring that you have attached correctly configured ACLs to your objects can be a challenge. For example, what if a user were to call the PutObjectAcl API call on an object that is supposed to be private and make it public? Or, what if a user were to call the PutObject with the optional Acl parameter set to public-read, therefore uploading a confidential file as publicly readable? In this blog post, I show a solution that uses Amazon CloudWatch Events to detect PutObject and PutObjectAcl API calls in near-real time and helps ensure that the objects remain private by making automatic PutObjectAcl calls, when necessary.
January 24:New SOC 2 Report Available: Confidentiality As with everything at Amazon, the success of our security and compliance program is primarily measured by one thing: our customers’ success. Our customers drive our portfolio of compliance reports, attestations, and certifications that support their efforts in running a secure and compliant cloud environment. As a result of our engagement with key customers across the globe, we are happy to announce the publication of our new SOC 2 Confidentiality report. This report is available now through AWS Artifact in the AWS Management Console.
January 18:Compliance in the Cloud for New Financial Services Cybersecurity Regulations Financial regulatory agencies are focused more than ever on ensuring responsible innovation. Consequently, if you want to achieve compliance with financial services regulations, you must be increasingly agile and employ dynamic security capabilities. AWS enables you to achieve this by providing you with the tools you need to scale your security and compliance capabilities on AWS. The following breakdown of the most recent cybersecurity regulations, NY DFS Rule 23 NYCRR 500, demonstrates how AWS continues to focus on your regulatory needs in the financial services sector.
January 9:New Amazon GameDev Blog Post: Protect Multiplayer Game Servers from DDoS Attacks by Using Amazon GameLift In online gaming, distributed denial of service (DDoS) attacks target a game’s network layer, flooding servers with requests until performance degrades considerably. These attacks can limit a game’s availability to players and limit the player experience for those who can connect. Today’s new Amazon GameDev Blog post uses a typical game server architecture to highlight DDoS attack vulnerabilities and discusses how to stay protected by using built-in AWS Cloud security, AWS security best practices, and the security features of Amazon GameLift. Read the post to learn more.
January 6:FedRAMP Compliance Update: AWS GovCloud (US) Region Receives a JAB-Issued FedRAMP High Baseline P-ATO for Three New Services Three new services in the AWS GovCloud (US) region have received a Provisional Authority to Operate (P-ATO) from the Joint Authorization Board (JAB) under the Federal Risk and Authorization Management Program (FedRAMP). JAB issued the authorization at the High baseline, which enables US government agencies and their service providers the capability to use these services to process the government’s most sensitive unclassified data, including Personal Identifiable Information (PII), Protected Health Information (PHI), Controlled Unclassified Information (CUI), criminal justice information (CJI), and financial data.
January 4:The Top 20 Most Viewed AWS IAM Documentation Pages in 2016 The following 20 pages were the most viewed AWS Identity and Access Management (IAM) documentation pages in 2016. I have included a brief description with each link to give you a clearer idea of what each page covers. Use this list to see what other people have been viewing and perhaps to pique your own interest about a topic you’ve been meaning to research.
January 3:The Most Viewed AWS Security Blog Posts in 2016 The following 10 posts were the most viewed AWS Security Blog posts that we published during 2016. You can use this list as a guide to catch up on your blog reading or even read a post again that you found particularly useful.
January 3:How to Monitor AWS Account Configuration Changes and API Calls to Amazon EC2 Security Groups You can use AWS security controls to detect and mitigate risks to your AWS resources. The purpose of each security control is defined by its control objective. For example, the control objective of an Amazon VPC security group is to permit only designated traffic to enter or leave a network interface. Let’s say you have an Internet-facing e-commerce website, and your security administrator has determined that only HTTP (TCP port 80) and HTTPS (TCP 443) traffic should be allowed access to the public subnet. As a result, your administrator configures a security group to meet this control objective. What if, though, someone were to inadvertently change this security group’s rules and enable FTP or other protocols to access the public subnet from any location on the Internet? That expanded access could weaken the security posture of your assets. Consequently, your administrator might need to monitor the integrity of your company’s security controls so that the controls maintain their desired effectiveness. In this blog post, I explore two methods for detecting unintended changes to VPC security groups. The two methods address not only control objectives but also control failures.
If you have questions about or issues with implementing the solutions in any of these posts, please start a new thread on the forum identified near the end of each post.
In the last few years, there has been a rapid rise in enterprises adopting the Apache Hadoop ecosystem for critical workloads that process sensitive or highly confidential data. Due to the highly critical nature of the workloads, the enterprises implement certain organization/industry wide policies and certain regulatory or compliance policies. Such policy requirements are designed to protect sensitive data from unauthorized access.
A common requirement within such policies is about encrypting data at-rest and in-flight. Amazon EMR uses “security configurations” to make it easy to specify the encryption keys and certificates, ranging from AWS Key Management Service to supplying your own custom encryption materials provider.
You create a security configuration that specifies encryption settings and then use the configuration when you create a cluster. This makes it easy to build the security configuration one time and use it for any number of clusters.
In this post, I go through the process of setting up the encryption of data at multiple levels using security configurations with EMR. Before I dive deep into encryption, here are the different phases where data needs to be encrypted.
Data at rest
Data residing on Amazon S3—S3 client-side encryption with EMR
Data residing on disk—the Amazon EC2 instance store volumes (except boot volumes) and the attached Amazon EBS volumes of cluster instances are encrypted using Linux Unified Key System (LUKS)
Data in transit
Data in transit from EMR to S3, or vice versa—S3 client side encryption with EMR
Data in transit between nodes in a cluster—in-transit encryption via Secure Sockets Layer (SSL) for MapReduce and Simple Authentication and Security Layer (SASL) for Spark shuffle encryption
Data being spilled to disk or cached during a shuffle phase—Spark shuffle encryption or LUKS encryption
For this post, you create a security configuration that implements encryption in transit and at rest. To achieve this, you create the following resources:
KMS keys for LUKS encryption and S3 client-side encryption for data exiting EMR to S3
SSL certificates to be used for MapReduce shuffle encryption
The environment into which the EMR cluster is launched. For this post, you launch EMR in private subnets and set up an S3 VPC endpoint to get the data from S3.
An EMR security configuration
All of the scripts and code snippets used for this walkthrough are available on the aws-blog-emrencryption GitHub repo.
Generate KMS keys
For this walkthrough, you use AWS KMS, a managed service that makes it easy for you to create and control the encryption keys used to encrypt your data and disks.
You generate two KMS master keys, one for S3 client-side encryption to encrypt data going out of EMR and the other for LUKS encryption to encrypt the local disks. The Hadoop MapReduce framework uses HDFS. Spark uses the local file system on each slave instance for intermediate data throughout a workload, where data could be spilled to disk when it overflows memory.
To generate the keys, use the kms.json AWS CloudFormation script. As part of this script, provide an alias name, or display name, for the keys. An alias must be in the “alias/aliasname” format, and can only contain alphanumeric characters, an underscore, or a dash.
After you finish generating the keys, the ARNs are available as part of the outputs.
Generate SSL certificates
The SSL certificates allow the encryption of the MapReduce shuffle using HTTPS while the data is in transit between nodes.
For this walkthrough, use OpenSSL to generate a self-signed X.509 certificate with a 2048-bit RSA private key that allows access to the issuer’s EMR cluster instances. This prompts you to provide subject information to generate the certificates.
Use the cert-create.sh script to generate SSL certificates that are compressed into a zip file. Upload the zipped certificates to S3 and keep a note of the S3 prefix. You use this S3 prefix when you build your security configuration.
This example is a proof-of-concept demonstration only. Using self-signed certificates is not recommended and presents a potential security risk. For production systems, use a trusted certification authority (CA) to issue certificates.
To implement certificates from custom providers, use the TLSArtifacts provider interface.
Build the environment
For this walkthrough, launch an EMR cluster into a private subnet. If you already have a VPC and would like to launch this cluster into a public subnet, skip this section and jump to the Create a Security Configuration section.
To launch the cluster into a private subnet, the environment must include the following resources:
Managed NAT gateway
S3 VPC endpoint
As the EMR cluster is launched into a private subnet, you need a bastion or a jump server to SSH onto the cluster. After the cluster is running, you need access to the Internet to request the data keys from KMS. Private subnets do not have access to the Internet directly, so route this traffic via the managed NAT gateway. Use an S3 VPC endpoint to provide a highly reliable and a secure connection to S3.
As part of the parameters, pick an instance family for the bastion and an EC2 key pair to be used to SSH onto the bastion. Provide an appropriate stack name and add the appropriate tags. For example, the following screenshot is the review step for a stack that I created.
After creating the environment stack, look at the Output tab and make a note of the VPC ID, bastion, and private subnet IDs, as you will use them when you launch the EMR cluster resources.
Create a security configuration
The final step before launching the secure EMR cluster is to create a security configuration. For this walkthrough, create a security configuration with S3 client-side encryption using EMR, and LUKS encryption for local volumes using the KMS keys created earlier. You also use the SSL certificates generated and uploaded to S3 earlier for encrypting the MapReduce shuffle.
From the Build an environment section, you have the VPC ID and the subnet ID for the private subnet into which the EMR cluster should be launched. Select those values for the Network and EC2 Subnet fields. In the next step, provide a name and tags for the cluster.
The last step is to select the private key, assign the security configuration that was created in the Create a security configuration section, and choose Create Cluster.
Now that you have the environment and the cluster up and running, you can get onto the master node to run scripts. You need the IP address, which you can retrieve from the EMR console page. Choose Hardware, Master Instance group and note the private IP address of the master node.
After you are on the master node, bring your own Hive or Spark scripts. For testing purposes, the GitHub /code directory includes the test.py PySpark and test.q Hive scripts.
As part of this post, I’ve identified the different phases where data needs to be encrypted and walked through how data in each phase can be encrypted. Then, I described a step-by-step process to achieve all the encryption prerequisites, such as building the KMS keys, building SSL certificates, and launching the EMR cluster with a strong security configuration. As part of this walkthrough, you also secured the data by launching your cluster in a private subnet within a VPC, and used a bastion instance for access to the EMR cluster.
If you have questions or suggestions, please comment below.
About the Author
Sai Sriparasa is a Big Data Consultant for AWS Professional Services. He works with our customers to provide strategic & tactical big data solutions with an emphasis on automation, operations & security on AWS. In his spare time, he follows sports and current affairs.
Encrypting data at rest is vital for regulatory compliance to ensure that sensitive data saved on disks is not readable by any user or application without a valid key. Some compliance regulations such as PCI DSS and HIPAA require that data at rest be encrypted throughout the data lifecycle. To this end, AWS provides data-at-rest options and key management to support the encryption process. For example, you can encrypt Amazon EBS volumes and configure Amazon S3 buckets for server-side encryption (SSE) using AES-256 encryption. Additionally, Amazon RDS supports Transparent Data Encryption (TDE).
Instance storage provides temporary block-level storage for Amazon EC2 instances. This storage is located on disks attached physically to a host computer. Instance storage is ideal for temporary storage of information that frequently changes, such as buffers, caches, and scratch data. By default, files stored on these disks are not encrypted.
In this blog post, I show a method for encrypting data on Linux EC2 instance stores by using Linux built-in libraries. This method encrypts files transparently, which protects confidential data. As a result, applications that process the data are unaware of the disk-level encryption.
First, though, I will provide some background information required for this solution.
Disk and file system encryption
You can use two methods to encrypt files on instance stores. The first method is disk encryption, in which the entire disk or block within the disk is encrypted by using one or more encryption keys. Disk encryption operates below the file-system level, is operating-system agnostic, and hides directory and file information such as name and size. Encrypting File System, for example, is a Microsoft extension to the Windows NT operating system’s New Technology File System (NTFS) that provides disk encryption.
The second method is file-system-level encryption. Files and directories are encrypted, but not the entire disk or partition. File-system-level encryption operates on top of the file system and is portable across operating systems.
The Linux dm-crypt Infrastructure
Dm-crypt is a Linux kernel-level encryption mechanism that allows users to mount an encrypted file system. Mounting a file system is the process in which a file system is attached to a directory (mount point), making it available to the operating system. After mounting, all files in the file system are available to applications without any additional interaction; however, these files are encrypted when stored on disk.
Device mapper is an infrastructure in the Linux 2.6 and 3.x kernel that provides a generic way to create virtual layers of block devices. The device mapper crypt target provides transparent encryption of block devices using the kernel crypto API. The solution in this post uses dm-crypt in conjunction with a disk-backed file system mapped to a logical volume by the Logical Volume Manager (LVM). LVM provides logical volume management for the Linux kernel.
The following diagram depicts the relationship between an application, file system, and dm-crypt. Dm-crypt sits between the physical disk and the file system, and data written from the operating system to the disk is encrypted. The application is unaware of such disk-level encryption. Applications use a specific mount point in order to store and retrieve files, and these files are encrypted when stored to disk. If the disk is lost or stolen, the data on the disk is useless.
Overview of the solution
In this post, I create a new file system called secretfs. This file system is encrypted using dm-crypt. This example uses LVM and Linux Unified Key Setup (LUKS) to encrypt a file system. The encrypted file system sits on the EC2 instance store disk. Note that the internal store file system is not encrypted but rather a newly created file system.
The following diagram shows how the newly encrypted file system resides in the EC2 internal store disk. Applications that need to save sensitive data temporarily will use the secretfs mount point (‘/mnt/secretfs’) directory to store temporary or scratch files.
This solution has three requirements for the solution to work. First, you need to configure the related items on boot using EC2 launch configuration because the encrypted file system is created at boot time. An administrator should have full control over every step and should be able to grant and revoke the encrypted file system creation or access to keys. Second, you must enable logging for every encryption or decryption request by using AWS CloudTrail. In particular, logging is critical when the keys are created and when an EC2 instance requests password decryption to unlock an encrypted file system. Lastly, you should integrate the solution with other AWS services, as described in the next section.
AWS services used in this solution
I use the following AWS services in this solution:
AWS Key Management Service (KMS) – AWS KMS is a managed service that enables easy creation and control of encryption keys used to encrypt data. KMS uses envelope encryption in which data is encrypted using a data key that is then encrypted using a master key. Master keys can also be used to encrypt and decrypt up to 4 kilobytes of data. In our solution, I use KMS encrypt/decrypt APIs to encrypt the encrypted file system’s password. See more information about envelope encryption.
AWS CloudTrail – CloudTrail records AWS API calls for your account. KMS and CloudTrail are fully integrated, which means CloudTrail logs each request to and from KMS for future auditing. This post’s solution enables CloudTrail for monitoring and audit.
Amazon S3 – S3 is an AWS storage I use S3 in this post to save the encrypted file system password.
AWS Identity and Access Management (IAM) – AWS IAM enables you to control access securely to AWS services. In this post, I configure and attach a policy to EC2 instances that allows access to the S3 bucket to read the encrypted password file and to KMS to decrypt the file system password.
The following diagram illustrates the steps in the process of encrypting the EC2 instance store.
In this architectural diagram:
The administrator encrypts a secret password by using KMS. The encrypted password is stored in a file.
The administrator puts the file containing the encrypted password in an S3 bucket.
At instance boot time, the instance copies the encrypted file to an internal disk.
The EC2 instance then decrypts the file using KMS and retrieves the plaintext password. The password is used to configure the Linux encrypted file system with LUKS. All data written to the encrypted file system is encrypted by using an AES-128 encryption algorithm when stored on disk.
Implementing the solution
Create an S3 bucket
First, you create a bucket to store the encrypted password file. This file contains the password (key) used to encrypt the file system. Each EC2 instance upon boot copies the encrypted password file, decrypts the file, and retrieves the plaintext password, which is used to encrypt the file system on the instance store disk.
In this step, you create the S3 bucket that stores the encrypted password file, and apply the necessary permissions. If you are using an Amazon VPC endpoint for Amazon S3, you also need to add permissions to the bucket to allow access from the endpoint. (For a detailed example, see Example Bucket Policies for VPC Endpoints for Amazon S3.)
To create a new bucket:
Sign in to the S3 console and choose Create Bucket.
In the Bucket Name box, type your bucket name and then choose Create.
You should see the details about your new bucket in the right pane.
Configure IAM roles and permission for the S3 bucket
When an EC2 instance boots, it must read the encrypted password file from S3 and then decrypt the password using KMS. In this section, I configure an IAM policy that allows the EC2 instance to assume a role with the right access permissions to the S3 bucket. The following policy grants the correct access permissions, in which your-bucket-name is the S3 bucket that stores the encrypted password file.
The preceding policy grants read access to the bucket where the encrypted password is stored. This policy is used by the EC2 instance, which requires you to configure an IAM role. You will configure KMS permissions later in this post.
In the IAM console, choose Roles, and then choose Create New Role.
In Step 1: Role Name, type your role name, and choose Next Step.
In Step 2:Select Role Type, choose Amazon EC2 and choose Next Step.
In Step 3: Established Trust, choose Next Step.
In Step 4:Attach Policy, choose the policy you created in Step 1, as shown in the following screenshot.
In Step 5: Review, review the configuration and complete the steps. The newly created IAM role is now ready. You will use it when launching new EC2 instances, which will have the permission to access the encrypted password file in the S3 bucket.
You now should have a new IAM role listed on the Roles page. Choose Roles to list all roles in your account and then select the role you just created as shown in the following screenshot.
Encrypt a secret password with KMS and store it in the S3 bucket
Next, you use KMS to encrypt a secret password. To encrypt text by using KMS, you must use AWS CLI. AWS CLI is installed by default on EC2 Amazon Linux instances and you can install it on Linux, Windows, or Mac computers.
To encrypt a secret password with KMS and store it in the S3 bucket:
From the AWS CLI, type the following command to encrypt a secret password by using KMS (replace the region name with your region). You must have the right permissions in order to create keys and put objects in S3 (for more details, see Using IAM Policies with AWS KMS). In this example, I have used AWS CLI on the Linux OS to encrypt and generate the encrypted password file.
The preceding commands encrypt the password (Base64 is used to decode the cipher text). The command outputs the results to a file called LuksInternalStorageKey. It also creates a key alias (key name) that makes it easy to identify different keys; the alias is called EncFSForEC2InternalStorageKey. The file is then copied to the S3 bucket I created earlier in this post.
Configure permissions to allow the role to access the KMS key
Next, you grant the role access to the key you just created with KMS:
From the IAM console, choose Encryption keys from the navigation pane.
Select EncFSForEC2InternalStorageKey (this is the key alias you configured in the previous section). To add a new role that can use the key, scroll down to the Key Policy and then choose Add under Key Users.
Choose the new role you created earlier in this post and then choose Attach.
The role now has permission to use the key.
Configure EC2 with role and launch configurations
In this section, you launch a new EC2 instance with the new IAM role and a bootstrap script that executes the steps to encrypt the file system, as described earlier in the “Architectural overview” section:
In the EC2 console, launch a new instance (see this tutorial for more details). In Step 3: Configure Instance Details, choose the IAM role you configured earlier, as shown in the following screenshot.
Expand the Advanced Details section (see previous screenshot) and paste the following script in the EC2 instance’s User data Keep the As text check box selected. The script will be executed at EC2 boot time.
## Initial setup to be executed on boot
# Create an empty file. This file will be used to host the file system.
# In this example we create a 2 GB file called secretfs (Secret File System).
dd of=secretfs bs=1G count=0 seek=2
# Lock down normal access to the file.
chmod 600 secretfs
# Associate a loopback device with the file.
losetup /dev/loop0 secretfs
#Copy encrypted password file from S3. The password is used to configure LUKE later on.
aws s3 cp s3://an-internalstoragekeybucket/LuksInternalStorageKey .
# Decrypt the password from the file with KMS, save the secret password in LuksClearTextKey
LuksClearTextKey=$(aws --region us-east-1 kms decrypt --ciphertext-blob fileb://LuksInternalStorageKey --output text --query Plaintext | base64 --decode)
# Encrypt storage in the device. cryptsetup will use the Linux
# device mapper to create, in this case, /dev/mapper/secretfs.
# Initialize the volume and set an initial key.
echo "$LuksClearTextKey" | cryptsetup -y luksFormat /dev/loop0
# Open the partition, and create a mapping to /dev/mapper/secretfs.
echo "$LuksClearTextKey" | cryptsetup luksOpen /dev/loop0 secretfs
# Clear the LuksClearTextKey variable because we don't need it anymore.
# Check its status (optional).
cryptsetup status secretfs
# Zero out the new encrypted device.
dd if=/dev/zero of=/dev/mapper/secretfs
# Create a file system and verify its status.
mke2fs -j -O dir_index /dev/mapper/secretfs
# List file system configuration (optional).
tune2fs -l /dev/mapper/secretfs
# Mount the new file system to /mnt/secretfs.
mount /dev/mapper/secretfs /mnt/secretfs
If you have not enabled it already, be sure to enable CloudTrail on your account. Using CloudTrail, you will be able to monitor and audit access to the KMS key.
Launch the EC2 instance, which copies the password file from S3, decrypts the file using KMS, and configures an encrypted file system. The file system is mounted on /mnt/secretfs. Therefore, every file written to this mount point is encrypted when stored to disk. Applications that process sensitive data and need temporary storage should use the encrypted file system by writing and reading files from the mount point, ‘/mnt/secretfs’. The rest of the file system (for example, /home/ec2-user) is not encrypted.
You can list the encrypted file system’s status. First, SSH to the EC2 instance using the key pair you used to launch the EC2 instance. (For more information about logging in to an EC2 instance using a key pair, see Getting Started with Amazon EC2 Linux Instances.) Then, run the following command as root.
[[email protected] ec2-user]# cryptsetup status secretfs
/dev/mapper/secretfs is active and is in use.
keysize: 256 bits
offset: 4096 sectors
size: 4190208 sectors
As the command’s results should show, the file system is encrypted with AES-256 using XTS mode. XTS is a configuration method that allows ciphers to work with large data streams, without the risk of compromising the provided security.
This blog post shows you how to encrypt a file system on EC2 instance storage by using built-in Linux libraries and drivers with LVM and LUKS, in conjunction with AWS services such as S3 and KMS. If your applications need temporary storage, you can use an EC2 internal disk that is physically attached to the host computer. The data on instance stores persists only during the lifetime of its associated instance. However, instance store volumes are not encrypted. This post provides a simple solution that balances between the speed and availability of instance stores and the need for encryption at rest when dealing with sensitive data.
If you have comments about this blog post, submit them in the “Comments” section below. If you have implementation questions about the solution in this post, please start a new thread on the EC2 forum.
The collective thoughts of the interwebz
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.