AWS Config enables continuous monitoring of your AWS resources, making it simple to assess, audit, and record resource configurations and changes. AWS Config does this through the use of rules that define the desired configuration state of your AWS resources. AWS Config provides a number of AWS managed rules that address a wide range of security concerns such as checking if you encrypted your Amazon Elastic Block Store (Amazon EBS) volumes, tagged your resources appropriately, and enabled multi-factor authentication (MFA) for root accounts. You can also create custom rules to codify your compliance requirements through the use of AWS Lambda functions.
In this post we’ll show you how to use AWS Config to monitor our Amazon Simple Storage Service (S3) bucket ACLs and policies for violations which allow public read or public write access. If AWS Config finds a policy violation, we’ll have it trigger an Amazon CloudWatch Event rule to trigger an AWS Lambda function which either corrects the S3 bucket ACL, or notifies you via Amazon Simple Notification Service (Amazon SNS) that the policy is in violation and allows public read or public write access. We’ll show you how to do this in five main steps.
Enable AWS Config to monitor Amazon S3 bucket ACLs and policies for compliance violations.
Create an IAM Role and Policy that grants a Lambda function permissions to read S3 bucket policies and send alerts through SNS.
Create and configure a CloudWatch Events rule that triggers the Lambda function when AWS Config detects an S3 bucket ACL or policy violation.
Create a Lambda function that uses the IAM role to review S3 bucket ACLs and policies, correct the ACLs, and notify your team of out-of-compliance policies.
Verify the monitoring solution.
Note: This post assumes your compliance policies require the buckets you monitor not allow public read or write access. If you have intentionally open buckets serving static content, for example, you can use this post as a jumping-off point for a solution tailored to your needs.
At the end of this post, we provide an AWS CloudFormation template that implements the solution outlined. The template enables you to deploy the solution in multiple regions quickly.
Important: The use of some of the resources deployed, including those deployed using the provided CloudFormation template, will incur costs as long as they are in use. AWS Config Rules incur costs in each region they are active.
Architecture
Here’s an architecture diagram of what we’ll implement:
Figure 1: Architecture diagram
Step 1: Enable AWS Config and Amazon S3 Bucket monitoring
The following steps demonstrate how to set up AWS Config to monitor Amazon S3 buckets.
If this is your first time using AWS Config, select Get started. If you’ve already used AWS Config, select Settings.
In the Settings page, under Resource types to record, clear the All resources checkbox. In the Specific types list, select Bucket under S3.
Figure 2: The Settings dialog box showing the “Specific types” list
Choose the Amazon S3 bucket for storing configuration history and snapshots. We’ll create a new Amazon S3 bucket.
Figure 3: Creating an S3 bucket
If you prefer to use an existing Amazon S3 bucket in your account, select the Choose a bucket from your account radio button and, using the dropdown, select an existing bucket.
Figure 4: Selecting an existing S3 bucket
Under Amazon SNS topic, check the box next to Stream configuration changes and notifications to an Amazon SNS topic, and then select the radio button to Create a topic.
Alternatively, you can choose a topic that you have previously created and subscribed to.
Figure 5: Selecting a topic that you’ve previously created and subscribed to
If you created a new SNS topic you need to subscribe to it to receive notifications. We’ll cover this in a later step.
Under AWS Config role, choose Create a role (unless you already have a role you want to use). We’re using the auto-suggested role name.
Figure 6: Creating a role
Select Next.
Configure Amazon S3 bucket monitoring rules:
On the AWS Config rules page, search for S3 and choose the s3-bucket-publice-read-prohibited and s3-bucket-public-write-prohibited rules, then click Next.
Figure 7: AWS Config rules dialog
On the Review page, select Confirm. AWS Config is now analyzing your Amazon S3 buckets, capturing their current configurations, and evaluating the configurations against the rules we selected.
If you created a new Amazon SNS topic, open the Amazon SNS Management Console and locate the topic you created:
Figure 8: Amazon SNS topic list
Copy the ARN of the topic (the string that begins with arn:) because you’ll need it in a later step.
Select the checkbox next to the topic, and then, under the Actions menu, select Subscribe to topic.
Select Email as the protocol, enter your email address, and then select Create subscription.
After several minutes, you’ll receive an email asking you to confirm your subscription for notifications for this topic. Select the link to confirm the subscription.
Step 2: Create a Role for Lambda
Our Lambda will need permissions that enable it to inspect and modify Amazon S3 bucket ACLs and policies, log to CloudWatch Logs, and publishing to an Amazon SNS topic. We’ll now set up a custom AWS Identity and Access Management (IAM) policy and role to support these actions and assign them to the Lambda function we’ll create in the next section.
In the AWS Management Console, under Services, select IAM to access the IAM Console.
Create a policy with the following permissions, or copy the following policy:
Select Lambda from the list of services that will use this role.
Select the check box next to the policy you created previously, and then select Next: Review
Name your role, give it a description, and then select Create Role. In this example, we’re naming the role LambdaS3PolicySecuringRole.
Step 3: Create and Configure a CloudWatch Rule
In this section, we’ll create a CloudWatch Rule to trigger the Lambda function when AWS Config determines that your Amazon S3 buckets are non-compliant.
In the AWS Management Console, under Services, select CloudWatch.
On the left-hand side, under Events, select Rules.
Click Create rule.
In Step 1: Create rule, under Event Source, select the dropdown list and select Build custom event pattern.
Copy the following pattern and paste it into the text box:
The pattern matches events generated by AWS Config when it checks the Amazon S3 bucket for public accessibility.
We’ll add a Lambda target later. For now, select your Amazon SNS topic created earlier, and then select Configure details.
Figure 9: The “Create rule” dialog
Give your rule a name and description. For this example, we’ll name ours AWSConfigFoundOpenBucket
Click Create rule.
Step 4: Create a Lambda Function
In this section, we’ll create a new Lambda function to examine an Amazon S3 bucket’s ACL and bucket policy. If the bucket ACL is found to allow public access, the Lambda function overwrites it to be private. If a bucket policy is found, the Lambda function creates an SNS message, puts the policy in the message body, and publishes it to the Amazon SNS topic we created. Bucket policies can be complex, and overwriting your policy may cause unexpected loss of access, so this Lambda function doesn’t attempt to alter your policy in any way.
Get the ARN of the Amazon SNS topic created earlier.
In the AWS Management Console, under Services, select Lambda to go to the Lambda Console.
From the Dashboard, select Create Function. Or, if you were taken directly to the Functions page, select the Create Function button in the upper-right.
On the Create function page:
Choose Author from scratch.
Provide a name for the function. We’re using AWSConfigOpenAccessResponder.
The Lambda function we’ve written is Python 3.6 compatible, so in the Runtime dropdown list, select Python 3.6.
Under Role, select Choose an existing role. Select the role you created in the previous section, and then select Create function.
Figure 10: The “Create function” dialog
We’ll now add a CloudWatch Event based on the rule we created earlier.
In the Add triggers section, select CloudWatch Events. A CloudWatch Events box should appear connected to the left side of the Lambda Function and have a note that states Configuration required.
Figure 11: CloudWatch Events in the “Add triggers” section
From the Rule dropdown box, choose the rule you created earlier, and then select Add.
Scroll up to the Designer section and select the name of your Lambda function.
Delete the default code and paste in the following code:
import boto3
from botocore.exceptions import ClientError
import json
import os
ACL_RD_WARNING = "The S3 bucket ACL allows public read access."
PLCY_RD_WARNING = "The S3 bucket policy allows public read access."
ACL_WRT_WARNING = "The S3 bucket ACL allows public write access."
PLCY_WRT_WARNING = "The S3 bucket policy allows public write access."
RD_COMBO_WARNING = ACL_RD_WARNING + PLCY_RD_WARNING
WRT_COMBO_WARNING = ACL_WRT_WARNING + PLCY_WRT_WARNING
def policyNotifier(bucketName, s3client):
try:
bucketPolicy = s3client.get_bucket_policy(Bucket = bucketName)
# notify that the bucket policy may need to be reviewed due to security concerns
sns = boto3.client('sns')
subject = "Potential compliance violation in " + bucketName + " bucket policy"
message = "Potential bucket policy compliance violation. Please review: " + json.dumps(bucketPolicy['Policy'])
# send SNS message with warning and bucket policy
response = sns.publish(
TopicArn = os.environ['TOPIC_ARN'],
Subject = subject,
Message = message
)
except ClientError as e:
# error caught due to no bucket policy
print("No bucket policy found; no alert sent.")
def lambda_handler(event, context):
# instantiate Amazon S3 client
s3 = boto3.client('s3')
resource = list(event['detail']['requestParameters']['evaluations'])[0]
bucketName = resource['complianceResourceId']
complianceFailure = event['detail']['requestParameters']['evaluations'][0]['annotation']
if(complianceFailure == ACL_RD_WARNING or complianceFailure == PLCY_RD_WARNING):
s3.put_bucket_acl(Bucket = bucketName, ACL = 'private')
elif(complianceFailure == PLCY_RD_WARNING or complianceFailure == PLCY_WRT_WARNING):
policyNotifier(bucketName, s3)
elif(complianceFailure == RD_COMBO_WARNING or complianceFailure == WRT_COMBO_WARNING):
s3.put_bucket_acl(Bucket = bucketName, ACL = 'private')
policyNotifier(bucketName, s3)
return 0 # done
Scroll down to the Environment variables section. This code uses an environment variable to store the Amazon SNS topic ARN.
For the key, enter TOPIC_ARN.
For the value, enter the ARN of the Amazon SNS topic created earlier.
Under Execution role, select Choose an existing role, and then select the role created earlier from the dropdown.
Leave everything else as-is, and then, at the top, select Save.
Step 5: Verify it Works
We now have the Lambda function, an Amazon SNS topic, AWS Config watching our Amazon S3 buckets, and a CloudWatch Rule to trigger the Lambda function if a bucket is found to be non-compliant. Let’s test them to make sure they work.
We have an Amazon S3 bucket, myconfigtestbucket that’s been created in the region monitored by AWS Config, as well as the associated Lambda function. This bucket has no public read or write access set in an ACL or a policy, so it’s compliant.
Figure 12: The “Config Dashboard”
Let’s change the bucket’s ACL to allow public listing of objects:
Figure 13: Screen shot of “Permissions” tab showing Everyone granted list access
After saving, the bucket now has public access. After several minutes, the AWS Config Dashboard notes that there is one non-compliant resource:
Figure 14: The “Config Dashboard” shown with a non-compliant resource
In the Amazon S3 Console, we can see that the bucket no longer has public listing of objects enabled after the invocation of the Lambda function triggered by the CloudWatch Rule created earlier.
Figure 15: The “Permissions” tab showing list access no longer allowed
Notice that the AWS Config Dashboard now shows that there are no longer any non-compliant resources:
Figure 16: The “Config Dashboard” showing zero non-compliant resources
Now, let’s try out the Amazon S3 bucket policy check by configuring a bucket policy that allows list access:
Figure 17: A bucket policy that allows list access
A few minutes after setting this bucket policy on the myconfigtestbucket bucket, AWS Config recognizes the bucket is no longer compliant. Because this is a bucket policy rather than an ACL, we publish a notification to the SNS topic we created earlier that lets us know about the potential policy violation:
Figure 18: Notification about potential policy violation
Knowing that the policy allows open listing of the bucket, we can now modify or delete the policy, after which AWS Config will recognize that the resource is compliant.
Conclusion
In this post, we demonstrated how you can use AWS Config to monitor for Amazon S3 buckets with open read and write access ACLs and policies. We also showed how to use Amazon CloudWatch, Amazon SNS, and Lambda to overwrite a public bucket ACL, or to alert you should a bucket have a suspicious policy. You can use the CloudFormation template to deploy this solution in multiple regions quickly. With this approach, you will be able to easily identify and secure open Amazon S3 bucket ACLs and policies. Once you have deployed this solution to multiple regions you can aggregate the results using an AWS Config aggregator. See this post to learn more.
If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the AWS Config forum or contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
AWS CloudHSM provides fully managed, single-tenant hardware security modules (HSMs) in the AWS cloud. A CloudHSM cluster contains either one or multiple HSMs. Multiple HSMs support higher throughput levels for cryptographic operations and provide redundancy. For clusters with multiple HSMs, the CloudHSM service supports server-side automated synchronization of keys and policies. Users, however, are synchronized from the client-side and the synchronization is driven by configuration files which must be refreshed when the cluster size changes. If you do not refresh the configuration files, your CloudHSM user configurations could become unsynchronized and affect the ability of your CloudHSM cluster to provide consistent support of cryptographic information.
In this blog post, I’ll provide a general overview of a CloudHSM architecture, discuss the cluster synchronization process, build a CloudHSM environment, show how the cluster users can become unsynchronized, and then restore user synchronization to bring your cluster back to a consistent state to meet your needs for consistency and redundancy.
CloudHSM Architectural Overview
When you provision an HSM instance in CloudHSM, the HSM instance provides an elastic network interface (ENI) in yourAmazon VPC while the HSM itself resides in a separate VPC managed by AWS CloudHSM. Your applications use the CloudHSM cluster ID to add or remove HSMs from the cluster and the ENI(s) of the HSM instance(s) to access the HSM instances.
You configure your cluster and its HSM instances using CloudHSM client software you deploy on Amazon EC2 instances in your VPC. You only need one such EC2 instance to manage a CloudHSM cluster, but it’s common to deploy additional EC2 instances in other availability zones to provide for client redundancy. Your applications communicate with the HSM instances using the client daemon. You manage and configure the cluster with command line tools including cloudhsm_mgmt_util, key_mgmt_util, and configure. An example of a CloudHSM architecture appears below.
Figure 1: A 3-Node CloudHSM architecture
The diagram shows a three-node CloudHSM cluster deployed in the us-west-2 (Oregon) region with three Amazon EC2 instances with the CloudHSM software. The client in Availability Zone 2 is communicating with the cluster through the elastic network interfaces in each availability zone.
CloudHSM Synchronization Process
Having discussed the architecture of AWS CloudHSM, let’s turn our attention to the matter of cluster synchronization. There are three events that require synchronization: cluster expansion, key management operations, and user management operations. Let’s look at each of these in more detail.
Cluster Expansion
When you add an HSM to an existing cluster, AWS CloudHSM clones all users, keys, and policies from another HSM in the cluster. No additional steps are required on your part.
Key Management Operations
Key management with the key_mgmt_util tool uses the CloudHSM client to communicate with the HSM cluster. Additionally, a fallback, HSM-based synchronization protocol keeps keys in sync.
User Management
You perform user management tasks, such as adding users or changing passwords, using the cloudhsm_mgmt_util tool. This tool communicates directly with the HSMs, bypassing the client daemon. cloudhsm_mgmt_util uses its own configuration files to determine the HSMs that it should connect to within the cluster. These configuration files aren’t updated dynamically when HSM instances are added. To prevent user synchronization errors, you must update the configuration files before running cloudhsm_mgmt_util. You must also not add new HSM instances to the cluster while you’re using the tool. This helps ensure that no HSM instances are accidentally left out of user updates that would in turn result in user synchronization problems.
Again, these safeguards are only necessary when using cloudhsm_mgmt_util. For all other applications and utilities using CloudHSM, the client daemon automatically reconfigures itself as you add and remove HSM instances from your cluster. In the remainder of this post, I will build a CloudHSM infrastructure as shown in the above diagram. I’ll then show you how users on your CloudHSM instances can become unsynchronized, and how to restore proper synchronization.
Prerequisites and Assumptions
You’ll need to have an AWS account that allows you to provision Amazon VPCs, Amazon EC2 instances, and CloudHSMs.
I’ll use the us-west-2 (Oregon) region, but you can use any region that offers CloudHSM.
You’ll need an Amazon EC2 key pair in the region.
You should have a working knowledge of the services I’ve mentioned.
Important: You’ll incur charges for the resources used in this example. You can find the cost of each service on that service’s pricing page.
Building a CloudHSM Infrastructure
Create an Amazon VPC with subnets in the us-west-2a, us-west-2b, and us-east-2c availability zones. I’ll use the Amazon VPC Architecture Quick Start, which is an AWS CloudFormation template that will do this on your behalf. Make sure you select the correct region after you load the Quick Start. Select the following parameters:
Parameter
Value
Availability Zones
us-west-2a, us-west-2b, us-west-2c
Number of Availability Zones
3
Create private subnets
False
Create additional private subnets with dedicated network ACLs
False
Key pair name
The name of your Amazon EC2 key pair
Accept the default values for all other parameters.
Follow these instructions to create a CloudHSM cluster in your new VPC in the us-west-2a, us-west-2b and us-west-2c availability zones. Note that the cluster will not have any HSMs after it’s created.
Follow these instructions to initialize the cluster with an HSM in the us-west-2a availability zone. After the cluster is initialized, note the ENI IP address from the cluster details section in the console as shown here:
Install the client software on the EC2 instance you launched in step 4.
Add the IP of the EC2 instance that you identified in step 4 to the security group you identified in step 3.
Activate the cluster. The activation instructions will guide you through connecting to the EC2 instance you launched in step 4. Remain logged into the EC2 instance following the activation of the cluster for the steps below.
While you are still logged into the EC2 instance you just launched, follow the steps below to add a crypto user named example_user to the cluster:
Ensure the CloudHSM daemon is stopped:$ sudo stop cloudhsm-client
Configure the IP address of the initial HSM using the ENI IP address from step 3:$ sudo /opt/cloudhsm/bin/configure –a 10.0.129.209
Note: the configure tool updates two configuration files: one for the CloudHSM client, and the other for the cloudhsm_mgmt_util program that is used to administer users.
Start the CloudHSM client:$ sudo start cloudhsm-client
Ensure the cloudhsm_mgmt_util configuration file is up to date. We need to do this to ensure cloudhsm_mgmt_util is aware of all the HSM instances in the cluster:$ sudo /opt/cloudhsm/bin/configure –m
Connect to the HSM instances, enable end-to-end encryption, and log in to the HSM instances. Enabling end-to-end encryption encrypts the communication between cloudhsm_mgmt_util and the HSM to prevent interception of sensitive information such as passwords:$ /opt/cloudhsm/bin/cloudhsm_mgmt_util /opt/cloudhsm/etc/cloudhsm_mgmt_util.cfg
aws-cloudhsm> enable_e2e
aws-cloudhsm> loginHSM CO admin
Figure 4: Connecting to a Single CloudHSM
Note: The connection or log in is automatically executed on every HSM instance that cloudhsm_mgmt_util is aware of. Note also that for each of the commands that you enter, the cloudhsm_mgmt_util program identifies the IP address of the HSM to which it is communicating.
Add the user example_user and then confirm the addition by listing the users in the HSM:aws-cloudhsm> createUser CU example_user yourpassword
aws-cloudhsm> listUsers
Use the quit command to log out and exit the program:aws-cloudhsm> quit
Now that we’ve added a user to the CloudHSM, let’s add a key so we can see how users and keys are synchronized as the cluster changes.
Start the key_mgmt_util program:$ /opt/cloudhsm/bin/key_mgmt_util
Log in to the HSM:Command: loginHSM –u CU –s example_user
Notice that key_mgmt_util displays the node id to which it is communicating.
Use the exit command to leave the program:exit
Add another HSM to the cluster in the us-west-2b availability zone and note the ENI IP address from the cluster details section in the console, as shown here:
Figure 6: The ENI IP address
Update the cluster configuration files and use cloud_mgmt_util to examine the user configuration: $ sudo stop cloudhsm-client$ sudo /opt/cloudhsm/bin/configure –a 10.0.129.209
Figure 7: Connecting to the 2-node CloudHSM cluster
Note that cloudhsm_mgmt_utilcloudhsm_mgmt_util now sends commands to both of the HSMs in the cluster. You can see the same thing when we list the users in the cluster.
Figure 8: Showing proper user synchronization across two CloudHSMs
Now, use key_mgmt_util to examine the keys:Command: findKey
Figure 9: Showing that keys are properly synchronized across a 2-node CloudHSM cluster
This command confirms that when we added the second HSM, CloudHSM used cluster-initiated synchronization to load the users and keys into the new HSM.
The CloudHSM Cluster Users Become Unsynchronized
Start cloudhsm_mgmt_util and enable end-to-end encryption:$ /opt/cloudhsm/bin/cloudhsm_mgmt_util /opt/cloudhsm/etc/cloudhsm_mgmt_util.cfg
aws-cloudhsm> enable_e2e
Figure 10: Connecting to the 2-node CloudHSM cluster
While cloudhsm_mgmt_util is left running, add a third HSM in us-west-2c through the console and note the ENI IP address, as shown here:
Figure 11: Connecting to the 2-node CloudHSM cluster
Going back to cloudhsm_mgmt_util, let’s add a user named newest_user to our cluster. Note that we have not exited cloudhsm_mgmt_util and refreshed its configuration file. So it’s still connected only to the first two HSM instances.aws-cloudhsm> enable_e2e
aws-cloudhsm> loginHSM CO admin yourpassword
aws-cloudhsm> createUser CU newest_user yourpassword
Figure 12: Adding a User to only two nodes of a 3-node CloudHSM Cluster and breaking synchronization
The cloudhsm_mgmt_util command adds the user to the two HSMs it already knows about and had connected to. It doesn’t communicate with the newly added HSM.
Let’s fix this by exiting cloudhsm_mgmt_util. Refresh the configuration, and then run the management utility again.$sudo stop cloudhsm-client
You can now see cloudhsm_mgmt_util is communicating with all of the cluster nodes.
Figure 13: Connecting to a 3-node CloudHSM cluster
Let’s see what happens when we list the users:aws-cloudhsm> listUsers
Figure 14: Showing that users are now unsynchronized
You can see from the results that one of the HSMs (server 1) is missing the user named newest_user. The reason this happened is that cloudhsm_mgmt_util was unaware of the HSM instance that was added while it was running (recall that cloudhsm_mgmt_util doesn’t use the cloudhsm_client daemon and, therefore, doesn’t get automatic cluster configuration updates).
Restoring User Synchronization to the CloudHSM Cluster
We now want to add the user newest_user to the single HSM (server 1) that is out of sync. Normally, cloudhsm_mgmt_util works in cluster mode and applies your commands to all HSMs in the cluster. Since we want to work on a single HSM, we’re going to enter the server command to tell cloudhsm_mgmt_util to work in server mode and apply our commands just to that one HSM.
In the server command below, we specify the number of the HSM that we want to change based on the figure above. In the createUser command, you must use the same password that you used in step 3 (in the section titled “The CloudHSM Cluster Users Become Unsynchronized”) on the other HSMs in the cluster so that all HSMs in the cluster have identical user names and passwords. After we make this change, we use the exit command to transition from server mode back to cluster mode.aws-cloudhsm> server 1
server1> createUser CU newest_user yourpassword
exit
Figure 15: Adding a user to a single-node of a 3-node CloudHSM cluster
Now that we have transitioned back to cluster mode, let’s confirm that the HSM user tables are now synchronized by listing the users:aws-cloudhsm> listUsers
Figure 16: Showing that users are now synchronized across the 3-node CloudHSM cluster
Let’s take a look at the keys using key_mgmt_util:Command: loginHSM –u CU –s example_user –p yourpassword
Command: findKey
Figure 17: Showing that keys continued to be synchronized across a 3-node CloudHSM Cluster
You can see that CloudHSM kept the keys in sync because key synchronization is cluster-initiated. No additional actions are required on our part.
Conclusion
AWS CloudHSM provides the ability to create scalable clusters of HSM instances to support the high volumes of cryptographic operations and provide resiliency by supporting multiple availability zones. As mentioned, it’s important to be aware of the various modes of synchronization used in CloudHSM so that each HSM can provide consistent service. In particular, users are synchronized only by the client. Since cloudhsm_mgmt_util doesn’t rely on the client daemon to talk to HSM instances in your cluster, it doesn’t automatically update its configuration. By following the steps above and refreshing the configuration information before changing users or passwords, CloudHSM will keep users and passwords synchronized within the cluster and provide consistent responses to cryptographic operations if the level of redundancy within the HSM cluster changes.
If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the Amazon CloudHSM forum or contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
There’s often tension between distributed and centralized control, especially in larger organizations. While a distributed control model allows teams to move fast and to respond to specialized local needs, a central model can provide the right level of oversight for global initiatives and challenges that span all teams.
We’ve seen this challenge arise first-hand when AWS customers grow to the point where their application footprint encompasses a plethora of AWS regions, AWS accounts, development teams, and applications. They love the fact that AWS increases their agility and responsiveness, while letting them deploy resources in the most appropriate location. This diversity and scale brings new challenges when it comes to security and compliance. The freedom to innovate must be balanced by the need to protect important data and to respond quickly when threats emerge.
Over the last couple of years we have provided our customers with an increasingly broad set of options for protection including AWS WAF and AWS Shield. Our customers are making great use of all of these options, and have asked for the ability to manage them from a single, central location.
Meet AWS Firewall Manager AWS Firewall Manager is designed to help these customers! It gives them the freedom to use multiple AWS accounts and to host applications in any desired region while maintaining centralized control over their organization’s security settings and profile. Developers can develop and innovators can innovate, while the security team gains the ability to respond quickly, uniformly, and globally to potential threats and actual attacks.
With automated policy enforcement across accounts & applications, your security team can be confident that new and existing applications comply with organization-wide security policies when they use Firewall Manager. They can find applications and AWS resources that don’t measure up, and bring them into compliance in minutes.
Firewall Manager is built around named policies that contain WAF rule sets and optional AWS Shield advanced protection. Each policy applies to a specific set of AWS resources, specified by account, resource type, resource identifier, or tag. Policies can be applied automatically to all matching resources, or to a subset that you select. Policies can include WAF rules drawn from within the organization, and also those created by AWS Partners such as Imperva, F5, Trend Micro, and other AWS Marketplace vendors. This gives your security team the power to duplicate their existing on-premises security posture in the cloud.
Take the Tour Firewall Manager has three prerequisites:
Firewall Administrator – You must designate one of the AWS accounts in your organization as the administrator for Firewall Manager. This gives the account permission to deploy AWS WAF rules across the organization.
AWS Config – You must enable AWS Config for all of the accounts in the Organization so that Firewall Manager can detect newly created resources (you can use the Enable AWS Config template on the StackSets Sample Templates page to take care of this). To learn more, read Getting Started with AWS Config.
Since I don’t own an enterprise, my colleagues were kind enough to create some test accounts for me! When I open the Firewall Manager Console in the master account, I can see where I stand with respect to the first two prerequisites:
The Learn more about… button reveals the Account ID of the administrator:
I switch to that account (in a a real-world situation it is unlikely that I would have access to the master account and this one), open the console, and see that I now meet the prerequisites. I click Create policy to move ahead:
The console outlines the process for me. I need to create rules and a rule group, define a policy with the rule group, define the scope of the policy, and then actually create the policy.
At the bottom of the page I choose to create a new policy and rule group, for resources in the US East (N. Virginia) Region, and click Next:
Then I specify the conditions for my rule, choosing from the following options:
Cross-site scripting
Geographic origin
SQL injection
IP address or range
Size constraint
String or regular expression
For example, I can create a condition that blocks malicious IP addresses (this AWS Solution shows you how to use a third-party reputation list with WAF, and may be helpful):
I’ll keep this one simple, but a rule can include multiple conditions. After I have added all of them, I click Next to proceed. Now I am ready to create my rule, and I click Createrule (I can add more conditions to it later if I want):
I give my rule a name (BlockExcludedIPs), enter a CloudWatch metric name, and add my condition (ExcludeIPs), then click Create:
I can create more rules, and include them in the same rule group. Again, I’ll keep this one simple, and click Next to move ahead:
I enter a name for my group, choose the rules that will make up the group, and click Create:
I now have two rule groups (testRuleGroup was already present in the account). I name my policy and click Next to proceed:
Now I define the scope of my policy. I choose the type of resource to be protected, and indicate when the policy should be applied:
I can also use tags to include or exclude resources:
Once I have defined the scope of my policy I click Next and review it, then click Create policy:
Now that the policy is in force, the ALBs within its scope are initially noncompliant:
Within minutes, Firewall Manager applies the policy and provides me with a status report:
Start Using AWS Firewall Manager Today You can start using AWS Firewall Manager today!
If you are using AWS Shield Advanced, you have access to AWS Firewall Manager and AWS WAF at no extra charge. If not, you are charged a monthly fee for each policy in each region, along with the usual charges for WAF WebACLs, WAF Rules, and AWS Config Rules.
We’re relentlessly innovating on your behalf at AWS, especially when it comes to security. Last November, we launched Amazon GuardDuty, a continuous security monitoring and threat detection service that incorporates threat intelligence, anomaly detection, and machine learning to help protect your AWS resources, including your AWS accounts. Many large customers, including General Electric, Autodesk, and MapBox, discovered these benefits and have quickly adopted the service for its ease of use and improved threat detection. In this post, I want to show you how easy it is for everyone to get started—large and small—and discuss our rapid iteration on the service.
After more than seven years at AWS, I still find myself staying up at night obsessing about unnecessary complexity. Sounds fun, right? Well, I don’t have to tell you that there’s a lot of unnecessary complexity and undifferentiated heavy lifting in security. Most security tooling requires significant care and feeding by humans. It’s often difficult to configure and manage, it’s hard to know if it’s working properly, and it’s costly to procure and run. As a result, it’s not accessible to all customers, and for those that do get their hands on it, they spend a lot of highly-skilled resources trying to keep it operating at its potential.
Even for the most skilled security teams, it can be a struggle to ensure that all resources are covered, especially in the age of virtualization, where new accounts, new resources, and new users can come and go across your organization at a rapid pace. Furthermore, attackers have come up with ingenious ways of giving you the impression your security solution is working when, in fact, it has been completely disabled.
I’ve spent a lot of time obsessing about these problems. How can we use the Cloud to not just innovate in security, but also make it easier, more affordable, and more accessible to all? Our ultimate goal is to help you better protect your AWS resources, while also freeing you up to focus on the next big project.
With GuardDuty, we really turned the screws on unnecessary complexity, distilling continuous security monitoring and threat detection down to a binary decision—it’s either on or off. That’s it. There’s no software, virtual appliances, or agents to deploy, no data sources to enable, and no complex permissions to create. You don’t have to write custom rules or become an expert at machine learning. All we ask of you is to simply turn the service on with a single-click or API call.
GuardDuty operates completely on our infrastructure, so there’s no risk of disrupting your workloads. By providing a hard hypervisor boundary between the code running in your AWS accounts and the code running in GuardDuty, we can help ensure full coverage while making it harder for a misconfiguration or an ingenious attacker to change that. When we detect something interesting, we generate a security finding and deliver it to you through the GuardDuty console and AWS CloudWatch Events. This makes it possible to simply view findings in GuardDuty or push them to an existing SIEM or workflow system. We’ve already seen customers take it a step further using AWS Lambda to automate actions such as changing security groups, isolating instances, or rotating credentials.
Now… are you ready to get started? It’s this simple:
So, you’ve got it enabled, now what can GuardDuty detect?
As soon as you enable the service, it immediately starts consuming multiple metadata streams at scale, including AWS CloudTrail, VPC Flow Logs, and DNS logs. It compares what it finds to fully managed threat intelligence feeds containing the latest malicious IPs and domains. In parallel, GuardDuty profiles all activity in your account, which allows it to learn the behavior of your resources so it can identify highly suspicious activity that suggests a threat.
The threat-intelligence-based detections can identify activity such as an EC2 instance being probed or brute-forced by an attacker. If an instance is compromised, it can detect attempts at lateral movement, communication with a known malware or command-and-control server, crypto-currency mining, or an attempt to exfiltrate data through DNS.
Where it gets more interesting is the ability to detect AWS account-focused threats. For example, if an attacker gets a hold of your AWS account credentials—say, one of your developers exposes credentials on GitHub—GuardDuty will identify unusual account behavior. For example, an unusual instance type being deployed in a region that has never been used, suspicious attempts to inventory your resources by calling unusual patterns of list APIs or describe APIs, or an effort to obscure user activity by disabling CloudTrail logging.
Our obsession with removing complexity meant making these detections fully-managed. We take on all the heavy lifting of building, maintaining, measuring, and improving the detections so that you can focus on what to do when an event does occur.
What’s new?
When we launched at the end of November, we had thirty-four distinct detections in GuardDuty, but we weren’t stopping there. Many of these detections are already on their second or third continuous improvement iteration. In less than three months, we’ve also added twelve more, including nine CloudTrail-based anomaly detections that identify highly suspicious activity in your accounts. These new detections intelligently catch changes to, or reconnaissance of, network, resource, user permissions, and anomalous activity in EC2, CloudTrail, and AWS console log-ins. These are detections we’ve built based on what we’ve learned from observed attack patterns across the scale of AWS.
The intelligence in these detections is built around the identification of highly sensitive AWS API calls that are invoked under one or more highly suspicious circumstances. The combination of “highly sensitive” and “highly suspicious” is important. Highly sensitive APIs are those that either change the security posture of an account by adding or elevating users, user policies, roles, or account-key IDs (AKIDs). Highly suspicious circumstances are determined from underlying models profiled at the API level by GuardDuty. The result is the ability to catch real threats, while decreasing false positives, limiting false negatives, and reducing alert-noise.
It’s still day one
As we like to say in Amazon, it’s still day one. I’m excited about what we’ve built with GuardDuty, but we’re not going to stop improving, even if you’re already happy with what we’ve built. Check out the list of new detections below and all of the GuardDuty detections in our online documentation. Keep the feedback coming as it’s what powers us at AWS.
Now, I have to stop writing because my wife tells me I have some unnecessary complexity to remove from our closet.
New GuardDuty CloudTrail-based anomaly detections
Recon:IAMUser/NetworkPermissions Situation: An IAM user invoked an API commonly used to discover the network access permissions of existing security groups, ACLs, and routes in your AWS account. Description: This finding is triggered when network configuration settings in your AWS environment are probed under suspicious circumstances. For example, if an IAM user in your AWS environment invoked the DescribeSecurityGroups API with no prior history of doing so. An attacker might use stolen credentials to perform this reconnaissance of network configuration settings before executing the next stage of their attack, which might include changing network permissions or making use of existing openings in the network configuration.
Recon:IAMUser/ResourcePermissions Situation: An IAM user invoked an API commonly used to discover the permissions associated with various resources in your AWS account. Description: This finding is triggered when resource access permissions in your AWS account are probed under suspicious circumstances. For example, if an IAM user with no prior history of doing so, invoked the DescribeInstances API. An attacker might use stolen credentials to perform this reconnaissance of your AWS resources in order to find valuable information or determine the capabilities of the credentials they already have.
Recon:IAMUser/UserPermissions Situation: An IAM user invoked an API commonly used to discover the users, groups, policies, and permissions in your AWS account. Description: This finding is triggered when user permissions in your AWS environment are probed under suspicious circumstances. For example, if an IAM user invoked the ListInstanceProfilesForRole API with no prior history of doing so. An attacker might use stolen credentials to perform this reconnaissance of your IAM users and roles to determine the capabilities of the credentials they already have or to find more permissive credentials that are vulnerable to lateral movement.
Persistence:IAMUser/NetworkPermissions Situation: An IAM user invoked an API commonly used to change the network access permissions for security groups, routes, and ACLs in your AWS account. Description: This finding is triggered when network configuration settings are changed under suspicious circumstances. For example, if an IAM user in your AWS environment invoked the CreateSecurityGroup API with no prior history of doing so. Attackers often attempt to change security groups, allowing certain inbound traffic on various ports to improve their ability to access the bot they might have planted on your EC2 instance.
Persistence:IAMUser/ResourcePermissions Situation: An IAM user invoked an API commonly used to change the security access policies of various resources in your AWS account. Description: This finding is triggered when a change is detected to policies or permissions attached to AWS resources. For example, if an IAM user in your AWS environment invoked the PutBucketPolicy API with no prior history of doing so. Some services, such as Amazon S3, support resource-attached permissions that grant one or more IAM principals access to the resource. With stolen credentials, attackers can change the policies attached to a resource, granting themselves future access to that resource.
Persistence:IAMUser/UserPermissions Situation: An IAM user invoked an API commonly used to add, modify, or delete IAM users, groups, or policies in your AWS account. Description: This finding is triggered by suspicious changes to the user-related permissions in your AWS environment. For example, if an IAM user in your AWS environment invoked the AttachUserPolicy API with no prior history of doing so. In an effort to maximize their ability to access the account even after they’ve been discovered, attackers can use stolen credentials to create new users, add access policies to existing users, create access keys, and so on. The owner of the account might notice that a particular IAM user or password was stolen and delete it from the account, but might not delete other users that were created by the fraudulently created admin IAM user, leaving their AWS account still accessible to the attacker.
ResourceConsumption:IAMUser/ComputeResources Situation: An IAM user invoked an API commonly used to launch compute resources like EC2 Instances. Description: This finding is triggered when EC2 instances in your AWS environment are launched under suspicious circumstances. For example, if an IAM user invoked the RunInstances API with no prior history of doing so. This might be an indication of an attacker using stolen credentials to access compute time (possibly for cryptocurrency mining or password cracking). It can also be an indication of an attacker using an EC2 instance in your AWS environment and its credentials to maintain access to your account.
Stealth:IAMUser/LoggingConfigurationModified Situation: An IAM user invoked an API commonly used to stop CloudTrail logging, delete existing logs, and otherwise eliminate traces of activity in your AWS account. Description: This finding is triggered when the logging configuration in your AWS account is modified under suspicious circumstances. For example, if an IAM user invoked the StopLogging API with no prior history of doing so. This can be an indication of an attacker trying to cover their tracks by eliminating any trace of their activity.
UnauthorizedAccess:IAMUser/ConsoleLogin Situation: An unusual console login by an IAM user in your AWS account was observed. Description: This finding is triggered when a console login is detected under suspicious circumstances. For example, if an IAM user invoked the ConsoleLogin API from a never-before- used client or an unusual location. This could be an indication of stolen credentials being used to gain access to your AWS account, or a valid user accessing the account in an invalid or less secure manner (for example, not over an approved VPN).
New GuardDuty threat intelligence based detections
Trojan:EC2/PhishingDomainRequest!DNS This detection occurs when an EC2 instance queries domains involved in phishing attacks.
Trojan:EC2/BlackholeTraffic!DNS This detection occurs when an EC2 instance connects to a black hole domain. Black holes refer to places in the network where incoming or outgoing traffic is silently discarded without informing the source that the data didn’t reach its intended recipient.
Trojan:EC2/DGADomainRequest.C!DNS This detection occurs when an EC2 instance queries algorithmically generated domains. Such domains are commonly used by malware and could be an indication of a compromised EC2 instance.
If you have feedback about this blog post, submit comments in the “Comments” section below. If you have questions about this blog post, start a new thread on the Amazon GuardDuty forum or contact AWS Support.
Amazon S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements. It gives you flexibility in the way you manage data for cost optimization, access control, and compliance. However, because the service is flexible, a user could accidentally configure buckets in a manner that is not secure. For example, let’s say you uploaded files to an Amazon S3 bucket with public read permissions, even though you intended only to share this file with a colleague or a partner. Although this might have accomplished your task to share the file internally, the file is now available to anyone on the internet, even without authentication.
In this blog post, we show you how to prevent your Amazon S3 buckets and objects from allowing public access. We discuss how to secure data in Amazon S3 with a defense-in-depth approach, where multiple security controls are put in place to help prevent data leakage. This approach helps prevent you from allowing public access to confidential information, such as personally identifiable information (PII) or protected health information (PHI).
Preventing your Amazon S3 buckets and objects from allowing public access
Every call to an Amazon S3 service becomes a REST API request. When your request is transformed via a REST call, the permissions are converted into parameters included in the HTTP header or as URL parameters. The Amazon S3 bucket policy allows or denies access to the Amazon S3 bucket or Amazon S3 objects based on policy statements, and then evaluates conditions based on those parameters. To learn more, see Using Bucket Policies and User Policies.
With this in mind, let’s say multiple AWS Identity and Access Management (IAM) users at Example Corp. have access to an Amazon S3 bucket and the objects in the bucket. Example Corp. wants to share the objects among its IAM users, while at the same time preventing the objects from being made available publicly.
To demonstrate how to do this, we start by creating an Amazon S3 bucket named examplebucket. After creating this bucket, we must apply the following bucket policy. This policy denies any uploaded object (PutObject) with the attribute x-amz-acl having the values public-read, public-read-write, or authenticated-read. This means authenticated users cannot upload objects to the bucket if the objects have public permissions.
“Deny any Amazon S3 request to PutObject or PutObjectAcl in the bucket examplebucket when the request includes one of the following access control lists (ACLs): public-read, public-read-write, or authenticated-read.”
Remember that IAM policies are evaluated not in a first-match-and-exit model. Instead, IAM evaluates first if there is an explicit Deny. If there is not, IAM continues to evaluate if you have an explicit Allow and then you have an implicit Deny.
The above policy creates an explicit Deny. Even when any authenticated user tries to upload (PutObject) an object with public read or write permissions, such as public-read or public-read-write or authenticated-read, the action will be denied. To understand how S3 Access Permissions work, you must understand what Access Control Lists (ACL) and Grants are. You can find the documentation here.
Now let’s continue our bucket policy explanation by examining the next statement.
This statement is very similar to the first statement, except that instead of checking the ACLs, we are checking specific user groups’ grants that represent the following groups:
AuthenticatedUsers group. Represented by http://acs.amazonaws.com/groups/global/AuthenticatedUsers, this group represents all AWS accounts. Access permissions to this group allow any AWS account to access the resource. However, all requests must be signed (authenticated).
AllUsers group. Represented by http://acs.amazonaws.com/groups/global/AllUsers, access permissions to this group allow anyone on the internet access to the resource. The requests can be signed (authenticated) or unsigned (anonymous). Unsigned requests omit the Authentication header in the request.
Now that you know how to deny object uploads with permissions that would make the object public, you just have two statement policies that prevent users from changing the bucket permissions (Denying s3:PutBucketACL from ACL and Denying s3:PutBucketACL from Grants).
Below is how we’re preventing users from changing the bucket permisssions.
As you can see above, the statement is very similar to the Object statements, except that now we use s3:PutBucketAcl instead of s3:PutObjectAcl, the Resource is just the bucket ARN, and the objects have the “/*” in the end of the ARN.
In this section, we showed how to prevent IAM users from accidently uploading Amazon S3 objects with public permissions to buckets. In the next section, we show you how to enforce multiple layers of security controls, such as encryption of data at rest and in transit while serving traffic from Amazon S3.
Securing data on Amazon S3 with defense-in-depth
Let’s say that Example Corp. wants to serve files securely from Amazon S3 to its users with the following requirements:
The data must be encrypted at rest and during transit.
The data must be accessible only by a limited set of public IP addresses.
All requests for data should be handled only by Amazon CloudFront (which is a content delivery network) instead of being directly available from an Amazon S3 URL. If you’re using an Amazon S3 bucket as the origin for a CloudFront distribution, you can grant public permission to read the objects in your bucket. This allows anyone to access your objects either through CloudFront or the Amazon S3 URL. CloudFront doesn’t expose Amazon S3 URLs, but your users still might have access to those URLs if your application serves any objects directly from Amazon S3, or if anyone gives out direct links to specific objects in Amazon S3.
A domain name is required to consume the content. Custom SSL certificate support lets you deliver content over HTTPS by using your own domain name and your own SSL certificate. This gives visitors to your website the security benefits of CloudFront over an SSL connection that uses your own domain name, in addition to lower latency and higher reliability.
To represent defense-in-depth visually, the following diagram contains several Amazon S3 objects (A) in a single Amazon S3 bucket (B). You can encrypt these objects on the server side or the client side. You also can configure the bucket policy such that objects are accessible only through CloudFront, which you can accomplish through an origin access identity (C). You then can configure CloudFront to deliver content only over HTTPS in addition to using your own domain name (D).
Defense-in-depth requirement 1: Data must be encrypted at rest and during transit
Let’s start with the objects themselves. Amazon S3 objects—files in this case—can range from zero bytes to multiple terabytes in size (see service limits for the latest information). Each Amazon S3 bucket includes a collection of objects, and the objects can be uploaded via the Amazon S3 console, AWS CLI, or AWS API.
If you choose to use server-side encryption, Amazon S3 encrypts your objects before saving them on disks in AWS data centers. To encrypt an object at the time of upload, you need to add the x-amz-server-side-encryption header to the request to tell Amazon S3 to encrypt the object using Amazon S3 managed keys (SSE-S3), AWS KMS managed keys (SSE-KMS), or customer-provided keys (SSE-C). There are two possible values for the x-amz-server-side-encryption header: AES256, which tells Amazon S3 to use Amazon S3 managed keys, and aws:kms, which tells Amazon S3 to use AWS KMS managed keys.
The following code example shows a Put request using SSE-S3.
PUT /example-object HTTP/1.1
Host: myBucket.s3.amazonaws.com
Date: Wed, 8 Jun 2016 17:50:00 GMT
Authorization: authorization string
Content-Type: text/plain
Content-Length: 11434
x-amz-meta-author: Janet
Expect: 100-continue
x-amz-server-side-encryption: AES256
[11434 bytes of object data]
If you choose to use client-side encryption, you can encrypt data on the client side and upload the encrypted data to Amazon S3. In this case, you manage the encryption process, the encryption keys, and related tools. You encrypt data on the client side by using AWS KMS managed keys or a customer-supplied, client-side master key.
Defense-in-depth requirement 2: Data must be accessible only by a limited set of public IP addresses
At the Amazon S3 bucket level, you can configure permissions through a bucket policy. For example, you can limit access to the objects in a bucket by IP address range or specific IP addresses. Alternatively, you can make the objects accessible only through HTTPS.
The following bucket policy allows access to Amazon S3 objects only through HTTPS (the policy was generated with the AWS Policy Generator). Here the bucket policy explicitly denies ("Effect": "Deny") all read access ("Action": "s3:GetObject") from anybody who browses ("Principal": "*") to Amazon S3 objects within an Amazon S3 bucket if they are not accessed through HTTPS ("aws:SecureTransport": "false").
Defense-in-depth requirement 3: Data must not be publicly accessible directly from an Amazon S3 URL
Next, configure Amazon CloudFront to serve traffic from within the bucket. The use of CloudFront serves several purposes:
CloudFront is a content delivery network that acts as a cache to serve static files quickly to clients.
Depending on the number of requests, the cost of delivery is less than if objects were served directly via Amazon S3.
Objects served through CloudFront can be limited to specific countries.
Access to these Amazon S3 objects is available only through CloudFront. We do this by creating an origin access identity (OAI) for CloudFront and granting access to objects in the respective Amazon S3 bucket only to that OAI. As a result, access to Amazon S3 objects from the internet is possible only through CloudFront; all other means of accessing the objects—such as through an Amazon S3 URL—are denied. CloudFront acts not only as a content distribution network, but also as a host that denies access based on geographic restrictions. You apply these restrictions by updating your CloudFront web distribution and adding a whitelist that contains only a specific country’s name (let’s say Liechtenstein). Alternatively, you could add a blacklist that contains every country except that country. Learn more about how to use CloudFront geographic restriction to whitelist or blacklist a country to restrict or allow users in specific locations from accessing web content in the AWS Support Knowledge Center.
Defense-in-depth requirement 4: A domain name is required to consume the content
To serve content from CloudFront, you must use a domain name in the URLs for objects on your webpages or in your web application. The domain name can be either of the following:
The domain name that CloudFront automatically assigns when you create a distribution, such as d111111abcdef8.cloudfront.net
Your own domain name, such as example.com
For example, you might use one of the following URLs to return the file image.jpg:
You use the same URL format whether you store the content in Amazon S3 buckets or at a custom origin, like one of your own web servers.
Instead of using the default domain name that CloudFront assigns for you when you create a distribution, you can add an alternate domain name that’s easier to work with, like example.com. By setting up your own domain name with CloudFront, you can use a URL like this for objects in your distribution: http://example.com/images/image.jpg.
Let’s say that you already have a domain name hosted on Amazon Route 53. You would like to serve traffic from the domain name, request an SSL certificate, and add this to your CloudFront web distribution. The SSL offloading occurs in CloudFront by serving traffic securely from each CloudFront location. You also can configure CloudFront to deliver your content over HTTPS by using your custom domain name and your own SSL certificate. Serving web content through CloudFront reduces response from the origin as requests are redirected to the nearest edge location. This results in faster download times than if the visitor had requested the content from a data center that is located farther away.
Summary
In this post, we demonstrated how you can apply policies to Amazon S3 buckets so that only users with appropriate permissions are allowed to access the buckets. Anonymous users (with public-read/public-read-write permissions) and authenticated users without the appropriate permissions are prevented from accessing the buckets.
We also examined how to secure access to objects in Amazon S3 buckets. The objects in Amazon S3 buckets can be encrypted at rest and during transit. Doing so helps provide end-to-end security from the source (in this case, Amazon S3) to your users.
If you have feedback about this blog post, submit comments in the “Comments” section below. If you have questions about this blog post, start a new thread on the Amazon S3 forum or contact AWS Support.
This post courtesy of Shane Baldacchino, Solutions Architect at Amazon Web Services.
Many customers ask for guidance on migrating end-to-end solutions running on virtual machines over to AWS. This post provides an overview of moving a common WordPress blog running on a virtualized platform to AWS, including re-pointing the DNS records associated to with the website.
AWS Server Migration Service (AWS SMS) is an agentless service that makes it easier and faster for you to migrate thousands of on-premises workloads to AWS. AWS SMS allows you to automate, schedule, and track incremental replications of live server volumes, making it easier for you to coordinate large-scale server migrations.
Walkthrough
The key elements of this migration process include the following steps:
Establish your AWS environment.
Replicate your database.
Download the SMS Connector from the AWS Management Console.
Configure AWS SMS and Hypervisor permissions.
Install and configure the SMS Connector appliance.
Import your virtual machine inventory and create a replication job.
Launch your Amazon EC2 instance.
Change your DNS records to resolve the WordPress blog to your EC2 instance.
Before you start, ensure that your source systems OS and vCenter version are supported by AWS. For more information, see the Server Migration Service FAQ.
Establish your AWS environment
For this walkthrough, your WordPress blog is currently running as a two-tier LAMP stack in a corporate data center. You have a frontend running Apache and PHP, plus a backend database running on MySQL. All systems are hosted on a virtualized platform.
First, establish your AWS environment. If your organization is new to AWS, this may include account or subaccount creation, a new virtual private cloud (VPC), and associated subnets, route tables, internet gateways, and so on. Think of this phase as setting up your software-defined data center. For more information, see Getting Started with Amazon EC2.
The blog is a two-tier stack, so go with two private subnets. Because you want it to be highly available, use multiple Availability Zones. A zone resides within an AWS Region. Each zone is isolated, but the zones within a region are connected through low-latency links. This allows architects and solution designers to build highly available solutions.
Replicate your database
WordPress uses a MySQL relational database. You could continue to manage MySQL and the associated EC2 instances associated with maintaining and scaling a database. For this walkthrough, use this opportunity to migrate to an RDS instance of Amazon Aurora, as it is a MySQL compliant database. Not only is Amazon Aurora a high-performant database engine but it frees you up to focus on application development by managing time-consuming database administration tasks, including backups, software patching, monitoring, scaling, and replication.
Use AWS Database Migration Service to migrate your MySQL database to Amazon Aurora easily and securely. After a database migration instance has been instantiated, configure the source and destination endpoints and create a replication task.
By attaching to the MySQL binlog, you can seed in the current data in the database and also capture all future state changes in near real time. For more information, see Migrating a MySQL-Compatible Database to Amazon Aurora.
Finally, the task shows that you are replicating current data in your WordPress blog database and future changes from MySQL into Amazon Aurora.
Download the SMS Connector from the AWS Management Console
Now, use AWS SMS to migrate your Apache PHP frontend to EC2. AWS SMS is delivered as an appliance for your hypervisor.
To download the SMS Connector, log in to the console and choose Server Migration Service, Connectors, SMS Connector setup guide.
Configure AWS SMS
Your hypervisor and AWS SMS will need an appropriate user with sufficient privileges to perform migrations.
AWS SMS – Use the AWS CLI or the IAM console to create an IAM user with the ServerMigrationConnector policy attached.
Launch a new VM based on the SMS Connector that you downloaded. To configure the connector, connect to it via HTTPS. You can obtain the SMS Connector IP address from your hypervisor.
Connect to the SMS Connector via HTTPS. In the example above, the connector IP address is 10.0.0.31. In your browser, enter https://10.0.0.31.
Configure the connector with the IAM and hypervisor credentials that you created earlier.
After it’s configured, and the associated connectivity and authentication checks have passed, return to the console and view your connector in AWS SMS.
Import your virtual machine inventory and create a replication job After validating that the SMS Connector is in a “HEALTHY” state, import your server catalog to AWS SMS. This process can take up to a minute.
Select the server to migrate and choose Create replication job. The console guides you through the process. The time that the initial replication task takes to complete is dependent on the available bandwidth and the size of your VM. After the initial seed replication, network bandwidth is minimized as AWS SMS replicates only incremental changes occurring on the VM.
Launch your EC2 instance
When your replication task is complete, the artifact created by AWS SMS is a custom AMI that you can use to deploy an EC2 instance. Follow the usual process to launch your EC2 instance, noting that you may need to replace any host-based firewalls with security groups and NACLs.
When you create an EC2 instance, ensure that you pick the most suitable EC2 instance type and size to match your performance requirements while optimizing for cost.
While your new EC2 instance is a replica of your on-premises VM, you should always validate that applications are functioning. How you do this differs on an application-by-application basis. You can use a combination of approaches, such as editing a local host file and testing your application, SSH, or Telnet.
From the RDS console, get your connection string details and update your WordPress configuration file to point to the Amazon Aurora database. As WordPress is expecting a MySQL database and Amazon Aurora is MySQL-compliant, this change of database engine is transparent to WordPress.
Change your DNS records to resolve the WordPress blog to your EC2 instance
You have validated that your WordPress application is running correctly, as you are still receiving changes from your on-premises data center via AWS DMS into your Amazon Aurora database.
You can now update your DNS zone file using Amazon Route 53. Route 53 can be driven by multiple methods: console, SDK, or AWS CLI.
For this walkthrough, update your DNS zone file via the AWS CLI. The JSON example shows upserting the A record in your zone to resolve to your EC2 instance.
Use the AWS CLI to execute the request and update the record in your zone file. The cut-over period between the original off-cloud location and AWS is defined by the TTL in the SOA (statement of authority) in your DNS zone. During this period, any requests resolving to your off-cloud server that result in database writes are automatically replicated to your Amazon Aurora instance via AWS DMS.
You have now successfully migrated your WordPress blog to AWS. Based on the TTL of your DNS zone file, end users slowly resolve the WordPress blog to AWS.
After you have validated your successful migration, be sure to delete your AWS DMS task and your AWS SMS replication job.
Summary
In this post, you moved a WordPress blog to AWS, using AWS SMS and AWS DMS to re-point the associated DNS records.
Many architectures can be extended to use many of the inherent benefits of AWS, with little effort. For example, by using Amazon CloudWatch metrics to drive Auto Scaling policies, you can use an Application Load Balancer as your frontend. This removes the single point of failure for a single Amazon EC2 instance and ensures that your deployed capacity closely follows customer demand. Think big and get building!
It was just about three years ago that AWS announced Amazon Elastic Container Service (Amazon ECS), to run and manage containers at scale on AWS. With Amazon ECS, you’ve been able to run your workloads at high scale and availability without having to worry about running your own cluster management and container orchestration software.
Today, AWS announced the availability of AWS Fargate – a technology that enables you to use containers as a fundamental compute primitive without having to manage the underlying instances. With Fargate, you don’t need to provision, configure, or scale virtual machines in your clusters to run containers. Fargate can be used with Amazon ECS today, with plans to support Amazon Elastic Container Service for Kubernetes (Amazon EKS) in the future.
Fargate has flexible configuration options so you can closely match your application needs and granular, per-second billing.
Amazon ECS with Fargate
Amazon ECS enables you to run containers at scale. This service also provides native integration into the AWS platform with VPC networking, load balancing, IAM, Amazon CloudWatch Logs, and CloudWatch metrics. These deep integrations make the Amazon ECS task a first-class object within the AWS platform.
To run tasks, you first need to stand up a cluster of instances, which involves picking the right types of instances and sizes, setting up Auto Scaling, and right-sizing the cluster for performance. With Fargate, you can leave all that behind and focus on defining your application and policies around permissions and scaling.
The same container management capabilities remain available so you can continue to scale your container deployments. With Fargate, the only entity to manage is the task. You don’t need to manage the instances or supporting software like Docker daemon or the Amazon ECS agent.
Fargate capabilities are available natively within Amazon ECS. This means that you don’t need to learn new API actions or primitives to run containers on Fargate.
Using Amazon ECS, Fargate is a launch type option. You continue to define the applications the same way by using task definitions. In contrast, the EC2 launch type gives you more control of your server clusters and provides a broader range of customization options.
For example, a RunTask command example is pasted below with the Fargate launch type:
Resource-based pricing and per second billing You pay by the task size and only for the time for which resources are consumed by the task. The price for CPU and memory is charged on a per-second basis. There is a one-minute minimum charge.
Flexible configurations options Fargate is available with 50 different combinations of CPU and memory to closely match your application needs. You can use 2 GB per vCPU anywhere up to 8 GB per vCPU for various configurations. Match your workload requirements closely, whether they are general purpose, compute, or memory optimized.
Networking All Fargate tasks run within your own VPC. Fargate supports the recently launched awsvpc networking mode and the elastic network interface for a task is visible in the subnet where the task is running. This provides the separation of responsibility so you retain full control of networking policies for your applications via VPC features like security groups, routing rules, and NACLs. Fargate also supports public IP addresses.
Load Balancing ECS Service Load Balancing for the Application Load Balancer and Network Load Balancer is supported. For the Fargate launch type, you specify the IP addresses of the Fargate tasks to register with the load balancers.
Permission tiers Even though there are no instances to manage with Fargate, you continue to group tasks into logical clusters. This allows you to manage who can run or view services within the cluster. The task IAM role is still applicable. Additionally, there is a new Task Execution Role that grants Amazon ECS permissions to perform operations such as pushing logs to CloudWatch Logs or pulling image from Amazon Elastic Container Registry (Amazon ECR).
Container Registry Support Fargate provides seamless authentication to help pull images from Amazon ECR via the Task Execution Role. Similarly, if you are using a public repository like DockerHub, you can continue to do so.
Amazon ECS CLI The Amazon ECS CLI provides high-level commands to help simplify to create and run Amazon ECS clusters, tasks, and services. The latest version of the CLI now supports running tasks and services with Fargate.
EC2 and Fargate Launch Type Compatibility All Amazon ECS clusters are heterogeneous – you can run both Fargate and Amazon ECS tasks in the same cluster. This enables teams working on different applications to choose their own cadence of moving to Fargate, or to select a launch type that meets their requirements without breaking the existing model. You can make an existing ECS task definition compatible with the Fargate launch type and run it as a Fargate service, and vice versa. Choosing a launch type is not a one-way door!
Logging and Visibility With Fargate, you can send the application logs to CloudWatch logs. Service metrics (CPU and Memory utilization) are available as part of CloudWatch metrics. AWS partners for visibility, monitoring and application performance management including Datadog, Aquasec, Splunk, Twistlock, and New Relic also support Fargate tasks.
Conclusion
Fargate enables you to run containers without having to manage the underlying infrastructure. Today, Fargate is availabe for Amazon ECS, and in 2018, Amazon EKS. Visit the Fargate product page to learn more, or get started in the AWS Console.
Starting today, AWS Shield Advanced can help protect your Amazon EC2 instances and Network Load Balancers against infrastructure-layer Distributed Denial of Service (DDoS) attacks. Enable AWS Shield Advanced on an AWS Elastic IP address and attach the address to an internet-facing EC2 instance or Network Load Balancer. AWS Shield Advanced automatically detects the type of AWS resource behind the Elastic IP address and mitigates DDoS attacks.
AWS Shield Advanced also ensures that all your Amazon VPCnetwork access control lists (ACLs) are automatically executed on AWS Shield at the edge of the AWS network, giving you access to additional bandwidth and scrubbing capacity as well as mitigating large volumetric DDoS attacks. You also can customize additional mitigations on AWS Shield by engaging the AWS DDoS Response Team, which can preconfigure the mitigations or respond to incidents as they happen. For every incident detected by AWS Shield Advanced, you also get near-real-time visibility via Amazon CloudWatch metrics and details about the incident, such as the geographic origin and source IP address of the attack.
AWS Shield Advanced for Elastic IP addresses extends the coverage of DDoS cost protection, which safeguards against scaling charges as a result of a DDoS attack. DDoS cost protection now allows you to request service credits for Elastic Load Balancing, Amazon CloudFront, Amazon Route 53, and your EC2 instance hours in the event that these increase as the result of a DDoS attack.
Get started protecting EC2 instances and Network Load Balancers
Activate AWS Shield Advanced by choosing Activate AWS Shield Advanced and accepting the terms.
Navigate to Protected Resources through the navigation pane.
Choose the Elastic IP addresses that you want to protect (these can point to EC2 instances or Network Load Balancers).
If AWS Shield Advanced detects a DDoS attack, you can get details about the attack by checking CloudWatch, or the Incidents tab on the AWS WAF and AWS Shield console. To learn more about this new feature and AWS Shield Advanced, see the AWS Shield home page.
If you have comments or questions about this post, submit them in the “Comments” section below, start a new thread in the AWS Shield forum, or contact AWS Support.
This post courtesy of ECS Sr. Software Dev Engineer Anirudh Aithal.
Today, AWS announced Task Networking for Amazon ECS. This feature brings Amazon EC2 networking capabilities to tasks using elastic network interfaces.
An elastic network interface is a virtual network interface that you can attach to an instance in a VPC. When you launch an EC2 virtual machine, an elastic network interface is automatically provisioned to provide networking capabilities for the instance.
A task is a logical group of running containers. Previously, tasks running on Amazon ECS shared the elastic network interface of their EC2 host. Now, the new awsvpc networking mode lets you attach an elastic network interface directly to a task.
This simplifies network configuration, allowing you to treat each container just like an EC2 instance with full networking features, segmentation, and security controls in the VPC.
In this post, I cover how awsvpc mode works and show you how you can start using elastic network interfaces with your tasks running on ECS.
Background: Elastic network interfaces in EC2
When you launch EC2 instances within a VPC, you don’t have to configure an additional overlay network for those instances to communicate with each other. By default, routing tables in the VPC enable seamless communication between instances and other endpoints. This is made possible by virtual network interfaces in VPCs called elastic network interfaces. Every EC2 instance that launches is automatically assigned an elastic network interface (the primary network interface). All networking parameters—such as subnets, security groups, and so on—are handled as properties of this primary network interface.
Furthermore, an IPv4 address is allocated to every elastic network interface by the VPC at creation (the primary IPv4 address). This primary address is unique and routable within the VPC. This effectively makes your VPC a flat network, resulting in a simple networking topology.
Elastic network interfaces can be treated as fundamental building blocks for connecting various endpoints in a VPC, upon which you can build higher-level abstractions. This allows elastic network interfaces to be leveraged for:
VPC-native IPv4 addressing and routing (between instances and other endpoints in the VPC)
Network traffic isolation
Network policy enforcement using ACLs and firewall rules (security groups)
IPv4 address range enforcement (via subnet CIDRs)
Why use awsvpc?
Previously, ECS relied on the networking capability provided by Docker’s default networking behavior to set up the network stack for containers. With the default bridge network mode, containers on an instance are connected to each other using the docker0 bridge. Containers use this bridge to communicate with endpoints outside of the instance, using the primary elastic network interface of the instance on which they are running. Containers share and rely on the networking properties of the primary elastic network interface, including the firewall rules (security group subscription) and IP addressing.
This means you cannot address these containers with the IP address allocated by Docker (it’s allocated from a pool of locally scoped addresses), nor can you enforce finely grained network ACLs and firewall rules. Instead, containers are addressable in your VPC by the combination of the IP address of the primary elastic network interface of the instance, and the host port to which they are mapped (either via static or dynamic port mapping). Also, because a single elastic network interface is shared by multiple containers, it can be difficult to create easily understandable network policies for each container.
The awsvpc networking mode addresses these issues by provisioning elastic network interfaces on a per-task basis. Hence, containers no longer share or contend use these resources. This enables you to:
Run multiple copies of the container on the same instance using the same container port without needing to do any port mapping or translation, simplifying the application architecture.
Extract higher network performance from your applications as they no longer contend for bandwidth on a shared bridge.
Enforce finer-grained access controls for your containerized applications by associating security group rules for each Amazon ECS task, thus improving the security for your applications.
Associating security group rules with a container or containers in a task allows you to restrict the ports and IP addresses from which your application accepts network traffic. For example, you can enforce a policy allowing SSH access to your instance, but blocking the same for containers. Alternatively, you could also enforce a policy where you allow HTTP traffic on port 80 for your containers, but block the same for your instances. Enforcing such security group rules greatly reduces the surface area of attack for your instances and containers.
ECS manages the lifecycle and provisioning of elastic network interfaces for your tasks, creating them on-demand and cleaning them up after your tasks stop. You can specify the same properties for the task as you would when launching an EC2 instance. This means that containers in such tasks are:
Addressable by IP addresses and the DNS name of the elastic network interface
Attachable as ‘IP’ targets to Application Load Balancers and Network Load Balancers
Observable from VPC flow logs
Access controlled by security groups
This also enables you to run multiple copies of the same task definition on the same instance, without needing to worry about port conflicts. You benefit from higher performance because you don’t need to perform any port translations or contend for bandwidth on the shared docker0 bridge, as you do with the bridge networking mode.
Getting started
If you don’t already have an ECS cluster, you can create one using the create cluster wizard. In this post, I use “awsvpc-demo” as the cluster name. Also, if you are following along with the command line instructions, make sure that you have the latest version of the AWS CLI or SDK.
Registering the task definition
The only change to make in your task definition for task networking is to set the networkMode parameter to awsvpc. In the ECS console, enter this value for Network Mode.
If you plan on registering a container in this task definition with an ECS service, also specify a container port in the task definition. This example specifies an NGINX container exposing port 80:
This creates a task definition named “nginx-awsvpc" with networking mode set to awsvpc. The following commands illustrate registering the task definition from the command line:
To run a task with this task definition, navigate to the cluster in the Amazon ECS console and choose Run new task. Specify the task definition as “nginx-awsvpc“. Next, specify the set of subnets in which to run this task. You must have instances registered with ECS in at least one of these subnets. Otherwise, ECS can’t find a candidate instance to attach the elastic network interface.
You can use the console to narrow down the subnets by selecting a value for Cluster VPC:
Next, select a security group for the task. For the purposes of this example, create a new security group that allows ingress only on port 80. Alternatively, you can also select security groups that you’ve already created.
Next, run the task by choosing Run Task.
You should have a running task now. If you look at the details of the task, you see that it has an elastic network interface allocated to it, along with the IP address of the elastic network interface:
When you describe an “awsvpc” task, details of the elastic network interface are returned via the “attachments” object. You can also get this information from the “containers” object. For example:
The nginx container is now addressable in your VPC via the 10.0.0.35 IPv4 address. You did not have to modify the security group on the instance to allow requests on port 80, thus improving instance security. Also, you ensured that all ports apart from port 80 were blocked for this application without modifying the application itself, which makes it easier to manage your task on the network. You did not have to interact with any of the elastic network interface API operations, as ECS handled all of that for you.
Back in 2006, when I announced S3, I wrote ” Further, each block is protected by an ACL (Access Control List) allowing the developer to keep the data private, share it for reading, or share it for reading and writing, as desired.”
Today we are adding five new encryption and security features to S3:
Default Encryption – You can now mandate that all objects in a bucket must be stored in encrypted form without having to construct a bucket policy that rejects objects that are not encrypted.
Permission Checks – The S3 Console now displays a prominent indicator next to each S3 bucket that is publicly accessible.
Cross-Region Replication ACL Overwrite – When you replicate objects across AWS accounts, you can now specify that the object gets a new ACL that gives full access to the destination account.
Cross-Region Replication with KMS – You can now replicate objects that are encrypted with keys that are managed by AWS Key Management Service (KMS).
Detailed Inventory Report – The S3 Inventory report now includes the encryption status of each object. The report itself can also be encrypted.
Let’s take a look at each one…
Default Encryption You have three server-side encryption options for your S3 objects: SSE-S3 with keys that are managed by S3, SSE-KMS with keys that are managed by AWS KMS, and SSE-C with keys that you manage. Some of our customers, particularly those who need to meet compliance requirements that dictate the use of encryption at rest, have used bucket policies to ensure that every newly stored object is encrypted. While this helps them to meet their requirements, simply rejecting the storage of unencrypted objects is an imperfect solution.
You can now mandate that all objects in a bucket must be stored in encrypted form by installing a bucket encryption configuration. If an unencrypted object is presented to S3 and the configuration indicates that encryption must be used, the object will be encrypted using encryption option specified for the bucket (the PUT request can also specify a different option).
Here’s how to enable this feature using the S3 Console when you create a new bucket. Enter the name of the bucket as usual and click on Next. Then scroll down and click on Default encryption:
Select the desired option and click on Save (if you select AWS-KMS you also get to designate a KMS key):
You can also make this change via a call to the PUT Bucket Encryption function. It must specify an SSE algorithm (either SSE-S3 or SSE-KMS) and can optionally reference a KMS key.
Keep the following in mind when you implement this feature:
SigV4 – Access to the bucket policy via the S3 REST API must be signed with SigV4 and made over an SSL connection.
Updating Bucket Policies – You should examine and then carefully modify existing bucket policies that currently reject unencrypted objects.
High-Volume Use – If you are using SSE-KMS and are uploading many hundreds or thousands of objects per second, you may bump in to the KMS Limit on the Encrypt and Decrypt operations. Simply file a Support Case and ask for a higher limit:
Cross-Region Replication – Unencrypted objects will be encrypted according to the configuration of the destination bucket. Encrypted objects will remain as such.
Permission Checks The combination of bucket policies, bucket ACLs, and Object ACLs give you very fine-grained control over access to your buckets and the objects within them. With the goal of making sure that your policies and ACLs combine to create the desired effect, we recently launched a set of Managed Config Rules to Secure Your S3 Buckets. As I mentioned in the post, these rules make use of some of our work to put Automated Formal Reasoning to use.
We are now using the same underlying technology to help you see the impact of changes to your bucket policies and ACLs as soon as you make them. You will know right away if you open up a bucket for public access, allowing you to make changes with confidence.
Here’s what it looks like on the main page of the S3 Console (I’ve sorted by the Access column for convenience):
The Public indicator is also displayed when you look inside a single bucket:
You can also see which Permission elements (ACL, Bucket Policy, or both) are enabling the public access:
Cross-Region Replication ACL Overwrite Our customers often use S3’s Cross-Region Replication to copy their mission-critical objects and data to a destination bucket in a separate AWS account. In addition to copying the object, the replication process copies the object ACL and any tags associated with the object.
We’re making this feature even more useful by allowing you to enable replacement of the ACL as it is in transit so that it grants full access to the owner of the destination bucket. With this change, ownership of the source and the destination data is split across AWS accounts, allowing you to maintain separate and distinct stacks of ownership for the original objects and their replicas.
To enable this feature when you are setting up replication, choose a destination bucket in a different account and Region by specifying the Account ID and Bucket name, and clicking on Save:
Then click on Change object ownership…:
We’ve also made it easier for you to set up the key policy for the destination bucket in the destination account. Simply log in to the account and find the bucket, then click on Management and Replication, then choose Receive objects… from the More menu:
Enter the source Account ID, enable versioning, inspect the policies, apply them, and click on Done:
Cross-Region Replication with KMS Replicating objects that have been encrypted using SSE-KMS across regions poses an interesting challenge that we are addressing today. Because the KMS keys are specific to a particular region, simply replicating the encrypted object would not work.
You can now choose the destination key when you set up cross-region replication. During the replication process, encrypted objects are replicated to the destination over an SSL connection. At the destination, the data key is encrypted with the KMS master key you specified in the replication configuration. The object remains in its original, encrypted form throughout; only the envelope containing the keys is actually changed.
Here’s how you enable this feature when you set up a replication rule:
As I mentioned earlier, you may need to request an increase in your KMS limits before you start to make heavy use of this feature.
Detailed Inventory Report Last but not least, you can now request that the daily or weekly S3 inventory reports include information on the encryption status of each object:
As you can see, you can also request SSE-S3 or SSE-KMS encryption for the report.
Now Available All of these features are available now and you can start using them today! There is no charge for the features, but you will be charged the usual rates for calls to KMS, S3 storage, S3 requests, and inter-region data transfer.
Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.
Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.
In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.
Walkthrough
Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.
This solution involves the following steps:
Install and configure RStudio with Athena
Log in to RStudio
Install R packages
Connect to Athena
Create a dataset
Create models
Install and configure RStudio with Athena
Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.
Launching this stack creates all required resources and prerequisites:
Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
Provisioning of the EC2 instance in an existing VPC and public subnet
Installation of Java 8
Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
RStudio username and password
Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
Amazon EC2 Systems Manager agent, which makes it easy to manage and patch
All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.
Log in to RStudio
The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.
In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
Log in to RStudio with the and password you provided during setup.
Install R packages
Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.
#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 42 minutes
## H2O cluster version: 3.10.4.6
## H2O cluster version age: 4 months and 4 days !!!
## H2O cluster name: H2O_started_from_R_rstudio_hjx881
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.30 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
install_aws_sdk()
}
load_sdk()
## NULL
Connect to Athena
Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir=Sys.getenv("ATHENABUCKET"),
aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
Verify the connection. The results returned depend on your specific Athena setup.
For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.
Field
Description
year
Year that song was released
songtitle
Title of the song
artistname
Name of the song artist
songid
Unique identifier for the song
artistid
Unique identifier for the song artist
timesignature
Variable estimating the time signature of the song
timesignature_confidence
Confidence in the estimate for the timesignature
loudness
Continuous variable indicating the average amplitude of the audio in decibels
tempo
Variable indicating the estimated beats per minute of the song
tempo_confidence
Confidence in the estimate for tempo
key
Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence
Confidence in the estimate for key
energy
Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch
Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min
Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max
Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10
Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)
Create an Athena table based on the dataset
In the Athena console, select the default database, sampled, or create a new database.
Run the following create table statement.
create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;
Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.
Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.
dbGetQuery(con, " SELECT songtitle,artistname,top10 FROM sampledb.billboard WHERE lower(artistname) = 'janet jackson' AND top10 = 1")
## songtitle artistname top10
## 1 Runaway Janet Jackson 1
## 2 Because Of Love Janet Jackson 1
## 3 Again Janet Jackson 1
## 4 If Janet Jackson 1
## 5 Love Will Never Do (Without You) Janet Jackson 1
## 6 Black Cat Janet Jackson 1
## 7 Come Back To Me Janet Jackson 1
## 8 Alright Janet Jackson 1
## 9 Escapade Janet Jackson 1
## 10 Rhythm Nation Janet Jackson 1
Determine how many songs in this dataset are specifically from the year 2010.
dbGetQuery(con, " SELECT count(*) FROM sampledb.billboard WHERE year = 2010")
## _col0
## 1 373
The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.
Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.
#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note: Running the preceding query results in the following error:
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
## 0 1 3 4 5 7
## 10 143 503 6787 112 19
nrow(dftimesignature)
## [1] 7574
From the results, observe that 6787 songs have a timesignature of 4.
Next, determine the song with the highest tempo.
dbGetQuery(con, " SELECT songtitle,artistname,tempo FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
## songtitle artistname tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307
Create the training dataset
Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first. This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.
#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.
r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
## year songtitle artistname timesignature
## 1 2009 The Awkward Goodbye Athlete 3
## 2 2009 Rubik's Cube Athlete 3
## timesignature_confidence loudness tempo tempo_confidence
## 1 0.732 -6.320 89.614 0.652
## 2 0.906 -9.541 117.742 0.542
nrow(BillboardTrain)
## [1] 7201
Create the test dataset
BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
## year songtitle artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember 11
## 2 2010 Sticks & Bricks A Day to Remember 10
## key_confidence energy pitch timbre_0_min
## 1 0.453 0.9666556 0.024 0.002
## 2 0.469 0.9847095 0.025 0.000
nrow(BillboardTest)
## [1] 373
Convert the training and test datasets into H2O dataframes
You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.
Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables: “year”, “songtitle”, “artistname”, “songid”, or “artistid”.
Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.
modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
##
|
| | 0%
|
|===== | 8%
|
|=================================================================| 100%
Measure the performance of Model 1, using H2O’s built-in performance function.
h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
##
## MSE: 0.09924684
## RMSE: 0.3150347
## LogLoss: 0.3220267
## Mean Per-Class Error: 0.2380168
## AUC: 0.8431394
## Gini: 0.6862787
## R^2: 0.254663
## Null Deviance: 326.0801
## Residual Deviance: 240.2319
## AIC: 308.2319
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 255 59 0.187898 =59/314
## 1 17 42 0.288136 =17/59
## Totals 272 101 0.203753 =76/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.192772 0.525000 100
## 2 max f2 0.124912 0.650510 155
## 3 max f0point5 0.416258 0.612903 23
## 4 max accuracy 0.416258 0.879357 23
## 5 max precision 0.813396 1.000000 0
## 6 max recall 0.037579 1.000000 282
## 7 max specificity 0.813396 1.000000 0
## 8 max absolute_mcc 0.416258 0.455251 23
## 9 max min_per_class_accuracy 0.161402 0.738854 125
## 10 max mean_per_class_accuracy 0.124912 0.765006 155
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.auc(h2o.performance(modelh1,test.h2o))
## [1] 0.8431394
The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)
Next, inspect the coefficients of the variables in the dataset.
Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.
You can make the following observations from the results:
The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.
These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.
This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.
You build two variations of the original model:
Model 2, in which you keep “energy” and omit “loudness”
Model 3, in which you keep “loudness” and omit “energy”
You compare these two models and choose the model with a better fit for this use case.
Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
As H2O orders variables by significance, the variable energy is not significant in this model.
You can conclude that Model 2 is not ideal for this use , as energy is not significant.
From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
Loudness is significant in this model.
Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:
Desired model accuracy at a given threshold
Number of correct predictions for top10 hits
Tolerable number of false positives or false negatives
Next, make predictions using Model 3 on the test dataset.
The first set of output results specifies the probabilities associated with each predicted observation. For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit). The second set of results list the actual predictions made. From the third set of results, this model predicts that 61 songs will be top 10 hits.
Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.
table(BillboardTest$top10)
##
## 0 1
## 314 59
Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.
It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?
View the two models from an investment perspective:
A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
The baseline model is not useful, as it simply does not label any song as a hit.
Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.
GBM model
H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.
Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.
train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
##
|
| | 0%
|
|=== | 5%
|
|===== | 7%
|
|====== | 9%
|
|======= | 10%
|
|====================== | 33%
|
|===================================== | 56%
|
|==================================================== | 79%
|
|================================================================ | 98%
|
|=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
##
## MSE: 0.09860778
## RMSE: 0.3140188
## LogLoss: 0.3206876
## Mean Per-Class Error: 0.2120263
## AUC: 0.8630573
## Gini: 0.7261146
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 266 48 0.152866 =48/314
## 1 16 43 0.271186 =16/59
## Totals 282 91 0.171582 =64/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.189757 0.573333 90
## 2 max f2 0.130895 0.693717 145
## 3 max f0point5 0.327346 0.598802 26
## 4 max accuracy 0.442757 0.876676 14
## 5 max precision 0.802184 1.000000 0
## 6 max recall 0.049990 1.000000 284
## 7 max specificity 0.802184 1.000000 0
## 8 max absolute_mcc 0.169135 0.496486 104
## 9 max min_per_class_accuracy 0.169135 0.796610 104
## 10 max mean_per_class_accuracy 0.169135 0.805948 104
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573
This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.
As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.
H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.
Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose. For models that require more computation, consider using accelerated computing instances such as the P2 instance type.
system.time(
dlearning.modelh <- h2o.deeplearning(y = y.dep,
x = x.indep,
training_frame = train.h2o,
epoch = 250,
hidden = c(250,250),
activation = "Rectifier",
seed = 1122,
distribution="multinomial"
)
)
##
|
| | 0%
|
|=== | 4%
|
|===== | 8%
|
|======== | 12%
|
|========== | 16%
|
|============= | 20%
|
|================ | 24%
|
|================== | 28%
|
|===================== | 32%
|
|======================= | 36%
|
|========================== | 40%
|
|============================= | 44%
|
|=============================== | 48%
|
|================================== | 52%
|
|==================================== | 56%
|
|======================================= | 60%
|
|========================================== | 64%
|
|============================================ | 68%
|
|=============================================== | 72%
|
|================================================= | 76%
|
|==================================================== | 80%
|
|======================================================= | 84%
|
|========================================================= | 88%
|
|============================================================ | 92%
|
|============================================================== | 96%
|
|=================================================================| 100%
## user system elapsed
## 1.216 0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
##
## MSE: 0.1678359
## RMSE: 0.4096778
## LogLoss: 1.86509
## Mean Per-Class Error: 0.3433013
## AUC: 0.7568822
## Gini: 0.5137644
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 290 24 0.076433 =24/314
## 1 36 23 0.610169 =36/59
## Totals 326 47 0.160858 =60/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.826267 0.433962 46
## 2 max f2 0.000000 0.588235 239
## 3 max f0point5 0.999929 0.511811 16
## 4 max accuracy 0.999999 0.865952 10
## 5 max precision 1.000000 1.000000 0
## 6 max recall 0.000000 1.000000 326
## 7 max specificity 1.000000 1.000000 0
## 8 max absolute_mcc 0.999929 0.363219 16
## 9 max min_per_class_accuracy 0.000004 0.662420 145
## 10 max mean_per_class_accuracy 0.000000 0.685334 224
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822
The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.
H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.
Metric
Description
Sensitivity
Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity
Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold
Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision
The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC
Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.
0.90 – 1 = excellent (A)
0.8 – 0.9 = good (B)
0.7 – 0.8 = fair (C)
.6 – 0.7 = poor (D)
0.5 – 0.5 = fail (F)
Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.
Metric
Model 3
GBM Model
Deep Learning Model
Accuracy
(max)
0.882038
(t=0.435479)
0.876676
(t=0.442757)
0.865952
(t=0.999999)
Precision
(max)
1.0
(t=0.821606)
1.0
(t=0802184)
1.0
(t=1.0)
Recall
(max)
1.0
1.0
1.0
(t=0)
Specificity
(max)
1.0
1.0
1.0
(t=1)
Sensitivity
0.2033898
0.1355932
0.3898305
(t=0.5)
AUC
0.8492389
0.8630573
0.756882
Note: ‘t’ denotes threshold.
Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier. If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits. The AUC metric for the GBM model is also higher than that of Model 3.
Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.
Conclusion
In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.
If you have questions or suggestions, please comment below.
Gopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.
Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.
Many AWS customers run their applications within a Amazon Virtual Private Cloud (VPC) for security or isolation reasons. Previously, if you wanted your EC2 instances in your VPC to be able to access DynamoDB, you had two options. You could use an Internet Gateway (with a NAT Gateway or assigning your instances public IPs) or you could route all of your traffic to your local infrastructure via VPN or AWS Direct Connect and then back to DynamoDB. Both of these solutions had security and throughput implications and it could be difficult to configure NACLs or security groups to restrict access to just DynamoDB. Here is a picture of the old infrastructure.
Creating an Endpoint
Let’s create a VPC Endpoint for DynamoDB. We can make sure our region supports the endpoint with the DescribeVpcEndpointServices API call.
Great, so I know my region supports these endpoints and I know what my regional endpoint is. I can grab one of my VPCs and provision an endpoint with a quick call to the CLI or through the console. Let me show you how to use the console.
First I’ll navigate to the VPC console and select “Endpoints” in the sidebar. From there I’ll click “Create Endpoint” which brings me to this handy console.
For now I’ll give full access to my instances within this VPC and click “Next Step”.
This brings me to a list of route tables in my VPC and asks me which of these route tables I want to assign my endpoint to. I’ll select one of them and click “Create Endpoint”.
Keep in mind the note of warning in the console: if you have source restrictions to DynamoDB based on public IP addresses the source IP of your instances accessing DynamoDB will now be their private IP addresses.
After adding the VPC Endpoint for DynamoDB to our VPC our infrastructure looks like this.
That’s it folks! It’s that easy. It’s provided at no cost. Go ahead and start using it today. If you need more details you can read the docs here.
In the world of data science, users must often sacrifice cluster set-up time to allow for complex usability scenarios. Amazon EMR allows data scientists to spin up complex cluster configurations easily, and to be up and running with complex queries in a matter of minutes.
Data scientists often use scheduling applications such as Oozie to run jobs overnight. However, Oozie can be difficult to configure when you are trying to use popular Python packages (such as “pandas,” “numpy,” and “statsmodels”), which are not included by default.
One such popular platform that contains these types of packages (and more) is Anaconda. This post focuses on setting up an Anaconda platform on EMR, with an intent to use its packages with Oozie. I describe how to run jobs using a popular open source scheduler like Oozie.
Walkthrough
For this post, you walk through the following tasks:
Create an EMR cluster.
Download Anaconda on your master node.
Configure Oozie.
Test the steps.
Create an EMR cluster
Spin up an Amazon EMR cluster using the console or the AWS CLI. Use the latest release, and include Apache Hadoop, Apache Spark, Apache Hive, and Oozie.
To create a three-node cluster in the us-east-1 region, issue an AWS CLI command such as the following. This command must be typed as one line, as shown below. It is shown here separated for readability purposes only.
At the time of publication, Anaconda 4.4 is the most current version available. For the download link location for the latest Python 2.7 version (Python 3.6 may encounter issues), see https://www.continuum.io/downloads. Open the context (right-click) menu for the Python 2.7 download link, choose Copy Link Location, and use this value in the previous wget command.
This post used the Anaconda 4.4 installation. If you have a later version, it is shown in the downloaded filename: “anaconda2-<version number>-Linux-x86_64.sh”.
Run this downloaded script and follow the on-screen installer prompts.
For an installation directory, select somewhere with enough space on your cluster, such as “/mnt/anaconda/”.
The process should take approximately 1–2 minutes to install. When prompted if you “wish the installer to prepend the Anaconda2 install location”, select the default option of [no].
After you are done, export the PATH to include this new Anaconda installation:
export PATH=/mnt/anaconda/bin:$PATH
Zip up the Anaconda installation:
cd /mnt/anaconda/
zip -r anaconda.zip .
The zip process may take 4–5 minutes to complete.
(Optional) Upload this anaconda.zip file to your S3 bucket for easier inclusion into future EMR clusters. This removes the need to repeat the previous steps for future EMR clusters.
Configure Oozie
Next, you configure Oozie to use Pyspark and the Anaconda platform.
Get the location of your Oozie sharelibupdate folder. Issue the following command and take note of the “sharelibDirNew” value:
oozie admin -sharelibupdate
For this post, this value is “hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136”.
Pass in the required Pyspark files into Oozies sharelibupdate location. The following files are required for Oozie to be able to run Pyspark commands:
pyspark.zip
py4j-0.10.4-src.zip
These are located on the EMR master instance in the location “/usr/lib/spark/python/lib/”, and must be put into the Oozie sharelib spark directory. This location is the value of the sharelibDirNew parameter value (shown above) with “/spark/” appended, that is, “hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/”.
(Optional) Verify that it was transferred successfully with the following command:
hdfs dfs -ls /tmp/myLocation/
On your master node, execute the following command:
export PYSPARK_PYTHON=/mnt/anaconda/bin/python
Set the PYSPARK_PYTHON environment variable on the executor nodes. Put the following configurations in your “spark-opts” values in your Oozie workflow.xml file:
Put the “myPysparkProgram.py” into the location mentioned between the “<jar>xxxxx</jar>” tags in your workflow.xml. In this example, the location is “hdfs:///user/oozie/apps/”. Use the following command to move the “myPysparkProgram.py” file to the correct location:
This is a sentence. 12
So is this 12
This is also a sentence 12
Summary
The myPysparkProgram.py has successfully imported the numpy library from the Anaconda platform and has produced some output with it. If you tried to run this using standard Python, you’d encounter the following error:
Now when your Python job runs in Oozie, any imported packages that are implicitly imported by your Pyspark script are imported into your job within Oozie directly from the Anaconda platform. Simple!
If you have questions or suggestions, please leave a comment below.
John Ohle is an AWS BigData Cloud Support Engineer II for the BigData team in Dublin. He works to provide advice and solutions to our customers on their Big Data projects and workflows on AWS. In his spare time, he likes to play music, learn, develop tools and write documentation to further help others – both colleagues and customers alike.
In this blog post, I demonstrate the step-by-step process for end-to-end account creation in Organizations as well as how to automate the entire process. I also show how to move a new account into an organizational unit (OU).
Process overview
The following process flow diagram illustrates the steps required to create an account, configure the account, and then move it into an OU so that the account can take advantage of the centralized SCP functionality in Organizations. The tasks in the blue nodes occur in the master account in the organization in question, and the task in the orange node occurs in the new member account I create. In this post, I provide a script (in both Bash/CLI and Python) that you can use to automate this account creation process, and I walk through each step shown in the diagram to explain the process in detail. For the purposes of this post, I use the AWS CLI in combination with CloudFormation to create and configure an account.
The account creation process
Follow the steps in this section to create an account, configure it, and move it into an OU. I am also providing a script and CloudFormation templates that you can use to automate the entire process.
1. Sign in to AWS Organizations
In order to create an account, you must sign in to your organization’s master account with a minimum of the following permissions:
organizations:DescribeOrganization
organizations:CreateAccount
2. Create a new member account
After signing in to your organization’s master account, create a new member account. Before you can create the member account, you need three pieces of information:
An account name – The friendly name of the member account, which you can find on the Accounts tab in the master account.
An email address – The email address of the owner of the new member account. This email address is used by AWS when we need to contact the account owner.
An IAM role name – The name of an IAM role that Organizations automatically preconfigures in the new member account. This role trusts the master account, allowing users in the master account to assume the role, as permitted by the master account administrator. The role also has administrator permissions in the new member account. If you do not change the role’s name, the name defaults to OrganizationAccountAccessRole.
The following AWS CLI command creates a new member account.
To explain the placeholder values in the preceding command that you must update with your own values:
newAccEmail – The email address of the owner of the new member account. This email address must not already be associated with another AWS account.
newAccName – The friendly name of the new member account.
roleName – The name of an IAM role that Organizations automatically preconfigures in the new member account. The default name is OrganizationAccountAccessRole.
This CLI command returns a request_id that uniquely identifies the request, a value that is required for in Step 3.
Important: When you create an account using Organizations, you currently cannot remove this account from your organization. This, in turn, can prevent you from later deleting the organization.
3. Verify account creation
Account creation may take a few seconds to complete, so before doing anything with the newly created account, you must first verify the account creation status. To check the status, you must have at least the following permission:
organizations:DescribeCreateAccountStatus
The following CLI command, with the request_id returned in the previous step as an input parameter, verifies that the account was created:
The command returns the state of your account creation request and can have three different values: IN_PROGRESS, SUCCEEDED, and FAILED.
4. Assume a role
After you have verified that the new account has been created, configure the account. In order to configure the newly created account, you must sign in with a user who has permission to assume the role submitted in the createAccount API call. In the example in Step 1, I named the role OrganizationAccountAccessRole; however, if you revised the name of the role, you must use that revised name when assuming the role. Note that when an account is created from within an organization, cross-account trust between the master and programmatically created accounts is automatically established.
role-session-name – An identifier for the assumed role session.
5. Configure the new account
After you assume the role, build the new account’s networking, IAM, and governance resources as explained in this section. Again, to learn more about and download the account creation script and the templates that can automate this process, see “Automating the entire end-to-end process” later in this post.
Networking – Amazon VPC, web access control lists (ACLs), and Internet gateway:
Create a new Amazon VPC to enable you to launch AWS resources in a virtual network that you define.
Run the script at the end of this post to create a VPC with two subnets (one public subnet and one private subnet) in each of two Availability Zones.
Set up web ACLs to control traffic in and out of the subnets. You can set up network ACLs with rules similar to your security groups in order to add an additional layer of security to your VPC.
Connect your VPC to remote networks by using a VPN connection.
If the resources in the VPC require access to the Internet, create an Internet gateway to allow communication between instances in your VPC and the Internet.
IAM – Identity provider (IdP), IAM policies, and IAM roles:
Many customers use enterprise federated authentication (such as Active Directory) to manage users and permissions centrally outside AWS. If you use federated authentication, set up an IdP.
Use AWS managed policies or your customer managed policies to manage access to your AWS resources.
Governance – AWS Config Rules:
Create AWS Config rules to help manage and enforce standards for resources deployed on AWS.
Develop a tagging strategy that specifies a minimum set of tags required on every taggable resource. A tagging rule checks that all resources created or edited fulfill this requirement. A noncompliance report is created to document resources that do not meet the AWS Config rule. AWS Lambda scripts can also be launched as a result of AWS Config rules.
6. Move the new account into an OU
Before allowing your development teams to access the new member account that you configured in the previous steps, apply an SCP to the account to limit the API calls that can be made by all users. To do this, you must move the member account into an OU that has an SCP attached to it.
An OU is a container for accounts. It can contain other OUs, allowing you to create a hierarchy that resembles an upside-down tree with a “root” at the top and OU “branches” that reach down, ending with accounts that are the “leaves” of the tree. When you attach a policy to one of the nodes in the hierarchy, it affects all the branches (OUs) and leaves (accounts) under it. An OU can have exactly one parent, and currently, each account can be a member of exactly one OU.
The following CLI command moves an account into an OU.
To explain the placeholder values in the preceding command that you must update with your own values:
account_id – The unique identifier (ID) of the account you want to move.
source_parent_id – The unique ID of the root or OU from which you want to move the account.
destination_parent_id – The unique ID of the root or OU to which you want to move the account.
7. Reduce the IAM role permissions
The OrganizationAccountAccessRole is created with full administrative permissions to enable the creation and development of the new member account. After you complete the development process and you have moved the member account into an OU, reduce the permissions of OrganizationAccountAccessRole to match your anticipated use of this role going forward.
Automating the entire end-to-end process
To help you fully automate the process of creating new member accounts, setting up those accounts, and moving new member accounts into an OU, I am providing a script in both Bash/CLI and Python. You can modify or call additional CloudFormation templates as needed.
Download the script and CloudFormation templates
Download the script and CloudFormation templates to help you automate this end-to-end process. The global variables in the script are set in the opening lines of code. Update these variables’ values, and they will flow as input parameters to the API commands when the script is executed. I have prepopulated the roleName by using AWS best practices nomenclature, but you can use a custom name.
I am including the following descriptions of the elements of the script to give you a better idea of how the script works.
Bash/CLI:
Organization-new-acc.sh – An example shell script that includes parameters, account creation, and a call to the JSON sample templates for each of three subtasks in Step 5 earlier in this post.
CF-VPC.json – An example Cloud Formation template that creates and configures a VPC in the new member account. Each AWS account must have at least one VPC as a networking construct where you can deploy customer resources. Though AWS does create a default VPC when an account is created, you must configure that VPC to meet your needs. This includes creating subnets with specific IP Classless Inter-Domain Routing (CIDR) blocks, creating gateways (including an Internet gateway, a customer gateway, a VPN tunnel, AWS Storage Gateway, Amazon API Gateway, and a NAT gateway), and VPC peering connections. Web ACLs are also part of this process to limit the source IP addresses and ports that can access the VPC. The VPC created by this script includes four subnets across two Availability Zones. Two of the subnets are public and two are private.
CF-IAM.json – An example Cloud Formation template that creates IAM roles in the new member account. As part of a security baseline, you should develop a standard set of IAM roles and related policies. Update this template with the IAM role definitions and policies you want to create in the member account to controls privilege and access.
CF-ConfigRules.json – An example Cloud Formation template that creates an AWS Config rule to enforce tagging standards on resources created in the new account.
Organization_Output.docx – Example output of the results from running Organization-new-acc.sh.
Python:
Create_account_with_iam.py – An example Python template that creates an account, moves it into an OU, applies an SCP, and then calls additional templates to deploy resources. CF-VPC.JSON can be called by this template if you first customize the .json file.
Baseline.yml – An example CloudFormation template for creating a new IAM administrative user, IAM user group, IAM role, and IAM policy in the account.
Summary
In this post, I have demonstrated the step-by-step process for end-to-end account creation in Organizations as well as how to automate the entire process. I also showed how to move a new account into an OU. This solution should save you some time and help you avoid common issues that tend to crop up in the manual account-creation process. To learn more about the features of Organizations, see the AWS Organizations User Guide. For more information about the APIs used in this post, see the Organizations API Reference.
If you have comments about this blog post, submit them in the “Comments” section below. If you have implementation or troubleshooting questions, start a new thread on the Organizations forum.
In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).
The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that’s what makes it useful for a variety of usecases that other tools can’t cover that well.
Why?
I created casync after studying how today’s popular tools store and deliver file system images. To very incomprehensively and briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).
Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:
Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
Put boundaries on disk space usage on clients
Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
Simplicity to use for users, repository administrators and developers
I don’t think any of the tools mentioned above are really good on more than a small subset of these points.
Specifically: Docker’s layered tarball approach dumps the “delta” question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaing full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there’s no use for it there, and likely requires substantially more disk space and download sizes.
OSTree’s serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.
Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/metadata (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the usecase I am interested in.
(Note: all the mentioned systems have great properties, and it’s not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)
Security & Reproducability
Besides the issues pointed out above I wasn’t happy with the security and reproducability properties of these systems. In today’s world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously undeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there’s only one valid representatin of the data to deliver, that can easily be verified.
What casync Is
So much about the background why I created casync. Now, let’s have a look what casync actually is like, and what it does. Here’s the brief technical overview:
Encoding: Let’s take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk’s contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let’s call this directory a “chunk store”. At the same time, generate a “chunk index” file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.
Decoding: Let’s take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.
As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.
Finally, let’s put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.
Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.
Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.
The “chunking” algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.
Here’s a diagram, hopefully explaining a bit how the encoding process works, wasn’t it for my crappy drawing skills:
The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.
Details
Note that casync operates on two different layers, depending on the usecase of the user:
You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.
You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.
The fact that it may be used on both the block and file system layer opens it up for a variety of different usecases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.
Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.
Features
Here are a couple of other features casync has:
When downloading a new image you may use casync‘s --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.
When operating on the file system level, the user has fine-grained control on the metadata included in the serialization. This is relevant since different usecases tend to require a different set of saved/restored metadata. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the metadata cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected metadata features is also always part of the serialization, so that seeding can work correctly and automatically.
casync tries to be as accurate as possible when storing file system metadata. This means that besides the usual baseline of file metadata (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs subvolume information where available. Note that as described above every single type of metadata may be turned off and on individually, hence if you don’t need FAT file bits (and I figure it’s pretty likely you don’t), then they won’t be stored.
The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).
Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other usecases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.
The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file subtrees located at the top of the tree, with zero metadata references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.
When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.
Optionally, when extracting file trees, casync can create traditional UNIX hardlinks for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific usecases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hardlinked files, as that’s the nature of hardlinks. In this mode, casync exposes OSTree-like behaviour, which is built heavily around read-only hardlink trees.
casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are exluded from serialization, as they expose API objects, not real files. Moreover, the “nodump” (+d) chattr(1) flag is honoured by default, permitting users to mark files to exclude from serialization.
When creating and extracting file trees casync may apply am automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user namespacing.
In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote backend.
When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev http://example.com/myimage.caibx exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to “dm-verity”, as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel’s block layer. This feature is implemented though Linux’ NBD kernel facility.
Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount http://example.com/mytree.caidx /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux’ FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing thrings through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s hopefully just a temporary gap to be fixed soon).
In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the superblock of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there’s little point in seeding the whitespace after the file system within the partition.
Example Command Lines
Here’s how to use casync, explained with a few examples:
$ casync make foobar.caidx /some/directory
This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:
$ casync make foobar.caibx /dev/sda1
This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:
$ casync make foobar.caibx myimage.raw
To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:
This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:
$ casync make ssh.example.com:images/foobar.caidx /some/directory
This will use ssh to connect to the ssh.example.com server, and then places the .caidx file and the chunks on it. Note that this mode of operation is “smart”: this scheme will only upload chunks currently missing on the server side, and not retransmit what already is available.
Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.
Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:
When creating chunk indexes on the file system layer casync will by default store metadata as accurately as possible. Let’s create a chunk index with reduced metadata:
$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir
This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity timestamps, symbolic links and a single read-only bit. In this mode, all the other metadata bits are not stored, including nanosecond timestamps, full unix permission bits, file ownership or even ACLs or extended attributes.
Now let’s make a .caidx file available locally as a mounted file system, without extracting it:
$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar
And similar, let’s make a .caibx file available locally as a block device:
This will create a block device in /dev and print the used device node path to STDOUT.
As mentioned, casync is big about reproducability. Let’s make use of that to calculate the a digest identifying a very specific version of a file tree:
$ casync digest .
This digest will include all metadata bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what metadata to include:
$ casync digest --with=unix .
This makes use of the --with=unix shortcut for selecting metadata fields. Specifying --with-unix= selects all metadata that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.
Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:
$ casync digest --without=flag-immutable
This generates a digest with the most accurate metadata, but leaves one feature out: chattr(1)‘s immutable (+i) file flag.
To list the contents of a .caidx file use a command like the following:
$ casync list http://example.com/images/foobar.caidx
The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file metadata than mtree files can express, though.
What casync isn’t
casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google’s Courgette.
casync is not a replacement for rsync, or git or zsync or anything like that. They have very different usecases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.
Where next?
casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I’d like to make it useful for a couple other usecases, too. Specifically:
To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool would might become an alternative to restic or tarsnap.
Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.
In the longer run, I’d like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.
casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync‘s functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME’s gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting backends can be replaced easily, for example to replace casync‘s default HTTP/HTTPS backends built on CURL with GNOME’s own HTTP implementation, in order to share cookies, certificates, … There’s also an alternative method to integrate with casync in place already: simply invoke casync as a subprocess. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.
I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.
Further plans are listed tersely in the TODO file.
FAQ:
Is this a systemd project? — casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the codebases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.
Is casync portable? — At the moment: no. I only run Linux and that’s what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn’t really make sense on non-Linux systems), as long as they don’t interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.
Does casync require reflink-capable file systems to work, such as btrfs? No it doesn’t. The reflink magic in casync is employed when the file system permits it, and it’s good to have it, but it’s not a requirement, and casync will implicitly fall back to copying when it isn’t available. Note that casync supports a number of file system features on a variety of file systems that aren’t available everywhere, for example FAT’s system/hidden file flags or xfs‘s projinherit file flag.
Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don’t see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.
Are the .caidx/.caibx and .catar file formats open and documented? — casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It’s definitely my intention to add comprehensive docs for both formats however. Don’t forget this is just the initial version right now.
casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casyncisn’t “just like” some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.
Why did you invent your own serialization format for file trees? Why don’t you just use tar? That’s a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn’t enforce reproducability. It also doesn’t really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducability, random access, and metadata control. Much like traditional tar it can still be generated and extracted in a stream fashion though.
Does casync save/restore SELinux/SMACK file labels? At the moment not. That’s not because I wouldn’t want it to, but simply because I am not a guru of either of these systems, and didn’t want to implement something I do not fully grok nor can test. If you look at the sources you’ll find that there’s already some definitions in place that keep room for them though. I’d be delighted to accept a patch implementing this fully.
What about delivering squashfs images? How well does chunking work on compressed serializations? – That’s a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let’s say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn’t apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync‘s --chunk-size= option). How precisely to choose both values is left to reasearch by the user, for now.
What does the name casync mean? – It’s a synchronizing tool, hence the -sync suffix, following rsync‘s naming. It makes use of the content-addressable concept of git hence the ca- prefix.
Well, that’s up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.
Note that casync is an Open Source project: if it doesn’t do exactly what you need, prepare a patch that adds what you need, and we’ll consider it.
If you are interested in the project and would like to talk about this in person, I’ll be presenting casync soon at Kinvolk’s Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.
In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).
The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that’s what makes it useful for a variety of use-cases that other tools can’t cover that well.
Why?
I created casync after studying how today’s popular tools store and deliver file system images. To briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).
Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:
Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
Put boundaries on disk space usage on clients
Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
Simplicity to use for users, repository administrators and developers
I don’t think any of the tools mentioned above are really good on more than a small subset of these points.
Specifically: Docker’s layered tarball approach dumps the “delta” question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaining full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there’s no use for it there, and likely requires substantially more disk space and download sizes.
OSTree’s serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.
Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/meta-data (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the use-case I am interested in.
(Note: all the mentioned systems have great properties, and it’s not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)
Security & Reproducibility
Besides the issues pointed out above I wasn’t happy with the security and reproducibility properties of these systems. In today’s world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously nondeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there’s only one valid representation of the data to deliver, that can easily be verified.
What casync Is
So much about the background why I created casync. Now, let’s have a look what casync actually is like, and what it does. Here’s the brief technical overview:
Encoding: Let’s take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk’s contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let’s call this directory a “chunk store”. At the same time, generate a “chunk index” file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.
Decoding: Let’s take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.
As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.
Finally, let’s put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.
Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.
Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.
The “chunking” algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.
Here’s a diagram, hopefully explaining a bit how the encoding process works, wasn’t it for my crappy drawing skills:
The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.)
Details
Note that casync operates on two different layers, depending on the use-case of the user:
You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.
You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.
The fact that it may be used on both the block and file system layer opens it up for a variety of different use-cases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.
Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.
Features
Here are a couple of other features casync has:
When downloading a new image you may use casync‘s --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.
When operating on the file system level, the user has fine-grained control on the meta-data included in the serialization. This is relevant since different use-cases tend to require a different set of saved/restored meta-data. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the meta-data cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected meta-data features is also always part of the serialization, so that seeding can work correctly and automatically.
casync tries to be as accurate as possible when storing file system meta-data. This means that besides the usual baseline of file meta-data (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs sub-volume information where available. Note that as described above every single type of meta-data may be turned off and on individually, hence if you don’t need FAT file bits (and I figure it’s pretty likely you don’t), then they won’t be stored.
The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).
Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other use-cases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.
The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file sub-trees located at the top of the tree, with zero meta-data references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.
When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.
Optionally, when extracting file trees, casync can create traditional UNIX hard-links for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific use-cases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hard-linked files, as that’s the nature of hard-links. In this mode, casync exposes OSTree-like behavior, which is built heavily around read-only hard-link trees.
casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are excluded from serialization, as they expose API objects, not real files. Moreover, the “nodump” (+d) chattr(1) flag is honored by default, permitting users to mark files to exclude from serialization.
When creating and extracting file trees casync may apply an automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user name-spacing.
In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote back-end.
When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev http://example.com/myimage.caibx exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to “dm-verity”, as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel’s block layer. This feature is implemented though Linux’ NBD kernel facility.
Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount http://example.com/mytree.caidx /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux’ FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing things through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s hopefully just a temporary gap to be fixed soon).
In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the super-block of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there’s little point in seeding the white-space after the file system within the partition.
Example Command Lines
Here’s how to use casync, explained with a few examples:
$ casync make foobar.caidx /some/directory
This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:
$ casync make foobar.caibx /dev/sda1
This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:
$ casync make foobar.caibx myimage.raw
To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:
This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:
$ casync make ssh.example.com:images/foobar.caidx /some/directory
This will use ssh to connect to the ssh.example.com server, and then places the .caidx file and the chunks on it. Note that this mode of operation is “smart”: this scheme will only upload chunks currently missing on the server side, and not re-transmit what already is available.
Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.
Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:
When creating chunk indexes on the file system layer casync will by default store meta-data as accurately as possible. Let’s create a chunk index with reduced meta-data:
$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir
This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity time-stamps, symbolic links and a single read-only bit. In this mode, all the other meta-data bits are not stored, including nanosecond time-stamps, full UNIX permission bits, file ownership or even ACLs or extended attributes.
Now let’s make a .caidx file available locally as a mounted file system, without extracting it:
$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar
And similar, let’s make a .caibx file available locally as a block device:
This will create a block device in /dev and print the used device node path to STDOUT.
As mentioned, casync is big about reproducibility. Let’s make use of that to calculate the a digest identifying a very specific version of a file tree:
$ casync digest .
This digest will include all meta-data bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what meta-data to include:
$ casync digest --with=unix .
This makes use of the --with=unix shortcut for selecting meta-data fields. Specifying --with-unix= selects all meta-data that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.
Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:
$ casync digest --without=flag-immutable
This generates a digest with the most accurate meta-data, but leaves one feature out: chattr(1)‘s immutable (+i) file flag.
To list the contents of a .caidx file use a command like the following:
$ casync list http://example.com/images/foobar.caidx
The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file meta-data than mtree files can express, though.
What casync isn’t
casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google’s Courgette.
casync is not a replacement for rsync, or git or zsync or anything like that. They have very different use-cases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.
Where next?
casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I’d like to make it useful for a couple other use-cases, too. Specifically:
To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool might become an alternative to restic, BorgBackup or tarsnap.
Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.
In the longer run, I’d like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.
casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync‘s functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME’s gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting back-ends can be replaced easily, for example to replace casync‘s default HTTP/HTTPS back-ends built on CURL with GNOME’s own HTTP implementation, in order to share cookies, certificates, … There’s also an alternative method to integrate with casync in place already: simply invoke casync as a sub-process. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.
I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.
Further plans are listed tersely in the TODO file.
FAQ:
Is this a systemd project? — casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the code-bases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.
Is casync portable? — At the moment: no. I only run Linux and that’s what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn’t really make sense on non-Linux systems), as long as they don’t interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.
Does casync require reflink-capable file systems to work, such as btrfs? — No it doesn’t. The reflink magic in casync is employed when the file system permits it, and it’s good to have it, but it’s not a requirement, and casync will implicitly fall back to copying when it isn’t available. Note that casync supports a number of file system features on a variety of file systems that aren’t available everywhere, for example FAT’s system/hidden file flags or xfs‘s projinherit file flag.
Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don’t see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.
Are the .caidx/.caibx and .catar file formats open and documented? — casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It’s definitely my intention to add comprehensive docs for both formats however. Don’t forget this is just the initial version right now.
casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casyncisn’t “just like” some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.
Why did you invent your own serialization format for file trees? Why don’t you just use tar? — That’s a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn’t enforce reproducibility. It also doesn’t really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducibility, random access, and meta-data control. Much like traditional tar it can still be generated and extracted in a stream fashion though.
Does casync save/restore SELinux/SMACK file labels? — At the moment not. That’s not because I wouldn’t want it to, but simply because I am not a guru of either of these systems, and didn’t want to implement something I do not fully grok nor can test. If you look at the sources you’ll find that there’s already some definitions in place that keep room for them though. I’d be delighted to accept a patch implementing this fully.
What about delivering squashfs images? How well does chunking work on compressed serializations? – That’s a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let’s say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn’t apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync‘s --chunk-size= option). How precisely to choose both values is left a research subject for the user, for now.
What does the name casync mean? – It’s a synchronizing tool, hence the -sync suffix, following rsync‘s naming. It makes use of the content-addressable concept of git hence the ca- prefix.
Where can I get this stuff? Is it already packaged? – Check out the sources on GitHub. I just tagged the first version. Martin Pitt has packaged casync for Ubuntu. There is also an ArchLinux package. Zbigniew Jędrzejewski-Szmek has prepared a Fedora RPM that hopefully will soon be included in the distribution.
Should you care? Is this a tool for you?
Well, that’s up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.
Note that casync is an Open Source project: if it doesn’t do exactly what you need, prepare a patch that adds what you need, and we’ll consider it.
If you are interested in the project and would like to talk about this in person, I’ll be presenting casync soon at Kinvolk’s Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.
My colleague Mia Champion is a scientist (check out her publications), an AWS Certified Solutions Architect, and an AWS Certified Developer. The time that she spent doing research on large-data datasets gave her an appreciation for the value of cloud computing in the bioinformatics space, which she summarizes and explains in the guest post below!
Technological advances in scientific research continue to enable the collection of exponentially growing datasets that are also increasing in the complexity of their content. The global pace of innovation is now also fueled by the recent cloud-computing revolution, which provides researchers with a seemingly boundless scalable and agile infrastructure. Now, researchers can remove the hindrances of having to own and maintain their own sequencers, microscopes, compute clusters, and more. Using the cloud, scientists can easily store, manage, process and share datasets for millions of patient samples with gigabytes and more of data for each individual. As American physicist, John Bardeen once said: “Science is a collaborative effort. The combined results of several people working together is much more effective than could be that of an individual scientist working alone”.
Prioritizing Reproducible Innovation, Democratization, and Data Protection Today, we have many individual researchers and organizations leveraging secure cloud enabled data sharing on an unprecedented scale and producing innovative, customized analytical solutions using the AWS cloud. But, can secure data sharing and analytics be done on such a collaborative scale as to revolutionize the way science is done across a domain of interest or even across discipline/s of science? Can building a cloud-enabled consortium of resources remove the analytical variability that leads to diminished reproducibility, which has long plagued the interpretability and impact of research discoveries? The answers to these questions are ‘yes’ and initiatives such as the Neuro Cloud Consortium, The Global Alliance for Genomics and Health (GA4GH), and The Sage Bionetworks Synapse platform, which powers many research consortiums including the DREAM challenges, are starting to put into practice model cloud-initiatives that will not only provide impactful discoveries in the areas of neuroscience, infectious disease, and cancer, but are also revolutionizing the way in which scientific research is done.
Bringing Crowd Developed Models, Algorithms, and Functions to the Data Collaborative projects have traditionally allowed investigators to download datasets such as those used for comparative sequence analysis or for training a deep learning algorithm on medical imaging data. Investigators were then able to develop and execute their analysis using institutional clusters, local workstations, or even laptops:
This method of collaboration is problematic for many reasons. The first concern is data security, since dataset download essentially permits “chain-data-sharing” with any number of recipients. Second, analytics done using compute environments that are not templated at some level introduces the risk of variable analytics that itself is not reproducible by a different investigator, or even the same investigator using a different compute environment. Third, the required data dump, processing, and then re-upload or distribution to the collaborative group is highly inefficient and dependent upon each individual’s networking and compute capabilities. Overall, traditional methods of scientific collaboration have introduced methods in which security is compromised and time to discovery is hampered.
Using the AWS cloud, collaborative researchers can share datasets easily and securely by taking advantage of Identity and Access Management (IAM) policy restrictions for user bucket access as well as S3 bucket policies or Access Control Lists (ACLs). To streamline analysis and ensure data security, many researchers are eliminating the necessity to download datasets entirely by leveraging resources that facilitate moving the analytics to the data source and/or taking advantage of remote API requests to access a shared database or data lake. One way our customers are accomplishing this is to leverage container based Docker technology to provide collaborators with a way to submit algorithms or models for execution on the system hosting the shared datasets:
Docker container images have all of the application’s dependencies bundled together, and therefore provide a high degree of versatility and portability, which is a significant advantage over using other executable-based approaches. In the case of collaborative machine learning projects, each docker container will contain applications, language runtime, packages and libraries, as well as any of the more popular deep learning frameworks commonly used by researchers including: MXNet, Caffe, TensorFlow, and Theano.
A common feature in these frameworks is the ability to leverage a host machine’s Graphical Processing Units (GPUs) for significant acceleration of the matrix and vector operations involved in the machine learning computations. As such, researchers with these objectives can leverage EC2’s new P2 instance types in order to power execution of submitted machine learning models. In addition, GPUs can be mounted directly to containers using the NVIDIA Docker tool and appear at the system level as additional devices. By leveraging Amazon EC2 Container Service and the EC2 Container Registry, collaborators are able to execute analytical solutions submitted to the project repository by their colleagues in a reproducible fashion as well as continue to build on their existing environment. Researchers can also architect a continuous deployment pipeline to run their docker-enabled workflows.
In conclusion, emerging cloud-enabled consortium initiatives serve as models for the broader research community for how cloud-enabled community science can expedite discoveries in Precision Medicine while also providing a platform where data security and discovery reproducibility is inherent to the project execution.
To help you secure your AWS resources, we recommend that you adopt a layered approach that includes the use of preventative and detective controls. For example, incorporating host-based controls for your Amazon EC2 instances can restrict access and provide appropriate levels of visibility into system behaviors and access patterns. These controls often include a host-based intrusion detection system (HIDS) that monitors and analyzes network traffic, log files, and file access on a host. A HIDS typically integrates with alerting and automated remediation solutions to detect and address attacks, unauthorized or suspicious activities, and general errors in your environment.
In this blog post, I show how you can use Amazon CloudWatch Logs to collect and aggregate alerts from an open-source security (OSSEC) HIDS. I use a CloudWatch Logs subscription to deliver the alerts to Amazon Elasticsearch Service (Amazon ES) for analysis and visualization with Kibana – a popular open-source visualization tool. To make it easier for you to see this solution in action, I provide a CloudFormation template to handle most of the deployment work. You can use this solution to gain improved visibility and insights across your EC2 fleet and help drive security remediation activities. For example, if specific hosts are scanning your EC2 instances and triggering OSSEC alerts, you can implement a VPC network access control list (ACL) or AWS WAF rule to block those source IP addresses or CIDR blocks.
Solution overview
The following diagram depicts a high-level overview of this post’s solution.
Here is how the solution works:
On the target EC2 instances, the OSSEC HIDS generates alerts that the CloudWatch Logs agent captures. The HIDS performs log analysis, integrity checking, Windows registry monitoring, rootkit detection, real-time alerting, and active response. For more information, see Getting started with OSSEC.
The CloudWatch Logs group receives the alerts as events.
A CloudWatch Logs subscription is applied to the target log group to forward the events through AWS Lambda to Amazon ES.
Amazon ES loads the logged alert data.
Kibana visualizes the alerts in near-real time. Amazon ES provides a default installation of Kibana with every Amazon ES domain.
Deployment considerations
For the purposes of this post, the primary OSSEC HIDS deployment consists of a Linux-based installation for which the alerts are generated locally within each system. Note that this solution depends on Amazon ES and Lambda in the target region for deployment. You can find the latest information about AWS service availability in the Region table. You also must identify an Amazon Virtual Private Cloud (VPC) subnet that has Internet access and DNS resolution for your EC2 instances to provision the required components properly.
To simplify the deployment process, I created a test environment AWS CloudFormation template. You can use this template to provision a test environment stack automatically into an existing Amazon VPC subnet. You will use CloudFormation to provision the core components of this solution and then configure Kibana for alert analysis. The source code for this solution is available on GitHub.
This post’s template performs the following high-level steps in the region you choose:
Creates two EC2 instances running Amazon Linux with an AWS Identity and Access Management (IAM) role for CloudWatch Logs access. Note: To provide sample HIDS alert data, the two EC2 instances are configured automatically to generate simulated HIDS alerts locally.
Installs and configures OSSEC, the CloudWatch Logs agent, and additional packages used for the test environment.
Creates the target HIDS Amazon ES domain.
Creates the target HIDS CloudWatch Logs group.
Creates the Lambda function and CloudWatch Logs subscription to send HIDS alerts to Amazon ES.
After the CloudFormation stack has been deployed, you can access the Kibana instance on the Amazon ES domain to complete the final steps of the setup for the test environment, which I show later in the post.
Although out of scope for this blog post, when deploying OSSEC into your existing EC2 environment, you should determine the desired configuration, including target log files for monitoring, directories for integrity checking, and active response. This typically also requires time for testing and tuning of the system to optimize it for your environment. The OSSEC documentation is a good place to start to familiarize yourself with this process. You could take another approach to OSSEC deployment, which involves an agent installation and a separate OSSEC manager to process events centrally before exporting them to CloudWatch Logs. This deployment requires an additional server component and network communication between the agent and the manager. Note that although Windows Server is supported by OSSEC, it requires an agent-based installation and therefore requires an OSSEC manager to be present. Review OSSEC Architecture for additional information about OSSEC architecture and deployment options.
Deploy the solution
This solution’s high-level steps are:
Launch the CloudFormation stack.
Configure a Kibana index pattern and begin exploring alerts.
Configure a Kibana HIDS dashboard and visualize alerts.
1. Launch the CloudFormation stack
You will launch your test environment by using a CloudFormation template that automates the provisioning process. For the following input parameters, you must identify a target VPC and subnet (which requires Internet access) for deployment. If the target subnet uses an Internet gateway, set the AssignPublicIP parameter to true. If the target subnet uses a NAT gateway, you can leave the default setting of AssignPublicIP as false.
First, you will need to stage the Lambda function deployment package in an S3 bucket located in the region into which you are deploying. To do this, download the zipped deployment package and upload it to your in-region bucket. For additional information about uploading objects to S3, see Uploading Object into Amazon S3.
You also must provide a trusted source IP address or CIDR block for access to the environment following the creation of the stack and an EC2 key pair to associate with the instances. For information about creating an EC2 key pair, see Creating a Key Pair Using Amazon EC2. Note that the trusted IP address or CIDR block also is used to create the Amazon ES access policy automatically for Kibana access. We recommend that you use a specific IP address or CIDR range rather than using 0.0.0.0/0, which would allow all IPv4 addresses to access your instances. For more information about authorizing inbound traffic to your instances, see Authorizing Inbound Traffic for Your Linux Instances.
After you have confirmed the input parameters (see the following screenshot and table for more details), create the CloudFormation stack.
Input parameter
Input parameter description
1. HIDSInstanceSize
EC2 instance size for test server
2. ESInstanceSize
Amazon ES instance size
3. MyKeyPair
A public/private key pair that allows you to connect securely to your instance after it launches
A SubnetId with outbound connectivity within the VPC you selected (requires Internet access)
8. AssignPublicIP
Set to true if your subnet is configured to connect through an Internet gateway; set to false if your subnet is configured to connect through a NAT gateway
9. MyTrustedNetwork
Your trusted source IP or CIDR block that is used to whitelist access to the EC2 instances and the Amazon ES endpoint
To finish creating the CloudFormation stack:
Enter the input parameters and choose Next.
On the Options page, accept the defaults and choose Next.
On the Review page, confirm the details, select the I acknowledge thatAWS CloudFormation might create IAM resources check box, and then choose Create. (The stack will be created in approximately 10 minutes.)
After the stack has been created, note the HIDSESKibanaURL on the CloudFormation Outputs tab. Then, proceed to the Kibana configuration instructions in the next section.
2. Configure a Kibana index pattern and begin exploring alerts
In this section, you perform the initial setup of Kibana. To access Kibana, find the HIDSESKibanaURL in the CloudFormation stack outputs (see the previous section) and choose it. This will bring you to the Kibana instance, which is automatically provisioned to your Amazon ES instance. The source IP you provided in the CloudFormation input parameters is used to automatically populate the Amazon ES access policy. If you receive an error similar to the following error, you must confirm that your Amazon ES access policy is correct.
{"Message":"User: anonymous is not authorized to perform: es:ESHttpGet on resource: hids-alerts"}
The OSSEC HIDS alerts now are being processed into Amazon ES. To use Kibana to analyze the alert data interactively, you must configure an index pattern that identifies the data you wish to analyze in Amazon ES. You can read additional information about index patterns in the Kibana documentation.
In the Index name or pattern box, type cwl-2017.*. The index pattern is generated within the Lambda function as cwl-YYYY.MM.DD, so you can use a wildcard character for the month and day to match data from 2017. From the Time-field name drop-down list, choose @timestamp, and then choose Create.
In Kibana, you should now be able to choose the Discover pane and see alerts being populated. To set the refresh rate for the display of near-real-time alerts, choose your desired time range in the top right (such as Last 15 minutes).
Choose Auto-refresh, and then choose an interval, such as 5 seconds.
Kibana should now be configured to auto-refresh at a 5-second interval within the timeframe you configured. You should now see your alerts updating along with a count graph, as shown in the following screenshot.
The EC2 instances are automatically configured by CloudFormation to simulate activity to display several types of alerts, including:
Successful sudo to ROOT executed – The Linux sudo command was successfully executed.
Web server 400 error code – The server cannot process the request due to an apparent client error (such as malformed request syntax, too large size, invalid request message framing, or deceptive request routing).
SSH insecure connection attempt (scan) – Invalid connection attempt to the SSH listener.
Login session opened – Opened login session on the system.
Login session closed – Closed login session on the system.
New Yum package installed – Package installed on the system.
Yum package deleted – Package deleted from the system.
Let’s take a closer look at some of the alert fields, as shown in the following screenshot.
The numbered alert fields in the preceding screenshot are defined as follows:
@log_group – The source CloudWatch Logs group
@log_stream – The CloudWatch Logs stream name (InstanceID)
@message – The JSON payload from the source alerts.json OSSEC log
@owner – The AWS account ID where the alert originated
@timestamp – The time stamp applied by the consumer Lambda function
full_log – The log event from the source file
location – The source log file path and file name
rule.comment – A brief description of the OSSEC rule that was matched
rule.level – The OSSEC rule classification from 0 to 16 (see Rules Classification for more information)
rule.sidid – The rule ID of the OSSEC rule that was matched
srcip – The source IP address that triggered the alert; in this case, the simulated alerts contain the local IP of the server
You can enter search criteria in the Kibana query bar to explore HIDS alert data interactively. For example, you can run the following query to see all the rule.level 6 alerts for the EC2 InstanceID i-0e427a8594852eca2 where the source IP is 10.10.10.10.
“rule.level: 6 AND @log_stream: "i-0e427a8594852eca2" AND srcip: 10.10.10.10”
You can perform searches including simple text, Lucene query syntax, or use the full JSON-based Elasticsearch Query DSL. You can find additional information on searching your data in the Elasticsearch documentation.
3. Configure a Kibana HIDS dashboard and visualize alerts
To analyze alert trends and patterns over time, it can be helpful to use charts and graphs to represent the alert data. I have configured a basic dashboard template that you can import into your Kibana instance.
Save the template locally and then choose Management in the Kibana navigation pane.
Choose Saved Objects, Import, and the HIDS dashboard template.
Choose the eye icon to the right of the HIDS Alerts dashboard entry. This will take you to the imported dashboard.
After importing the Kibana dashboard template and selecting it, you will see the HIDS dashboard, as shown in the following screenshot. This sample HIDS dashboard includes Alerts Over Time, Top 20 Alert Types, Rule Level Breakdown, Top 10 Rule Source ID, and Top 10 Source IPs.
To explore the alert data in more detail, you can choose an alert type on which to filter, as shown in the following two screenshots.
You can see more details about the alerts based on criteria such as source IP address or time range. For more information about using Kibana to visualize alert data, see the Kibana User Guide.
Summary
In this blog post, I showed how to use CloudWatch Logs to collect alerts in near-real time from an OSSEC HIDS and use a CloudWatch Logs subscription to pass the alerts into Amazon ES for analysis and visualization with Kibana. The dashboard deployed by this solution can help you improve the security monitoring of your EC2 fleet as part of a defense-in-depth security strategy in your AWS environment.
You can use this solution to help detect attacks, anomalous activities, and error trends across your EC2 fleet. You can also use it to help prioritize remediation efforts for your systems or help determine where to introduce additional security controls such as VPC security group rules, VPC network ACLs, or AWS WAF rules.
If you have comments about this post, add them to the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the CloudWatch or Amazon ES forum. The source code for this solution is available on GitHub. If you need OSSEC-specific support, see OSSEC Support Options.
In case you missed any AWS Security Blog posts published so far in 2017, they are summarized and linked to below. The posts are shown in reverse chronological order (most recent first), and the subject matter ranges from protecting dynamic web applications against DDoS attacks to monitoring AWS account configuration changes and API calls to Amazon EC2 security groups.
March
March 22:How to Help Protect Dynamic Web Applications Against DDoS Attacks by Using Amazon CloudFront and Amazon Route 53 Using a content delivery network (CDN) such as Amazon CloudFront to cache and serve static text and images or downloadable objects such as media files and documents is a common strategy to improve webpage load times, reduce network bandwidth costs, lessen the load on web servers, and mitigate distributed denial of service (DDoS) attacks. AWS WAF is a web application firewall that can be deployed on CloudFront to help protect your application against DDoS attacks by giving you control over which traffic to allow or block by defining security rules. When users access your application, the Domain Name System (DNS) translates human-readable domain names (for example, www.example.com) to machine-readable IP addresses (for example, 192.0.2.44). A DNS service, such as Amazon Route 53, can effectively connect users’ requests to a CloudFront distribution that proxies requests for dynamic content to the infrastructure hosting your application’s endpoints. In this blog post, I show you how to deploy CloudFront with AWS WAF and Route 53 to help protect dynamic web applications (with dynamic content such as a response to user input) against DDoS attacks. The steps shown in this post are key to implementing the overall approach described in AWS Best Practices for DDoS Resiliency and enable the built-in, managed DDoS protection service, AWS Shield.
March 21:New AWS Encryption SDK for Python Simplifies Multiple Master Key Encryption The AWS Cryptography team is happy to announce a Python implementation of the AWS Encryption SDK. This new SDK helps manage data keys for you, and it simplifies the process of encrypting data under multiple master keys. As a result, this new SDK allows you to focus on the code that drives your business forward. It also provides a framework you can easily extend to ensure that you have a cryptographic library that is configured to match and enforce your standards. The SDK also includes ready-to-use examples. If you are a Java developer, you can refer to this blog post to see specific Java examples for the SDK. In this blog post, I show you how you can use the AWS Encryption SDK to simplify the process of encrypting data and how to protect your encryption keys in ways that help improve application availability by not tying you to a single region or key management solution.
March 21:Updated CJIS Workbook Now Available by Request The need for guidance when implementing Criminal Justice Information Services (CJIS)–compliant solutions has become of paramount importance as more law enforcement customers and technology partners move to store and process criminal justice data in the cloud. AWS services allow these customers to easily and securely architect a CJIS-compliant solution when handling criminal justice data, creating a durable, cost-effective, and secure IT infrastructure that better supports local, state, and federal law enforcement in carrying out their public safety missions. AWS has created several documents (collectively referred to as the CJIS Workbook) to assist you in aligning with the FBI’s CJIS Security Policy. You can use the workbook as a framework for developing CJIS-compliant architecture in the AWS Cloud. The workbook helps you define and test the controls you operate, and document the dependence on the controls that AWS operates (compute, storage, database, networking, regions, Availability Zones, and edge locations).
March 9:New Cloud Directory API Makes It Easier to Query Data Along Multiple Dimensions Today, we made available a new Cloud Directory API, ListObjectParentPaths, that enables you to retrieve all available parent paths for any directory object across multiple hierarchies. Use this API when you want to fetch all parent objects for a specific child object. The order of the paths and objects returned is consistent across iterative calls to the API, unless objects are moved or deleted. In case an object has multiple parents, the API allows you to control the number of paths returned by using a paginated call pattern. In this blog post, I use an example directory to demonstrate how this new API enables you to retrieve data across multiple dimensions to implement powerful applications quickly.
March 8:How to Access the AWS Management Console Using AWS Microsoft AD and Your On-Premises Credentials AWS Directory Service for Microsoft Active Directory, also known as AWS Microsoft AD, is a managed Microsoft Active Directory (AD) hosted in the AWS Cloud. Now, AWS Microsoft AD makes it easy for you to give your users permission to manage AWS resources by using on-premises AD administrative tools. With AWS Microsoft AD, you can grant your on-premises users permissions to resources such as the AWS Management Console instead of adding AWS Identity and Access Management (IAM) user accounts or configuring AD Federation Services (AD FS) with Security Assertion Markup Language (SAML). In this blog post, I show how to use AWS Microsoft AD to enable your on-premises AD users to sign in to the AWS Management Console with their on-premises AD user credentials to access and manage AWS resources through IAM roles.
March 7:How to Protect Your Web Application Against DDoS Attacks by Using Amazon Route 53 and an External Content Delivery Network Distributed Denial of Service (DDoS) attacks are attempts by a malicious actor to flood a network, system, or application with more traffic, connections, or requests than it is able to handle. To protect your web application against DDoS attacks, you can use AWS Shield, a DDoS protection service that AWS provides automatically to all AWS customers at no additional charge. You can use AWS Shield in conjunction with DDoS-resilient web services such as Amazon CloudFront and Amazon Route 53 to improve your ability to defend against DDoS attacks. Learn more about architecting for DDoS resiliency by reading the AWS Best Practices for DDoS Resiliency whitepaper. You also have the option of using Route 53 with an externally hosted content delivery network (CDN). In this blog post, I show how you can help protect the zone apex (also known as the root domain) of your web application by using Route 53 to perform a secure redirect to prevent discovery of your application origin.
February 23:s2n Is Now Handling 100 Percent of SSL Traffic for Amazon S3 Today, we’ve achieved another important milestone for securing customer data: we have replaced OpenSSL with s2n for all internal and external SSL traffic in Amazon Simple Storage Service (Amazon S3) commercial regions. This was implemented with minimal impact to customers, and multiple means of error checking were used to ensure a smooth transition, including client integration tests, catching potential interoperability conflicts, and identifying memory leaks through fuzz testing.
February 22:Easily Replace or Attach an IAM Role to an Existing EC2 Instance by Using the EC2 Console AWS Identity and Access Management (IAM) roles enable your applications running on Amazon EC2 to use temporary security credentials. IAM roles for EC2 make it easier for your applications to make API requests securely from an instance because they do not require you to manage AWS security credentials that the applications use. Recently, we enabled you to use temporary security credentials for your applications by attaching an IAM role to an existing EC2 instance by using the AWS CLI and SDK. To learn more, see New! Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI. Starting today, you can attach an IAM role to an existing EC2 instance from the EC2 console. You can also use the EC2 console to replace an IAM role attached to an existing instance. In this blog post, I will show how to attach an IAM role to an existing EC2 instance from the EC2 console.
February 22:How to Audit Your AWS Resources for Security Compliance by Using Custom AWS Config Rules AWS Config Rules enables you to implement security policies as code for your organization and evaluate configuration changes to AWS resources against these policies. You can use Config rules to audit your use of AWS resources for compliance with external compliance frameworks such as CIS AWS Foundations Benchmark and with your internal security policies related to the US Health Insurance Portability and Accountability Act (HIPAA), the Federal Risk and Authorization Management Program (FedRAMP), and other regimes. AWS provides some predefined, managed Config rules. You also can create custom Config rules based on criteria you define within an AWS Lambda function. In this post, I show how to create a custom rule that audits AWS resources for security compliance by enabling VPC Flow Logs for an Amazon Virtual Private Cloud (VPC). The custom rule meets requirement 4.3 of the CIS AWS Foundations Benchmark: “Ensure VPC flow logging is enabled in all VPCs.”
February 13:How to Enable Multi-Factor Authentication for AWS Services by Using AWS Microsoft AD and On-Premises Credentials You can now enable multi-factor authentication (MFA) for users of AWS services such as Amazon WorkSpaces and Amazon QuickSight and their on-premises credentials by using your AWS Directory Service for Microsoft Active Directory (Enterprise Edition) directory, also known as AWS Microsoft AD. MFA adds an extra layer of protection to a user name and password (the first “factor”) by requiring users to enter an authentication code (the second factor), which has been provided by your virtual or hardware MFA solution. These factors together provide additional security by preventing access to AWS services, unless users supply a valid MFA code.
February 13:How to Create an Organizational Chart with Separate Hierarchies by Using Amazon Cloud Directory Amazon Cloud Directory enables you to create directories for a variety of use cases, such as organizational charts, course catalogs, and device registries. Cloud Directory offers you the flexibility to create directories with hierarchies that span multiple dimensions. For example, you can create an organizational chart that you can navigate through separate hierarchies for reporting structure, location, and cost center. In this blog post, I show how to use Cloud Directory APIs to create an organizational chart with two separate hierarchies in a single directory. I also show how to navigate the hierarchies and retrieve data. I use the Java SDK for all the sample code in this post, but you can use other language SDKs or the AWS CLI.
February 10:How to Easily Log On to AWS Services by Using Your On-Premises Active Directory AWS Directory Service for Microsoft Active Directory (Enterprise Edition), also known as Microsoft AD, now enables your users to log on with just their on-premises Active Directory (AD) user name—no domain name is required. This new domainlesslogon feature makes it easier to set up connections to your on-premises AD for use with applications such as Amazon WorkSpaces and Amazon QuickSight, and it keeps the user logon experience free from network naming. This new interforest trusts capability is now available when using Microsoft AD with Amazon WorkSpaces and Amazon QuickSight Enterprise Edition. In this blog post, I explain how Microsoft AD domainless logon works with AD interforest trusts, and I show an example of setting up Amazon WorkSpaces to use this capability.
February 9:New! Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI AWS Identity and Access Management (IAM) roles enable your applications running on Amazon EC2 to use temporary security credentials that AWS creates, distributes, and rotates automatically. Using temporary credentials is an IAM best practice because you do not need to maintain long-term keys on your instance. Using IAM roles for EC2 also eliminates the need to use long-term AWS access keys that you have to manage manually or programmatically. Starting today, you can enable your applications to use temporary security credentials provided by AWS by attaching an IAM role to an existing EC2 instance. You can also replace the IAM role attached to an existing EC2 instance. In this blog post, I show how you can attach an IAM role to an existing EC2 instance by using the AWS CLI.
January 30:How to Protect Data at Rest with Amazon EC2 Instance Store Encryption Encrypting data at rest is vital for regulatory compliance to ensure that sensitive data saved on disks is not readable by any user or application without a valid key. Some compliance regulations such as PCI DSS and HIPAA require that data at rest be encrypted throughout the data lifecycle. To this end, AWS provides data-at-rest options and key management to support the encryption process. For example, you can encrypt Amazon EBS volumes and configure Amazon S3 buckets for server-side encryption (SSE) using AES-256 encryption. Additionally, Amazon RDS supports Transparent Data Encryption (TDE). Instance storage provides temporary block-level storage for Amazon EC2 instances. This storage is located on disks attached physically to a host computer. Instance storage is ideal for temporary storage of information that frequently changes, such as buffers, caches, and scratch data. By default, files stored on these disks are not encrypted. In this blog post, I show a method for encrypting data on Linux EC2 instance stores by using Linux built-in libraries. This method encrypts files transparently, which protects confidential data. As a result, applications that process the data are unaware of the disk-level encryption.
January 27:How to Detect and Automatically Remediate Unintended Permissions in Amazon S3 Object ACLs with CloudWatch Events Amazon S3Access Control Lists (ACLs) enable you to specify permissions that grant access to S3 buckets and objects. When S3 receives a request for an object, it verifies whether the requester has the necessary access permissions in the associated ACL. For example, you could set up an ACL for an object so that only the users in your account can access it, or you could make an object public so that it can be accessed by anyone. If the number of objects and users in your AWS account is large, ensuring that you have attached correctly configured ACLs to your objects can be a challenge. For example, what if a user were to call the PutObjectAcl API call on an object that is supposed to be private and make it public? Or, what if a user were to call the PutObject with the optional Acl parameter set to public-read, therefore uploading a confidential file as publicly readable? In this blog post, I show a solution that uses Amazon CloudWatch Events to detect PutObject and PutObjectAcl API calls in near-real time and helps ensure that the objects remain private by making automatic PutObjectAcl calls, when necessary.
January 24:New SOC 2 Report Available: Confidentiality As with everything at Amazon, the success of our security and compliance program is primarily measured by one thing: our customers’ success. Our customers drive our portfolio of compliance reports, attestations, and certifications that support their efforts in running a secure and compliant cloud environment. As a result of our engagement with key customers across the globe, we are happy to announce the publication of our new SOC 2 Confidentiality report. This report is available now through AWS Artifact in the AWS Management Console.
January 18:Compliance in the Cloud for New Financial Services Cybersecurity Regulations Financial regulatory agencies are focused more than ever on ensuring responsible innovation. Consequently, if you want to achieve compliance with financial services regulations, you must be increasingly agile and employ dynamic security capabilities. AWS enables you to achieve this by providing you with the tools you need to scale your security and compliance capabilities on AWS. The following breakdown of the most recent cybersecurity regulations, NY DFS Rule 23 NYCRR 500, demonstrates how AWS continues to focus on your regulatory needs in the financial services sector.
January 9:New Amazon GameDev Blog Post: Protect Multiplayer Game Servers from DDoS Attacks by Using Amazon GameLift In online gaming, distributed denial of service (DDoS) attacks target a game’s network layer, flooding servers with requests until performance degrades considerably. These attacks can limit a game’s availability to players and limit the player experience for those who can connect. Today’s new Amazon GameDev Blog post uses a typical game server architecture to highlight DDoS attack vulnerabilities and discusses how to stay protected by using built-in AWS Cloud security, AWS security best practices, and the security features of Amazon GameLift. Read the post to learn more.
January 6:The Top 10 Most Downloaded AWS Security and Compliance Documents in 2016 The following list includes the 10 most downloaded AWS security and compliance documents in 2016. Using this list, you can learn about what other people found most interesting about security and compliance last year.
January 6:FedRAMP Compliance Update: AWS GovCloud (US) Region Receives a JAB-Issued FedRAMP High Baseline P-ATO for Three New Services Three new services in the AWS GovCloud (US) region have received a Provisional Authority to Operate (P-ATO) from the Joint Authorization Board (JAB) under the Federal Risk and Authorization Management Program (FedRAMP). JAB issued the authorization at the High baseline, which enables US government agencies and their service providers the capability to use these services to process the government’s most sensitive unclassified data, including Personal Identifiable Information (PII), Protected Health Information (PHI), Controlled Unclassified Information (CUI), criminal justice information (CJI), and financial data.
January 4:The Top 20 Most Viewed AWS IAM Documentation Pages in 2016 The following 20 pages were the most viewed AWS Identity and Access Management (IAM) documentation pages in 2016. I have included a brief description with each link to give you a clearer idea of what each page covers. Use this list to see what other people have been viewing and perhaps to pique your own interest about a topic you’ve been meaning to research.
January 3:The Most Viewed AWS Security Blog Posts in 2016 The following 10 posts were the most viewed AWS Security Blog posts that we published during 2016. You can use this list as a guide to catch up on your blog reading or even read a post again that you found particularly useful.
January 3:How to Monitor AWS Account Configuration Changes and API Calls to Amazon EC2 Security Groups You can use AWS security controls to detect and mitigate risks to your AWS resources. The purpose of each security control is defined by its control objective. For example, the control objective of an Amazon VPC security group is to permit only designated traffic to enter or leave a network interface. Let’s say you have an Internet-facing e-commerce website, and your security administrator has determined that only HTTP (TCP port 80) and HTTPS (TCP 443) traffic should be allowed access to the public subnet. As a result, your administrator configures a security group to meet this control objective. What if, though, someone were to inadvertently change this security group’s rules and enable FTP or other protocols to access the public subnet from any location on the Internet? That expanded access could weaken the security posture of your assets. Consequently, your administrator might need to monitor the integrity of your company’s security controls so that the controls maintain their desired effectiveness. In this blog post, I explore two methods for detecting unintended changes to VPC security groups. The two methods address not only control objectives but also control failures.
If you have questions about or issues with implementing the solutions in any of these posts, please start a new thread on the forum identified near the end of each post.
Using a content delivery network (CDN) such as Amazon CloudFront to cache and serve static text and images or downloadable objects such as media files and documents is a common strategy to improve webpage load times, reduce network bandwidth costs, lessen the load on web servers, and mitigate distributed denial of service (DDoS) attacks. AWS WAF is a web application firewall that can be deployed on CloudFront to help protect your application against DDoS attacks by giving you control over which traffic to allow or block by defining security rules. When users access your application, the Domain Name System (DNS) translates human-readable domain names (for example, www.example.com) to machine-readable IP addresses (for example, 192.0.2.44). A DNS service, such as Amazon Route 53, can effectively connect users’ requests to a CloudFront distribution that proxies requests for dynamic content to the infrastructure hosting your application’s endpoints.
In this blog post, I show you how to deploy CloudFront with AWS WAF and Route 53 to help protect dynamic web applications (with dynamic content such as a response to user input) against DDoS attacks. The steps shown in this post are key to implementing the overall approach described in AWS Best Practices for DDoS Resiliency and enable the built-in, managed DDoS protection service, AWS Shield.
Background
AWS hosts CloudFront and Route 53 services on a distributed network of proxy servers in data centers throughout the world called edge locations. Using the global Amazon network of edge locations for application delivery and DNS service plays an important part in building a comprehensive defense against DDoS attacks for your dynamic web applications. These web applications can benefit from the increased security and availability provided by CloudFront and Route 53 as well as improving end users’ experience by reducing latency.
The following screenshot of an Amazon.com webpage shows how static and dynamic content can compose a dynamic web application that is delivered via HTTPS protocol for the encryption of user page requests as well as the pages that are returned by a web server.
The following map shows the global Amazon network of edge locations available to serve static content and proxy requests for dynamic content back to the origin as of the writing of this blog post. For the latest list of edge locations, see AWS Global Infrastructure.
How AWS Shield, CloudFront, and Route 53 work to help protect against DDoS attacks
To help keep your dynamic web applications available when they are under DDoS attack, the steps in this post enable AWS Shield Standard by configuring your applications behind CloudFront and Route 53. AWS Shield Standard protects your resources from common, frequently occurring network and transport layer DDoS attacks. Attack traffic can be geographically isolated and absorbed using the capacity in edge locations close to the source. Additionally, you can configure geographical restrictions to help block attacks originating from specific countries.
The request-routing technology in CloudFront connects each client to the nearest edge location, as determined by continuously updated latency measurements. HTTP and HTTPS requests sent to CloudFront can be monitored, and access to your application resources can be controlled at edge locations using AWS WAF. Based on conditions that you specify in AWS WAF, such as the IP addresses that requests originate from or the values of query strings, traffic can be allowed, blocked, or allowed and counted for further investigation or remediation. The following diagram shows how static and dynamic web application content can originate from endpoint resources within AWS or your corporate data center. For more details, see How CloudFront Delivers Content and How CloudFront Works with Regional Edge Caches.
Route 53 DNS requests and subsequent application traffic routed through CloudFront are inspected inline. Always-on monitoring, anomaly detection, and mitigation against common infrastructure DDoS attacks such as SYN/ACK floods, UDP floods, and reflection attacks are built into both Route 53 and CloudFront. For a review of common DDoS attack vectors, see How to Help Prepare for DDoS Attacks by Reducing Your Attack Surface. When the SYN flood attack threshold is exceeded, SYN cookies are activated to avoid dropping connections from legitimate clients. Deterministic packet filtering drops malformed TCP packets and invalid DNS requests, only allowing traffic to pass that is valid for the service. Heuristics-based anomaly detection evaluates attributes such as type, source, and composition of traffic. Traffic is scored across many dimensions, and only the most suspicious traffic is dropped. This method allows you to avoid false positives while protecting application availability.
Route 53 is also designed to withstand DNS query floods, which are real DNS requests that can continue for hours and attempt to exhaust DNS server resources. Route 53 uses shuffle sharding and anycast striping to spread DNS traffic across edge locations and help protect the availability of the service.
The next four sections provide guidance about how to deploy CloudFront, Route 53, AWS WAF, and, optionally, AWS Shield Advanced.
Deploy CloudFront
To take advantage of application delivery with DDoS mitigations at the edge, start by creating a CloudFront distribution and configuring origins:
Specify cache behavior settings for the distribution, as shown in the following screenshot. You can configure each URL path pattern with a set of associated cache behaviors. For dynamic web applications, set the Minimum TTL to 0 so that CloudFront will make a GET request with an If-Modified-Since header back to the origin. When CloudFront proxies traffic to the origin from edge locations and back, multiple concurrent requests for the same object are collapsed into a single request. The request is sent over a persistent connection from the edge location to the region over networks monitored by AWS. The use of a large initial TCP window size in CloudFront maximizes the available bandwidth, and TCP Fast Open (TFO) reduces latency.
To ensure that all traffic to CloudFront is encrypted and to enable SSL termination from clients at global edge locations, specify Redirect HTTP to HTTPS for Viewer Protocol Policy. Moving SSL termination to CloudFront offloads computationally expensive SSL negotiation, helps mitigate SSL abuse, and reduces latency with the use of OCSP stapling and session tickets. For more information about options for serving HTTPS requests, see Choosing How CloudFront Serves HTTPS Requests. For dynamic web applications, set Allowed HTTP Methods to include all methods, set Forward Headers to All, and for Query String Forwarding and Caching, choose Forward all, cache based on all.
Specify distribution settings for the distribution, as shown in the following screenshot. Enter your domain names in the Alternate Domain Names box and choose Custom SSL Certificate.
Choose Create Distribution. Note the x.cloudfront.net Domain Name of the distribution. In the next section, you will configure Route 53 to route traffic to this CloudFront distribution domain name.
Configure Route 53
When you created a web distribution in the previous section, CloudFront assigned a domain name to the distribution, such as d111111abcdef8.cloudfront.net. You can use this domain name in the URLs for your content, such as: http://d111111abcdef8.cloudfront.net/logo.jpg.
Alternatively, you might prefer to use your own domain name in URLs, such as: http://example.com/logo.jpg. You can accomplish this by creating a Route 53 alias resource record set that routes dynamic web application traffic to your CloudFront distribution by using your domain name. Alias resource record sets are virtual records specific to Route 53 that are used to map alias resource record sets for your domain to your CloudFront distribution. Alias resource record sets are similar to CNAME records except there is no charge for DNS queries to Route 53 alias resource record sets mapped to AWS services. Alias resource record sets are also not visible to resolvers, and they can be created for the root domain (zone apex) as well as subdomains.
A hosted zone, similar to a DNS zone file, is a collection of records that belongs to a single parent domain name. Each hosted zone has four nonoverlapping name servers in a delegation set. If a DNS query is dropped, the client automatically retries the next name server. If you have not already registered a domain name and have not configured a hosted zone for your domain, complete these two prerequisite steps before proceeding:
Choose the name of the hosted zone for the domain that you want to use to route traffic to your CloudFront distribution.
Choose Create Record Set.
Specify the following values:
Name – Type the domain name that you want to use to route traffic to your CloudFront distribution. The default value is the name of the hosted zone. For example, if the name of the hosted zone is example.com and you want to use acme.example.com to route traffic to your distribution, type acme.
Type – Choose A – IPv4 address. If IPv6 is enabled for the distribution and you are creating a second resource record set, choose AAAA – IPv6 address.
Alias – Choose Yes.
Alias Target – In the CloudFront distributions section, choose the name that CloudFront assigned to the distribution when you created it.
Routing Policy – Accept the default value of Simple.
Evaluate Target Health – Accept the default value of No.
Choose Create.
If IPv6 is enabled for the distribution, repeat Steps 4 through 6. Specify the same settings except for the Type field, as explained in Step 5.
The following screenshot of the Route 53 console shows a Route 53 alias resource record set that is configured to map a domain name to a CloudFront distribution.
If your dynamic web application requires geo redundancy, you can use latency-based routing in Route 53 to run origin servers in different AWS regions. Route 53 is integrated with CloudFront to collect latency measurements from each edge location. With Route 53 latency-based routing, each CloudFront edge location goes to the region with the lowest latency for the origin fetch.
Enable AWS WAF
AWS WAF is a web application firewall that helps detect and mitigate web application layer DDoS attacks by inspecting traffic inline. Application layer DDoS attacks use well-formed but malicious requests to evade mitigation and consume application resources. You can define custom security rules (also called web ACLs) that contain a set of conditions, rules, and actions to block attacking traffic. After you define web ACLs, you can apply them to CloudFront distributions, and web ACLs are evaluated in the priority order you specified when you configured them. Real-time metrics and sampled web requests are provided for each web ACL.
The following diagram shows how the (a) flow of CloudFront access logs files to an Amazon S3 bucket (b) provides the source data for the Lambda log parser function (c) to identify HTTP flood traffic and update AWS WAF web ACLs. As CloudFront receives requests on behalf of your dynamic web application, it sends access logs to an S3 bucket, triggering the Lambda log parser. The Lambda function parses CloudFront access logs to identify suspicious behavior, such as an unusual number of requests or errors, and it automatically updates your AWS WAF rules to block subsequent requests from the IP addresses in question for a predefined amount of time that you specify.
In addition to automated rate-based blacklisting to help protect against HTTP flood attacks, prebuilt AWS CloudFormation templates are available to simplify the configuration of AWS WAF for a proactive application-layer security defense. The following diagram provides an overview of CloudFormation template input into the creation of the CommonAttackProtection stack that includes AWS WAF web ACLs used to block, allow, or count requests that meet the criteria defined in each rule.
Choose the link under the ID column for your CloudFront distribution.
Choose Edit under the General
Choose your AWS WAF Web ACL from the drop-down
Choose Yes, Edit.
Activate AWS Shield Advanced (optional)
Deploying CloudFront, Route 53, and AWS WAF as described in this post enables the built-in DDoS protections for your dynamic web applications that are included with AWS Shield Standard. (There is no upfront cost or charge for AWS Shield Standard beyond the normal pricing for CloudFront, Route 53, and AWS WAF.) AWS Shield Standard is designed to meet the needs of many dynamic web applications.
For dynamic web applications that have a high risk or history of frequent, complex, or high volume DDoS attacks, AWS Shield Advanced provides additional DDoS mitigation capacity, attack visibility, cost protection, and access to the AWS DDoS Response Team (DRT). For more information about AWS Shield Advanced pricing, see AWS Shield Advanced pricing. To activate advanced protection services, follow these steps:
Sign in to the AWS Management Console and open the AWS WAF console.
If this is your first time signing in to the AWS WAF console, choose Get started with AWS Shield Advanced. Otherwise, choose Protected resources.
Choose Activate AWS Shield Advanced.
Choose the resource type and resource to protect.
For Name, enter a friendly name that will help you identify the AWS resources that are protected. For example, My CloudFront AWS Shield Advanced distributions.
(Optional) For Web DDoS attack, select Enable. You will be prompted to associate an existing web ACL with these resources, or create a new ACL if you don’t have any yet.
Choose Add DDoS protection.
Summary
In this blog post, I outline the steps to deploy CloudFront and configure Route 53 in front of your dynamic web application to leverage the global Amazon network of edge locations for DDoS resiliency. The post also provides guidance about enabling AWS WAF for application layer traffic monitoring and automated rules creation to block malicious traffic. I also cover the optional steps to activate AWS Shield Advanced, which helps build a more comprehensive defense against DDoS attacks for your dynamic web applications.
If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, please open a new thread on the AWS WAF forum.
– Holly
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.