Amazon Web Services (AWS) serves more than a million private and public sector organizations all over the world from its extensive and expanding global infrastructure.
Like other countries, organizations all around New Zealand are using AWS to change the way they operate. For example, Xero, a Wellington-based online accountancy software vendor, now serves customers in more than 100 countries, while the Department of Conservation provides its end users with virtual desktops running in Amazon Workspaces.
New Zealand doesn’t currently have a dedicated AWS Region. Geographically, the closest is Asia Pacific (Sydney), which is 2,000 kilometers (km) away, across a deep sea. While customers rely on AWS for business-critical workloads, they are well-served by New Zealand’s international connectivity.
To connect to Amazon’s network, our New Zealand customers have a range of options:
Public internet endpoints
Managed or software Virtual Private Networks (VPN)
All rely on the extensive internet infrastructure connecting New Zealand to the world.
International Connectivity
The vast majority of internet traffic is carried over physical cables, while the percentage of traffic moving over satellite or wireless links is small by comparison.
Historically, cables were funded and managed by consortia of telecommunication providers. More recently, large infrastructure and service providers like AWS have contributed to or are building their own cable networks.
There are currently about 400 submarine cables in service globally. Modern submarine cables are fiber-optic, run for thousands of kilometers, and are protected by steel strands, plastic sheathing, copper, and a chemical water barrier. Over that distance, the signal can weaken—or attenuate—so signal repeaters are installed approximately every 50km to mitigate attenuation. Repeaters are powered by a charge running over the copper sheathing in the cable.
An example of submarine cable composition.. Source: WikiMedia Commons
For most of their run, these cables are about as thick as a standard garden hose. They are thicker, however, closer to shore and in areas where there’s a greater risk of damage by fishing nets, boat anchors, etc.
Cables can—and do—break, but redundancy is built into the network. According to Telegeography, there are 100 submarine cable faults globally every year. However, most faults don’t impact users meaningfully.
Southern Cross B: Takapuna (Auckland, New Zealand) -> Spencer Beach (Hawaii, USA) – 1.2 Tbps
A map of major submarine cables connecting to New Zealand. Source submarinecablemap.com
The four cables combined currently deliver 66 Tbps of available capacity. The Southern Cross NEXT cable is due to come online in 2020, which will add another 72 Tbps. These are, of course, potential capacities; it’s likely the “lit” capacity—the proportion of the cables’ overall capacity that is actually in use—is much lower.
Connecting to AWS from New Zealand
While understanding the physical infrastructure is important in practice, these details are not shared with customers. Connectivity options are evaluated on the basis of partner and AWS offerings, which include connectivity.
Customers connect to AWS in three main ways: over public endpoints, via site-to-site VPNs, and via Direct Connect (DX), all typically provided by partners.
Public Internet Endpoints
Customers can connect to public endpoints for AWS services over the public internet. Some services, like Amazon CloudFront, Amazon API Gateway, and Amazon WorkSpaces are generally used in this way.
Many organizations use a VPN to connect to AWS. It’s the simplest and lowest cost entry point to expose resources deployed in private ranges in an Amazon VPC. Amazon VPC allows customers to provision a logically isolated network segment, with fine-grained control of IP ranges, filtering rules, and routing.
AWS offers a managed site-to-site VPN service, which creates secure, redundant Internet Protocol Security (IPSec) VPNs, and also handles maintenance and high-availability while integrating with Amazon CloudWatch for robust monitoring.
If using an AWS managed VPN, the AWS endpoints have publicly routable IPs. They can be connected to over the public internet or via a Public Virtual Interface over DX (outlined below).
Customers can also deploy VPN appliances onto Amazon Elastic Compute Cloud (EC2) instances running in their VPC. These may be self-managed or provided by Amazon Marketplace sellers.
AWS also offers AWS Client VPN, for direct user access to AWS resources.
AWS Direct Connect
While connectivity over the internet is secure and flexible, it has one major disadvantage: it’s unpredictable. By design, traffic traversing the internet can take any path to reach its destination. Most of the time it works but occasionally routing conditions may reduce capacity or increase latency.
DX connections are either 1 or 10 Gigabits per second (Gbps). This capacity is dedicated to the customer; it isn’t shared, as other network users are never routed over the connection. This means customers can rely on consistent latency and bandwidth. The DX per-Gigabit transfer cost is lower than other egress mechanisms. For customers transferring large volumes of data, DX may be more cost effective than other means of connectivity.
Customers may publish their own 802.11q Virtual Local Area Network (VLAN) tags across the DX, and advertise routes via Border Gateway Protocol (BGP). A dedicated connection supports up to 50 private or public virtual interfaces. New Zealand does not have a physical point-of-presence for DX—users must procure connectivity to our Sydney Region. Many AWS Partner Network (APN) members in New Zealand offer this connectivity.
For customers who don’t want or need to manage VLANs to AWS—or prefer 1 Gbps or smaller links —APN partners offer hosted connections or hosted virtual interfaces. For more detail, please review our AWS Direct Connect Partners page.
Performance
There are physical limits to latency dictated by the speed of light, and the medium through which optical signals travel. Southern Cross publishes latency statistics, and it sees one-way latency of approximately 11 milliseconds (ms) over the 2,276km Alexandria to Whenuapai link. Double that for a round-trip to 22 ms.
In practice, we see customers achieving round-trip times from user workstations to Sydney in approximately 30-50 ms, assuming fair-weather internet conditions or DX links. Latency in Auckland (the largest city) tends to be on the lower end of that spectrum, while the rest of the country tends towards the higher end.
Bandwidth constraints are more often dictated by client hardware, but AWS and our partners offer up to 10 Gbps links, or smaller as required. For customers that require more than 10 Gbps over a single link, AWS supports Link Aggregation Groups (LAG).
As outlined above, there are a range of ways for customers to adopt AWS via secure, reliable, and performant networks. To discuss your use case, please contact an AWS Solutions Architect.
Learn about AWS Services & Solutions – September AWS Online Tech Talks
Join us this September to learn about AWS services and solutions. The AWS Online Tech Talks are live, online presentations that cover a broad range of topics at varying technical levels. These tech talks, led by AWS solutions architects and engineers, feature technical deep dives, live demonstrations, customer examples, and Q&A with AWS experts. Register Now!
Note – All sessions are free and in Pacific Time.
Tech talks this month:
Compute:
September 23, 2019 | 11:00 AM – 12:00 PM PT – Build Your Hybrid Cloud Architecture with AWS – Learn about the extensive range of services AWS offers to help you build a hybrid cloud architecture best suited for your use case.
September 25, 2019 | 1:00 PM – 2:00 PM PT – What’s New in Amazon DocumentDB (with MongoDB compatibility) – Learn what’s new in Amazon DocumentDB, a fully managed MongoDB compatible database service designed from the ground up to be fast, scalable, and highly available.
September 23, 2019 | 9:00 AM – 10:00 AM PT – Training Machine Learning Models Faster – Learn how to train machine learning models quickly and with a single click using Amazon SageMaker.
September 30, 2019 | 11:00 AM – 12:00 PM PT – Using Containers for Deep Learning Workflows – Learn how containers can help address challenges in deploying deep learning environments.
September 24, 2019 | 11:00 AM – 12:00 PM PT – Application Migrations Using AWS Server Migration Service (SMS) – Learn how to use AWS Server Migration Service (SMS) for automating application migration and scheduling continuous replication, from your on-premises data centers or Microsoft Azure to AWS.
September 30, 2019 | 1:00 PM – 2:00 PM PT – AWS Office Hours: Amazon CloudFront – Just getting started with Amazon CloudFront and [email protected]? Get answers directly from our experts during AWS Office Hours.
Robotics:
October 1, 2019 | 11:00 AM – 12:00 PM PT – Robots and STEM: AWS RoboMaker and AWS Educate Unite! – Come join members of the AWS RoboMaker and AWS Educate teams as we provide an overview of our education initiatives and walk you through the newly launched RoboMaker Badge.
October 2, 2019 | 9:00 AM – 10:00 AM PT – Deep Dive on Amazon EventBridge – Learn how to optimize event-driven applications, and use rules and policies to route, transform, and control access to these events that react to data from SaaS apps.
This post courtesy of Thiago Morais, AWS Solutions Architect
When you build web applications or expose any data externally, you probably look for a platform where you can build highly scalable, secure, and robust REST APIs. As APIs are publicly exposed, there are a number of best practices for providing a secure mechanism to consumers using your API.
Amazon API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring, and API version management.
In this post, I show you how to take advantage of the regional API endpoint feature in API Gateway, so that you can create your own Amazon CloudFront distribution and secure your API using AWS WAF.
AWS WAF is a web application firewall that helps protect your web applications from common web exploits that could affect application availability, compromise security, or consume excessive resources.
As you make your APIs publicly available, you are exposed to attackers trying to exploit your services in several ways. The AWS security team published a whitepaper solution using AWS WAF, How to Mitigate OWASP’s Top 10 Web Application Vulnerabilities.
Regional API endpoints
Edge-optimized APIs are endpoints that are accessed through a CloudFront distribution created and managed by API Gateway. Before the launch of regional API endpoints, this was the default option when creating APIs using API Gateway. It primarily helped to reduce latency for API consumers that were located in different geographical locations than your API.
When API requests predominantly originate from an Amazon EC2 instance or other services within the same AWS Region as the API is deployed, a regional API endpoint typically lowers the latency of connections. It is recommended for such scenarios.
For better control around caching strategies, customers can use their own CloudFront distribution for regional APIs. They also have the ability to use AWS WAF protection, as I describe in this post.
Edge-optimized API endpoint
The following diagram is an illustrated example of the edge-optimized API endpoint where your API clients access your API through a CloudFront distribution created and managed by API Gateway.
Regional API endpoint
For the regional API endpoint, your customers access your API from the same Region in which your REST API is deployed. This helps you to reduce request latency and particularly allows you to add your own content delivery network, as needed.
Walkthrough
In this section, you implement the following steps:
Attach the web ACL to the CloudFront distribution.
Test AWS WAF protection.
Create the regional API
For this walkthrough, use an existing PetStore API. All new APIs launch by default as the regional endpoint type. To change the endpoint type for your existing API, choose the cog icon on the top right corner:
After you have created the PetStore API on your account, deploy a stage called “prod” for the PetStore API.
On the API Gateway console, select the PetStore API and choose Actions, Deploy API.
For Stage name, type prod and add a stage description.
Choose Deploy and the new API stage is created.
Use the following AWS CLI command to update your API from edge-optimized to regional:
{
"description": "Your first API with Amazon API Gateway. This is a sample API that integrates via HTTP with your demo Pet Store endpoints",
"createdDate": 1511525626,
"endpointConfiguration": {
"types": [
"REGIONAL"
]
},
"id": "{api-id}",
"name": "PetStore"
}
After you change your API endpoint to regional, you can now assign your own CloudFront distribution to this API.
Create a CloudFront distribution
To make things easier, I have provided an AWS CloudFormation template to deploy a CloudFront distribution pointing to the API that you just created. Click the button to deploy the template in the us-east-1 Region.
For Stack name, enter RegionalAPI. For APIGWEndpoint, enter your API FQDN in the following format:
{api-id}.execute-api.us-east-1.amazonaws.com
After you fill out the parameters, choose Next to continue the stack deployment. It takes a couple of minutes to finish the deployment. After it finishes, the Output tab lists the following items:
A CloudFront domain URL
An S3 bucket for CloudFront access logs
Output from CloudFormation
Test the CloudFront distribution
To see if the CloudFront distribution was configured correctly, use a web browser and enter the URL from your distribution, with the following parameters:
With the new CloudFront distribution in place, you can now start setting up AWS WAF to protect your API.
For this demo, you deploy the AWS WAF Security Automations solution, which provides fine-grained control over the requests attempting to access your API.
For more information about deployment, see Automated Deployment. If you prefer, you can launch the solution directly into your account using the following button.
For CloudFront Access Log Bucket Name, add the name of the bucket created during the deployment of the CloudFormation stack for your CloudFront distribution.
The solution allows you to adjust thresholds and also choose which automations to enable to protect your API. After you finish configuring these settings, choose Next.
To start the deployment process in your account, follow the creation wizard and choose Create. It takes a few minutes do finish the deployment. You can follow the creation process through the CloudFormation console.
After the deployment finishes, you can see the new web ACL deployed on the AWS WAF console, AWSWAFSecurityAutomations.
Attach the AWS WAF web ACL to the CloudFront distribution
With the solution deployed, you can now attach the AWS WAF web ACL to the CloudFront distribution that you created earlier.
To assign the newly created AWS WAF web ACL, go back to your CloudFront distribution. After you open your distribution for editing, choose General, Edit.
Select the new AWS WAF web ACL that you created earlier, AWSWAFSecurityAutomations.
Save the changes to your CloudFront distribution and wait for the deployment to finish.
Test AWS WAF protection
To validate the AWS WAF Web ACL setup, use Artillery to load test your API and see AWS WAF in action.
To install Artillery on your machine, run the following command:
$ npm install -g artillery
After the installation completes, you can check if Artillery installed successfully by running the following command:
$ artillery -V
$ 1.6.0-12
As the time of publication, Artillery is on version 1.6.0-12.
One of the WAF web ACL rules that you have set up is a rate-based rule. By default, it is set up to block any requesters that exceed 2000 requests under 5 minutes. Try this out.
First, use cURL to query your distribution and see the API output:
What you are doing is firing 2000 requests to your API from 10 concurrent users. For brevity, I am not posting the Artillery output here.
After Artillery finishes its execution, try to run the cURL request again and see what happens:
$ curl -s https://{distribution-name}.cloudfront.net/prod/pets
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Request blocked.
<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: [removed]
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>
As you can see from the output above, the request was blocked by AWS WAF. Your IP address is removed from the blocked list after it falls below the request limit rate.
Conclusion
In this first part, you saw how to use the new API Gateway regional API endpoint together with Amazon CloudFront and AWS WAF to secure your API from a series of attacks.
In the second part, I will demonstrate some other techniques to protect your API using API keys and Amazon CloudFront custom headers.
In 2015, we announced Backblaze B2 Cloud Storage — the most affordable, high performance storage cloud on the planet. The decision to release B2 as a service was in direct response to customers asking us if they could use the same cloud storage infrastructure we use for our Computer Backup service. With B2, we entered a market in direct competition with Amazon S3, Google Cloud Services, and Microsoft Azure Storage. Today, we have over 500 petabytes of data from customers in over 150 countries. At $0.005 / GB / month for storage (1/4th of S3) and $0.01 / GB for downloads (1/5th of S3), it turns out there’s a healthy market for cloud storage that’s easy and affordable.
As B2 has grown, customers wanted to use our cloud storage for a variety of use cases that required not only storage but compute. We’re happy to say that through partnerships with Packet & ServerCentral, today we’re announcing that compute is now available for B2 customers.
Cloud Compute and Storage
Backblaze has directly connected B2 with the compute servers of Packet and ServerCentral, thereby allowing near-instant (< 10 ms) data transfers between services. Also, transferring data between B2 and both our compute partners is free.
Storing data in B2 and want to run an AI analysis on it? — There are no fees to move the data to our compute partners.
Generating data in an application? — Run the application with one of our partners and store it in B2.
Transfers are free and you’ll save more than 50% off of the equivalent set of services from AWS.
These partnerships enable B2 customers to use compute, give our compute partners’ customers access to cloud storage, and introduce new customers to industry-leading storage and compute — all with high-performance, low-latency, and low-cost.
Is This a Big Deal? We Think So
Compute is one of the most requested services from our customers Why? Because it unlocks a number of use cases for them. Let’s look at three popular examples:
Transcoding Media Files
B2 has earned wide adoption in the Media & Entertainment (“M&E”) industry. Our affordable storage and download pricing make B2 great for a wide variety of M&E use cases. But many M&E workflows require compute. Content syndicators, like American Public Television, need the ability to transcode files to meet localization and distribution management requirements.
There are a multitude of reasons that transcode is needed — thumbnail and proxy generation enable M&E professionals to work efficiently. Without compute, the act of transcoding files remains cumbersome. Either the files need to be brought down from the cloud, transcoded, and then pushed back up or they must be kept locally until the project is complete. Both scenarios are inefficient.
Starting today, any content producer can spin up compute with one of our partners, pay by the hour for their transcode processing, and return the new media files to B2 for storage and distribution. The company saves money, moves faster, and ensures their files are safe and secure.
Disaster Recovery
Backblaze’s heritage is based on providing outstanding backup services. When you have incredibly affordable cloud storage, it ends up being a great destination for your backup data.
Most enterprises have virtual machines (“VMs”) running in their infrastructure and those VMs need to be backed up. In a disaster scenario, a business wants to know they can get back up and running quickly.
With all data stored in B2, a business can get up and running quickly. Simply restore your backed up VM to one of our compute providers, and your business will be able to get back online.
Since B2 does not place restrictions, delays, or penalties on getting data out, customers can get back up and running quickly and affordably.
Saving $74 Million (aka “The Dropbox Effect”)
Ten years ago, Backblaze decided that S3 was too costly a platform to build its cloud storage business. Instead, we created the Backblaze Storage Pod and our own cloud storage infrastructure. That decision enabled us to offer our customers storage at a previously unavailable price point and maintain those prices for over a decade. It also laid the foundation for Netflix Open Connect and Facebook Open Compute.
Dropbox recently migrated the majority of their cloud services off of AWS and onto Dropbox’s own infrastructure. By leaving AWS, Dropbox was able to build out their own data centers and still save over $74 Million. They achieved those savings by avoiding the fees AWS charges for storing and downloading data, which, incidentally, are five times higher than Backblaze B2.
For Dropbox, being able to realize savings was possible because they have access to enough capital and expertise that they can build out their own infrastructure. For companies that have such resources and scale, that’s a great answer.
“Before this offering, the economics of the cloud would have made our business simply unviable.” — Gabriel Menegatti, SlicingDice
The questions Backblaze and our compute partners pondered was “how can we democratize the Dropbox effect for our storage and compute customers? How can we help customers do more and pay less?” The answer we came up with was to connect Backblaze’s B2 storage with strategic compute partners and remove any transfer fees between them. You may not save $74 million as Dropbox did, but you can choose the optimal providers for your use case and realize significant savings in the process.
This Sounds Good — Tell Me More About Your Partners
We’re very fortunate to be launching our compute program with two fantastic partners in Packet and ServerCentral. These partners allow us to offer a range of computing services.
Packet
We recommend Packet for customers that need on-demand, high performance, bare metal servers available by the hour. They also have robust offerings for private / customized deployments. Their offerings end up costing 50-75% of the equivalent offerings from EC2.
To get started with Packet and B2, visit our partner page on Packet.net.
ServerCentral
ServerCentral is the right partner for customers that have business and IT challenges that require more than “just” hardware. They specialize in fully managed, custom cloud solutions that solve complex business and IT challenges. ServerCentral also has expertise in managed network solutions to address global connectivity and content delivery.
To get started with ServerCentral and B2, visit our partner page on ServerCentral.com.
What’s Next?
We’re excited to find out. The combination of B2 and compute unlocks use cases that were previously impossible or at least unaffordable.
“The combination of performance and price offered by this partnership enables me to create an entirely new business line. Before this offering, the economics of the cloud would have made our business simply unviable,” noted Gabriel Menegatti, co-founder at SlicingDice, a serverless data warehousing service. “Knowing that transfers between compute and B2 are free means I don’t have to worry about my business being successful. And, with download pricing from B2 at just $0.01 GB, I know I’m avoiding a 400% tax from AWS on data I retrieve.”
What can you do with B2 & compute? Please share your ideas with us in the comments. And, for those attending NAB 2018 in Las Vegas next week, please come by and say hello!
Amazon S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements. It gives you flexibility in the way you manage data for cost optimization, access control, and compliance. However, because the service is flexible, a user could accidentally configure buckets in a manner that is not secure. For example, let’s say you uploaded files to an Amazon S3 bucket with public read permissions, even though you intended only to share this file with a colleague or a partner. Although this might have accomplished your task to share the file internally, the file is now available to anyone on the internet, even without authentication.
In this blog post, we show you how to prevent your Amazon S3 buckets and objects from allowing public access. We discuss how to secure data in Amazon S3 with a defense-in-depth approach, where multiple security controls are put in place to help prevent data leakage. This approach helps prevent you from allowing public access to confidential information, such as personally identifiable information (PII) or protected health information (PHI).
Preventing your Amazon S3 buckets and objects from allowing public access
Every call to an Amazon S3 service becomes a REST API request. When your request is transformed via a REST call, the permissions are converted into parameters included in the HTTP header or as URL parameters. The Amazon S3 bucket policy allows or denies access to the Amazon S3 bucket or Amazon S3 objects based on policy statements, and then evaluates conditions based on those parameters. To learn more, see Using Bucket Policies and User Policies.
With this in mind, let’s say multiple AWS Identity and Access Management (IAM) users at Example Corp. have access to an Amazon S3 bucket and the objects in the bucket. Example Corp. wants to share the objects among its IAM users, while at the same time preventing the objects from being made available publicly.
To demonstrate how to do this, we start by creating an Amazon S3 bucket named examplebucket. After creating this bucket, we must apply the following bucket policy. This policy denies any uploaded object (PutObject) with the attribute x-amz-acl having the values public-read, public-read-write, or authenticated-read. This means authenticated users cannot upload objects to the bucket if the objects have public permissions.
“Deny any Amazon S3 request to PutObject or PutObjectAcl in the bucket examplebucket when the request includes one of the following access control lists (ACLs): public-read, public-read-write, or authenticated-read.”
Remember that IAM policies are evaluated not in a first-match-and-exit model. Instead, IAM evaluates first if there is an explicit Deny. If there is not, IAM continues to evaluate if you have an explicit Allow and then you have an implicit Deny.
The above policy creates an explicit Deny. Even when any authenticated user tries to upload (PutObject) an object with public read or write permissions, such as public-read or public-read-write or authenticated-read, the action will be denied. To understand how S3 Access Permissions work, you must understand what Access Control Lists (ACL) and Grants are. You can find the documentation here.
Now let’s continue our bucket policy explanation by examining the next statement.
This statement is very similar to the first statement, except that instead of checking the ACLs, we are checking specific user groups’ grants that represent the following groups:
AuthenticatedUsers group. Represented by http://acs.amazonaws.com/groups/global/AuthenticatedUsers, this group represents all AWS accounts. Access permissions to this group allow any AWS account to access the resource. However, all requests must be signed (authenticated).
AllUsers group. Represented by http://acs.amazonaws.com/groups/global/AllUsers, access permissions to this group allow anyone on the internet access to the resource. The requests can be signed (authenticated) or unsigned (anonymous). Unsigned requests omit the Authentication header in the request.
Now that you know how to deny object uploads with permissions that would make the object public, you just have two statement policies that prevent users from changing the bucket permissions (Denying s3:PutBucketACL from ACL and Denying s3:PutBucketACL from Grants).
Below is how we’re preventing users from changing the bucket permisssions.
As you can see above, the statement is very similar to the Object statements, except that now we use s3:PutBucketAcl instead of s3:PutObjectAcl, the Resource is just the bucket ARN, and the objects have the “/*” in the end of the ARN.
In this section, we showed how to prevent IAM users from accidently uploading Amazon S3 objects with public permissions to buckets. In the next section, we show you how to enforce multiple layers of security controls, such as encryption of data at rest and in transit while serving traffic from Amazon S3.
Securing data on Amazon S3 with defense-in-depth
Let’s say that Example Corp. wants to serve files securely from Amazon S3 to its users with the following requirements:
The data must be encrypted at rest and during transit.
The data must be accessible only by a limited set of public IP addresses.
All requests for data should be handled only by Amazon CloudFront (which is a content delivery network) instead of being directly available from an Amazon S3 URL. If you’re using an Amazon S3 bucket as the origin for a CloudFront distribution, you can grant public permission to read the objects in your bucket. This allows anyone to access your objects either through CloudFront or the Amazon S3 URL. CloudFront doesn’t expose Amazon S3 URLs, but your users still might have access to those URLs if your application serves any objects directly from Amazon S3, or if anyone gives out direct links to specific objects in Amazon S3.
A domain name is required to consume the content. Custom SSL certificate support lets you deliver content over HTTPS by using your own domain name and your own SSL certificate. This gives visitors to your website the security benefits of CloudFront over an SSL connection that uses your own domain name, in addition to lower latency and higher reliability.
To represent defense-in-depth visually, the following diagram contains several Amazon S3 objects (A) in a single Amazon S3 bucket (B). You can encrypt these objects on the server side or the client side. You also can configure the bucket policy such that objects are accessible only through CloudFront, which you can accomplish through an origin access identity (C). You then can configure CloudFront to deliver content only over HTTPS in addition to using your own domain name (D).
Defense-in-depth requirement 1: Data must be encrypted at rest and during transit
Let’s start with the objects themselves. Amazon S3 objects—files in this case—can range from zero bytes to multiple terabytes in size (see service limits for the latest information). Each Amazon S3 bucket includes a collection of objects, and the objects can be uploaded via the Amazon S3 console, AWS CLI, or AWS API.
If you choose to use server-side encryption, Amazon S3 encrypts your objects before saving them on disks in AWS data centers. To encrypt an object at the time of upload, you need to add the x-amz-server-side-encryption header to the request to tell Amazon S3 to encrypt the object using Amazon S3 managed keys (SSE-S3), AWS KMS managed keys (SSE-KMS), or customer-provided keys (SSE-C). There are two possible values for the x-amz-server-side-encryption header: AES256, which tells Amazon S3 to use Amazon S3 managed keys, and aws:kms, which tells Amazon S3 to use AWS KMS managed keys.
The following code example shows a Put request using SSE-S3.
PUT /example-object HTTP/1.1
Host: myBucket.s3.amazonaws.com
Date: Wed, 8 Jun 2016 17:50:00 GMT
Authorization: authorization string
Content-Type: text/plain
Content-Length: 11434
x-amz-meta-author: Janet
Expect: 100-continue
x-amz-server-side-encryption: AES256
[11434 bytes of object data]
If you choose to use client-side encryption, you can encrypt data on the client side and upload the encrypted data to Amazon S3. In this case, you manage the encryption process, the encryption keys, and related tools. You encrypt data on the client side by using AWS KMS managed keys or a customer-supplied, client-side master key.
Defense-in-depth requirement 2: Data must be accessible only by a limited set of public IP addresses
At the Amazon S3 bucket level, you can configure permissions through a bucket policy. For example, you can limit access to the objects in a bucket by IP address range or specific IP addresses. Alternatively, you can make the objects accessible only through HTTPS.
The following bucket policy allows access to Amazon S3 objects only through HTTPS (the policy was generated with the AWS Policy Generator). Here the bucket policy explicitly denies ("Effect": "Deny") all read access ("Action": "s3:GetObject") from anybody who browses ("Principal": "*") to Amazon S3 objects within an Amazon S3 bucket if they are not accessed through HTTPS ("aws:SecureTransport": "false").
Defense-in-depth requirement 3: Data must not be publicly accessible directly from an Amazon S3 URL
Next, configure Amazon CloudFront to serve traffic from within the bucket. The use of CloudFront serves several purposes:
CloudFront is a content delivery network that acts as a cache to serve static files quickly to clients.
Depending on the number of requests, the cost of delivery is less than if objects were served directly via Amazon S3.
Objects served through CloudFront can be limited to specific countries.
Access to these Amazon S3 objects is available only through CloudFront. We do this by creating an origin access identity (OAI) for CloudFront and granting access to objects in the respective Amazon S3 bucket only to that OAI. As a result, access to Amazon S3 objects from the internet is possible only through CloudFront; all other means of accessing the objects—such as through an Amazon S3 URL—are denied. CloudFront acts not only as a content distribution network, but also as a host that denies access based on geographic restrictions. You apply these restrictions by updating your CloudFront web distribution and adding a whitelist that contains only a specific country’s name (let’s say Liechtenstein). Alternatively, you could add a blacklist that contains every country except that country. Learn more about how to use CloudFront geographic restriction to whitelist or blacklist a country to restrict or allow users in specific locations from accessing web content in the AWS Support Knowledge Center.
Defense-in-depth requirement 4: A domain name is required to consume the content
To serve content from CloudFront, you must use a domain name in the URLs for objects on your webpages or in your web application. The domain name can be either of the following:
The domain name that CloudFront automatically assigns when you create a distribution, such as d111111abcdef8.cloudfront.net
Your own domain name, such as example.com
For example, you might use one of the following URLs to return the file image.jpg:
You use the same URL format whether you store the content in Amazon S3 buckets or at a custom origin, like one of your own web servers.
Instead of using the default domain name that CloudFront assigns for you when you create a distribution, you can add an alternate domain name that’s easier to work with, like example.com. By setting up your own domain name with CloudFront, you can use a URL like this for objects in your distribution: http://example.com/images/image.jpg.
Let’s say that you already have a domain name hosted on Amazon Route 53. You would like to serve traffic from the domain name, request an SSL certificate, and add this to your CloudFront web distribution. The SSL offloading occurs in CloudFront by serving traffic securely from each CloudFront location. You also can configure CloudFront to deliver your content over HTTPS by using your custom domain name and your own SSL certificate. Serving web content through CloudFront reduces response from the origin as requests are redirected to the nearest edge location. This results in faster download times than if the visitor had requested the content from a data center that is located farther away.
Summary
In this post, we demonstrated how you can apply policies to Amazon S3 buckets so that only users with appropriate permissions are allowed to access the buckets. Anonymous users (with public-read/public-read-write permissions) and authenticated users without the appropriate permissions are prevented from accessing the buckets.
We also examined how to secure access to objects in Amazon S3 buckets. The objects in Amazon S3 buckets can be encrypted at rest and during transit. Doing so helps provide end-to-end security from the source (in this case, Amazon S3) to your users.
If you have feedback about this blog post, submit comments in the “Comments” section below. If you have questions about this blog post, start a new thread on the Amazon S3 forum or contact AWS Support.
I often encounter people experiencing frustration as they attempt to scale their e-commerce or WordPress site—particularly around the cost and complexity related to scaling. When I talk to customers about their scaling plans, they often mention phrases such as horizontal scaling and microservices, but usually people aren’t sure about how to dive in and effectively scale their sites.
Now let’s talk about different scaling options. For instance if your current workload is in a traditional data center, you can leverage the cloud for your on-premises solution. This way you can scale to achieve greater efficiency with less cost. It’s not necessary to set up a whole powerhouse to light a few bulbs. If your workload is already in the cloud, you can use one of the available out-of-the-box options.
Designing your API in microservices and adding horizontal scaling might seem like the best choice, unless your web application is already running in an on-premises environment and you’ll need to quickly scale it because of unexpected large spikes in web traffic.
So how to handle this situation? Take things one step at a time when scaling and you may find horizontal scaling isn’t the right choice, after all.
For example, assume you have a tech news website where you did an early-look review of an upcoming—and highly-anticipated—smartphone launch, which went viral. The review, a blog post on your website, includes both video and pictures. Comments are enabled for the post and readers can also rate it. For example, if your website is hosted on a traditional Linux with a LAMP stack, you may find yourself with immediate scaling problems.
Let’s get more details on the current scenario and dig out more:
Where are images and videos stored?
How many read/write requests are received per second? Per minute?
What is the level of security required?
Are these synchronous or asynchronous requests?
We’ll also want to consider the following if your website has a transactional load like e-commerce or banking:
How is the website handling sessions?
Do you have any compliance requests—like the Payment Card Industry Data Security Standard (PCI DSS compliance) —if your website is using its own payment gateway?
How are you recording customer behavior data and fulfilling your analytics needs?
What are your loading balancing considerations (scaling, caching, session maintenance, etc.)?
So, if we take this one step at a time:
Step 1:Ease server load. We need to quickly handle spikes in traffic, generated by activity on the blog post, so let’s reduce server load by moving image and video to some third -party content delivery network (CDN). AWS provides Amazon CloudFront as a CDN solution, which is highly scalable with built-in security to verify origin access identity and handle any DDoS attacks. CloudFront can direct traffic to your on-premises or cloud-hosted server with its 113 Points of Presence (102 Edge Locations and 11 Regional Edge Caches) in 56 cities across 24 countries, which provides efficient caching. Step 2: Reduce read load by adding more read replicas. MySQL provides a nice mirror replication for databases. Oracle has its own Oracle plug for replication and AWS RDS provide up to five read replicas, which can span across the region and even the Amazon database Amazon Aurora can have 15 read replicas with Amazon Aurora autoscaling support. If a workload is highly variable, you should consider Amazon Aurora Serverless database to achieve high efficiency and reduced cost. While most mirror technologies do asynchronous replication, AWS RDS can provide synchronous multi-AZ replication, which is good for disaster recovery but not for scalability. Asynchronous replication to mirror instance means replication data can sometimes be stale if network bandwidth is low, so you need to plan and design your application accordingly.
I recommend that you always use a read replica for any reporting needs and try to move non-critical GET services to read replica and reduce the load on the master database. In this case, loading comments associated with a blog can be fetched from a read replica—as it can handle some delay—in case there is any issue with asynchronous reflection.
Step 3: Reduce write requests. This can be achieved by introducing queue to process the asynchronous message. Amazon Simple Queue Service (Amazon SQS) is a highly-scalable queue, which can handle any kind of work-message load. You can process data, like rating and review; or calculate Deal Quality Score (DQS) using batch processing via an SQS queue. If your workload is in AWS, I recommend using a job-observer pattern by setting up Auto Scaling to automatically increase or decrease the number of batch servers, using the number of SQS messages, with Amazon CloudWatch, as the trigger. For on-premises workloads, you can use SQS SDK to create an Amazon SQS queue that holds messages until they’re processed by your stack. Or you can use Amazon SNS to fan out your message processing in parallel for different purposes like adding a watermark in an image, generating a thumbnail, etc.
Step 4: Introduce a more robust caching engine. You can use Amazon Elastic Cache for Memcached or Redis to reduce write requests. Memcached and Redis have different use cases so if you can afford to lose and recover your cache from your database, use Memcached. If you are looking for more robust data persistence and complex data structure, use Redis. In AWS, these are managed services, which means AWS takes care of the workload for you and you can also deploy them in your on-premises instances or use a hybrid approach.
Step 5: Scale your server. If there are still issues, it’s time to scale your server. For the greatest cost-effectiveness and unlimited scalability, I suggest always using horizontal scaling. However, use cases like database vertical scaling may be a better choice until you are good with sharding; or use Amazon Aurora Serverless for variable workloads. It will be wise to use Auto Scaling to manage your workload effectively for horizontal scaling. Also, to achieve that, you need to persist the session. Amazon DynamoDB can handle session persistence across instances.
If your server is on premises, consider creating a multisite architecture, which will help you achieve quick scalability as required and provide a good disaster recovery solution. You can pick and choose individual services like Amazon Route 53, AWS CloudFormation, Amazon SQS, Amazon SNS, Amazon RDS, etc. depending on your needs.
Your multisite architecture will look like the following diagram:
In this architecture, you can run your regular workload on premises, and use your AWS workload as required for scalability and disaster recovery. Using Route 53, you can direct a precise percentage of users to an AWS workload.
If you decide to move all of your workloads to AWS, the recommended multi-AZ architecture would look like the following:
In this architecture, you are using a multi-AZ distributed workload for high availability. You can have a multi-region setup and use Route53 to distribute your workload between AWS Regions. CloudFront helps you to scale and distribute static content via an S3 bucket and DynamoDB, maintaining your application state so that Auto Scaling can apply horizontal scaling without loss of session data. At the database layer, RDS with multi-AZ standby provides high availability and read replica helps achieve scalability.
This is a high-level strategy to help you think through the scalability of your workload by using AWS even if your workload in on premises and not in the cloud…yet.
I highly recommend creating a hybrid, multisite model by placing your on-premises environment replica in the public cloud like AWS Cloud, and using Amazon Route53 DNS Service and Elastic Load Balancing to route traffic between on-premises and cloud environments. AWS now supports load balancing between AWS and on-premises environments to help you scale your cloud environment quickly, whenever required, and reduce it further by applying Amazon auto-scaling and placing a threshold on your on-premises traffic using Route 53.
Last week I attended a talk given by Bryan Mistele, president of Seattle-based INRIX. Bryan’s talk provided a glimpse into the future of transportation, centering around four principle attributes, often abbreviated as ACES:
Autonomous – Cars and trucks are gaining the ability to scan and to make sense of their environments and to navigate without human input.
Connected – Vehicles of all types have the ability to take advantage of bidirectional connections (either full-time or intermittent) to other cars and to cloud-based resources. They can upload road and performance data, communicate with each other to run in packs, and take advantage of traffic and weather data.
Electric – Continued development of battery and motor technology, will make electrics vehicles more convenient, cost-effective, and environmentally friendly.
Shared – Ride-sharing services will change usage from an ownership model to an as-a-service model (sound familiar?).
Individually and in combination, these emerging attributes mean that the cars and trucks we will see and use in the decade to come will be markedly different than those of the past.
On the Road with AWS AWS customers are already using our AWS IoT, edge computing, Amazon Machine Learning, and Alexa products to bring this future to life – vehicle manufacturers, their tier 1 suppliers, and AutoTech startups all use AWS for their ACES initiatives. AWS Greengrass is playing an important role here, attracting design wins and helping our customers to add processing power and machine learning inferencing at the edge.
AWS customer Aptiv (formerly Delphi) talked about their Automated Mobility on Demand (AMoD) smart vehicle architecture in a AWS re:Invent session. Aptiv’s AMoD platform will use Greengrass and microservices to drive the onboard user experience, along with edge processing, monitoring, and control. Here’s an overview:
Another customer, Denso of Japan (one of the world’s largest suppliers of auto components and software) is using Greengrass and AWS IoT to support their vision of Mobility as a Service (MaaS). Here’s a video:
AWS at CES The AWS team will be out in force at CES in Las Vegas and would love to talk to you. They’ll be running demos that show how AWS can help to bring innovation and personalization to connected and autonomous vehicles.
Personalized In-Vehicle Experience – This demo shows how AWS AI and Machine Learning can be used to create a highly personalized and branded in-vehicle experience. It makes use of Amazon Lex, Polly, and Amazon Rekognition, but the design is flexible and can be used with other services as well. The demo encompasses driver registration, login and startup (including facial recognition), voice assistance for contextual guidance, personalized e-commerce, and vehicle control. Here’s the architecture for the voice assistance:
Connected Vehicle Solution – This demo shows how a connected vehicle can combine local and cloud intelligence, using edge computing and machine learning at the edge. It handles intermittent connections and uses AWS DeepLens to train a model that responds to distracted drivers. Here’s the overall architecture, as described in our Connected Vehicle Solution:
Digital Content Delivery – This demo will show how a customer uses a web-based 3D configurator to build and personalize their vehicle. It will also show high resolution (4K) 3D image and an optional immersive AR/VR experience, both designed for use within a dealership.
Autonomous Driving – This demo will showcase the AWS services that can be used to build autonomous vehicles. There’s a 1/16th scale model vehicle powered and driven by Greengrass and an overview of a new AWS Autonomous Toolkit. As part of the demo, attendees drive the car, training a model via Amazon SageMaker for subsequent on-board inferencing, powered by Greengrass ML Inferencing.
To speak to one of my colleagues or to set up a time to see the demos, check out the Visit AWS at CES 2018 page.
Some Resources If you are interested in this topic and want to learn more, the AWS for Automotive page is a great starting point, with discussions on connected vehicles & mobility, autonomous vehicle development, and digital customer engagement.
When you are ready to start building a connected vehicle, the AWS Connected Vehicle Solution contains a reference architecture that combines local computing, sophisticated event rules, and cloud-based data processing and storage. You can use this solution to accelerate your own connected vehicle projects.
Amazon CloudFront is a web service that speeds up distribution of your static and dynamic web content to end users through a worldwide network of edge locations. CloudFront provides a number of benefits and capabilities that can help you secure your applications and content while meeting compliance requirements. For example, you can configure CloudFront to help enforce secure, end-to-end connections using HTTPS SSL/TLS encryption. You also can take advantage of CloudFront integration with AWS Shield for DDoS protection and with AWS WAF (a web application firewall) for protection against application-layer attacks, such as SQL injection and cross-site scripting.
Now, CloudFront field-level encryption helps secure sensitive data such as a customer phone numbers by adding another security layer to CloudFront HTTPS. Using this functionality, you can help ensure that sensitive information in a POST request is encrypted at CloudFront edge locations. This information remains encrypted as it flows to and beyond your origin servers that terminate HTTPS connections with CloudFront and throughout the application environment. In this blog post, we demonstrate how you can enhance the security of sensitive data by using CloudFront field-level encryption.
Note: This post assumes that you understand concepts and services such as content delivery networks, HTTP forms, public-key cryptography, CloudFront, AWS Lambda, and the AWS CLI. If necessary, you should familiarize yourself with these concepts and review the solution overview in the next section before proceeding with the deployment of this post’s solution.
How field-level encryption works
Many web applications collect and store data from users as those users interact with the applications. For example, a travel-booking website may ask for your passport number and less sensitive data such as your food preferences. This data is transmitted to web servers and also might travel among a number of services to perform tasks. However, this also means that your sensitive information may need to be accessed by only a small subset of these services (most other services do not need to access your data).
User data is often stored in a database for retrieval at a later time. One approach to protecting stored sensitive data is to configure and code each service to protect that sensitive data. For example, you can develop safeguards in logging functionality to ensure sensitive data is masked or removed. However, this can add complexity to your code base and limit performance.
Field-level encryption addresses this problem by ensuring sensitive data is encrypted at CloudFront edge locations. Sensitive data fields in HTTPS form POSTs are automatically encrypted with a user-provided public RSA key. After the data is encrypted, other systems in your architecture see only ciphertext. If this ciphertext unintentionally becomes externally available, the data is cryptographically protected and only designated systems with access to the private RSA key can decrypt the sensitive data.
It is critical to secure private RSA key material to prevent unauthorized access to the protected data. Management of cryptographic key material is a larger topic that is out of scope for this blog post, but should be carefully considered when implementing encryption in your applications. For example, in this blog post we store private key material as a secure string in the Amazon EC2 Systems Manager Parameter Store. The Parameter Store provides a centralized location for managing your configuration data such as plaintext data (such as database strings) or secrets (such as passwords) that are encrypted using AWS Key Management Service (AWS KMS). You may have an existing key management system in place that you can use, or you can use AWS CloudHSM. CloudHSM is a cloud-based hardware security module (HSM) that enables you to easily generate and use your own encryption keys in the AWS Cloud.
To illustrate field-level encryption, let’s look at a simple form submission where Name and Phone values are sent to a web server using an HTTP POST. A typical form POST would contain data such as the following.
POST / HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
Content-Length:60
Name=Jane+Doe&Phone=404-555-0150
Instead of taking this typical approach, field-level encryption converts this data similar to the following.
POST / HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
Content-Length: 1713
Name=Jane+Doe&Phone=AYABeHxZ0ZqWyysqxrB5pEBSYw4AAA...
To further demonstrate field-level encryption in action, this blog post includes a sample serverless application that you can deploy by using a CloudFormation template, which creates an application environment using CloudFront, Amazon API Gateway, and Lambda. The sample application is only intended to demonstrate field-level encryption functionality and is not intended for production use. The following diagram depicts the architecture and data flow of this sample application.
Sample application architecture and data flow
Here is how the sample solution works:
An application user submits an HTML form page with sensitive data, generating an HTTPS POST to CloudFront.
Field-level encryption intercepts the form POST and encrypts sensitive data with the public RSA key and replaces fields in the form post with encrypted ciphertext. The form POST ciphertext is then sent to origin servers.
The serverless application accepts the form post data containing ciphertext where sensitive data would normally be. If a malicious user were able to compromise your application and gain access to your data, such as the contents of a form, that user would see encrypted data.
Lambda stores data in a DynamoDB table, leaving sensitive data to remain safely encrypted at rest.
An administrator uses the AWS Management Console and a Lambda function to view the sensitive data.
During the session, the administrator retrieves ciphertext from the DynamoDB table.
Decrypted sensitive data is transmitted over SSL/TLS via the AWS Management Console to the administrator for review.
Deployment walkthrough
The high-level steps to deploy this solution are as follows:
Stage the required artifacts When deployment packages are used with Lambda, the zipped artifacts have to be placed in an S3 bucket in the target AWS Region for deployment. This step is not required if you are deploying in the US East (N. Virginia) Region because the package has already been staged there.
Generate an RSA key pair Create a public/private key pair that will be used to perform the encrypt/decrypt functionality.
Upload the public key to CloudFront and associate it with the field-level encryption configuration After you create the key pair, the public key is uploaded to CloudFront so that it can be used by field-level encryption.
Launch the CloudFormation stack Deploy the sample application for demonstrating field-level encryption by using AWS CloudFormation.
Add the field-level encryption configuration to the CloudFront distribution After you have provisioned the application, this step associates the field-level encryption configuration with the CloudFront distribution.
Store the RSA private key in the Parameter Store Store the private key in the Parameter Store as a SecureString data type, which uses AWS KMS to encrypt the parameter value.
Deploy the solution
1. Stage the required artifacts
(If you are deploying in the US East [N. Virginia] Region, skip to Step 2, “Generate an RSA key pair.”)
Stage the Lambda function deployment package in an Amazon S3 bucket located in the AWS Region you are using for this solution. To do this, download the zipped deployment package and upload it to your in-region bucket. For additional information about uploading objects to S3, see Uploading Object into Amazon S3.
2. Generate an RSA key pair
In this section, you will generate an RSA key pair by using OpenSSL:
Confirm access to OpenSSL.
$ openssl version
You should see version information similar to the following.
OpenSSL <version> <date>
Create a private key using the following command.
$ openssl genrsa -out private_key.pem 2048
The command results should look similar to the following.
Restrict access to the private key.$ chmod 600 private_key.pem Note: You will use the public and private key material in Steps 3 and 6 to configure the sample application.
3. Upload the public key to CloudFront and associate it with the field-level encryption configuration
Now that you have created the RSA key pair, you will use the AWS Management Console to upload the public key to CloudFront for use by field-level encryption. Complete the following steps to upload and configure the public key.
Note: Do not include spaces or special characters when providing the configuration values in this section.
From the AWS Management Console, choose Services > CloudFront.
In the navigation pane, choose Public Key and choose Add Public Key.
Complete the Add Public Key configuration boxes:
Key Name: Type a name such as DemoPublicKey.
Encoded Key: Paste the contents of the public_key.pem file you created in Step 2c. Copy and paste the encoded key value for your public key, including the -----BEGIN PUBLIC KEY----- and -----END PUBLIC KEY----- lines.
Comment: Optionally add a comment.
Choose Create.
After adding at least one public key to CloudFront, the next step is to create a profile to tell CloudFront which fields of input you want to be encrypted. While still on the CloudFront console, choose Field-level encryption in the navigation pane.
Under Profiles, choose Create profile.
Complete the Create profile configuration boxes:
Name: Type a name such as FLEDemo.
Comment: Optionally add a comment.
Public key: Select the public key you configured in Step 4.b.
Provider name: Type a provider name such as FLEDemo. This information will be used when the form data is encrypted, and must be provided to applications that need to decrypt the data, along with the appropriate private key.
Pattern to match: Type phone. This configures field-level encryption to match based on the phone.
Choose Save profile.
Configurations include options for whether to block or forward a query to your origin in scenarios where CloudFront can’t encrypt the data. Under Encryption Configurations, choose Create configuration.
Complete the Create configuration boxes:
Comment: Optionally add a comment.
Content type: Enter application/x-www-form-urlencoded. This is a common media type for encoding form data.
Default profile ID: Select the profile you added in Step 3e.
Choose Save configuration
4. Launch the CloudFormation stack
Launch the sample application by using a CloudFormation template that automates the provisioning process.
Input parameter
Input parameter description
ProviderID
Enter the Provider name you assigned in Step 3e. The ProviderID is used in field-level encryption configuration in CloudFront (letters and numbers only, no special characters)
PublicKeyName
Enter the Key Name you assigned in Step 3b. This name is assigned to the public key in field-level encryption configuration in CloudFront (letters and numbers only, no special characters).
PrivateKeySSMPath
Leave as the default: /cloudfront/field-encryption-sample/private-key
The path in the S3 bucket containing artifact files. Leave as default if deploying in us-east-1.
To finish creating the CloudFormation stack:
Choose Next on the Select Template page, enter the input parameters and choose Next. Note: The Artifacts configuration needs to be updated only if you are deploying outside of us-east-1 (US East [N. Virginia]). See Step 1 for artifact staging instructions.
On the Options page, accept the defaults and choose Next.
On the Review page, confirm the details, choose the I acknowledge that AWS CloudFormation might create IAM resources check box, and then choose Create. (The stack will be created in approximately 15 minutes.)
5. Add the field-level encryption configuration to the CloudFront distribution
While still on the CloudFront console, choose Distributions in the navigation pane, and then:
In the Outputs section of the FLE-Sample-App stack, look for CloudFrontDistribution and click the URL to open the CloudFront console.
Choose Behaviors, choose the Default (*) behavior, and then choose Edit.
For Field-level Encryption Config, choose the configuration you created in Step 3g.
Choose Yes, Edit.
While still in the CloudFront distribution configuration, choose the General Choose Edit, scroll down to Distribution State, and change it to Enabled.
Choose Yes, Edit.
6. Store the RSA private key in the Parameter Store
In this step, you store the private key in the EC2 Systems Manager Parameter Store as a SecureString data type, which uses AWS KMS to encrypt the parameter value. For more information about AWS KMS, see the AWS Key Management Service Developer Guide. You will need a working installation of the AWS CLI to complete this step.
Store the private key in the Parameter Store with the AWS CLI by running the following command. You will find the <KMSKeyID> in the KMSKeyID in the CloudFormation stack Outputs. Substitute it for the placeholderin the following command.
Verify the parameter. Your private key material should be accessible through the ssm get-parameter in the following command in the Value The key material has been truncated in the following output.
Notice we use the —with decryption argument in this command. This returns the private key as cleartext.
This completes the sample application deployment. Next, we show you how to see field-level encryption in action.
Delete the private key from local storage. On Linux for example, using the shred command, securely delete the private key material from your workstation as shown below. You may also wish to store the private key material within an AWS CloudHSM or other protected location suitable for your security requirements. For production implementations, you also should implement key rotation policies.
Use the following steps to test the sample application with field-level encryption:
Open sample application in your web browser by clicking the ApplicationURL link in the CloudFormation stack Outputs. (for example, https:d199xe5izz82ea.cloudfront.net/prod/). Note that it may take several minutes for the CloudFront distribution to reach the Deployed Status from the previous step, during which time you may not be able to access the sample application.
Fill out and submit the HTML form on the page:
Complete the three form fields: Full Name, Email Address, and Phone Number.
Choose Submit. Notice that the application response includes the form values. The phone number returns the following ciphertext encryption using your public key. This ciphertext has been stored in DynamoDB.
Execute the Lambda decryption function to download ciphertext from DynamoDB and decrypt the phone number using the private key:
In the CloudFormation stack Outputs, locate DecryptFunction and click the URL to open the Lambda console.
Configure a test event using the “Hello World” template.
Choose the Test button.
View the encrypted and decrypted phone number data.
Summary
In this blog post, we showed you how to use CloudFront field-level encryption to encrypt sensitive data at edge locations and help prevent access from unauthorized systems. The source code for this solution is available on GitHub. For additional information about field-level encryption, see the documentation.
If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, please start a new thread on the CloudFront forum.
This op-ed by a “net neutrality expert” claims the FCC has always defended “net neutrality”. It’s garbage.
This wrong on its face. It imagines decades ago that the FCC inshrined some plaque on the wall stating principles that subsequent FCC commissioners have diligently followed. The opposite is true. FCC commissioners are a chaotic bunch, with different interests, influenced (i.e. “lobbied” or “bribed”) by different telecommunications/Internet companies. Rather than following a principle, their Internet regulatory actions have been ad hoc and arbitrary — for decades.
Sure, you can cherry pick some of those regulatory actions as fitting a “net neutrality” narrative, but most actions don’t fit that narrative, and there have been gross net neutrality violations that the FCC has ignored.
There are gross violations going on right now that the FCC is allowing. Most egregiously is the “zero-rating” of video traffic on T-Mobile. This is a clear violation of the principles of net neutrality, yet the FCC is allowing it — despite official “net neutrality” rules in place.
The op-ed above claims that “this [net neutrality] principle was built into the architecture of the Internet”. The opposite is true. Traffic discrimination was built into the architecture since the beginning. If you don’t believe me, read RFC 791 and the “precedence” field.
More concretely, from the beginning of the Internet as we know it (the 1990s), CDNs (content delivery networks) have provided a fast-lane for customers willing to pay for it. These CDNs are so important that the Internet wouldn’t work without them.
I just traced the route of my CNN live stream. It comes from a server 5 miles away, instead of CNN’s headquarters 2500 miles away. That server is located inside Comcast’s network, because CNN pays Comcast a lot of money to get a fast-lane to Comcast’s customers.
The reason these egregious net net violations exist is because it’s in the interests of customers. Moving content closer to customers helps. Re-prioritizing (and charging less for) high-bandwidth video over cell networks helps customers.
You might say it’s okay that the FCC bends net neutrality rules when it benefits consumers, but that’s garbage. Net neutrality claims these principles are sacred and should never be violated. Obviously, that’s not true — they should be violated when it benefits consumers. This means what net neutrality is really saying is that ISPs can’t be trusted to allows act to benefit consumers, and therefore need government oversight. Well, if that’s your principle, then what you are really saying is that you are a left-winger, not that you believe in net neutrality.
Anyway, my point is that the above op-ed cherry picks a few data points in order to build a narrative that the FCC has always regulated net neutrality. A larger view is that the FCC has never defended this on principle, and is indeed, not defending it right now, even with “net neutrality” rules officially in place.
With just 40 days remaining before AWS re:Invent begins, my colleagues and I want to share some tips that will help you to make the most of your time in Las Vegas. As always, our focus is on training and education, mixed in with some after-hours fun and recreation for balance.
Locations, Locations, Locations The re:Invent Campus will span the length of the Las Vegas strip, with events taking place at the MGM Grand, Aria, Mirage, Venetian, Palazzo, the Sands Expo Hall, the Linq Lot, and the Encore. Each venue will host tracks devoted to specific topics:
MGM Grand – Business Apps, Enterprise, Security, Compliance, Identity, Windows.
Aria – Analytics & Big Data, Alexa, Container, IoT, AI & Machine Learning, and Serverless.
If your interests span more than one topic, plan to take advantage of the re:Invent shuttles that will be making the rounds between the venues.
Lots of Content The re:Invent Session Catalog is now live and you should start to choose the sessions of interest to you now.
With more than 1100 sessions on the agenda, planning is essential! Some of the most popular “deep dive” sessions will be run more than once and others will be streamed to overflow rooms at other venues. We’ve analyzed a lot of data, run some simulations, and are doing our best to provide you with multiple opportunities to build an action-packed schedule.
We’re just about ready to let you reserve seats for your sessions (follow me and/or @awscloud on Twitter for a heads-up). Based on feedback from earlier years, we have fine-tuned our seat reservation model. This year, 75% of the seats for each session will be reserved and the other 25% are for walk-up attendees. We’ll start to admit walk-in attendees 10 minutes before the start of the session.
Las Vegas never sleeps and neither should you! This year we have a host of late-night sessions, workshops, chalk talks, and hands-on labs to keep you busy after dark.
See You in Vegas As always, I am looking forward to meeting as many AWS users and blog readers as possible. Never hesitate to stop me and to say hello!
The powerhouse Amazon retail store is set to launch in Australia toward the end of 2018 and Aussie ecommerce retailers need to ready themselves for the competition storm ahead.
2018 may seem a while away but getting your ecommerce site in tip top shape and ready to compete can take time. Check out these helpful hints from the Anchor crew.
Speed kills
If you’ve ever heard of the tale of the tortoise and the hare, the moral is that “slow and steady wins the race”. This is definitely not the place for that phrase, because if your site loads as slowly as a 1995 dial up connection, your ecommerce store will not, I repeat, will not win the race.
Site speed can be impacted by a number of factors and getting the balance right between a site that loads at lightning speed and delivering engaging content to your audience. There are many ways to check the performance of your site including Anchor’s free hosting check up or pingdom.
Taking action can boost the performance of your site:
As an ecommerce store, getting credit card details as fast as possible is probably at the top of your list, but it’s important to remember that it’s an actual person that needs to hand over the details.
Consider the customer’s experience whilst checking out. Making people log in to their account before checkout, can lead to abandoned carts as customers try to remember the vital details. Similarly, making a customer enter all their details before displaying shipping costs is more of an annoyance than a benefit.
Built for growth
Before you blast out a promo email to your entire database or spend up big on PPC, consider what happens when this 5 fold increase in traffic, all jumps onto your site at around the same time.
Will your site come screeching to a sudden halt with a 504 or 408 error message, or ride high on the wave of increased traffic? If you have fixed infrastructure such as a dedicated server, or are utilising a VPS, then consider the maximum concurrent users that your site can handle.
Consider this. Amazon.com.au will be built on the scalable cloud infrastructure of Amazon Web Services and will utilise all the microservices and data mining technology to offer customers a seamless, personalised shopping experience. How will your business compete?
Search ready
Being found online is important for any business, but for ecommerce sites, it’s essential. Gaining results from SEO practices can take time so beware of ‘quick fix guarantees’ from outsourced agencies.
Search Engine Optimisation (SEO) practices can have lasting effects. Good practices can ensure your site is found via organic search without huge advertising budgets, on the other hand ‘black hat’ practices can push your ecommerce store into search oblivion.
SEO takes discipline and focus to get right. Here are some of our favourite hints for SEO greatness from those who live and breathe SEO:
Leverage Descriptive alt tags and image file names
Create content for people, not bots (keyword stuffing is a no no!)
SEO best practices are continually evolving, but creating a site that is designed to give users a great experience and give them the content they expect to find.
Google My Business is a free service that EVERY business should take advantage of. It is a listing service where your business can provide details such as address, phone number, website, and trading hours. It’s easy to update and manage, you can add photos, a physical address (if applicable), and display shopper reviews.
Get your site ship shape
Overwhelmed by these starter tips? If you are ready to get your site into tip top shape–get in touch. We work with awesome partners like eWave who can help create a seamless online shopping experience.
In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).
The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that’s what makes it useful for a variety of usecases that other tools can’t cover that well.
Why?
I created casync after studying how today’s popular tools store and deliver file system images. To very incomprehensively and briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).
Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:
Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
Put boundaries on disk space usage on clients
Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
Simplicity to use for users, repository administrators and developers
I don’t think any of the tools mentioned above are really good on more than a small subset of these points.
Specifically: Docker’s layered tarball approach dumps the “delta” question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaing full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there’s no use for it there, and likely requires substantially more disk space and download sizes.
OSTree’s serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.
Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/metadata (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the usecase I am interested in.
(Note: all the mentioned systems have great properties, and it’s not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)
Security & Reproducability
Besides the issues pointed out above I wasn’t happy with the security and reproducability properties of these systems. In today’s world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously undeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there’s only one valid representatin of the data to deliver, that can easily be verified.
What casync Is
So much about the background why I created casync. Now, let’s have a look what casync actually is like, and what it does. Here’s the brief technical overview:
Encoding: Let’s take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk’s contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let’s call this directory a “chunk store”. At the same time, generate a “chunk index” file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.
Decoding: Let’s take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.
As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.
Finally, let’s put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.
Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.
Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.
The “chunking” algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.
Here’s a diagram, hopefully explaining a bit how the encoding process works, wasn’t it for my crappy drawing skills:
The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.
Details
Note that casync operates on two different layers, depending on the usecase of the user:
You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.
You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.
The fact that it may be used on both the block and file system layer opens it up for a variety of different usecases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.
Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.
Features
Here are a couple of other features casync has:
When downloading a new image you may use casync‘s --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.
When operating on the file system level, the user has fine-grained control on the metadata included in the serialization. This is relevant since different usecases tend to require a different set of saved/restored metadata. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the metadata cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected metadata features is also always part of the serialization, so that seeding can work correctly and automatically.
casync tries to be as accurate as possible when storing file system metadata. This means that besides the usual baseline of file metadata (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs subvolume information where available. Note that as described above every single type of metadata may be turned off and on individually, hence if you don’t need FAT file bits (and I figure it’s pretty likely you don’t), then they won’t be stored.
The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).
Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other usecases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.
The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file subtrees located at the top of the tree, with zero metadata references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.
When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.
Optionally, when extracting file trees, casync can create traditional UNIX hardlinks for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific usecases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hardlinked files, as that’s the nature of hardlinks. In this mode, casync exposes OSTree-like behaviour, which is built heavily around read-only hardlink trees.
casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are exluded from serialization, as they expose API objects, not real files. Moreover, the “nodump” (+d) chattr(1) flag is honoured by default, permitting users to mark files to exclude from serialization.
When creating and extracting file trees casync may apply am automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user namespacing.
In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote backend.
When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev http://example.com/myimage.caibx exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to “dm-verity”, as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel’s block layer. This feature is implemented though Linux’ NBD kernel facility.
Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount http://example.com/mytree.caidx /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux’ FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing thrings through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s hopefully just a temporary gap to be fixed soon).
In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the superblock of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there’s little point in seeding the whitespace after the file system within the partition.
Example Command Lines
Here’s how to use casync, explained with a few examples:
$ casync make foobar.caidx /some/directory
This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:
$ casync make foobar.caibx /dev/sda1
This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:
$ casync make foobar.caibx myimage.raw
To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:
This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:
$ casync make ssh.example.com:images/foobar.caidx /some/directory
This will use ssh to connect to the ssh.example.com server, and then places the .caidx file and the chunks on it. Note that this mode of operation is “smart”: this scheme will only upload chunks currently missing on the server side, and not retransmit what already is available.
Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.
Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:
When creating chunk indexes on the file system layer casync will by default store metadata as accurately as possible. Let’s create a chunk index with reduced metadata:
$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir
This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity timestamps, symbolic links and a single read-only bit. In this mode, all the other metadata bits are not stored, including nanosecond timestamps, full unix permission bits, file ownership or even ACLs or extended attributes.
Now let’s make a .caidx file available locally as a mounted file system, without extracting it:
$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar
And similar, let’s make a .caibx file available locally as a block device:
This will create a block device in /dev and print the used device node path to STDOUT.
As mentioned, casync is big about reproducability. Let’s make use of that to calculate the a digest identifying a very specific version of a file tree:
$ casync digest .
This digest will include all metadata bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what metadata to include:
$ casync digest --with=unix .
This makes use of the --with=unix shortcut for selecting metadata fields. Specifying --with-unix= selects all metadata that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.
Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:
$ casync digest --without=flag-immutable
This generates a digest with the most accurate metadata, but leaves one feature out: chattr(1)‘s immutable (+i) file flag.
To list the contents of a .caidx file use a command like the following:
$ casync list http://example.com/images/foobar.caidx
The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file metadata than mtree files can express, though.
What casync isn’t
casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google’s Courgette.
casync is not a replacement for rsync, or git or zsync or anything like that. They have very different usecases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.
Where next?
casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I’d like to make it useful for a couple other usecases, too. Specifically:
To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool would might become an alternative to restic or tarsnap.
Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.
In the longer run, I’d like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.
casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync‘s functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME’s gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting backends can be replaced easily, for example to replace casync‘s default HTTP/HTTPS backends built on CURL with GNOME’s own HTTP implementation, in order to share cookies, certificates, … There’s also an alternative method to integrate with casync in place already: simply invoke casync as a subprocess. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.
I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.
Further plans are listed tersely in the TODO file.
FAQ:
Is this a systemd project? — casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the codebases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.
Is casync portable? — At the moment: no. I only run Linux and that’s what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn’t really make sense on non-Linux systems), as long as they don’t interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.
Does casync require reflink-capable file systems to work, such as btrfs? No it doesn’t. The reflink magic in casync is employed when the file system permits it, and it’s good to have it, but it’s not a requirement, and casync will implicitly fall back to copying when it isn’t available. Note that casync supports a number of file system features on a variety of file systems that aren’t available everywhere, for example FAT’s system/hidden file flags or xfs‘s projinherit file flag.
Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don’t see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.
Are the .caidx/.caibx and .catar file formats open and documented? — casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It’s definitely my intention to add comprehensive docs for both formats however. Don’t forget this is just the initial version right now.
casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casyncisn’t “just like” some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.
Why did you invent your own serialization format for file trees? Why don’t you just use tar? That’s a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn’t enforce reproducability. It also doesn’t really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducability, random access, and metadata control. Much like traditional tar it can still be generated and extracted in a stream fashion though.
Does casync save/restore SELinux/SMACK file labels? At the moment not. That’s not because I wouldn’t want it to, but simply because I am not a guru of either of these systems, and didn’t want to implement something I do not fully grok nor can test. If you look at the sources you’ll find that there’s already some definitions in place that keep room for them though. I’d be delighted to accept a patch implementing this fully.
What about delivering squashfs images? How well does chunking work on compressed serializations? – That’s a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let’s say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn’t apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync‘s --chunk-size= option). How precisely to choose both values is left to reasearch by the user, for now.
What does the name casync mean? – It’s a synchronizing tool, hence the -sync suffix, following rsync‘s naming. It makes use of the content-addressable concept of git hence the ca- prefix.
Well, that’s up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.
Note that casync is an Open Source project: if it doesn’t do exactly what you need, prepare a patch that adds what you need, and we’ll consider it.
If you are interested in the project and would like to talk about this in person, I’ll be presenting casync soon at Kinvolk’s Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.
In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).
The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that’s what makes it useful for a variety of use-cases that other tools can’t cover that well.
Why?
I created casync after studying how today’s popular tools store and deliver file system images. To briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).
Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:
Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
Put boundaries on disk space usage on clients
Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
Simplicity to use for users, repository administrators and developers
I don’t think any of the tools mentioned above are really good on more than a small subset of these points.
Specifically: Docker’s layered tarball approach dumps the “delta” question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaining full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there’s no use for it there, and likely requires substantially more disk space and download sizes.
OSTree’s serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.
Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/meta-data (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the use-case I am interested in.
(Note: all the mentioned systems have great properties, and it’s not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)
Security & Reproducibility
Besides the issues pointed out above I wasn’t happy with the security and reproducibility properties of these systems. In today’s world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously nondeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there’s only one valid representation of the data to deliver, that can easily be verified.
What casync Is
So much about the background why I created casync. Now, let’s have a look what casync actually is like, and what it does. Here’s the brief technical overview:
Encoding: Let’s take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk’s contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let’s call this directory a “chunk store”. At the same time, generate a “chunk index” file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.
Decoding: Let’s take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.
As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.
Finally, let’s put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.
Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.
Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.
The “chunking” algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.
Here’s a diagram, hopefully explaining a bit how the encoding process works, wasn’t it for my crappy drawing skills:
The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.)
Details
Note that casync operates on two different layers, depending on the use-case of the user:
You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.
You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.
The fact that it may be used on both the block and file system layer opens it up for a variety of different use-cases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.
Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.
Features
Here are a couple of other features casync has:
When downloading a new image you may use casync‘s --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.
When operating on the file system level, the user has fine-grained control on the meta-data included in the serialization. This is relevant since different use-cases tend to require a different set of saved/restored meta-data. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the meta-data cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected meta-data features is also always part of the serialization, so that seeding can work correctly and automatically.
casync tries to be as accurate as possible when storing file system meta-data. This means that besides the usual baseline of file meta-data (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs sub-volume information where available. Note that as described above every single type of meta-data may be turned off and on individually, hence if you don’t need FAT file bits (and I figure it’s pretty likely you don’t), then they won’t be stored.
The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).
Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other use-cases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.
The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file sub-trees located at the top of the tree, with zero meta-data references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.
When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.
Optionally, when extracting file trees, casync can create traditional UNIX hard-links for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific use-cases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hard-linked files, as that’s the nature of hard-links. In this mode, casync exposes OSTree-like behavior, which is built heavily around read-only hard-link trees.
casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are excluded from serialization, as they expose API objects, not real files. Moreover, the “nodump” (+d) chattr(1) flag is honored by default, permitting users to mark files to exclude from serialization.
When creating and extracting file trees casync may apply an automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user name-spacing.
In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote back-end.
When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev http://example.com/myimage.caibx exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to “dm-verity”, as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel’s block layer. This feature is implemented though Linux’ NBD kernel facility.
Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount http://example.com/mytree.caidx /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux’ FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing things through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s hopefully just a temporary gap to be fixed soon).
In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the super-block of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there’s little point in seeding the white-space after the file system within the partition.
Example Command Lines
Here’s how to use casync, explained with a few examples:
$ casync make foobar.caidx /some/directory
This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:
$ casync make foobar.caibx /dev/sda1
This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:
$ casync make foobar.caibx myimage.raw
To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:
This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:
$ casync make ssh.example.com:images/foobar.caidx /some/directory
This will use ssh to connect to the ssh.example.com server, and then places the .caidx file and the chunks on it. Note that this mode of operation is “smart”: this scheme will only upload chunks currently missing on the server side, and not re-transmit what already is available.
Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.
Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:
When creating chunk indexes on the file system layer casync will by default store meta-data as accurately as possible. Let’s create a chunk index with reduced meta-data:
$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir
This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity time-stamps, symbolic links and a single read-only bit. In this mode, all the other meta-data bits are not stored, including nanosecond time-stamps, full UNIX permission bits, file ownership or even ACLs or extended attributes.
Now let’s make a .caidx file available locally as a mounted file system, without extracting it:
$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar
And similar, let’s make a .caibx file available locally as a block device:
This will create a block device in /dev and print the used device node path to STDOUT.
As mentioned, casync is big about reproducibility. Let’s make use of that to calculate the a digest identifying a very specific version of a file tree:
$ casync digest .
This digest will include all meta-data bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what meta-data to include:
$ casync digest --with=unix .
This makes use of the --with=unix shortcut for selecting meta-data fields. Specifying --with-unix= selects all meta-data that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.
Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:
$ casync digest --without=flag-immutable
This generates a digest with the most accurate meta-data, but leaves one feature out: chattr(1)‘s immutable (+i) file flag.
To list the contents of a .caidx file use a command like the following:
$ casync list http://example.com/images/foobar.caidx
The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file meta-data than mtree files can express, though.
What casync isn’t
casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google’s Courgette.
casync is not a replacement for rsync, or git or zsync or anything like that. They have very different use-cases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.
Where next?
casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I’d like to make it useful for a couple other use-cases, too. Specifically:
To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool might become an alternative to restic, BorgBackup or tarsnap.
Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.
In the longer run, I’d like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.
casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync‘s functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME’s gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting back-ends can be replaced easily, for example to replace casync‘s default HTTP/HTTPS back-ends built on CURL with GNOME’s own HTTP implementation, in order to share cookies, certificates, … There’s also an alternative method to integrate with casync in place already: simply invoke casync as a sub-process. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.
I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.
Further plans are listed tersely in the TODO file.
FAQ:
Is this a systemd project? — casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the code-bases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.
Is casync portable? — At the moment: no. I only run Linux and that’s what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn’t really make sense on non-Linux systems), as long as they don’t interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.
Does casync require reflink-capable file systems to work, such as btrfs? — No it doesn’t. The reflink magic in casync is employed when the file system permits it, and it’s good to have it, but it’s not a requirement, and casync will implicitly fall back to copying when it isn’t available. Note that casync supports a number of file system features on a variety of file systems that aren’t available everywhere, for example FAT’s system/hidden file flags or xfs‘s projinherit file flag.
Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don’t see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.
Are the .caidx/.caibx and .catar file formats open and documented? — casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It’s definitely my intention to add comprehensive docs for both formats however. Don’t forget this is just the initial version right now.
casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casyncisn’t “just like” some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.
Why did you invent your own serialization format for file trees? Why don’t you just use tar? — That’s a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn’t enforce reproducibility. It also doesn’t really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducibility, random access, and meta-data control. Much like traditional tar it can still be generated and extracted in a stream fashion though.
Does casync save/restore SELinux/SMACK file labels? — At the moment not. That’s not because I wouldn’t want it to, but simply because I am not a guru of either of these systems, and didn’t want to implement something I do not fully grok nor can test. If you look at the sources you’ll find that there’s already some definitions in place that keep room for them though. I’d be delighted to accept a patch implementing this fully.
What about delivering squashfs images? How well does chunking work on compressed serializations? – That’s a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let’s say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn’t apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync‘s --chunk-size= option). How precisely to choose both values is left a research subject for the user, for now.
What does the name casync mean? – It’s a synchronizing tool, hence the -sync suffix, following rsync‘s naming. It makes use of the content-addressable concept of git hence the ca- prefix.
Where can I get this stuff? Is it already packaged? – Check out the sources on GitHub. I just tagged the first version. Martin Pitt has packaged casync for Ubuntu. There is also an ArchLinux package. Zbigniew Jędrzejewski-Szmek has prepared a Fedora RPM that hopefully will soon be included in the distribution.
Should you care? Is this a tool for you?
Well, that’s up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.
Note that casync is an Open Source project: if it doesn’t do exactly what you need, prepare a patch that adds what you need, and we’ll consider it.
If you are interested in the project and would like to talk about this in person, I’ll be presenting casync soon at Kinvolk’s Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.
Nowadays, it’s common for a web server to be fronted by a global content delivery service, like Amazon CloudFront. This type of front end accelerates delivery of websites, APIs, media content, and other web assets to provide a better experience to users across the globe.
The insights gained by analysis of Amazon CloudFront access logs helps improve website availability through bot detection and mitigation, optimizing web content based on the devices and browser used to view your webpages, reducing perceived latency by caching of popular object closer to its viewer, and so on. This results in a significant improvement in the overall perceived experience for the user.
This blog post provides a way to build a serverless architecture to generate some of these insights. To do so, we analyze Amazon CloudFront access logs both at rest and in transit through the stream. This serverless architecture uses Amazon Athena to analyze large volumes of CloudFront access logs (on the scale of terabytes per day), and Amazon Kinesis Analytics for streaming analysis.
The analytic queries in this blog post focus on three common use cases:
Detection of common bots using the user agent string
Calculation of current bandwidth usage per Amazon CloudFront distribution per edge location
Determination of the current top 50 viewers
However, you can easily extend the architecture described to power dashboards for monitoring, reporting, and trigger alarms based on deeper insights gained by processing and analyzing the logs. Some examples are dashboards for cache performance, usage and viewer patterns, and so on.
Following we show a diagram of this architecture.
Prerequisites
Before you set up this architecture, install the AWS Command Line Interface (AWS CLI) tool on your local machine, if you don’t have it already.
Setup summary
The following steps are involved in setting up the serverless architecture on the AWS platform:
Create an Amazon S3 bucket for your Amazon CloudFront access logs to be delivered to and stored in.
Create a second Amazon S3 bucket to receive processed logs and store the partitioned data for interactive analysis.
Create an Amazon Kinesis Firehose delivery stream to batch, compress, and deliver the preprocessed logs for analysis.
Create an AWS Lambda function to preprocess the logs for analysis.
Configure Amazon S3 event notification on the CloudFront access logs bucket, which contains the raw logs, to trigger the Lambda preprocessing function.
Create an Amazon DynamoDB table to look up partition details, such as partition specification and partition location.
Create an Amazon Athena table for interactive analysis.
Create a second AWS Lambda function to add new partitions to the Athena table based on the log delivered to the processed logs bucket.
Configure Amazon S3 event notification on the processed logs bucket to trigger the Lambda partitioning function.
Configure Amazon Kinesis Analytics application for analysis of the logs directly from the stream.
ETL and preprocessing
In this section, we parse the CloudFront access logs as they are delivered, which occurs multiple times in an hour. We filter out commented records and use the user agent string to decipher the browser name, the name of the operating system, and whether the request has been made by a bot. For more details on how to decipher the preceding information based on the user agent string, see user-agents 1.1.0 in the Python documentation.
We use the Lambda preprocessing function to perform these tasks on individual rows of the access log. On successful completion, the rows are pushed to an Amazon Kinesis Firehose delivery stream to be persistently stored in an Amazon S3 bucket, the processed logs bucket.
To create a Firehose delivery stream with a new or existing S3 bucket as the destination, follow the steps described in Create a Firehose Delivery Stream to Amazon S3 in the S3 documentation. Keep most of the default settings, but select an AWS Identity and Access Management (IAM) role that has write access to your S3 bucket and specify GZIP compression. Name the delivery stream CloudFrontLogsToS3.
Another pre-requisite for this setup is to create an IAM role that provides the necessary permissions our AWS Lambda function to get the data from S3, process it, and deliver it to the CloudFrontLogsToS3 delivery stream.
Let’s use the AWS CLI to create the IAM role using the following the steps:
Create the IAM policy (lambda-exec-policy) for the Lambda execution role to use.
Create the Lambda execution role (lambda-cflogs-exec-role) and assign the service to use this role.
Attach the policy created in step 1 to the Lambda execution role.
To download the policy document to your local machine, type the following command.
To attach the policy (lambda-exec-policy) created to the AWS Lambda execution role (lambda-cflogs-exec-role), type the following command.
aws iam attach-role-policy --role-name lambda-cflogs-exec-role --policy-arn arn:aws:iam::<your-account-id>:policy/lambda-exec-policy
Now that we have created the CloudFrontLogsToS3 Firehose delivery stream and the lambda-cflogs-exec-role IAM role for Lambda, the next step is to create a Lambda preprocessing function.
This Lambda preprocessing function parses the CloudFront access logs delivered into the S3 bucket and performs a few transformation and mapping operations on the data. The Lambda function adds descriptive information, such as the browser and the operating system that were used to make this request based on the user agent string found in the logs. The Lambda function also adds information about the web distribution to support scenarios where CloudFront access logs are delivered to a centralized S3 bucket from multiple distributions. With the solution in this blog post, you can get insights across distributions and their edge locations.
Use the Lambda Management Console to create a new Lambda function with a Python 2.7 runtime and the s3-get-object-python blueprint. Open the console, and on the Configure triggers page, choose the name of the S3 bucket where the CloudFront access logs are delivered. Choose Put for Event type. For Prefix, type the name of the prefix, if any, for the folder where CloudFront access logs are delivered, for example cloudfront-logs/. To invoke Lambda to retrieve the logs from the S3 bucket as they are delivered, select Enable trigger.
Choose Next and provide a function name to identify this Lambda preprocessing function.
For Code entry type, choose Upload a file from Amazon S3. For S3 link URL, type https.amazonaws.com//preprocessing-lambda/pre-data.zip. In the section, also create an environment variable with the key KINESIS_FIREHOSE_STREAM and a value with the name of the Firehose delivery stream as CloudFrontLogsToS3.
Choose lambda-cflogs-exec-role as the IAM role for the Lambda function, and type prep-data.lambda_handler for the value for Handler.
Choose Next, and then choose Create Lambda.
Table creation in Amazon Athena
In this step, we will build the Athena table. Use the Athena console in the same region and create the table using the query editor.
A popular website with millions of requests each day routed using Amazon CloudFront can generate a large volume of logs, on the order of a few terabytes a day. We strongly recommend that you partition your data to effectively restrict the amount of data scanned by each query. Partitioning significantly improves query performance and substantially reduces cost. The Lambda partitioning function adds the partition information to the Athena table for the data delivered to the preprocessed logs bucket.
Before delivering the preprocessed Amazon CloudFront logs file into the preprocessed logs bucket, Amazon Kinesis Firehose adds a UTC time prefix in the format YYYY/MM/DD/HH. This approach supports multilevel partitioning of the data by year, month, date, and hour. You can invoke the Lambda partitioning function every time a new processed Amazon CloudFront log is delivered to the preprocessed logs bucket. To do so, configure the Lambda partitioning function to be triggered by an S3 Put event.
For a website with millions of requests, a large number of preprocessed logs can be delivered multiple times in an hour—for example, at the interval of one each second. To avoid querying the Athena table for partition information every time a preprocessed log file is delivered, you can create an Amazon DynamoDB table for fast lookup.
Based on the year, month, data and hour in the prefix of the delivered log, the Lambda partitioning function checks if the partition specification exists in the Amazon DynamoDB table. If it doesn’t, it’s added to the table using an atomic operation, and then the Athena table is updated.
Type the following command to create the Amazon DynamoDB table.
PartitionSpec is the hash key and is a representation of the partition signature—for example, year=”2017”; month=”05”; day=”15”; hour=”10”.
Depending on the rate at which the processed log files are delivered to the processed log bucket, you might have to increase the ReadCapacityUnits and WriteCapacityUnits values, if these are throttled.
The other attributes besides PartitionSpec are the following:
PartitionPath – The S3 path associated with the partition.
PartitionType – The type of partition used (Hour, Month, Date, Year, or ALL). In this case, ALL is used.
Next step is to create the IAM role to provide permissions for the Lambda partitioning function. You require permissions to do the following:
Look up and write partition information to DynamoDB.
Alter the Athena table with new partition information.
Perform Amazon CloudWatch logs operations.
Perform Amazon S3 operations.
To download the policy document to your local machine, type following command.
To create the Lambda execution role and assign the service to use this role, type the following command.
aws iam create-role --role-name lambda-cflogs-exec-role --assume-role-policy-document file://<path>/assume-lambda-policy.json
Let’s use the AWS CLI to create the IAM role using the following three steps:
Create the IAM policy(lambda-partition-exec-policy) used by the Lambda execution role.
Create the Lambda execution role (lambda-partition-execution-role)and assign the service to use this role.
Attach the policy created in step 1 to the Lambda execution role.
To create the IAM policy used by Lambda execution role, type the following command.
aws iam create-policy --policy-name lambda-partition-exec-policy --policy-document file://<path>/lambda-partition-function-execution-policy.json
To create the Lambda execution role and assign the service to use this role, type the following command.
aws iam create-role --role-name lambda-partition-execution-role --assume-role-policy-document file://<path>/assume-lambda-policy.json
To attach the policy (lambda-partition-exec-policy) created to the AWS Lambda execution role (lambda-partition-execution-role), type the following command.
aws iam attach-role-policy --role-name lambda-partition-execution-role --policy-arn arn:aws:iam::<your-account-id>:policy/lambda-partition-exec-policy
Following is the lambda-partition-function-execution-policy.json file, which is the IAM policy used by the Lambda execution role.
From the AWS Management Console, create a new Lambda function with Java8 as the runtime. Select the Blank Function blueprint.
On the Configure triggers page, choose the name of the S3 bucket where the preprocessed logs are delivered. Choose Put for the Event Type. For Prefix, type the name of the prefix folder, if any, where preprocessed logs are delivered by Firehose—for example, out/. For Suffix, type the name of the compression format that the Firehose stream (CloudFrontLogToS3) delivers the preprocessed logs —for example, gz. To invoke Lambda to retrieve the logs from the S3 bucket as they are delivered, select Enable Trigger.
Choose Next and provide a function name to identify this Lambda partitioning function.
Choose Java8 for Runtime for the AWS Lambda function. Choose Upload a .ZIP or .JAR file for the Code entry type, and choose Upload to upload the downloaded aws-lambda-athena-1.0.0.jar file.
Next, create the following environment variables for the Lambda function:
TABLE_NAME – The name of the Athena table (for example, cf_logs).
PARTITION_TYPE – The partition to be created based on the Athena table for the logs delivered to the sub folders in S3 bucket based on Year, Month, Date, Hour, or Set this to ALL to use Year, Month, Date, and Hour.
DDB_TABLE_NAME – The name of the DynamoDB table holding partition information (for example, athenapartitiondetails).
ATHENA_REGION – The current AWS Region for the Athena table to construct the JDBC connection string.
S3_STAGING_DIR – The Amazon S3 location where your query output is written. The JDBC driver asks Athena to read the results and provide rows of data back to the user (for example, s3://<bucketname>/<folder>/).
To configure the function handler and IAM, for Handler copy and paste the name of the handler: com.amazonaws.services.lambda.CreateAthenaPartitionsBasedOnS3EventWithDDB::handleRequest. Choose the existing IAM role, lambda-partition-execution-role.
Choose Next and then Create Lambda.
Interactive analysis using Amazon Athena
In this section, we analyze the historical data that’s been collected since we added the partitions to the Amazon Athena table for data delivered to the preprocessing logs bucket.
Scenario 1 is robot traffic by edge location.
SELECT COUNT(*) AS ct, requestip, location FROM cf_logs
WHERE isbot='True'
GROUP BY requestip, location
ORDER BY ct DESC;
Scenario 2 is total bytes transferred per distribution for each edge location for your website.
SELECT distribution, location, SUM(bytes) as totalBytes
FROM cf_logs
GROUP BY location, distribution;
Scenario 3 is the top 50 viewers of your website.
SELECT requestip, COUNT(*) AS ct FROM cf_logs
GROUP BY requestip
ORDER BY ct DESC;
Streaming analysis using Amazon Kinesis Analytics
In this section, you deploy a stream processing application using Amazon Kinesis Analytics to analyze the preprocessed Amazon CloudFront log streams. This application analyzes directly from the Amazon Kinesis Stream as it is delivered to the preprocessing logs bucket. The stream queries in section are focused on gaining the following insights:
The IP address of the bot, identified by its Amazon CloudFront edge location, that is currently sending requests to your website. The query also includes the total bytes transferred as part of the response.
The total bytes served per distribution per population for your website.
The top 10 viewers of your website.
To download the firehose-access-policy.json file, type the following.
Before we create the Amazon Kinesis Analytics application, we need to create the IAM role to provide permission for the analytics application to access Amazon Kinesis Firehose stream.
Let’s use the AWS CLI to create the IAM role using the following three steps:
Create the IAM policy(firehose-access-policy) for the Lambda execution role to use.
Create the Lambda execution role (ka-execution-role) and assign the service to use this role.
Attach the policy created in step 1 to the Lambda execution role.
Following is the firehose-access-policy.json file, which is the IAM policy used by Kinesis Analytics to read Firehose delivery stream.
To create the IAM policy used by Analytics access role, type the following command.
aws iam create-policy --policy-name firehose-access-policy --policy-document file://<path>/firehose-access-policy.json
To create the Analytics execution role and assign the service to use this role, type the following command.
aws iam attach-role-policy --role-name ka-execution-role --policy-arn arn:aws:iam::<your-account-id>:policy/firehose-access-policy
To attach the policy (irehose-access-policy) created to the Analytics execution role (ka-execution-role), type the following command.
aws iam attach-role-policy --role-name ka-execution-role --policy-arn arn:aws:iam::<your-account-id>:policy/firehose-access-policy
To deploy the Analytics application, first download the configuration file and then modify ResourceARN and RoleARN for the Amazon Kinesis Firehose input configuration.
Scenario 1 is a query for detecting bots for sending request to your website detection for your website.
-- Create output stream, which can be used to send to a destination
CREATE OR REPLACE STREAM "BOT_DETECTION" (requesttime TIME, destribution VARCHAR(16), requestip VARCHAR(64), edgelocation VARCHAR(64), totalBytes BIGINT);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "BOT_DETECTION_PUMP" AS INSERT INTO "BOT_DETECTION"
--
SELECT STREAM
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND) as requesttime,
"distribution_name" as distribution,
"request_ip" as requestip,
"edge_location" as edgelocation,
SUM("bytes") as totalBytes
FROM "CF_LOG_STREAM_001"
WHERE "is_bot" = true
GROUP BY "request_ip", "edge_location", "distribution_name",
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND),
STEP("CF_LOG_STREAM_001".ROWTIME BY INTERVAL '1' SECOND);
Scenario 2 is a query for total bytes transferred per distribution for each edge location for your website.
-- Create output stream, which can be used to send to a destination
CREATE OR REPLACE STREAM "BYTES_TRANSFFERED" (requesttime TIME, destribution VARCHAR(16), edgelocation VARCHAR(64), totalBytes BIGINT);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "BYTES_TRANSFFERED_PUMP" AS INSERT INTO "BYTES_TRANSFFERED"
-- Bytes Transffered per second per web destribution by edge location
SELECT STREAM
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND) as requesttime,
"distribution_name" as distribution,
"edge_location" as edgelocation,
SUM("bytes") as totalBytes
FROM "CF_LOG_STREAM_001"
GROUP BY "distribution_name", "edge_location", "request_date",
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND),
STEP("CF_LOG_STREAM_001".ROWTIME BY INTERVAL '1' SECOND);
Scenario 3 is a query for the top 50 viewers for your website.
-- Create output stream, which can be used to send to a destination
CREATE OR REPLACE STREAM "TOP_TALKERS" (requestip VARCHAR(64), requestcount DOUBLE);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "TOP_TALKERS_PUMP" AS INSERT INTO "TOP_TALKERS"
-- Top Ten Talker
SELECT STREAM ITEM as requestip, ITEM_COUNT as requestcount FROM TABLE(TOP_K_ITEMS_TUMBLING(
CURSOR(SELECT STREAM * FROM "CF_LOG_STREAM_001"),
'request_ip', -- name of column in single quotes
50, -- number of top items
60 -- tumbling window size in seconds
)
);
Conclusion
Following the steps in this blog post, you just built an end-to-end serverless architecture to analyze Amazon CloudFront access logs. You analyzed these both in interactive and streaming mode, using Amazon Athena and Amazon Kinesis Analytics respectively.
By creating a partition in Athena for the logs delivered to a centralized bucket, this architecture is optimized for performance and cost when analyzing large volumes of logs for popular websites that receive millions of requests. Here, we have focused on just three common use cases for analysis, sharing the analytic queries as part of the post. However, you can extend this architecture to gain deeper insights and generate usage reports to reduce latency and increase availability. This way, you can provide a better experience on your websites fronted with Amazon CloudFront.
In this blog post, we focused on building serverless architecture to analyze Amazon CloudFront access logs. Our plan is to extend the solution to provide rich visualization as part of our next blog post.
About the Authors
Rajeev Srinivasan is a Senior Solution Architect for AWS. He works very close with our customers to provide big data and NoSQL solution leveraging the AWS platform and enjoys coding . In his spare time he enjoys riding his motorcycle and reading books.
Sai Sriparasa is a consultant with AWS Professional Services. He works with our customers to provide strategic and tactical big data solutions with an emphasis on automation, operations & security on AWS. In his spare time, he follows sports and current affairs.
Many organizations begin their cloud journey to AWS by moving a few applications to demonstrate the power and flexibility of AWS. This initial application architecture includes building security groups that control the network ports, protocols, and IP addresses that govern access and traffic to their AWS Virtual Private Cloud (VPC). When the architecture process is complete and an application is fully functional, some organizations forget to revisit their security groups to optimize rules and help ensure the appropriate level of governance and compliance. Not optimizing security groups can create less-than-optimal security, with ports open that may not be needed or source IP ranges set that are broader than required.
Removing unused rules or limiting source IP addresses requires either an in-depth knowledge of an application’s active ports on Amazon EC2 instances or analysis of active network traffic. In this blog post, I discuss a method to:
Use VPC Flow Logs to capture information about the IP traffic in an Amazon VPC.
Enrich the VPC Flow Logs dataset with security group IDs by using Firehose and Lambda.
Demonstrate how to visualize and analyze network traffic from VPC Flow Logs by using Amazon Elasticsearch Service (Amazon ES).
Using this approach can help you remediate security group rules to necessary source IPs, ports, and nested security groups, helping to improve the security of your AWS resources while minimizing the potential risk to production environments.
As illustrated in the preceding diagram, this is how the data flows in this model:
The Lambda ingestor function passes the data to Firehose.
Firehose then passes the data to the Lambda decorator function.
The Lambda decorator function performs a number of lookups for each record and returns the data to Firehose with additional fields.
Firehose then posts the enhanced dataset to the Amazon ES endpoint and any errors to Amazon S3.
The solution
Step 1: Set up your Amazon ES cluster and VPC Flow Logs
Create an Amazon ES cluster
The first step in this solution is to create an Amazon ES cluster. Do this first because it takes some time for the cluster to become available. If you are new to Amazon ES, you can learn more about it in the Amazon ES documentation.
Type es-flowlogs for the Elasticsearch domain name.
Set Version to 1 in the drop-down list. Choose Next.
Set Instance count to 2 and select the Enable zone awareness check box. (This ensures cluster stability in the event of an Availability Zone outage.) Accept the defaults for the rest of the page.
[Optional] If you use this domain for production purposes, I recommend using dedicated master nodes. Select the Enable dedicated master check box and select medium.elasticsearch from the Instance type drop-down list. Leave the Instance count at 3, which is the default.
Choose Next.
From the Set the domain access policy to drop-down list on the next page, select Allow access to the domain from specific IP(s). In the dialog box, type or paste the comma-separated list of valid IPv4 addresses or Classless Inter-Domain Routing (CIDR) blocks you would like to be able to access the Amazon ES domain.
It will take a few minutes for the cluster to be available. In the meantime, you can begin enabling VPC Flow Logs.
Enable VPC Flow Logs
VPC Flow Logs is a feature that lets you capture information about the IP traffic going to and from network interfaces in your VPC. Flow log data is stored using Amazon CloudWatch Logs. For more information about VPC Flow Logs, see VPC Flow Logs and CloudWatch Logs.
Choose Your VPCs in the navigation pane, and select the VPC you would like to analyze. (You can also enable VPC Flow Logs on only a subnet if you do not want to enable it on the entire VPC.)
Choose the Flow Logs tab in the bottom pane, and then choose Create Flow Log.
In the text beneath the Role box, choose Set Up Permissions (this will open an IAM management page).
Choose Allow on the IAM management page. Return to the VPC Flow Logs setup page.
Choose All from the Filter drop-down list.
Choose flowlogsRole from the Role drop-down list (you created this role in steps 3 and 4 in this procedure).
Choose Flowlogs from the Destination Log Group drop-down list.
Choose Create Flow Log.
Step 2: Set up AWS Lambda to enrich the VPC Flow Logs dataset with security group IDs
If you completed Step 1, VPC Flow Logs data is now streaming to CloudWatch Logs. Next, you will deploy two Lambda functions. The first, the ingestor function, moves the data into Firehose, and the second, the decorator function, adds three new fields to the VPC Flow Logs dataset and returns records to Firehose for delivery to Amazon ES.
The new fields added by the decorator function are:
Direction – By comparing the primary IP address of the elastic network interface (ENI) in the destination IP address, you can set the direction for the IP connection.
Security group IDs – Each ENI can be associated with as many as five security groups. The security group IDs are added as an array in the record.
Source – This includes a number of fields that result from looking up srcaddr from a free service for geographical lookups.
The Source includes:
source-country-code
source-country-name
source-region-code
source-region-name
source-city
source-location, latitude, and longitude.
Follow the instructions in this GitHub repository to deploy the two Lambda functions and the associated permissions that are required.
Step 3: Set up Firehose
Firehose is a fully managed service that allows you to transform flow log data and stream it into Amazon ES. The service scales automatically with load, and you only pay for the data transmitted through the service.
Choose Go to Firehose and then choose Create Delivery Stream.
Step 3.1: Define the destination
Choose Amazon Elasticsearch Service from the Destination drop-down list.
For Delivery stream name, type VPCFlowLogsToElasticSearch (the name must match the default environment variable in the ingestion Lambda function).
Choose es-flowlogs from the Elasticsearch domain drop-down list. (The Amazon ES cluster configuration state needs to be Active for es-flowlogs to be available in the drop-down list.)
For Index, type cwl.
Choose OneDay from the Index rotation drop-down list.
For Type, type log.
For Backup mode, select Failed Documents Only.
For S3 bucket, select New S3 bucket in the drop-down list and type a bucket name of your choice. Choose Create bucket.
Choose Next.
Step 3.2: Configure Lambda
Choose Enable for Data transformation.
Choose vpc-flow-log-appender-dev-FlowLogDecoratorFunction-xxxxx from the Lambda function drop-down list (make sure you select the Decorator function).
Choose Create/Update existing IAM role, Firehose delivery IAM roll from the IAM role drop-down list.
Choose Allow. This takes you back to the Firehose Configuration.
Choose Next and then choose Create Delivery Stream.
Step 4: Stream data to Firehose
The next step is to enable the data to stream from CloudWatch Logs to Firehose. You will use the Lambda ingestion function you deployed earlier: vpc-flow-log-appender-dev-FlowLogIngestionFunction-xxxxxxx.
Choose Logs in the navigation pane, and select the check box next to Flowlogs under Log Groups.
From the Actions menu, choose Stream to AWS Lambda. Choose vpc-flow-log-appender-dev-FlowLogIngestionFunction-xxxxxxx (select the Ingestion function). Choose Next.
Choose Amazon VPC Flow Logs from the Log Format drop-down list. Choose Next.
Choose Start Streaming.
VPC Flow Logs will now be forwarded to Firehose, capturing information about the IP traffic going to and from network interfaces in your VPC. Firehose appends additional data fields and forwards the enriched data to your Amazon ES cluster.
Data is now flowing to your Amazon ES cluster, but be patient because it can take up to 30 minutes for the data to begin appearing in your Amazon ES cluster.
Step 5: Verify that the flow log data is streaming through Firehose to the Amazon ES cluster
You should see VPC Flow Logs with ENI IDs under Log Streams (see the following screenshot) and Stored Bytes greater than zero in the CloudWatch log group.
Do you have logs from the Lambda ingestion function in the CloudWatch log group? As shown in the following screenshot, you should see START, END and REPORT records. These show that the ingestion function is running and streaming data to Firehose.
Do you have logs from the Lambda decorator function in the CloudWatch log group? You should see START, END, and REPORT records as well as entries similar to: “Processing completed. Successful records XXX, Failed records 0.”
Do you have cwl-* indexes in the Amazon ES dashboard, as shown in the following screenshot? If you do, you are successfully streaming through Firehose and populating the Amazon ES cluster, and you are ready to proceed to Step 6. Remember, it can take up to 30 minutes for the flow logs from your workloads to begin flowing to the Amazon ES cluster.
Step 6: Using the SGDashboard to analyze VPC network traffic
You now need set up a Kibana dashboard to monitor the traffic in your VPC.
Choose es-flowlogs under Elasticsearch domain name.
Click the link next to Kibana, as shown in the following screenshot.
The first time you access Kibana, you will be asked to set the defaultindex. To set the defaultindex in the Amazon ES cluster:
Set the Index name or pattern to cwl-*.
For Time-field name, type @timestamp.
Choose Create.
Load the SGDashboard:
Download this JSON file and save it to your computer. The file includes a dashboard and visualizations I created for this blog post’s purposes.
In Kibana, choose Management in the navigation pane, choose Saved Objects, and then import the file you just downloaded.
Choose Dashboard and Open to load the SGDashboard you just imported. (You might have to press Enter in the top search box to have the dashboard load the first time.)
The following screenshot shows the SGDashboard after it has loaded.
The SGDashboard is composed of a set of visualizations. Each visualization contains a view or summary of the underlying data contained in the Amazon ES cluster, as shown in the preceding screenshot. You can control the timeframe for the dashboard in the upper right corner. By clicking the timeframe, the dashboard exposes alternative timeframes that you can select.
The SGDashboard includes a list of security groups, destination ports, source IP addresses, actions, protocols, and connection directions as well as raw VPC Flow Log records. This information is useful because you can compare this to your security group configurations. Ports might be open in the security group but have no network traffic flowing to the instances on those ports, which means the corresponding rules can probably be removed. Also, by evaluating IP ranges in use, you can narrow the ranges to only those IP addresses required for the application. The following screenshot on the left shows a view of the SGDashboard for a specific security group. By comparing its accepted inbound IP addresses with the security group rules in the following screenshot on the right, you can ensure the source IP ranges are sufficiently restrictive.
Analyze VPC Flow Logs data
Amazon ES allows you to quickly view and filter VPC Flow Logs data to determine what network traffic is flowing in your VPC. This analysis requires an understanding of security groups and elastic network interfaces (ENIs). Let’s say you have two security groups associated with the same ENI, and the first security group has traffic it will register for both groups. You will still see traffic to the ENI listed in the second security group because it is allowing traffic to the ENI. Therefore, when you click a security group that you want to filter, additional groups might still be on the list because they are included in the VPC Flow Logs records.
The following screenshot on the left is a view of the SGDashboard with a security group selected (sg-978414e8). Even though that security group has a filter, two additional security groups remain in the dashboard. The following screenshot on the right shows the raw log data where each record contains all three security groups and demonstrates that all three security groups share a common set of flow log records.
Also, note that security groups are stateful, so if the instance itself is initiating traffic to a different location, the return traffic will be displayed in the Kibana dashboard. The best example of this is port 123 Network Time Protocol (NTP). This type of traffic can be easily removed from the display by choosing the port on the right side of the dashboard, and then reversing the filter, as shown in the following screenshot. By reversing the filter, you can exclude data from the view.
Example: Unused security groups
Let’s say that some security groups are no longer in use. First, I change the time range by clicking the current time range in the top right corner of the dashboard, as shown in the following screenshot. I select Week to date.
As the following screenshot shows, the dashboard has identified five security groups that have had traffic during the week to date.
As you can see in the following screenshot, I have many security groups in my test account that are not in use. Any security groups not in the SGDashboard are candidates for removal.
Example: Unused inbound rules
Let’s take a look at security group sg-63ed8c1c from the preceding screenshot. When I click sg-63ed8c1c (the security group ID) in the dashboard, a filter is applied that reduces the security groups displayed to only the records with that security group included. We can compare the traffic associated with this security group in the SGDashboard (shown in the following screenshot) to the security group rules in the EC2 console.
As the following screenshot of the EC2 console shows, this security group has only 2 inbound rules: one for HTTP on port 80 and one for RDP. The SGDashboard shows that traffic is not flowing on port 80, so I can safely remove that rule from the security group.
Summary
It can be challenging to help ensure that your AWS Cloud environment allows only intended traffic and is as secure and manageable as possible. In this post, I have shown how to enable VPC Flow Logs. I then showed how to use Firehose and Lambda to add security group IDs, directions, and locations to the VPC Flow Logs dataset. The SGDashboard then enables you to analyze the flow log data and compare it with your security group configurations to improve your cloud security.
If you have comments about this blog post, submit them in the “Comments” section below. If you have implementation or troubleshooting questions about the solution in this post, please start a new thread on the AWS WAF forum.
As a hosting provider, we speak with many businesses who need a fix for their slow site speeds. There are many contributing factors why hosting infrastructure may be constraining your site performance but typically; old infrastructure used by some hosting providers, contention issues and even the physical location of the servers. Having your site hosted in a high-speed environment with world class managed services (such as Anchor) provides the right foundations and utilising a Content Delivery Network (CDN) that can give you that extra boost in speed and performance you desire – and deserve. One of the more popular site performance applications is Cloudflare; global network designed to optimize security, performance and reliability, without the bloat of legacy technologies. Cloudflare has some robust CDN capabilities in addition to other security services like DDoS (Distributed Denial of Service) protection and reverse proxies.
A traditional CDN is a group of web servers distributed across multiple locations around the world, which delivers content more efficiently to users. The server selected for delivering content to a specific user is typically based on a measure of network proximity. For example, the server with the fewest network hops or the server with the quickest response time is chosen.
If you are looking to take advantage of a CDN, a great place to to start is Cloudflare’s free plan. This basic plan can be set up in less than 5 minutes and only requires a simple change to your domain’s DNS settings to get you up and running. There is no hardware or software to install or maintain and you do not need to change any of your site’s existing code. As a partner of Cloudflare, we can offer discounted pricing to our customers if you are looking to take advantage of some of Cloudflare’s advanced performance and security features such as image optimisations, firewalls and PCI compliance to name just a few.
CloudFlare utilises more than 40 data centres in almost as many countries, and use the size of their ‘quietly built cloud’ to process more than 5% of all web requests. It includes:
A Global CDN
DDoS Protection
Page Rules
DDoS Protection- Why do I need it and how to protect against attack?
In 2015 the internet saw the highest rate of DDoS attacks ever. Generally, the attackers will flood a network or service (usually with thousands of IP addresses) in order to overwhelm the server and make a network or website unavailable for its users. It is extremely important to make sure your site is protected from such an attack, especially if your site is eCommerce and down time will prevent customers completing their purchases.
What are Global CDN’s?
As mentioned above, Content Delivery Networks (CDNs) are important for a number of reasons. The primary feature that a CDN does, is provides alternative server nodes, or locations for the user to download resources (usually JavaScript or static content). This means that although the server may be located in the US, someone in Sydney can still experience fast load speed and response times due to this reduced latency. This is extremely important for sites that have users in other countries, especially those who are shopping online, as these sites generally have a large volumes of images, which can be timely to load. Overall, it improves your user’s experience in terms of speed.
Page Rules
Page Rules give you the ability to control how Cloudflare actually works on a URL or subdomain basis, which means it allows you to customise it’s functionality to match your domain’s unique needs. They give you the ability to take various actions based on the page’s URL, such as creating redirects, fine tuning caching behavior, or enabling and disabling our various services. This helps you to optimize speed, harden security, increase reliability, maximize bandwidth savings, and much more.
Other benefits include, the added scalability or capacity effects that a CDN like Cloudflare has, not only does it have higher availability but also lower packet loss. Further, Cloudflare provides website traffic insight and other analytics such as threat monitoring, so that you can improve your site even further.
As a partner of Cloudflare, Anchor receives discounted rates for the Pro and Business plans, as well as can help you install the free plan if you are a customer. The easiest part about Cloudflare however, is that it only requires a simple change to your domain’s DNS settings. There is no hardware or software to install or maintain and you do not need to change any of your site’s existing code.
If your site is running slow and want know how you can boost your site performance, contact us for a free, no obligation site hosting check up.
In case you missed any AWS Security Blog posts published so far in 2017, they are summarized and linked to below. The posts are shown in reverse chronological order (most recent first), and the subject matter ranges from protecting dynamic web applications against DDoS attacks to monitoring AWS account configuration changes and API calls to Amazon EC2 security groups.
March
March 22:How to Help Protect Dynamic Web Applications Against DDoS Attacks by Using Amazon CloudFront and Amazon Route 53 Using a content delivery network (CDN) such as Amazon CloudFront to cache and serve static text and images or downloadable objects such as media files and documents is a common strategy to improve webpage load times, reduce network bandwidth costs, lessen the load on web servers, and mitigate distributed denial of service (DDoS) attacks. AWS WAF is a web application firewall that can be deployed on CloudFront to help protect your application against DDoS attacks by giving you control over which traffic to allow or block by defining security rules. When users access your application, the Domain Name System (DNS) translates human-readable domain names (for example, www.example.com) to machine-readable IP addresses (for example, 192.0.2.44). A DNS service, such as Amazon Route 53, can effectively connect users’ requests to a CloudFront distribution that proxies requests for dynamic content to the infrastructure hosting your application’s endpoints. In this blog post, I show you how to deploy CloudFront with AWS WAF and Route 53 to help protect dynamic web applications (with dynamic content such as a response to user input) against DDoS attacks. The steps shown in this post are key to implementing the overall approach described in AWS Best Practices for DDoS Resiliency and enable the built-in, managed DDoS protection service, AWS Shield.
March 21:New AWS Encryption SDK for Python Simplifies Multiple Master Key Encryption The AWS Cryptography team is happy to announce a Python implementation of the AWS Encryption SDK. This new SDK helps manage data keys for you, and it simplifies the process of encrypting data under multiple master keys. As a result, this new SDK allows you to focus on the code that drives your business forward. It also provides a framework you can easily extend to ensure that you have a cryptographic library that is configured to match and enforce your standards. The SDK also includes ready-to-use examples. If you are a Java developer, you can refer to this blog post to see specific Java examples for the SDK. In this blog post, I show you how you can use the AWS Encryption SDK to simplify the process of encrypting data and how to protect your encryption keys in ways that help improve application availability by not tying you to a single region or key management solution.
March 21:Updated CJIS Workbook Now Available by Request The need for guidance when implementing Criminal Justice Information Services (CJIS)–compliant solutions has become of paramount importance as more law enforcement customers and technology partners move to store and process criminal justice data in the cloud. AWS services allow these customers to easily and securely architect a CJIS-compliant solution when handling criminal justice data, creating a durable, cost-effective, and secure IT infrastructure that better supports local, state, and federal law enforcement in carrying out their public safety missions. AWS has created several documents (collectively referred to as the CJIS Workbook) to assist you in aligning with the FBI’s CJIS Security Policy. You can use the workbook as a framework for developing CJIS-compliant architecture in the AWS Cloud. The workbook helps you define and test the controls you operate, and document the dependence on the controls that AWS operates (compute, storage, database, networking, regions, Availability Zones, and edge locations).
March 9:New Cloud Directory API Makes It Easier to Query Data Along Multiple Dimensions Today, we made available a new Cloud Directory API, ListObjectParentPaths, that enables you to retrieve all available parent paths for any directory object across multiple hierarchies. Use this API when you want to fetch all parent objects for a specific child object. The order of the paths and objects returned is consistent across iterative calls to the API, unless objects are moved or deleted. In case an object has multiple parents, the API allows you to control the number of paths returned by using a paginated call pattern. In this blog post, I use an example directory to demonstrate how this new API enables you to retrieve data across multiple dimensions to implement powerful applications quickly.
March 8:How to Access the AWS Management Console Using AWS Microsoft AD and Your On-Premises Credentials AWS Directory Service for Microsoft Active Directory, also known as AWS Microsoft AD, is a managed Microsoft Active Directory (AD) hosted in the AWS Cloud. Now, AWS Microsoft AD makes it easy for you to give your users permission to manage AWS resources by using on-premises AD administrative tools. With AWS Microsoft AD, you can grant your on-premises users permissions to resources such as the AWS Management Console instead of adding AWS Identity and Access Management (IAM) user accounts or configuring AD Federation Services (AD FS) with Security Assertion Markup Language (SAML). In this blog post, I show how to use AWS Microsoft AD to enable your on-premises AD users to sign in to the AWS Management Console with their on-premises AD user credentials to access and manage AWS resources through IAM roles.
March 7:How to Protect Your Web Application Against DDoS Attacks by Using Amazon Route 53 and an External Content Delivery Network Distributed Denial of Service (DDoS) attacks are attempts by a malicious actor to flood a network, system, or application with more traffic, connections, or requests than it is able to handle. To protect your web application against DDoS attacks, you can use AWS Shield, a DDoS protection service that AWS provides automatically to all AWS customers at no additional charge. You can use AWS Shield in conjunction with DDoS-resilient web services such as Amazon CloudFront and Amazon Route 53 to improve your ability to defend against DDoS attacks. Learn more about architecting for DDoS resiliency by reading the AWS Best Practices for DDoS Resiliency whitepaper. You also have the option of using Route 53 with an externally hosted content delivery network (CDN). In this blog post, I show how you can help protect the zone apex (also known as the root domain) of your web application by using Route 53 to perform a secure redirect to prevent discovery of your application origin.
February 23:s2n Is Now Handling 100 Percent of SSL Traffic for Amazon S3 Today, we’ve achieved another important milestone for securing customer data: we have replaced OpenSSL with s2n for all internal and external SSL traffic in Amazon Simple Storage Service (Amazon S3) commercial regions. This was implemented with minimal impact to customers, and multiple means of error checking were used to ensure a smooth transition, including client integration tests, catching potential interoperability conflicts, and identifying memory leaks through fuzz testing.
February 22:Easily Replace or Attach an IAM Role to an Existing EC2 Instance by Using the EC2 Console AWS Identity and Access Management (IAM) roles enable your applications running on Amazon EC2 to use temporary security credentials. IAM roles for EC2 make it easier for your applications to make API requests securely from an instance because they do not require you to manage AWS security credentials that the applications use. Recently, we enabled you to use temporary security credentials for your applications by attaching an IAM role to an existing EC2 instance by using the AWS CLI and SDK. To learn more, see New! Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI. Starting today, you can attach an IAM role to an existing EC2 instance from the EC2 console. You can also use the EC2 console to replace an IAM role attached to an existing instance. In this blog post, I will show how to attach an IAM role to an existing EC2 instance from the EC2 console.
February 22:How to Audit Your AWS Resources for Security Compliance by Using Custom AWS Config Rules AWS Config Rules enables you to implement security policies as code for your organization and evaluate configuration changes to AWS resources against these policies. You can use Config rules to audit your use of AWS resources for compliance with external compliance frameworks such as CIS AWS Foundations Benchmark and with your internal security policies related to the US Health Insurance Portability and Accountability Act (HIPAA), the Federal Risk and Authorization Management Program (FedRAMP), and other regimes. AWS provides some predefined, managed Config rules. You also can create custom Config rules based on criteria you define within an AWS Lambda function. In this post, I show how to create a custom rule that audits AWS resources for security compliance by enabling VPC Flow Logs for an Amazon Virtual Private Cloud (VPC). The custom rule meets requirement 4.3 of the CIS AWS Foundations Benchmark: “Ensure VPC flow logging is enabled in all VPCs.”
February 13:How to Enable Multi-Factor Authentication for AWS Services by Using AWS Microsoft AD and On-Premises Credentials You can now enable multi-factor authentication (MFA) for users of AWS services such as Amazon WorkSpaces and Amazon QuickSight and their on-premises credentials by using your AWS Directory Service for Microsoft Active Directory (Enterprise Edition) directory, also known as AWS Microsoft AD. MFA adds an extra layer of protection to a user name and password (the first “factor”) by requiring users to enter an authentication code (the second factor), which has been provided by your virtual or hardware MFA solution. These factors together provide additional security by preventing access to AWS services, unless users supply a valid MFA code.
February 13:How to Create an Organizational Chart with Separate Hierarchies by Using Amazon Cloud Directory Amazon Cloud Directory enables you to create directories for a variety of use cases, such as organizational charts, course catalogs, and device registries. Cloud Directory offers you the flexibility to create directories with hierarchies that span multiple dimensions. For example, you can create an organizational chart that you can navigate through separate hierarchies for reporting structure, location, and cost center. In this blog post, I show how to use Cloud Directory APIs to create an organizational chart with two separate hierarchies in a single directory. I also show how to navigate the hierarchies and retrieve data. I use the Java SDK for all the sample code in this post, but you can use other language SDKs or the AWS CLI.
February 10:How to Easily Log On to AWS Services by Using Your On-Premises Active Directory AWS Directory Service for Microsoft Active Directory (Enterprise Edition), also known as Microsoft AD, now enables your users to log on with just their on-premises Active Directory (AD) user name—no domain name is required. This new domainlesslogon feature makes it easier to set up connections to your on-premises AD for use with applications such as Amazon WorkSpaces and Amazon QuickSight, and it keeps the user logon experience free from network naming. This new interforest trusts capability is now available when using Microsoft AD with Amazon WorkSpaces and Amazon QuickSight Enterprise Edition. In this blog post, I explain how Microsoft AD domainless logon works with AD interforest trusts, and I show an example of setting up Amazon WorkSpaces to use this capability.
February 9:New! Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI AWS Identity and Access Management (IAM) roles enable your applications running on Amazon EC2 to use temporary security credentials that AWS creates, distributes, and rotates automatically. Using temporary credentials is an IAM best practice because you do not need to maintain long-term keys on your instance. Using IAM roles for EC2 also eliminates the need to use long-term AWS access keys that you have to manage manually or programmatically. Starting today, you can enable your applications to use temporary security credentials provided by AWS by attaching an IAM role to an existing EC2 instance. You can also replace the IAM role attached to an existing EC2 instance. In this blog post, I show how you can attach an IAM role to an existing EC2 instance by using the AWS CLI.
January 30:How to Protect Data at Rest with Amazon EC2 Instance Store Encryption Encrypting data at rest is vital for regulatory compliance to ensure that sensitive data saved on disks is not readable by any user or application without a valid key. Some compliance regulations such as PCI DSS and HIPAA require that data at rest be encrypted throughout the data lifecycle. To this end, AWS provides data-at-rest options and key management to support the encryption process. For example, you can encrypt Amazon EBS volumes and configure Amazon S3 buckets for server-side encryption (SSE) using AES-256 encryption. Additionally, Amazon RDS supports Transparent Data Encryption (TDE). Instance storage provides temporary block-level storage for Amazon EC2 instances. This storage is located on disks attached physically to a host computer. Instance storage is ideal for temporary storage of information that frequently changes, such as buffers, caches, and scratch data. By default, files stored on these disks are not encrypted. In this blog post, I show a method for encrypting data on Linux EC2 instance stores by using Linux built-in libraries. This method encrypts files transparently, which protects confidential data. As a result, applications that process the data are unaware of the disk-level encryption.
January 27:How to Detect and Automatically Remediate Unintended Permissions in Amazon S3 Object ACLs with CloudWatch Events Amazon S3Access Control Lists (ACLs) enable you to specify permissions that grant access to S3 buckets and objects. When S3 receives a request for an object, it verifies whether the requester has the necessary access permissions in the associated ACL. For example, you could set up an ACL for an object so that only the users in your account can access it, or you could make an object public so that it can be accessed by anyone. If the number of objects and users in your AWS account is large, ensuring that you have attached correctly configured ACLs to your objects can be a challenge. For example, what if a user were to call the PutObjectAcl API call on an object that is supposed to be private and make it public? Or, what if a user were to call the PutObject with the optional Acl parameter set to public-read, therefore uploading a confidential file as publicly readable? In this blog post, I show a solution that uses Amazon CloudWatch Events to detect PutObject and PutObjectAcl API calls in near-real time and helps ensure that the objects remain private by making automatic PutObjectAcl calls, when necessary.
January 24:New SOC 2 Report Available: Confidentiality As with everything at Amazon, the success of our security and compliance program is primarily measured by one thing: our customers’ success. Our customers drive our portfolio of compliance reports, attestations, and certifications that support their efforts in running a secure and compliant cloud environment. As a result of our engagement with key customers across the globe, we are happy to announce the publication of our new SOC 2 Confidentiality report. This report is available now through AWS Artifact in the AWS Management Console.
January 18:Compliance in the Cloud for New Financial Services Cybersecurity Regulations Financial regulatory agencies are focused more than ever on ensuring responsible innovation. Consequently, if you want to achieve compliance with financial services regulations, you must be increasingly agile and employ dynamic security capabilities. AWS enables you to achieve this by providing you with the tools you need to scale your security and compliance capabilities on AWS. The following breakdown of the most recent cybersecurity regulations, NY DFS Rule 23 NYCRR 500, demonstrates how AWS continues to focus on your regulatory needs in the financial services sector.
January 9:New Amazon GameDev Blog Post: Protect Multiplayer Game Servers from DDoS Attacks by Using Amazon GameLift In online gaming, distributed denial of service (DDoS) attacks target a game’s network layer, flooding servers with requests until performance degrades considerably. These attacks can limit a game’s availability to players and limit the player experience for those who can connect. Today’s new Amazon GameDev Blog post uses a typical game server architecture to highlight DDoS attack vulnerabilities and discusses how to stay protected by using built-in AWS Cloud security, AWS security best practices, and the security features of Amazon GameLift. Read the post to learn more.
January 6:The Top 10 Most Downloaded AWS Security and Compliance Documents in 2016 The following list includes the 10 most downloaded AWS security and compliance documents in 2016. Using this list, you can learn about what other people found most interesting about security and compliance last year.
January 6:FedRAMP Compliance Update: AWS GovCloud (US) Region Receives a JAB-Issued FedRAMP High Baseline P-ATO for Three New Services Three new services in the AWS GovCloud (US) region have received a Provisional Authority to Operate (P-ATO) from the Joint Authorization Board (JAB) under the Federal Risk and Authorization Management Program (FedRAMP). JAB issued the authorization at the High baseline, which enables US government agencies and their service providers the capability to use these services to process the government’s most sensitive unclassified data, including Personal Identifiable Information (PII), Protected Health Information (PHI), Controlled Unclassified Information (CUI), criminal justice information (CJI), and financial data.
January 4:The Top 20 Most Viewed AWS IAM Documentation Pages in 2016 The following 20 pages were the most viewed AWS Identity and Access Management (IAM) documentation pages in 2016. I have included a brief description with each link to give you a clearer idea of what each page covers. Use this list to see what other people have been viewing and perhaps to pique your own interest about a topic you’ve been meaning to research.
January 3:The Most Viewed AWS Security Blog Posts in 2016 The following 10 posts were the most viewed AWS Security Blog posts that we published during 2016. You can use this list as a guide to catch up on your blog reading or even read a post again that you found particularly useful.
January 3:How to Monitor AWS Account Configuration Changes and API Calls to Amazon EC2 Security Groups You can use AWS security controls to detect and mitigate risks to your AWS resources. The purpose of each security control is defined by its control objective. For example, the control objective of an Amazon VPC security group is to permit only designated traffic to enter or leave a network interface. Let’s say you have an Internet-facing e-commerce website, and your security administrator has determined that only HTTP (TCP port 80) and HTTPS (TCP 443) traffic should be allowed access to the public subnet. As a result, your administrator configures a security group to meet this control objective. What if, though, someone were to inadvertently change this security group’s rules and enable FTP or other protocols to access the public subnet from any location on the Internet? That expanded access could weaken the security posture of your assets. Consequently, your administrator might need to monitor the integrity of your company’s security controls so that the controls maintain their desired effectiveness. In this blog post, I explore two methods for detecting unintended changes to VPC security groups. The two methods address not only control objectives but also control failures.
If you have questions about or issues with implementing the solutions in any of these posts, please start a new thread on the forum identified near the end of each post.
Using a content delivery network (CDN) such as Amazon CloudFront to cache and serve static text and images or downloadable objects such as media files and documents is a common strategy to improve webpage load times, reduce network bandwidth costs, lessen the load on web servers, and mitigate distributed denial of service (DDoS) attacks. AWS WAF is a web application firewall that can be deployed on CloudFront to help protect your application against DDoS attacks by giving you control over which traffic to allow or block by defining security rules. When users access your application, the Domain Name System (DNS) translates human-readable domain names (for example, www.example.com) to machine-readable IP addresses (for example, 192.0.2.44). A DNS service, such as Amazon Route 53, can effectively connect users’ requests to a CloudFront distribution that proxies requests for dynamic content to the infrastructure hosting your application’s endpoints.
In this blog post, I show you how to deploy CloudFront with AWS WAF and Route 53 to help protect dynamic web applications (with dynamic content such as a response to user input) against DDoS attacks. The steps shown in this post are key to implementing the overall approach described in AWS Best Practices for DDoS Resiliency and enable the built-in, managed DDoS protection service, AWS Shield.
Background
AWS hosts CloudFront and Route 53 services on a distributed network of proxy servers in data centers throughout the world called edge locations. Using the global Amazon network of edge locations for application delivery and DNS service plays an important part in building a comprehensive defense against DDoS attacks for your dynamic web applications. These web applications can benefit from the increased security and availability provided by CloudFront and Route 53 as well as improving end users’ experience by reducing latency.
The following screenshot of an Amazon.com webpage shows how static and dynamic content can compose a dynamic web application that is delivered via HTTPS protocol for the encryption of user page requests as well as the pages that are returned by a web server.
The following map shows the global Amazon network of edge locations available to serve static content and proxy requests for dynamic content back to the origin as of the writing of this blog post. For the latest list of edge locations, see AWS Global Infrastructure.
How AWS Shield, CloudFront, and Route 53 work to help protect against DDoS attacks
To help keep your dynamic web applications available when they are under DDoS attack, the steps in this post enable AWS Shield Standard by configuring your applications behind CloudFront and Route 53. AWS Shield Standard protects your resources from common, frequently occurring network and transport layer DDoS attacks. Attack traffic can be geographically isolated and absorbed using the capacity in edge locations close to the source. Additionally, you can configure geographical restrictions to help block attacks originating from specific countries.
The request-routing technology in CloudFront connects each client to the nearest edge location, as determined by continuously updated latency measurements. HTTP and HTTPS requests sent to CloudFront can be monitored, and access to your application resources can be controlled at edge locations using AWS WAF. Based on conditions that you specify in AWS WAF, such as the IP addresses that requests originate from or the values of query strings, traffic can be allowed, blocked, or allowed and counted for further investigation or remediation. The following diagram shows how static and dynamic web application content can originate from endpoint resources within AWS or your corporate data center. For more details, see How CloudFront Delivers Content and How CloudFront Works with Regional Edge Caches.
Route 53 DNS requests and subsequent application traffic routed through CloudFront are inspected inline. Always-on monitoring, anomaly detection, and mitigation against common infrastructure DDoS attacks such as SYN/ACK floods, UDP floods, and reflection attacks are built into both Route 53 and CloudFront. For a review of common DDoS attack vectors, see How to Help Prepare for DDoS Attacks by Reducing Your Attack Surface. When the SYN flood attack threshold is exceeded, SYN cookies are activated to avoid dropping connections from legitimate clients. Deterministic packet filtering drops malformed TCP packets and invalid DNS requests, only allowing traffic to pass that is valid for the service. Heuristics-based anomaly detection evaluates attributes such as type, source, and composition of traffic. Traffic is scored across many dimensions, and only the most suspicious traffic is dropped. This method allows you to avoid false positives while protecting application availability.
Route 53 is also designed to withstand DNS query floods, which are real DNS requests that can continue for hours and attempt to exhaust DNS server resources. Route 53 uses shuffle sharding and anycast striping to spread DNS traffic across edge locations and help protect the availability of the service.
The next four sections provide guidance about how to deploy CloudFront, Route 53, AWS WAF, and, optionally, AWS Shield Advanced.
Deploy CloudFront
To take advantage of application delivery with DDoS mitigations at the edge, start by creating a CloudFront distribution and configuring origins:
Specify cache behavior settings for the distribution, as shown in the following screenshot. You can configure each URL path pattern with a set of associated cache behaviors. For dynamic web applications, set the Minimum TTL to 0 so that CloudFront will make a GET request with an If-Modified-Since header back to the origin. When CloudFront proxies traffic to the origin from edge locations and back, multiple concurrent requests for the same object are collapsed into a single request. The request is sent over a persistent connection from the edge location to the region over networks monitored by AWS. The use of a large initial TCP window size in CloudFront maximizes the available bandwidth, and TCP Fast Open (TFO) reduces latency.
To ensure that all traffic to CloudFront is encrypted and to enable SSL termination from clients at global edge locations, specify Redirect HTTP to HTTPS for Viewer Protocol Policy. Moving SSL termination to CloudFront offloads computationally expensive SSL negotiation, helps mitigate SSL abuse, and reduces latency with the use of OCSP stapling and session tickets. For more information about options for serving HTTPS requests, see Choosing How CloudFront Serves HTTPS Requests. For dynamic web applications, set Allowed HTTP Methods to include all methods, set Forward Headers to All, and for Query String Forwarding and Caching, choose Forward all, cache based on all.
Specify distribution settings for the distribution, as shown in the following screenshot. Enter your domain names in the Alternate Domain Names box and choose Custom SSL Certificate.
Choose Create Distribution. Note the x.cloudfront.net Domain Name of the distribution. In the next section, you will configure Route 53 to route traffic to this CloudFront distribution domain name.
Configure Route 53
When you created a web distribution in the previous section, CloudFront assigned a domain name to the distribution, such as d111111abcdef8.cloudfront.net. You can use this domain name in the URLs for your content, such as: http://d111111abcdef8.cloudfront.net/logo.jpg.
Alternatively, you might prefer to use your own domain name in URLs, such as: http://example.com/logo.jpg. You can accomplish this by creating a Route 53 alias resource record set that routes dynamic web application traffic to your CloudFront distribution by using your domain name. Alias resource record sets are virtual records specific to Route 53 that are used to map alias resource record sets for your domain to your CloudFront distribution. Alias resource record sets are similar to CNAME records except there is no charge for DNS queries to Route 53 alias resource record sets mapped to AWS services. Alias resource record sets are also not visible to resolvers, and they can be created for the root domain (zone apex) as well as subdomains.
A hosted zone, similar to a DNS zone file, is a collection of records that belongs to a single parent domain name. Each hosted zone has four nonoverlapping name servers in a delegation set. If a DNS query is dropped, the client automatically retries the next name server. If you have not already registered a domain name and have not configured a hosted zone for your domain, complete these two prerequisite steps before proceeding:
Choose the name of the hosted zone for the domain that you want to use to route traffic to your CloudFront distribution.
Choose Create Record Set.
Specify the following values:
Name – Type the domain name that you want to use to route traffic to your CloudFront distribution. The default value is the name of the hosted zone. For example, if the name of the hosted zone is example.com and you want to use acme.example.com to route traffic to your distribution, type acme.
Type – Choose A – IPv4 address. If IPv6 is enabled for the distribution and you are creating a second resource record set, choose AAAA – IPv6 address.
Alias – Choose Yes.
Alias Target – In the CloudFront distributions section, choose the name that CloudFront assigned to the distribution when you created it.
Routing Policy – Accept the default value of Simple.
Evaluate Target Health – Accept the default value of No.
Choose Create.
If IPv6 is enabled for the distribution, repeat Steps 4 through 6. Specify the same settings except for the Type field, as explained in Step 5.
The following screenshot of the Route 53 console shows a Route 53 alias resource record set that is configured to map a domain name to a CloudFront distribution.
If your dynamic web application requires geo redundancy, you can use latency-based routing in Route 53 to run origin servers in different AWS regions. Route 53 is integrated with CloudFront to collect latency measurements from each edge location. With Route 53 latency-based routing, each CloudFront edge location goes to the region with the lowest latency for the origin fetch.
Enable AWS WAF
AWS WAF is a web application firewall that helps detect and mitigate web application layer DDoS attacks by inspecting traffic inline. Application layer DDoS attacks use well-formed but malicious requests to evade mitigation and consume application resources. You can define custom security rules (also called web ACLs) that contain a set of conditions, rules, and actions to block attacking traffic. After you define web ACLs, you can apply them to CloudFront distributions, and web ACLs are evaluated in the priority order you specified when you configured them. Real-time metrics and sampled web requests are provided for each web ACL.
The following diagram shows how the (a) flow of CloudFront access logs files to an Amazon S3 bucket (b) provides the source data for the Lambda log parser function (c) to identify HTTP flood traffic and update AWS WAF web ACLs. As CloudFront receives requests on behalf of your dynamic web application, it sends access logs to an S3 bucket, triggering the Lambda log parser. The Lambda function parses CloudFront access logs to identify suspicious behavior, such as an unusual number of requests or errors, and it automatically updates your AWS WAF rules to block subsequent requests from the IP addresses in question for a predefined amount of time that you specify.
In addition to automated rate-based blacklisting to help protect against HTTP flood attacks, prebuilt AWS CloudFormation templates are available to simplify the configuration of AWS WAF for a proactive application-layer security defense. The following diagram provides an overview of CloudFormation template input into the creation of the CommonAttackProtection stack that includes AWS WAF web ACLs used to block, allow, or count requests that meet the criteria defined in each rule.
Choose the link under the ID column for your CloudFront distribution.
Choose Edit under the General
Choose your AWS WAF Web ACL from the drop-down
Choose Yes, Edit.
Activate AWS Shield Advanced (optional)
Deploying CloudFront, Route 53, and AWS WAF as described in this post enables the built-in DDoS protections for your dynamic web applications that are included with AWS Shield Standard. (There is no upfront cost or charge for AWS Shield Standard beyond the normal pricing for CloudFront, Route 53, and AWS WAF.) AWS Shield Standard is designed to meet the needs of many dynamic web applications.
For dynamic web applications that have a high risk or history of frequent, complex, or high volume DDoS attacks, AWS Shield Advanced provides additional DDoS mitigation capacity, attack visibility, cost protection, and access to the AWS DDoS Response Team (DRT). For more information about AWS Shield Advanced pricing, see AWS Shield Advanced pricing. To activate advanced protection services, follow these steps:
Sign in to the AWS Management Console and open the AWS WAF console.
If this is your first time signing in to the AWS WAF console, choose Get started with AWS Shield Advanced. Otherwise, choose Protected resources.
Choose Activate AWS Shield Advanced.
Choose the resource type and resource to protect.
For Name, enter a friendly name that will help you identify the AWS resources that are protected. For example, My CloudFront AWS Shield Advanced distributions.
(Optional) For Web DDoS attack, select Enable. You will be prompted to associate an existing web ACL with these resources, or create a new ACL if you don’t have any yet.
Choose Add DDoS protection.
Summary
In this blog post, I outline the steps to deploy CloudFront and configure Route 53 in front of your dynamic web application to leverage the global Amazon network of edge locations for DDoS resiliency. The post also provides guidance about enabling AWS WAF for application layer traffic monitoring and automated rules creation to block malicious traffic. I also cover the optional steps to activate AWS Shield Advanced, which helps build a more comprehensive defense against DDoS attacks for your dynamic web applications.
If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, please open a new thread on the AWS WAF forum.
Distributed Denial of Service (DDoS) attacks are attempts by a malicious actor to flood a network, system, or application with more traffic, connections, or requests than it is able to handle. To protect your web application against DDoS attacks, you can use AWS Shield, a DDoS protection service that AWS provides automatically to all AWS customers at no additional charge. You can use AWS Shield in conjunction with DDoS-resilient web services such as Amazon CloudFront and Amazon Route 53 to improve your ability to defend against DDoS attacks. Learn more about architecting for DDoS resiliency by reading the AWS Best Practices for DDoS Resiliency whitepaper.
In this blog post, I show how you can help protect the zone apex (also known as the root domain) of your web application by using Route 53 to perform a secure redirect to your externally hosted content delivery network (CDN) distribution.
Background
When browsing the Internet, a user might type example.com instead of www.example.com. To make sure these requests are routed properly, it is necessary to create a Route 53 alias resource record set for the zone apex. For example.com, this would be an alias resource record set without any subdomain (www) defined. With Route 53, you can use an alias resource record set to point www or your zone apex directly at a CloudFront distribution. As a result, anyone resolving example.com or www.example.com will see only the CloudFront distribution. This makes it difficult for a malicious actor to find and attack your application origin.
You can also use Route 53 to route end users to a CDN outside AWS. The CDN provider will ask you to create a CNAME alias resource record set to point www.example.com to your CDN distribution’s hostname. Unfortunately, it is not possible to point your zone apex with a CNAME alias resource record set because a zone apex cannot be a CNAME. As a result, users who type example.com without www will not be routed to your web application unless you point the zone apex directly to your application origin.
The benefit of a secure redirect from the zone apex to www is that it helps protect your origin from being exposed to direct attacks.
Solution overview
The following solution diagram shows the AWS services this solution uses and how the solution uses them.
Here is how the process works:
A user’s browser makes a DNS request to Route 53.
Route 53 has a hosted zone for the example.com domain.
The hosted zone serves the record:
If the request is for the apex zone, the alias resource record set for the CloudFront distribution is served.
If the request is for the www subdomain, the CNAME for the externally hosted CDN is served.
S3 performs a secure redirect from example.com to www.example.com.
Note: All of the steps in this blog post’s solution use example.com as a domain name. You must replace this domain name with your own domain name.
AWS services used in this solution
You will use three AWS services in this walkthrough to build your zone apex–to–external CDN distribution redirect:
Route 53 – This post assumes that you are already using Route 53 to route users to your web application, which provides you with protection against common DDoS attacks, including DNS query floods. To learn more about migrating to Route 53, see Getting Started with Amazon Route 53.
S3 – S3 is object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web. S3 also allows you to configure a bucket for website hosting. In this walkthrough, you will use the S3 website hosting feature to redirect users from example.com to www.example.com, which points to your externally hosted CDN.
CloudFront – When architecting your application for DDoS resiliency, it is important to protect origin resources, such as S3 buckets, from discovery by a malicious actor. This is known as obfuscation. In this walkthrough, you will use a CloudFront distribution to obfuscate your S3 bucket.
Prerequisites
The solution in this blog post assumes that you already have the following components as part of your architecture:
Create an S3 bucket with HTTP redirection. This allows requests made to your zone apex to be redirected to your www subdomain.
Create and configure a CloudFront web distribution. I use a CloudFront distribution in front of my S3 web redirect so that I can leverage the advanced DDoS protection and scale that is native to CloudFront.
Configure an alias resource record set in your hosted zone. Alias resource record sets are similar to CNAME records, but you can set them at the zone apex.
Validate that the redirect is working.
Step 1: Create an S3 bucket with HTTP redirection
The following steps show how to configure your S3 bucket as a static website that will perform HTTP redirects to your www URL:
Configure static website hosting to redirect all requests to another host name:
Choose the S3 bucket you just created and then choose Properties.
Choose Static Website Hosting.
Choose Redirect all requests to another host name, and type your zone apex (root domain) in the Redirect all requests to box, as shown in the following screenshot.
Note: At the top of this tab, you will see an endpoint. Copy the endpoint because you will need it in Step 2 when you configure the CloudFront distribution. In this example, the endpoint is example-com.s3-website-us-east-1.amazonaws.com.
Step 2: Create and configure a CloudFront web distribution
The following steps show how to create a CloudFront web distribution that protects the S3 bucket:
On the first page of the Create Distribution Wizard, in the Web section, choose Get Started.
The Create Distribution page has many values you can specify. For this walkthrough, you need to specify only two settings:
Origin Settings:
Origin Domain Name –When you click in this box, a menu appears with AWS resources you can choose. Choose the S3 bucket you created in Step 1, or paste the endpoint URL you copied in Step 1. In this example, the endpoint is example-com.s3-website-us-east-1.amazonaws.com.
Distribution Settings:
Alternate Domain Names (CNAMEs) – Type the root domain (for this walkthrough, it is www.example.com).
Click Create Distribution.
Wait for the CloudFront distribution to deploy completely before proceeding to Step 3. After CloudFront creates your distribution, the value of the Status column for your distribution will change from InProgress to Deployed. The distribution is then ready to process requests.
Step 3: Configure an alias resource record set in your hosted zone
In this step, you use Route 53 to configure an alias resource record set for your zone apex that resolves to the CloudFront distribution you made in Step 2:
On the Hosted zones page, choose your domain. This takes you to the Record sets page.
Click Create Record Set.
Leave the Name box blank and choose Alias: Yes.
Click the Alias Target box, and choose the CloudFront distribution you created in Step 2. If the distribution does not appear in the list automatically, you can copy and paste the name exactly as it appears in the CloudFront console.
Click Create.
Step 4: Validate that the redirect is working
To confirm that you have correctly configured all components of this solution and your zone apex is redirecting to the www domain as expected, open a browser and navigate to your zone apex. In this walkthrough, the zone apex is http://example.com and it should redirect automatically to http://www.example.com.
Summary
In this post, I showed how you can help protect your web application against DDoS attacks by using Route 53 to perform a secure redirect to your externally hosted CDN distribution. This helps protect your origin from being exposed to direct DDoS attacks.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing the solution in this blog post, start a new thread in the Route 53 forum.
– Shawn
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.