A new whitepaper is available that summarizes the results of tests by Foregenix comparing Amazon GuardDuty with network intrusion detection systems (IDS) on threat detection of network layer attacks. GuardDuty is a cloud-centric IDS service that uses Amazon Web Services (AWS) data sources to detect a broad range of threat behaviors. Security engineers need to understand how Amazon GuardDuty compares to traditional solutions for network threat detection. Assessors have also asked for clarity on the effectiveness of GuardDuty for meeting compliance requirements, like Payment Card Industry (PCI) Data Security Standard (DSS) requirement 11.4, which requires intrusion detection techniques to be implemented at critical points within a network.
A traditional IDS typically relies on monitoring network traffic at specific network traffic control points, like firewalls and host network interfaces. This allows the IDS to use a set of preconfigured rules to examine incoming data packet information and identify patterns that closely align with network attack types. Traditional IDS have several challenges in the cloud:
Networks are virtualized. Data traffic control points are decentralized and traffic flow management is a shared responsibility with the cloud provider. This makes it difficult or impossible to monitor all network traffic for analysis.
Cloud applications are dynamic. Features like auto-scaling and load balancing continuously change how a network environment is configured as demand fluctuates.
Most traditional IDS require experienced technicians to maintain their effective operation and avoid the common issue of receiving an overwhelming number of false positive findings. As a compliance assessor, I have often seen IDS intentionally de-tuned to address the false positive finding reporting issue when expert, continuous support isn’t available.
GuardDuty analyzes tens of billions of events across multiple AWS data sources, such as AWS CloudTrail, Amazon Virtual Private Cloud (Amazon VPC) flow logs, and Amazon Route 53 DNS logs. This gives GuardDuty the ability to analyze event data, such as AWS API calls to AWS Identity and Access Management (IAM) login events, which is beyond the capabilities of traditional IDS solutions. Monitoring AWS API calls from CloudTrail also enables threat detection for AWS serverless services, which sets it apart from traditional IDS solutions. However, without inspection of packet contents, the question remained, “Is GuardDuty truly effective in detecting network level attacks that more traditional IDS solutions were specifically designed to detect?”
AWS asked Foregenix to conduct a test that would compare GuardDuty to market-leading IDS to help answer this question for us. AWS didn’t specify any specific attacks or architecture to be implemented within their test. It was left up to the independent tester to determine both the threat space covered by market-leading IDS and how to construct a test for determining the effectiveness of threat detection capabilities of GuardDuty and traditional IDS solutions which included open-source and commercial IDS.
Foregenix configured a lab environment to support tests that used extensive and complex attack playbooks. The lab environment simulated a real-world deployment composed of a web server, a bastion host, and an internal server used for centralized event logging. The environment was left running under normal operating conditions for more than 45 days. This allowed all tested solutions to build up a baseline of normal data traffic patterns prior to the anomaly detection testing exercises that followed this activity.
Foregenix determined that GuardDuty is at least as effective at detecting network level attacks as other market-leading IDS. They found GuardDuty to be simple to deploy and required no specialized skills to configure the service to function effectively. Also, with its inherent capability of analyzing DNS requests, VPC flow logs, and CloudTrail events, they concluded that GuardDuty was able to effectively identify threats that other IDS could not natively detect and required extensive manual customization to detect in the test environment. Foregenix recommended that adding a host-based IDS agent on Amazon Elastic Compute Cloud (Amazon EC2) instances would provide an enhanced level of threat defense when coupled with Amazon GuardDuty.
As a PCI Qualified Security Assessor (QSA) company, Foregenix states that they consider GuardDuty as a qualifying network intrusion technique for meeting PCI DSS requirement 11.4. This is important for AWS customers whose applications must maintain PCI DSS compliance. Customers should be aware that individual PCI QSAs might have different interpretations of the requirement, and should discuss this with their assessor before a PCI assessment.
Customer PCI QSAs can also speak with AWS Security Assurance Services, an AWS Professional Services team of PCI QSAs, to obtain more information on how customers can leverage AWS services to help them maintain PCI DSS Compliance. Customers can request Security Assurance Services support through their AWS Account Manager, Solutions Architect, or other AWS support.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon GuardDuty forum or contact AWS Support.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
Our mission in AWS Security Assurance Services is to ease Payment Card Industry Data Security Standard (PCI DSS) compliance for all Amazon Web Services (AWS) customers. We work closely with the AWS audit team to answer customer questions about understanding their compliance, finding and implementing solutions, and optimizing their controls and assessments. The most frequent and foundational questions have been compiled to create the Payment Card Industry Data Security Standard (PCI DSS) 3.2.1 on AWS Compliance Guide. The guide is an overview of concepts and principles for building PCI DSS compliant applications. Each section is thoroughly referenced to source AWS documentation to meet PCI DSS reporting requirements.
The guide helps customers who are developing payment applications, compliance teams that are preparing to manage assessments of cloud applications, internal assessment teams, and PCI Qualified Security Assessors (QSA) supporting customers who use AWS.
What’s in the guide?
The objective of the guide is to provide customers with the information they need to plan for and document the PCI DSS compliance of their AWS workloads.
The guide includes:
What AWS PCI DSS Level 1 Service Provider status means for customers
Assessment scoping of AWS applications
Required diagrams for assessments
Requirement-by-requirement guidance
The guide is most useful for people who are developing solutions on AWS, but it also will help Qualified Security Assessors (QSAs), internal security assessors (ISAs), and internal audit teams better understand the assessment of cloud applications. It provides examples of the diagrams required for assessments and includes links to AWS source documentation to support assessment evidence requirements.
Compliance at cloud scale
More customers than ever are running PCI DSS compliant workloads on AWS, with thousands of compliant applications. New security and governance tools available from AWS and the AWS Partner Network (APN) enable building business-as-usual compliance and automated security tasks so you can shift your focus to scaling and innovating your business.
If you have questions or want to learn more, contact your account executive, or submit comments in the Comments section below.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
Amazon Web Services (AWS) continues to expand the scope of our PCI compliance program to support our customers’ most important workloads. We are pleased to announce that six services have been added to the scope of our Payment Card Industry Data Security Standard (PCI DSS) compliance program. These services were validated by Coalfire, our independent Qualified Security Assessor (QSA).
The Spring 2020 PCI DSS attestation of compliance covers 124 services that you can use to securely architect your Cardholder Data Environment (CDE) in AWS. You can see the full list of services on the AWS Services in Scope by Compliance Program page. The six newly added services are:
The compliance reports, including the Spring 2020 PCI DSS report, are available on demand through AWS Artifact. The PCI DSS package available in AWS Artifact includes the DSS v. 3.2.1 Attestation of Compliance (AOC) and Shared Responsibility Guide.
You can learn more about our PCI program and other compliance and security programs on the AWS Compliance Programs page.
If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
We are pleased to announce that Amazon Web Services (AWS) has achieved its first PCI 3-D Secure (3DS) certification. Financial institutions and payment providers are implementing EMV® 3-D Secure services to support application-based authentication, integration with digital wallets, and browser-based e-commerce transactions. Although AWS doesn’t perform 3DS functions directly, the AWS PCI 3DS attestation of compliance enables customers to attain their own PCI 3DS compliance for their services running on AWS.
All AWS regions in scope for PCI DSS were included in the 3DS attestation. AWS was assessed by Coalfire, an independent Qualified Security Assessor (QSA).
AWS compliance reports, including this latest PCI 3DS attestation, are available on demand through AWS Artifact. The 3DS package available in AWS Artifact includes the 3DS Attestation of Compliance (AOC) and Shared Responsibility Guide.
To learn more about our PCI program and other compliance and security programs, please visit the AWS Compliance Programs page.
We value your feedback and questions. Feel free to reach out to the team through the Contact Us page. If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
We’re pleased to announce that seven services have been added to the scope of our Payment Card Industry Data Security Standard (PCI DSS) certification, providing our customers more options to process and store their payment card data and architect their Cardholder Data Environment (CDE) securely in AWS.
In the past year we have increased the scope of our PCI DSS certification by 19%. With these additions, you can now select from a total of 118 PCI-certified services. You can see the full list on our Services in Scope by Compliance program page. The seven new services are:
We were evaluated by third-party auditors from Coalfire and their report is available through AWS Artifact.
To learn more about our PCI program and other compliance and security programs, see the AWS Compliance Programs page. As always, we value your feedback and questions; reach out to the team through the Contact Us page.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
At AWS Security, continuously raising the cloud security bar for our customers is central to all that we do. Part of that work is focused on our formal compliance certifications, which enable our customers to use the AWS cloud for highly sensitive and/or regulated workloads. We see our customers constantly developing creative and innovative solutions—and in order for them to continue to do so, we need to increase the availability of services within our certifications. I’m pleased to tell you that in the past year, we’ve increased our Payment Card Industry – Data Security Standard (PCI DSS) certification scope by 79%, from 62 services to 111 services, including 12 newly added services in our latest PCI report (listed below), and we were audited by our third-party auditor, Coalfire.
The PCI DSS report and certification cover the 111 services currently in scope that are used by our customers to architect a secure Cardholder Data Environment (CDE) to protect important workloads. The full list of PCI DSS certified AWS services is available on our Services in Scope by Compliance program page. The 12 newly added services for our Spring 2019 report are:
Logging account activity within your AWS infrastructure is paramount to your security posture and could even be required by compliance standards such as PCI-DSS (Payment Card Industry Security Standard). Organizations often analyze these logs to adapt to changes and respond quickly to security events. For example, if users are reporting that their resources are unable to communicate with the public internet, it would be beneficial to know if a network access list had been changed just prior to the incident. Many of our customers ship AWS CloudTrail event logs to an Amazon Elasticsearch Service cluster for this type of analysis. However, security best practices and compliance standards could require additional considerations. Common concerns include how to analyze log data without the data leaving the security constraints of your private VPC.
In this post, I’ll show you not only how to store your logs, but how to put them to work to help you meet your compliance goals. This implementation deploys an Amazon Elasticsearch Service domain with Amazon Virtual Private Cloud (Amazon VPC) support by utilizing VPC endpoints. A VPC endpoint enables you to privately connect your VPC to Amazon Elasticsearch without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. An AWS Lambda function is used to ship AWS CloudTrail event logs to the Elasticsearch cluster. A separate AWS Lambda function performs scheduled queries on log sets to look for patterns of concern. Amazon Simple Notification Service (SNS) generates automated reports based on a sample set of PCI guidelines discussed further in this post and notifies stakeholders when specific events occur. Kibana serves as the command center, providing visualizations of CloudTrail events that need to be logged based on the provided sample set of PCI-DSS compliance guidelines. The automated report and dashboard that are constructed around the sample PCI-DSS guidelines assist in event awareness regarding your security posture and should not be viewed as a de facto means of achieving certification. This solution serves as an additional tool to provide visibility in to the actions and events within your environment. Deployment is made simple with a provided AWS CloudFormation template.
Figure 1: Architectural diagram
The figure above depicts the architecture discussed in this post. An Elasticsearch cluster with VPC support is deployed within an AWS Region and Availability Zone. This creates a VPC endpoint in a private subnet within a VPC. Kibana is an Elasticsearch plugin that resides within the Elasticsearch cluster, it is accessed through a provided endpoint in the output section of the CloudFormation template. CloudTrail is enabled in the VPC and ships CloudTrail events to both an S3 bucket and CloudWatch Log Group. The CloudWatch Log Group triggers a custom Lambda function that ships the CloudTrail Event logs to the Elasticsearch domain through the VPC endpoint. An additional Lambda function is created that performs a periodic set of Elasticsearch queries and produces a report that is sent to an SNS Topic. A Windows-based EC2 instance is deployed in a public subnet so users will have the ability to view and interact with a Kibana dashboard. Access to the EC2 instance can be restricted to an allowed CIDR range through a parameter set in the CloudFormation deployment. Access to the Elasticsearch cluster and Kibana is restricted to a Security Group that is created and is associated with the EC2 instance and custom Lambda functions.
Sample PCI-DSS Guidelines
This solution provides a sample set of (10) PCI-DSS guidelines for events that need to be logged.
All Commands, API action taken by AWS root user
All failed logins at the AWS platform level
Action related to RDS (configuration changes)
Action related to enabling/disabling/changing of CloudTrail, CloudWatch logs
All access to S3 bucket that stores the AWS logs
Action related to VPCs (creation, deletion and changes)
Action related to changes to SGs/NACLs (creation, deletion and changes)
Action related to IAM users, roles, and groups (creation, deletion and changes)
Action related to route tables (creation, deletion and changes)
Action related to subnets (creation, deletion and changes)
Solution overview
In this walkthrough, you’ll create an Elasticsearch cluster within an Amazon VPC environment. You’ll ship AWS CloudTrail logs to both an Amazon S3 Bucket (to maintain an immutable copy of the logs) and to a custom AWS Lambda function that will stream the logs to the Elasticsearch cluster. You’ll also create an additional Lambda function that will run once a day and build a report of the number of CloudTrail events that occurred based on the example set of 10 PCI-DSS guidelines and then notify stakeholders via SNS. Here’s what you’ll need for this solution:
An S3 bucket to upload and store the sample AWS Lambda code and sample Kibana dashboards. This bucket name will be requested during the CloudFormation template deployment.
Email address for reporting to notify stakeholders via SNS. You must accept the subscription by selecting the link sent to this address before alerts will arrive.
It takes 30-45 minutes for this stack to be created. When it’s complete, the CloudFormation console will display the following resource values in the Outputs tab. These values can be referenced at any time and will be needed in the following sections.
oElasticsearchDomainEndpoint
Elasticsearch Domain Endpoint Hostname
oKibanaEndpoint
Kibana Endpoint Hostname
oEC2Instance
Windows EC2 Instance Name used for Kibana access
oSNSSubscriber
SNS Subscriber Email Address
oElasticsearchDomainArn
Arn of the Elasticsearch Domain
oEC2InstancePublicIp
Public IP address of the Windows EC2 instance
Managing and testing the solution
Now that you’ve set up the environment, it’s time to configure the Kibana dashboard.
Kibana configuration
From the AWS CloudFormation output, gather information related to the Windows-based EC2 instance. Once you have retrieved that information, move on to the next steps.
Initial configuration and index pattern
Log into the Windows EC2 instance via Remote Desktop Protocol (RDP) from a resource that is within the allowed CIDR range for remote access to the instance.
Open a browser window and navigate to the Kibana endpoint hostname URL from the output of the AWS CloudFormation stack. Access to the Elasticsearch cluster and Kibana is restricted to the security group that is associated with the EC2 instance and custom Lambda functions during deployment.
In the Kibana dashboard, select Management from the left panel and choose the link for Index Patterns.
Add one index pattern containing the following: cwl-*
Figure 2: Define the index pattern
Select Next Step.
Select the Time Filter Field named @timestamp.
Figure 3: Select “@timestamp”
Select Create index pattern.
At this point we’ve launched our environment and have accessed the Kibana console. Within the Kibana console, we’ve configured the index pattern for the CloudWatch logs that will contain the CloudTrail events. Next, we’ll configure visualizations and a dashboard.
Importing sample PCI DSS queries and Kibana dashboard
Copy the export.json from the location you extracted the downloaded zip file to the EC2 Kibana bastion.
Select Management on the left panel and choose the link for Saved Objects.
Select Import in upper right corner and navigate to export.json.
Select Yes, overwrite all saved objects, then select Index Patterncwl-* and confirm all changes.
Once the import completes, select PCI DSS Dashboard to see the sample dashboard and queries.
Note: You might encounter an error during the import that looks like this:
Figure 4: Error message
This simply means that your streamed logs do not have login-type events in the time period since your deployment. To correct this, you can add a field with a null event.
From the left panel, select Dev Tools and copy the following JSON into the left panel of the console:
POST /cwl-/default/
{
"userIdentity": {
"userName": "test"
}
}
Select the green Play triangle to execute the POST of a document with the missing field.
At this point, you should have CloudTrail events that have been streamed to the Elasticsearch cluster, with a configured Kibana dashboard that looks similar to the following graphic:
Figure 6: A configured Kibana dashboard
Automated Reports
A custom AWS Lambda function was created during the deployment of the Amazon CloudFormation stack. This function uses the sample PCI-DSS guidelines from the Kibana dashboard to build a daily report. The Lambda function is triggered every 24 hours and performs a series of Elasticsearch time-based queries of now-1day (the last 24 hours) on the sample guidelines. The results are compiled into a message that is forwarded to Amazon Simple Notification Service (SNS), which sends a report to stakeholders based on the email address you provided in the CloudFormation deployment.
The Lambda function will be named <CloudFormation Stack Name>-ES-Query-LambdaFunction. The Lambda Function enables environment variables such as your query time window to be adjusted or additional functionality like additional Elasticsearch queries to be added to the code. The below sample report allows you to monitor any events against the sample PCI-DSS guidelines. These reports can then be further analyzed in the Kibana dashboard.
Logging Compliance Report - Wednesday, 11. July 2018 01:06PM
Violations for time period: 'now-1d'
All Failed login attempts
- No Alerts Found
All Commands, API action taken by AWS root user
- No Alerts Found
Action related to RDS (configuration changes)
- No Alerts Found
Action related to enabling/disabling/changing of CloudTrail CloudWatch logs
- 3 API calls indicating alteration of log sources detected
All access to S3 bucket that stores the AWS logs
- No Alerts Found
Action related to VPCs (creation, deletion and changes)
- No Alerts Found
Action related to changes to SGs/NACLs (creation, deletion and changes)
- No Alerts Found
Action related to changes to IAM roles, users, and groups (creation, deletion and changes)
- 2 API calls indicating creation, alteration or deletion of IAM roles, users, and groups
Action related to changes to Route Tables (creation, deletion and changes)
- No Alerts Found
Action related to changes to Subnets (creation, deletion and changes)
- No Alerts Found
Summary
At this point, you have now created a private Elasticsearch cluster with Kibana dashboards that monitors AWS CloudTrail events on a sample set of PCI-DSS guidelines and uses Amazon SNS to send a daily report providing awareness in to your environment—all isolated securely within a VPC. In addition to CloudTrail events streaming to the Elasticsearch cluster, events are also shipped to an Amazon S3 bucket to maintain an immutable source of your log files. The provided Lambda functions can be further modified to add additional or more complex search queries and to create more customized reports for your organization. With minimal effort, you could begin sending additional log data from your instances or containers to gain even more insight as to the security state of your environment. The more data you retain, the more visibility you have into your resources and the closer you are to achieving Compliance-on-Demand.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
Our security culture is one of the things that sets AWS apart. Security is job zero — it is the foundation for all AWS employees and impacts the work we do every day, across the company. And that’s reflected in our services, which undergo exacting internal and external security reviews before being released. From there, we have historically waited for customer demand to begin the complex process of third-party assessment and validating services under specific compliance programs. However, we’ve heard you tell us you want every generally available (GA) service in scope to keep up with the pace of your innovation and at the same time, meet rigorous compliance and regulatory requirements.
I wanted to share how we’re meeting this challenge with a more proactive approach to service certification by certifying services at launch. For the first time, we’ve launched new GA services with PCI DSS, ISO 9001/27001/27017/27018, SOC 2, and HIPAA eligibility. That means customers who rely on or require these compliance programs can select from 10 brand new services right away, without having to wait for one or more trailing audit cycles.
This proactive compliance approach means we move upstream in the product development process. Over the last several months, we’ve made significant process improvements to deliver additional services with compliance certifications and HIPAA eligibility. Our security, compliance, and service teams have partnered in new ways to implement controls and audit earlier in a service’s development phase to demonstrate operating effectiveness. We also integrated auditing mechanisms into multiple stages of the launch process, enabling our security and compliance teams, as well as auditors, to assess controls throughout a service’s preview period. Additionally, we increased our audit frequency to meet services’ GA deadlines.
The work reflects a meaningful shift in our business. We’re excited to get these services into your hands sooner and wanted to report our overall progress. We also ask for your continued feedback since it drives our decisions and prioritization. Because going forward, we’ll continue to iterate and innovate until all of our services are certified at launch.
We continue to expand the scope of our assurance programs to support your most important workloads. I’m pleased to tell you that eight services have been added to the scope of our Payment Card Industry Data Security Standard (PCI DSS) certification. With these additions, you can now select from a total of 62 PCI-compliant services. You can see the full list on our Services in Scope by Compliance program page. The eight newly added services are:
We were evaluated by third-party auditors from Coalfire and their report is available on-demand through AWS Artifact. When you go to AWS Artifact, you’ll find something new. We’ve made the full Responsibility Summary, listing each requirement and control, available in a spreadsheet. This includes a break down of the shared responsibility for each control – yours and ours – with a mapping to our services. We hope this new format makes it easier to evaluate and use the information from the audit.
To learn more about our PCI program and other compliance and security programs, please go to the AWS Compliance Programs page. As always, we value your feedback and questions, reach out to the team through the Contact Us page.
The adoption of Apache Spark has increased significantly over the past few years, and running Spark-based application pipelines is the new normal. Spark jobs that are in an ETL (extract, transform, and load) pipeline have different requirements—you must handle dependencies in the jobs, maintain order during executions, and run multiple jobs in parallel. In most of these cases, you can use workflow scheduler tools like Apache Oozie, Apache Airflow, and even Cron to fulfill these requirements.
Apache Oozie is a widely used workflow scheduler system for Hadoop-based jobs. However, its limited UI capabilities, lack of integration with other services, and heavy XML dependency might not be suitable for some users. On the other hand, Apache Airflow comes with a lot of neat features, along with powerful UI and monitoring capabilities and integration with several AWS and third-party services. However, with Airflow, you do need to provision and manage the Airflow server. The Cron utility is a powerful job scheduler. But it doesn’t give you much visibility into the job details, and creating a workflow using Cron jobs can be challenging.
What if you have a simple use case, in which you want to run a few Spark jobs in a specific order, but you don’t want to spend time orchestrating those jobs or maintaining a separate application? You can do that today in a serverless fashion using AWS Step Functions. You can create the entire workflow in AWS Step Functions and interact with Spark on Amazon EMR through Apache Livy.
In this post, I walk you through a list of steps to orchestrate a serverless Spark-based ETL pipeline using AWS Step Functions and Apache Livy.
Input data
For the source data for this post, I use the New York City Taxi and Limousine Commission (TLC) trip record data. For a description of the data, see this detailed dictionary of the taxi data. In this example, we’ll work mainly with the following three columns for the Spark jobs.
Column name
Column description
RateCodeID
Represents the rate code in effect at the end of the trip (for example, 1 for standard rate, 2 for JFK airport, 3 for Newark airport, and so on).
FareAmount
Represents the time-and-distance fare calculated by the meter.
TripDistance
Represents the elapsed trip distance in miles reported by the taxi meter.
The trip data is in comma-separated values (CSV) format with the first row as a header. To shorten the Spark execution time, I trimmed the large input data to only 20,000 rows. During the deployment phase, the input file tripdata.csv is stored in Amazon S3 in the <<your-bucket>>/emr-step-functions/input/ folder.
The following image shows a sample of the trip data:
Solution overview
The next few sections describe how Spark jobs are created for this solution, how you can interact with Spark using Apache Livy, and how you can use AWS Step Functions to create orchestrations for these Spark applications.
At a high level, the solution includes the following steps:
Trigger the AWS Step Function state machine by passing the input file path.
The first stage in the state machine triggers an AWS Lambda
The Lambda function interacts with Apache Spark running on Amazon EMR using Apache Livy, and submits a Spark job.
The state machine waits a few seconds before checking the Spark job status.
Based on the job status, the state machine moves to the success or failure state.
Subsequent Spark jobs are submitted using the same approach.
The state machine waits a few seconds for the job to finish.
The job finishes, and the state machine updates with its final status.
Let’s take a look at the Spark application that is used for this solution.
Spark jobs
For this example, I built a Spark jar named spark-taxi.jar. It has two different Spark applications:
MilesPerRateCode – The first job that runs on the Amazon EMR cluster. This job reads the trip data from an input source and computes the total trip distance for each rate code. The output of this job consists of two columns and is stored in Apache Parquet format in the output path.
The following are the expected output columns:
rate_code – Represents the rate code for the trip.
total_distance – Represents the total trip distance for that rate code (for example, sum(trip_distance)).
RateCodeStatus – The second job that runs on the EMR cluster, but only if the first job finishes successfully. This job depends on two different input sets:
csv – The same trip data that is used for the first Spark job.
miles-per-rate – The output of the first job.
This job first reads the tripdata.csv file and aggregates the fare_amount by the rate_code. After this point, you have two different datasets, both aggregated by rate_code. Finally, the job uses the rate_code field to join two datasets and output the entire rate code status in a single CSV file.
The output columns are as follows:
rate_code_id – Represents the rate code type.
total_distance – Derived from first Spark job and represents the total trip distance.
total_fare_amount – A new field that is generated during the second Spark application, representing the total fare amount by the rate code type.
Note that in this case, you don’t need to run two different Spark jobs to generate that output. The goal of setting up the jobs in this way is just to create a dependency between the two jobs and use them within AWS Step Functions.
Both Spark applications take one input argument called rootPath. It’s the S3 location where the Spark job is stored along with input and output data. Here is a sample of the final output:
The next section discusses how you can use Apache Livy to interact with Spark applications that are running on Amazon EMR.
Using Apache Livy to interact with Apache Spark
Apache Livy provides a REST interface to interact with Spark running on an EMR cluster. Livy is included in Amazon EMR release version 5.9.0 and later. In this post, I use Livy to submit Spark jobs and retrieve job status. When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy, and it starts listening on port 8998 by default. Livy provides APIs to interact with Spark.
Let’s look at a couple of examples how you can interact with Spark running on Amazon EMR using Livy.
To list active running jobs, you can execute the following from the EMR master node:
curl localhost:8998/sessions
If you want to do the same from a remote instance, just change localhost to the EMR hostname, as in the following (port 8998 must be open to that remote instance through the security group):
Through Spark submit, you can pass multiple arguments for the Spark job and Spark configuration settings. You can also do that using Livy, by passing the S3 path through the args parameter, as shown following:
For a detailed list of Livy APIs, see the Apache Livy REST API page. This post uses GET /batches and POST /batches.
In the next section, you create a state machine and orchestrate Spark applications using AWS Step Functions.
Using AWS Step Functions to create a Spark job workflow
AWS Step Functions automatically triggers and tracks each step and retries when it encounters errors. So your application executes in order and as expected every time. To create a Spark job workflow using AWS Step Functions, you first create a Lambda state machine using different types of states to create the entire workflow.
First, you use the Task state—a simple state in AWS Step Functions that performs a single unit of work. You also use the Wait state to delay the state machine from continuing for a specified time. Later, you use the Choice state to add branching logic to a state machine.
The following is a quick summary of how to use different states in the state machine to create the Spark ETL pipeline:
Task state – Invokes a Lambda function. The first Task state submits the Spark job on Amazon EMR, and the next Task state is used to retrieve the previous Spark job status.
Wait state – Pauses the state machine until a job completes execution.
Choice state – Each Spark job execution can return a failure, an error, or a success state So, in the state machine, you use the Choice state to create a rule that specifies the next action or step based on the success or failure of the previous step.
Here is one of my Task states, MilesPerRateCode, which simply submits a Spark job:
"MilesPerRate Job": {
"Type": "Task",
"Resource":"arn:aws:lambda:us-east-1:xxxxxx:function:blog-miles-per-rate-job-submit-function",
"ResultPath": "$.jobId",
"Next": "Wait for MilesPerRate job to complete"
}
This Task state configuration specifies the Lambda function to execute. Inside the Lambda function, it submits a Spark job through Livy using Livy’s POST API. Using ResultPath, it tells the state machine where to place the result of the executing task. As discussed in the previous section, Spark submit returns the session ID, which is captured with $.jobId and used in a later state.
The following code section shows the Lambda function, which is used to submit the MilesPerRateCode job. It uses the Python request library to submit a POST against the Livy endpoint hosted on Amazon EMR and passes the required parameters in JSON format through payload. It then parses the response, grabs id from the response, and returns it. The Next field tells the state machine which state to go to next.
Just like in the MilesPerRate job, another state submits the RateCodeStatus job, but it executes only when all previous jobs have completed successfully.
Here is the Task state in the state machine that checks the Spark job status:
Just like other states, the preceding Task executes a Lambda function, captures the result (represented by jobStatus), and passes it to the next state. The following is the Lambda function that checks the Spark job status based on a given session ID:
In the Choice state, it checks the Spark job status value, compares it with a predefined state status, and transitions the state based on the result. For example, if the status is success, move to the next state (RateCodeJobStatus job), and if it is dead, move to the MilesPerRate job failed state.
To set up this entire solution, you need to create a few AWS resources. To make it easier, I have created an AWS CloudFormation template. This template creates all the required AWS resources and configures all the resources that are needed to create a Spark-based ETL pipeline on AWS Step Functions.
This CloudFormation template requires you to pass the following four parameters during initiation.
Parameter
Description
ClusterSubnetID
The subnet where the Amazon EMR cluster is deployed and Lambda is configured to talk to this subnet.
KeyName
The name of the existing EC2 key pair to access the Amazon EMR cluster.
VPCID
The ID of the virtual private cloud (VPC) where the EMR cluster is deployed and Lambda is configured to talk to this VPC.
S3RootPath
The Amazon S3 path where all required files (input file, Spark job, and so on) are stored and the resulting data is written.
IMPORTANT: These templates are designed only to show how you can create a Spark-based ETL pipeline on AWS Step Functions using Apache Livy. They are not intended for production use without modification. And if you try this solution outside of the us-east-1 Region, download the necessary files from s3://aws-data-analytics-blog/emr-step-functions, upload the files to the buckets in your Region, edit the script as appropriate, and then run it.
To launch the CloudFormation stack, choose Launch Stack:
Launching this stack creates the following list of AWS resources.
Logical ID
Resource Type
Description
StepFunctionsStateExecutionRole
IAM role
IAM role to execute the state machine and have a trust relationship with the states service.
SparkETLStateMachine
AWS Step Functions state machine
State machine in AWS Step Functions for the Spark ETL workflow.
LambdaSecurityGroup
Amazon EC2 security group
Security group that is used for the Lambda function to call the Livy API.
RateCodeStatusJobSubmitFunction
AWS Lambda function
Lambda function to submit the RateCodeStatus job.
MilesPerRateJobSubmitFunction
AWS Lambda function
Lambda function to submit the MilesPerRate job.
SparkJobStatusFunction
AWS Lambda function
Lambda function to check the Spark job status.
LambdaStateMachineRole
IAM role
IAM role for all Lambda functions to use the lambda trust relationship.
EMRCluster
Amazon EMR cluster
EMR cluster where Livy is running and where the job is placed.
During the AWS CloudFormation deployment phase, it sets up S3 paths for input and output. Input files are stored in the <<s3-root-path>>/emr-step-functions/input/ path, whereas spark-taxi.jar is copied under <<s3-root-path>>/emr-step-functions/.
The following screenshot shows how the S3 paths are configured after deployment. In this example, I passed a bucket that I created in the AWS account s3://tm-app-demos for the S3 root path.
If the CloudFormation template completed successfully, you will see Spark-ETL-State-Machine in the AWS Step Functions dashboard, as follows:
Choose the Spark-ETL-State-Machine state machine to take a look at this implementation. The AWS CloudFormation template built the entire state machine along with its dependent Lambda functions, which are now ready to be executed.
On the dashboard, choose the newly created state machine, and then choose New execution to initiate the state machine. It asks you to pass input in JSON format. This input goes to the first state MilesPerRate Job, which eventually executes the Lambda function blog-miles-per-rate-job-submit-function.
Pass the S3 root path as input:
{
“rootPath”: “s3://tm-app-demos”
}
Then choose Start Execution:
The rootPath value is the same value that was passed when creating the CloudFormation stack. It can be an S3 bucket location or a bucket with prefixes, but it should be the same value that is used for AWS CloudFormation. This value tells the state machine where it can find the Spark jar and input file, and where it will write output files. After the state machine starts, each state/task is executed based on its definition in the state machine.
At a high level, the following represents the flow of events:
Execute the first Spark job, MilesPerRate.
The Spark job reads the input file from the location <<rootPath>>/emr-step-functions/input/tripdata.csv. If the job finishes successfully, it writes the output data to <<rootPath>>/emr-step-functions/miles-per-rate.
If the Spark job fails, it transitions to the error state MilesPerRate job failed, and the state machine stops. If the Spark job finishes successfully, it transitions to the RateCodeStatus Job state, and the second Spark job is executed.
If the second Spark job fails, it transitions to the error state RateCodeStatus job failed, and the state machine stops with the Failed status.
If this Spark job completes successfully, it writes the final output data to the <<rootPath>>/emr-step-functions/rate-code-status/ It also transitions the RateCodeStatus job finished state, and the state machine ends its execution with the Success status.
This following screenshot shows a successfully completed Spark ETL state machine:
The right side of the state machine diagram shows the details of individual states with their input and output.
When you execute the state machine for the second time, it fails because the S3 path already exists. The state machine turns red and stops at MilePerRate job failed. The following image represents that failed execution of the state machine:
You can also check your Spark application status and logs by going to the Amazon EMR console and viewing the Application history tab:
I hope this walkthrough paints a picture of how you can create a serverless solution for orchestrating Spark jobs on Amazon EMR using AWS Step Functions and Apache Livy. In the next section, I share some ideas for making this solution even more elegant.
Next steps
The goal of this post is to show a simple example that uses AWS Step Functions to create an orchestration for Spark-based jobs in a serverless fashion. To make this solution robust and production ready, you can explore the following options:
In this example, I manually initiated the state machine by passing the rootPath as input. You can instead trigger the state machine automatically. To run the ETL pipeline as soon as the files arrive in your S3 bucket, you can pass the new file path to the state machine. Because CloudWatch Events supports AWS Step Functions as a target, you can create a CloudWatch rule for an S3 event. You can then set AWS Step Functions as a target and pass the new file path to your state machine. You’re all set!
You can also improve this solution by adding an alerting mechanism in case of failures. To do this, create a Lambda function that sends an alert email and assigns that Lambda function to a Fail That way, when any part of your state fails, it triggers an email and notifies the user.
If you want to submit multiple Spark jobs in parallel, you can use the Parallel state type in AWS Step Functions. The Parallel state is used to create parallel branches of execution in your state machine.
With Lambda and AWS Step Functions, you can create a very robust serverless orchestration for your big data workload.
Cleaning up
When you’ve finished testing this solution, remember to clean up all those AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack named Blog-Spark-ETL-Step-Functions.
Summary
In this post, I showed you how to use AWS Step Functions to orchestrate your Spark jobs that are running on Amazon EMR. You used Apache Livy to submit jobs to Spark from a Lambda function and created a workflow for your Spark jobs, maintaining a specific order for job execution and triggering different AWS events based on your job’s outcome. Go ahead—give this solution a try, and share your experience with us!
Tanzir Musabbir is an EMR Specialist Solutions Architect with AWS. He is an early adopter of open source Big Data technologies. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.
In a multi-account environment where you require connectivity between accounts, and perhaps connectivity between cloud and on-premises workloads, the demand for a robust Domain Name Service (DNS) that’s capable of name resolution across all connected environments will be high.
The most common solution is to implement local DNS in each account and use conditional forwarders for DNS resolutions outside of this account. While this solution might be efficient for a single-account environment, it becomes complex in a multi-account environment.
In this post, I will provide a solution to implement central DNS for multiple accounts. This solution reduces the number of DNS servers and forwarders needed to implement cross-account domain resolution. I will show you how to configure this solution in four steps:
Set up your Central DNS account.
Set up each participating account.
Create Route53 associations.
Configure on-premises DNS (if applicable).
Solution overview
In this solution, you use AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD) as a DNS service in a dedicated account in a Virtual Private Cloud (DNS-VPC).
The DNS service included in AWS Managed Microsoft AD uses conditional forwarders to forward domain resolution to either Amazon Route 53 (for domains in the awscloud.com zone) or to on-premises DNS servers (for domains in the example.com zone). You’ll use AWS Managed Microsoft AD as the primary DNS server for other application accounts in the multi-account environment (participating accounts).
A participating account is any application account that hosts a VPC and uses the centralized AWS Managed Microsoft AD as the primary DNS server for that VPC. Each participating account has a private, hosted zone with a unique zone name to represent this account (for example, business_unit.awscloud.com).
You associate the DNS-VPC with the unique hosted zone in each of the participating accounts, this allows AWS Managed Microsoft AD to use Route 53 to resolve all registered domains in private, hosted zones in participating accounts.
The following diagram shows how the various services work together:
Figure 1: Diagram showing the relationship between all the various services
In this diagram, all VPCs in participating accounts use Dynamic Host Configuration Protocol (DHCP) option sets. The option sets configure EC2 instances to use the centralized AWS Managed Microsoft AD in DNS-VPC as their default DNS Server. You also configure AWS Managed Microsoft AD to use conditional forwarders to send domain queries to Route53 or on-premises DNS servers based on query zone. For domain resolution across accounts to work, we associate DNS-VPC with each hosted zone in participating accounts.
If, for example, server.pa1.awscloud.com needs to resolve addresses in the pa3.awscloud.com domain, the sequence shown in the following diagram happens:
Figure 2: How domain resolution across accounts works
1.1: server.pa1.awscloud.com sends domain name lookup to default DNS server for the name server.pa3.awscloud.com. The request is forwarded to the DNS server defined in the DHCP option set (AWS Managed Microsoft AD in DNS-VPC).
1.2: AWS Managed Microsoft AD forwards name resolution to Route53 because it’s in the awscloud.com zone.
1.3: Route53 resolves the name to the IP address of server.pa3.awscloud.com because DNS-VPC is associated with the private hosted zone pa3.awscloud.com.
Similarly, if server.example.com needs to resolve server.pa3.awscloud.com, the following happens:
2.1: server.example.com sends domain name lookup to on-premise DNS server for the name server.pa3.awscloud.com.
2.2: on-premise DNS server using conditional forwarder forwards domain lookup to AWS Managed Microsoft AD in DNS-VPC.
1.2: AWS Managed Microsoft AD forwards name resolution to Route53 because it’s in the awscloud.com zone.
1.3: Route53 resolves the name to the IP address of server.pa3.awscloud.com because DNS-VPC is associated with the private hosted zone pa3.awscloud.com.
Step 1: Set up a centralized DNS account
In previous AWS Security Blog posts, Drew Dennis covered a couple of options for establishing DNS resolution between on-premises networks and Amazon VPC. In this post, he showed how you can use AWS Managed Microsoft AD (provisioned with AWS Directory Service) to provide DNS resolution with forwarding capabilities.
To set up a centralized DNS account, you can follow the same steps in Drew’s post to create AWS Managed Microsoft AD and configure the forwarders to send DNS queries for awscloud.com to default, VPC-provided DNS and to forward example.com queries to the on-premise DNS server.
Here are a few considerations while setting up central DNS:
The VPC that hosts AWS Managed Microsoft AD (DNS-VPC) will be associated with all private hosted zones in participating accounts.
To be able to resolve domain names across AWS and on-premises, connectivity through Direct Connect or VPN must be in place.
Step 2: Set up participating accounts
The steps I suggest in this section should be applied individually in each application account that’s participating in central DNS resolution.
Create the VPC(s) that will host your resources in participating account.
Create VPC Peering between local VPC(s) in each participating account and DNS-VPC.
Create a private hosted zone in Route 53. Hosted zone domain names must be unique across all accounts. In the diagram above, we used pa1.awscloud.com / pa2.awscloud.com / pa3.awscloud.com. You could also use a combination of environment and business unit: for example, you could use pa1.dev.awscloud.com to achieve uniqueness.
Associate VPC(s) in each participating account with the local private hosted zone.
The next step is to change the default DNS servers on each VPC using DHCP option set:
Follow these steps to create a new DHCP option set. Make sure in the DNS Servers to put the private IP addresses of the two AWS Managed Microsoft AD servers that were created in DNS-VPC:
Figure 3: The “Create DHCP options set” dialog box
Follow these steps to assign the DHCP option set to your VPC(s) in participating account.
Step 3: Associate DNS-VPC with private hosted zones in each participating account
The next steps will associate DNS-VPC with the private, hosted zone in each participating account. This allows instances in DNS-VPC to resolve domain records created in these hosted zones. If you need them, here are more details on associating a private, hosted zone with VPC on a different account.
In each participating account, create the authorization using the private hosted zone ID from the previous step, the region, and the VPC ID that you want to associate (DNS-VPC).
After completing these steps, AWS Managed Microsoft AD in the centralized DNS account should be able to resolve domain records in the private, hosted zone in each participating account.
Step 4: Setting up on-premises DNS servers
This step is necessary if you would like to resolve AWS private domains from on-premises servers and this task comes down to configuring forwarders on-premise to forward DNS queries to AWS Managed Microsoft AD in DNS-VPC for all domains in the awscloud.com zone.
The steps to implement conditional forwarders vary by DNS product. Follow your product’s documentation to complete this configuration.
Summary
I introduced a simplified solution to implement central DNS resolution in a multi-account environment that could be also extended to support DNS resolution between on-premise resources and AWS. This can help reduce operations effort and the number of resources needed to implement cross-account domain resolution.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Directory Service forum or contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
This article, pointed out by @TheGrugq, is stupid enough that it’s worth rebutting.
“The views and opinions expressed are those of the author and not necessarily the positions of the U.S. Army, Department of Defense, or the U.S. Government.” <- I sincerely hope so… “the cyber guns of August” https://t.co/xdybbr5B0E
The article starts with the question “Why did the lessons of Stuxnet, Wannacry, Heartbleed and Shamoon go unheeded?“. It then proceeds to ignore the lessons of those things.
Some of the actual lessons should be things like how Stuxnet crossed air gaps, how Wannacry spread through flat Windows networking, how Heartbleed comes from technical debt, and how Shamoon furthers state aims by causing damage.
But this article doesn’t cover the technical lessons. Instead, it thinks the lesson should be the moral lesson, that we should take these things more seriously. But that’s stupid. It’s the sort of lesson people teach you that know nothing about the topic. When you have nothing of value to contribute to a topic you can always take the moral high road and criticize everyone for being morally weak for not taking it more seriously. Obviously, since doctors haven’t cured cancer yet, it’s because they don’t take the problem seriously.
The article continues to ignore the lesson of these cyber attacks and instead regales us with a list of military lessons from WW I and WW II. This makes the same flaw that many in the military make, trying to understand cyber through analogies with the real world. It’s not that such lessons could have no value, it’s that this article contains a poor list of them. It seems to consist of a random list of events that appeal to the author rather than events that have bearing on cybersecurity.
Then, in case we don’t get the point, the article bullies us with hyperbole, cliches, buzzwords, bombastic language, famous quotes, and citations. It’s hard to see how most of them actually apply to the text. Rather, it seems like they are included simply because he really really likes them.
The article invests much effort in discussing the buzzword “OODA loop”. Most attacks in cyberspace don’t have one. Instead, attackers flail around, trying lots of random things, overcoming defense with brute-force rather than an understanding of what’s going on. That’s obviously the case with Wannacry: it was an accident, with the perpetrator experimenting with what would happen if they added the ETERNALBLUE exploit to their existing ransomware code. The consequence was beyond anybody’s ability to predict.
You might claim that this is just the first stage, that they’ll loop around, observe Wannacry’s effects, orient themselves, decide, then act upon what they learned. Nope. Wannacry burned the exploit. It’s essentially removed any vulnerable systems from the public Internet, thereby making it impossible to use what they learned. It’s still active a year later, with infected systems behind firewalls busily scanning the Internet so that if you put a new system online that’s vulnerable, it’ll be taken offline within a few hours, before any other evildoer can take advantage of it.
See what I’m doing here? Learning the actual lessons of things like Wannacry? The thing the above article fails to do??
The article has a humorous paragraph on “defense in depth”, misunderstanding the term. To be fair, it’s the cybersecurity industry’s fault: they adopted then redefined the term. That’s why there’s two separate articles on Wikipedia: one for the old military term (as used in this article) and one for the new cybersecurity term.
As used in the cybersecurity industry, “defense in depth” means having multiple layers of security. Many organizations put all their defensive efforts on the perimeter, and none inside a network. The idea of “defense in depth” is to put more defenses inside the network. For example, instead of just one firewall at the edge of the network, put firewalls inside the network to segment different subnetworks from each other, so that a ransomware infection in the customer support computers doesn’t spread to sales and marketing computers.
The article talks about exploiting WiFi chips to bypass the defense in depth measures like browser sandboxes. This is conflating different types of attacks. A WiFi attack is usually considered a local attack, from somebody next to you in bar, rather than a remote attack from a server in Russia. Moreover, far from disproving “defense in depth” such WiFi attacks highlight the need for it. Namely, phones need to be designed so that successful exploitation of other microprocessors (namely, the WiFi, Bluetooth, and cellular baseband chips) can’t directly compromise the host system. In other words, once exploited with “Broadpwn”, a hacker would need to extend the exploit chain with another vulnerability in the hosts Broadcom WiFi driver rather than immediately exploiting a DMA attack across PCIe. This suggests that if PCIe is used to interface to peripherals in the phone that an IOMMU be used, for “defense in depth”.
Cybersecurity is a young field. There are lots of useful things that outsider non-techies can teach us. Lessons from military history would be well-received.
But that’s not this story. Instead, this story is by an outsider telling us we don’t know what we are doing, that they do, and then proceeds to prove they don’t know what they are doing. Their argument is based on a moral suasion and bullying us with what appears on the surface to be intellectual rigor, but which is in fact devoid of anything smart.
My fear, here, is that I’m going to be in a meeting where somebody has read this pretentious garbage, explaining to me why “defense in depth” is wrong and how we need to OODA faster. I’d rather nip this in the bud, pointing out if you found anything interesting from that article, you are wrong.
Amazon Simple Notification Service (SNS) now supports VPC Endpoints (VPCE) via AWS PrivateLink. You can use VPC Endpoints to privately publish messages to SNS topics, from an Amazon Virtual Private Cloud (VPC), without traversing the public internet. When you use AWS PrivateLink, you don’t need to set up an Internet Gateway (IGW), Network Address Translation (NAT) device, or Virtual Private Network (VPN) connection. You don’t need to use public IP addresses, either.
VPC Endpoints doesn’t require code changes and can bring additional security to Pub/Sub Messaging use cases that rely on SNS. VPC Endpoints helps promote data privacy and is aligned with assurance programs, including the Health Insurance Portability and Accountability Act (HIPAA), FedRAMP, and others discussed below.
VPC Endpoints for SNS in action
Here’s how VPC Endpoints for SNS works. The following example is based on a banking system that processes mortgage applications. This banking system, which has been deployed to a VPC, publishes each mortgage application to an SNS topic. The SNS topic then fans out the mortgage application message to two subscribing AWS Lambda functions:
Save-Mortgage-Application stores the application in an Amazon DynamoDB table. As the mortgage application contains personally identifiable information (PII), the message must not traverse the public internet.
Save-Credit-Report checks the applicant’s credit history against an external Credit Reporting Agency (CRA), then stores the final credit report in an Amazon S3 bucket.
The following diagram depicts the underlying architecture for this banking system:
To protect applicants’ data, the financial institution responsible for developing this banking system needed a mechanism to prevent PII data from traversing the internet when publishing mortgage applications from their VPC to the SNS topic. Therefore, they created a VPC endpoint to enable their publisher Amazon EC2 instance to privately connect to the SNS API. As shown in the diagram, when the VPC endpoint is created, an Elastic Network Interface (ENI) is automatically placed in the same VPC subnet as the publisher EC2 instance. This ENI exposes a private IP address that is used as the entry point for traffic destined to SNS. This ensures that traffic between the VPC and SNS doesn’t leave the Amazon network.
Set up VPC Endpoints for SNS
The process for creating a VPC endpoint to privately connect to SNS doesn’t require code changes: access the VPC Management Console, navigate to the Endpoints section, and create a new Endpoint. Three attributes are required:
The SNS service name.
The VPC and Availability Zones (AZs) from which you’ll publish your messages.
The Security Group (SG) to be associated with the endpoint network interface. The Security Group controls the traffic to the endpoint network interface from resources in your VPC. If you don’t specify a Security Group, the default Security Group for your VPC will be associated.
The SNS API is served through HTTP Secure (HTTPS), and encrypts all messages in transit with Transport Layer Security (TLS) certificates issued by Amazon Trust Services (ATS). The certificates verify the identity of the SNS API server when encrypted connections are established. The certificates help establish proof that your SNS API client (SDK, CLI) is communicating securely with the SNS API server. A Certificate Authority (CA) issues the certificate to a specific domain. Hence, when a domain presents a certificate that’s issued by a trusted CA, the SNS API client knows it’s safe to make the connection.
Summary
VPC Endpoints can increase the security of your pub/sub messaging use cases by allowing you to publish messages to SNS topics, from instances in your VPC, without traversing the internet. Setting up VPC Endpoints for SNS doesn’t require any code changes because the SNS API address remains the same.
VPC Endpoints for SNS is now available in all AWS Regions where AWS PrivateLink is available. For information on pricing and regional availability, visit the VPC pricing page. For more information and on-boarding, see Publishing to Amazon SNS Topics from Amazon Virtual Private Cloud in the SNS documentation.
If you have comments about this post, submit them in the Comments section below. If you have questions about anything in this post, start a new thread on the Amazon SNS forum or contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
David Howells recently published the latest version of his kernel lockdown patchset. This is intended to strengthen the boundary between root and the kernel by imposing additional restrictions that prevent root from modifying the kernel at runtime. It’s not the first feature of this sort – /dev/mem no longer allows you to overwrite arbitrary kernel memory, and you can configure the kernel so only signed modules can be loaded. But the present state of things is that these security features can be easily circumvented (by using kexec to modify the kernel security policy, for instance).
Why do you want lockdown? If you’ve got a setup where you know that your system is booting a trustworthy kernel (you’re running a system that does cryptographic verification of its boot chain, or you built and installed the kernel yourself, for instance) then you can trust the kernel to keep secrets safe from even root. But if root is able to modify the running kernel, that guarantee goes away. As a result, it makes sense to extend the security policy from the boot environment up to the running kernel – it’s really just an extension of configuring the kernel to require signed modules.
The patchset itself isn’t hugely conceptually controversial, although there’s disagreement over the precise form of certain restrictions. But one patch has, because it associates whether or not lockdown is enabled with whether or not UEFI Secure Boot is enabled. There’s some backstory that’s important here.
Most kernel features get turned on or off by either build-time configuration or by passing arguments to the kernel at boot time. There’s two ways that this patchset allows a bootloader to tell the kernel to enable lockdown mode – it can either pass the lockdown argument on the kernel command line, or it can set the secure_boot flag in the bootparams structure that’s passed to the kernel. If you’re running in an environment where you’re able to verify the kernel before booting it (either through cryptographic validation of the kernel, or knowing that there’s a secret tied to the TPM that will prevent the system booting if the kernel’s been tampered with), you can turn on lockdown.
There’s a catch on UEFI systems, though – you can build the kernel so that it looks like an EFI executable, and then run it directly from the firmware. The firmware doesn’t know about Linux, so can’t populate the bootparam structure, and there’s no mechanism to enforce command lines so we can’t rely on that either. The controversial patch simply adds a kernel configuration option that automatically enables lockdown when UEFI secure boot is enabled and otherwise leaves it up to the user to choose whether or not to turn it on.
Why do we want lockdown enabled when booting via UEFI secure boot? UEFI secure boot is designed to prevent the booting of any bootloaders that the owner of the system doesn’t consider trustworthy[1]. But a bootloader is only software – the only thing that distinguishes it from, say, Firefox is that Firefox is running in user mode and has no direct access to the hardware. The kernel does have direct access to the hardware, and so there’s no meaningful distinction between what grub can do and what the kernel can do. If you can run arbitrary code in the kernel then you can use the kernel to boot anything you want, which defeats the point of UEFI Secure Boot. Linux distributions don’t want their kernels to be used to be used as part of an attack chain against other distributions or operating systems, so they enable lockdown (or equivalent functionality) for kernels booted this way.
So why not enable it everywhere? There’s a couple of reasons. The first is that some of the features may break things people need – for instance, some strange embedded apps communicate with PCI devices by mmap()ing resources directly from sysfs[2]. This is blocked by lockdown, which would break them. Distributions would then have to ship an additional kernel that had lockdown disabled (it’s not possible to just have a command line argument that disables it, because an attacker could simply pass that), and users would have to disable secure boot to boot that anyway. It’s easier to just tie the two together.
The second is that it presents a promise of security that isn’t really there if your system didn’t verify the kernel. If an attacker can replace your bootloader or kernel then the ability to modify your kernel at runtime is less interesting – they can just wait for the next reboot. Appearing to give users safety assurances that are much less strong than they seem to be isn’t good for keeping users safe.
So, what about people whose work is impacted by lockdown? Right now there’s two ways to get stuff blocked by lockdown unblocked: either disable secure boot[3] (which will disable it until you enable secure boot again) or press alt-sysrq-x (which will disable it until the next boot). Discussion has suggested that having an additional secure variable that disables lockdown without disabling secure boot validation might be helpful, and it’s not difficult to implement that so it’ll probably happen.
Overall: the patchset isn’t controversial, just the way it’s integrated with UEFI secure boot. The reason it’s integrated with UEFI secure boot is because that’s the policy most distributions want, since the alternative is to enable it everywhere even when it doesn’t provide real benefits but does provide additional support overhead. You can use it even if you’re not using UEFI secure boot. We should have just called it securelevel.
[1] Of course, if the owner of a system isn’t allowed to make that determination themselves, the same technology is restricting the freedom of the user. This is abhorrent, and sadly it’s the default situation in many devices outside the PC ecosystem – most of them not using UEFI. But almost any security solution that aims to prevent malicious software from running can also be used to prevent any software from running, and the problem here is the people unwilling to provide that policy to users rather than the security features. [2] This is how X.org used to work until the advent of kernel modesetting [3] If your vendor doesn’t provide a firmware option for this, run sudo mokutil –disable-validation
With the advent of AWS PrivateLink, you can provide services to AWS customers directly in their Virtual Private Networks by offering cross-account SaaS solutions on private IP addresses rather than over the Internet.
Traffic that flows to the services you provide does so over private AWS networking rather than over the Internet, offering security and performance enhancements, as well as convenience. PrivateLink can tie in with the AWS Marketplace, facilitating billing and providing a straightforward consumption model to your customers.
The use cases are myriad, but, for this blog post, we’ll demonstrate a fictional order-processing resource. The resource accepts JSON data over a RESTful API, simulating an interface. This could easily be an existing application being considered for a PrivateLink-based consumption model. Consumers of this resource send JSON payloads representing new orders and the system responds with order IDs corresponding to newly-created orders in the system. In a real-world scenario, additional APIs, such as authentication, might also represent critical aspects of the system. This example will not demonstrate these additional APIs because they could be consumed over PrivateLink in a similar fashion to the API constructed in the example.
I’ll demonstrate how to expose the resource on a private IP address in a customer’s VPC. I’ll also explain an architecture leveraging PrivateLink and provide detailed instructions for how to set up such a service. Finally, I’ll provide an example of how a customer might consume such a service. I’ll focus not only on how to architect the solution, but also the considerations that drive architectural choices.
Solution Overview
N.B.: Only two subnets and Availability Zones are shown per VPC for simplicity. Resources must cover all Availability Zones per Region, so that the application is available to all consumers in the region. The instructions in this post, which pertain to resources sitting in us-east-1 will detail the deployment of subnets in all six Availability Zones for this region.
This solution exposes an application’s HTTP-based API over PrivateLink in a provider’s AWS account. The application is a stateless web server running on Amazon Elastic Compute Cloud (EC2) instances. The provider places instances within a virtual private network (VPC) consisting of one private subnet per Availability Zone (AZ). Each AZ contains a subnet. Instances populate each subnet inside of Auto Scaling Groups (ASGs), maintaining a desired count per subnet. There is one ASG per subnet to ensure that the service is available in each AZ. An internal Network Load Balancer (NLB) sits in front of the entire fleet of application instances and an endpoint service is connected with the NLB.
In the customer’s AWS account, they create an endpoint that consumes the endpoint service from the provider’s account. The endpoint exposes an Elastic Network Interface (ENI) in each subnet the customer desires. Each ENI is assigned an IP address within the CIDR block associated with the subnet, for any number of subnets in any number of AZs within the region, for each customer.
PrivateLink facilitates cross-account access to services so the customer can use the provider’s service, feeding it data that exist within the customer’s account while using application logic and systems that run in the provider’s account. The routing between accounts is over private networking rather than over the Internet.
Though this example shows a simple, stateless service running on EC2 and sitting behind an NLB, many kinds of AWS services can be exposed through PrivateLink and can serve as pathways into a provider’s application, such as Amazon Kinesis Streams, Amazon EC2 Container Service, Amazon EC2 Systems Manager, and more.
Using PrivateLink to Establish a Service for Consumption
Building a service to be consumed through PrivateLink involves a few steps:
Build a VPC covering all AZs in region with private subnets
Create a NLB, listener, and target group for instances
Create a launch configuration and ASGs to manage the deployment of Amazon
EC2 instances in each subnet
Launch an endpoint service and connect it with the NLB
Tie endpoint-request approval with billing systems or the AWS Marketplace
Provide the endpoint service in multiple regions
Step 1: Build a VPC and private subnets
Start by determining the network you will need to serve the application. Keep in mind, that you will need to serve the application out of each AZ within any region you choose. Customers will expect to consume your service in multiple AZs because AWS recommends they architect their own applications to span across AZs for fault-tolerance purposes.
Additionally, anything less than full coverage across all AZs in a single region will not facilitate straightforward consumption of your service because AWS does not guarantee that a single AZ will carry the same name across accounts. In fact, AWS randomizes AZ names across accounts to ensure even distribution of independent workloads. Telling customers, for example, that you provide a service in us-east-1a may not give them sufficient information to connect with your service.
The solution is to serve your application in all AZs within a region because this guarantees that no matter what AZs a customer chooses for endpoint creation, that customer is guaranteed to find a running instance of your application with which to connect.
You can lay the foundations for doing this by creating a subnet in each AZ within the region of your choice. The subnets can be private because the service, exposed via PrivateLink, will not provide any publicly routable APIs.
This example uses the us-east-1 region. If you use a different region, the number of AZs may vary, which will change the number of subnets required, and thus the size of the IP address range for your VPC may require adjustments.
The example above creates a VPC with 128 IP addresses starting at 10.3.0.0. Each subnet will contain 16 IP addresses, using a total of 96 addresses in the space. Allocating a sufficient block of addresses requires some planning (though you can make adjustments later if needed). I’d suggest an equally-sized address space in each subnet because the provided service should embody the same performance, availability, and functionality regardless of which AZ your customers choose. Each subnet will need a sufficient address space to accommodate the number of instances you run within it. Additionally, you will need enough space to allow for one IP address per subnet to assign to that subnet’s NLB node’s Elastic Network Interface (ENI).
In this simple example, 16 IP addresses per subnet are enough because we will configure ASGs to maintain two instances each and the NLB requires one ENI. Each subnet reserves five IP addresses for internal purposes, for a total of eight IP addresses needed in each subnet to support the service.
Next, create the private subnets for each Availability Zone. The following demonstrates the creation of the first subnet, which sits in the us-east-1a AZ:
Repeat this step for each remaining AZ. If using the us-east-1 region, you will need to create private subnets in all AZs as follows:
For the purpose of this example, the subnets can leverage the default route table, as it contains a single rule for routing requests to private IP addresses in the VPC, as follows:
In a real-world case, additional routing may be required. For example, you may need additional routes to support VPC peering to access dependencies in other VPCs, connectivity to on-premises resources over DirectConnect or VPN, Internet-accessible dependencies via NAT, or other scenarios.
Security Group Creation
Instances will need to be placed in a security group that allows traffic from the NLB nodes that sit in each subnet.
All instances running the service should be in a security group accepting TCP traffic on the traffic port from any other IP address in the VPC. This will allow the NLB to forward traffic to those instances because the NLB nodes sit in the VPC and are assigned IP addresses in the subnets. In this example, the order processing server running on each instance exposes a service on port 3000, so the security group rule covers this port.
Create a security group for instances:
aws ec2 create-security-group \
--group-name "service-sg" \
--description "Security group for service instances" \
--vpc-id "vpc-2cadab54"
Step 2: Create a Network Load Balancer, Listener, and Target Group
The service integrates with PrivateLink using an internal NLB which sits in front of instances that run the service.
Step 3: Create a Launch Configuration and Auto Scaling Groups
Each private subnet in the VPC will require its own ASG in order to ensure that there is always a minimum number of instances in each subnet.
A single ASG spanning all subnets will not guarantee that every subnet contains the appropriate number of instances. For example, while a single ASG could be configured to work across six subnets and maintain twelve instances, there is no guarantee that each of the six subnets will contain two instances. To guarantee the appropriate number of instances on a per-subnet basis, each subnet must be configured with its own ASG.
New instances should be automatically created within each ASG based on a single launch configuration. The launch configuration should be set up to use an existing Amazon Machine Image (AMI).
This post presupposes you have an AMI that can be used to create new instances that serve the application. There are only a few basic assumptions to how this image is configured:
1. The image containes a web server that serves traffic (in this case, on port 3000) 2. The image is configured to automatically launch the web server as a daemon when the instance starts.
Create a launch configuration for the ASGs, providing the AMI ID, the ID of the security group created in previous steps (above), a key for access, and an instance type:
Repeat this process to create an ASG in each remaining subnet, using the same launch configuration and target group.
In this example, only two instances are created in each subnet. In a real-world scenario, additional instances would likely be recommended for both availability and scale. The ASGs use the provided launch configuration as a template for creating new instances.
When creating the ASGs, the ARN of the target group for the NLB is provided. This way, the ASGs automatically register newly-created instances with the target group so that the NLB can begin sending traffic to them.
Step 4: Launch an endpoint service and connect with NLB
Now, expose the service via PrivateLink with an endpoint service, providing the ARN of the NLB:
This endpoint service is configured to require acceptance. This means that new consumers who attempt to add endpoints that consume it will have to wait for the provider to allow access. This provides an opportunity to control access and integrate with billing systems that monetize the provided service.
Step 5: Tie endpoint request approval with billing system or the AWS Marketplace
If you’re maintaining your service as a private service, any account that is intended to have access must be whitelisted before it can find the endpoint service and create an endpoint to consume it.
For more information on listing a PrivateLink service in the AWS Marketplace, see How to List Your Product in AWS Marketplace (https://aws.amazon.com/blogs/apn/how-to-list-your-product-in-aws-marketplace/).
Most production-ready services offered through PrivateLink will require acceptance of Endpoint requests before customers can consume them. Typically, some level of automation around processing approvals is helpful. PrivateLink can publish on a Simple Notification Service (SNS) topic when customers request approval.
Setting this up requires two steps:
1. Create a new SNS topic 2. Create an endpoint connection notification that publishes to the SNS topic.
Each is discussed below.
Create an SNS Topic
First, create a new SNS Topic that can receive messages relating to endpoint service access requests:
This creates a new topic with the name “service-notification-topic”. Endpoint request approval systems can subscribe to this Topic so that acceptance can be automated.
Next, create a new Endpoint Connection Notification, so that messages will be published to the topic when new Endpoints connect and need to have access requests approved:
A billing system may ultimately tie in with request approval. This can also be done manually, which may be less useful, but is illustrative. As an example, assume that a customer account has already requested an endpoint to consume the service. The customer can be accepted manually, as follows:
At this point, the consumer can begin consuming the service.
Step 6: Take the Service Across Regions
In distributing SaaS via PrivateLink, providers may have to have to think about how to make their services available in different regions because Endpoint Services are only available within the region where they are created. Customers who attempt to consume Endpoint Services will not be able to create Endpoints across regions.
Rather than saddling consumers with the responsibility of making the jump across regions, we recommend providers work to make services available where their customers consume. They are in a better position to adapt their architectures to multiple regions than customers who do not know the internals of how providers have designed their services.
There are several architectural options that can support multi-region adaptation. Selection among them will depend on a number of factors, including read-to-write ratio, latency requirements, budget, amenability to re-architecture, and preference for simplicity.
Generally, the challenge in providing multi-region SaaS is in instantiating stateful components in multiple regions because the data on which such components depend are hard to replicate, synchronize, and access with low latency over large geographical distances.
Of all stateful components, perhaps the most frequently encountered will be databases. Some solutions for overcoming this challenge with respect to databases are as follows:
1. Provide a master in a single region; provide read replicas in every region. 2. Provide a master in every region; assign each tenant to one master only. 3. Create a full multi-master architecture; replicate data efficiently. 4. Rely on a managed service for replicating data cross-regionally (e.g., DynamoDB Global Tables).
Stateless components can be provisioned in multiple regions more easily. In this example, you will have to re-create all of the VPC resources—including subnets, Routing Tables, Security Groups, and Endpoint Services—as well as all EC2 resources—including instances, NLBs, Listeners, Target Groups, ASGs, and Launch Configurations—in each additional region. Because of the complexity in doing so, in addition to the significant need to keep regional configurations in-sync, you may wish to explore an orchestration tool such as CloudFormation, rather than the command line.
Regardless of what orchestration tooling you choose, you will need to copy your AMI to each region in which you wish to deploy it. Once available, you can build out your service in that region much as you did in the first one.
This copies the AMI used to provide the service in us-east-1 into us-east-2. Repeat the process for each region you wish to enable.
Consuming a Service via PrivateLink
To consume a service over PrivateLink, a customer must create a new Endpoint in their VPC within a Security Group that allows traffic on the traffic port.
Start by creating a Security Group to apply to a new Endpoint:
aws ec2 create-security-group \ --group-name "consumer-sg" \ --description "Security group for service consumers" \ --vpc-id "vpc-2cadab54"
Next, create an endpoint in the VPC, placing it in the Security Group:
The response will include an attribute called VpcEndpoint.DnsEntries. The service can be accessed at each of the DNS names in the output under any of the entries there. Before the consumer can access the endpoint service, the provider has to accept the Endpoint.
Access Endpoint Via Custom DNS Names
When creating a new Endpoint, the consumer will receive named endpoint addresses in each AZ where the Endpoint is created, plus a named endpoint that is AZ-agnostic. For example:
The consumer can use Route53 to provide a custom DNS name for the service. This not only allows for using cleaner service names, but also enables the consumer to leverage the traffic management features of Route53, such as fail-over routing.
First, the the consumer must enable DNS Hostnames and DNS Support on the VPC within which the Endpoint was created. The consumer should start by enabling DNS Hostnames:
After the VPC is properly configured to work with Route53, the consumer should either select an existing hosted zone or create a new one. Assuming one has not already been created, the consumer should create one as follows:
In the request, the consumer specifies the DNS name, VPC ID, region, and flags the hosted zone as private. Additionally, the consumer must provide a “caller reference” which is a unique ID of the request that can be used to identify it in subsequent actions if the request fails.
Next, the consumer should create a JSON file corresponding to a batch of record change requests. In this file, the consumer can specify the name of the endpoint, as well as a CNAME pointing to the AZ-agnostic DNS name of the Endpoint:
At this point, the Endpoint can be consumed at http://order-processor.endpoints.internal.
Conclusion
AWS PrivateLink is an exciting way to expose SaaS services to customers. This article demonstrated how to expose an existing application on EC2 via PrivateLink in a customer’s VPC, as well as recommended architecture. Finally, it walked through the steps that a customer would have to go through to consume the service.
Today, our customers use AWS CloudHSM to meet corporate, contractual and regulatory compliance requirements for data security by using dedicated Hardware Security Module (HSM) instances within the AWS cloud. CloudHSM delivers all the benefits of traditional HSMs including secure generation, storage, and management of cryptographic keys used for data encryption that are controlled and accessible only by you.
As a managed service, it automates time-consuming administrative tasks such as hardware provisioning, software patching, high availability, backups and scaling for your sensitive and regulated workloads in a cost-effective manner. Backup and restore functionality is the core building block enabling scalability, reliability and high availability in CloudHSM.
You should consider using AWS CloudHSM if you require:
Keys stored in dedicated, third-party validated hardware security modules under your exclusive control
FIPS 140-2 compliance
Integration with applications using PKCS#11, Java JCE, or Microsoft CNG interfaces
Healthcare applications subject to HIPAA regulations
Streaming video solutions subject to contractual DRM requirements
We recently released a whitepaper, “Security of CloudHSM Backups” that provides in-depth information on how backups are protected in all three phases of the CloudHSM backup lifecycle process: Creation, Archive, and Restore.
About the Author
Balaji Iyer is a senior consultant in the Professional Services team at Amazon Web Services. In this role, he has helped several customers successfully navigate their journey to AWS. His specialties include architecting and implementing highly-scalable distributed systems, operational security, large scale migrations, and leading strategic AWS initiatives.
Today, I’m very pleased to announce that AWS services comply with the General Data Protection Regulation (GDPR). This means that, in addition to benefiting from all of the measures that AWS already takes to maintain services security, customers can deploy AWS services as a key part of their GDPR compliance plans.
This announcement confirms we have completed the entirety of our GDPR service readiness audit, validating that all generally available services and features adhere to the high privacy bar and data protection standards required of data processors by the GDPR. We completed this work two months ahead of the May 25, 2018 enforcement deadline in order to give customers and APN partners an environment in which they can confidently build their own GDPR-compliant products, services, and solutions.
AWS’s GDPR service readiness is only part of the story; we are continuing to work alongside our customers and the AWS Partner Network (APN) to help on their journey toward GDPR compliance. Along with this announcement, I’d like to highlight the following examples of ways AWS can help you accelerate your own GDPR compliance efforts.
Security of Personal Data During our GDPR service readiness audit, our security and compliance experts confirmed that AWS has in place effective technical and organizational measures for data processors to secure personal data in accordance with the GDPR. Security remains our highest priority, and we continue to innovate and invest in a high bar for security and compliance across all global operations. Our industry-leading functionality provides the foundation for our long list of internationally-recognized certifications and accreditations, demonstrating compliance with rigorous international standards, such as ISO 27001 for technical measures, ISO 27017 for cloud security, ISO 27018 for cloud privacy, SOC 1, SOC 2 and SOC 3, PCI DSS Level 1, and EU-specific certifications such as BSI’s Common Cloud Computing Controls Catalogue (C5). AWS continues to pursue the certifications that assist our customers.
Compliance-enabling Services Many requirements under the GDPR focus on ensuring effective control and protection of personal data. AWS services give you the capability to implement your own security measures in the ways you need in order to enable your compliance with the GDPR, including specific measures such as:
Encryption of personal data
Ability to ensure the ongoing confidentiality, integrity, availability, and resilience of processing systems and services
Ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident
Processes for regularly testing, assessing, and evaluating the effectiveness of technical and organizational measures for ensuring the security of processing
This is an advanced set of security and compliance services that are designed specifically to handle the requirements of the GDPR. There are numerous AWS services that have particular significance for customers focusing on GDPR compliance, including:
Amazon GuardDuty – a security service featuring intelligent threat detection and continuous monitoring
Amazon Macie – a machine learning tool to assist discovery and securing of personal data stored in Amazon S3
Amazon Inspector – an automated security assessment service to help keep applications in conformity with best security practices
AWS Config Rules – a monitoring service that dynamically checks cloud resources for compliance with security rules
Additionally, we have published a whitepaper, “Navigating GDPR Compliance on AWS,” dedicated to this topic. This paper details how to tie GDPR concepts to specific AWS services, including those relating to monitoring, data access, and key management. Furthermore, our GDPR Center will give you access to the up-to-date resources you need to tackle requirements that directly support your GDPR efforts.
Compliant DPA We offer a GDPR-compliant Data Processing Addendum (DPA), enabling you to comply with GDPR contractual obligations.
Conformity with a Code of Conduct GDPR introduces adherence to a “code of conduct” as a mechanism for demonstrating sufficient guarantees of requirements that the GDPR places on data processors. In this context, we previously announced compliance with the CISPE Code of Conduct. The CISPE Code of Conduct provides customers with additional assurances regarding their ability to fully control their data in a safe, secure, and compliant environment when they use services from providers like AWS. More detail about the CISPE Code of Conduct can be found at: https://aws.amazon.com/compliance/cispe/
Training and Summits We can provide you with training on navigating GDPR compliance using AWS services via our Professional Services team. This team has a GDPR workshop offering, which is a two-day facilitated session customized to your specific needs and challenges. We are also providing GDPR presentations during our AWS Summits in European countries, as well as San Francisco and Tokyo.
Additional Resources Finally, we have teams of compliance, data protection, and security experts, as well as the APN, helping customers across Europe prepare for running regulated workloads in the cloud as the GDPR becomes enforceable. For additional information on this, please contact your AWS Account Manager.
As we move towards May 25 and beyond, we’ll be posting a series of blogs to dive deeper into GDPR-related concepts along with how AWS can help. Please visit our GDPR Center for more information. We’re excited about being your partner in fully addressing this important regulation.
-Chad Woolf
Vice President, AWS Security Assurance
Interested in additional AWS Security news? Follow the AWS Security Blog on Twitter.
The worst thing for a computer user has happened. The hard drive on your computer crashed, or your computer is lost or completely unusable.
Fortunately, you’re a Backblaze customer with a current backup in the cloud. That’s great. The challenge is that you’ve got a presentation to make in just 48 hours and the document and materials you need for the presentation were on the hard drive that crashed.
Relax. Backblaze has your data (and your back). The question is, how do you get what you need to make that presentation deadline?
Here are some strategies you could use.
One — The first approach is to get back the presentation file and materials you need to meet your presentation deadline as quickly as possible. You can use another computer (maybe even your smartphone) to make that presentation.
Two — The second approach is to get your computer (or a new computer, if necessary) working again and restore all the files from your Backblaze backup.
Let’s start with Option One, which gets you back to work with just the files you need now as quickly as possible.
Option One — You’ve Got a Deadline and Just Need Your Files
Getting Back to Work Immediately
You want to get your computer working again as soon as possible, but perhaps your top priority is getting access to the files you need for your presentation. The computer can wait.
Find a Computer to Use
First of all. You’re going to need a computer to use. If you have another computer handy, you’re all set. If you don’t, you’re going to need one. Here are some ideas on where to find one:
Family and Friends
Work
Neighbors
Local library
Local school
Community or religious organization
Local computer shop
Online store
If you have a smartphone that you can use to give your presentation or to print materials, that’s great. With the Backblaze app for iOS and Android, you can download files directly from your Backblaze account to your smartphone. You also have the option with your smartphone to email or share files from your Backblaze backup so you can use them elsewhere.
Download The File(s) You Need
Once you have the computer, you need to connect to your Backblaze backup through a web browser or the Backblaze smartphone app.
Backblaze Web Admin
Sign into your Backblaze account. You can download the files directly or use the share link to share files with yourself or someone else.
If you have an iOS or Android smartphone, you can use the Backblaze app and retrieve the files you need. You then could view the file on your phone, use a smartphone app with the file, or email it to yourself or someone else.
Backblaze Smartphone app (iOS)
Using one of the approaches above, you got your files back in time for your presentation. Way to go!
Now, the next step is to get the computer with the bad drive running again and restore all your files, or, if that computer is no longer usable, restore your Backblaze backup to a new computer.
Option Two — You Need a Working Computer Again
Getting the Computer with the Failed Drive Running Again (or a New Computer)
If the computer with the failed drive can’t be saved, then you’re going to need a new computer. A new computer likely will come with the operating system installed and ready to boot. If you’ve got a running computer and are ready to restore your files from Backblaze, you can skip forward to Restore the Files to the Drive.
If you need to replace the hard drive in your computer before you restore your files, you can continue reading.
Buy a New Hard Drive to Replace the Failed Drive
The hard drive is gone, so you’re going to need a new drive. If you have a computer or electronics store nearby, you could get one there. Another choice is to order a drive online and pay for one or two-day delivery. You have a few choices:
Buy a hard drive of the same type and size you had
Upgrade to a drive with more capacity
Upgrade to an SSD. SSDs cost more but they are faster, more reliable, and less susceptible to jolts, magnetic fields, and other hazards that can affect a drive. Otherwise, they work the same as a hard disk drive (HDD) and most likely will work with the same connector.
Hard Disk Drive (HDD)
Solid State Drive (SSD)
Be sure that the drive dimensions are compatible with where you’re going to install the drive in your computer, and the drive connector is compatible with your computer system (SATA, PCIe, etc.) Here’s some help.
Install the Drive
If you’re handy with computers, you can install the drive yourself. It’s not hard, and there are numerous videos on YouTube and elsewhere on how to do this. Just be sure to note how everything was connected so you can get everything connected and put back together correctly. Also, be sure that you discharge any static electricity from your body by touching something metallic before you handle anything inside the computer. If all this sounds like too much to handle, find a friend or a local computer store to help you.
Note: If the drive that failed is a boot drive for your operating system (either Macintosh or Windows), you need to make sure that the drive is bootable and has the operating system files on it. You may need to reinstall from an operating system source disk or install files.
Once logged in, you will be brought to the account Overview page. On this page, all of the computers registered for backup under your account are shown with some basic information about each. Select the backup from which you wish to restore data by using the appropriate “Restore” button.
Selecting the Type of Restore
Backblaze offers three different ways in which you can receive your restore data: downloadable ZIP file, USB flash drive, or USB hard drive. The downloadable ZIP restore option will create a ZIP file of the files you request that is made available for download for 7 days. ZIP restores do not have any additional cost and are a great option for individual files or small sets of data.
Depending on the speed of your internet connection to the Backblaze data center, downloadable restores may not always be the best option for restoring very large amounts of data. ZIP restores are limited to 500 GB per request and a maximum of 5 active requests can be submitted under a single account at any given time.
USB flash and hard drive restores are built with the data you request and then shipped to an address of your choosing via FedEx Overnight or FedEx Priority International. USB flash restores cost $99 and can contain up to 128 GB (110,000 MB of data) and USB hard drive restores cost $189 and can contain up to 4TB max (3,500,000 MB of data). Both include the cost of shipping.
You can return the ZIP drive within 30 days for a full refund with our Restore Return Refund Program, effectively making the process of restoring free, even with a shipped USB drive.
Selecting Files for Restore
Using the left hand file viewer, navigate to the location of the files you wish to restore. You can use the disclosure triangles to see subfolders. Clicking on a folder name will display the folder’s files in the right hand file viewer. If you are attempting to restore files that have been deleted or are otherwise missing or files from a failed or disconnected secondary or external hard drive, you may need to change the time frame parameters.
Put checkmarks next to disks, files or folders you’d like to recover. Once you have selected the files and folders you wish to restore, select the “Continue with Restore” button above or below the file viewer. Backblaze will then build the restore via the option you select (ZIP or USB drive). You’ll receive an automated email notifying you when the ZIP restore has been built and is ready for download or when the USB restore drive ships.
If you are using the downloadable ZIP option, and the restore is over 2 GB, we highly recommend using the Backblaze Downloader for better speed and reliability. We have a guide on using the Backblaze Downloader for Mac OS X or for Windows.
Recent versions of both macOS and Windows have built-in capability to extract files from a ZIP archive. If the built-in capabilities aren’t working for you, you can find additional utilities for Macintosh and Windows.
Reactivating your Backblaze Account
Now that you’ve got a working computer again, you’re going to need to reinstall Backblaze Backup (if it’s not on the system already) and connect with your existing account. Start by downloading and reinstalling Backblaze.
If you’ve restored the files from your Backblaze Backup to your new computer or drive, you don’t want to have to reupload the same files again to your Backblaze backup. To let Backblaze know that this computer is on the same account and has the same files, you need to use “Inherit Backup State.” See https://help.backblaze.com/hc/en-us/articles/217666358-Inherit-Backup-State
That’s It
You should be all set, either with the files you needed for your presentation, or with a restored computer that is again ready to do productive work.
We hope your presentation wowed ’em.
If you have any additional questions on restoring from a Backblaze backup, please ask away in the comments. Also, be sure to check out our help resources at https://www.backblaze.com/help.html.
In Part 2, we take a deeper look at the differences between HDDs and SSDs, how both HDD and SSD technologies are evolving, and how Backblaze takes advantage of SSDs in our operations and data centers.
The first time you booted a computer or opened an app on a computer with a solid-state-drive (SSD), you likely were delighted. I know I was. I loved the speed, silence, and just the wow factor of this new technology that seemed better in just about every way compared to hard drives.
I was ready to fully embrace the promise of SSDs. And I have. My desktop uses an SSD for booting, applications, and for working files. My laptop has a single 512GB SSD. I still use hard drives, however. The second, third, and fourth drives in my desktop computer are HDDs. The external USB RAID I use for local backup uses HDDs in four drive bays. When my laptop is at my desk it is attached to a 1.5TB USB backup hard drive. HDDs still have a place in my personal computing environment, as they likely do in yours.
Nothing stays the same for long, however, especially in the fast-changing world of computing, so we are certain to see new storage technologies coming to the fore, perhaps with even more wow factor.
Before we get to what’s coming, let’s review the primary differences between HDDs and SSDs in a little more detail in the following table.
A Comparison of HDDs to SSDs
HDD
SSD
Power Draw/Battery Life
More power draw, averages 6–7 watts and therefore uses more battery
Less power draw, averages 2–3 watts, resulting in 30+ minute battery boost
Cost
Only around $0.03 per gigabyte, very cheap (buying a 4TB model)
Expensive, roughly $0.20- $0.30 per gigabyte (based on buying a 1TB drive)
Capacity
Typically around 500GB and 2TB maximum for notebook size drives; 10TB max for desktops
Typically not larger than 1TB for notebook size drives; 4TB for desktops
Operating System Boot Time
Around 30-40 seconds average bootup time
Around 8-13 seconds average bootup time
Noise
Audible clicks and spinning platters can be heard
There are no moving parts, hence no sound
Vibration
The spinning of the platters can sometimes result in vibration
No vibration as there are no moving parts
Heat Produced
HDD doesn’t produce much heat, but it will have a measurable amount more heat than an SSD due to moving parts and higher power draw
Lower power draw and no moving parts so little heat is produced
Failure Rate
Mean time between failure rate of 1.5 million hours
Mean time between failure rate of 2.0 million hours
File Copy / Write Speed
The range can be anywhere from 50–120MB/s
Generally above 200 MB/s and up to 550 MB/s for cutting edge drives
Encryption
Full Disk Encryption (FDE) Supported on some models
Full Disk Encryption (FDE) Supported on some models
The HDD has an amazing history of improvement and innovation. From its inception in 1956 the hard drive has decreased in size 57,000 times, increased storage 1 million times, and decreased cost 2,000 times. In other words, the cost per gigabyte has decreased by 2 billion times in about 60 years.
Hard drive manufacturers made these dramatic advances by reducing the size, and consequently the seek times, of platters while increasing their density, improving disk reading technologies, adding multiple arms and read/write heads, developing better bus interfaces, and increasing spin speed and reducing friction with techniques such as filling drives with helium.
In 2005, the drive industry introduced perpendicular recording technology to replace the older longitudinal recording technology, which enabled areal density to reach more than 100 gigabits per square inch. Longitudinal recording aligns data bits horizontally in relation to the drive’s spinning platter, parallel to the surface of the disk, while perpendicular recording aligns bits vertically, perpendicular to the disk surface.
Perpendicular Recording
Other technologies such as bit patterned media recording (BPMR) are contributing to increased densities, as well. Introduced by Toshiba in 2010, BPMR is a proposed hard disk drive technology that could succeed perpendicular recording. It records data using nanolithography in magnetic islands, with one bit per island. This contrasts with current disk drive technology where each bit is stored in 20 to 30 magnetic grains within a continuous magnetic film.
Shingled magnetic recording (SMR) is a magnetic storage data recording technology used in HDDs to increase storage density and overall per-drive storage capacity. Shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track narrower and allowing for higher track density. Thus, the tracks partially overlap similar to roof shingles. This approach was selected because physical limitations prevent recording magnetic heads from having the same width as reading heads, leaving recording heads wider.
Track Spacing Enabled by SMR Technology (Seagate)
To increase the amount of data stored on a drive’s platter requires cramming the magnetic regions closer together, which means the grains need to be smaller so they won’t interfere with each other. In 2002, Seagate successfully demoed heat-assisted magnetic recording (HAMR). HAMR records magnetically using laser-thermal assistance that ultimately could lead to a 20 terabyte drive by 2019. (See our post on HAMR by Seagate’s CTO Mark Re, What is HAMR and How Does It Enable the High-Capacity Needs of the Future?)
Western Digital claims that its competing microwave-assisted magnetic recording (MAMR) could enable drive capacity to increase up to 40TB by the year 2025. Some industry watchers and drive manufacturers predict increases in areal density from today’s .86 tbpsi terabit-per-square-inch (TBPSI) to 10 tbpsi by 2025 resulting in as much as 100TB drive capacity in the next decade.
The future certainly does look bright for HDDs continuing to be with us for a while.
The Outlook for SSDs
SSDs are also in for some amazing advances.
Interfaces
SATA (Serial Advanced Technology Attachment) is the common hardware interface that allows the transfer of data to and from HDDs and SSDs. SATA SSDs are fine for the majority of home users, as they are generally cheaper, operate at a lower speed, and have a lower write life.
While fine for everyday computing, in a RAID (Redundant Array of Independent Disks), server array or data center environment, often a better alternative has been to use ‘SAS’ drives, which stands for Serial Attached SCSI. This is another type of interface that, again, is usable either with HDDs or SSDs. ‘SCSI’ stands for Small Computer System Interface (which is why SAS drives are sometimes referred to as ‘scuzzy’ drives). SAS has increased IOPS (Inputs Outputs Per Second) over SATA, meaning it has the ability to read and write data faster. This has made SAS an optimal choice for systems that require high performance and availability.
On an enterprise level, SAS prevails over SATA, as SAS supports over-provisioning to prolong write life and has been specifically designed to run in environments that require constant drive usage.
Enter PCIe
PCIe (Peripheral Component Interconnect Express) is a high speed serial computer expansion bus standard that supports drastically higher data transfer rates over SATA or SAS interfaces due to the fact that there are more channels available for the flow of data.
Many leading drive manufacturers have been adopting PCIe as the standard for new home and enterprise storage and some peripherals. For example, you’ll see that the latest Apple Macbooks ship with PCIe-based flash storage, something that Apple has been adopting over the years with their consumer devices.
PCIe can also be used within data centers for RAID systems and to create high-speed networking capabilities, increasing overall performance and supporting the newer and higher capacity HDDs.
Storage Density
As we covered in Part 1, SSDs are based on a type of non-volatile flash memory called NAND.The latest trend in NAND flash is quad-level-cell (QLC) NAND. NAND is subdivided into types based on how many bits of data are stored in each physical memory cell. SLC (single-level-cell) stores one bit, MLC (multi-level-cell) stores two, TLC (triple-level cell) stores three, and QLC (quad-level-cell) stores four.
QLC NAND
Storing more data per cell makes NAND more dense, but it also makes the memory slower — it takes more time to read and write data when so much additional information (and so many more charge states) are stored within the same cell of memory.
QLC NAND memory is built on older process nodes with larger cells that can more easily store multiple bits of data. The new NAND tech has higher overall reliability with higher total number of program / erase cycles (P/E cycles).
QLC NAND wafer from which individual microcircuits are made
QLC NAND promises to produce faster and denser SSDs. The effect on price also could be dramatic. Tom’s Hardware is predicting that the advent of QLC could push 512GB SSDs down to $100.
Beyond HDDs and SSDs
There is significant work being done that is pushing the bounds of data storage beyond what is possible with spinning platters and microcircuits. A team at Harvard University has used genome-editing to encode video into live bacteria.
Scientists from the University of Southampton’s Optoelectronics Research Centre (ORC) have developed the recording and retrieval processes of five dimensional (5D) digital data by femtosecond laser writing. These discs can hold 360 TB of data and are theoretically stable for up to 14 billion years, thanks to ‘5D data storage’ inscribed by laser nanostructuring in quartz silica glass. Space-X’s Falcon Heavy recently carried this technology into space.
We’ve already discussed the benefits of SSDs. The benefits of SSDs that apply particularly to the data center are:
Low power consumption — When you are running lots of drives, power usage adds up. Anywhere you can conserve power is a win.
Speed — Data can be accessed faster, which is especially beneficial for caching databases and other data affecting overall application or system performance.
Lack of vibration — Reducing vibration improves reliability thereby reducing problems and maintenance. Racks don’t need the size and structural rigidity housing SSDs that they need housing HDDs.
Low noise — Data centers will become quieter as more SSDs are deployed.
Low heat production — The less heat generated the less cooling and power required in the data center.
Faster booting — The faster a storage chassis can get online or a critical server can be rebooted after maintenance or a problem, the better.
Greater areal density — Data centers will be able to store more data in less space, which increases efficiency in all areas (power, cooling, etc.)
The top drive manufacturers say that they expect HDDs and SSDs to coexist for the foreseeable future in all areas — home, business, and data center, with customers choosing which technology and product will best fit their application.
How Backblaze Uses SSDs
In just about all respects, SSDs are superior to HDDs. So why don’t we replace the 100,000+ hard drives we have spinning in our data centers with SSDs?
Our operations team takes advantage of the benefits and savings of SSDs wherever they can, using them in every place that’s appropriate other than primary data storage. They’re particularly useful in our caching and restore layers, where we use them strategically to speed up data transfers. SSDs also speed up access to B2 Cloud Storage metadata. Our operations teams is considering moving to SSDs to boot our Storage Pods, where the cost of a small SSD is competitive with hard drives, and their other attributes (small size, lack of vibration, speed, low-power consumption, reliability) are all pluses.
A Future with Both HDDs and SSDs
IDC predicts that total data created will grow from approximately 33 zettabytes in 2018 to about 160 zettabytes in 2025. (See What’s a Byte? if you’d like help understanding the size of a zettabyte.)
Annual Size of the Global Datasphere
Over 90% of enterprise drive shipments today are HDD, according to IDC. By 2025, SSDs will comprise almost 20% of drive shipments. SSDs will gain share, but total growth in data created will result in massive sales of both HDDs and SSDs.
Enterprise Byte Shipments: NDD and SSD
As both HDD and SSD sales grow, so does the capacity of both technologies. Given the benefits of SSDs in many applications, we’re likely going to see SSDs replacing HDDs in all but the highest capacity uses.
It’s clear that there are merits to both HDDs and SSDs. If you’re not running a data center, and don’t have more than one or two terabytes of data to store on your home or business computer, your first choice likely should be an SSD. They provide a noticeable improvement in performance during boot-up and data transfer, and are smaller, quieter, and more reliable as well. Save the HDDs for secondary drives, NAS, RAID, and local backup devices in your system.
Perhaps some day we’ll look back at the days of spinning platters with the same nostalgia we look back at stereo LPs, and some of us will have an HDD paperweight on our floating anti-gravity desk as a conversation piece. Until the day that SSD’s performance, capacity, and finally, price, expel the last HDD out of the home and data center, we can expect to live in a world that contains both solid state SSDs and magnetic platter HDDs, and as users we will reap the benefits from both technologies.
Don’t miss future posts on HDDs, SSDs, and other topics, including hard drive stats, cloud storage, and tips and tricks for backing up to the cloud. Use the Join button above to receive notification of future posts on our blog.
Amazon EC2 Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by Amazon EC2 with two minutes of notification when EC2 needs the capacity back.
Customers have been taking advantage of Spot Instance interruption notices available via the instance metadata service since January 2015 to orchestrate their workloads seamlessly around any potential interruptions. Examples include saving the state of a job, detaching from a load balancer, or draining containers. Needless to say, the two-minute Spot Instance interruption notice is a powerful tool when using Spot Instances.
In January 2018, the Spot Instance interruption notice also became available as an event in Amazon CloudWatch Events. This allows targets such as AWS Lambda functions or Amazon SNS topics to process Spot Instance interruption notices by creating a CloudWatch Events rule to monitor for the notice.
In this post, I walk through an example use case for taking advantage of Spot Instance interruption notices in CloudWatch Events to automatically deregister Spot Instances from an Elastic Load Balancing Application Load Balancer.
When any of the Spot Instances receives an interruption notice, Spot Fleet sends the event to CloudWatch Events. The CloudWatch Events rule then notifies both targets, the Lambda function and SNS topic. The Lambda function detaches the Spot Instance from the Application Load Balancer target group, taking advantage of nearly a full two minutes of connection draining before the instance is interrupted. The SNS topic also receives a message, and is provided as an example for the reader to use as an exercise.
To complete this walkthrough, have the AWS CLI installed and configured, as well as the ability to launch CloudFormation stacks.
Launch the stack
Go ahead and launch the CloudFormation stack. You can check it out from GitHub, or grab the template directly. In this post, I use the stack name “spot-spin-cwe“, but feel free to use any name you like. Just remember to change it in the instructions.
Here are the details of the architecture being launched by the stack.
IAM permissions
Give permissions to a few components in the architecture:
The Lambda function
The CloudWatch Events rule
The Spot Fleet
The Lambda function needs basic Lambda function execution permissions so that it can write logs to CloudWatch Logs. You can use the AWS managed policy for this. It also needs to describe EC2 tags as well as deregister targets within Elastic Load Balancing. You can create a custom policy for these.
Finally, Spot Fleet needs permissions to request Spot Instances, tag, and register targets in Elastic Load Balancing. You can tap into an AWS managed policy for this.
Because you are taking advantage of the two-minute Spot Instance notice, you can tune the Elastic Load Balancing target group deregistration timeout delay to match. When a target is deregistered from the target group, it is put into connection draining mode for the length of the timeout delay: 120 seconds to equal the two-minute notice.
To capture the Spot Instance interruption notice being published to CloudWatch Events, create a rule with two targets: the Lambda function and the SNS topic.
The Lambda function does the heavy lifting for you. The details of the CloudWatch event are published to the Lambda function, which then uses boto3 to make a couple of AWS API calls. The first call is to describe the EC2 tags for the Spot Instance, filtering on a key of “TargetGroupArn”. If this tag is found, the instance is then deregistered from the target group ARN stored as the value of the tag.
import boto3
def handler(event, context):
instanceId = event['detail']['instance-id']
instanceAction = event['detail']['instance-action']
try:
ec2client = boto3.client('ec2')
describeTags = ec2client.describe_tags(Filters=[{'Name': 'resource-id','Values':[instanceId],'Name':'key','Values':['loadBalancerTargetGroup']}])
except:
print("No action being taken. Unable to describe tags for instance id:", instanceId)
return
try:
elbv2client = boto3.client('elbv2')
deregisterTargets = elbv2client.deregister_targets(TargetGroupArn=describeTags['Tags'][0]['Value'],Targets=[{'Id':instanceId}])
except:
print("No action being taken. Unable to deregister targets for instance id:", instanceId)
return
print("Detaching instance from target:")
print(instanceId, describeTags['Tags'][0]['Value'], deregisterTargets, sep=",")
return
SNS topic
Finally, you’ve created an SNS topic as an example target. For example, you could subscribe an email address to the SNS topic in order to receive email notifications when a Spot Instance interruption notice is received.
To proceed to creating your Spot Fleet request, use some of the resources that the CloudFormation stack created, to populate the Spot Fleet request launch configuration. You can find the values in the outputs values of the CloudFormation stack:
You can confirm that the Spot Fleet request was fulfilled by checking that ActivityStatus is “fulfilled”, or by checking that FulfilledCapacity is greater than or equal to TargetCapacity, while describing the request:
In order to test, you can take advantage of the fact that any interruption action that Spot Fleet takes on a Spot Instance results in a Spot Instance interruption notice being provided. Therefore, you can simply decrease the target size of your Spot Fleet from 2 to 1. The instance that is interrupted receives the interruption notice:
As soon as the interruption notice is published to CloudWatch Events, the Lambda function triggers and detaches the instance from the target group, effectively putting the instance in a draining state.
In conclusion, Amazon EC2 Spot Instance interruption notices are an extremely powerful tool when taking advantage of Amazon EC2 Spot Instances in your workloads, for tasks such as saving state, draining connections, and much more. I’d love to hear how you are using them in your own environment!
Chad Schmutzer Solutions Architect
Chad Schmutzer is a Solutions Architect at Amazon Web Services based in Pasadena, CA. As an extension of the Amazon EC2 Spot Instances team, Chad helps customers significantly reduce the cost of running their applications, growing their compute capacity and throughput without increasing budget, and enabling new types of cloud computing applications.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.