Today, I’m going to tell you about a new site we launched at re:Invent, the Amazon Builders’ Library, a collection of living articles covering topics across architecture, software delivery, and operations. You get to peek under the hood of how Amazon architects, releases, and operates the software underpinning Amazon.com and AWS.
Want to know how Amazon.com does what it does? This is for you. In this two-part series (the next one coming December 23), I’ll highlight some of the best architecture articles written by Amazon’s senior technical leaders and engineers.
Avoiding insurmountable queue backlogs
In queueing theory, the behavior of queues when they are short is relatively uninteresting. After all, when a queue is short, everyone is happy. It’s only when the queue is backlogged, when the line to an event goes out the door and around the corner, that people start thinking about throughput and prioritization.
In this article, I discuss strategies we use at Amazon to deal with queue backlog scenarios – design approaches we take to drain queues quickly and to prioritize workloads. Most importantly, I describe how to prevent queue backlogs from building up in the first place. In the first half, I describe scenarios that lead to backlogs, and in the second half, I describe many approaches used at Amazon to avoid backlogs or deal with them gracefully.
Whenever one service or system calls another, failures can happen. These failures can come from a variety of factors. They include servers, networks, load balancers, software, operating systems, or even mistakes from system operators. We design our systems to reduce the probability of failure, but impossible to build systems that never fail. So in Amazon, we design our systems to tolerate and reduce the probability of failure, and avoid magnifying a small percentage of failures into a complete outage. To build resilient systems, we employ three essential tools: timeouts, retries, and backoff.
The moment we added our second server, distributed systems became the way of life at Amazon. When I started at Amazon in 1999, we had so few servers that we could give some of them recognizable names like “fishy” or “online-01”. However, even in 1999, distributed computing was not easy. Then as now, challenges with distributed systems involved latency, scaling, understanding networking APIs, marshalling and unmarshalling data, and the complexity of algorithms such as Paxos. As the systems quickly grew larger and more distributed, what had been theoretical edge cases turned into regular occurrences.
Developing distributed utility computing services, such as reliable long-distance telephone networks, or Amazon Web Services (AWS) services, is hard. Distributed computing is also weirder and less intuitive than other forms of computing because of two interrelated problems. Independent failures and nondeterminism cause the most impactful issues in distributed systems. In addition to the typical computing failures most engineers are used to, failures in distributed systems can occur in many other ways. What’s worse, it’s impossible always to know whether something failed.
At Amazon, the services we build must meet extremely high availability targets. This means that we need to think carefully about the dependencies that our systems take. We design our systems to stay resilient even when those dependencies are impaired. In this article, we’ll define a pattern that we use called static stability to achieve this level of resilience. We’ll show you how we apply this concept to Availability Zones, a key infrastructure building block in AWS and therefore a bedrock dependency on which all of our services are built.
Serverless is one of the hottest design patterns in the cloud today, allowing you to focus on building and innovating, rather than worrying about the heavy lifting of server and OS operations. In this series of posts, we’ll discuss topics that you should consider when designing your serverless architectures. First, we’ll look at architectural patterns designed to achieve massive scale with serverless.
Scaling Considerations
In general, developers in a “serverful” world need to be worried about how many total requests can be served throughout the day, week, or month, and how quickly their system can scale. As you move into the serverless world, the most important question you should understand becomes: “What is the concurrency that your system is designed to handle?”
The AWS Serverless platform allows you to scale very quickly in response to demand. Below is an example of a serverless design that is fully synchronous throughout the application. During periods of extremely high demand, Amazon API Gateway and AWS Lambda will scale in response to your incoming load. This design places extremely high load on your backend relational database because Lambda can easily scale from thousands to tens of thousands of concurrent requests. In most cases, your relational databases are not designed to accept the same number of concurrent connections.
This design risks bottlenecks at your relational database and may cause service outages. This design also risks data loss due to throttling or database connection exhaustion.
Cloud Native Design
Instead, you should consider decoupling your architecture and moving to an asynchronous model. In this architecture, you use an intermediary service to buffer incoming requests, such as Amazon Kinesis or Amazon Simple Queue Service (SQS). You can configure Kinesis or SQS as out of the box event sources for Lambda. In design below, AWS will automatically poll your Kinesis stream or SQS resource for new records and deliver them to your Lambda functions. You can control the batch size per delivery and further place throttles on a per Lambda function basis.
This design allows you to accept extremely high volume of requests, store the requests in a durable datastore, and process them at the speed which your system can handle.
Conclusion
Serverless computing allows you to scale much quicker than with server-based applications, but that means application architects should always consider the effects of scaling to your downstream services. Always keep in mind cost, speed, and reliability when you’re building your serverless applications.
Our next post in this series will discuss the different ways to invoke your Lambda functions and how to design your applications appropriately.
About the Author
George Mao is a Specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. George is responsible for helping customers design and operate Serverless applications using services like Lambda, API Gateway, Cognito, and DynamoDB. He is a regular speaker at AWS Summits, re:Invent, and various tech events. George is a software engineer and enjoys contributing to open source projects, delivering technical presentations at technology events, and working with customers to design their applications in the Cloud. George holds a Bachelor of Computer Science and Masters of IT from Virginia Tech.
This post is courtesy of Otavio Ferreira, Manager, Amazon SNS, AWS Messaging.
Amazon SNS message filtering provides a set of string and numeric matching operators that allow each subscription to receive only the messages of interest. Hence, SNS message filtering can simplify your pub/sub messaging architecture by offloading the message filtering logic from your subscriber systems, as well as the message routing logic from your publisher systems.
After you set the subscription attribute that defines a filter policy, the subscribing endpoint receives only the messages that carry attributes matching this filter policy. Other messages published to the topic are filtered out for this subscription. In this way, the native integration between SNS and Amazon CloudWatch provides visibility into the number of messages delivered, as well as the number of messages filtered out.
CloudWatch metrics are captured automatically for you. To get started with SNS message filtering, see Filtering Messages with Amazon SNS.
Message Filtering Metrics
The following six CloudWatch metrics are relevant to understanding your SNS message filtering activity:
NumberOfMessagesPublished – Inbound traffic to SNS. This metric tracks all the messages that have been published to the topic.
NumberOfNotificationsDelivered – Outbound traffic from SNS. This metric tracks all the messages that have been successfully delivered to endpoints subscribed to the topic. A delivery takes place either when the incoming message attributes match a subscription filter policy, or when the subscription has no filter policy at all, which results in a catch-all behavior.
NumberOfNotificationsFilteredOut – This metric tracks all the messages that were filtered out because they carried attributes that didn’t match the subscription filter policy.
NumberOfNotificationsFilteredOut-NoMessageAttributes – This metric tracks all the messages that were filtered out because they didn’t carry any attributes at all and, consequently, didn’t match the subscription filter policy.
NumberOfNotificationsFilteredOut-InvalidAttributes – This metric keeps track of messages that were filtered out because they carried invalid or malformed attributes and, thus, didn’t match the subscription filter policy.
NumberOfNotificationsFailed – This last metric tracks all the messages that failed to be delivered to subscribing endpoints, regardless of whether a filter policy had been set for the endpoint. This metric is emitted after the message delivery retry policy is exhausted, and SNS stops attempting to deliver the message. At that moment, the subscribing endpoint is likely no longer reachable. For example, the subscribing SQS queue or Lambda function has been deleted by its owner. You may want to closely monitor this metric to address message delivery issues quickly.
Message filtering graphs
Through the AWS Management Console, you can compose graphs to display your SNS message filtering activity. The graph shows the number of messages published, delivered, and filtered out within the timeframe you specify (1h, 3h, 12h, 1d, 3d, 1w, or custom).
To compose an SNS message filtering graph with CloudWatch:
Open the CloudWatch console.
Choose Metrics, SNS, All Metrics, and Topic Metrics.
Select all metrics to add to the graph, such as:
NumberOfMessagesPublished
NumberOfNotificationsDelivered
NumberOfNotificationsFilteredOut
Choose Graphed metrics.
In the Statistic column, switch from Average to Sum.
Title your graph with a descriptive name, such as “SNS Message Filtering”
After you have your graph set up, you may want to copy the graph link for bookmarking, emailing, or sharing with co-workers. You may also want to add your graph to a CloudWatch dashboard for easy access in the future. Both actions are available to you on the Actions menu, which is found above the graph.
Summary
SNS message filtering defines how SNS topics behave in terms of message delivery. By using CloudWatch metrics, you gain visibility into the number of messages published, delivered, and filtered out. This enables you to validate the operation of filter policies and more easily troubleshoot during development phases.
SNS message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). CloudWatch metrics for SNS message filtering is available now, in all AWS Regions.
Contributed by Shea Lutton, AWS Cloud Infrastructure Architect
Amazon Simple Queue Service (Amazon SQS) is a fully managed queuing service that helps decouple applications, distributed systems, and microservices to increase fault tolerance. SQS queues come in two distinct types:
Standard SQS queues are able to scale to enormous throughput with at-least-once delivery.
FIFO queues are designed to guarantee that messages are processed exactly once in the exact order that they are received and have a default rate of 300 transactions per second.
As customers explore SQS FIFO queues, they often have questions about how the behavior works when messages arrive and are consumed. This post walks through some common situations to identify the exact behavior that you can expect. It also covers the behavior of message groups in depth and explains why message groups are key to understanding how FIFO queues work.
The simple case
Suppose that you run a major auction platform where people buy and sell a wide range of products. Your platform requires that transactions from buyers and sellers get processed in exactly the order received. Here’s how a FIFO queue helps you keep all your transactions in one straight flow.
A seller currently is holding an auction for a laptop, and three different bids are received for the same price. Ties are awarded to the first bidder at that price so it is important to track which arrived first. Your auction platform receives the three bids and sends them to a FIFO queue before they are processed.
Now observe how messages leave the queue. When your consumer asks for a batch of up to 10 messages, SQS starts filling the batch with the oldest message (bid A1). It keeps filling until either the batch is full or the queue is empty. In this case, the batch contains the three messages and the queue is now empty. After a batch has left the queue, SQS considers that batch of messages to be “in-flight” until the consumer either deletes them or the batch’s visibility timer expires.
When you have a single consumer, this is easy to envision. The consumer gets a batch of messages (now in-flight), does its processing, and deletes the messages. That consumer is then ready to ask for the next batch of messages.
The critical thing to keep in mind is that SQS won’t release the next batch of messages until the first batch has been deleted. By adding more messages to the queue, you can see more interesting behaviors. Imagine that a burst of 11 bids is sent to your FIFO queue, with two bids for Auction A arriving last.
The FIFO queue now has at least two batches of messages in it. When your single consumer requests the first batch of 10 messages, it receives a batch starting with B1 and ending with A1. Later, after the first batch has been deleted, the consumer can get the second batch of messages containing the final A2 message from the queue.
Adding complexity with multiple message groups
A new challenge arises. Your auction platform is getting busier and your dev team added a number of new features. The combination of increased messages and extra processing time for the new features means that a single consumer is too slow. The solution is to scale to have more consumers and process messages in parallel.
To work in parallel, your team realized that only the messages related to a single auction must be kept in order. All transactions for Auction A need to be kept in order and so do all transactions for Auction B. But the two auctions are independent and it does not matter which auctions transactions are processed first.
FIFO can handle that case with a feature called message groups. Each transaction related to Auction A is placed by your producer into message group A, and so on. In the diagram below, Auction A and Auction B each received three bid transactions, with bid B1 arriving first. The FIFO queue always keeps transactions within a message group in the order in which they arrived.
How is this any different than earlier examples? The consumer now gets the messages ordered by message groups, all the B group messages followed by all the A group messages. Multiple message groups create the possibility of using multiple consumers, which I explain in a moment. If FIFO can’t fill up a batch of messages with a single message group, FIFO can place more than one message group in a batch of messages. But whenever possible, the queue gives you a full batch of messages from the same group.
The order of messages leaving a FIFO queue is governed by three rules:
Return the oldest message where no other message in the same message group is currently in-flight.
Return as many messages from the same message group as possible.
If a message batch is still not full, go back to rule 1.
To see this behavior, add a second consumer and insert many more messages into the queue. For simplicity, the delete message action has been omitted in these diagrams but it is assumed that all messages in a batch are processed successfully by the consumer and the batch is properly deleted immediately after.
In this example, there are 11 Group A and 11 Group B transactions arriving in interleaved order and a second consumer has been added. Consumer 1 asks for a group of 10 messages and receives 10 Group A messages. Consumer 2 then asks for 10 messages but SQS knows that Group A is in flight, so it releases 10 Group B messages. The two consumers are now processing two batches of messages in parallel, speeding up throughput and then deleting their batches. When Consumer 1 requests the next batch of messages, it receives the remaining two messages, one from Group A and one from Group B.
Consider this nuanced detail from the example above. What would happen if Consumer 1 was on a faster server and processed its first batch of messages before Consumer 2 could mark its messages for deletion? See if you can predict the behavior before looking at the answer.
If Consumer 2 has not deleted its Group B messages yet when Consumer 1 asks for the next batch, then the FIFO queue considers Group B to still be in flight. It does not release any more Group B messages. Consumer 1 gets only the remaining Group A message. Later, after Consumer 2 has deleted its first batch, the remaining Group B message is released.
Conclusion
I hope this post answered your questions about how Amazon SQS FIFO queues work and why message groups are helpful. If you’re interested in exploring SQS FIFO queues further, here are a few ideas to get you started:
Create an Amazon SQS FIFO queue with three simple commands in the SQS console
This post courtesy of Massimiliano Angelino, AWS Solutions Architect
Different enterprise systems—ERP, CRM, BI, HR, etc.—need to exchange information but normally cannot do that natively because they are from different vendors. Enterprises have tried multiple ways to integrate heterogeneous systems, generally referred to as enterprise application integration (EAI).
Modern EAI systems are based on a message-oriented middleware (MoM), also known as enterprise service bus (ESB). An ESB provides data communication via a message bus, on top of which it also provides components to orchestrate, route, translate, and monitor the data exchange. Communication with the ESB is done via adapters or connectors provided by the ESB. In this way, the different applications do not have to have specific knowledge of the technology used to provide the integration.
Amazon MQ used with Apache Camel is an open-source alternative to commercial ESBs. With the launch of Amazon MQ, integration between on-premises applications and cloud services becomes much simpler. Amazon MQ provides a managed message broker service currently supporting ApacheMQ 5.15.0.
In this post, I show how a simple integration between Amazon MQ and other AWS services can be achieved by using Apache Camel.
Apache Camel provides built-in connectors for integration with a wide variety of AWS services such as Amazon MQ, Amazon SQS, Amazon SNS, Amazon SWF, Amazon S3, AWS Lambda, Amazon DynamoDB, AWS Elastic Beanstalk, and Amazon Kinesis Streams. It also provides a broad range of other connectors including Cassandra, JDBC, Spark, and even Facebook and Slack.
EAI system architecture
Different applications use different data formats, hence the need for a translation/transformation service. Such services can be provided to or from a common “normalized” format, or specifically between two applications.
The use of normalized formats simplifies the integration process when multiple applications need to share the same data, as the number of conversions to be realized is N (number of applications). This is at the cost of a more complex adaptation to a common format, which is required to cover all needs from the different applications, current and future.
Another characteristic of an EAI system is the support of distributed transactions to ensure data consistency across multiple applications.
EAI system architecture is normally composed of the following components:
A centralized broker that handles security, access control, and data communications. Amazon MQ provides these features through the support of multiple transport protocols (AMQP, Openwire, MQTT, WebSocket), security (all communications are encrypted via SSL), and per destination granular access control.
An independent data model, also known as the canonical data model. XML is the de facto standard for the data representation.
Connectors/agents that allow the applications to communicate with the broker.
A system model to allow a standardized way for all components to interface with the EAI. Java Message Service (JMS) and Windows Communication Foundation (WCF) are standard APIs to interact with constructs such as queues and topics to implement the different messaging patterns.
Walkthrough
This solution walks you through the following steps:
Creating the broker
Writing a simple application
Adding the dependencies
Triaging files into S3
Writing the Camel route
Sending files to the AMQP queue
Setting up AMQP
Testing the code
Creating the broker
To create a new broker, log in to your AWS account and choose Amazon MQ. Amazon MQ is currently available in six AWS Regions:
US East (N. Virginia)
US East (Ohio)
US West (Oregon)
EU (Ireland)
EU (Frankfurt)
Asia Pacific (Sydney) regions.
Make sure that you have selected one of these Regions.
The master user name and password are used to access the monitoring console of the broker and can be also used to authenticate when connecting the clients to the broker. I recommend creating separate users, without console access, to authenticate the clients to the broker, after the broker has been created.
For this example, create a single broker without failover. If your application requires a higher availability level, check the Create standby in a different zone check box. In case the principal broker instance would fail, the standby takes over in seconds. To make the client aware of the standby, use the failover:// protocol in the connection configuration pointing to both broker endpoints.
Leave the other settings as is. The broker takes few minutes to be created. After it’s done, you can see the list of endpoints available for the different protocols.
After the broker has been created, modify the security group to add the allowed ports and sources for access.
For this example, you need access to the ActiveMQ admin page and to AMQP. Open up ports 8162 and 5671 to the public address of your laptop.
You can also create a new user for programmatic access to the broker. In the Users section, choose Create User and add a new user named sdk.
Writing a simple application
The complete code for this walkthrough is available from the aws-amazonmq-apachecamel-sample GitHub repo. Clone the repository on your local machine to have the fully functional example. The rest of this post offers step-by-step instructions to build this solution.
To write the application, use Apache Maven and the Camel archetypes provided by Maven. If you do not have Apache Maven installed on your machine, you can follow the instructions at Installing Apache Maven.
From a terminal, run the following command:
mvn archetype:generate
You get a list of archetypes. Type camel to get only the one related to camel. In this case, use the java8 example and type the following:
Maven now generates the skeleton code in a folder named as the artifactId. In this case:
camel-aws-simple
Next, test that the environment is configured correctly to run Camel. At the prompt, run the following commands:
cd camel-aws-simple
mvn install
mvn exec:java
You should see a log appearing in the console, printing the following:
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ camel-aws-test ---
[ com.angmas.MainApp.main()] DefaultCamelContext INFO Apache Camel 2.20.1 (CamelContext: camel-1) is starting
[ com.angmas.MainApp.main()] ManagedManagementStrategy INFO JMX is enabled
[ com.angmas.MainApp.main()] DefaultTypeConverter INFO Type converters loaded (core: 192, classpath: 0)
[ com.angmas.MainApp.main()] DefaultCamelContext INFO StreamCaching is not in use. If using streams then its recommended to enable stream caching. See more details at http://camel.apache.org/stream-caching.html
[ com.angmas.MainApp.main()] DefaultCamelContext INFO Route: route1 started and consuming from: timer://simple?period=1000
[ com.angmas.MainApp.main()] DefaultCamelContext INFO Total 1 routes, of which 1 are started
[ com.angmas.MainApp.main()] DefaultCamelContext INFO Apache Camel 2.20.1 (CamelContext: camel-1) started in 0.419 seconds
[-1) thread #2 - timer://simple] route1 INFO Got a String body
[-1) thread #2 - timer://simple] route1 INFO Got an Integer body
[-1) thread #2 - timer://simple] route1 INFO Got a Double body
[-1) thread #2 - timer://simple] route1 INFO Got a String body
[-1) thread #2 - timer://simple] route1 INFO Got an Integer body
[-1) thread #2 - timer://simple] route1 INFO Got a Double body
[-1) thread #2 - timer://simple] route1 INFO Got a String body
[-1) thread #2 - timer://simple] route1 INFO Got an Integer body
[-1) thread #2 - timer://simple] route1 INFO Got a Double body
Adding the dependencies
Now that you have verified that the sample works, modify it to add the dependencies to interface to Amazon MQ/ActiveMQ and AWS.
For the following steps, you can use a normal text editor, such as vi, Sublime Text, or Visual Studio Code. Or, open the maven project in an IDE such as Eclipse or IntelliJ IDEA.
Open pom.xml and add the following lines inside the <dependencies> tag:
The camel-aws component is taking care of the interface with the supported AWS services without requiring any in-depth knowledge of the AWS Java SDK. For more information, see Camel Components for Amazon Web Services.
Triaging files into S3
Write a Camel component that receives files as a payload to messages in a queue and write them to an S3 bucket with different prefixes depending on the extension.
Because the broker that you created is exposed via a public IP address, you can execute the code from anywhere that there is an internet connection that allows communication on the specific ports. In this example, run the code from your own laptop. A broker can also be created without public IP address, in which case it is only accessible from inside the VPC in which it has been created, or by any peered VPC or network connected via a virtual gateway (VPN or AWS Direct Connect).
First, look at the code created by Maven. The archetype chosen created a standalone Camel context run via the helper org.apache.camel.main.Main class. This provides an easy way to run Camel routes from an IDE or the command line without needing to deploy it inside a container. Apache Camel can be also run as an OSGi module, or Spring and SpringBoot bean.
package com.angmas;
import org.apache.camel.main.Main;
/**
* A Camel Application
*/
public class MainApp {
/**
* A main() so you can easily run these routing rules in your IDE
*/
public static void main(String... args) throws Exception {
Main main = new Main();
main.addRouteBuilder(new MyRouteBuilder());
main.run(args);
}
}
The main method instantiates the Camel Main helper class and the routes, and runs the Camel application. The MyRouteBuilder class creates a route using Java DSL. It is also possible to define routes in Spring XML and load them dynamically in the code.
public void configure() {
// this sample sets a random body then performs content-based
// routing on the message using method references
from("timer:simple?period=1000")
.process()
.message(m -> m.setHeader("index", index++ % 3))
.transform()
.message(this::randomBody)
.choice()
.when()
.body(String.class::isInstance)
.log("Got a String body")
.when()
.body(Integer.class::isInstance)
.log("Got an Integer body")
.when()
.body(Double.class::isInstance)
.log("Got a Double body")
.otherwise()
.log("Other type message");
}
Writing the Camel route
Replace the existing route with one that fetches messages from Amazon MQ over AMQP, and routes the content to different S3 buckets depending on the file name extension.
Reads messages from the AMQP queue named filequeue.
Processes the message and sets a new ext header using the setExtensionHeader method (see below).
Checks the value of the ext header and write the body of the message as an object in an S3 bucket using different key prefixes, retaining the original name of the file.
The Amazon S3 component is configured with the bucket name, and a reference to an S3 client (amazonS3client=#s3Client) that you added to the Camel registry in the Main method of the app. Adding the object to the Camel registry allows Camel to find the object at runtime. Even though you could pass the region, accessKey, and secretKey parameters directly in the component URI, this way is more secure. It can make use of EC2 instance roles, so that you never need to pass the secrets.
Sending files to the AMQP queue
To send the files to the AMQP queue for testing, add another Camel route. In a real scenario, the messages to the AMQP queue are generated by another client. You are going to create a new route builder, but you could also add this route inside the existing MyRouteBuilder.
package com.angmas;
import org.apache.camel.builder.RouteBuilder;
/**
* A Camel Java8 DSL Router
*/
public class MessageProducerBuilder extends RouteBuilder {
/**
* Configure the Camel routing rules using Java code...
*/
public void configure() {
from("file://input?delete=false&noop=true")
.log("Content ${body} ${headers.CamelFileName}")
.to("amqp:filequeue");
}
}
The code reads files from the input folder in the work directory and publishes it to the queue. The route builder is added in the main class:
By default, Camel tries to connect to a local AMQP broker. Configure it to connect to your Amazon MQ broker.
Create an AMQPConnectionDetails object that is configured to connect to Amazon MQ broker with SSL and pass the user name and password that you set on the broker. Adding the object to the Camel registry allows Camel to find the object at runtime and use it as the default connection to AMQP.
public class MainApp {
public static String BROKER_URL = System.getenv("BROKER_URL");
public static String AMQP_URL = "amqps://"+BROKER_URL+":5671";
public static String BROKER_USERNAME = System.getenv("BROKER_USERNAME");
public static String BROKER_PASSWORD = System.getenv("BROKER_PASSWORD");
/**
* A main() so you can easily run these routing rules in your IDE
*/
public static void main(String... args) throws Exception {
Main main = new Main();
main.bind("amqp", getAMQPconnection());
main.bind("s3Client", AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).build());
main.addRouteBuilder(new MyRouteBuilder());
main.addRouteBuilder(new MessageProducerBuilder());
main.run(args);
}
public static AMQPConnectionDetails getAMQPconnection() {
return new AMQPConnectionDetails(AMQP_URL, BROKER_USERNAME, BROKER_PASSWORD);
}
}
The AMQP_URL uses the amqps schema that indicates that you are using SSL. You then add the component to the registry. Camel finds it by matching the class type. main.bind("amqp-ssl", getAMQPConnection());
Testing the code
Create an input folder in the project root, and create few files with different extensions, such as txt, html, and csv.
Set the different environment variables required by the code, either in the shell or in your IDE as execution configuration.
If you are running the example from an EC2 instance, ensure that the EC2 instance role has read permission on the S3 bucket.
If you are running this on your laptop, ensure that you have configured the AWS credentials in the environment, for example, by using the aws configure command.
From the command line, execute the code:
mvn exec:java
If you are using an IDE, execute the main class. Camel outputs logging information and you should see messages listing the content and names of the files in the input folder.
Keep adding some more files to the input folder. You see that they are triaged in S3 a few seconds later. You can open the S3 console to check that they have been created.
To stop Camel, press CTRL+C in the shell.
Conclusion
In this post, I showed you how to create a publicly accessible Amazon MQ broker, and how to use Apache Camel to easily integrate AWS services with the broker. In the example, you created a Camel route that reads messages containing files from the AMQP queue and triages them by file extension into an S3 bucket.
Camel supports several components and provides blueprints for several enterprise integration patterns. Used in combination with the Amazon MQ, it provides a powerful and flexible solution to extend traditional enterprise solutions to the AWS Cloud, and integrate them seamlessly with cloud-native services, such as Amazon S3, Amazon SNS, Amazon SQS, Amazon CloudWatch, and AWS Lambda.
To learn more, see the Amazon MQ website. You can try Amazon MQ for free with the AWS Free Tier, which includes up to 750 hours of a single-instance mq.t2.micro broker and up to 1 GB of storage per month for one year.
We launched AWS Support a full decade ago, with Gold and Silver plans focused on Amazon EC2, Amazon S3, and Amazon SQS. Starting from that initial offering, backed by a small team in Seattle, AWS Support now encompasses thousands of people working from more than 60 locations.
A Quick Look Back Over the years, that offering has matured and evolved in order to meet the needs of an increasingly diverse base of AWS customers. We aim to support you at every step of your cloud adoption journey, from your initial experiments to the time you deploy mission-critical workloads and applications.
We have worked hard to make our support model helpful and proactive. We do our best to provide you with the tools, alerts, and knowledge that will help you to build systems that are secure, robust, and dependable. Here are some of our most recent efforts toward that goal:
Trusted Advisor S3 Bucket Policy Check – AWS Trusted Advisor provides you with five categories of checks and makes recommendations that are designed to improve security and performance. Earlier this year we announced that the S3 Bucket Permissions Check is now free, and available to all AWS users. If you are signed up for the Business or Professional level of AWS Support, you can also monitor this check (and many others) using Amazon CloudWatch Events. You can use this to monitor and secure your buckets without human intervention.
Personal Health Dashboard – This tool provides you with alerts and guidance when AWS is experiencing events that may affect you. You get a personalized view into the performance and availability of the AWS services that underlie your AWS resources. It also generates Amazon CloudWatch Events so that you can initiate automated failover and remediation if necessary.
Well Architected / Cloud Ops Review – We’ve learned a lot about how to architect AWS-powered systems over the years and we want to share everything we know with you! The AWS Well-Architected Framework provide proven, detailed guidance in critical areas including operational excellence, security, reliability, performance efficiency, and cost optimization. You can read the materials online and you can also sign up for the online training course. If you are signed up for Enterprise support, you can also benefit from our Cloud Ops review.
Infrastructure Event Management – If you are launching a new app, kicking off a big migration, or hosting a large-scale event similar to Prime Day, we are ready with guidance and real-time support. Our Infrastructure Event Management team will help you to assess the readiness of your environment and work with you to identify and mitigate risks ahead of time.
To learn more about how AWS customers have used AWS support to realize all of the benefits that I noted above, watch these videos (and find more on the Customer Testmonials page):
The Amazon retail site makes heavy use of AWS. You can read my post, Prime Day 2017 – Powered by AWS, to learn more about the process of preparing to sustain a record-setting amount of traffic and to accept a like number of orders.
Come and Join Us The AWS Support Team is in continuous hiring mode and we have openings all over the world! Here are a couple of highlights:
This post courtesy of Michael Edge, Sr. Cloud Architect – AWS Professional Services
Applications built using a microservices architecture typically result in a number of independent, loosely coupled microservices communicating with each other, synchronously via their APIs and asynchronously via events. These microservices are often owned by different product teams, and these teams may segregate their resources into different AWS accounts for reasons that include security, billing, and resource isolation. This can sometimes result in the following challenges:
Cross-account deployment: A single pipeline must deploy a microservice into multiple accounts; for example, a microservice must be deployed to environments such as DEV, QA, and PROD, all in separate accounts.
Cross-account lookup: During deployment, a resource deployed into one AWS account may need to refer to a resource deployed in another AWS account.
Cross-account communication: Microservices executing in one AWS account may need to communicate with microservices executing in another AWS account.
In this post, I look at ways to address these challenges using a sample application composed of a web application supported by two serverless microservices. The microservices are owned by different product teams and deployed into different accounts using AWS CodePipeline, AWS CloudFormation, and the Serverless Application Model (SAM). At runtime, the microservices communicate using an event-driven architecture that requires asynchronous, cross-account communication via an Amazon Simple Notification Service (Amazon SNS) topic.
Sample application
First, look at the sample application I use to demonstrate these concepts. In the following overview diagram, you can see the following:
The entire application consists of three main services:
A Booking microservice, owned by the Booking account.
An Airmiles microservice, owned by the Airmiles account.
A web application that uses the services exposed by both microservices, owned by the Web Channel account.
The Booking microservice creates flight bookings and publishes booking events to an SNS topic.
The Airmiles microservice consumes booking events from the SNS topic and uses the booking event to calculate the airmiles associated with the flight booking. It also supports querying airmiles for a specific flight booking.
The web application allows an end user to make flight bookings, view flights bookings, and view the airmiles associated with a flight booking.
In the sample application, the Booking and Airmiles microservices are implemented using AWS Lambda. Together with Amazon API Gateway, Amazon DynamoDB, and SNS, the sample application is completely serverless.
Sample Serverless microservices application
The typical booking flow would be triggered by an end user making a flight booking using the web application, which invokes the Booking microservice via its REST API. The Booking microservice persists the flight booking and publishes the booking event to an SNS topic to enable sharing of the booking with other interested consumers. In this sample application, the Airmiles microservice subscribes to the SNS topic and consumes the booking event, using the booking information to calculate the airmiles. In line with microservices best practices, both the Booking and Airmiles microservices store their information in their own DynamoDB tables, and expose an API (via API Gateway) that is used by the web application.
Setup
Before you delve into the details of the sample application, get the source code and deploy it.
Cross-account deployment of Lambda functions using CodePipeline has been previously discussed by my colleague Anuj Sharma in his post, Building a Secure Cross-Account Continuous Delivery Pipeline. This sample application builds upon the solution proposed by Anuj, using some of the same scripts and a similar account structure. To make it feasible for you to deploy the sample application, I’ve reduced the number of accounts needed down to three accounts by consolidating some of the services. In the following diagram, you can see the services used by the sample application require three accounts:
Tools: A central location for the continuous delivery/deployment services such as CodePipeline and AWS CodeBuild. To reduce the number of accounts required by the sample application, also deploy the AWS CodeCommit repositories here, though typically they may belong in a separate Dev account.
Booking: Account for the Booking microservice.
Airmiles: Account for the Airmiles microservice.
Without consolidation, the sample application may require up to 10 accounts: one for Tools, and three accounts each for Booking, Airmiles and Web Application (to support the DEV, QA, and PROD environments).
Account structure for sample application
To follow the rest of this post, clone the repository in step 1 below. To deploy the application on AWS, follow steps 2 and 3:
Follow the instructions in the repository README to build the CodePipeline pipelines and deploy the microservices and web application.
Challenge 1: Cross-account deployment using CodePipeline
Though the Booking pipeline executes in the Tools account, it deploys the Booking Lambda functions into the Booking account.
In the sample application code repository, open the ToolsAcct/code-pipeline.yaml CloudFormation template.
Scroll down to the Pipeline resource and look for the DeployToTest pipeline stage (shown below). There are two AWS Identity and Access Management (IAM) service roles used in this stage that allow cross-account activity. Both of these roles exist in the Booking account:
Under Actions.RoleArn, find the service role assumed by CodePipeline to execute this pipeline stage in the Booking account. The role referred to by the parameter NonProdCodePipelineActionServiceRole allows access to the CodePipeline artifacts in the S3 bucket in the Tools account, and also access to the AWS KMS key needed to encrypt/decrypt the artifacts.
Under Actions.Configuration.RoleArn, find the service role assumed by CloudFormation when it carries out the CHANGE_SET_REPLACE action in the Booking account.
These roles are created in the CloudFormation template NonProdAccount/toolsacct-codepipeline-cloudformation-deployer.yaml.
Challenge 2: Cross-account stack lookup using custom resources
Asynchronous communication is a fairly common pattern in microservices architectures. A publishing microservice publishes an event that consumers may be interested in, without any concern for who those consumers may be.
In the case of the sample application, the publisher is the Booking microservice, publishing a flight booking event onto an SNS topic that exists in the Booking account. The consumer is the Airmiles microservice in the Airmiles account. To enable the two microservices to communicate, the Airmiles microservice must look up the ARN of the Booking SNS topic at deployment time in order to set up a subscription to it.
To enable CloudFormation templates to be reused, you are not hardcoding resource names in the templates. Because you allow CloudFormation to generate a resource name for the Booking SNS topic, the Airmiles microservice CloudFormation template must look up the SNS topic name at stack creation time. It’s not possible to use cross-stack references, as these can’t be used across different accounts.
However, you can use a CloudFormation custom resource to achieve the same outcome. This is discussed in the next section. Using a custom resource to look up stack exports in another stack does not create a dependency between the two stacks, unlike cross-stack references. A stack dependency would prevent one stack being deleted if another stack depended upon it. For more information, see Fn::ImportValue.
Using a Lambda function as a custom resource to look up the booking SNS topic
The sample application uses a Lambda function as a custom resource. This approach is discussed in AWS Lambda-backed Custom Resources. Walk through the sample application and see how the custom resource is used to list the stack export variables in another account and return these values to the calling AWS CloudFormation stack. Examine each of the following aspects of the custom resource:
Deploying the custom resource.
Custom resource IAM role.
A calling CloudFormation stack uses the custom resource.
The custom resource assumes a role in another account.
The custom resource obtains the stack exports from all stacks in the other account.
The custom resource returns the stack exports to the calling CloudFormation stack.
The custom resource handles all the event types sent by the calling CloudFormation stack.
Step1: Deploying the custom resource
The custom Lambda function is deployed using SAM, as are the Lambda functions for the Booking and Airmiles microservices. See the CustomLookupExports resource in Custom/custom-lookup-exports.yml.
Step2: Custom resource IAM role
The CustomLookupExports resource in Custom/custom-lookup-exports.yml executes using the CustomLookupLambdaRole IAM role. This role allows the custom Lambda function to assume a cross account role that is created along with the other cross account roles in the NonProdAccount/toolsacct-codepipeline-cloudformation-deployer.yml. See the resource CustomCrossAccountServiceRole.
Step3: The CloudFormation stack uses the custom resource
The Airmiles microservice is created by the Airmiles CloudFormation template Airmiles/sam-airmile.yml, a SAM template that uses the custom Lambda resource to look up the ARN of the Booking SNS topic. The custom resource is specified by the CUSTOMLOOKUP resource, and the PostAirmileFunction resource uses the custom resource to look up the ARN of the SNS topic and create a subscription to it.
Because the Airmiles Lambda function is going to subscribe to an SNS topic in another account, it must grant the SNS topic in the Booking account permissions to invoke the Airmiles Lambda function in the Airmiles account whenever a new event is published. Permissions are granted by the LambdaResourcePolicy resource.
Step4: The custom resource assumes a role in another account
When the Lambda custom resource is invoked by the Airmiles CloudFormation template, it must assume a role in the Booking account (see Step 2) in order to query the stack exports in that account.
This can be seen in Custom/custom-lookup-exports.py, where the AWS Simple Token Service (AWS STS) is used to obtain a temporary access key to allow access to resources in the account referred to by the environment variable: ‘CUSTOM_CROSS_ACCOUNT_ROLE_ARN’. This environment variable is defined in the Custom/custom-lookup-exports.yml CloudFormation template, and refers to the role created in Step 2.
Step5: The custom resource obtains the stack exports from all stacks in the other account
The function get_exports() in Custom/custom-lookup-exports.py uses the AWS SDK to list all the stack exports in the Booking account.
Step 6: The custom resource returns the stack exports to the calling CloudFormation stack
The calling CloudFormation template, Airmiles/sam-airmile.yml, uses the custom resource to look up a stack export with the name of booking-lambda-BookingTopicArn. This is exported by the CloudFormation template Booking/sam-booking.yml.
Step7: The custom resource handles all the event types sent by the calling CloudFormation stack
Custom resources used in a stack are called whenever the stack is created, updated, or deleted. They must therefore handle CREATE, UPDATE, and DELETE stack events. If your custom resource fails to send a SUCCESS or FAILED notification for each of these stack events, your stack may become stuck in a state such as CREATE_IN_PROGRESS or UPDATE_ROLLBACK_IN_PROGRESS. To handle this cleanly, use the crhelper custom resource helper. This strongly encourages you to handle the CREATE, UPDATE, and DELETE CloudFormation stack events.
Challenge 3: Cross-account SNS subscription
The SNS topic is specified as a resource in the Booking/sam-booking.yml CloudFormation template. To allow an event published to this topic to trigger the Airmiles Lambda function, permissions must be granted on both the Booking SNS topic and the Airmiles Lambda function. This is discussed in the next section.
Permissions on booking SNS topic
SNS topics support resource-based policies, which allow a policy to be attached directly to a resource specifying who can access the resource. This policy can specify which accounts can access a resource, and what actions those accounts are allowed to perform. This is the same approach as used by a small number of AWS services, such as Amazon S3, KMS, Amazon SQS, and Lambda. In Booking/sam-booking.yml, the SNS topic policy allows resources in the Airmiles account (referenced by the parameter NonProdAccount in the following snippet) to subscribe to the Booking SNS topic:
The Airmiles microservice is created by the Airmiles/sam-airmile.yml CloudFormation template. The template uses SAM to specify the Lambda functions together with their associated API Gateway configurations. SAM hides much of the complexity of deploying Lambda functions from you.
For instance, by adding an ‘Events’ resource to the PostAirmileFunction in the CloudFormation template, the Airmiles Lambda function is triggered by an event published to the Booking SNS topic. SAM creates the SNS subscription, as well as the permissions necessary for the SNS subscription to trigger the Lambda function.
However, the Lambda permissions generated automatically by SAM are not sufficient for cross-account SNS subscription, which means you must specify an additional Lambda permissions resource in the CloudFormation template. LambdaResourcePolicy, in the following snippet, specifies that the Booking SNS topic is allowed to invoke the Lambda function PostAirmileFunction in the Airmiles account.
In this post, I showed you how product teams can use CodePipeline to deploy microservices into different AWS accounts and different environments such as DEV, QA, and PROD. I also showed how, at runtime, microservices in different accounts can communicate securely with each other using an asynchronous, event-driven architecture. This allows product teams to maintain their own AWS accounts for billing and security purposes, while still supporting communication with microservices in other accounts owned by other product teams.
Acknowledgments
Thanks are due to the following people for their help in preparing this post:
Kevin Yung for the great web application used in the sample application
Jay McConnell for crhelper, the custom resource helper used with the CloudFormation custom resources
This blog was contributed by Otavio Ferreira, Software Development Manager for Amazon SNS
Message filtering simplifies the overall pub/sub messaging architecture by offloading message filtering logic from subscribers, as well as message routing logic from publishers. The initial launch of message filtering provided a basic operator that was based on exact string comparison. For more information, see Simplify Your Pub/Sub Messaging with Amazon SNS Message Filtering.
Today, AWS is announcing an additional set of filtering operators that bring even more power and flexibility to your pub/sub messaging use cases.
Message filtering operators
Amazon SNS now supports both numeric and string matching. Specifically, string matching operators allow for exact, prefix, and “anything-but” comparisons, while numeric matching operators allow for exact and range comparisons, as outlined below. Numeric matching operators work for values between -10e9 and +10e9 inclusive, with five digits of accuracy right of the decimal point.
Anything-but matching on string values (Blacklisting): Subscription filter policy {"sport": [{"anything-but": "rugby"}]} matches message attributes such as {"sport": "baseball"} and {"sport": "basketball"} and {"sport": "football"} but not {"sport": "rugby"}
Prefix matching on string values: Subscription filter policy {"sport": [{"prefix": "bas"}]} matches message attributes such as {"sport": "baseball"} and {"sport": "basketball"}
Exact matching on numeric values: Subscription filter policy {"balance": [{"numeric": ["=", 301.5]}]} matches message attributes {"balance": 301.500} and {"balance": 3.015e2}
Range matching on numeric values: Subscription filter policy {"balance": [{"numeric": ["<", 0]}]} matches negative numbers only, and {"balance": [{"numeric": [">", 0, "<=", 150]}]} matches any positive number up to 150.
As usual, you may apply the “AND” logic by appending multiple keys in the subscription filter policy, and the “OR” logic by appending multiple values for the same key, as follows:
AND logic: Subscription filter policy {"sport": ["rugby"], "language": ["English"]} matches only messages that carry both attributes {"sport": "rugby"} and {"language": "English"}
OR logic: Subscription filter policy {"sport": ["rugby", "football"]} matches messages that carry either the attribute {"sport": "rugby"} or {"sport": "football"}
Message filtering operators in action
Here’s how this new set of filtering operators works. The following example is based on a pharmaceutical company that develops, produces, and markets a variety of prescription drugs, with research labs located in Asia Pacific and Europe. The company built an internal procurement system to manage the purchasing of lab supplies (for example, chemicals and utensils), office supplies (for example, paper, folders, and markers) and tech supplies (for example, laptops, monitors, and printers) from global suppliers.
This distributed system is composed of the four following subsystems:
A requisition system that presents the catalog of products from suppliers, and takes orders from buyers
An approval system for orders targeted to Asia Pacific labs
Another approval system for orders targeted to European labs
A fulfillment system that integrates with shipping partners
As shown in the following diagram, the company leverages AWS messaging services to integrate these distributed systems.
Firstly, an SNS topic named “Orders” was created to take all orders placed by buyers on the requisition system.
Secondly, two Amazon SQS queues, named “Lab-Orders-AP” and “Lab-Orders-EU” (for Asia Pacific and Europe respectively), were created to backlog orders that are up for review on the approval systems.
Lastly, an SQS queue named “Common-Orders” was created to backlog orders that aren’t related to lab supplies, which can already be picked up by shipping partners on the fulfillment system.
The company also uses AWS Lambda functions to automatically process lab supply orders that don’t require approval or which are invalid.
In this example, because different types of orders have been published to the SNS topic, the subscribing endpoints have had to set advanced filter policies on their SNS subscriptions, to have SNS automatically filter out orders they can’t deal with.
As depicted in the above diagram, the following five filter policies have been created:
The SNS subscription that points to the SQS queue “Lab-Orders-AP” sets a filter policy that matches lab supply orders, with a total value greater than $1,000, and that target Asia Pacific labs only. These more expensive transactions require an approver to review orders placed by buyers.
The SNS subscription that points to the SQS queue “Lab-Orders-EU” sets a filter policy that matches lab supply orders, also with a total value greater than $1,000, but that target European labs instead.
The SNS subscription that points to the Lambda function “Lab-Preapproved” sets a filter policy that only matches lab supply orders that aren’t as expensive, up to $1,000, regardless of their target lab location. These orders simply don’t require approval and can be automatically processed.
The SNS subscription that points to the Lambda function “Lab-Cancelled” sets a filter policy that only matches lab supply orders with total value of $0 (zero), regardless of their target lab location. These orders carry no actual items, obviously need neither approval nor fulfillment, and as such can be automatically canceled.
The SNS subscription that points to the SQS queue “Common-Orders” sets a filter policy that blacklists lab supply orders. Hence, this policy matches only office and tech supply orders, which have a more streamlined fulfillment process, and require no approval, regardless of price or target location.
After the company finished building this advanced pub/sub architecture, they were then able to launch their internal procurement system and allow buyers to begin placing orders. The diagram above shows six example orders published to the SNS topic. Each order contains message attributes that describe the order, and cause them to be filtered in a different manner, as follows:
Message #1 is a lab supply order, with a total value of $15,700 and targeting a research lab in Singapore. Because the value is greater than $1,000, and the location “Asia-Pacific-Southeast” matches the prefix “Asia-Pacific-“, this message matches the first SNS subscription and is delivered to SQS queue “Lab-Orders-AP”.
Message #2 is a lab supply order, with a total value of $1,833 and targeting a research lab in Ireland. Because the value is greater than $1,000, and the location “Europe-West” matches the prefix “Europe-“, this message matches the second SNS subscription and is delivered to SQS queue “Lab-Orders-EU”.
Message #3 is a lab supply order, with a total value of $415. Because the value is greater than $0 and less than $1,000, this message matches the third SNS subscription and is delivered to Lambda function “Lab-Preapproved”.
Message #4 is a lab supply order, but with a total value of $0. Therefore, it only matches the fourth SNS subscription, and is delivered to Lambda function “Lab-Cancelled”.
Messages #5 and #6 aren’t lab supply orders actually; one is an office supply order, and the other is a tech supply order. Therefore, they only match the fifth SNS subscription, and are both delivered to SQS queue “Common-Orders”.
Although each message only matched a single subscription, each was tested against the filter policy of every subscription in the topic. Hence, depending on which attributes are set on the incoming message, the message might actually match multiple subscriptions, and multiple deliveries will take place. Also, it is important to bear in mind that subscriptions with no filter policies catch every single message published to the topic, as a blank filter policy equates to a catch-all behavior.
Summary
Amazon SNS allows for both string and numeric filtering operators. As explained in this post, string operators allow for exact, prefix, and “anything-but” comparisons, while numeric operators allow for exact and range comparisons. These advanced filtering operators bring even more power and flexibility to your pub/sub messaging functionality and also allow you to simplify your architecture further by removing even more logic from your subscribers.
Message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). SNS filtering operators for numeric matching, prefix matching, and blacklisting are available now in all AWS Regions, for no extra charge.
With the explosion in virtual reality (VR) technologies over the past few years, we’ve had an increasing number of customers ask us for advice and best practices around deploying their VR-based products and service offerings on the AWS Cloud. It soon became apparent that while the VR ecosystem is large in both scope and depth of types of workloads (gaming, e-medicine, security analytics, live streaming events, etc.), many of the workloads followed repeatable patterns, with storage and delivery of live and on-demand immersive video at the top of the list.
Looking at consumer trends, the desire for live and on-demand immersive video is fairly self-explanatory. VR has ushered in convenient and low-cost access for consumers and businesses to a wide variety of options for consuming content, ranging from browser playback of live and on-demand 360º video, all the way up to positional tracking systems with a high degree of immersion. All of these scenarios contain one lowest common denominator: video.
Which brings us to the topic of this post. We set out to build a solution that could support both live and on-demand events, bring with it a high degree of scalability, be flexible enough to support transformation of video if required, run at a low cost, and use open-source software to every extent possible.
In this post, we describe the reference architecture we created to solve this challenge, using Amazon EC2 Spot Instances, Amazon S3, Elastic Load Balancing, Amazon CloudFront, AWS CloudFormation, and Amazon CloudWatch, with open-source software such as NGINX, FFMPEG, and JavaScript-based client-side playback technologies. We step you through deployment of the solution and how the components work, as well as the capture, processing, and playback of the underlying live and on-demand immersive media streams.
This GitHub repository includes the source code necessary to follow along. We’ve also provided a self-paced workshop, from AWS re:Invent 2017 that breaks down this architecture even further. If you experience any issues or would like to suggest an enhancement, please use the GitHub issue tracker.
Prerequisites
As a side note, you’ll also need a few additional components to take best advantage of the infrastructure:
A camera/capture device capable of encoding and streaming RTMP video
A browser to consume the content.
You’re going to generate HTML5-compatible video (Apple HLS to be exact), but there are many other native iOS and Android options for consuming the media that you create. It’s also worth noting that your playback device should support projection of your input stream. We’ll talk more about that in the next section.
How does immersive media work?
At its core, any flavor of media, be that audio or video, can be viewed with some level of immersion. The ability to interact passively or actively with the content brings with it a further level of immersion. When you look at VR devices with rotational and positional tracking, you naturally need more than an ability to interact with a flat plane of video. The challenge for any creative thus becomes a tradeoff between immersion features (degrees of freedom, monoscopic 2D or stereoscopic 3D, resolution, framerate) and overall complexity.
Where can you start from a simple and effective point of view, that enables you to build out a fairly modular solution and test it? There are a few areas we chose to be prescriptive with our solution.
Source capture from the Ricoh Theta S
First, monoscopic 360-degree video is currently one of the most commonly consumed formats on consumer devices. We explicitly chose to focus on this format, although the infrastructure is not limited to it. More on this later.
Second, if you look at most consumer-level cameras that provide live streaming ability, and even many professional rigs, there are at least two lenses or cameras at a minimum. The figure above illustrates a single capture from a Ricoh Theta S in monoscopic 2D. The left image captures 180 degrees of the field of view, and the right image captures the other 180 degrees.
For this post, we chose a typical midlevel camera (the Ricoh Theta S), and used a laptop with open-source software (Open Broadcaster Software) to encode and stream the content. Again, the solution infrastructure is not limited to this particular brand of camera. Any camera or encoder that outputs 360º video and encodes to H264+AAC with an RTMP transport will work.
Third, capturing and streaming multiple camera feeds brings additional requirements around stream synchronization and cost of infrastructure. There is also a requirement to stitch media in real time, which can be CPU and GPU-intensive. Many devices and platforms do this either on the device, or via outboard processing that sits close to the camera location. If you stitch and deliver a single stream, you can save the costs of infrastructure and bitrate/connectivity requirements. We chose to keep these aspects on the encoder side to save on cost and reduce infrastructure complexity.
Last, the most common delivery format that requires little to no processing on the infrastructure side is equirectangular projection, as per the above figure. By stitching and unwrapping the spherical coordinates into a flat plane, you can easily deliver the video exactly as you would with any other live or on-demand stream. The only caveat is that resolution and bit rate are of utmost importance. The higher you can push these (high bit rate @ 4K resolution), the more immersive the experience is for viewers. This is due to the increase in sharpness and reduction of compression artifacts.
Knowing that we would be transcoding potentially at 4K on the source camera, but in a format that could be transmuxed without an encoding penalty on the origin servers, we implemented a pass-through for the highest bit rate, and elected to only transcode lower bitrates. This requires some level of configuration on the source encoder, but saves on cost and infrastructure. Because you can conform the source stream, you may as well take advantage of that!
For this post, we chose not to focus on ways to optimize projection. However, the reference architecture does support this with additional open source components compiled into the FFMPEG toolchain. A number of options are available to this end, such as open source equirectangular to cubic transformation filters. There is a tradeoff, however, in that reprojection implies that all streams must be transcoded.
Processing and origination stack
To get started, we’ve provided a CloudFormation template that you can launch directly into your own AWS account. We quickly review how it works, the solution’s components, key features, processing steps, and examine the main configuration files. Following this, you launch the stack, and then proceed with camera and encoder setup.
Immersive streaming reference architecture
The event encoder publishes the RTMP source to multiple origin elastic IP addresses for packaging into the HLS adaptive bitrate.
The client requests the live stream through the CloudFront CDN.
The origin responds with the appropriate HLS stream.
The edge fleet caches media requests from clients and elastically scales across both Availability Zones to meet peak demand.
CloudFront caches media at local edge PoPs to improve performance for users and reduce the origin load.
When the live event is finished, the VOD asset is published to S3. An S3 event is then published to SQS.
The encoding fleet processes the read messages from the SQS queue, processes the VOD clips, and stores them in the S3 bucket.
How it works
A camera captures content, and with the help of a contribution encoder, publishes a live stream in equirectangular format. The stream is encoded at a high bit rate (at least 2.5 Mbps, but typically 16+ Mbps for 4K) using H264 video and AAC audio compression codecs, and delivered to a primary origin via the RTMP protocol. Streams may transit over the internet or dedicated links to the origins. Typically, for live events in the field, internet or bonded cellular are the most widely used.
The encoder is typically configured to push the live stream to a primary URI, with the ability (depending on the source encoding software/hardware) to roll over to a backup publishing point origin if the primary fails. Because you run across multiple Availability Zones, this architecture could handle an entire zone outage with minor disruption to live events. The primary and backup origins handle the ingestion of the live stream as well as transcoding to H264+AAC-based adaptive bit rate sets. After transcode, they package the streams into HLS for delivery and create a master-level manifest that references all adaptive bit rates.
The edge cache fleet pulls segments and manifests from the active origin on demand, and supports failover from primary to backup if the primary origin fails. By adding this caching tier, you effectively separate the encoding backend tier from the cache tier that responds to client or CDN requests. In addition to origin protection, this separation allows you to independently monitor, configure, and scale these components.
Viewers can use the sample HTML5 player (or compatible desktop, iOS or Android application) to view the streams. Navigation in the 360-degree view is handled either natively via device-based gyroscope, positionally via more advanced devices such as a head mount display, or via mouse drag on the desktop. Adaptive bit rate is key here, as this allows you to target multiple device types, giving the player on each device the option of selecting an optimum stream based on network conditions or device profile.
Solution components
When you deploy the CloudFormation template, all the architecture services referenced above are created and launched. This includes:
The compute tier running on Spot Instances for the corresponding components:
the primary and backup ingest origins
the edge cache fleet
the transcoding fleet
the test source
The CloudFront distribution
S3 buckets for storage of on-demand VOD assets
An Application Load Balancer for load balancing the service
An Amazon ECS cluster and container for the test source
The template also provisions the underlying dependencies:
A VPC
Security groups
IAM policies and roles
Elastic network interfaces
Elastic IP addresses
The edge cache fleet instances need some way to discover the primary and backup origin locations. You use elastic network interfaces and elastic IP addresses for this purpose.
As each component of the infrastructure is provisioned, software required to transcode and process the streams across the Spot Instances is automatically deployed. This includes NGiNX-RTMP for ingest of live streams, FFMPEG for transcoding, NGINX for serving, and helper scripts to handle various tasks (potential Spot Instance interruptions, queueing, moving content to S3). Metrics and logs are available through CloudWatch and you can manage the deployment using the CloudFormation console or AWS CLI.
Key features include:
Live and video-on-demand recording
You’re supporting both live and on-demand. On-demand content is created automatically when the encoder stops publishing to the origin.
Cost-optimization and operating at scale using Spot Instances
Spot Instances are used exclusively for infrastructure to optimize cost and scale throughput.
Midtier caching
To protect the origin servers, the midtier cache fleet pulls, caches, and delivers to downstream CDNs.
Distribution via CloudFront or multi-CDN
The Application Load Balancer endpoint allows CloudFront or any third-party CDN to source content from the edge fleet and, indirectly, the origin.
FFMPEG + NGINX + NGiNX-RTMP
These three components form the core of the stream ingest, transcode, packaging, and delivery infrastructure, as well as the VOD-processing component for creating transcoded VOD content on-demand.
Simple deployment using a CloudFormation template
All infrastructure can be easily created and modified using CloudFormation.
Prototype player page
To provide an end-to-end experience right away, we’ve included a test player page hosted as a static site on S3. This page uses A-Frame, a cross-platform, open-source framework for building VR experiences in the browser. Though A-Frame provides many features, it’s used here to render a sphere that acts as a 3D canvas for your live stream.
Spot Instance considerations
At this stage, and before we discuss processing, it is important to understand how the architecture operates with Spot Instances.
Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. Spot Instances enables you to optimize your costs on the AWS Cloud and scale your application’s throughput up to 10X for the same budget. By selecting Spot Instances, you can save up-to 90% on On-Demand prices. This allows you to greatly reduce the cost of running the solution because, outside of S3 for storage and CloudFront for delivery, this solution is almost entirely dependent on Spot Instances for infrastructure requirements.
We also know that customers running events look to deploy streaming infrastructure at the lowest price point, so it makes sense to take advantage of it wherever possible. A potential challenge when using Spot Instances for live streaming and on-demand processing is that you need to proactively deal with potential Spot Instance interruptions. How can you best deal with this?
First, the origin is deployed in a primary/backup deployment. If a Spot Instance interruption happens on the primary origin, you can fail over to the backup with a brief interruption. Should a potential interruption not be acceptable, then either Reserved Instances or On-Demand options (or a combination) can be used at this tier.
Second, the edge cache fleet runs a job (started automatically at system boot) that periodically queries the local instance metadata to detect if an interruption is scheduled to occur. Spot Instance Interruption Notices provide a two-minute warning of a pending interruption. If you poll every 5 seconds, you have almost 2 full minutes to detach from the Load Balancer and drain or stop any traffic directed to your instance.
Lastly, use an SQS queue when transcoding. If a transcode for a Spot Instance is interrupted, the stale item falls back into the SQS queue and is eventually re-surfaced into the processing pipeline. Only remove items from the queue after the transcoded files have been successfully moved to the destination S3 bucket.
Processing
As discussed in the previous sections, you pass through the video for the highest bit rate to save on having to increase the instance size to transcode the 4K or similar high bit rate or resolution content.
We’ve selected a handful of bitrates for the adaptive bit rate stack. You can customize any of these to suit the requirements for your event. The default ABR stack includes:
2160p (4K)
1080p
540p
480p
These can be modified by editing the /etc/nginx/rtmp.d/rtmp.conf NGINX configuration file on the origin or the CloudFormation template.
It’s important to understand where and how streams are transcoded. When the source high bit rate stream enters the primary or backup origin at the /live RTMP application entry point, it is recorded on stop and start of publishing. On completion, it is moved to S3 by a cleanup script, and a message is placed in your SQS queue for workers to use. These workers transcode the media and push it to a playout location bucket.
This solution uses Spot Fleet with automatic scaling to drive the fleet size. You can customize it based on CloudWatch metrics, such as simple utilization metrics to drive the size of the fleet. Why use Spot Instances for the transcode option instead of Amazon Elastic Transcoder? This allows you to implement reprojection of the input stream via FFMPEG filters in the future.
The origins handle all the heavy live streaming work. Edges only store and forward the segments and manifests, and provide scaling plus reduction of burden on the origin. This lets you customize the origin to the right compute capacity without having to rely on a ‘high watermark’ for compute sizing, thus saving additional costs.
Loopback is an important concept for the live origins. The incoming stream entering /live is transcoded by FFMPEG to multiple bit rates, which are streamed back to the same host via RTMP, on a secondary publishing point /show. The secondary publishing point is transparent to the user and encoder, but handles HLS segment generation and cleanup, and keeps a sliding window of live segments and constantly updating manifests.
Configuration
Our solution provides two key points of configuration that can be used to customize the solution to accommodate ingest, recording, transcoding, and delivery, all controlled via origin and edge configuration files, which are described later. In addition, a number of job scripts run on the instances to provide hooks into Spot Instance interruption events and the VOD SQS-based processing queue.
Origin instances
The rtmp.conf excerpt below also shows additional parameters that can be customized, such as maximum recording file size in Kbytes, HLS Fragment length, and Playlist sizes. We’ve created these in accordance with general industry best practices to ensure the reliable streaming and delivery of your content.
rtmp {
server {
listen 1935;
chunk_size 4000;
application live {
live on;
record all;
record_path /var/lib/nginx/rec;
record_max_size 128000K;
exec_record_done /usr/local/bin/record-postprocess.sh $path $basename;
exec /usr/local/bin/ffmpeg <…parameters…>;
}
application show {
live on;
hls on;
...
hls_type live;
hls_fragment 10s;
hls_playlist_length 60s;
...
}
}
}
This exposes a few URL endpoints for debugging and general status. In production, you would most likely turn these off:
/stat provides a statistics endpoint accessible via any standard web browser.
/control enables control of RTMP streams and publishing points.
You also control the TTLs, as previously discussed. It’s important to note here that you are setting TTLs explicitly at the origin, instead of in CloudFront’s distribution configuration. While both are valid, this approach allows you to reconfigure and restart the service on the fly without having to push changes through CloudFront. This is useful for debugging any caching or playback issues.
record-postprocess.sh – Ensures that recorded files on the origin are well-formed, and transfers them to S3 for processing.
ffmpeg.sh – Transcodes content on the encoding fleet, pulling source media from your S3 ingress bucket, based on SQS queue entries, and pushing transcoded adaptive bit rate segments and manifests to your VOD playout egress bucket.
For more details, see the Delivery and Playback section later in this post.
Camera source
With the processing and origination infrastructure running, you need to configure your camera and encoder.
As discussed, we chose to use a Ricoh Theta S camera and Open Broadcaster Software (OBS) to stitch and deliver a stream into the infrastructure. Ricoh provides a free ‘blender’ driver, which allows you to transform, stitch, encode, and deliver both transformed equirectangular (used for this post) video as well as spherical (two camera) video. The Theta provides an easy way to get capturing for under $300, and OBS is a free and open-source software application for capturing and live streaming on a budget. It is quick, cheap, and enjoys wide use by the gaming community. OBS lowers the barrier to getting started with immersive streaming.
While the resolution and bit rate of the Theta may not be 4K, it still provides us with a way to test the functionality of the entire pipeline end to end, without having to invest in a more expensive camera rig. One could also use this type of model to target smaller events, which may involve mobile devices with smaller display profiles, such as phones and potentially smaller sized tablets.
Looking for a more professional solution? Nokia, GoPro, Samsung, and many others have options ranging from $500 to $50,000. This solution is based around the Theta S capabilities, but we’d encourage you to extend it to meet your specific needs.
If your device can support equirectangular RTMP, then it can deliver media through the reference architecture (dependent on instance sizing for higher bit rate sources, of course). If additional features are required such as camera stitching, mixing, or device bonding, we’d recommend exploring a commercial solution such as Teradek Sphere.
Teradek Rig (Teradek)
Ricoh Theta (CNET)
All cameras have varied PC connectivity support. We chose the Ricoh Theta S due to the real-time video connectivity that it provides through software drivers on macOS and PC. If you plan to purchase a camera to use with a PC, confirm that it supports real-time capabilities as a peripheral device.
Encoding and publishing
Now that you have a camera, encoder, and AWS stack running, you can finally publish a live stream.
To start streaming with OBS, configure the source camera and set a publishing point. Use the RTMP application name /live on port 1935 to ingest into the primary origin’s Elastic IP address provided as the CloudFormation output: primaryOriginElasticIp.
You also need to choose a stream name or stream key in OBS. You can use any stream name, but keep the naming short and lowercase, and use only alphanumeric characters. This avoids any parsing issues on client-side player frameworks. There’s no publish point protection in your deployment, so any stream key works with the default NGiNX-RTMP configuration. For more information about stream keys, publishing point security, and extending the NGiNX-RTMP module, see the NGiNX-RTMP Wiki.
You should end up with a configuration similar to the following:
OBS Stream Settings
The Output settings dialog allows us to rescale the Video canvas and encode it for delivery to our AWS infrastructure. In the dialog below, we’ve set the Theta to encode at 5 Mbps in CBR mode using a preset optimized for low CPU utilization. We chose these settings in accordance with best practices for the stream pass-through at the origin for the initial incoming bit rate. You may notice that they largely match the FFMPEG encoding settings we use on the origin – namely constant bit rate, a single audio track, and x264 encoding with the ‘veryfast’ encoding profile.
OBS Output Settings
Live to On-Demand
As you may have noticed, an on-demand component is included in the solution architecture. When talking to customers, one frequent request that we see is that they would like to record the incoming stream with as little effort as possible.
NGINX-RTMP’s recording directives provide an easy way to accomplish this. We record any newly published stream on stream start at the primary or backup origins, using the incoming source stream, which also happens to be the highest bit rate. When the encoder stops broadcasting, NGINX-RTMP executes an exec_record_done script – record-postprocess.sh (described in the Configuration section earlier), which ensures that the content is well-formed, and then moves it to an S3 ingest bucket for processing.
Transcoding of content to make it ready for VOD as adaptive bit rate is a multi-step pipeline. First, Spot Instances in the transcoding cluster periodically poll the SQS queue for new jobs. Items on the queue are pulled off on demand by processing instances, and transcoded via FFMPEG into adaptive bit rate HLS. This allows you to also extend FFMPEG using filters for cubic and other bitrate-optimizing 360-specific transforms. Finally, transcoded content is moved from the ingest bucket to an egress bucket, making them ready for playback via your CloudFront distribution.
Separate ingest and egress by bucket to provide hard security boundaries between source recordings (which are highest quality and unencrypted), and destination derivatives (which may be lower quality and potentially require encryption). Bucket separation also allows you to order and archive input and output content using different taxonomies, which is common when moving content from an asset management and archival pipeline (the ingest bucket) to a consumer-facing playback pipeline (the egress bucket, and any other attached infrastructure or services, such as CMS, Mobile applications, and so forth).
Because streams are pushed over the internet, there is always the chance that an interruption could occur in the network path, or even at the origin side of the equation (primary to backup roll-over). Both of these scenarios could result in malformed or partial recordings being created. For the best level of reliability, encoding should always be recorded locally on-site as a precaution to deal with potential stream interruptions.
Delivery and playback
With the camera turned on and OBS streaming to AWS, the final step is to play the live stream. We’ve primarily tested the prototype player on the latest Chrome and Firefox browsers on macOS, so your mileage may vary on different browsers or operating systems. For those looking to try the livestream on Google Cardboard, or similar headsets, native apps for iOS (VRPlayer) and Android exist that can play back HLS streams.
The prototype player is hosted in an S3 bucket and can be found from the CloudFormation output clientWebsiteUrl. It requires a stream URL provided as a query parameter ?url=<stream_url> to begin playback. This stream URL is determined by the RTMP stream configuration in OBS. For example, if OBS is publishing to rtmp://x.x.x.x:1935/live/foo, the resulting playback URL would be:
https://<cloudFrontDistribution>/hls/foo.m3u8
The combined player URL and playback URL results in a path like this one:
To assist in setup/debugging, we’ve provided a test source as part of the CloudFormation template. A color bar pattern with timecode and audio is being generated by FFmpeg running as an ECS task. Much like OBS, FFmpeg is streaming the test pattern to the primary origin over the RTMP protocol. The prototype player and test HLS stream can be accessed by opening the clientTestPatternUrl CloudFormation output link.
Test Stream Playback
What’s next?
In this post, we walked you through the design and implementation of a full end-to-end immersive streaming solution architecture. As you may have noticed, there are a number of areas this could expand into, and we intend to do this in follow-up posts around the topic of virtual reality media workloads in the cloud. We’ve identified a number of topics such as load testing, content protection, client-side metrics and analytics, and CI/CD infrastructure for 24/7 live streams. If you have any requests, please drop us a line.
We would like to extend extra-special thanks to Scott Malkie and Chad Neal for their help and contributions to this post and reference architecture.
The AWS team is always eager to add features that make it easier for you to protect your sensitive data and to help you to achieve your compliance objectives. For example, in 2017 we launched encryption at rest for SQS and EFS, additional encryption options for S3, and server-side encryption of Kinesis Data Streams.
Today we are giving you another data protection option with the introduction of encryption at rest for Amazon DynamoDB. You simply enable encryption when you create a new table and DynamoDB takes care of the rest. Your data (tables, local secondary indexes, and global secondary indexes) will be encrypted using AES-256 and a service-default AWS Key Management Service (KMS) key. The encryption adds no storage overhead and is completely transparent; you can insert, query, scan, and delete items as before. The team did not observe any changes in latency after enabling encryption and running several different workloads on an encrypted DynamoDB table.
Creating an Encrypted Table You can create an encrypted table from the AWS Management Console, API (CreateTable), or CLI (create-table). I’ll use the console! I enter the name and set up the primary key as usual:
Before proceeding, I uncheck Use default settings, scroll down to the Encrypytion section, and check Enable encryption. Then I click Create and my table is created in encrypted form:
I can see the encryption setting for the table at a glance:
When my compliance team asks me to show them how DynamoDB uses the key to encrypt the data, I can create a AWS CloudTrail trail, insert an item, and then scan the table to see the calls to the AWS KMS API. Here’s an extract from the trail:
Available Now This feature is available now in the US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland) Regions and you can start using it today.
There’s no charge for the encryption; you will be charged for the calls that DynamoDB makes to AWS KMS on your behalf.
I often encounter people experiencing frustration as they attempt to scale their e-commerce or WordPress site—particularly around the cost and complexity related to scaling. When I talk to customers about their scaling plans, they often mention phrases such as horizontal scaling and microservices, but usually people aren’t sure about how to dive in and effectively scale their sites.
Now let’s talk about different scaling options. For instance if your current workload is in a traditional data center, you can leverage the cloud for your on-premises solution. This way you can scale to achieve greater efficiency with less cost. It’s not necessary to set up a whole powerhouse to light a few bulbs. If your workload is already in the cloud, you can use one of the available out-of-the-box options.
Designing your API in microservices and adding horizontal scaling might seem like the best choice, unless your web application is already running in an on-premises environment and you’ll need to quickly scale it because of unexpected large spikes in web traffic.
So how to handle this situation? Take things one step at a time when scaling and you may find horizontal scaling isn’t the right choice, after all.
For example, assume you have a tech news website where you did an early-look review of an upcoming—and highly-anticipated—smartphone launch, which went viral. The review, a blog post on your website, includes both video and pictures. Comments are enabled for the post and readers can also rate it. For example, if your website is hosted on a traditional Linux with a LAMP stack, you may find yourself with immediate scaling problems.
Let’s get more details on the current scenario and dig out more:
Where are images and videos stored?
How many read/write requests are received per second? Per minute?
What is the level of security required?
Are these synchronous or asynchronous requests?
We’ll also want to consider the following if your website has a transactional load like e-commerce or banking:
How is the website handling sessions?
Do you have any compliance requests—like the Payment Card Industry Data Security Standard (PCI DSS compliance) —if your website is using its own payment gateway?
How are you recording customer behavior data and fulfilling your analytics needs?
What are your loading balancing considerations (scaling, caching, session maintenance, etc.)?
So, if we take this one step at a time:
Step 1:Ease server load. We need to quickly handle spikes in traffic, generated by activity on the blog post, so let’s reduce server load by moving image and video to some third -party content delivery network (CDN). AWS provides Amazon CloudFront as a CDN solution, which is highly scalable with built-in security to verify origin access identity and handle any DDoS attacks. CloudFront can direct traffic to your on-premises or cloud-hosted server with its 113 Points of Presence (102 Edge Locations and 11 Regional Edge Caches) in 56 cities across 24 countries, which provides efficient caching. Step 2: Reduce read load by adding more read replicas. MySQL provides a nice mirror replication for databases. Oracle has its own Oracle plug for replication and AWS RDS provide up to five read replicas, which can span across the region and even the Amazon database Amazon Aurora can have 15 read replicas with Amazon Aurora autoscaling support. If a workload is highly variable, you should consider Amazon Aurora Serverless database to achieve high efficiency and reduced cost. While most mirror technologies do asynchronous replication, AWS RDS can provide synchronous multi-AZ replication, which is good for disaster recovery but not for scalability. Asynchronous replication to mirror instance means replication data can sometimes be stale if network bandwidth is low, so you need to plan and design your application accordingly.
I recommend that you always use a read replica for any reporting needs and try to move non-critical GET services to read replica and reduce the load on the master database. In this case, loading comments associated with a blog can be fetched from a read replica—as it can handle some delay—in case there is any issue with asynchronous reflection.
Step 3: Reduce write requests. This can be achieved by introducing queue to process the asynchronous message. Amazon Simple Queue Service (Amazon SQS) is a highly-scalable queue, which can handle any kind of work-message load. You can process data, like rating and review; or calculate Deal Quality Score (DQS) using batch processing via an SQS queue. If your workload is in AWS, I recommend using a job-observer pattern by setting up Auto Scaling to automatically increase or decrease the number of batch servers, using the number of SQS messages, with Amazon CloudWatch, as the trigger. For on-premises workloads, you can use SQS SDK to create an Amazon SQS queue that holds messages until they’re processed by your stack. Or you can use Amazon SNS to fan out your message processing in parallel for different purposes like adding a watermark in an image, generating a thumbnail, etc.
Step 4: Introduce a more robust caching engine. You can use Amazon Elastic Cache for Memcached or Redis to reduce write requests. Memcached and Redis have different use cases so if you can afford to lose and recover your cache from your database, use Memcached. If you are looking for more robust data persistence and complex data structure, use Redis. In AWS, these are managed services, which means AWS takes care of the workload for you and you can also deploy them in your on-premises instances or use a hybrid approach.
Step 5: Scale your server. If there are still issues, it’s time to scale your server. For the greatest cost-effectiveness and unlimited scalability, I suggest always using horizontal scaling. However, use cases like database vertical scaling may be a better choice until you are good with sharding; or use Amazon Aurora Serverless for variable workloads. It will be wise to use Auto Scaling to manage your workload effectively for horizontal scaling. Also, to achieve that, you need to persist the session. Amazon DynamoDB can handle session persistence across instances.
If your server is on premises, consider creating a multisite architecture, which will help you achieve quick scalability as required and provide a good disaster recovery solution. You can pick and choose individual services like Amazon Route 53, AWS CloudFormation, Amazon SQS, Amazon SNS, Amazon RDS, etc. depending on your needs.
Your multisite architecture will look like the following diagram:
In this architecture, you can run your regular workload on premises, and use your AWS workload as required for scalability and disaster recovery. Using Route 53, you can direct a precise percentage of users to an AWS workload.
If you decide to move all of your workloads to AWS, the recommended multi-AZ architecture would look like the following:
In this architecture, you are using a multi-AZ distributed workload for high availability. You can have a multi-region setup and use Route53 to distribute your workload between AWS Regions. CloudFront helps you to scale and distribute static content via an S3 bucket and DynamoDB, maintaining your application state so that Auto Scaling can apply horizontal scaling without loss of session data. At the database layer, RDS with multi-AZ standby provides high availability and read replica helps achieve scalability.
This is a high-level strategy to help you think through the scalability of your workload by using AWS even if your workload in on premises and not in the cloud…yet.
I highly recommend creating a hybrid, multisite model by placing your on-premises environment replica in the public cloud like AWS Cloud, and using Amazon Route53 DNS Service and Elastic Load Balancing to route traffic between on-premises and cloud environments. AWS now supports load balancing between AWS and on-premises environments to help you scale your cloud environment quickly, whenever required, and reduce it further by applying Amazon auto-scaling and placing a threshold on your on-premises traffic using Route 53.
Today we are launching our 18th AWS Region, our fourth in Europe. Located in the Paris area, AWS customers can use this Region to better serve customers in and around France.
The Paris Region will benefit from three AWS Direct Connect locations. Telehouse Voltaire is available today. AWS Direct Connect will also become available at Equinix Paris in early 2018, followed by Interxion Paris.
All AWS infrastructure regions around the world are designed, built, and regularly audited to meet the most rigorous compliance standards and to provide high levels of security for all AWS customers. These include ISO 27001, ISO 27017, ISO 27018, SOC 1 (Formerly SAS 70), SOC 2 and SOC 3 Security & Availability, PCI DSS Level 1, and many more. This means customers benefit from all the best practices of AWS policies, architecture, and operational processes built to satisfy the needs of even the most security sensitive customers.
AWS is certified under the EU-US Privacy Shield, and the AWS Data Processing Addendum (DPA) is GDPR-ready and available now to all AWS customers to help them prepare for May 25, 2018 when the GDPR becomes enforceable. The current AWS DPA, as well as the AWS GDPR DPA, allows customers to transfer personal data to countries outside the European Economic Area (EEA) in compliance with European Union (EU) data protection laws. AWS also adheres to the Cloud Infrastructure Service Providers in Europe (CISPE) Code of Conduct. The CISPE Code of Conduct helps customers ensure that AWS is using appropriate data protection standards to protect their data, consistent with the GDPR. In addition, AWS offers a wide range of services and features to help customers meet the requirements of the GDPR, including services for access controls, monitoring, logging, and encryption.
From Our Customers Many AWS customers are preparing to use this new Region. Here’s a small sample:
Societe Generale, one of the largest banks in France and the world, has accelerated their digital transformation while working with AWS. They developed SG Research, an application that makes reports from Societe Generale’s analysts available to corporate customers in order to improve the decision-making process for investments. The new AWS Region will reduce latency between applications running in the cloud and in their French data centers.
SNCF is the national railway company of France. Their mobile app, powered by AWS, delivers real-time traffic information to 14 million riders. Extreme weather, traffic events, holidays, and engineering works can cause usage to peak at hundreds of thousands of users per second. They are planning to use machine learning and big data to add predictive features to the app.
Radio France, the French public radio broadcaster, offers seven national networks, and uses AWS to accelerate its innovation and stay competitive.
Les Restos du Coeur, a French charity that provides assistance to the needy, delivering food packages and participating in their social and economic integration back into French society. Les Restos du Coeur is using AWS for its CRM system to track the assistance given to each of their beneficiaries and the impact this is having on their lives.
AlloResto by JustEat (a leader in the French FoodTech industry), is using AWS to to scale during traffic peaks and to accelerate their innovation process.
AWS Consulting and Technology Partners We are already working with a wide variety of consulting, technology, managed service, and Direct Connect partners in France. Here’s a partial list:
AWS in France We have been investing in Europe, with a focus on France, for the last 11 years. We have also been developing documentation and training programs to help our customers to improve their skills and to accelerate their journey to the AWS Cloud.
As part of our commitment to AWS customers in France, we plan to train more than 25,000 people in the coming years, helping them develop highly sought after cloud skills. They will have access to AWS training resources in France via AWS Academy, AWSome days, AWS Educate, and webinars, all delivered in French by AWS Technical Trainers and AWS Certified Trainers.
Use it Today The EU (Paris) Region is open for business now and you can start using it today!
Today we launched our 17th Region globally, and the second in China. The AWS China (Ningxia) Region, operated by Ningxia Western Cloud Data Technology Co. Ltd. (NWCD), is generally available now and provides customers another option to run applications and store data on AWS in China.
Operating Partner To comply with China’s legal and regulatory requirements, AWS has formed a strategic technology collaboration with NWCD to operate and provide services from the AWS China (Ningxia) Region. Founded in 2015, NWCD is a licensed datacenter and cloud services provider, based in Ningxia, China. NWCD joins Sinnet, the operator of the AWS China China (Beijing) Region, as an AWS operating partner in China. Through these relationships, AWS provides its industry-leading technology, guidance, and expertise to NWCD and Sinnet, while NWCD and Sinnet operate and provide AWS cloud services to local customers. While the cloud services offered in both AWS China Regions are the same as those available in other AWS Regions, the AWS China Regions are different in that they are isolated from all other AWS Regions and operated by AWS’s Chinese partners separately from all other AWS Regions. Customers using the AWS China Regions enter into customer agreements with Sinnet and NWCD, rather than with AWS.
Use it Today The AWS China (Ningxia) Region, operated by NWCD, is open for business, and you can start using it now! Starting today, Chinese developers, startups, and enterprises, as well as government, education, and non-profit organizations, can leverage AWS to run their applications and store their data in the new AWS China (Ningxia) Region, operated by NWCD. Customers already using the AWS China (Beijing) Region, operated by Sinnet, can select the AWS China (Ningxia) Region directly from the AWS Management Console, while new customers can request an account at www.amazonaws.cn to begin using both AWS China Regions.
Glenn Gore here, Chief Architect for AWS. I’m in Las Vegas this week — with 43K others — for re:Invent 2017. We have a lot of exciting announcements this week. I’m going to post to the AWS Architecture blog each day with my take on what’s interesting about some of the announcements from a cloud architectural perspective.
Why not start at the beginning? At the Midnight Madness launch on Sunday night, we announced Amazon Sumerian, our platform for VR, AR, and mixed reality. The hype around VR/AR has existed for many years, though for me, it is a perfect example of how a working end-to-end solution often requires innovation from multiple sources. For AR/VR to be successful, we need many components to come together in a coherent manner to provide a great experience.
First, we need lightweight, high-definition goggles with motion tracking that are comfortable to wear. Second, we need to track movement of our body and hands in a 3-D space so that we can interact with virtual objects in the virtual world. Third, we need to build the virtual world itself and populate it with assets and define how the interactions will work and connect with various other systems.
There has been rapid development of the physical devices for AR/VR, ranging from iOS devices to Oculus Rift and HTC Vive, which provide excellent capabilities for the first and second components defined above. With the launch of Amazon Sumerian we are solving for the third area, which will help developers easily build their own virtual worlds and start experimenting and innovating with how to apply AR/VR in new ways.
Already, within 48 hours of Amazon Sumerian being announced, I have had multiple discussions with customers and partners around some cool use cases where VR can help in training simulations, remote-operator controls, or with new ideas around interacting with complex visual data sets, which starts bringing concepts straight out of sci-fi movies into the real (virtual) world. I am really excited to see how Sumerian will unlock the creative potential of developers and where this will lead.
Amazon MQ I am a huge fan of distributed architectures where asynchronous messaging is the backbone of connecting the discrete components together. Amazon Simple Queue Service (Amazon SQS) is one of my favorite services due to its simplicity, scalability, performance, and the incredible flexibility of how you can use Amazon SQS in so many different ways to solve complex queuing scenarios.
While Amazon SQS is easy to use when building cloud-native applications on AWS, many of our customers running existing applications on-premises required support for different messaging protocols such as: Java Message Service (JMS), .Net Messaging Service (NMS), Advanced Message Queuing Protocol (AMQP), MQ Telemetry Transport(MQTT), Simple (or Streaming) Text Orientated Messaging Protocol (STOMP), and WebSockets. One of the most popular applications for on-premise message brokers is Apache ActiveMQ. With the release of Amazon MQ, you can now run Apache ActiveMQ on AWS as a managed service similar to what we did with Amazon ElastiCache back in 2012. For me, there are two compelling, major benefits that Amazon MQ provides:
Integrate existing applications with cloud-native applications without having to change a line of application code if using one of the supported messaging protocols. This removes one of the biggest blockers for integration between the old and the new.
Remove the complexity of configuring Multi-AZ resilient message broker services as Amazon MQ provides out-of-the-box redundancy by always storing messages redundantly across Availability Zones. Protection is provided against failure of a broker through to complete failure of an Availability Zone.
I believe that Amazon MQ is a major component in the tools required to help you migrate your existing applications to AWS. Having set up cross-data center Apache ActiveMQ clusters in the past myself and then testing to ensure they work as expected during critical failure scenarios, technical staff working on migrations to AWS benefit from the ease of deploying a fully redundant, managed Apache ActiveMQ cluster within minutes.
Who would have thought I would have been so excited to revisit Apache ActiveMQ in 2017 after using SQS for many, many years? Choice is a wonderful thing.
Amazon GuardDuty Maintaining application and information security in the modern world is increasingly complex and is constantly evolving and changing as new threats emerge. This is due to the scale, variety, and distribution of services required in a competitive online world.
At Amazon, security is our number one priority. Thus, we are always looking at how we can increase security detection and protection while simplifying the implementation of advanced security practices for our customers. As a result, we released Amazon GuardDuty, which provides intelligent threat detection by using a combination of multiple information sources, transactional telemetry, and the application of machine learning models developed by AWS. One of the biggest benefits of Amazon GuardDuty that I appreciate is that enabling this service requires zero software, agents, sensors, or network choke points. which can all impact performance or reliability of the service you are trying to protect. Amazon GuardDuty works by monitoring your VPC flow logs, AWS CloudTrail events, DNS logs, as well as combing other sources of security threats that AWS is aggregating from our own internal and external sources.
The use of machine learning in Amazon GuardDuty allows it to identify changes in behavior, which could be suspicious and require additional investigation. Amazon GuardDuty works across all of your AWS accounts allowing for an aggregated analysis and ensuring centralized management of detected threats across accounts. This is important for our larger customers who can be running many hundreds of AWS accounts across their organization, as providing a single common threat detection of their organizational use of AWS is critical to ensuring they are protecting themselves.
Detection, though, is only the beginning of what Amazon GuardDuty enables. When a threat is identified in Amazon GuardDuty, you can configure remediation scripts or trigger Lambda functions where you have custom responses that enable you to start building automated responses to a variety of different common threats. Speed of response is required when a security incident may be taking place. For example, Amazon GuardDuty detects that an Amazon Elastic Compute Cloud (Amazon EC2) instance might be compromised due to traffic from a known set of malicious IP addresses. Upon detection of a compromised EC2 instance, we could apply an access control entry restricting outbound traffic for that instance, which stops loss of data until a security engineer can assess what has occurred.
Whether you are a customer running a single service in a single account, or a global customer with hundreds of accounts with thousands of applications, or a startup with hundreds of micro-services with hourly release cycle in a devops world, I recommend enabling Amazon GuardDuty. We have a 30-day free trial available for all new customers of this service. As it is a monitor of events, there is no change required to your architecture within AWS.
Stay tuned for tomorrow’s post on AWS Media Services and Amazon Neptune.
Messaging holds the parts of a distributed application together, while also adding resiliency and enabling the implementation of highly scalable architectures. For example, earlier this year, Amazon Simple Queue Service (SQS) and Amazon Simple Notification Service (SNS) supported the processing of customer orders on Prime Day, collectively processing 40 billion messages at a rate of 10 million per second, with no customer-visible issues.
SQS and SNS have been used extensively for applications that were born in the cloud. However, many of our larger customers are already making use of open-sourced or commercially-licensed message brokers. Their applications are mission-critical, and so is the messaging that powers them. Our customers describe the setup and on-going maintenance of their messaging infrastructure as “painful” and report that they spend at least 10 staff-hours per week on this chore.
New Amazon MQ Today we are launching Amazon MQ – a managed message broker service for Apache ActiveMQ that lets you get started in minutes with just three clicks! As you may know, ActiveMQ is a popular open-source message broker that is fast & feature-rich. It offers queues and topics, durable and non-durable subscriptions, push-based and poll-based messaging, and filtering.
As a managed service, Amazon MQ takes care of the administration and maintenance of ActiveMQ. This includes responsibility for broker provisioning, patching, failure detection & recovery for high availability, and message durability. With Amazon MQ, you get direct access to the ActiveMQ console and industry standard APIs and protocols for messaging, including JMS, NMS, AMQP, STOMP, MQTT, and WebSocket. This allows you to move from any message broker that uses these standards to Amazon MQ–along with the supported applications–without rewriting code.
You can create a single-instance Amazon MQ broker for development and testing, or an active/standby pair that spans AZs, with quick, automatic failover. Either way, you get data replication across AZs and a pay-as-you-go model for the broker instance and message storage.
Amazon MQ is a full-fledged part of the AWS family, including the use of AWS Identity and Access Management (IAM) for authentication and authorization to use the service API. You can use Amazon CloudWatch metrics to keep a watchful eye metrics such as queue depth and initiate Auto Scaling of your consumer fleet as needed.
Launching an Amazon MQ Broker To get started, I open up the Amazon MQ Console, select the desired AWS Region, enter a name for my broker, and click on Next step:
Then I choose the instance type, indicate that I want to create a standby , and click on Create broker (I can select a VPC and fine-tune other settings in the Advanced settings section):
My broker will be created and ready to use in 5-10 minutes:
The URLs and endpoints that I use to access my broker are all available at a click:
I can access the ActiveMQ Web Console at the link provided:
The broker publishes instance, topic, and queue metrics to CloudWatch. Here are the instance metrics:
Available Now Amazon MQ is available now and you can start using it today in the US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), and Asia Pacific (Sydney) Regions.
The AWS Free Tier lets you use a single-AZ micro instance for up to 750 hours and to store up to 1 gigabyte each month, for one year. After that, billing is based on instance-hours and message storage, plus charges Internet data transfer if the broker is accessed from outside of AWS.
Contributed by: Stephen Liedig, Senior Solutions Architect, ANZ Public Sector, and Otavio Ferreira, Manager, Amazon Simple Notification Service
Want to make your cloud-native applications scalable, fault-tolerant, and highly available? Recently, we wrote a couple of posts about using AWS messaging services Amazon SQS and Amazon SNS to address messaging patterns for loosely coupled communication between highly cohesive components. For more information, see:
Today, AWS is releasing a new message filtering functionality for SNS. This new feature simplifies the pub/sub messaging architecture by offloading the filtering logic from subscribers, as well as the routing logic from publishers, to SNS.
In this post, we walk you through the new message filtering feature, and how to use it to clean up unnecessary logic in your components, and reduce the number of topics in your architecture.
Topic-based filtering
SNS is a fully managed pub/sub messaging service that lets you fan out messages to large numbers of recipients at one time, using topics. SNS topics support a variety of subscription types, allowing you to push messages to SQS queues, AWS Lambda functions, HTTP endpoints, email addresses, and mobile devices (SMS, push).
In the above scenario, every subscriber receives the same message published to the topic, allowing them to process the message independently. For many use cases, this is sufficient.
However, in more complex scenarios, the subscriber may only be interested in a subset of the messages being published. The onus, in that case, is on each subscriber to ensure that they are filtering and only processing those messages in which they are actually interested.
To avoid this additional filtering logic on each subscriber, many organizations have adopted a practice in which the publisher is now responsible for routing different types of messages to different topics. However, as depicted in the following diagram, this topic-based filtering practice can lead to overly complicated publishers, topic proliferation, and additional overhead in provisioning and managing your SNS topics.
Attribute-based filtering
To leverage the new message filtering capability, SNS requires the publisher to set message attributes and each subscriber to set a subscription attribute (a subscription filter policy). When the publisher posts a new message to the topic, SNS attempts to match the incoming message attributes to the filter policy set on each subscription, to determine whether a particular subscriber is interested in that incoming event. If there is a match, SNS then pushes the message to the subscriber in question. The new attribute-based message filtering approach is depicted in the following diagram.
Message filtering in action
Look at how message filtering works. The following example is based on a sports merchandise ecommerce website, which publishes a variety of events to an SNS topic. The events range from checkout events (triggered when orders are placed or canceled) to buyers’ navigation events (triggered when product pages are visited). The code below is based on the existing AWS SDK for Python.
First, create the single SNS topic to which all shopping events are published.
Next, subscribe the endpoints that will be listening to those shopping events. The first subscriber is an SQS queue that is processed by a payment gateway, while the second subscriber is a Lambda function that indexes the buyer’s shopping interests against a search engine.
A subscription filter policy is set as a subscription attribute, by the subscription owner, as a simple JSON object, containing a set of key-value pairs. This object defines the kind of event in which the subscriber is interested.
You’re now ready to start publishing events with attributes!
Message attributes allow you to provide structured metadata items (such as time stamps, geospatial data, event type, signatures, and identifiers) about the message. Message attributes are optional and separate from, but sent along with, the message body. You can include up to 10 message attributes with your message.
The first message published in this example is related to an order that has been placed on the ecommerce website. The message attribute “event_type” with the value “order_placed” matches only the filter policy associated with the payment gateway subscription. Therefore, only the SQS queue subscribed to the SNS topic is notified about this checkout event.
The second message published is related to a buyer’s navigation activity on the ecommerce website. The message attribute “event_type” with the value “product_page_visited” matches only the filter policy associated with the search engine subscription. Therefore, only the Lambda function subscribed to the SNS topic is notified about this navigation event.
The following diagram represents the architecture for this ecommerce website, with the message filtering mechanism in action. As described earlier, checkout events are pushed only to the SQS queue, whereas navigation events are pushed to the Lambda function only.
Message filtering criteria
It is important to remember the following things about subscription filter policy matching:
A subscription filter policy either matches an incoming message, or it doesn’t. It’s Boolean logic.
For a filter policy to match a message, the message must contain all the attribute keys listed in the policy.
Attributes of the message not mentioned in the filtering policy are ignored.
The value of each key in the filter policy is an array containing one or more values. The policy matches if any of the values in the array match the value in the corresponding message attribute.
If the value in the message attribute is an array, then the filter policy matches if the intersection of the policy array and the message array is non-empty.
The matching is exact (character-by-character), without case-folding or any other string normalization.
The values being matched follow JSON rules: Strings enclosed in quotes, numbers, and the unquoted keywords true, false, and null.
Number matching is at the string representation level. Example: 300, 300.0, and 3.0e2 aren’t considered equal.
When should I use message filtering?
We recommend using message filtering and grouping subscribers into a single topic only when all of the following is true:
Subscribers are semantically related to each other
Subscribers consume similar types of events
Subscribers are supposed to share the same access permissions on the topic
Technically, you could get away with creating a single topic for your entire domain to handle all event processing, even unrelated use cases, but this wouldn’t be recommended. This option could result in an unnecessarily large topic, which could potentially impact your message delivery latency. Also, you would lose the ability to implement fine-grained access control on your topics.
Finally, if you already use SNS, but had to add filtering logic in your subscribers or routing logic in your publishers (topic-based filtering), you can now immediately benefit from message filtering. This new approach lets you clean up any unnecessary logic in your components, and reduce the number of topics in your architecture.
Summary
As we’ve shown in this post, the new message filtering capability in Amazon SNS gives you a great amount of flexibility in your messaging pattern. It allows you to really simplify your pub/sub infrastructure requirements.
Message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). It’s now available in all AWS commercial regions, at no extra charge.
Here’s a few ideas for next steps to get you started:
Add filter policies to your subscriptions on the SNS console,
When we discuss how to build applications with customers, we often align to the Well Architected Framework pillars of security, reliability, performance efficiency, cost optimization, and operational excellence. Designing for failure is an essential component to developing well architected applications that are resilient to spurious errors that may occur.
There are many ways you can use AWS services to achieve high availability and resiliency of your applications. For example, you can couple Elastic Load Balancing with Auto Scaling and Amazon EC2 instances to build highly available applications. Or use Amazon API Gateway and AWS Lambda to rapidly scale out a microservices-based architecture. Many AWS services have built in solutions to help with the appropriate error handling, such as Dead Letter Queues (DLQ) for Amazon SQS or retries in AWS Batch.
AWS Step Functions is an AWS service that makes it easy for you to coordinate the components of distributed applications and microservices. Step Functions allows you to easily design for failure, by incorporating features such as error retries and custom error handling from AWS Lambda exceptions. These features allow you to programmatically handle many common error modes and build robust, reliable applications.
In some rare cases, however, your application may fail in an unexpected manner. In these situations, you might not want to duplicate in a repeat execution those portions of your state machine that have already run. This is especially true when orchestrating long-running jobs or executing a complex state machine as part of a microservice. Here, you need to know the last successful state in your state machine from which to resume, so that you don’t duplicate previous work. In this post, we present a solution to enable you to resume from any given state in your state machine in the case of an unexpected failure.
Resuming from a given state
To resume a failed state machine execution from the state at which it failed, you first run a script that dynamically creates a new state machine. When the new state machine is executed, it resumes the failed execution from the point of failure. The script contains the following two primary steps:
Parse the execution history of the failed execution to find the name of the state at which it failed, as well as the JSON input to that state.
Create a new state machine, which adds an additional state to failed state machine, called "GoToState". "GoToState" is a choice state at the beginning of the state machine that branches execution directly to the failed state, allowing you to skip states that had succeeded in the previous execution.
The full script along with a CloudFormation template that creates a demo of this is available in the aws-sfn-resume-from-any-state GitHub repo.
Diving into the script
In this section, we walk you through the script and highlight the core components of its functionality. The script contains a main function, which adds a command line parameter for the failedExecutionArn so that you can easily call the script from the command line:
First, the script extracts the name of the failed state along with the input to that state. It does so by using the failed state machine execution history, which is identified by the Amazon Resource Name (ARN) of the execution. The failed state is marked in the execution history, along with the input to that state (which is also the output of the preceding successful state). The script is able to parse these values from the log.
The script loops through the execution history of the failed state machine, and traces it backwards until it finds the failed state. If the state machine failed in a parallel state, then it must restart from the beginning of the parallel state. The script is able to capture the name of the parallel state that failed, rather than any substate within the parallel state that may have caused the failure. The following code is the Python function that does this.
def parseFailureHistory(failedExecutionArn):
'''
Parses the execution history of a failed state machine to get the name of failed state and the input to the failed state:
Input failedExecutionArn = A string containing the execution ARN of a failed state machine y
Output = A list with two elements: [name of failed state, input to failed state]
'''
failedAtParallelState = False
try:
#Get the execution history
response = client.get\_execution\_history(
executionArn=failedExecutionArn,
reverseOrder=True
)
failedEvents = response['events']
except Exception as ex:
raise ex
#Confirm that the execution actually failed, raise exception if it didn't fail.
try:
failedEvents[0]['executionFailedEventDetails']
except:
raise('Execution did not fail')
'''
If you have a 'States.Runtime' error (for example, if a task state in your state machine attempts to execute a Lambda function in a different region than the state machine), get the ID of the failed state, and use it to determine the failed state name and input.
'''
if failedEvents[0]['executionFailedEventDetails']['error'] == 'States.Runtime':
failedId = int(filter(str.isdigit, str(failedEvents[0]['executionFailedEventDetails']['cause'].split()[13])))
failedState = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['name']
failedInput = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['input']
return (failedState, failedInput)
'''
You need to loop through the execution history, tracing back the executed steps.
The first state you encounter is the failed state. If you failed on a parallel state, you need the name of the parallel state rather than the name of a state within a parallel state that it failed on. This is because you can only attach goToState to the parallel state, but not a substate within the parallel state.
This loop starts with the ID of the latest event and uses the previous event IDs to trace back the execution to the beginning (id 0). However, it returns as soon it finds the name of the failed state.
'''
currentEventId = failedEvents[0]['id']
while currentEventId != 0:
#multiply event ID by -1 for indexing because you're looking at the reversed history
currentEvent = failedEvents[-1 \* currentEventId]
'''
You can determine if the failed state was a parallel state because it and an event with 'type'='ParallelStateFailed' appears in the execution history before the name of the failed state
'''
if currentEvent['type'] == 'ParallelStateFailed':
failedAtParallelState = True
'''
If the failed state is not a parallel state, then the name of failed state to return is the name of the state in the first 'TaskStateEntered' event type you run into when tracing back the execution history
'''
if currentEvent['type'] == 'TaskStateEntered' and failedAtParallelState == False:
failedState = currentEvent['stateEnteredEventDetails']['name']
failedInput = currentEvent['stateEnteredEventDetails']['input']
return (failedState, failedInput)
'''
If the failed state was a parallel state, then you need to trace execution back to the first event with 'type'='ParallelStateEntered', and return the name of the state
'''
if currentEvent['type'] == 'ParallelStateEntered' and failedAtParallelState:
failedState = failedState = currentEvent['stateEnteredEventDetails']['name']
failedInput = currentEvent['stateEnteredEventDetails']['input']
return (failedState, failedInput)
#Update the ID for the next execution of the loop
currentEventId = currentEvent['previousEventId']
Create the new state machine
The script uses the name of the failed state to create the new state machine, with "GoToState" branching execution directly to the failed state.
To do this, the script requires the Amazon States Language (ASL) definition of the failed state machine. It modifies the definition to append "GoToState", and create a new state machine from it.
The script gets the ARN of the failed state machine from the execution ARN of the failed state machine. This ARN allows it to get the ASL definition of the failed state machine by calling the DesribeStateMachine API action. It creates a new state machine with "GoToState".
When the script creates the new state machine, it also adds an additional input variable called "resuming". When you execute this new state machine, you specify this resuming variable as true in the input JSON. This tells "GoToState" to branch execution to the state that had previously failed. Here’s the function that does this:
def attachGoToState(failedStateName, stateMachineArn):
'''
Given a state machine ARN and the name of a state in that state machine, create a new state machine that starts at a new choice state called 'GoToState'. "GoToState" branches to the named state, and sends the input of the state machine to that state, when a variable called "resuming" is set to True.
Input failedStateName = A string with the name of the failed state
stateMachineArn = A string with the ARN of the state machine
Output response from the create_state_machine call, which is the API call that creates a new state machine
'''
try:
response = client.describe\_state\_machine(
stateMachineArn=stateMachineArn
)
except:
raise('Could not get ASL definition of state machine')
roleArn = response['roleArn']
stateMachine = json.loads(response['definition'])
#Create a name for the new state machine
newName = response['name'] + '-with-GoToState'
#Get the StartAt state for the original state machine, because you point the 'GoToState' to this state
originalStartAt = stateMachine['StartAt']
'''
Create the GoToState with the variable $.resuming.
If new state machine is executed with $.resuming = True, then the state machine skips to the failed state.
Otherwise, it executes the state machine from the original start state.
'''
goToState = {'Type':'Choice', 'Choices':[{'Variable':'$.resuming', 'BooleanEquals':False, 'Next':originalStartAt}], 'Default':failedStateName}
#Add GoToState to the set of states in the new state machine
stateMachine['States']['GoToState'] = goToState
#Add StartAt
stateMachine['StartAt'] = 'GoToState'
#Create new state machine
try:
response = client.create_state_machine(
name=newName,
definition=json.dumps(stateMachine),
roleArn=roleArn
)
except:
raise('Failed to create new state machine with GoToState')
return response
Testing the script
Now that you understand how the script works, you can test it out.
The following screenshot shows an example state machine that has failed, called "TestMachine". This state machine successfully completed "FirstState" and "ChoiceState", but when it branched to "FirstMatchState", it failed.
Use the script to create a new state machine that allows you to rerun this state machine, but skip the "FirstState" and the "ChoiceState" steps that already succeeded. You can do this by calling the script as follows:
This creates a new state machine called "TestMachine-with-GoToState", and returns its ARN, along with the input that had been sent to "FirstMatchState". You can then inspect the input to determine what caused the error. In this case, you notice that the input to "FirstMachState" was the following:
{
"foo": 1,
"Message": true
}
However, this state machine expects the "Message" field of the JSON to be a string rather than a Boolean. Execute the new "TestMachine-with-GoToState" state machine, change the input to be a string, and add the "resuming" variable that "GoToState" requires:
When you execute the new state machine, it skips "FirstState" and "ChoiceState", and goes directly to "FirstMatchState", which was the state that failed:
Look at what happens when you have a state machine with multiple parallel steps. This example is included in the GitHub repository associated with this post. The repo contains a CloudFormation template that sets up this state machine and provides instructions to replicate this solution.
The following state machine, "ParallelStateMachine", takes an input through two subsequent parallel states before doing some final processing and exiting, along with the JSON with the ASL definition of the state machine.
{
"Comment": "An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
"StartAt": "Parallel",
"States": {
"Parallel": {
"Type": "Parallel",
"ResultPath":"$.output",
"Next": "Parallel 2",
"Branches": [
{
"StartAt": "Parallel Step 1, Process 1",
"States": {
"Parallel Step 1, Process 1": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
"End": true
}
}
},
{
"StartAt": "Parallel Step 1, Process 2",
"States": {
"Parallel Step 1, Process 2": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
"End": true
}
}
}
]
},
"Parallel 2": {
"Type": "Parallel",
"Next": "Final Processing",
"Branches": [
{
"StartAt": "Parallel Step 2, Process 1",
"States": {
"Parallel Step 2, Process 1": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXXX:function:LambdaB",
"End": true
}
}
},
{
"StartAt": "Parallel Step 2, Process 2",
"States": {
"Parallel Step 2, Process 2": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
"End": true
}
}
}
]
},
"Final Processing": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
"End": true
}
}
}
First, use an input that initially fails:
{
"Message": "Hello!"
}
This fails because the state machine expects you to have a variable in the input JSON called "foo" in the second parallel state to run "Parallel Step 2, Process 1" and "Parallel Step 2, Process 2". Instead, the original input gets processed by the first parallel state and produces the following output to pass to the second parallel state:
Run the script on the failed state machine to create a new state machine that allows it to resume directly at the second parallel state instead of having to redo the first parallel state. This creates a new state machine called "ParallelStateMachine-with-GoToState". The following JSON was created by the script to define the new state machine in ASL. It contains the "GoToState" value that was attached by the script.
{
"Comment":"An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
"States":{
"Final Processing":{
"Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
"End":true,
"Type":"Task"
},
"GoToState":{
"Default":"Parallel 2",
"Type":"Choice",
"Choices":[
{
"Variable":"$.resuming",
"BooleanEquals":false,
"Next":"Parallel"
}
]
},
"Parallel":{
"Branches":[
{
"States":{
"Parallel Step 1, Process 1":{
"Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
"End":true,
"Type":"Task"
}
},
"StartAt":"Parallel Step 1, Process 1"
},
{
"States":{
"Parallel Step 1, Process 2":{
"Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:LambdaA",
"End":true,
"Type":"Task"
}
},
"StartAt":"Parallel Step 1, Process 2"
}
],
"ResultPath":"$.output",
"Type":"Parallel",
"Next":"Parallel 2"
},
"Parallel 2":{
"Branches":[
{
"States":{
"Parallel Step 2, Process 1":{
"Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
"End":true,
"Type":"Task"
}
},
"StartAt":"Parallel Step 2, Process 1"
},
{
"States":{
"Parallel Step 2, Process 2":{
"Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
"End":true,
"Type":"Task"
}
},
"StartAt":"Parallel Step 2, Process 2"
}
],
"Type":"Parallel",
"Next":"Final Processing"
}
},
"StartAt":"GoToState"
}
You can then execute this state machine with the correct input by adding the "foo" and "resuming" variables:
This yields the following result. Notice that this time, the state machine executed successfully to completion, and skipped the steps that had previously failed.
Conclusion
When you’re building out complex workflows, it’s important to be prepared for failure. You can do this by taking advantage of features such as automatic error retries in Step Functions and custom error handling of Lambda exceptions.
Nevertheless, state machines still have the possibility of failing. With the methodology and script presented in this post, you can resume a failed state machine from its point of failure. This allows you to skip the execution of steps in the workflow that had already succeeded, and recover the process from the point of failure.
Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging
Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.
However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.
If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.
What is event-driven computing?
Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.
Which AWS services publish events to SNS natively?
Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.
Compute services
Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).
AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.
Elastic Load Balancing:Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.
Amazon S3:Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.
Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.
Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.
AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.
Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.
Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.
Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.
AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.
Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.
AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.
In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:
Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.
Conclusion
Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.
Contributed by Zak Islam, Senior Manager, Software Development, AWS Messaging
Amazon Simple Notification Service (Amazon SNS) is a fully managed AWS service that makes it easy to decouple your application components and fan-out messages. SNS provides topics (similar to topics in message brokers such as RabbitMQ or ActiveMQ) that you can use to create 1:1, 1:N, or N:N producer/consumer design patterns. For more information about how to send messages from SNS to Amazon SQS, AWS Lambda, or HTTP(S) endpoints in the same account, see Sending Amazon SNS Messages to Amazon SQS Queues.
SNS can be used to send messages within a single account or to resources in different accounts to create administrative isolation. This enables administrators to grant only the minimum level of permissions required to process a workload (for example, limiting the scope of your application account to only send messages and to deny deletes). This approach is commonly known as the “principle of least privilege.” If you are interested, read more about AWS’s multi-account security strategy.
This is great from a security perspective, but why would you want to share messages between accounts? It may sound scary, but it’s a common practice to isolate application components (such as producer and consumer) to operate using different AWS accounts to lock down privileges in case credentials are exposed. In this post, I go slightly deeper and explore how to set up your SNS topic so that it can route messages to SQS queues that are owned by a separate AWS account.
Potential use cases
First, look at a common order processing design pattern:
This is a simple architecture. A web server submits an order directly to an SNS topic, which then fans out messages to two SQS queues. One SQS queue is used to track all incoming orders for audits (such as anti-entropy, comparing the data of all replicas and updating each replica to the newest version). The other is used to pass the request to the order processing systems.
Imagine now that a few years have passed, and your downstream processes no longer scale, so you are kicking around the idea of a re-architecture project. To thoroughly test your system, you need a way to replay your production messages in your development system. Sure, you can build a system to replicate and replay orders from your production environment in your development environment. Wouldn’t it be easier to subscribe your development queues to the production SNS topic so you can test your new system in real time? That’s exactly what you can do here.
Here’s another use case. As your business grows, you recognize the need for more metrics from your order processing pipeline. The analytics team at your company has built a metrics aggregation service and ingests data via a central SQS queue. Their architecture is as follows:
Again, it’s a fairly simple architecture. All data is ingested via SQS queues (master_ingest_queue, in this case). You subscribe the master_ingest_queue, running under the analytics team’s AWS account, to the topic that is in the order management team’s account.
Making it work
Now that you’ve seen a few scenarios, let’s dig into the details. There are a couple of ways to link an SQS queue to an SNS topic (subscribe a queue to a topic):
The queue owner can create a subscription to the topic.
The topic owner can subscribe a queue in another account to the topic.
Queue owner subscription
What happens when the queue owner subscribes to a topic? In this case, assume that the topic owner has given permission to the subscriber’s account to call the Subscribe API action using the topic ARN (Amazon Resource Name). For the examples below, also assume the following:
Topic_Owner is the identifier for the account that owns the topic MainTopic
Queue_Owner is the identifier for the account that owns the queue subscribed to the main topic
To enable the subscriber to subscribe to a topic, the topic owner must add the sns:Subscribe and topic ARN to the topic policy via the AWS Management Console, as follows:
After this has been set up, the subscriber (using account Queue_Owner) can call Subscribe to link the queue to the topic. After the queue has been successfully subscribed, SNS starts to publish notifications. In this case, neither the topic owner nor the subscriber have had to process any kind of confirmation message.
Topic owner subscription
The second way to subscribe an SQS queue to an SNS topic is to have the Topic_Owner account initiate the subscription for the queue from account Queue_Owner. In this case, SNS first sends a confirmation message to the queue. To confirm the subscription, a user who can read messages from the queue must visit the URL specified in the SubscribeURL value in the message. Until the subscription is confirmed, no notifications published to the topic are sent to the queue. To confirm a subscription, you can use the SQS console or the ReceiveMessage API action.
What’s next?
In this post, I covered a few simple use cases but the principles can be extended to complex systems as well. As you architect new systems and refactor existing ones, think about where you can leverage queues (SQS) and topics (SNS) to build a loosely coupled system that can be quickly and easily extended to meet your business need.
Customers often use public endpoints to perform cross-region replication or other application layer communication to remote regions. But a common problem is how do you protect these endpoints? It can be tempting to open up the security groups to the world due to the complexity of keeping security groups in sync across regions with a dynamically changing infrastructure.
Consider a situation where you are running large clusters of instances in different regions that all require internode connectivity. One approach would be to use a VPN tunnel between regions to provide a secure tunnel over which to send your traffic. A good example of this is the Transit VPC Solution, which is a published AWS solution to help customers quickly get up and running. However, this adds additional cost and complexity to your solution due to the newly required additional infrastructure.
Another approach, which I’ll explore in this post, is to restrict access to the nodes by whitelisting the public IP addresses of your hosts in the opposite region. Today, I’ll outline a solution that allows for cross-region security group updates, can handle remote region failures, and supports external actions such as manually terminating instances or adding instances to an existing Auto Scaling group.
Solution overview
The overview of this solution is diagrammed below. Although this post covers limiting access to your instances, you should still implement encryption to protect your data in transit.
If your entire infrastructure is running in a single region, you can reference a security group as the source, allowing your IP addresses to change without any updates required. However, if you’re going across the public internet between regions to perform things like application-level traffic or cross-region replication, this is no longer an option. Security groups are regional. When you go across regions it can be tempting to drop security to enable this communication.
Although using an Elastic IP address can provide you with a static IP address that you can define as a source for your security groups, this may not always be feasible, especially when automatic scaling is desired.
In this example scenario, you have a distributed database that requires full internode communication for replication. If you place a cluster in us-east-1 and us-west-2, you must provide a secure method of communication between the two. Because the database uses cloud best practices, you can add or remove nodes as the load varies.
To start the process of updating your security groups, you must know when an instance has come online to trigger your workflow. Auto Scaling groups have the concept of lifecycle hooks that enable you to perform custom actions as the group launches or terminates instances.
When Auto Scaling begins to launch or terminate an instance, it puts the instance into a wait state (Pending:Wait or Terminating:Wait). The instance remains in this state while you perform your various actions until either you tell Auto Scaling to Continue, Abandon, or the timeout period ends. A lifecycle hook can trigger a CloudWatch event, publish to an Amazon SNS topic, or send to an Amazon SQS queue. For this example, you use CloudWatch Events to trigger an AWS Lambda function that updates an Amazon DynamoDB table.
Component breakdown
Here’s a quick breakdown of the components involved in this solution:
• Lambda function • CloudWatch event • DynamoDB table
Lambda function
The Lambda function automatically updates your security groups, in the following way:
1. Determines whether a change was triggered by your Auto Scaling group lifecycle hook or manually invoked for a “true up” functionality, which I discuss later in this post. 2. Describes the instances in the Auto Scaling group and obtain public IP addresses for each instance. 3. Updates both local and remote DynamoDB tables. 4. Compares the list of public IP addresses for both local and remote clusters with what’s already in the local region security group. Update the security group. 5. Compares the list of public IP addresses for both local and remote clusters with what’s already in the remote region security group. Update the security group 6. Signals CONTINUE back to the lifecycle hook.
CloudWatch event
The CloudWatch event triggers when an instance passes through either the launching or terminating states. When the Lambda function gets invoked, it receives an event that looks like the following:
You use DynamoDB to store lists of remote IP addresses in a local table that is updated by the opposite region as a failsafe source of truth. Although you can describe your Auto Scaling group for the local region, you must maintain a list of IP addresses for the remote region.
To minimize the number of describe calls and prevent an issue in the remote region from blocking your local scaling actions, we keep a list of the remote IP addresses in a local DynamoDB table. Each Lambda function in each region is responsible for updating the public IP addresses of its Auto Scaling group for both the local and remote tables.
As with all the infrastructure in this solution, there is a DynamoDB table in both regions that mirror each other. For example, the following screenshot shows a sample DynamoDB table. The Lambda function in us-east-1 would update the DynamoDB entry for us-east-1 in both tables in both regions.
By updating a DynamoDB table in both regions, it allows the local region to gracefully handle issues with the remote region, which would otherwise prevent your ability to scale locally. If the remote region becomes inaccessible, you have a copy of the latest configuration from the table that you can use to continue to sync with your security groups. When the remote region comes back online, it pushes its updated public IP addresses to the DynamoDB table. The security group is updated to reflect the current status by the remote Lambda function.
Walkthrough
Note: All of the following steps are performed in both regions. The Launch Stack buttons will default to the us-east-1 region.
Here’s a quick overview of the steps involved in this process:
1. An instance is launched or terminated, which triggers an Auto Scaling group lifecycle hook, triggering the Lambda function via CloudWatch Events. 2. The Lambda function retrieves the list of public IP addresses for all instances in the local region Auto Scaling group. 3. The Lambda function updates the local and remote region DynamoDB tables with the public IP addresses just received for the local Auto Scaling group. 4. The Lambda function updates the local region security group with the public IP addresses, removing and adding to ensure that it mirrors what is present for the local and remote Auto Scaling groups. 5. The Lambda function updates the remote region security group with the public IP addresses, removing and adding to ensure that it mirrors what is present for the local and remote Auto Scaling groups.
Prerequisites
To deploy this solution, you need to have Auto Scaling groups, launch configurations, and a base security group in both regions. To expedite this process, this CloudFormation template can be launched in both regions.
Step 1: Launch the AWS SAM template in the first region
To make the deployment process easy, I’ve created an AWS Serverless Application Model (AWS SAM) template, which is a new specification that makes it easier to manage and deploy serverless applications on AWS. This template creates the following resources:
• A Lambda function, to perform the various security group actions • A DynamoDB table, to track the state of the local and remote Auto Scaling groups • Auto Scaling group lifecycle hooks for instance launching and terminating • A CloudWatch event, to track the EC2 Instance-Launch Lifecycle-Action and EC2 Instance-terminate Lifecycle-Action events • A pointer from the CloudWatch event to the Lambda function, and the necessary permissions
Download the template from here or click to launch.
Upon launching the template, you’ll be presented with a list of parameters which includes the remote/local names for your Auto Scaling Groups, AWS region, Security Group IDs, DynamoDB table names, as well as where the code for the Lambda function is located. Because this is the first region you’re launching the stack in, fill out all the parameters except for the RemoteTable parameter as it hasn’t been created yet (you fill this in later).
Step 2: Test the local region
After the stack has finished launching, you can test the local region. Open the EC2 console and find the Auto Scaling group that was created when launching the prerequisite stack. Change the desired number of instances from 0 to 1.
For both regions, check your security group to verify that the public IP address of the instance created is now in the security group.
Local region:
Remote region:
Now, change the desired number of instances for your group back to 0 and verify that the rules are properly removed.
Local region:
Remote region:
Step 3: Launch in the remote region
When you deploy a Lambda function using CloudFormation, the Lambda zip file needs to reside in the same region you are launching the template. Once you choose your remote region, create an Amazon S3 bucket and upload the Lambda zip file there. Next, go to the remote region and launch the same SAM template as before, but make sure you update the CodeBucket and CodeKey parameters. Also, because this is the second launch, you now have all the values and can fill out all the parameters, specifically the RemoteTable value.
Step 4: Update the local region Lambda environment variable
When you originally launched the template in the local region, you didn’t have the name of the DynamoDB table for the remote region, because you hadn’t created it yet. Now that you have launched the remote template, you can perform a CloudFormation stack update on the initial SAM template. This populates the remote DynamoDB table name into the initial Lambda function’s environment variables.
In the CloudFormation console in the initial region, select the stack. Under Actions, choose Update Stack, and select the SAM template used for both regions. Under Parameters, populate the remote DynamoDB table name, as shown below. Choose Next and let the stack update complete. This updates your Lambda function and completes the setup process.
Step 5: Final testing
You now have everything fully configured and in place to trigger security group changes based on instances being added or removed to your Auto Scaling groups in both regions. Test this by changing the desired capacity of your group in both regions.
True up functionality If an instance is manually added or removed from the Auto Scaling group, the lifecycle hooks don’t get triggered. To account for this, the Lambda function supports a “true up” functionality in which the function can be manually invoked. If you paste in the following JSON text for your test event, it kicks off the entire workflow. For added peace of mind, you can also have this function fire via a CloudWatch event with a CRON expression for nearly continuous checking.
Now that all the resources are created in both regions, go back and break down the policy to incorporate resource-level permissions for specific security groups, Auto Scaling groups, and the DynamoDB tables.
Although this post is centered around using public IP addresses for your instances, you could instead use a VPN between regions. In this case, you would still be able to use this solution to scope down the security groups to the cluster instances. However, the code would need to be modified to support private IP addresses.
Conclusion
At this point, you now have a mechanism in place that captures when a new instance is added to or removed from your cluster and updates the security groups in both regions. This ensures that you are locking down your infrastructure securely by allowing access only to other cluster members.
Keep in mind that this architecture (lifecycle hooks, CloudWatch event, Lambda function, and DynamoDB table) requires that the infrastructure to be deployed in both regions, to have synchronization going both ways.
Because this Lambda function is modifying security group rules, it’s important to have an audit log of what has been modified and who is modifying them. The out-of-the-box function provides logs in CloudWatch for what IP addresses are being added and removed for which ports. As these are all API calls being made, they are logged in CloudTrail and can be traced back to the IAM role that you created for your lifecycle hooks. This can provide historical data that can be used for troubleshooting or auditing purposes.
Security is paramount at AWS. We want to ensure that customers are protecting access to their resources. This solution helps you keep your security groups in both regions automatically in sync with your Auto Scaling group resources. Let us know if you have any questions or other solutions you’ve come up with!
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.