Tag Archives: microservices

Using API Gateway as a Single Entry Point for Web Applications and API Microservices

Post Syndicated from Anandprasanna Gaitonde original https://aws.amazon.com/blogs/architecture/using-api-gateway-as-a-single-entry-point-for-web-applications-and-api-microservices/

Introduction

The benefits of high availability, scalability, and elasticity that AWS offers has proven to be a boon for Software-as-a-Service (SaaS) providers. AWS has also made it seamless to adopt microservices architectures for modernizing these SaaS applications, as well as providing API-based access for external applications.

An API management layer such as Amazon API Gateway is a natural choice for customers to expose APIs externally in a secure and highly scalable manner. However, as they adopt the cloud for their software applications and services, these providers may spin up redundant AWS environments to support them for multiple customers. This is typically driven by some unique requirements for each of their customers.

However, there is potential to create a multi-tenant microservices architecture using the capabilities of API Gateway. This architecture utilizes the same instance of microservice to serve different customers, thereby leading to a better utilization of the environment and optimized from a cost perspective. This configuration requires providers to support white-labelling of domains to cater to each of their customer as well as support identification of the customer domain for handling customized business logic for each customer in the backend microservices.

This blog post talks about a reference architecture that allows API Gateway to act as single entry point for external-facing, API-based microservices and web applications across multiple external customers by leveraging a different subdomain for each one.

Amazon API Gateway: A Single Entry-Point

Using a single API Gateway in the architecture across multiple web portal applications and microservices is an important consideration towards the goal of reusability of components and cost optimization.

Amazon API Gateway provides a highly scalable solution to create and publish RESTful and WebSocket APIs. It provides flexibility in choosing multiple backend technologies such as AWS Lambda functions, AWS Step Functions state machines, or call HTTP(s) endpoints hosted on AWS Elastic Beanstalk, Amazon EC2, and also non-AWS hosted HTTP based services.

API Gateway allows for handling common API management tasks such as security, caching, throttling, and monitoring. While its primary objective is to provide that abstraction layer on top of your backend APIs and microservices, it can also allow backends to be simple web applications for web portal access or Amazon S3 buckets for providing access to static web content or documents.

Along with above capabilities, the following key features of API Gateway help to create the architecture described here.

  1. Custom Domain Names support:
    When an API is deployed using API Gateway, the default API endpoint domain name is not user friendly as can be seen here:https://api-id.execute-api.region.amazonaws.com/stageapi-id is generated by API Gateway; region is specified by you when creating the API; and stage is specified by you when deploying the API.The default API endpoint can be difficult to recall and not user-friendly. To provide a simpler and more intuitive URL for your API users, it allows you to specify a custom domain name such as customer1.example.com via its integration with AWS Certificate Manager, which allows for SSL certificate-based validation of the sub-domains. API Gateway allows you to map multiple sub-domains to a single API endpoint allowing you to white-label the domains based on an external customer’s requirement.
  2. API request /response transformation:
    API Gateway allows you to specify the integration of each path of the API endpoint separately. This allows you to route API requests for each path to a separate backend endpoint and at the same time apply any request/response transformations, such as customer header insertion or modification of existing headers to manage any custom handling of APIs.

Architecture and Its Benefits

In the architecture shown in the diagram below, the features explained in this blog are utilized.

This architecture is an example of a typical SaaS provider who wants to offer its services to other enterprises and needs to support white-labeling domains for this web and API infrastructure. This is achieved using the following steps:

    1. A single domain of example.com can be registered with a domain registrar and you can create subdomains by creating CNAME records for example customer1.example.com, customer2.example.com by updating DNS information with the domain registrar. This can be handled by AWS’s own DNS and Registrar service Amazon Route 53 or can be any third party domain name provider.
    2. Once complete, AWS provides AWS Certificate Manager (ACM) to create a certificate for the following domains: example.com and *.example.com. This makes sure that the ACM certificate once applied to the API Gateway can allow for multiple subdomains to be served by it.
    3. Using the certificate created in ACM, you can create custom domain for the API endpoint. In this example this API endpoint will serve two subdomains for two different external customers and specifying base path mappings as needed. The following two subdomains are created as custom domains using this capability: customer1.example.com and customer2.example.com.
      Note: Make sure to add CNAME records for customer1 and customer2 at your DNS provider to point to the target domain name created within your API Gateway for each of the two customer sub-domains.
    4. The API Endpoint is then configured with the following API resources:
      1. HTTP integration of /service1 to route traffic to the ELB endpoint of microservice hosted on an ECS cluster
      2. HTTP integration of /service2 to route traffic to the ELB endpoint of web application hosted on an EC2 cluster
      3. HTTP integration of /service1 to route traffic to the ELB endpoint of microservice hosted on an ECS cluster
    5. API Gateway allows you to capture the FQDN of the URL and map it to Custom Headers or Query String Parameters which are then sent to the backend service integrated with the corresponding API resource and the HTTP method. For example we can create a custom header called “Customer” to forward customer1 or customer2 to the backend application for customer-specific business logic. This is done using the Method Request parameters and Integration Request configuration within API Gateway.

    Summary

    As you can see, this is one of the approaches to use an API Gateway as a single entry-point for API-based microservices and web application assets. This allows you to use infrastructure more cost effectively without losing the advantages of scaling when demand to your applications grow. You can read more about working with API Gateway and Route 53 DNS in AWS Documentation and use these capabilities to create architectures to suit your specific requirements.

Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused…

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/open-sourcing-mantis-a-platform-for-building-cost-effective-realtime-operations-focused-5b8ff387813a?source=rss----2615bd06b42e---4

Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused Applications

By Jeff Chao on behalf of the Mantis team

Today we’re excited to announce that we’re open sourcing Mantis, a platform that helps Netflix engineers better understand the behavior of their applications to ensure the highest quality experience for our members. We believe the challenges we face here at Netflix are not necessarily unique to Netflix which is why we’re sharing it with the broader community.

As a streaming microservices ecosystem, the Mantis platform provides engineers with capabilities to minimize the costs of observing and operating complex distributed systems without compromising on operational insights. Engineers have built cost-efficient applications on top of Mantis to quickly identify issues, trigger alerts, and apply remediations to minimize or completely avoid downtime to the Netflix service. Where other systems may take over ten minutes to process metrics accurately, Mantis reduces that from tens of minutes down to seconds, effectively reducing our Mean-Time-To-Detect. This is crucial because any amount of downtime is brutal and comes with an incredibly high impact to our members — every second counts during an outage.

As the company continues to grow our member base, and as those members use the Netflix service even more, having cost-efficient, rapid, and precise insights into the operational health of our systems is only growing in importance. For example, a five-minute outage today is equivalent to a two-hour outage at the time of our last Mantis blog post.

Mantis Makes It Easy to Answer New Questions

The traditional way of working with metrics and logs alone is not sufficient for large-scale and growing systems. Metrics and logs require that you to know what you want to answer ahead of time. Mantis on the other hand allows us to sidestep this drawback completely by giving us the ability to answer new questions without having to add new instrumentation. Instead of logs or metrics, Mantis enables a democratization of events where developers can tap into an event stream from any instrumented application on demand. By making consumption on-demand, you’re able to freely publish all of your data to Mantis.

Mantis is Cost-Effective in Answering Questions

Publishing 100% of your operational data so that you’re able to answer new questions in the future is traditionally cost prohibitive at scale. Mantis uses an on-demand, reactive model where you don’t pay the cost for these events until something is subscribed to their stream. To further reduce cost, Mantis reissues the same data for equivalent subscribers. In this way, Mantis is differentiated from other systems by allowing us to achieve streaming-based observability on events while empowering engineers with the tooling to reduce costs that would otherwise become detrimental to the business.

From the beginning, we’ve built Mantis with this exact guiding principle in mind: Let’s make sure we minimize the costs of observing and operating our systems without compromising on required and opportunistic insights.

Guiding Principles Behind Building Mantis

The following are the guiding principles behind building Mantis.

  1. We should have access to raw events. Applications that publish events into Mantis should be free to publish every single event. If we prematurely transform events at this stage, then we’re already at a disadvantage when it comes to getting insight since data in its original form is already lost.
  2. We should be able to access these events in realtime. Operational use cases are inherently time sensitive by nature. The traditional method of publishing, storing, and then aggregating events in batch is too slow. Instead, we should process and serve events one at a time as they arrive. This becomes increasingly important with scale as the impact becomes much larger in far less time.
  3. We should be able to ask new questions of this data without having to add new instrumentation to your applications. It’s not possible to know ahead of time every single possible failure mode our systems might encounter despite all the rigor built in to make these systems resilient. When these failures do inevitably occur, it’s important that we can derive new insights with this data. You should be able to publish as large of an event with as much context as you want. That way, when you think of a new questions to ask of your systems in the future, the data will be available for you to answer those questions.
  4. We should be able to do all of the above in a cost-effective way. As our business critical systems scale, we need to make sure the systems in support of these business critical systems don’t end up costing more than the business critical systems themselves.

With these guiding principles in mind, let’s take a look at how Mantis brings value to Netflix.

How Mantis Brings Value to Netflix

Mantis has been in production for over four years. Over this period several critical operational insight applications have been built on top of the Mantis platform.

A few noteworthy examples include:

Realtime monitoring of Netflix streaming health which examines all of Netflix’s streaming video traffic in realtime and accurately identifies negative impact on the viewing experience with fine-grained granularity. This system serves as an early warning indicator of the overall health of the Netflix service and will trigger and alert relevant teams within seconds.

Contextual Alerting which analyzes millions of interactions between dozens of Netflix microservices in realtime to identify anomalies and provide operators with rich and relevant context. The realtime nature of these Mantis-backed aggregations allows the Mean-Time-To-Detect to be cut down from tens of minutes to a few seconds. Given the scale of Netflix this makes a huge impact.

Raven which allows users to perform ad-hoc exploration of realtime data from hundreds of streaming sources using our Mantis Query Language (MQL).

Cassandra Health check which analyzes rich operational events in realtime to generate a holistic picture of the health of every Cassandra cluster at Netflix.

Alerting on Log data which detects application errors by processing data from thousands of Netflix servers in realtime.

Chaos Experimentation monitoring which tracks user experience during a Chaos exercise in realtime and triggers an abort of the chaos exercise in case of an adverse impact.

Realtime Personally Identifiable Information (PII) data detection samples data across all streaming sources to quickly identify transmission of sensitive data.

Try It Out Today

To learn more about Mantis, you can check out the main Mantis page. You can try out Mantis today by spinning up your first Mantis cluster locally using Docker or using the Mantis CLI to bootstrap a minimal cluster in AWS. You can also start contributing to Mantis by getting the code on Github or engaging with the community on the users or dev mailing list.

Acknowledgements

A lot of work has gone into making Mantis successful at Netflix. We’d like to thank all the contributors, in alphabetical order by first name, that have been involved with Mantis at various points of its existence:

Andrei Ushakov, Ben Christensen, Ben Schmaus, Chris Carey, Cody Rioux, Daniel Jacobson, Danny Yuan, Erik Meijer, Indrajit Roy Choudhury, Jeff Chao, Josh Evans, Justin Becker, Kathrin Probst, Kevin Lew, Neeraj Joshi, Nick Mahilani, Piyush Goyal, Prashanth Ramdas, Ram Vaithalingam, Ranjit Mavinkurve, Sangeeta Narayanan, Santosh Kalidindi, Seth Katz, Sharma Podila, Zhenzhong Xu.


Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused… was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Netflix microservices tackle dataset pub-sub

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/how-netflix-microservices-tackle-dataset-pub-sub-4a068adcc9a?source=rss----2615bd06b42e---4

By Ammar Khaku

Introduction

In a microservice architecture such as Netflix’s, propagating datasets from a single source to multiple downstream destinations can be challenging. These datasets can represent anything from service configuration to the results of a batch job, are often needed in-memory to optimize access and must be updated as they change over time.

One example displaying the need for dataset propagation: at any given time Netflix runs a very large number of A/B tests. These tests span multiple services and teams, and the operators of the tests need to be able to tweak their configuration on the fly. There needs to be the ability to detect nodes that have failed to pick up the latest test configuration, and the ability to revert to older versions of configuration when things go wrong.

Another example of a dataset that needs to be disseminated is the result of a machine-learning model: the results of these models may be used by several teams, but the ML teams behind the model aren’t necessarily interested in maintaining high-availability services in the critical path. Rather than each team interested in consuming the model having to build in fallbacks to degrade gracefully, there is a lot of value in centralizing the work to allow multiple teams to leverage a single team’s effort.

Without infrastructure-level support, every team ends up building their own point solution to varying degrees of success. Datasets themselves are of varying size, from a few bytes to multiple gigabytes. It is important to build in observability and fault detection, and to provide tooling to allow operators to make quick changes without having to develop their own tools.

Dataset propagation

At Netflix we use an in-house dataset pub/sub system called Gutenberg. Gutenberg allows for propagating versioned datasets — consumers subscribe to data and are updated to the latest versions when they are published. Each version of the dataset is immutable and represents a complete view of the data — there is no dependency on previous versions of data. Gutenberg allows browsing older versions of data for use cases such as debugging, rapid mitigation of data related incidents, and re-training of machine-learning models. This post is a high level overview of the design and architecture of Gutenberg.

Data model

1 topic -> many versions

The top-level construct in Gutenberg is a “topic”. A publisher publishes to a topic and consumers consume from a topic. Publishing to a topic creates a new monotonically-increasing “version”. Topics have a retention policy that specifies a number of versions or a number of days of versions, depending on the use case. For example, you could configure a topic to retain 10 versions or 10 days of versions.

Each version contains metadata (keys and values) and a data pointer. You can think of a data pointer as special metadata that points to where the actual data you published is stored. Today, Gutenberg supports direct data pointers (where the payload is encoded in the data pointer value itself) and S3 data pointers (where the payload is stored in S3). Direct data pointers are generally used when the data is small (under 1MB) while S3 is used as a backing store when the data is large.

1 topic -> many publish scopes

Gutenberg provides the ability to scope publishes to a particular set of consumers — for example by region, application, or cluster. This can be used to canary data changes with a single cluster, roll changes out incrementally, or constrain a dataset so that only a subset of applications can subscribe to it. Publishers decide the scope of a particular data version publish, and they can later add scopes to a previously published version. Note that this means that the concept of a latest version depends on the scope — two applications may see different versions of data as the latest depending on the publish scopes created by the publisher. The Gutenberg service matches the consuming application with the published scopes before deciding what to advertise as the latest version.

Use cases

The most common use case of Gutenberg is to propagate varied sizes of data from a single publisher to multiple consumers. Often the data is held in memory by consumers and used as a “total cache”, where it is accessed at runtime by client code and atomically swapped out under the hood. Many of these use cases can be loosely grouped as “configuration” — for example Open Connect Appliance cache configuration, supported device type IDs, supported payment method metadata, and A/B test configuration. Gutenberg provides an abstraction between the publishing and consumption of this data — this allows publishers the freedom to iterate on their application without affecting downstream consumers. In some cases, publishing is done via a Gutenberg-managed UI, and teams do not need to manage their own publishing app at all.

Another use case for Gutenberg is as a versioned data store. This is common for machine-learning applications, where teams build and train models based on historical data, see how it performs over time, then tweak some parameters and run through the process again. More generally, batch-computation jobs commonly use Gutenberg to store and propagate the results of a computation as distinct versions of datasets. “Online” use cases subscribe to topics to serve real-time requests using the latest versions of topics’ data, while “offline” systems may instead use historical data from the same topics — for example to train machine-learned models.

An important point to note is that Gutenberg is not designed as an eventing system — it is meant purely for data versioning and propagation. In particular, rapid-fire publishes do not result in subscribed clients stepping through each version; when they ask for an update, they will be provided with the latest version, even if they are currently many versions behind. Traditional pub-sub or eventing systems are suited towards messages that are smaller in size and are consumed in sequence; consumers may build up a view of an entire dataset by consuming an entire (potentially compacted) feed of events. Gutenberg, however, is designed for publishing and consuming an entire immutable view of a dataset.

Design and architecture

Gutenberg consists of a service with gRPC and REST APIs as well as a Java client library that uses the gRPC API.

High-level architecture

Client

The Gutenberg client library handles tasks such as subscription management, S3 uploads/downloads, Atlas metrics, and knobs you can tweak using Archaius properties. It communicates with the Gutenberg service via gRPC, using Eureka for service discovery.

Publishing

Publishers generally use high-level APIs to publish strings, files, or byte arrays. Depending on the data size, the data may be published as a direct data pointer or it may get uploaded to S3 and then published as an S3 data pointer. The client can upload a payload to S3 on the caller’s behalf or it can publish just the metadata for a payload that already exists in S3.

Direct data pointers are automatically replicated globally. Data that is published to S3 is uploaded to multiple regions by the publisher by default, although that can be configured by the caller.

Subscription management

The client library provides subscription management for consumers. This allows users to create subscriptions to particular topics, where the library retrieves data (eg from S3) before handing off to a user-provided listener. Subscriptions operate on a polling model — they ask the service for a new update every 30 seconds, providing the version with which they were last notified. Subscribed clients will never consume an older version of data than the one they are on unless they are pinned (see “Data resiliency” below). Retry logic is baked in and configurable — for instance, users can configure Gutenberg to try older versions of data if it fails to download or process the latest version of data on startup, often to deal with non-backwards-compatible data changes. Gutenberg also provides a pre-built subscription that holds on to the latest data and atomically swaps it out under the hood when a change comes in — this tackles a majority of subscription use cases, where callers only care about the current value at any given time. It allows callers to specify a default value — either for a topic that has never been published to (a good fit when the topic is used for configuration) or if there is an error consuming the topic (to avoid blocking service startup when there is a reasonable default).

Consumption APIs

Gutenberg also provides high-level client APIs that wrap the low-level gRPC APIs and provide additional functionality and observability. One example of this is to download data for a given topic and version — this is used extensively by components plugged into Netflix Hollow. Another example is a method to get the “latest” version of a topic at a particular time — a common use case when debugging and when training ML models.

Client resiliency and observability

Gutenberg was designed with a bias towards allowing consuming services to be able to start up successfully versus guaranteeing that they start with the freshest data. With this in mind, the client library was built with fallback logic for when it cannot communicate with the Gutenberg service. After HTTP request retries are exhausted, the client downloads a fallback cache of topic publish metadata from S3 and works based off of that. This cache contains all the information needed to decide whether an update needs to be applied, and from where data needs to be fetched (either from the publish metadata itself or from S3). This allows clients to fetch data (which is potentially stale, depending on how current that fallback cache is) without using the service.

Part of the benefit of providing a client library is the ability to expose metrics that can be used to alert on an infrastructure-wide issue or issues with specific applications. Today these metrics are used by the Gutenberg team to monitor our publish-propagation SLI and to alert in the event of widespread issues. Some clients also use these metrics to alert on app-specific errors, for example individual publish failures or a failure to consume a particular topic.

Server

The Gutenberg service is a Governator/Tomcat application that exposes gRPC and REST endpoints. It uses a globally-replicated Cassandra cluster for persistence and to propagate publish metadata to every region. Instances handling consumer requests are scaled separately from those handling publish requests — there are approximately 1000 times more consumer requests than there are publish requests. In addition, this insulates publishing from consumption — a sudden spike in publishing will not affect consumption, and vice versa.

Each instance in the consumer request cluster maintains its own in-memory cache of “latest publishes”, refreshing it from Cassandra every few seconds. This is to handle the large volume of poll requests coming from subscribed clients without passing on the traffic to the Cassandra cluster. In addition, request-pooling low-ttl caches protect against large spikes in requests that could potentially burden Cassandra enough to affect entire region — we’ve had situations where transient errors coinciding with redeployments of large clusters have caused Gutenberg service degradation. Furthermore, we use an adaptive concurrency limiter bucketed by source application to throttle misbehaving applications without affecting others.

For cases where the data was published to S3 buckets in multiple regions, the server makes a decision on what bucket to send back to the client to download from based on where the client is. This also allows the service to provide the client with a bucket in the “closest” region, and to have clients fall back to another region if there is a region outage.

Before returning subscription data to consumers, the Gutenberg service first runs consistency checks on the data. If the checks fail and the polling client already has consumed some data the service returns nothing, which effectively means that there is no update available. If the polling client has not yet consumed any data (this usually means it has just started up), the service queries the history for the topic and returns the latest value that passes consistency checks. This is because we see sporadic replication delays at the Cassandra layer, where by the time a client polls for new data, the metadata associated with the most recently published version has only been partially replicated. This can result in incomplete data being returned to the client, which then manifests itself either as a data fetch failure or an obscure business-logic failure. Running these consistency checks on the server insulates consumers from the eventual-consistency caveats that come with the service’s choice of a data store.

Visibility on topic publishes and nodes that consume a topic’s data is important for auditing and to gather usage info. To collect this data, the service intercepts requests from publishers and consumers (both subscription poll requests and others) and indexes them in Elasticsearch by way of the Keystone data pipeline. This allows us to gain visibility into topic usage and decommission topics that are no longer in use. We expose deep-links into a Kibana dashboard from an internal UI to allow topic owners to get a handle on their consumers in a self-serve manner.

In addition to the clusters serving publisher and consumer requests, the Gutenberg service runs another cluster that runs periodic tasks. Specifically this runs two tasks:

  1. Every few minutes, all the latest publishes and metadata are gathered up and sent to S3. This powers the fallback cache used by the client as detailed above.
  2. A nightly janitor job purges topic versions which exceed their topic’s retention policy. This deletes the underlying data as well (e.g. S3 objects) and helps enforce a well-defined lifecycle for data.

Data resiliency

Pinning

In the world of application development bad deployments happen, and a common mitigation strategy there is to roll back the deployment. A data-driven architecture makes that tricky, since behavior is driven by data that changes over time.

Data propagated by Gutenberg influences — and in many cases drives — system behavior. This means that when things go wrong, we need a way to roll back to a last-known good version of data. To facilitate this, Gutenberg provides the ability to “pin” a topic to a particular version. Pins override the latest version of data and force clients to update to that version — this allows for quick mitigation rather than having an under-pressure operator attempt to figure out how to publish the last known good version. You can even apply a pin to a specific publish scope so that only consumers that match that scope are pinned. Pins also override data that is published while the pin is active, but when the pin is removed clients update to the latest version, which may be the latest version when the pin was applied or a version published while the pin was active.

Incremental rollout

When deploying new code, it’s often a good idea to canary new builds with a subset of traffic, roll it out incrementally, or otherwise de-risk a deployment by taking it slow. For cases where data drives behavior, a similar principle should be applied.

One feature Gutenberg provides is the ability to incrementally roll out data publishes via Spinnaker pipelines. For a particular topic, users configure what publish scopes they want their publish to go to and what the delay is between each one. Publishing to that topic then kicks off the pipeline, which publishes the same data version to each scope incrementally. Users are able to interact with the pipeline; for example they may choose to pause or cancel pipeline execution if their application starts misbehaving, or they may choose to fast-track a publish to get it out sooner. For example, for some topics we roll out a new dataset version one AWS region at a time.

Scale

Gutenberg has been at use at Netflix for the past three years. At present, Gutenberg stores low tens-of-thousands of topics in production, about a quarter of which have published at least once in the last six months. Topics are published at a variety of cadences — from tens of times a minute to once every few months — and on average we see around 1–2 publishes per second, with peaks and troughs about 12 hours apart.

In a given 24 hour period, the number of nodes that are subscribed to at least one topic is in the low six figures. The largest number of topics a single one of these nodes is subscribed to is north of 200, while the median is 7. In addition to subscribed applications, there are a large number of applications that request specific versions of specific topics, for example for ML and Hollow use cases. Currently the number of nodes that make a non-subscribe request for a topic is in the low hundreds of thousands, the largest number of topics requested is 60, and the median is 4.

Future work

Here’s a sample of work we have planned for Gutenberg:

  • Polyglot support: today Gutenberg only supports a Java client, but we’re seeing an increasing number of requests for Node.js and Python support. Some of these teams have cobbled together their own solutions built on top of the Gutenberg REST API or other systems. Rather than have different teams reinvent the wheel, we plan to provide first-class client libraries for Node.js and Python.
  • Encryption and access control: for sensitive data, Gutenberg publishers should be able to encrypt data and distribute decryption credentials to consumers out-of-band. Adding this feature opens Gutenberg up to another set of use-cases.
  • Better incremental rollout: the current implementation is in its pretty early days and needs a lot of work to support customization to fit a variety of use cases. For example, users should be able to customize the rollout pipeline to automatically accept or reject a data version based on their own tests.
  • Alert templates: the metrics exposed by the Gutenberg client are used by the Gutenberg team and a few teams that are power users. Instead, we plan to provide leverage to users by building and parameterizing templates they can use to set up alerts for themselves.
  • Topic cleanup: currently topics sit around forever unless they are explicitly deleted, even if no one is publishing to them or consuming from them. We plan on building an automated topic cleanup system based on the consumption trends indexed in Elasticsearch.
  • Data catalog integration: an ongoing issue at Netflix is the problem of cataloging data characteristics and lineage. There is an effort underway to centralize metadata around data sources and sinks, and once Gutenberg integrates with this, we can leverage the catalog to automate tools that message the owners of a dataset.

If any of this piques your interest — we’re hiring!


How Netflix microservices tackle dataset pub-sub was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

About being a Principal Engineer at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/about-being-a-principal-engineer-at-grab

Over the past few years Grab has grown from a small startup to one of the largest technology companies in South-East Asia. Along with the company’s growth, the number of microservices, features and teams also grew substantially. At the time of writing this blog, we have around 350 microservices powering our super-app.

A great engineering team is a critical component of our success. As an engineer you have two career paths in front of you: an individual contributor role, or a management role. While a management role is generally better understood, this article clarifies what it means to be a principal engineer at Grab, which is one of the highest levels of our engineering career ladder.

Engineering Career Ladder

Improving the Quality

“You set the standard for engineering excellence in your technical family. Your architectures are exemplary in terms of efficiency, stability, extensibility, testability and the ability to evolve over time. Your software is robust in the presence of failures, scalable, and cost-effective. Your coding practices are exemplary in terms of code organization, clarity, simplicity, error handling, and documentation. You tackle intrinsically hard problems, acquiring expertise as needed. You decompose complex problems into straightforward solutions.” – Grab’s Engineering Career Ladder

 

So, what does a principal engineer do? As your career progresses from junior to senior to lead engineer we have more and more responsibilities; you manage larger and larger systems. For example, junior engineer might manage a specific component of a micro-service. A senior engineer would be tasked with designing and operating an entire micro-service or product. While a lead engineer would typically be concerned with the architecture at a team level.

Principal engineer level is akin to a senior manager where instead of indirectly managing people (manager of managers) you take care of the architecture of an entire sub-organisation, known as Tech Family/Platform. These Tech Families usually have more than 50 engineers spread across multiple teams and function as a tiny company with their own business owners, designers, product managers, etc.

Challenging Projects

“You take engineering ownership of projects that may require the work of several teams to implement; you divide responsibilities so that each team can work independently and have the system come together into an integrated whole. Your projects often cross team, tech family, platform, and even R&D center boundaries. You solicit differing views and keep an open mind. You are adept at building consensus.” – Grab’s Engineering Career Ladder

 

As a principal engineer, your job is to solve larger problems and translate somewhat vague problems into a set of actionable items. You might be faced with a large problem such as “improve efficiency and interoperability of Grab’s transportation system.” You will need to understand the problem, the business impact and see how can it be improved. It might require you to design new systems, change existing systems, understand the costs involved and get the right people together to make it happen.

Solving such a problem all by yourself is pretty much impossible. You have to work with other managers and other engineers together as a team to make it happen. Help your lead/senior engineers to design the right system by giving them a clear objective but let them take care of the system-level architecture.  

You will also need to work with managers, advise them to get things done, and get the right things prioritised by the team. While you don’t need to be well-versed in project management and agile methodologies, you do need to be able to plan ahead with your teams and have an understanding of how much time a project or migration will take.

A Tech Family can easily have 20 or more micro-services. You need to have a good understanding of their functional requirements and interactions. This is challenging as learning new things is always “uncomfortable” and takes time. You must reach out to engineers, product managers, and data scientists, ideally face-to-face to build empathy. Keep asking questions and try to understand how things work. You will also need to read the existing documentation and their code.

Technical Ownership

“You are the origin of significant technical contributions to our architecture and infrastructure.  You take technical ownership of the design and quality of the security, performance, availability, and operational aspects of the software built by one or more teams. You identify where your time is needed, transitioning between coding, design, and architecture based on project and team needs. You deliver software in ways that empower teams to self-service, providing clear adoption/migration paths.” – Grab’s Engineering Career Ladder

 

As a principal engineer you work together with the Head of Engineering and managers within the Tech Family and improve the quality of systems across the board. Typically, no-one tells you what needs to be done. You need to identify gaps, raise them and keep improving the systems.

You also need to learn how to manage your own time better so you can prioritise effectively. This boils down to knowing your strengths, your weaknesses. For example, if you are really good in building distributed systems but have no clue about the latest-and-greatest design in information security, get the right InfoSec engineers in this meeting and consider skipping it yourself. Avoid trying to do everything at once and be in every single meeting you get invited – you still have to review code, design and focus, so plan accordingly.

You will also need to understand the business impact of your decisions. For example, if you contribute to product features, know how impactful this feature is going to be to the organisation. If you don’t know it – ask the Product Manager responsible for it. If you work on a platform feature, for example improving the build system, know how it will help: saving 30 minutes of build time for every engineer each day is a huge achievement.

More often than not, you will have to drive migrations, this is akin to code refactoring but on a system-level and will involve a lot of collaboration with the people. Understand what a technical debt is and how it can be mitigated – a good architecture minimises technical debt and in turn accelerates time-to-market and helps business flourish.

Technical Leadership

“You amplify your impact by leading design reviews for complex software and/or critical features. You probe assumptions, illuminate pitfalls, and foster shared understanding. You align teams toward coherent architectural strategies.” – Grab’s Engineering Career Ladder

 

In Grab we have a process known as RFC (Request For Comments) which allows engineers to submit designs and ideas for a larger audience to debate. This is especially important given that our organisation is spread across several continents with research and development offices in Southeast Asia, the US, India and China. While any engineer is welcome to comment on these RFCs, it is a duty of lead and principal engineers’ to review them on a regular basis. This will help you to expand your knowledge of existing systems and help others with improving their designs.

Communication is a key skill that you need to keep improving and it is often the Achilles’ heel of many engineers who would rather be doing work in their corner without talking to anyone else. This is perfectly fine for a junior (or even some senior engineers) but it is critical for a principal engineer to communicate. Let’s break this down to a set of specific skills that you’d need to sharpen.

You need to be able to write effectively in order to convey your ideas to others. This includes knowing your audience and wording it in such a way that readers can understand. A technical design document whose audience are engineers is not written the same way as a design proposal whose audience are product and business managers.

You need to be able to publicly present and talk about various projects that you are working on. This includes creation of slide decks with good visuals and distilling down months of work to just a couple of slides. The best way of learning this is to get out there and keep presenting your work – you will get better over time.

You also need to be able to drive meetings and discussions without wasting anyone’s time. As a technical leader, one of your key responsibilities is to get people moving in the same direction and driving consensus during meetings.

Teaching and Learning

“You educate other engineers, both at an individual level and at scale: keeping the engineering community up to date on advanced technical issues, technologies, and trends. Examples include onboarding bootcamps for new hires, interns, specific skill-gap training development, and sharing specialized knowledge to raise the technical bar for other engineers/teams/dev centers.”

 

A principal engineer is a technical leader and as a leader you have the responsibility to mentor, coach fellow engineers, regardless of their level. In addition to code-reviews, you can organise office hours in your team and knowledge sharing sessions where everyone could present something. You could also help with bootcamps and help new hires in getting up-to-speed.

Most importantly, you will also need to keep learning whichever way works for you – reading journals and papers, blog posts, watching video-recorded talks, attending conferences and browsing through a variety of open-source projects. You will also learn from other Grabbers as even a junior engineer can teach you something, we all have our strengths and weaknesses. Keep improving and working on yourself!

Evolving Regional Evacuation

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/evolving-regional-evacuation-69e6cc1d24c6?source=rss----2615bd06b42e---4

Niosha Behnam | Demand Engineering @ Netflix

At Netflix we prioritize innovation and velocity in pursuit of the best experience for our 150+ million global customers. This means that our microservices constantly evolve and change, but what doesn’t change is our responsibility to provide a highly available service that delivers 100+ million hours of daily streaming to our subscribers.

In order to achieve this level of availability, we leverage an N+1 architecture where we treat Amazon Web Services (AWS) regions as fault domains, allowing us to withstand single region failures. In the event of an isolated failure we first pre-scale microservices in the healthy regions after which we can shift traffic away from the failing one. This pre-scaling is necessary due to our use of autoscaling, which generally means that services are right-sized to handle their current demand, not the surge they would experience once we shift traffic.

Though this evacuation capability exists today, this level of resiliency wasn’t always the standard for the Netflix API. In 2013 we first developed our multi-regional availability strategy in response to a catalyst that led us to re-architect the way our service operates. Over the last 6 years Netflix has continued to grow and evolve along with our customer base, invalidating core assumptions built into the machinery that powers our ability to pre-scale microservices. Two such assumptions were that:

  • Regional demand for all microservices (i.e. requests, messages, connections, etc.) can be abstracted by our key performance indicator, stream starts per second (SPS).
  • Microservices within healthy regions can be scaled uniformly during an evacuation.

These assumptions simplified pre-scaling, allowing us to treat microservices uniformly, ignoring the uniqueness and regionality of demand. This approach worked well in 2013 due to the existence of monolithic services and a fairly uniform customer base, but became less effective as Netflix evolved.

Invalidated Assumptions

Regional Microservice Demand

Most of our microservices are in some way related to serving a stream, so SPS seemed like a reasonable stand-in to simplify regional microservice demand. This was especially true for large monolithic services. For example, player logging, authorization, licensing, and bookmarks were initially handled by a single monolithic service whose demand correlated highly with SPS. However, in order to improve developer velocity, operability, and reliability, the monolith was decomposed into smaller, purpose-built services with dissimilar function-specific demand.

Our edge gateway (zuul) also sharded by function to achieve similar wins. The graph below captures the demand for each shard, the combined demand, and SPS. Looking at the combined demand and SPS lines, SPS roughly approximates combined demand for a majority of the day. Looking at individual shards however, the amount of error introduced by using SPS as a demand proxy varies widely.

Time of Day vs. Normalized Demand by Zuul Shard

Uniform Evacuation Scaling

Since we used SPS as a demand proxy, it also seemed reasonable to assume that we can uniformly pre-scale all microservices in the healthy regions. In order to illustrate the shortcomings of this approach, let’s look at playback licensing (DRM) & authorization.

DRM is closely aligned with device type, such that Consumer Electronics (CE), Android, & iOS use different DRM platforms. In addition, the ratio of CE to mobile streaming differs regionally; for example, mobile is more popular in South America. So, if we evacuate South American traffic to North America, demand for CE and Android DRM won’t grow uniformly.

On the other hand, playback authorization is a function used by all devices prior to requesting a license. While it does have some device specific behavior, demand during an evacuation is more a function of the overall change in regional demand.

Closing The Gap

In order to address the issues with our previous approach, we needed to better characterize microservice-specific demand and how it changes when we evacuate. The former requires that we capture regional demand for microservices versus relying on SPS. The latter necessitates a better understanding of microservice demand by device type as well as how regional device demand changes during an evacuation.

Microservice-Specific Regional Demand

Because of service decomposition, we understood that using a proxy demand metric like SPS wasn’t tenable and we needed to transition to microservice-specific demand. Unfortunately, due to the diversity of services, a mix of Java (Governator/Springboot with Ribbon/gRPC, etc.) and Node (NodeQuark), there wasn’t a single demand metric we could rely on to cover all use cases. To address this, we built a system that allows us to associate each microservice with metrics that represent their demand.

The microservice metrics are configuration-driven, self-service, and allows for scoping such that services can have different configurations across various shards and regions. Our system then queries Atlas, our time series telemetry platform, to gather the appropriate historical data.

Microservice Demand By Device Type

Since demand is impacted by regional device preferences, we needed to deconstruct microservice demand to expose the device-specific components. The approach we took was to partition a microservice’s regional demand by aggregated device types (CE, Android, PS4, etc.). Unfortunately, the existing metrics didn’t uniformly expose demand by device type, so we leveraged distributed tracing to expose the required details. Using this sampled trace data we can explain how a microservice’s regional device type demand changes over time. The graph below highlights how relative device demand can vary throughout the day for a microservice.

Regional Microservice Demand By Device Type

Device Type Demand

We can use historical device type traffic to understand how to scale the device-specific components of a service’s demand. For example, the graph below shows how CE traffic in us-east-1 changes when we evacuate us-west-2. The nominal and evacuation traffic lines are normalized such that 1 represents the max(nominal traffic) and the demand scaling ratio represents the relative change in demand during an evacuation (i.e. evacuation traffic/nominal traffic).

Nominal vs Evacuation CE Traffic in US-East-1

Microservice-Specific Demand Scaling Ratio

We can now combine microservice demand by device and device-specific evacuation scaling ratios to better represent the change in a microservice’s regional demand during an evacuation — i.e. the microservice’s device type weighted demand scaling ratio. To calculate this ratio (for a specific time of day) we take a service’s device type percentages, multiply by device type evacuation scaling ratios, producing each device type’s contribution to the service’s scaling ratio. Summing these components then yields a device type weighted evacuation scaling ratio for the microservice. To provide a concrete example, the table below shows the evacuation scaling ratio calculation for a fictional service.

Service Evacuation Scaling Ratio Calculation

The graph below highlights the impact of using a microservice-specific evacuation scaling ratio versus the simplified SPS-based approach used previously. In the case of Service A, the old approach would have done well in approximating the ratio, but in the case of Service B and Service C, it would have resulted in over and under predicting demand, respectively.

Device Type Weighted vs. Previous Approach

What Now?

Understanding the uniqueness of demand across our microservices improved the quality of our predictions, leading to safer and more efficient evacuations at the cost of additional computational complexity. This new approach, however, is itself an approximation with its own set of assumptions. For example, it assumes all categories of traffic for a device type has similar shape, for example Android logging and playback traffic. As Netflix grows our assumptions will again be challenged and we will have to adapt to continue to provide our customers with the availability and reliability that they have come to expect.

If this article has piqued your interest and you have a passion for solving cross-discipline distributed systems problems, our small but growing team is hiring!


Evolving Regional Evacuation was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ten Things Serverless Architects Should Know

Post Syndicated from Justin Pirtle original https://aws.amazon.com/blogs/architecture/ten-things-serverless-architects-should-know/

Building on the first three parts of the AWS Lambda scaling and best practices series where you learned how to design serverless apps for massive scale, AWS Lambda’s different invocation models, and best practices for developing with AWS Lambda, we now invite you to take your serverless knowledge to the next level by reviewing the following 10 topics to deepen your serverless skills.

1: API and Microservices Design

With the move to microservices-based architectures, decomposing monothlic applications and de-coupling dependencies is more important than ever. Learn more about how to design and deploy your microservices with Amazon API Gateway:

Get hands-on experience building out a serverless API with API Gateway, AWS Lambda, and Amazon DynamoDB powering a serverless web application by completing the self-paced Wild Rydes web application workshop.

Figure 1: WildRydes serverless web application workshop

2: Event-driven Architectures and Asynchronous Messaging Patterns

When building event-driven architectures, whether you’re looking for simple queueing and message buffering or a more intricate event-based choreography pattern, it’s valuable to learn about the mechanisms to enable asynchronous messaging and integration. These are enabled primarily through the use of queues or streams as a message buffer and topics for pub/sub messaging. Understand when to use each and the unique advantages and features of all three:

Gets hands-on experience building a real-time data processing application using Amazon Kinesis Data Streams and AWS Lambda by completing the self-paced Wild Rydes data processing workshop.

3: Workflow Orchestration in a Distributed, Microservices Environment

In distributed microservices architectures, you must design coordinated transactions in different ways than traditional database-based ACID transactions, which are typically implemented using a monolithic relational database. Instead, you must implement coordinated sequenced invocations across services along with rollback and retry mechanisms. For workloads where there a significant orchestration logic is required and you want to use more of an orchestrator pattern than the event choreography pattern mentioned above, AWS Step Functions enables the building complex workflows and distributed transactions through integration with a variety of AWS services, including AWS Lambda. Learn about the options you have to build your business workflows and keep orchestration logic out of your AWS Lambda code:

Get hands-on experience building an image processing workflow using computer vision AI services with AWS Rekognition and AWS Step Functions to orchestrate all logic and steps with the self-paced Serverless image processing workflow workshop.

Figure 2: Several AWS Lambda functions managed by an AWS Step Functions state machine

4: Lambda Computing Environment and Programming Model

Though AWS Lambda is a service that is quick to get started, there is value in learning more about the AWS Lambda computing environment and how to take advantage of deeper performance and cost optimization strategies with the AWS Lambda runtime. Take your understanding and skills of AWS Lambda to the next level:

5: Serverless Deployment Automation and CI/CD Patterns

When dealing with a large number of microservices or smaller components—such as AWS Lambda functions all working together as part of a broader application—it’s critical to integrate automation and code management into your application early on to efficiently create, deploy, and version your serverless architectures. AWS offers several first-party deployment tools and frameworks for Serverless architectures, including the AWS Serverless Application Model (SAM), the AWS Cloud Development Kit (CDK), AWS Amplify, and AWS Chalice. Additionally, there are several third party deployment tools and frameworks available, such as the Serverless Framework, Claudia.js, Sparta, or Zappa. You can also build your own custom-built homegrown framework. The important thing is to ensure your automation strategy works for your use case and team, and supports your planned data source integrations and development workflow. Learn more about the available options:

Learn how to build a full CI/CD pipeline and other DevOps deployment automation with the following workshops:

6: Serverless Identity Management, Authentication, and Authorization

Modern application developers need to plan for and integrate identity management into their applications while implementing robust authentication and authorization functionality. With Amazon Cognito, you can deploy serverless identity management and secure sign-up and sign-in directly into your applications. Beyond authentication, Amazon API Gateway also allows developers to granularly manage authorization logic at the gateway layer and authorize requests directly, without exposing their using several types of native authorization.

Learn more about the options and benefits of each:

Get hands-on experience working with Amazon Cognito, AWS Identity and Access Management (IAM), and Amazon API Gateway with the Serverless Identity Management, Authentication, and Authorization Workshop.

Figure 3: Serverless Identity Management, Authentication, and Authorization Workshop

7: End-to-End Security Techniques

Beyond identity and authentication/authorization, there are many other areas to secure in a serverless application. These include:

  • Input and request validation
  • Dependency and vulnerability management
  • Secure secrets storage and retrieval
  • IAM execution roles and invocation policies
  • Data encryption at-rest/in-transit
  • Metering and throttling access
  • Regulatory compliance concerns

Thankfully, there are AWS offerings and integrations for each of these areas. Learn more about the options and benefits of each:

Get hands-on experience adding end-to-end security with the techniques mentioned above into a serverless application with the Serverless Security Workshop.

8: Application Observability with Comprehensive Logging, Metrics, and Tracing

Before taking your application to production, it’s critical that you ensure your application is fully observable, both at a microservice or component level, as well as overall through comprehensive logging, metrics at various granularity, and tracing to understand distributed system performance and end user experiences end-to-end. With many different components making up modern architectures, having centralized visibility into all of your key logs, metrics, and end-to-end traces will make it much easier to monitor and understand your end users’ experiences. Learn more about the options for observability of your AWS serverless application:

9. Ensuring Your Application is Well-Architected

Adding onto the considerations mentioned above, we suggest architecting your applications more holistically to the AWS Well-Architected framework. This framework includes the five key pillars: security, reliability, performance efficiency, cost optimization, and operational excellence. Additionally, there is a serverless-specific lens to the Well-Architected framework, which more specifically looks at key serverless scenarios/use cases such as RESTful microservices, Alexa skills, mobile backends, stream processing, and web applications, and how they can implement best practices to be Well-Architected. More information:

10. Continuing your Learning as Serverless Computing Continues to Evolve

As we’ve discussed, there are many opportunities to dive deeper into serverless architectures in a variety of areas. Though the resources shared above should be helpful in familiarizing yourself with key concepts and techniques, there’s nothing better than continued learning from others over time as new advancements come out and patterns evolve.

Finally, we encourage you to check back often as we’ll be continuing further blog post series on serverless architectures, with the next series focusing on API design patterns and best practices.

About the author

Justin PritleJustin Pirtle is a specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. He’s responsible for helping customers design, deploy, and scale serverless applications using services such as AWS Lambda, Amazon API Gateway, Amazon Cognito, and Amazon DynamoDB. He is a regular speaker at AWS conferences, including re:Invent, as well as other AWS events. Justin holds a bachelor’s degree in Management Information Systems from the University of Texas at Austin and a master’s degree in Software Engineering from Seattle University.

Integrating AWS X-Ray with AWS App Mesh

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/integrating-aws-x-ray-with-aws-app-mesh/

This post is contributed by Lulu Zhao | Software Development Engineer II, AWS

 

AWS X-Ray helps developers and DevOps engineers quickly understand how an application and its underlying services are performing. When it’s integrated with AWS App Mesh, the combination makes for a powerful analytical tool.

X-Ray helps to identify and troubleshoot the root causes of errors and performance issues. It’s capable of analyzing and debugging distributed applications, including those based on a microservices architecture. It offers insights into the impact and reach of errors and performance problems.

In this post, I demonstrate how to integrate it with App Mesh.

Overview

App Mesh is a service mesh based on the Envoy proxy that makes it easy to monitor and control microservices. App Mesh standardizes how your microservices communicate, giving you end-to-end visibility and helping to ensure high application availability.

With App Mesh, it’s easy to maintain consistent visibility and network traffic control for services built across multiple types of compute infrastructure. App Mesh configures each service to export monitoring data and implements consistent communications control logic across your application.

A service mesh is like a communication layer for microservices. All communication between services happens through the mesh. Customers use App Mesh to configure a service mesh that contains virtual services, virtual nodes, virtual routes, and corresponding routes.

However, it’s challenging to visualize the way that request traffic flows through the service mesh while attempting to identify latency and other types of performance issues. This is particularly true as the number of microservices increases.

It’s in exactly this area where X-Ray excels. To show a detailed workflow inside a service mesh, I implemented a tracing extension called X-Ray tracer inside Envoy. With it, I ensure that I’m tracing all inbound and outbound calls that are routed through Envoy.

Traffic routing with color app

The following example shows how X-Ray works with App Mesh. I used the Color App, a simple demo application, to showcase traffic routing.

This app has two Go applications that are included in the AWS X-Ray Go SDK: color-gateway and color-teller. The color-gateway application is exposed to external clients and responds to http://service-name:port/color, which retrieves color from color-teller. I deployed color-app using Amazon ECS. This image illustrates how color-gateway routes traffic into a virtual router and then into separate nodes using color-teller.

 

The following image shows client interactions with App Mesh in an X-Ray service map after requests have been made to the color-gateway and to color-teller.

Integration

There are two types of service nodes:

  • AWS::AppMesh::Proxy is generated by the X-Ray tracing extension inside Envoy.
  • AWS::ECS::Container is generated by the AWS X-Ray Go SDK.

The service graph arrows show the request workflow, which you may find helpful as you try to understand the relationships between services.

To send Envoy-generated segments into X-Ray, install the X-Ray daemon. The following code example shows the ECS task definition used to install the daemon into the container.

{
    "name": "xray-daemon",

    "image": "amazon/aws-xray-daemon",

    "user": "1337",

    "essential": true,

    "cpu": 32,

    "memoryReservation": 256,

    "portMappings": [

        {

            "hostPort": 2000,

            "containerPort": 2000,

            "protocol": "udp"

         }

After the Color app successfully launched, I made a request to color-gateway to fetch a color.

  • First, the Envoy proxy appmesh/colorgateway-vn in front of default-gateway received the request and routed it to the server default-gateway.
  • Then, default-gateway made a request to server default-colorteller-white to retrieve the color.
  • Instead of directly calling the color-teller server, the request went to the default-gateway Envoy proxy and the proxy routed the call to color-teller.

That’s the advantage of using the Envoy proxy. Envoy is a self-contained process that is designed to run in parallel with all application servers. All of the Envoy proxies form a transparent communication mesh through which each application sends and receives messages to and from localhost while remaining unaware of the broader network topology.

For App Mesh integration, the X-Ray tracer records the mesh name and virtual node name values and injects them into the segment JSON document. Here is an example:

“aws”: {
	“app_mesh”: {
		“mesh_name”: “appmesh”,
		“virtual_node_name”: “colorgateway-vn”
	}
},

To enable X-Ray tracing through App Mesh inside Envoy, you must set two environment variable configurations:

  • ENABLE_ENVOY_XRAY_TRACING
  • XRAY_DAEMON_PORT

The first one enables X-Ray tracing using 127.0.0.1:2000 as the default daemon endpoint to which generated segments are sent. If the daemon you installed listens on a different port, you can specify a port value to override the default X-Ray daemon port by using the second configuration.

Conclusion

Currently, AWS X-Ray supports SDKs written in multiple languages (including Java, Python, Go, .NET, and .NET Core, Node.js, and Ruby) to help you implement your services. For more information, see Getting Started with AWS X-Ray.

New: Application integration with AWS Cloud Map for service discovery

Post Syndicated from AWS Admin original https://aws.amazon.com/blogs/architecture/new-application-integration-with-aws-cloud-map-for-service-discovery/

By: Alexandr Moroz, Sr. Product Manager, Amazon Route 53; Madhuri Peri, Sr. IoT Architect, AWS Professional Services; Aaron Molitor, Sr. Infrastructure Architect, AWS Professional Services; and Sarma Palli, Sr. DevOps Architect, AWS Professional Services

AWS Cloud Map enables you to map your cloud. You can define friendly names for any resource, such as Amazon S3 buckets, Amazon DynamoDB tables, Amazon SQS queues, or custom cloud services built on Amazon EC2, Amazon ECS, Amazon EKS, or AWS Lambda. Your applications can then discover resource location and metadata by friendly name using the AWS SDK and authenticated API queries. Resources can be further filtered and discovered by custom attributes such as deployment stage or version.

What’s new with API service discovery

If you want an enterprise application component such as a database hosted on Amazon EC2 instances to provide an endpoint to your database service, you have to register your applications’ EC2 IP address with AWS Cloud Map. You could register additional metadata attributes, like INSTANCE_STATUS, and then use this attribute with AWS Cloud Map to identify when the service is READY so that querying applications can only attempt a connection when they see a READY status in AWS Cloud Map. In cases where different microservices or enterprise applications have endpoints that have to be discovered, you can use AWS Cloud Map to register those as well. Examples of such endpoints include types of ELB load balancers, including ELB Classic, Application Load Balancers (ALB), and Network Load Balancers (NLB) with Auto Scaling groups.

Compute stack choices

Modern application architectures require a way to expose and advertise the service endpoint, register and de-register the endpoints, and query them. The dependencies of applications are expected to be handled by the applications themselves where a service registry becomes critical.

These microservices could follow different patterns of architecture lending themselves to use:

  1. Traditional workloads running on Amazon EC2 fronted by Auto Scaling groups or an ELB load balancer such as ELB Classic, Application Load Balancer, or Network Load Balancer.
  2. Amazon API Gateway and AWS Lambda for event-driven workflows.
  3. Container-based workloads on Amazon Elastic Container Service (ECS) using EC2 or Fargate launch types and Amazon Elastic Container Service for Kubernetes (EKS) for workloads that run as services (long-running) or daemons or run to completion (Batch / cron type).

This image shows a typical enterprise application composed of components that run different architectures. There is a web server running on Amazon EKS, a backend on Amazon ECS, a serverless event registration service, and payments running on EC2 Auto Scaling groups (ASG) while leveraging databases on Amazon Relational Database Service (RDS).

 

From a service discovery perspective, this is how the applications would want to be discovered and queried:

Let’s see how we can register each of these microservices (which are running on different cloud compute products) with AWS Cloud Map using both DNS-based and API-based service discovery and leveraging attributes for discovery when components are ready for traffic.

Microservice endpoints and discovery

AWS Cloud Map is a managed solution that lets you map logical names to the components/resources for an application. It allows applications to discover the resources using one of the AWS SDKs, RESTful API calls, or DNS queries. AWS Cloud Map serves registered resources, which can be Amazon DynamoDB tables, Amazon Simple Queue Service (SQS) queues, any higher-level application services that are built using EC2 instances or ECS tasks, or using a serverless stack.

When you register a resource, you can specify attributes and clients that can use the attributes to filter which resources are to be returned. For example, an application can request resources in a particular deployment stage, like Gamma or Prod. Additionally, you can choose to enable health checking for your IP-based resources, ensuring that AWS Cloud Map returns only healthy endpoints. Each API call is authenticated, and developers can control access to service locations and configuration using AWS Identity and Access Management (IAM).  This ensures that clients always discover the services that they’re authorized to use.

Let’s cover fundamentals

There are two aspects to service discovery:

  • The microservices themselves that register/de-register
  • Other microservices that are discover / query microservices

To register a microservice, follow these steps:

  1. Create a namespace.
  2. Create a service.
  3. Register instances with the service.

Steps 1 and 2  are performed once for each service. A utility function for registration and de-registration of a microservice has to be created. This utility function can be invoked for microservices regardless of the compute stack choice and deployed through your CI/CD/DevOps processes.

Step 3 is an ongoing operation that has to be updated each time the underlying EC2 compute that powers it changes. Examples include: EC2 Amazon Machine Image (AMI) changes, code changes for the service, and version changes.

Creating a namespace

A namespace is a logical group of services that share the same domain name, such as example1.example.com or example2.example.com. If you want these namespaces to be queried from only within your VPC, opt for a private namespace. If you want them to be accessible over the Internet, create a public namespace. In our example, the public namespace could be example1, but the total count of use of items in example1 in a tracker/reporting service could be an internal service.

Microservice using DNS-based service discovery:

Microservice using API-based service discovery:

Creating a service

When you register a service, AWS Cloud Map will create a record in the hosted zone – which is a combination of the name of the service and the name of the namespace. You could optionally define a health check for the service, too.

If the service you are creating is meant for DNS-based discovery using one of the A, AAAA, or SRV records, then you can create your service using the following syntax. Examples of this could be your application code running on an EC2 instance or as a container (ECS/EKS).

For services that are meant to be used only in an API-only namespace, the API call would look like this:

Register the compute backend with the service

Container IP address register/de-register

Amazon ECS is tightly integrated with AWS Cloud Map to enable service discovery for compute workloads running in ECS. When you enable service discovery for ECS services, it automatically keeps track of all task instances in AWS Cloud Map. Now your applications can discover them using DNS queries and AWS Cloud Map DiscoverInstances API calls. The ECS control plane that issues the calls would register the IP address of the task (and containers) with the AWS Cloud Map.

When the task goes away – either because a newer version has been deployed or there is a crash or a restart – the ECS control plane handles the de-registration process as well.

If you are using ECS for running containers, this is done seamlessly with ECS and AWS Cloud Map API integration.

API Gateway URL and AWS Lambda

When you create a microservice with an API namespace, you could use any attributes you prefer, without providing the IP/port information.

EC2 instance IP address registration and de-registration

As the EC2 instances are coming online, the userdata section of the bootstrap configuration will issue commands to register the EC2 instance’s IP address with the service. An alternate approach would be to run a Lambda function that runs against a microservice’s Auto Scaling group, lists the IP addresses, and registers the instance against the service.

If you are using an EC2 instance, if the instance is integrated with an Auto Scaling group, lifecycle hooks could also be used to run the de-register scripts. Another approach would be to use a Lambda function that runs periodically against an Auto Scaling group or even fires on Auto Scaling group notification events.

Query/Discovery

Both DNS and API service discovery are now supported by the AWS Cloud Map service discovery. Supported DNS record types are – A, AAAA, SRV, and CNAME.

It is typical in a microservices architecture for a service to be able to discover other services. We recommend that you query only by name and/or endpoint, and do not use the IP address of the compute stack (AWS Lambda / container/ EC2) that is backing the service.

The API commands list_services and get_services provide the information on what services are available and their corresponding details.

A DNS protocol also has clients caching the responses, so make sure that you handle caching settings. AWS Cloud Map uses regional endpoints here. Any A records created will use either a WEIGHTED response or MULTIVALUE answer policy. If you are using a Java-based compute stack, you might not want to choose DNS-based service discovery as the JVM caches DNS name lookups. When the JVM resolves a hostname to an IP address, it caches the IP address for a specified period of time, TTL. In such cases, you could use API-based service discovery and leverage the same approach as your other microservices that can use AWS Cloud Map.

DiscoverInstances API

DiscoverInstances API discovers registered instances for a specified namespace and service using regional endpoints. Updates to your services, such as new instances registered or existing instances removed, will be available in the API faster than via DNS. The API provides the ability to decorate the resources with additional metadata (service attributes) that can be used during discovery. For example, get the services with attributes of blue or green or other application attributes. These attributes can be used to complement health checks while performing discovery (such as finding out whether the instance is ready or not).

Here is a screenshot that shows how the registered ECS task instances appear in the AWS Cloud Map console:

The idea is that as the container or EC2 instance comes online or goes offline, it needs to issue a call to the AWS Cloud Map API to register or de-register the compute IP address.

Get started by visiting the AWS Cloud Map page. To learn more, take a look at the demo code in the GitHub repo here. If your compute workloads use EKS, please refer to this blog post that shows how to make EKS automatically publish all services in AWS Cloud Map.

Introducing AWS App Mesh – service mesh for microservices on AWS

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/introducing-aws-app-mesh-service-mesh-for-microservices-on-aws/

AWS App Mesh is a service mesh that allows you to easily monitor and control communications across microservices applications on AWS. You can use App Mesh with microservices running on Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Container Service for Kubernetes (Amazon EKS), and Kubernetes running on Amazon EC2.

Today, App Mesh is available as a public preview. In the coming months, we plan to add new functionality and integrations.

Why App Mesh?

Many of our customers are building applications with microservices architectures, breaking applications into many separate, smaller pieces of software that are independently deployed and operated. Microservices help to increase the availability and scalability of an application by allowing each component to scale independently based on demand. Each microservice interacts with the other microservices through an API.

When you start building more than a few microservices within an application, it becomes difficult to identify and isolate issues. These can include high latencies, error rates, or error codes across the application. There is no dynamic way to route network traffic when there are failures or when new containers need to be deployed.

You can address these problems by adding custom code and libraries into each microservice and using open source tools that manage communications for each microservice. However, these solutions can be hard to install, difficult to update across teams, and complex to manage for availability and resiliency.

AWS App Mesh implements a new architectural pattern that helps solve many of these challenges and provides a consistent, dynamic way to manage the communications between microservices. With App Mesh, the logic for monitoring and controlling communications between microservices is implemented as a proxy that runs alongside each microservice, instead of being built into the microservice code. The proxy handles all of the network traffic into and out of the microservice and provides consistency for visibility, traffic control, and security capabilities to all of your microservices.

Use App Mesh to model how all of your microservices connect. App Mesh automatically computes and sends the appropriate configuration information to each microservice proxy. This gives you standardized, easy-to-use visibility and traffic controls across your entire application.  App Mesh uses Envoy, an open source proxy. That makes it compatible with a wide range of AWS partner and open source tools for monitoring microservices.

Using App Mesh, you can export observability data to multiple AWS and third-party tools, including Amazon CloudWatch, AWS X-Ray, or any third-party monitoring and tracing tool that integrates with Envoy. You can configure new traffic routing controls to enable dynamic blue/green canary deployments for your services.

Getting started

Here’s a sample application with two services, where service A receives traffic from the internet and uses service B for some backend processing. You want to route traffic dynamically between services B and B’, a new version of B deployed to act as the canary.

First, create a mesh, a namespace that groups related microservices that must interact.

Next, create virtual nodes to represent services in the mesh. A virtual node can represent a microservice or a specific microservice version. In this example, service A and B participate in the mesh and you manage the traffic to service B using App Mesh.

Now, deploy your services with the required Envoy proxy and with a mapping to the node in the mesh.

After you have defined your virtual nodes, you can define how the traffic flows between your microservices. To do this, define a virtual router and routes for communications between microservices.

A virtual router handles traffic for your microservices. After you create a virtual router, you create routes to direct traffic appropriately. These routes include the connection requests that the route should accept, where they should go, and the weighted amount of traffic to send. All of these changes to adjust traffic between services is computed and sent dynamically to the appropriate proxies by App Mesh to execute your deployment.

You now have a virtual router set up that accepts all traffic from virtual node A sending to the existing version of service B, as well some traffic to the new version, B’.

Exporting metrics, logs, and traces

One of benefits about placing a proxy in front of every microservice is that you can automatically capture metrics, logs, and traces about the communication between your services. App Mesh enables you to easily collect and export this data to the tools of your choice. Envoy is already integrated with several tools like Prometheus and Datadog.

During the preview, we are adding support for AWS services such as Amazon CloudWatch and AWS X-Ray. We have a lot more integrations planned as well.

Available now

AWS App Mesh is available as a public preview and you can start using it today in the North Virginia, Ohio, Oregon, and Ireland AWS Regions. During the preview, we plan to add new features and want to hear your feedback. You can check out our GitHub repository for examples and our roadmap.

— Nate

Building, deploying, and operating containerized applications with AWS Fargate

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/building-deploying-and-operating-containerized-applications-with-aws-fargate/

This post was contributed by Jason Umiker, AWS Solutions Architect.

Whether it’s helping facilitate a journey to microservices or deploying existing tools more easily and repeatably, many customers are moving toward containerized infrastructure and workflows. AWS provides many of the services and mechanisms to help you with that.

In this post, I show you how to use Amazon ECS and AWS Fargate, as well as AWS CodeBuild and AWS CodePipeline, for an end-to-end CI/CD container solution.

What is Amazon ECS?

Amazon Elastic Container Service (ECS) helps schedule and orchestrate containers across a fleet of servers. It involves installing an agent on each container host that takes instructions from the ECS control plane and relays them to the local Docker image on each one. ECS makes this easy by providing an optimized Amazon Machine Image (AMI) that launches automatically using the ECS console or CLI and that you can use to launch container hosts yourself.

It is up to you to choose the appropriate instance types, sizes, and quantity for your cluster fleet. You should have the capacity to deploy and scale workloads as well as to spread them across enough failure domains for high availability. Features like Auto Scaling groups help with that.

Also, while AWS provides Amazon Linux and Windows AMIs pre-configured for ECS, you are responsible for ongoing maintenance of the OS, which includes patching and security. Items that require regular patching or updating in this model are the OS, Docker, the ECS agent, and of course the contents of the container images.

Two of the key ECS concepts are Tasks and Services. A task is one or more containers that are to be scheduled together by ECS. A service is like an Auto Scaling group for tasks. It defines the quantity of tasks to run across the cluster, where they should be running (for example, across multiple Availability Zones), automatically associates them with a load balancer, and horizontally scales based on metrics that you define like CPI or memory utilization.

What is Fargate?

AWS Fargate is a new compute engine for Amazon ECS that runs containers without requiring you to deploy or manage the underlying Amazon EC2 instances. With Fargate, you specify an image to deploy and the amount of CPU and memory it requires. Fargate handles the updating and securing of the underlying Linux OS, Docker daemon, and ECS agent as well as all the infrastructure capacity management and scaling.

How to use Fargate?

Fargate is exposed as a launch type for ECS. It uses an ECS task and service definition that is similar to the traditional EC2 launch mode, with a few minor differences. It is easy to move tasks and services back and forth between launch types. The differences include:

  • Using the awsvpc network mode
  • Specifying the CPU and memory requirements for the task in the definition

The best way to learn how to use Fargate is to walk through the process and see it in action.

Walkthrough: Deploying a service with Fargate in the console

At the time of publication, Fargate for ECS is available in the N. Virginia, Ohio, Oregon, and Ireland AWS regions. This walkthrough works in any AWS region where Fargate is available.

If you’d prefer to use a CloudFormation template, this one covers Steps 1-4. After launching this template you can skip ahead to Explore Running Service after Step 4.

Step 1 – Create an ECS cluster

An ECS cluster is a logical construct for running groups of containers known as tasks. Clusters can also be used to segregate different environments or teams from each other. In the traditional EC2 launch mode, there are specific EC2 instances associated with and managed by each ECS cluster, but this is transparent to the customer with Fargate.

  1. Open the ECS console and ensure that Fargate is available in the selected Region (for example, N. Virginia).
  2. Choose Clusters, Create Cluster.
  3. Choose Networking only, Next step.
  4. For Cluster name, enter “Fargate”. If you don’t already have a VPC to use, select the Create VPC check box and accept the defaults as well. Choose Create.

Step 2 – Create a task definition, CloudWatch log group, and task execution role

A task is a collection of one or more containers that is the smallest deployable unit of your application. A task definition is a JSON document that serves as the blueprint for ECS to know how to deploy and run your tasks.

The console makes it easier to create this definition by exposing all the parameters graphically. In addition, the console creates two dependencies:

  • The Amazon CloudWatch log group to store the aggregated logs from the task
  • The task execution IAM role that gives Fargate the permissions to run the task
  1. In the left navigation pane, choose Task Definitions, Create new task definition.
  2. Under Select launch type compatibility, choose FARGATE, Next step.
  3. For Task Definition Name, enter NGINX.
  4. If you had an IAM role for your task, you would enter it in Task Role but you don’t need one for this example.
  5. The Network Mode is automatically set to awsvpc for Fargate
  6. Under Task size, for Task memory, choose 0.5 GB. For Task CPU, enter 0.25.
  7. Choose Add container.
  8. For Container name, enter NGINX.
  9. For Image, put nginx:1.13.9-alpine.
  10. For Port mappings type 80 into Container port.
  11. Choose Add, Create.

Step 3 – Create an Application Load Balancer

Sending incoming traffic through a load balancer is often a key piece of making an application both scalable and highly available. It can balance the traffic between multiple tasks, as well as ensure that traffic is only sent to healthy tasks. You can have the service manage the addition or removal of tasks from an Application Load Balancer as they come and go but that must be specified when the service is created. It’s a dependency that you create first.

  1. Open the EC2 console.
  2. In the left navigation pane, choose Load Balancers, Create Load Balancer.
  3. Under Application Load Balancer, choose Create.
  4. For Name, put NGINX.
  5. Choose the appropriate VPC (10.0.0.0/16 if you let ECS create if for you).
  6. For Availability Zones, select both and choose Next: Configure Security Settings.
  7. Choose Next: Configure Security Groups.
  8. For Assign a security group, choose Create a new security group. Choose Next: Configure Routing.
  9. For Name, enter NGINX. For Target type, choose ip.
  10. Choose Next: Register Targets, Next: Review, Create.
  11. Select the new load balancer and note its DNS name (this is the public address for the service).

Step 4 – Create an ECS service using Fargate

A service in ECS using Fargate serves a similar purpose to an Auto Scaling group in EC2. It ensures that the needed number of tasks are running both for scaling as well as spreading the tasks over multiple Availability Zones for high availability. A service creates and destroys tasks as part of its role and can optionally add or remove them from an Application Load Balancer as targets as it does so.

  1. Open the ECS console and ensure that that Fargate is available in the selected Region (for example, N. Virginia).
  2. In the left navigation pane, choose Task Definitions.
  3. Select the NGINX task definition that you created and choose Actions, Create Service.
  4. For Launch Type, select Fargate.
  5. For Service name, enter NGINX.
  6. For Number of tasks, enter 1.
  7. Choose Next step.
  8. Under Subnets, choose both of the options.
  9. For Load balancer type, choose Application Load Balancer. It should then default to the NGINX version that you created earlier.
  10. Choose Add to load balancer.
  11. For Target group name, choose NGINX.
  12. Under DNS records for service discovery, for TTL, enter 60.
  13. Click Next step, Next step, and Create Service.

Explore the running service

At this point, you have a running NGINX service using Fargate. You can now explore what you have running and how it works. You can also ask it to scale up to two tasks across two Availability Zones in the console.

Go into the service and see details about the associated load balancer, tasks, events, metrics, and logs:

Scale the service from one task to multiple tasks:

  • Choose Update.
  • For Number of tasks, enter 2.
  • Choose Next step, Next step, Next step then Update Service.
  • Watch the event that is logged and the new additional task both appear.

On the service Details tab, open the NGINX Target Group Name link and see the IP address registered targets spread across the two zones.

Go to the DNS name for the Application Load Balancer in your browser and see the default NGINX page. Get the value from the Load Balancers dashboard in the EC2 console.

Walkthrough: Adding a CI/CD pipeline to your service

Now, I’m going to show you how to set up a CI/CD pipeline around this service. It watches a GitHub repo for changes and rebuilds the container with CodeBuild based on the buildspec.yml file and Dockerfile in the repo. If that build is successful, it then updates your Fargate service to deploy the new image.

If you’d prefer to use a CloudFormation Template, this one covers the creation of the dependencies so that the console will pre-fill these (CodeBuild Project and IAM Roles) during the creation of the CodePipeline in the steps below.

Step 1 – Create an ECR repository for the rebuilt container image

An ECR repository is a place to store your container images in a secure and reliable manner. Scaling and self-healing of Fargate tasks requires these images to be always available to be pulled when required. This is an important part of a container platform.

  1. Open the ECS console and ensure that that Fargate is available in the selected Region (for example N. Virginia).
  2. In the left navigation pane, under Amazon ECR, choose Repositories, Get started.
  3. For Repository name, put NGINX and choose Next step.

Step 2 – Fork the nginx-codebuild example into your own GitHub account

I have created an example project that takes the Dockerfile and config files for the official NGINX Docker Hub image and adds a buildspec.yml file to tell CodeBuild how to build the container and push it to your new ECR registry on completion. You can fork it into your own GitHub account for this CI/CD demo.

  1. Go to https://github.com/jasonumiker/nginx-codebuild.
  2. In the upper right corner, choose Fork.

Step 3 – Create the pipeline and associated IAM roles

You have two complementary AWS services for building a CI/CD pipeline for your containers. CodeBuild executes the build jobs and CodePipeline kicks off those builds when it notices that the source GitHub or CodeCommit repo changes. If successful, CodePipeline then deploys the new container image to Fargate.

The CodePipeline console can create the associated CodeBuild project, in addition to other dependencies such as the required IAM roles.

  1. Open the CodePipeline console and ensure that that Fargate is available in the selected Region (for example, N. Virginia).
  2. Choose Get started.
  3. For Pipeline name, enter NGINX and choose Next step.
  4. For Source provider, choose GitHub.
  5. Choose Connect to GitHub and log in.
    • For Repository, choose your forked nginx-codebuild repo. For Branch, enter master. Choose Next step.
  6. For Build provider, enter AWS CodeBuild.
  7. Select Create a new build project.
  8. For Project name, enter NGINX.
  9. For Operating system, choose Ubuntu. For Runtime, choose Docker. For Version, select the latest version.
  10. Expand Advanced and set the following environment variables:
    • AWS_ACCOUNT_ID with a value of the account number
    • IMAGE_REPO_NAME with a value of NGINX (or whatever ECR name that you used)
  11. Choose Save build project, Next step.
  12. For Deployment provider, choose Amazon ECS.
  13. For Cluster name, enter Fargate.
  14. For Service name, choose NGINX.
  15. For Image filename, enter images.json.
  16. Choose Next step.
  17. Choose Create role, Allow, Next step, and then choose Create pipeline.
  18. Open the IAM console and ensure that that Fargate is available in the selected Region (for example, N. Virginia).
  19. In the left navigation pane, choose Roles.
  20. Choose the code-build-nginx-service-role that was just created and choose Attach policy.
  21. For Policy type, choose AmazonEC2ContainerRegistryPowerUser and choose Attach policy.

Step 4 – Start the pipeline

You now have CodePipeline watching the GitHub repo for changes. It kicks off a CodeBuild build job on a change and, if the build is successful, creates a new deployment of the Fargate service with the new image.

Make a change to the source repo (even just adding a new dummy file) and then commit it and push it to master on your GitHub fork. This automatically kicks off the pipeline to build and deploy the change.

Conclusion

As you’ve seen, Fargate is fast and easy to set up, integrates well with the rest of the AWS platform, and saves you from much of the heavy lifting of running containers reliably at scale.

While it is useful to go through creating things in the console to understand them better we suggest automating them with infrastructure-as-code patterns via things like our CloudFormation to ensure that they are repeatable, and any changes can be managed. There are some example templates to help you get started in this post.

In addition, adding things like unit and integration testing, blue/green and/or manual approval gates into CodePipeline are often a good idea before deploying patterns like this to production in many organizations. Some additional examples to look at next include:

Solving Complex Ordering Challenges with Amazon SQS FIFO Queues

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/solving-complex-ordering-challenges-with-amazon-sqs-fifo-queues/

Contributed by Shea Lutton, AWS Cloud Infrastructure Architect

Amazon Simple Queue Service (Amazon SQS) is a fully managed queuing service that helps decouple applications, distributed systems, and microservices to increase fault tolerance. SQS queues come in two distinct types:

  • Standard SQS queues are able to scale to enormous throughput with at-least-once delivery.
  • FIFO queues are designed to guarantee that messages are processed exactly once in the exact order that they are received and have a default rate of 300 transactions per second.

As customers explore SQS FIFO queues, they often have questions about how the behavior works when messages arrive and are consumed. This post walks through some common situations to identify the exact behavior that you can expect. It also covers the behavior of message groups in depth and explains why message groups are key to understanding how FIFO queues work.

The simple case

Suppose that you run a major auction platform where people buy and sell a wide range of products. Your platform requires that transactions from buyers and sellers get processed in exactly the order received. Here’s how a FIFO queue helps you keep all your transactions in one straight flow.

A seller currently is holding an auction for a laptop, and three different bids are received for the same price. Ties are awarded to the first bidder at that price so it is important to track which arrived first. Your auction platform receives the three bids and sends them to a FIFO queue before they are processed.

Now observe how messages leave the queue. When your consumer asks for a batch of up to 10 messages, SQS starts filling the batch with the oldest message (bid A1). It keeps filling until either the batch is full or the queue is empty. In this case, the batch contains the three messages and the queue is now empty. After a batch has left the queue, SQS considers that batch of messages to be “in-flight” until the consumer either deletes them or the batch’s visibility timer expires.

 

When you have a single consumer, this is easy to envision. The consumer gets a batch of messages (now in-flight), does its processing, and deletes the messages. That consumer is then ready to ask for the next batch of messages.

The critical thing to keep in mind is that SQS won’t release the next batch of messages until the first batch has been deleted. By adding more messages to the queue, you can see more interesting behaviors. Imagine that a burst of 11 bids is sent to your FIFO queue, with two bids for Auction A arriving last.

The FIFO queue now has at least two batches of messages in it. When your single consumer requests the first batch of 10 messages, it receives a batch starting with B1 and ending with A1. Later, after the first batch has been deleted, the consumer can get the second batch of messages containing the final A2 message from the queue.

Adding complexity with multiple message groups

A new challenge arises. Your auction platform is getting busier and your dev team added a number of new features. The combination of increased messages and extra processing time for the new features means that a single consumer is too slow. The solution is to scale to have more consumers and process messages in parallel.

To work in parallel, your team realized that only the messages related to a single auction must be kept in order. All transactions for Auction A need to be kept in order and so do all transactions for Auction B. But the two auctions are independent and it does not matter which auctions transactions are processed first.

FIFO can handle that case with a feature called message groups. Each transaction related to Auction A is placed by your producer into message group A, and so on. In the diagram below, Auction A and Auction B each received three bid transactions, with bid B1 arriving first. The FIFO queue always keeps transactions within a message group in the order in which they arrived.

How is this any different than earlier examples? The consumer now gets the messages ordered by message groups, all the B group messages followed by all the A group messages. Multiple message groups create the possibility of using multiple consumers, which I explain in a moment. If FIFO can’t fill up a batch of messages with a single message group, FIFO can place more than one message group in a batch of messages. But whenever possible, the queue gives you a full batch of messages from the same group.

The order of messages leaving a FIFO queue is governed by three rules:

  1. Return the oldest message where no other message in the same message group is currently in-flight.
  2. Return as many messages from the same message group as possible.
  3. If a message batch is still not full, go back to rule 1.

To see this behavior, add a second consumer and insert many more messages into the queue. For simplicity, the delete message action has been omitted in these diagrams but it is assumed that all messages in a batch are processed successfully by the consumer and the batch is properly deleted immediately after.

In this example, there are 11 Group A and 11 Group B transactions arriving in interleaved order and a second consumer has been added. Consumer 1 asks for a group of 10 messages and receives 10 Group A messages. Consumer 2 then asks for 10 messages but SQS knows that Group A is in flight, so it releases 10 Group B messages. The two consumers are now processing two batches of messages in parallel, speeding up throughput and then deleting their batches. When Consumer 1 requests the next batch of messages, it receives the remaining two messages, one from Group A and one from Group B.

Consider this nuanced detail from the example above. What would happen if Consumer 1 was on a faster server and processed its first batch of messages before Consumer 2 could mark its messages for deletion? See if you can predict the behavior before looking at the answer.

If Consumer 2 has not deleted its Group B messages yet when Consumer 1 asks for the next batch, then the FIFO queue considers Group B to still be in flight. It does not release any more Group B messages. Consumer 1 gets only the remaining Group A message. Later, after Consumer 2 has deleted its first batch, the remaining Group B message is released.

Conclusion

I hope this post answered your questions about how Amazon SQS FIFO queues work and why message groups are helpful. If you’re interested in exploring SQS FIFO queues further, here are a few ideas to get you started:

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.

Compute

May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.

Containers

May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.

Databases

May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.

DevOps

May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.

IoT

May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.

Mobile

May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.

Networking

May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.

Serverless

May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.

Storage

May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.

 

 

 

 

AWS Online Tech Talks – April & Early May 2018

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-april-early-may-2018/

We have several upcoming tech talks in the month of April and early May. Come join us to learn about AWS services and solution offerings. We’ll have AWS experts online to help answer questions in real-time. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

April & early May — 2018 Schedule

Compute

April 30, 2018 | 01:00 PM – 01:45 PM PTBest Practices for Running Amazon EC2 Spot Instances with Amazon EMR (300) – Learn about the best practices for scaling big data workloads as well as process, store, and analyze big data securely and cost effectively with Amazon EMR and Amazon EC2 Spot Instances.

May 1, 2018 | 01:00 PM – 01:45 PM PTHow to Bring Microsoft Apps to AWS (300) – Learn more about how to save significant money by bringing your Microsoft workloads to AWS.

May 2, 2018 | 01:00 PM – 01:45 PM PTDeep Dive on Amazon EC2 Accelerated Computing (300) – Get a technical deep dive on how AWS’ GPU and FGPA-based compute services can help you to optimize and accelerate your ML/DL and HPC workloads in the cloud.

Containers

April 23, 2018 | 11:00 AM – 11:45 AM PTNew Features for Building Powerful Containerized Microservices on AWS (300) – Learn about how this new feature works and how you can start using it to build and run modern, containerized applications on AWS.

Databases

April 23, 2018 | 01:00 PM – 01:45 PM PTElastiCache: Deep Dive Best Practices and Usage Patterns (200) – Learn about Redis-compatible in-memory data store and cache with Amazon ElastiCache.

April 25, 2018 | 01:00 PM – 01:45 PM PTIntro to Open Source Databases on AWS (200) – Learn how to tap the benefits of open source databases on AWS without the administrative hassle.

DevOps

April 25, 2018 | 09:00 AM – 09:45 AM PTDebug your Container and Serverless Applications with AWS X-Ray in 5 Minutes (300) – Learn how AWS X-Ray makes debugging your Container and Serverless applications fun.

Enterprise & Hybrid

April 23, 2018 | 09:00 AM – 09:45 AM PTAn Overview of Best Practices of Large-Scale Migrations (300) – Learn about the tools and best practices on how to migrate to AWS at scale.

April 24, 2018 | 11:00 AM – 11:45 AM PTDeploy your Desktops and Apps on AWS (300) – Learn how to deploy your desktops and apps on AWS with Amazon WorkSpaces and Amazon AppStream 2.0

IoT

May 2, 2018 | 11:00 AM – 11:45 AM PTHow to Easily and Securely Connect Devices to AWS IoT (200) – Learn how to easily and securely connect devices to the cloud and reliably scale to billions of devices and trillions of messages with AWS IoT.

Machine Learning

April 24, 2018 | 09:00 AM – 09:45 AM PT Automate for Efficiency with Amazon Transcribe and Amazon Translate (200) – Learn how you can increase the efficiency and reach your operations with Amazon Translate and Amazon Transcribe.

April 26, 2018 | 09:00 AM – 09:45 AM PT Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sagemaker (200) – Learn more about developing machine learning applications for the IoT edge.

Mobile

April 30, 2018 | 11:00 AM – 11:45 AM PTOffline GraphQL Apps with AWS AppSync (300) – Come learn how to enable real-time and offline data in your applications with GraphQL using AWS AppSync.

Networking

May 2, 2018 | 09:00 AM – 09:45 AM PT Taking Serverless to the Edge (300) – Learn how to run your code closer to your end users in a serverless fashion. Also, David Von Lehman from Aerobatic will discuss how they used [email protected] to reduce latency and cloud costs for their customer’s websites.

Security, Identity & Compliance

April 30, 2018 | 09:00 AM – 09:45 AM PTAmazon GuardDuty – Let’s Attack My Account! (300) – Amazon GuardDuty Test Drive – Practical steps on generating test findings.

May 3, 2018 | 09:00 AM – 09:45 AM PTProtect Your Game Servers from DDoS Attacks (200) – Learn how to use the new AWS Shield Advanced for EC2 to protect your internet-facing game servers against network layer DDoS attacks and application layer attacks of all kinds.

Serverless

April 24, 2018 | 01:00 PM – 01:45 PM PTTips and Tricks for Building and Deploying Serverless Apps In Minutes (200) – Learn how to build and deploy apps in minutes.

Storage

May 1, 2018 | 11:00 AM – 11:45 AM PTBuilding Data Lakes That Cost Less and Deliver Results Faster (300) – Learn how Amazon S3 Select And Amazon Glacier Select increase application performance by up to 400% and reduce total cost of ownership by extending your data lake into cost-effective archive storage.

May 3, 2018 | 11:00 AM – 11:45 AM PTIntegrating On-Premises Vendors with AWS for Backup (300) – Learn how to work with AWS and technology partners to build backup & restore solutions for your on-premises, hybrid, and cloud native environments.

AWS Secrets Manager: Store, Distribute, and Rotate Credentials Securely

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/aws-secrets-manager-store-distribute-and-rotate-credentials-securely/

Today we’re launching AWS Secrets Manager which makes it easy to store and retrieve your secrets via API or the AWS Command Line Interface (CLI) and rotate your credentials with built-in or custom AWS Lambda functions. Managing application secrets like database credentials, passwords, or API Keys is easy when you’re working locally with one machine and one application. As you grow and scale to many distributed microservices, it becomes a daunting task to securely store, distribute, rotate, and consume secrets. Previously, customers needed to provision and maintain additional infrastructure solely for secrets management which could incur costs and introduce unneeded complexity into systems.

AWS Secrets Manager

Imagine that I have an application that takes incoming tweets from Twitter and stores them in an Amazon Aurora database. Previously, I would have had to request a username and password from my database administrator and embed those credentials in environment variables or, in my race to production, even in the application itself. I would also need to have our social media manager create the Twitter API credentials and figure out how to store those. This is a fairly manual process, involving multiple people, that I have to restart every time I want to rotate these credentials. With Secrets Manager my database administrator can provide the credentials in secrets manager once and subsequently rely on a Secrets Manager provided Lambda function to automatically update and rotate those credentials. My social media manager can put the Twitter API keys in Secrets Manager which I can then access with a simple API call and I can even rotate these programmatically with a custom lambda function calling out to the Twitter API. My secrets are encrypted with the KMS key of my choice, and each of these administrators can explicitly grant access to these secrets with with granular IAM policies for individual roles or users.

Let’s take a look at how I would store a secret using the AWS Secrets Manager console. First, I’ll click Store a new secret to get to the new secrets wizard. For my RDS Aurora instance it’s straightforward to simply select the instance and provide the initial username and password to connect to the database.

Next, I’ll fill in a quick description and a name to access my secret by. You can use whatever naming scheme you want here.

Next, we’ll configure rotation to use the Secrets Manager-provided Lambda function to rotate our password every 10 days.

Finally, we’ll review all the details and check out our sample code for storing and retrieving our secret!

Finally I can review the secrets in the console.

Now, if I needed to access these secrets I’d simply call the API.

import json
import boto3
secrets = boto3.client("secretsmanager")
rds = json.dumps(secrets.get_secrets_value("prod/TwitterApp/Database")['SecretString'])
print(rds)

Which would give me the following values:


{'engine': 'mysql',
 'host': 'twitterapp2.abcdefg.us-east-1.rds.amazonaws.com',
 'password': '-)Kw>THISISAFAKEPASSWORD:lg{&sad+Canr',
 'port': 3306,
 'username': 'ranman'}

More than passwords

AWS Secrets Manager works for more than just passwords. I can store OAuth credentials, binary data, and more. Let’s look at storing my Twitter OAuth application keys.

Now, I can define the rotation for these third-party OAuth credentials with a custom AWS Lambda function that can call out to Twitter whenever we need to rotate our credentials.

Custom Rotation

One of the niftiest features of AWS Secrets Manager is custom AWS Lambda functions for credential rotation. This allows you to define completely custom workflows for credentials. Secrets Manager will call your lambda with a payload that includes a Step which specifies which step of the rotation you’re in, a SecretId which specifies which secret the rotation is for, and importantly a ClientRequestToken which is used to ensure idempotency in any changes to the underlying secret.

When you’re rotating secrets you go through a few different steps:

  1. createSecret
  2. setSecret
  3. testSecret
  4. finishSecret

The advantage of these steps is that you can add any kind of approval steps you want for each phase of the rotation. For more details on custom rotation check out the documentation.

Available Now
AWS Secrets Manager is available today in US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), EU (Frankfurt), EU (Ireland), EU (London), and South America (São Paulo). Secrets are priced at $0.40 per month per secret and $0.05 per 10,000 API calls. I’m looking forward to seeing more users adopt rotating credentials to secure their applications!

Randall

Amazon ECS Service Discovery

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-ecs-service-discovery/

Amazon ECS now includes integrated service discovery. This makes it possible for an ECS service to automatically register itself with a predictable and friendly DNS name in Amazon Route 53. As your services scale up or down in response to load or container health, the Route 53 hosted zone is kept up to date, allowing other services to lookup where they need to make connections based on the state of each service. You can see a demo of service discovery in an imaginary social networking app over at: https://servicediscovery.ranman.com/.

Service Discovery


Part of the transition to microservices and modern architectures involves having dynamic, autoscaling, and robust services that can respond quickly to failures and changing loads. Your services probably have complex dependency graphs of services they rely on and services they provide. A modern architectural best practice is to loosely couple these services by allowing them to specify their own dependencies, but this can be complicated in dynamic environments as your individual services are forced to find their own connection points.

Traditional approaches to service discovery like consul, etcd, or zookeeper all solve this problem well, but they require provisioning and maintaining additional infrastructure or installation of agents in your containers or on your instances. Previously, to ensure that services were able to discover and connect with each other, you had to configure and run your own service discovery system or connect every service to a load balancer. Now, you can enable service discovery for your containerized services in the ECS console, AWS CLI, or using the ECS API.

Introducing Amazon Route 53 Service Registry and Auto Naming APIs

Amazon ECS Service Discovery works by communicating with the Amazon Route 53 Service Registry and Auto Naming APIs. Since we haven’t talked about it before on this blog, I want to briefly outline how these Route 53 APIs work. First, some vocabulary:

  • Namespaces – A namespace specifies a domain name you want to route traffic to (e.g. internal, local, corp). You can think of it as a logical boundary between which services should be able to discover each other. You can create a namespace with a call to the aws servicediscovery create-private-dns-namespace command or in the ECS console. Namespaces are roughly equivalent to hosted zones in Route 53. A namespace contains services, our next vocabulary word.
  • Service – A service is a specific application or set of applications in your namespace like “auth”, “timeline”, or “worker”. A service contains service instances.
  • Service Instance – A service instance contains information about how Route 53 should respond to DNS queries for a resource.

Route 53 provides APIs to create: namespaces, A records per task IP, and SRV records per task IP + port.

When we ask Route 53 for something like: worker.corp we should get back a set of possible IPs that could fulfill that request. If the application we’re connecting to exposes dynamic ports then the calling application can easily query the SRV record to get more information.

ECS service discovery is built on top of the Route 53 APIs and manages all of the underlying API calls for you. Now that we understand how the service registry, works lets take a look at the ECS side to see service discovery in action.

Amazon ECS Service Discovery

Let’s launch an application with service discovery! First, I’ll create two task definitions: “flask-backend” and “flask-worker”. Both are simple AWS Fargate tasks with a single container serving HTTP requests. I’ll have flask-backend ask worker.corp to do some work and I’ll return the response as well as the address Route 53 returned for worker. Something like the code below:

@app.route("/")
namespace = os.getenv("namespace")
worker_host = "worker" + namespace
def backend():
    r = requests.get("http://"+worker_host)
    worker = socket.gethostbyname(worker_host)
    return "Worker Message: {]\nFrom: {}".format(r.content, worker)

 

Now, with my containers and task definitions in place, I’ll create a service in the console.

As I move to step two in the service wizard I’ll fill out the service discovery section and have ECS create a new namespace for me.

I’ll also tell ECS to monitor the health of the tasks in my service and add or remove them from Route 53 as needed. Then I’ll set a TTL of 10 seconds on the A records we’ll use.

I’ll repeat those same steps for my “worker” service and after a minute or so most of my tasks should be up and running.

Over in the Route 53 console I can see all the records for my tasks!

We can use the Route 53 service discovery APIs to list all of our available services and tasks and programmatically reach out to each one. We could easily extend to any number of services past just backend and worker. I’ve created a simple demo of an imaginary social network with services like “auth”, “feed”, “timeline”, “worker”, “user” and more here: https://servicediscovery.ranman.com/. You can see the code used to run that page on github.

Available Now
Amazon ECS service discovery is available now in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland). AWS Fargate is currently only available in US East (N. Virginia). When you use ECS service discovery, you pay for the Route 53 resources that you consume, including each namespace that you create, and for the lookup queries your services make. Container level health checks are provided at no cost. For more information on pricing check out the documentation.

Please let us know what you’ll be building or refactoring with service discovery either in the comments or on Twitter!

Randall

 

P.S. Every blog post I write is made with a tremendous amount of help from numerous AWS colleagues. To everyone that helped build service discovery across all of our teams – thank you :)!

Our Newest AWS Community Heroes (Spring 2018 Edition)

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/aws/our-newest-aws-community-heroes-spring-2018-edition/

The AWS Community Heroes program helps shine a spotlight on some of the innovative work being done by rockstar AWS developers around the globe. Marrying cloud expertise with a passion for community building and education, these Heroes share their time and knowledge across social media and in-person events. Heroes also actively help drive content at Meetups, workshops, and conferences.

This March, we have five Heroes that we’re happy to welcome to our network of cloud innovators:

Peter Sbarski

Peter Sbarski is VP of Engineering at A Cloud Guru and the organizer of Serverlessconf, the world’s first conference dedicated entirely to serverless architectures and technologies. His work at A Cloud Guru allows him to work with, talk and write about serverless architectures, cloud computing, and AWS. He has written a book called Serverless Architectures on AWS and is currently collaborating on another book called Serverless Design Patterns with Tim Wagner and Yochay Kiriaty.

Peter is always happy to talk about cloud computing and AWS, and can be found at conferences and meetups throughout the year. He helps to organize Serverless Meetups in Melbourne and Sydney in Australia, and is always keen to share his experience working on interesting and innovative cloud projects.

Peter’s passions include serverless technologies, event-driven programming, back end architecture, microservices, and orchestration of systems. Peter holds a PhD in Computer Science from Monash University, Australia and can be followed on Twitter, LinkedIn, Medium, and GitHub.

 

 

 

Michael Wittig

Michael Wittig is co-founder of widdix, a consulting company focused on cloud architecture, DevOps, and software development on AWS. widdix maintains several AWS related open source projects, most notably a collection of production-ready CloudFormation templates. In 2016, widdix released marbot: a Slack bot supporting your DevOps team to detect and solve incidents on AWS.

In close collaboration with his brother Andreas Wittig, the Wittig brothers are actively creating AWS related content. Their book Amazon Web Services in Action (Manning) introduces AWS with a strong focus on automation. Andreas and Michael run the blog cloudonaut.io where they share their knowledge about AWS with the community. The Wittig brothers also published a bunch of video courses with O’Reilly, Manning, Pluralsight, and A Cloud Guru. You can also find them speaking at conferences and user groups in Europe. Both brothers are co-organizing the AWS user group in Stuttgart.

 

 

 

 

Fernando Hönig

Fernando is an experienced Infrastructure Solutions Leader, holding 5 AWS Certifications, with extensive IT Architecture and Management experience in a variety of market sectors. Working as a Cloud Architect Consultant in United Kingdom since 2014, Fernando built an online community for Hispanic speakers worldwide.

Fernando founded a LinkedIn Group, a Slack Community and a YouTube channel all of them named “AWS en Español”, and started to run a monthly webinar via YouTube streaming where different leaders discuss aspects and challenges around AWS Cloud.

During the last 18 months he’s been helping to run and coach AWS User Group leaders across LATAM and Spain, and 10 new User Groups were founded during this time.

Feel free to follow Fernando on Twitter, connect with him on LinkedIn, or join the ever-growing Hispanic Community via Slack, LinkedIn or YouTube.

 

 

 

Anders Bjørnestad

Anders is a consultant and cloud evangelist at Webstep AS in Norway. He finished his degree in Computer Science at the Norwegian Institute of Technology at about the same time the Internet emerged as a public service. Since then he has been an IT consultant and a passionate advocate of knowledge-sharing.

He architected and implemented his first customer solution on AWS back in 2010, and is essential in building Webstep’s core cloud team. Anders applies his broad expert knowledge across all layers of the organizational stack. He engages with developers on technology and architectures and with top management where he advises about cloud strategies and new business models.

Anders enjoys helping people increase their understanding of AWS and cloud in general, and holds several AWS certifications. He co-founded and co-organizes the AWS User Groups in the largest cities in Norway (Oslo, Bergen, Trondheim and Stavanger), and also uses any opportunity to engage in events related to AWS and cloud wherever he is.

You can follow him on Twitter or connect with him on LinkedIn.

To learn more about the AWS Community Heroes Program and how to get involved with your local AWS community, click here.

 

 

 

 

 

 

 

 

Reactive Microservices Architecture on AWS

Post Syndicated from Sascha Moellering original https://aws.amazon.com/blogs/architecture/reactive-microservices-architecture-on-aws/

Microservice-application requirements have changed dramatically in recent years. These days, applications operate with petabytes of data, need almost 100% uptime, and end users expect sub-second response times. Typical N-tier applications can’t deliver on these requirements.

Reactive Manifesto, published in 2014, describes the essential characteristics of reactive systems including: responsiveness, resiliency, elasticity, and being message driven.

Being message driven is perhaps the most important characteristic of reactive systems. Asynchronous messaging helps in the design of loosely coupled systems, which is a key factor for scalability. In order to build a highly decoupled system, it is important to isolate services from each other. As already described, isolation is an important aspect of the microservices pattern. Indeed, reactive systems and microservices are a natural fit.

Implemented Use Case
This reference architecture illustrates a typical ad-tracking implementation.

Many ad-tracking companies collect massive amounts of data in near-real-time. In many cases, these workloads are very spiky and heavily depend on the success of the ad-tech companies’ customers. Typically, an ad-tracking-data use case can be separated into a real-time part and a non-real-time part. In the real-time part, it is important to collect data as fast as possible and ask several questions including:,  “Is this a valid combination of parameters?,””Does this program exist?,” “Is this program still valid?”

Because response time has a huge impact on conversion rate in advertising, it is important for advertisers to respond as fast as possible. This information should be kept in memory to reduce communication overhead with the caching infrastructure. The tracking application itself should be as lightweight and scalable as possible. For example, the application shouldn’t have any shared mutable state and it should use reactive paradigms. In our implementation, one main application is responsible for this real-time part. It collects and validates data, responds to the client as fast as possible, and asynchronously sends events to backend systems.

The non-real-time part of the application consumes the generated events and persists them in a NoSQL database. In a typical tracking implementation, clicks, cookie information, and transactions are matched asynchronously and persisted in a data store. The matching part is not implemented in this reference architecture. Many ad-tech architectures use frameworks like Hadoop for the matching implementation.

The system can be logically divided into the data collection partand the core data updatepart. The data collection part is responsible for collecting, validating, and persisting the data. In the core data update part, the data that is used for validation gets updated and all subscribers are notified of new data.

Components and Services

Main Application
The main application is implemented using Java 8 and uses Vert.x as the main framework. Vert.x is an event-driven, reactive, non-blocking, polyglot framework to implement microservices. It runs on the Java virtual machine (JVM) by using the low-level IO library Netty. You can write applications in Java, JavaScript, Groovy, Ruby, Kotlin, Scala, and Ceylon. The framework offers a simple and scalable actor-like concurrency model. Vert.x calls handlers by using a thread known as an event loop. To use this model, you have to write code known as “verticles.” Verticles share certain similarities with actors in the actor model. To use them, you have to implement the verticle interface. Verticles communicate with each other by generating messages in  a single event bus. Those messages are sent on the event bus to a specific address, and verticles can register to this address by using handlers.

With only a few exceptions, none of the APIs in Vert.x block the calling thread. Similar to Node.js, Vert.x uses the reactor pattern. However, in contrast to Node.js, Vert.x uses several event loops. Unfortunately, not all APIs in the Java ecosystem are written asynchronously, for example, the JDBC API. Vert.x offers a possibility to run this, blocking APIs without blocking the event loop. These special verticles are called worker verticles. You don’t execute worker verticles by using the standard Vert.x event loops, but by using a dedicated thread from a worker pool. This way, the worker verticles don’t block the event loop.

Our application consists of five different verticles covering different aspects of the business logic. The main entry point for our application is the HttpVerticle, which exposes an HTTP-endpoint to consume HTTP-requests and for proper health checking. Data from HTTP requests such as parameters and user-agent information are collected and transformed into a JSON message. In order to validate the input data (to ensure that the program exists and is still valid), the message is sent to the CacheVerticle.

This verticle implements an LRU-cache with a TTL of 10 minutes and a capacity of 100,000 entries. Instead of adding additional functionality to a standard JDK map implementation, we use Google Guava, which has all the features we need. If the data is not in the L1 cache, the message is sent to the RedisVerticle. This verticle is responsible for data residing in Amazon ElastiCache and uses the Vert.x-redis-client to read data from Redis. In our example, Redis is the central data store. However, in a typical production implementation, Redis would just be the L2 cache with a central data store like Amazon DynamoDB. One of the most important paradigms of a reactive system is to switch from a pull- to a push-based model. To achieve this and reduce network overhead, we’ll use Redis pub/sub to push core data changes to our main application.

Vert.x also supports direct Redis pub/sub-integration, the following code shows our subscriber-implementation:

vertx.eventBus().<JsonObject>consumer(REDIS_PUBSUB_CHANNEL_VERTX, received -> {

JsonObject value = received.body().getJsonObject("value");

String message = value.getString("message");

JsonObject jsonObject = new JsonObject(message);

eb.send(CACHE_REDIS_EVENTBUS_ADDRESS, jsonObject);

});

redis.subscribe(Constants.REDIS_PUBSUB_CHANNEL, res -> {

if (res.succeeded()) {

LOGGER.info("Subscribed to " + Constants.REDIS_PUBSUB_CHANNEL);

} else {

LOGGER.info(res.cause());

}

});

The verticle subscribes to the appropriate Redis pub/sub-channel. If a message is sent over this channel, the payload is extracted and forwarded to the cache-verticle that stores the data in the L1-cache. After storing and enriching data, a response is sent back to the HttpVerticle, which responds to the HTTP request that initially hit this verticle. In addition, the message is converted to ByteBuffer, wrapped in protocol buffers, and send to an Amazon Kinesis Data Stream.

The following example shows a stripped-down version of the KinesisVerticle:

public class KinesisVerticle extends AbstractVerticle {

private static final Logger LOGGER = LoggerFactory.getLogger(KinesisVerticle.class);

private AmazonKinesisAsync kinesisAsyncClient;

private String eventStream = "EventStream";

@Override

public void start() throws Exception {

EventBus eb = vertx.eventBus();

kinesisAsyncClient = createClient();

eventStream = System.getenv(STREAM_NAME) == null ? "EventStream" : System.getenv(STREAM_NAME);

eb.consumer(Constants.KINESIS_EVENTBUS_ADDRESS, message -> {

try {

TrackingMessage trackingMessage = Json.decodeValue((String)message.body(), TrackingMessage.class);

String partitionKey = trackingMessage.getMessageId();

byte [] byteMessage = createMessage(trackingMessage);

ByteBuffer buf = ByteBuffer.wrap(byteMessage);

sendMessageToKinesis(buf, partitionKey);

message.reply("OK");

}

catch (KinesisException exc) {

LOGGER.error(exc);

}

});

}

Kinesis Consumer
This AWS Lambda function consumes data from an Amazon Kinesis Data Stream and persists the data in an Amazon DynamoDB table. In order to improve testability, the invocation code is separated from the business logic. The invocation code is implemented in the class KinesisConsumerHandler and iterates over the Kinesis events pulled from the Kinesis stream by AWS Lambda. Each Kinesis event is unwrapped and transformed from ByteBuffer to protocol buffers and converted into a Java object. Those Java objects are passed to the business logic, which persists the data in a DynamoDB table. In order to improve duration of successive Lambda calls, the DynamoDB-client is instantiated lazily and reused if possible.

Redis Updater
From time to time, it is necessary to update core data in Redis. A very efficient implementation for this requirement is using AWS Lambda and Amazon Kinesis. New core data is sent over the AWS Kinesis stream using JSON as data format and consumed by a Lambda function. This function iterates over the Kinesis events pulled from the Kinesis stream by AWS Lambda. Each Kinesis event is unwrapped and transformed from ByteBuffer to String and converted into a Java object. The Java object is passed to the business logic and stored in Redis. In addition, the new core data is also sent to the main application using Redis pub/sub in order to reduce network overhead and converting from a pull- to a push-based model.

The following example shows the source code to store data in Redis and notify all subscribers:

public void updateRedisData(final TrackingMessage trackingMessage, final Jedis jedis, final LambdaLogger logger) {

try {

ObjectMapper mapper = new ObjectMapper();

String jsonString = mapper.writeValueAsString(trackingMessage);

Map<String, String> map = marshal(jsonString);

String statusCode = jedis.hmset(trackingMessage.getProgramId(), map);

}

catch (Exception exc) {

if (null == logger)

exc.printStackTrace();

else

logger.log(exc.getMessage());

}

}

public void notifySubscribers(final TrackingMessage trackingMessage, final Jedis jedis, final LambdaLogger logger) {

try {

ObjectMapper mapper = new ObjectMapper();

String jsonString = mapper.writeValueAsString(trackingMessage);

jedis.publish(Constants.REDIS_PUBSUB_CHANNEL, jsonString);

}

catch (final IOException e) {

log(e.getMessage(), logger);

}

}

Similarly to our Kinesis Consumer, the Redis-client is instantiated somewhat lazily.

Infrastructure as Code
As already outlined, latency and response time are a very critical part of any ad-tracking solution because response time has a huge impact on conversion rate. In order to reduce latency for customers world-wide, it is common practice to roll out the infrastructure in different AWS Regions in the world to be as close to the end customer as possible. AWS CloudFormation can help you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS.

You create a template that describes all the AWS resources that you want (for example, Amazon EC2 instances or Amazon RDS DB instances), and AWS CloudFormation takes care of provisioning and configuring those resources for you. Our reference architecture can be rolled out in different Regions using an AWS CloudFormation template, which sets up the complete infrastructure (for example, Amazon Virtual Private Cloud (Amazon VPC), Amazon Elastic Container Service (Amazon ECS) cluster, Lambda functions, DynamoDB table, Amazon ElastiCache cluster, etc.).

Conclusion
In this blog post we described reactive principles and an example architecture with a common use case. We leveraged the capabilities of different frameworks in combination with several AWS services in order to implement reactive principles—not only at the application-level but also at the system-level. I hope I’ve given you ideas for creating your own reactive applications and systems on AWS.

About the Author

Sascha Moellering is a Senior Solution Architect. Sascha is primarily interested in automation, infrastructure as code, distributed computing, containers and JVM. He can be reached at [email protected]