All posts by Vittorio Denti

Let’s Architect! Designing Well-Architected systems

2024-07-31 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/architecture/lets-architect-well-architected-systems/

The design of cloud workloads can be a complex task, where a perfect and universal solution doesn’t exist. We should balance all the different trade-offs and find an optimal solution based on our context. But how does it work in practice? Which guiding principles should we follow? Which are the most important areas we should focus on?

In this blog, we will try to answer some of these questions by sharing a set of resources related to the AWS Well-Architected Framework. The Framework shares a set of methods to help you understand the pros and cons of decisions you make while building cloud systems. By following this resource, you will learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems in the cloud. The framework is constantly updated; it evolves as the technology landscape changes. Check out the latest updates from June 2024.

Build secure applications on AWS the Well-Architected way

The AWS Well-Architected Framework is constantly updated across all six pillars. The security pillar added a new best practice area: application security (AppSec). In this session, you can learn about the best practices highlighted in this area. Review four key domains: organization and culture, security of the pipeline, security in the pipeline, and dependency management. Each area provides a set of principles that you can implement and provides a complete view of how you design, develop, build, deploy, and operate secure workloads in the cloud.

Security should be part of the end-to-end development process, and implementing best practices both in the application code as well as in the underlying infrastructure components.

Figure 1. Security should be part of the end-to-end development process, and implementing best practices both in the application code as well as in the underlying infrastructure components.

Take me to this video

Announcing the AWS Well-Architected Mergers and Acquisitions Lens

How can we integrate different systems as a consequence of an acquisition? Mergers and acquisitions operations bring different people with different backgrounds together, with a need of driving systems convergence. Both organization and technical challenges can arise in this scenario. The Mergers and Acquisitions (M&A) Lens is a collection of customer-proven design principles, best practices, and prescriptive guidance to help you integrate the IT systems of two or more organizations. This lens helps companies follow AWS prescribed best practices during technical integration, drive cost optimization, and expedite merger and acquisition value realization.

If the seller company runs on another cloud platform or on-premises, the acquirer should plan a cloud migration while guaranteeing continuity of service.

Figure 2. If the seller company runs on another cloud platform or on-premises, the acquirer should plan a cloud migration while guaranteeing continuity of service.

Take me to this blog

AWS Well-Architected Labs

One of the best ways to become familiar with new concepts and methodologies consist of doing hands-on work to absorb the techniques properly. For each Let’s Architect! blog, we tend to share at least one workshop associated with the topic. The AWS Well-Architected Framework covers six different pillars, so today we share the AWS Well-Architected Labs to cover each area of the framework. Feel free to jump across the different workshops and start building!

Sustainability is one of the pillars in the framework. Asynchronous and scheduled processing are key techniques for improving the sustainability and costs of cloud architectures.

Figure 3. Sustainability is one of the pillars in the framework. Asynchronous and scheduled processing are key techniques for improving the sustainability and costs of cloud architectures.

Take me to this workshop

Gain confidence in system correctness and resilience with formal methods

Distributed systems are difficult to design. It’s even more difficult to test them and prove they are working. Formal methods enable the early discovery of design bugs that can escape the guardrails of design reviews and automated testing only to get uncovered in production. This video shows how AWS uses P, an open source, state machine–based programming language for formal modelling and analysis of distributed systems.

You can learn from AWS engineers and architects how to use P for your own applications to find bugs early in the development process and increase developer velocity. This tool is used in AWS to reason out the correctness of cloud services (for example, Amazon Simple Storage Service and Amazon DynamoDB).

Figure 4. An example of a distributed system for processing transactions.

Take me to this video

See you next time!

Thanks for reading! Hopefully, you got interesting insights into the methodologies for designing Well-Architected systems. In the next blog, we will talk about multi-region architectures. We will understand when they are actually needed, and which design principles should be applied.

To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

Let’s Architect! Open-source technologies on AWS

2023-06-22 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/architecture/lets-architect-open-source-technologies-on-aws/

We brought you a Let’s Architect! blog post about open-source on AWS that covered some technologies with development led by AWS/Amazon, as well as well-known solutions available on managed AWS services. Today, we’re following the same approach to share more insights about the process itself for developing open-source. That’s why the first topic we discuss in this post is a re:Invent talk from Heitor Lessa, Principal Solutions Architect at AWS, explaining some interesting approaches for developing and scaling successful open-source projects.

This edition of Let’s Architect! also touches on observability with Open Telemetry, Apache Kafka on AWS, and Infrastructure as Code with an hands-on workshop on AWS Cloud Development Kit (AWS CDK).

Powertools for AWS Lambda: Lessons from the road to 10 million downloads

Powertools for AWS Lambda is an open-source library to help engineering teams implement serverless best practices. In two years, Powertools went from an initial prototype to a fast-growing project in the open-source world. Rapid growth along with support from a wide community led to challenges from balancing new features with operational excellence to triaging bug reports and RFCs and scaling and redesigning documentation.

In this session, you can learn about Powertools for AWS Lambda to understand what it is and the problems it solves. Moreover, there are many valuable lessons to learn how to create and scale a successful open-source project. From managing the trade-off between releasing new features and achieving operational stability to measuring the impact of the project, there are many challenges in open-source projects that require careful thought.

Take me to this video!

Heitor Lessa describing one the key lessons: development and releasing new features should be as important as the other activities (governance, operational excellence, and more)

Heitor Lessa describing one of the key lessons: development and releasing new features should be as important as the other activities (governance, operational excellence, and more).

Observability the open-source way

The recent blog post Let’s Architect! Monitoring production systems at scale talks about the importance of monitoring. Setting up observability is critical to maintain application and infrastructure health, but instrumenting applications to collect monitoring signals such as metrics and logs can be challenging when using vendor-specific SDKs.

This video introduces you to OpenTelemetry, an open-source observability framework. OpenTelemetry provides a flexible, single vendor-agnostic SDK based on open-source specifications that developers can use to instrument and collect signals from applications. This resource explains how it works in practice and how to monitor microservice-based applications with the OpenTelemetry SDK.

Take me to this video!

With AWS Distro for OpenTelemetry, you can collect data from your AWS resources.

Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost

Apache Kafka is an open-source streaming data store that decouples applications producing streaming data (producers) into its data store from applications consuming streaming data (consumers) from its data store. Amazon Managed Streaming for Apache Kafka (Amazon MSK) allows you to use the open-source version of Apache Kafka with the service managing infrastructure and operations for you.

This blog post explains how the underlying infrastructure configuration can affect Apache Kafka performance. You can learn strategies on how to size the clusters to meet the desired throughput, availability, and latency requirements. This resource helps you discover strategies to find the optimal sizing for your resources, and learn the mental models adopted to conduct the investigation and derive the conclusions.

Take me to this blog!

Comparisons of put latencies for three clusters with different broker sizes

AWS Cloud Development Kit workshop

AWS Cloud Development Kit (AWS CDK) is an open-source software development framework that allows you to provision cloud resources programmatically (Infrastructure as Code or IaC) by using familiar programming languages such as Python, Typescript, Javascript, Java, Go, and C#/.Net.

CDK allows you to create reusable template and assets, test your infrastructure, make deployments repeatable, and make your cloud environment stable by removing manual (and error-prone) operations. This workshop introduces you to CDK, where you can learn how to provision an initial simple application as well as become familiar with more advanced concepts like CDK constructs.

Take me to this workshop!

This construct can be attached to any Lambda function that is used as an API Gateway backend. It counts how many requests were issued to each URL.

See you next time!

Thanks for joining our conversation! To find all the blogs from this series, you can check out the Let’s Architect! list of content on the AWS Architecture Blog.

Let’s Architect! Monitoring production systems at scale

2023-04-12 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/architecture/lets-architect-monitoring-production-systems-at-scale/

“Everything fails, all the time” is a famous quote from Amazon’s Chief Technology Officer Werner Vogels. This means that software and distributed systems may eventually fail because something can always go wrong. We have to accept this and design our systems accordingly, test our software and services, and think about all the possible edge cases.

With this in mind, we should also set our teams up for success by providing visibility in every environment for a quick turnaround when incidents happen. When a system serves traffic in production, we need to monitor it to make sure it behaves as expected and that all components are healthy. But questions arise such as:

How do we monitor a system?
What is monitoring?
What are some architectural and engineering approaches to implement in order to design a successful monitoring strategy?

All of these questions require complex answers. It’s not possible to cover everything in a blog post, but let’s start exploring the topic and sharing resources to guide you through this domain.

In this edition of Let’s Architect! we share some practices for monitoring used at Amazon and AWS, as well as more resources to discover how to build monitoring solutions for the workloads running on AWS.

Observability best practices at Amazon

Observability and monitoring are engineering tasks that also require putting a suitable cultural mindset in place. At Amazon, if a service doesn’t run as expected, the team writes a CoE (Correction of Errors) document to analyze the issue and answer critical questions to learn from it. There are also weekly operations meetings to analyze operational and performance dashboards for each service.

The session introduced here covers the full range of monitoring at Amazon, from how teams assess system health at a high level to how they understand the details of a single request. Use this resource to learn some best practices for metrics, logs, and tracing, and using these signals to achieve operational excellence.

Take me to this re:Invent video!

Observability is an iterative process which requires us to establish a feedback loop and improve based on the signals coming from the system.

Build an observability solution using managed AWS services and the OpenTelemetry standard

Visibility of what’s happening in a distributed system is key to operationalize workloads at scale. OpenTelemetry is the standard for observability and AWS services are fully integrated with that. The blog post introduced in this section shows you how AWS Distro for OpenTelemetry (ADOT) works under the hood and how to use it with a Kubernetes cluster. But keep in mind, this is just one of the many implementations available for AWS compute services and OpenTelemetry—so even if you’re not using Kubernetes right now, we’ve still got you covered!

Want more? Watch this re:Invent video for an understanding of how to think about logging, tracing, metrics, and monitoring with AWS services, and the possibilities to provide the observability your distributed systems need. This is a great learning resource with many demos and examples.

Take me to this blog post!

Flow of metrics and traces from Application services to the Observability Platform.

Optimizing your AWS Batch architecture for scale with observability dashboards

We’ve explored the mental models and strategies for monitoring in previous resources. Now let’s see how these principles can be applied in a scenario where we run batch and ML computing jobs at scale. In the blog post introduced in this section, you can learn how to use runtime metrics to understand an architecture designed on AWS Batch for running batch computing jobs. AWS Batch is a fully managed service enabling you to run jobs at any scale without needing to manage underlying compute resources. This blog explains how AWS Batch works and guides you through the process used to design a monitoring framework.

Since the solution is open-source, you are free to add other custom metrics you find useful. To get started with the AWS Batch open-source observability solution, visit the project page on GitHub. Several customers have used this monitoring tool to optimize their workload for scale by reshaping their jobs, refining their instance selection, and tuning their AWS Batch architecture.

Take me to this blog!

High-level structure of AWS Batch resources and interactions. This diagram depicts a user submitting jobs based on a job definition template to a job queue, which then communicates to a compute environment that resources are needed.

Observability workshop

This resource provides a hands-on experience for you on the variety of toolsets AWS offers to set up monitoring and observability on your applications. Whether your workload is on-premises or on AWS—or your application is a giant monolith or based on modern microservices-based architecture—the observability tools can provide deeper insights into application performance and health.

The monitoring tools covered in this workshop provide powerful capabilities that enable you to identify bottlenecks, issues, and defects without having to manually sift through various logs, metrics, and trace data.

Take me to this workshop!

The diagram illustrates the various components of the PetAdoptions architecture. In the workshop you will learn how to monitor this application.

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about containers on AWS.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Design a data mesh with event streaming for real-time recommendations on AWS

2022-10-03 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/big-data/design-a-data-mesh-with-event-streaming-for-real-time-recommendations-on-aws/

This blog post was co-authored with Federico Piccinini.

The data landscape has been changing in recent years: there is a proliferation of entities producing and consuming large quantities of data within companies, and for most of them defining a proper data strategy has become of fundamental importance. A modern data strategy gives you a comprehensive plan to manage, access, analyze, and act on data.

As a result, more companies are considering the adoption of a data mesh architecture, a recently introduced paradigm where data is organized by domain, clear ownership of data and technology stack is enhanced, and a more agile setup is achieved. Because of this, some of your applications may need to be designed for a data-by-domain separation in order to benefit from a data mesh architecture.

In this post, we show you how to design a data mesh architecture for a scenario that requires real-time recommendations. The recommendation system is implemented through Amazon Personalize, a fully managed machine learning (ML) service, and works by consuming data by domain. For recommendations use cases, it’s important to have access to information about users, items, and interactions, often associated with different data sources within a company.

Because ML applications may have multiple types of input data, we propose a solution that works both for data at rest as well as real-time streaming. Real-time recommendations require streaming data in order to adapt to the user’s current intent.

Throughout the post, we introduce the data mesh paradigm and then extend it to a real-time use case by adding event streaming capabilities. We use the use case of a music streaming company that offers its customers the opportunity to listen to on-demand songs. The company has also started to offer, through the same platform, on-demand podcasts, and wants to take advantage of a modern data architecture to support data access for fast ML experimentation and inference.

Data mesh: A paradigm shift

Domain-driven design (DDD) represents a software design approach where complex solutions are divided into domains according to the underlying business logic. An architectural style that is often mentioned in the context of DDD is microservice architecture, a concept where software systems are structured into loosely coupled entities, namely microservices, each one owned by a small team and structured around business requirements. These paradigms, together with the advancement of cloud technologies, allowed companies to release software updates faster and continuously adapt their technology stack to evolving business requirements.

However, unlike software architectures, most data architectures were still designed around technologies rather than business domains. This changed in 2019, when Zhamak Dehghani introduced the data mesh. Data mesh is a paradigm shift towards data being treated as a product and processed as part of a domain. Data mesh applies the principles of DDD to data architectures: data is organized into data domains and the data is considered the product that the team owns and offers for consumption. This is a shift from a centralized ownership model to a decentralized one that allows companies to access data at scale. This shift also allows each team assigned to a data domain to build the data products by choosing the right technology for their job, analogous to software engineers working on a microservice.

Data mesh advocates for decentralized ownership and delivery of data management systems, while emphasizing the need for distributed governance and self-service tooling. The data mesh approach enables better autonomy of data domain owners and brings domains together to enable data sharing and federation across business units, without compromising on data security. This type of architecture supports the idea of distributed data, where all data is accessible for those with the right authority to access it. One key differentiator between a data lake and a data mesh is that in a data mesh, data doesn’t have to be consolidated into a single data lake and can remain within different databases.

For more information about the details and advantages of adopting the data mesh as a domain-driven data architecture, refer to Design a data mesh architecture using AWS Lake Formation and AWS Glue.

The components of a data mesh

Now that we have a good understanding of the data mesh paradigm, let’s look at the implementation and its components.

First, we start with data producers. These are the entities that are responsible for maintaining, owning, and exposing the specific data of their domain. Because of the domain separation, each producer can choose its own technology stack independently.

Similarly, we also have data consumers. These components, as their name indicates, use one or more data sources exposed by the producers. As before, adopting a data mesh architecture implies that each consumer is independent one another, meaning they could implement different technology stacks as well as solve different use cases.

The data-at-rest plane is then completed by the Centralized Data Catalog, a component that works as the link between producers and consumers. This middle layer is responsible for indexing the available data producers into a centralized data catalog as well as controlling access to the different data sources.

The data catalog is used by the producers to expose the data products (steps 1a and 1b) to the organization’s data scientists and data engineers working on the consumer domains. The following figure illustrates how data products should be easily discoverable: the central data catalog allows the data consumers to find their data source of interest (steps 2a and 2b) after they have been registered with the centralized catalog by their corresponding producer domain (steps 1a and 1b).

Working with real-time events

One could argue that this architecture can only support data at rest as it is; indeed, there is no straightforward solution to move data in real time from a producer domain to a consumer. The paradigm presented so far addresses the scenario of data at rest, where producers are pulling data on demand rather than being notified when data is changed.

Because several applications need to quickly respond to the changes happening in the environment, real-time data is an important consideration in data architectures. For example, an ecommerce platform or a video streaming service can extract value from the real-time user interactions with content. In these cases, it’s critical to track events as they happen, feed them in the ML model, and adapt the predictions accordingly.

In this section, we want to introduce some of the streaming platforms that can work to implement this pattern, with a focus on Apache Kafka because it’s frequently used and many companies are moving their Kafka workloads to the cloud.

Apache Kafka is an open-source distributed event streaming platform that captures data in real time from sources such as microservices or databases, stores the events in streams organized into topics, and reacts to these events in real time as well as retrospectively. Event streaming architectures built on Apache Kafka follow the publish/subscribe paradigm: the producers publish events to topics via a write operation, and the consumers, which subscribe to such topics, receive the events as they happen. An alternative to Apache Kafka in this scenario could be Amazon Kinesis Data Streams, a streaming service that allows developers to collect, store, and process real-time data in the cloud.

If we consider for example an ecommerce platform, we could have a Payment microservice running the payment functionalities of the system publishing events to Purchases topic, tracking every transaction happening on the platform. Then, we could have another component subscribing to the Purchases topic to receive the events and take action accordingly, for example by updating a dashboard for business intelligence. For more information on Apache Kafka, we recommend reading Introduction to Apache Kafka.

Event-streaming architecture

The data-in-motion plane is introduced to implement the publish/subscribe pattern in the context of a data mesh. Such a plane is composed of the set of producer and consumer domains connected via a central event streaming component that makes real-time events accessible. To benefit from the data-by-domain architecture, we consider each producer to have its own corresponding centralized stream, as shown in the following figure.

You can also think of the event stream as the channel for sending real-time events to the consumers, therefore each producer has its dedicated channel to send updates.

Each consumer can subscribe to multiple topics based on specific data needs. When new events are available, the corresponding producer publishes them in the associated stream (steps 1a and 1b) and the subscribers can read the events (step 2a and 2b), process them, and take action accordingly.

The preceding figure shows a scenario with N producer domains and M consumer domains: each consumer subscribes only to the streams of interest for that domain. In this example, Consumer #1 is subscribed to the events coming from Producer #1, while Consumer #M is subscribed to the events coming from both Producer #1 and Producer #N.

You could adopt this pattern to solve several use cases and data domains. For instance, a user playing a song on a music streaming platform can generate a new event sent from the Interactions service producer to the Personalization consumer, where the recommendation system generates personalized recommendations. Similarly, a Payment producer can send a transaction request, and a Fraud Detector consumer determines whether the transaction is fraudulent or not.

For producers and consumers to communicate correctly, the event payload schema needs to be consistent. Applications depend on schemas so no changes made to events break the implicit contract between producers and consumers. For complex use cases, you can use a schema registry to enforce compatibility in event streaming. For more information about the options for working with the AWS Glue Schema Registry, refer to Validate streaming data over Amazon MSK using schemas in cross-account AWS Glue Schema Registry.

Recommendation use case

Previously, we introduced the overall idea behind the data mesh architecture without focusing on a specific use case. In this section, we present a real-world scenario where the mesh paradigm is implemented using AWS.

Let’s consider the music streaming company XYZ, which offers its customers the opportunity to listen to on-demand songs. XYZ has recently started to offer, through the same platform, on-demand podcasts as well.

The ML team is interested in adding podcasts to the catalog of personalized recommendations that are presented to users. To do so, the ML team working on the recommendation system, which in the data mesh paradigm can be seen as a consumer, needs access to multiple data domains (producers): Users, Songs, Podcasts, and Interactions.

In this post, we use Amazon Personalize as a fully managed ML service for personalized recommendations. It allows developers to train, tune, and deploy custom ML models to deliver highly customized experiences. Amazon Personalize provisions the infrastructure and manages the entire ML pipeline, including processing the data; identifying features; and training, optimizing, and hosting the models. You can learn more about Amazon Personalize in the Developer Guide.

We now dive deeper into the implementation of the solution, both for the data-at-rest and data-in-motion scenario. ML needs large amounts of data at rest to create a dataset and train the models. Additionally, the personalization scenario requires access to real-time data to adapt to the users’ current intent, so we need access to real-time events and interactions. A data mesh solution for this scenario will require both:

Data at rest – Historical data from user, items, and interactions. Some of this could be stored in separate systems and data sources.
Data in motion – This data is for the real-time events, for instance songs listened to or new items made available in the catalog.

Architecture for data at rest

In this section, we focus on the data at rest part of the solution.

The following diagram shows how we can implement the data mesh architecture in the context of personalized recommendations, and include the podcasts in the recommendation system deployed with Amazon Personalize. Each producer domain owns the data and exposes them via the data catalogs. The consumers use the data catalogs to find the data they need for their application.

First, we can identify the three main components of the mesh architecture introduced before: data producers, the centralized data catalog, and data consumers.

In this specific example, we can see how different producer domains implement different storage solutions:

The Users domain uses Amazon Aurora as its own line of business (LOB) database, a relational database (step 1a)
Songs and Podcasts use Amazon DynamoDB, a NoSQL database (steps 1b and 1c)
Interactions ingests the events directly into Amazon S3 (step 1d)

The producer domains are decoupling their LOB databases from the data catalogs by using Amazon Simple Storage Service (Amazon S3). With the data mesh paradigm, each producer considers the data as a product, therefore it can preprocess the data before exposing them, and store the results in a format that is suitable for the consumers. This decoupling is implemented using AWS Glue to define an extract, transform, and load (ETL) pipeline, whose results are eventually stored in S3 buckets (steps 2a, 2b, 2c).

Finally, each producer shares its respective AWS Glue Data Catalog with the Centralized Data Catalog (steps 3a, 3b, 3c, 3d).

Data consumers can now access the different data domains through the central catalog. As shown in the preceding figure, we have two consumers: the Analytics domain, which accesses certain catalogs and showcases metrics on an Amazon QuickSight dashboard (step 4), and the Personalized Recommendations domain (step 5).

The latter, which is the one of interest for this post, consists of an AWS Glue ETL job that accesses, through the central catalog, data from the different producers. The ETL job performs traditional data engineering tasks, for example merging song and podcast data. We can now generate our Amazon Personalize solution, where our items dataset includes information about both songs and podcasts, expanding the initial recommendation catalog.

Our recommendation engine is then made available for inference requests through an API deployed using Amazon API Gateway (step 6).

The architecture is designed to work across multiple accounts: an AWS account is a natural boundary for the resources deployed into it and a single unit of billing. This approach allows us to separate the resources owned by the different domains and maintain operational agility: each team owns and controls its account. To learn more about the approaches for sharing data catalogs across different accounts while working with a data mesh, check out Design a data mesh architecture using AWS Lake Formation and AWS Glue.

We’re now able to provide users with song or podcast recommendations based on their comprehensive listening preferences across the two categories. In the next section, we explore how to improve the architecture to be reactive to continuously evolving data, such as new songs added to the catalog or new interactions made available.

Architecture for data in motion

Earlier, we introduced the theoretical framework for event streaming in the context of the data mesh, defined as the data-in-motion plane. We can now drill down into the architecture for our specific use case.

We’re using a scenario with four producers (Users, Songs, Podcasts, and Interactions), the central streaming component, and two consumer domains (Personalized Recommendations and Analytics). The data-in-motion plane is implemented by using a platform for event streaming, namely Apache Kafka, and each producer has a dedicated stream to publish its events.

In the scenario of real-time recommendations for music, the Personalized Recommendations consumer is notified about changes to Users, Songs, Podcasts, and Interactions. Similar to the at-rest example, we also consider a second consumer domain, called Analytics, used to create real-time dashboards about the trends in the interactions. Here, the analytics consumer requires only interaction events, therefore it subscribes only to the Interactions stream.

This architecture is designed to offer a loosely coupled interaction mechanism for producers and consumers: the producers don’t need to know about the consumers that are part of the system. The producers focus on emitting the events, the events are sent to the data-in-motion plane, and the delivery is guaranteed by the streaming platform.

Let’s drill down into the strategy for building this architecture in the cloud. For readability purposes, we study this part of the solution in isolation, without adding to the diagram of the data-at-rest scenario.

From a technological perspective, we use AWS Lambda to run the back-end business logic of the application: the microservice runs the logic in a Lambda function and publishes events to the event streams. We use Lambda because it fits our use case well, both for scalability and operational efficiency, because it offers minimal operational overhead. However, the architecture pattern is also valid for other types of backend deployments, for example, containers running on Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS).

The data-in-motion plane is implemented using Amazon Managed Streaming for Apache Kafka (Amazon MSK), a fully managed solution for running Apache Kafka in the cloud. It provisions the servers, configures the Apache Kafka clusters, replaces servers when they fail, orchestrates server patches and upgrades, and runs clusters for high availability. Kafka organizes and stores events into topics. Topics are always multi-producer and multi-consumer: this means that one or many producers can publish to the same topic, and one or many consumers can subscribe to read from the topic. We use the concept of topics to model this architecture paradigm, and we assign one topic for each producer domain.

Finally, we adapt our previously introduced consumer domain, Personalized Recommendations, to take into account real-time events. This time, we use Lambda to read the events from the topics and invoke the commands to call the Amazon Personalize API through the Amazon Personalize SDK. Within the same consumer domain, we use a Lambda function per topic, which is triggered as soon as a new event is published in the monitored topic. This event-driven pattern allows us to run code only when a new event is published and we need to update the information in Amazon Personalize. Each Lambda function in the Personalized Recommendations domain uses the Amazon Personalize SDK to invoke the corresponding actions on Amazon Personalize and update the datasets.

Let’s consider a new interaction happening in the system using the following figure. This serverless implementation of the event streaming pattern extends the data mesh to respond to real-time events.

The Interactions microservice, which is running the backend logic of the application, publishes a new event (step 1), which is persisted in the Interactions topic (step 2). The publishing of a new event triggers the Lambda functions subscribed to the topic, in this case InteractionsUpdate and InteractionsIngestor (step 3). The InteractionsUpdate function invokes the PutEvents operation on the Amazon Personalize API through the Amazon Personalize SDK to add the real-time event to the recommendation system (step 4). InteractionsIngestor triggers the operations to refresh the dashboards according to the strategy adopted by the Analytics domain. Finally, other services and components can consume the recommendations through the API exposed by the Personalized Recommendation domain to make the predictions consumable (step 5).

For the Analytics domain, which we added to showcase the scalability of this architecture, we use a Lambda function to ingest the real-time events into Amazon Kinesis Data Firehose. Then we can visualize the interactions using Amazon OpenSearch Service in conjunction with Amazon QuickSight. For more details, refer to Visualize live analytics from Amazon QuickSight connected to Amazon OpenSearch Service.

Because the data producers, Kafka resources, and data consumers are all in different accounts, we need to establish cross-account connectivity to keep the traffic within the AWS infrastructure and avoid the public internet, both for security reasons as well as cost-optimization. The objective of this post is to show the architecture and the approach to implement this pattern. If you want to dive deeper into how to establish cross-account connectivity between producers and consumers and Amazon MSK, refer to Secure connectivity patterns to access Amazon MSK and How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS Private Link.

Data mesh with event streaming: Putting it all together

Earlier, we recalled the data mesh paradigm and designed a solution to emphasize the importance of adopting a data as a product strategy. Each producer domain exposes the data via the catalog, and they are made centrally discoverable through the Centralized Data Catalog. Each consumer domain has a catalog interface for connecting to the central catalog and finding the data required to build the solution the domain focuses on.

Next, we studied the scenario for data in motion, introduced Apache Kafka and Amazon MSK to implement the event streaming platform, and connected the producers and consumers with the streaming service via Lambda. This event-driven implementation allows us to decouple the producers from the consumers, and make the solution scalable as the domains may change and evolve during time, without significant changes required in the architecture.

We can now put it all together, as shown in the following figure. The complete data mesh with event streaming architecture uses two different data planes: one is dedicated for sharing data at rest (blue); the other one is for data in motion (red).

Each domain has two interfaces required to communicate with both planes: the data catalogs and the Lambda functions. The data at rest is shared and discovered by taking advantage of the data catalogs, whereas the data in motion are emitted by the service running the backend logic in the producer domains. They’re consumed using the Lambda functions subscribed to the topics, which are deployed in the consumer domains.

Conclusion

In this post, we introduced the high-level architecture paradigm that allows you to extend the concept of a data mesh to real-time events.

We first covered the fundamental concepts associated with this architectural style, and then showcased how to apply this solution to solve real-world business challenges, such as real-time personalized recommendations and analytics, in a multi-account setting on AWS.

Furthermore, the framework presented in this post can be generalized to different domains, for example other AWS AI services such as Amazon Forecast or Amazon Comprehend, or your custom ML solutions built for your specific scenario and deployed through Amazon SageMaker. With the most experience, the most reliable, scalable and secure cloud, and the most comprehensive set of services and solutions, AWS is the best place to unlock value from your data.

More resources:

About the authors

Vittorio Denti is a Solutions Architect at AWS based in London. After completing his M.Sc. in Computer Science and Engineering at Politecnico di Milano (Milan) and the KTH Royal Institute of Technology (Stockholm), he joined AWS. Vittorio has a background in Distributed Systems and Machine Learning, and a strong interest in cloud technologies. He’s especially passionate for software engineering, building ML models, and putting ML into production.

Anna Grüebler is a Specialist Solutions Architect at AWS focusing on in Artificial Intelligence. She has more than 10 years experience helping customers develop and deploy machine learning applications. Her passion is taking new technologies and putting them in the hands of everyone, and solving difficult problems leveraging the advantages of using AI in the cloud.

Noise

All posts by Vittorio Denti

Let’s Architect! Designing Well-Architected systems

Build secure applications on AWS the Well-Architected way

Announcing the AWS Well-Architected Mergers and Acquisitions Lens

AWS Well-Architected Labs

Gain confidence in system correctness and resilience with formal methods

See you next time!

Let’s Architect! Open-source technologies on AWS

Powertools for AWS Lambda: Lessons from the road to 10 million downloads

Observability the open-source way

Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost

AWS Cloud Development Kit workshop

See you next time!

Let’s Architect! Monitoring production systems at scale

Observability best practices at Amazon

Build an observability solution using managed AWS services and the OpenTelemetry standard

Optimizing your AWS Batch architecture for scale with observability dashboards

Observability workshop

See you next time!

Design a data mesh with event streaming for real-time recommendations on AWS

Data mesh: A paradigm shift

The components of a data mesh

Working with real-time events

Event-streaming architecture

Recommendation use case

Architecture for data at rest

Architecture for data in motion

Data mesh with event streaming: Putting it all together

Conclusion

About the authors

The collective thoughts of the interwebz