Tag Archives: Architecture

Optimizing fleet utilization with Amazon Location Service and HERE Technologies

2023-04-06 Mahesh Geeniga

Post Syndicated from Mahesh Geeniga original https://aws.amazon.com/blogs/architecture/optimizing-fleet-utilization-with-amazon-location-service-and-here-technologies/

The fleet management market is expected to grow at a Compound Annual Growth Rate (CAGR) of 15.5 percent—from 25.5 billion US dollars in 2022 to USD 52.4 billion in 2027. Optimizing how your organization uses its vehicle fleet is important for logistics and service providers such as last mile, middle mile, and field services.

In this post, we demonstrate how to build and run a solution for many-to-many vehicle routing using HERE Tour Planning and Amazon Location Service. HERE Technologies is a data provider for Amazon Location Service and provides it with map rendering, geocoding, search, and routing. HERE Tour Planning expands on functionality such as geocoding, basic routing, and matrix routing to consider parameters such as time windows, job requirements or priorities, vehicle capabilities, range, and traffic information. They also support immediate re-planning when conditions change.

The architecture described in this post can help you optimize your fleets for delivering shipments, such as perishable items on pallets, from a central distribution center to multiple retail locations. The architecture uses the AWS Cloud Development Kit (AWS CDK) to help you provision and version control your infrastructure. The architecture also uses Event Driven Architecture (EDA) based on AWS Lambda and Amazon DynamoDB. It uses Amazon Simple Storage Service (Amazon S3) and DynamoDB to store the artifacts generated and integrates with Amazon Location Service to plot the routes visually on a map for each delivery driver.

Solution overview

This post will help you:

Configure the many-to-many vehicle routing architecture using HERE Tour Planning and Amazon Location Service
Submit a HERE Tour Planning problem
Generate an optimized solution file
Run a React app that:
- Generates a list of routes for each vehicle in the fleet
- Allows drivers to select and view routes in detail

The following diagram outlines how the architecture works.

Figure 1. Many-to-many vehicle routing architecture using HERE Tour Planning and Amazon Location Service

Let’s explore the steps in the diagram.

The fleet operator uploads the tour requirements file to an Amazon S3 bucket.
The upload invokes a Lambda function to process the new Tour Request. If the HERE API key is present, it calls the HERE Tour Planning API.
The HERE Tour Planning API calculates the solution to the routing problem.
The driver uses the React app to select a vehicle, which requests a route.
The invoked Lambda function uses Amazon Location Service to calculate the route and render it in the React app.

Prerequisites

This walkthrough requires the following installations and resources:

Have an AWS account
Install the AWS Amplify CLI (command-line interface)
Install the AWS CDK CLI
Install the AWS CLI
- Configure and authenticate the AWS CLI to interact with your AWS account.
Have your preferred integrated development environment (IDE), such as Visual Studio Code
Have GitHub repository access
- git clone https://github.com/aws-samples/aws-here-optimize-fleet-utilization
Have a HERE API Key (optional)
- This is needed to invoke the HERE Tour Planning API for generating solutions to new routing problems.
- The GitHub sample repository includes a problem and pre-solved solution file, so you don’t need to acquire a HERE API key to learn about these offerings.
- To acquire an API key, create a free account on the HERE Platform, then follow the instructions in the HERE Tour Planning documentation to create your API key.
- There can be additional charges based on API key use. For more details, see the HERE Tour Planning section within HERE service rates.

Walkthrough

Provision the infrastructure

The architecture uses NPM node modules. Run the following commands to install the dependencies:
- # AWS Lambda function dependencies
  cd lib/lambda/calculate-route
  npm install
- # Sample Frontend application dependencies
  cd frontend/here-driver-app
  npm install
If you have never used AWS CDK in your AWS account, you must first bootstrap the solution, which creates Amazon S3 buckets and metadata to support AWS CDK operations. Note: Architectural components described in this article are covered under AWS Free Tier and HERE Free monthly usage, but additional charges can occur based on usage beyond the Free Tier limits. We recommend following the Cleanup instructions after completing the walkthrough.
- - From the root of the repository, run the following command to generate the needed infrastructure:
    - cdk bootstrap
  - Output similar to the following indicates that you successfully bootstrapped the AWS account with what AWS CDK requires.
    
    Figure 2. Successful bootstrap of AWS account with AWS CDK requirements
Deploy the infrastructure for this solution by running the following command. This provisions much of the infrastructure required for this solution such as DynamoDB, Lambda functions, and Amazon S3 buckets:
- cdk deploy
Next, Amplify provisions the remaining resources to complete the architecture. Navigate to the folder for the frontend by running the following command:
- cd frontend/here-driver-app
Run the following series of Amplify commands to create the remaining resources, Amazon API Gateway, Amazon Cognito, and Amazon Location Service. For additional details, see Clone sample Amplify project.
To accept the defaults, run the following command:
- amplify init
To push the infrastructure out to the AWS account, run the following command:
- amplify push
To publish the environment, run the following command:
- amplify publish

Create problem and generate architecture files

The next step is to create the HERE Tour Planning problem file and submit it to the HERE Tour Planning API to solve. Note: You can either sign up for the HERE Developer program to receive an API key to test this solution live or use the problem and pre-solved solution provided in the /data folder of the repository.

Open the Amazon S3 bucket that the previous step created.
Upload a problem file (in JSON format) to the bucket.
An Amazon S3 event notification invokes a Lambda function that performs a synchronous call to the HERE Tour Planning API and generates vehicle routing problem solution file in JSON format.
The Lambda function saves the solution file to the Amazon S3 bucket and additional details about the solution to a DynamoDB table.
The delivery drivers can use the example React app to view the list of vehicles and the routes.

Figure 3. Creating a problem file and submitting it to HERE Tour Planning API

Frontend

The next step is to run the React Frontend app to see the results. Access to the app and the API Gateway is secured with Amazon Cognito.

To run the web application, run the following command:
- npm start
A local web server runs on http://localhost:3000.
To use the system, the user must authenticate. Amazon Cognito allows users to sign in or create a new account.
After you authenticate, the Home screen displays a list of available vehicles and routes.

Figure 4. Available vehicles and routes

Choose a vehicle to see the detail of the route. Each red marker is a stop.

Figure 5. Vehicle route details

Cleanup

To avoid incurring future charges, delete all resources created.

Run the following commands to delete the application in AWS Amplify. As an alternative, you can use the Amplify console to delete Amplify resources:
- cd frontend/here-driver-app
  amplify pull
  amplify delete
With AWS CDK, run the following command to delete the AWS CloudFormation stack that was used to provision the resources. Note: You can leave the AWS CLI, Amplify CLI, and CDK CLI installed on your computer for future development:
- cdk destroy

Conclusion

In this post, using shipment delivery from a central distribution center use case, we’ve demonstrated how you can build your own serverless solution for optimizing middle and last mile operations. The solution uses multi-vehicle and multi-stop optimization services provided by HERE Tour Planning and Amazon Location Service to visualize the generated routes for each delivery vehicle driver. For additional details about HERE’s offerings on AWS Marketplace, see AWS Marketplace: HERE Technologies.

Software-defined edge architecture for connected vehicles

2023-04-03 James Simon

Post Syndicated from James Simon original https://aws.amazon.com/blogs/architecture/software-defined-edge-architecture-for-connected-vehicles/

To remain competitive in a marketplace that increasingly views transportation as a service emphasizing customer experience, vehicle capabilities and mobility applications need to improve and increase value over time, much like the internet of things and smart phones have done.

Vehicle manufacturers and fleet operators are responding to this change by using data to inform and operate their businesses, and by adopting software-defined vehicle features and capabilities. In the broader transportation community, insurance providers are evolving to usage-based insurance, which offers rates that are based on a customer’s risk profile rather than a general population. Telematics service providers are expanding their offerings to fleet operators to include machine learning (ML)-driven capabilities like driver coaching, compliance, and predictive maintenance. Rental car providers are developing apps that provide a personalized in-vehicle experience.

All of these features rely on in-vehicle data collection and processing, evolving vehicles from simple data gathering sensors to fully smart devices. To meet these demands, you can adopt cloud native architectures that use microservices, containers, and declarative application programming interfaces (APIs). This blog post explores a system architecture AWS and Luxsoft have developed together in order to help our customers reduce friction and accelerate time to market for the development, deployment and operation of edge applications required to make vehicles into smart devices.

Software-defined edge

Software is becoming more critical to vehicle function. A modern car has approximately 70-100 Electronic Control Units (ECUs), which control most core functions in the engine, transmission, Heating, ventilation, and air conditioning (HVAC) Automatic Brake System (ABS), body, and airbag hardware components. With new features such as infotainment systems, autonomous driving (AD), and advanced driver assistance systems (ADAS), modern cars use approximately 100 million lines of code, and this is increasing rapidly. ECU and software complexity produce challenges with portability of applications, this could be due to variations of the hardware, or CAN and other network communication differences, without an abstraction layer, application software must be conformed to the operating environment.

One difficulty presented by this increasing complexity and the multitude of integration points and communication interfaces at the edge is that applications typically must be re-developed, customized or at least cross-compiled to fit each hardware platform, often requiring a lengthy integration development and testing cycle for each new device. As a result, edge applications executing ML and other data-driven workloads can produce a pace that is slower than consumer expectations.

The development pace for software-defined vehicles (SDVs) relies on reusing and redeploying software applications. Software reuse is a challenge when target hardware and processing environments aren’t designed in a common way. Therefore, an early step in developing SDVs is to address these challenges at the vehicle edge, that is to say, the externally connected vehicle devices.

System architecture

To help address the challenges of creating reusable software and applications for the vehicle edge, we worked with AWS Partner DXC Luxsoft to create an end-to-end system architecture. The architecture in Figure 1 uses software-defined mobility devices at the vehicle edge to connect to multiple hardware components.

Figure 1. AWS Partner DXC Luxsoft to create an end-to-end system architecture

Let’s explore this architecture step by step.

Edge application Developers check in source code to a CI/CD system for build and test. Applications are built as containers and tested on Amazon EC2 Elastic Compute Cloud (Amazon EC2) instances with target same Amazon Machine Images (AMI).
Edge application containers are stored in Amazon Elastic Container Registry (Amazon ECR).
Containers are registered as AWS IoT Greengrass components for target devices.
Deploy edge applications to multiple target types using AWS IoT Greengrass.
Data is published back to AWS through AWS IoT Core basic telemetry or AWS IoT Fleetwise.
AWS IoT Core routes data to connected vehicle databases and services for further processing. Alternatively, AWS IoT Fleetwise-collected data is routed through AWS IoT Fleetwise.
The Fleet Management portal allows fleet managers to view results produced by applications and services and data collected by AWS IoT Fleetwise. This can include geolocation, vehicle health, Usage Based Insurance (UBI), or other driver scores.
Mobile clients can be created to allow end users and consumers to view application and service results and scores for applications like UBI.

How it works

This technology stack abstracts the specific hardware used, presenting a common run environment that you can deploy to devices that are based on ARM processors with ARM Sytem Ready firmware, including Amazon EC2 Graviton instances.

This architecture uses the following:

Device software stack with an ARM System Ready board support package
Yocto custom Linux build with the SOAFEE EWAOL layer(s) and AWS IoT Greengrass
Instance of the AWS IoT Fleetwise edge agent
Application containers for pre and post processing of data collected by AWS IoT Fleetwise or other applications

We paired this stack with AWS IoT Greengrass V2 and a container-based version of AWS IoT Fleetwise edge agent that we been modified to publish to a local broker first to allow pre and post-processing. This also makes AWS IoT Fleetwise-collected data available to other application containers running on the device. The pre- and post-processing containers prepare messages and data for use by other edge applications or exchange with the cloud services and will be available with the architecture source code.

Benefits

Because this architecture is container-based and abstracts the hardware, deploying and updating applications becomes more efficient. Because Amazon EC2 Graviton instances are ARM-based, you can deploy specific AMIs and configurations that contain this architecture’s technology stack as part of a continuous integration and continuous deployment (CI/CD) pipeline. This means that you can develop new applications and services entirely in the cloud, test them in the cloud with bit equivalent binaries and containers, and then deploy them to hardware components. This can save weeks to months of development and verification through the use of automation. The use of ARM based systems cloud native development and testing strategies can be applied, reducing the need for hardware test equipment and bringing new revenue streams and customer experiences to market at a pace that matches the current demand.

Conclusion

With this architecture, you can develop and deliver new edge processing applications to one or more vehicle device platforms. You can also develop, test, and deploy purpose-built edge compute applications as containerized AWS IoT Greengrass V2 components, including applications like Usage Based Insurance, over the air update agents, and driver distraction.

With this architecture alone in place, value can quickly be added by developing AWS IoT Fleetwise data campaigns targeting specific data required to fulfill business value for consumers, operators or fleet managers, such as vehicle battery state of charge or health indicators.

You can add value to this architecture by developing AWS IoT Fleetwise data campaigns for specific data required to fulfill business value for consumers, operators, or fleet managers. Examples of this data include vehicle battery state of charge and health indicators.

If you’re interested in creating or contributing to a common architecture that can accelerate developing and deploying edge applications to multiple hardware components, contact Connected Vehicle Tech Strategy Lead James Simon or AWS Partner DXC Luxsoft for a working demonstration or to start a proof of concept.

Creating scalable architectures with AWS IoT Greengrass stream manager

2023-03-31 Neil Mehta

Post Syndicated from Neil Mehta original https://aws.amazon.com/blogs/architecture/creating-scalable-architectures-with-aws-iot-greengrass-stream-manager/

Designing a scalable, global, real-time, distributed system to process millions of messages from a variety of critical devices can complicate architectures. Collecting large data streams or image recognition from the edge also requires scalable solutions.

AWS IoT Core is designed to handle large numbers of Internet of things (IoT) devices sending a few messages per second. However, when IoT devices send large numbers of messages per second and processing occurs at the edge, managing large data streams is challenging. Further, data that is buffered or processed at the edge can increase latency.

This post describes how to create a scalable IoT architecture with AWS IoT Greengrass stream manager that can handle thousands of critical messages per second from a variety of IoT devices.

Relaying critical messages from high-throughput IoT devices

Let’s explore an example. Consider the architecture in Figure 1, where you have two types of IoT devices sending messages to AWS IoT Core. One set of IoT devices sends thousands of messages per second while the other set of devices sends tens of messages per second.

The IoT devices that are sending thousands of messages per second contain critical data that cannot be lost and must be processed at the edge. The IoT devices sending tens of messages per second are of less importance.

Figure 1. IoT devices sending messages to AWS IoT Core

In considering this architecture, the critical IoT devices sending thousands of messages per second take the following path:

IoT devices send data to an AWS IoT Greengrass component for processing with a Quality of Service (QoS) of 0.
The AWS IoT Greengrass component processes the data and sends it to the AWS IoT Greengrass message broker.
AWS IoT Greengrass message broker relays the data to AWS IoT Core with a QoS of 1.
AWS IoT Core sends the data to Amazon Kinesis Data Streams for further processing.

In contrast, the IoT devices that send a few messages per second take the following path:

IoT devices send data to the AWS IoT Greengrass message broker.
The AWS IoT Greengrass message broker relays the data to AWS IoT Core.
AWS IoT Core sends the data to Kinesis Data Streams for further processing.

Due to the critical nature of the messages being sent, the AWS IoT Greengrass message broker is configured to send messages to the AWS IoT Core with a QoS of 1. However, when you configure QoS to 1, the AWS IoT Greengrass message broker has to wait for an acknowledgement (ACK) before sending more data.

As you add more IoT devices, many choose to batch the messages before sending them to AWS IoT Core. This can be a good strategy when you are dealing with many IoT devices that send a small number of messages per second. But when you are adding IoT devices that send thousands of messages per second, the time waiting for an ACK can add latency and cause inconsistencies when reporting the data downstream.

This is because the AWS IoT Greengrass message broker is capable of sending 100 messages to AWS IoT Core before waiting for an ACK when QoS is set to 1. As a result, scaling this architecture to handle additional IoT devices can become challenging.

For more information about the AWS IoT Greengrass message broker’s limits, refer to the AWS IoT Core message broker and protocol limits and quotas section of the AWS General Reference Guide.

AWS IoT Greengrass stream manager for speed and reliability

To scale this type of architecture, you can use a pre-built AWS IoT Greengrass component called AWS IoT Greengrass stream manager to bypass the Greengrass message broker and AWS IoT Core to send your data directly to AWS IoT Analytics, AWS IoT SiteWise, Amazon Kinesis, or Amazon Simple Storage Service (Amazon S3).

For example, consider the earlier scenario where one set of IoT devices is sending thousands of critical messages per second and another set is sending data of less importance.

Instead, you can use AWS IoT Greengrass stream manager to create an architecture that can easily and reliably send large amounts of data from the edge directly to Kinesis, as in Figure 2.

Figure 2. AWS IoT Greengrass stream manager sending data directly to Kinesis

As opposed to the Figure 1 configuration, the critical IoT devices that send thousands of messages per second can now take the following path:

Critical IoT devices send data to an AWS IoT Greengrass component for processing.
The AWS IoT Greengrass component processes the data and sends it to AWS IoT Core stream manager.
AWS IoT Greengrass stream manager sends the data directly Amazon Kinesis Data Streams.

Note that the IoT devices sending a few messages per second can also be sent to AWS IoT Greengrass stream manager at a lower priority. You are still using AWS IoT Core, but it is no longer the main data path. By continuing to use AWS IoT Core, you can benefit from its control plane features such as managing updates, certificates, and policies. However, AWS IoT Core’s data plane features—like Rules Engine—are no longer used in this architecture, which can help reduce costs. If you choose to bypass the AWS IoT Greengrass message broker and use AWS IoT Core stream manager, any components that you have built must be moved so that processing occurs at the edge.

In the architecture in Figure 2, AWS IoT Greengrass stream manager is being used to bypass the main data path away from the AWS IoT Greengrass message broker and AWS IoT Core. Bypassing these services reduces the latency in Figure 1 caused by the AWS IoT Greengrass message broker waiting for an ACK from AWS IoT Core.

AWS IoT Greengrass stream manager can handle thousands of messages per second so you can:

Reliably scale your architecture
Create multiple data paths to send both critical and non-critical data to AWS IoT Core stream manager while still leveraging AWS IoT Core control plane features
Prioritize your critical data paths to have specific IoT devices take higher priority
Create configurations to handle situations where you have IoT devices with limited or intermittent connectivity. (For example, you can create configurations to use local storage or memory to cache your data when internet connectivity is lost, and then flush data when an ACK is received from the destination.)

All of these features can help you reduce latency, costs, data inconsistencies, and the potential loss of critical data. They also provide a mechanism to scale the number of devices that your architecture can reliably handle.

Get started with stream manager by using the AWS IoT Greengrass Core SDK or the AWS IoT console.

Conclusion

In this blog post, we discussed how to create a scalable IoT architecture that can handle thousands of critical messages per second from a variety of IoT devices. Incorporating AWS IoT Greengrass stream manager into your architecture can help reduce latency, data inconsistencies, and the potential loss of critical data by providing a way to bypass AWS IoT Core and send large amounts of data efficiently and reliably.

Let’s Architect! Streamlining business with migration and modernization

2023-03-29 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-streamlining-business-with-migration-and-modernization/

Many customers migrate their systems to Amazon Web Services (AWS) to increase their competitive edge and drive business value. To maximize the benefits of a cloud migration, companies tend to move their applications in conjunction with modernization initiatives. These joined efforts help your applications gain more agility, scalability, and resilience. Modernizing the portfolio of workloads with AWS means that you can re-platform, refactor, or replace these workloads by using containers, serverless technologies, purpose-built data stores, and software automation. These functionalities allow you to benefit from the best of the AWS agility and total cost optimization (TCO) benefits.

In this edition of Let’s Architect! we share hands-on activities, customer stories, and tips and tricks to migrate and modernize your applications with AWS.

Migrating to the cloud: What is the cost of doing nothing?

Would you think that small companies always migrate faster than large enterprises? Actually, cloud migration speed doesn’t necessarily depend on the size of the business! Company size is not a clear indicator of migration and modernization success, but a shift of culture and mindset is essential for successful company evolution.

When it comes to migration, the cost of doing nothing is not just financial: Businesses can also expect a slower pace of innovation and a higher security burden. This video analyzes the financial benefits of migration and shares mental models for approaching an AWS cloud migration, and Marriott team members explain how they planned their migration and the lessons learned along the way.

Take me to this re:Invent 2022 video!

Benefits of an early migration start

Modernization pathways for a legacy .NET Framework monolithic application on AWS

Organizations aim to deliver the best technological solutions based on customer needs. At any stage in their cloud adoption journey, businesses often end up managing and building monolithic applications. Let’s explore a migration path for a monolithic .NET Framework application to a modern microservices-based stack on AWS, and discuss AWS tools to break the monolith into microservices and containerize applications.

Cost optimization is another key factor for modernizing your workloads and solutions include moving to Linux-based systems or using open-source database engines. This Migrate and Modernize enterprise workloads with AWS video walks you through the process of migrating and modernizing enterprise workloads with AWS.

Take me to this blog post with more detail!

A modernized microservices-based rearchitecture

Implementing a serverless-first strategy in an enterprise

Organizations of all sizes want to benefit from the agility, cost savings, and developer experience that serverless architectures can provide on AWS. For large enterprises, the return on investment (ROI) can be massive, but overcoming architecture inertia while ensuring security best practices and governance stay in place is a hurdle that many struggle with. In this lightning talk, learn how your organization can implement a serverless-first strategy to overcome these obstacles. Delta Air Lines shares the story of making serverless-first a reality as part of their AWS journey.

Take me to this video

Benefits of serverless

Application Migration with AWS

This workshop shows you how to migrate and modernize a fictional application to the AWS Cloud by:

Performing a database migration
Migrating and modernizing your web server using different migration strategies (for example, breaking down the monolith into containers)
Teaching you how to improve Operation excellence, Security, Performance efficiency, and Cost optimization of the deployed architecture by following these pillars of the AWS Well-Architected Framework.

Take me to this workshop!

Different migration strategies for web servers

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about distributed systems with containers.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

How French Broadcaster TF1 Used AWS Cloud Technology and Expertise to Bring the FIFA World Cup to Millions

2023-03-27 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/how-french-broadcaster-tf1-used-aws-cloud-technology-and-expertise-to-bring-the-fifa-world-cup-to-millions/

Three years before millions of viewers saw, arguably, one of the most thrilling World Cup Finals ever broadcast, TF1, the leading private TV channel in France, started a project to redefine the foundations of its broadcasting platform, including adopting a new cloud-based architecture.

They, and all other broadcasters, have been observing diminishing audiences for traditional over-the-air broadcasting and increasing popularity of digital platforms, such as smart TVs, and boxes like FireTV, ChromeCast, and AppleTV, as well as laptops, tablets, and mobile phones. According to Thierry Bonhomme, CTO of eTF1 (the group within TF1 in charge of digital platforms) whom I recently interviewed for the AWS French Podcast, digital broadcasting now accounts for 20–25 percent of TF1’s total audience.

Image of a soccer ball in a large stadium This online and mobile usage drives very specific traffic patterns on IT systems: a huge peak of connections and authentications in the few minutes before the start of a game and millions of video streams that must be delivered reliably over a variety of changing network qualities. In addition to these technical challenges, there is also an economic challenge: to deliver advertisements at key moments, such as before a national anthem or during a 15-minute half-time. The digital platform sells its own set of commercials, which are different from the commercials broadcast over the air, and might also be different from region to region. All these video streams have to be delivered to millions of viewers on a wide range of devices and a variety of network conditions: from 1 Gbs fiber at home down to 3 G networks in remote areas.

TF1’s approach to readiness included redesigning its digital architecture, setting up metrics showing how the new system is performing, and defining processes, roles and responsibilities for people in the team. As part of this preparation, AWS helped TF1 prepare their system to meet their scalability, performance, and security requirements.

In my conversation with Thierry, he described the two main objectives the company had when designing its new technical architecture for the future of broadcasting: first, the scalability of the platform and second, meet the demand for performance. Scalability is key to absorbing the peaks of concurrent viewers. And performance is required to ensure that the video streams start quickly (in less than 3 seconds) and there is no interruption of the video player (known as re-buffering). After all, nobody wants to know their team just scored by hearing their neighbors yelling before seeing it happen on the screen they’re watching.

The Technology
Starting in 2019, TF1 started to redesign its digital broadcasting architecture and to rewrite significant parts of the code, such as the back-end API or the front-end applications running on set-top boxes, on Android, or on iOS devices. They adopted a micro service architecture, deployed on Amazon Elastic Kubernetes Service (EKS) and written in the Go programming language for maximum performance. They designed a set of REST and GraphQL APIs to define the contracts between front and back-end applications, and an event-driven architecture with Apache Kafka for maximum scalability. They adopted multiple content delivery networks, including Amazon CloudFront, to reliably distribute the video streams to client devices. In August 2020, TF1 got a chance to test the new platform on a large-scale sporting event when Bayern Munich beat Paris Saint Germain 1-0 at the UEFA European Champion League.

TF1 headquarters in paris

Here’s a peek at what happens from the moment the action is shot on the field to the moment you see it on your mobile device: The high-quality video stream first lands in the TF1 tower, located in Paris, where hardware encoders create the necessary videos streams adapted to your device. AWS Elemental Live hardware encoders are able to generate up to eight different encodings: 4K for TVs, high-definition (1080), standard definition (720), and a variety of other formats suited to a wide range of mobile devices and network bandwidth. (This extra video encoding step is one of the reasons why you might sometimes observe a extra latency between the video you receive on your traditional TV and the feed you receive on your mobile device.) The system sends the encoded videos to AWS Elemental Media Package for packaging and, finally, to the CDNs where the player applications fetch the video segments. The player applications select the best video encoding depending on your device size and current network bandwidth available.

At the end of 2021, one year before millions watched French player Kylian Mbappé score a hat trick (three goals) for the first time in a World Cup final since 1966, TF1 started preparing for the big event by identifying risks based on previous experiences and areas needing improvement. Thierry described how they built hypotheses of the likely audience size based on different game scenarios: the longer the French national team might stay in the competition, the higher the expected traffic. They classified risks for each phase of the tournament (selection pools, quarter-final, semi-final, and final). Based on these scenarios, they figured that the platform must be able to sustain 4.5 million viewers connecting to the platform 15 minutes before the start of a game (that’s 5,000 new viewers every second).

This level of scalability requires preparation from TF1’s team but also all external systems in use, such as the AWS cloud services, the authentication and authorization service, and the CDN services.

A viewer arrival triggers multiple flows and API calls. The viewer must authenticate, and some must create a new account or reset their passwords. Once authenticated, the viewer sees the homepage that, in turn, triggers multiple API calls, one of them to the catalog service. When the viewer selects a live stream, other API calls are made to receive the video stream URL. Then the video part kicks in. The client-side player connects to the chosen CDN and starts to download video segments. Once the video is playing, the platform must ensure the stream is delivered smoothly, with high quality and no drop that would cause a re-buffering. All these elements are key to ensuring the best possible viewer experience.

The Preparation
Six months before France made it to the final and squared off against Argentina, TF1 started to work closely with their vendors, including AWS, to define requirements, reserve capacity, and start to work on test and execution plans. At this point, TF1 engaged with AWS Infrastructure Event Management, a dedicated program of the AWS enterprise support plan. Our experts offer architecture and guidance and operational support during the preparation and execution of planned events, such as shopping holidays, product launches, migrations – and in this case, the largest football (soccer) event in the world. For these events, AWS helps customers assess operational readiness, identify and mitigate risks, and execute confidently with AWS experts by their side.

Special care was given to test the scalability of the API. The TF1 team developped a load-testing engine to simulate users connecting to the platform, authenticating, selecting a program, and starting a video stream. To closely simulate real traffic, TF1 used another hyperscale cloud provider to send requests to their AWS infrastructure. The testing allowed them to define the correct metrics to observe in their dashboards and the correct values to generate alarms. Thierry said the first time the load simulator ran full speed, simulating 5,000 new connections per second, it crashed the entire back end.

But like any world class team, TF1 used this to their advantage. They took 2–3 weeks to tune the system. They eliminated redundant API calls from client applications and applied aggressive caching strategies. They learned how to scale their back-end platform in response to such traffic. They also learned to identify the value of key metrics under load. After a couple of back-end deployments and new releases for their Android and iOS apps, the system successfully passed the load test. It was a month before the start of the event. At that moment, TF1 decided to freeze all new developments or deployments until the first kickoff in Qatar, unless critical bugs were found.

Monitoring and Planning
The technological platform was only one piece of the project, Thierry told me. They also designed metric dashboards using Datadog and Grafana to monitor key performance indicators and detect anomalies during the event. Thierry noted that when observing average values, they often miss parts of the picture. For example, he said, observing a P95 percentile value instead of an average shows the experience for five percent of your users. When you have three million of them, five percent represents 150K customers, so it is important to know what their experience is. (Incidentally, this percentile technique is used routinely at Amazon and AWS across all service teams, and Amazon CloudWatch has built-in support to measure percentile values.)

TF1 also prepared for the worst, he said, including the specter of having three million people staring at a black screen during a game. TF1 involved community managers and social media owners early on, and they prepared press releases and social media messages for multiple scenarios. The team also planned to gather all key team members together in a “war room” during each game to reduce communication and reaction time if something needed immediate action. This team included the AWS technical account manager, their counterpart from the authentication service, and other CDN vendors. AWS also had on-call engineers from service teams and premium support team monitoring the health of our services and ready to react in case something went wrong.

The Attacks Weren’t Just on the Field
Three key moments at the start of the tournament provided opportunities to test the platform for real: the opening ceremony, the first game, and particularly for TF1’s audience, the first game for the French team. As the tournament played out over the following weeks — with increased intensity, suspense, and load on IT systems as the French team progressed — the TF1 team would reevaluate its traffic estimates and conduct debriefs after each game. But while the intensity of the action was unfolding on the field, TF1’s team had some behind-the-scenes excitement of its own.

Starting in the quarter final, the team noticed unusual activity from a wide range of distributed IP addresses, and they determined that the system was under a large distributed denial of service (DDOS) attack from a network of compromised machines; someone was trying to take down the service and prevent millions of people from watching. TF1 is accustomed to these types of attacks, and their dashboard helped to identify the traffic patterns in real time. Services such as AWS Shield and AWS Web Application Firewall helped to mitigate the incident without impacting the viewer experience. The TF1 security team and AWS experts conducted further analysis to proactively block some patterns of traffic and IP addresses for the next game.

Still, the intensity of the attacks increased during the semi-finals and final game, when it peaked at 40 millions of requests for a ten-minute period. “These attacks are a cat-and-mouse game,” said Thierry: attackers try new strategies and apply new patterns, but the team in the war room detects them and dynamically updates the filtering rules to block them before viewers can even detect a change in the quality of the service. The long and detailed preparation served its purpose, and everybody knew what to do. Thierry reported that the attacks were successfully mitigated with no consequences.

The Thrilling Finale
France Argentine By the time France took to the pitch on Dec. 18, 2022, TF1 knew they would break records on the platform. Thierry said the traffic was higher than estimated, but the platform absorbed it. He also described that during the first part of the game, when Argentina was leading, the TF1 team observed a slow decline of connections… that is, until the first goal scored by MBappé 10 minutes before the end of the game. At that point, all dashboards showed a sudden return of viewers for the thrilling last moments of the game. At peak, more than 3.2 million digital players were connected at the same time, delivering 3.6 terabits per second of outgoing bandwidth through all four CDNs.

Across the globe, Amazon CloudFront also helped 18 broadcasters deliver video streams. In all, over 48 million unique client IPs connected to one of 450+ edge locations globally during the tournament, peaking at just under 23 terabits per second across these customer distributions during the final game of the tournament.

The Future
While Argentina ultimately triumphed and Lionel Messi achieved his long-sought World Cup win, the 2022 FIFA World Cup proved to the team at TF1 that their processes, their architecture, and their implementation are able to deliver a high-quality viewing experience to millions. The team is now confident the platform is ready to absorb the next planned large-scale events: the World Cup of Rugby in September 2023 and the next French presidential election in 2027. Thierry concluded our conversation predicting digital broadcasting will eventually attain a larger audience than over-the-air, and having 3+ millions simultaneous viewers will become the new normal.

If your company is also looking to transform its business using the power of cloud computing, consult with one of our AWS Enterprise support advisors today.

— seb

Invoking asynchronous external APIs with AWS Step Functions

2023-03-27 Jorge Fonseca

Post Syndicated from Jorge Fonseca original https://aws.amazon.com/blogs/architecture/invoking-asynchronous-external-apis-with-aws-step-functions/

External vendor APIs can help organizations streamline operations, reduce costs, and provide better services to their customers. But many challenges exist in integrating with third-party services such as security, reliability, and cost.

Organizations must ensure their systems can handle performance issues or downtime. In some cases, calling an external API may have associated costs such as licensing fees. If a contract exists with the external API vendor to adhere to maximum Requests Per Second (RPS), the system needs to adapt accordingly.

In this blog post, we show you how to build an architecture to invoke an external vendor API using AWS Step Functions, with specific guidance on reliability.

This orchestration is applicable to any industry that relies on technology and data benefitting from external vendor API integration. Examples include e-commerce applications for online retailers integrating with third-party payment gateways, shipping carriers, or applications in the healthcare and banking sectors.

Invoking asynchronous external APIs overview

This solution outlines the use of AWS services to build an orchestrator controlling the invocation rate of third-party services that implement the service callback pattern to process long-running jobs. This architecture is also available in the AWS Reference Architecture Diagrams section of the AWS Architecture Center.

As in Figure 1, the architecture gives you the control to feed your calls to an external service according to its maximum RPS contract using Step Functions capabilities. Step Functions pauses main request workflows until you receive a callback from the external system indicating job completion.

Figure 1. Invoking Asynchronous External APIs architecture

Let’s explore each step.

Set up Step Functions to handle the lifecycle of long-running requests to the third party. Inside the workflow, add a request step that pauses it using waitForTaskToken as a callback to continue. Set a timeout to throw a timeout error if a callback isn’t received.
Send the task token and request payload to an Amazon Simple Queue Service (Amazon SQS) queue. Use Amazon CloudWatch to monitor its length. Consider adjusting the contract with the third-party service if the queue length grows beyond a soft limit determined on the maximum RPS with the third party.
Use AWS Lambda to poll Amazon SQS and trigger an express Step Functions workflow. Control the invocation rate of Lambda using polling batch size, reserved concurrency, and maximum concurrency, discussed in more detail later in the blog.
Optionally, add dynamic delay inside Lambda controlled by AWS AppConfig if the system still needs a lower invocation rate to comply with the contracted RPS.
Step Functions invokes an Amazon API Gateway HTTP proxy API configured with rate limit to comply with the contracted RPS. API Gateway is a safeguard proxy to make sure your system is not breaking the RPS contract while dynamically adjusting the invocation rate parameters.
Invoke the external third-party asynchronous service API sending the payload consumed from the requests queue and receiving the jobID from the service. Send failed requests to the Dead Letter Queue (DLQ) using Amazon SQS.
Store the main workflow’s task token and jobID from the third-party service in an Amazon DynamoDB table. This is used as a mapping to correlate the jobID with the task token.
When the external service completes, receive the completed jobID in a callback webhook endpoint implemented with API Gateway.
Transform the external callbacks with API Gateway mapping templates, add the payload and jobID to an Amazon SQS queue, and respond immediately to the caller.
Use Lambda to poll the callback Amazon SQS queue, then query the token stored. Use the token to unblock the waiting workflow by calling SendTaskSuccess and the callback DLQ to store failed messages.
On the main workflow, pass the jobID to the next step and invoke a Step Functions processor to fetch the third-party results. Finally, process the external service’s results.

Controlling external API invocation rates

To comply with third-party RPS contracts, adopt a mechanism to control your system’s invocation rate. The rate of polling the messages from the request Amazon SQS (or step 3 in the architecture) directly impacts your invocation rate.

Different parameters can be used to control the invocation rate for Lambda with Amazon SQS as its trigger “event source,” such as:

Batch size: The number of records to send to the function in each batch. For a standard queue, this can be up to 10,000 records. For a first-in, first-out (FIFO) queue, the maximum is 10. Using batch size on its own will not limit the invocation rate. It should be used in conjunction with other parameters such as reserved concurrency or maximum concurrency.
Batch window: The maximum amount of time to gather records before invoking the function, in seconds. This applies only to standard queues.
Maximum concurrency: Sets limits on the number of concurrent instances of the function that an Amazon SQS event source can invoke. Maximum concurrency is an event source-level setting.

Trigger configuration is shown in Figure 2.

Figure 2. Configuration parameters for triggering Lambda

Other Lambda configuration parameters can also be used to control the invocation rate, such as:

Reserved concurrency: Guarantees the maximum number of concurrent instances for the function. When a function has reserved concurrency, no other function can use that concurrency. This can be used to limit and reduce the invocation rate.
Provisioned concurrency: Initializes a requested number of execution environments so that they are prepared to respond immediately to your function’s invocations. Note that configuring provisioned concurrency incurs charges to your AWS account.

These additional Lambda configuration parameters are shown here in Figure 3.

Figure 3. Lambda concurrency configuration options – Reserved and Provisioned

Maximizing your external API architecture

During this architecture implementation, consider some use cases to ensure that you are building a mature orchestrator.

Let’s explore some examples:

If the external system fails to respond to the API request in step 8, a timeout exception will occur at step 1. A sensible timeout should be configured in the main state machine in step 1. The timeout value should consider the maximum response time of the external system.

The Error handling capabilities in Step Functions section of the AWS Step Functions Developer Guide provides the ability to implement your own logic for different error types. Configure timeout errors using the States.Timeout error state.

Dynamic delay inside the Lambda function—as mentioned in step 4—should only be used temporarily for burst traffic. If the external party has a very low RPS contract, consider other alternatives to introduce delay.

For example, Amazon EventBridge Scheduler can be used to trigger the Lambda function at regular intervals to consume the messages from Amazon SQS. This avoids costs for the idle/waiting state of your Lambda functions.

Conclusion

This blog post discusses how to build end-to-end orchestration to manage a request’s lifecycle, five different parameters to control invocation rate of third-party services, and throttle calls to external service API per maximum RPS contract.

We also consider use cases on using error handling capabilities in Step Functions and monitor systems with CloudWatch. In addition, this architecture adopts fully managed AWS Serverless services, removing the undifferentiated heavy lifting in building highly available, reliable, secure, and cost-effective systems in AWS.

Genomics workflows, Part 5: automated benchmarking

2023-03-24 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-5-automated-benchmarking/

Launching and running genomics workflows can take hours and involves large pools of compute instances that process data at a petabyte scale. Benchmarking helps you evaluate workflow performance and discover faster and cheaper ways of running them.

In practice, performance evaluations happen irregularly because of the associated heavy lifting. In this blog post, we discuss how life-science research teams can automate evaluations.

Business Benefits

An automated benchmarking solution provides:

more accurate enterprise resource planning by performing historical analytics,
lower cost to the business by comparing performance on different resource types, and
cost transparency to the business by quantifying periodical chargeback.

We’ve used automated benchmarking to compare processing times on different services such as Amazon Elastic Compute Cloud (Amazon EC2), AWS Batch, AWS ParallelCluster, Amazon Elastic Kubernetes Service (Amazon EKS), and on-premises HPC clusters. Scientists, financiers, technical leaders, and other stakeholders can build reports and dashboards to compare consumption data by consumer, workflow type, and time period.

Design pattern

Our automated benchmarking solution measures performance on two dimensions:

Timing: measures the duration of a workflow launch on a specific dataset
Pricing: measures the associated cost

This solution can be extended to other performance metrics such as iterations per second or process/thread distribution across compute nodes.

Our requirements include the following:

Consistent measurement of timing based on workflow status (such as preparing, waiting, ready, running, failed, complete)
Extensible pricing models based on unit prices (the Amazon EC2 Spot price at a specific period of time compared to Amazon EC2 On-Demand pricing)
Scalable, cost-efficient, and flexible data store enabling historical benchmarking and estimations
Minimal infrastructure management overhead

We choose a serverless design pattern using AWS Step Functions orchestration, AWS Lambda for our application code, and Amazon DynamoDB to track workflow launch IDs and states (as described in Part 3). We assume that the genomics workflows run on AWS Batch with genomics data on Amazon FSx for Lustre (Part 1). AWS Step Functions allows us to break down processing into smaller steps and avoid monolithic application code. Our evaluation process runs in four steps:

Monitor for completed workflow launches in the DynamoDB stream using an Amazon EventBridge pipe with a Step Functions workflow as target. This event-driven approach is more efficient than periodic polling and avoids custom code for parsing status and cost values in all records of the DynamoDB stream.
Collect a list of all compute resources associated with the workflow launch. Design a Lambda function that queries the AWS Batch API (see Part 1) to describe compute environment parameters like the Amazon EC2 instance IDs and their details, such as processing times, instance family/size, and allocation strategy (for example, Spot Instances, Reserved Instances, On-Demand Instances).
Calculate the cost of all consumed resources. We achieve this with another Lambda function, which calculates the total price based on unit prices from the AWS Price List Query API.
Our state machine updates the total price in the DynamoDB table without the need for additional application code.

Figure 1 visualizes these steps.

Figure 1. Automated benchmarking of genomics workflows

Implementation considerations

AWS Step Functions orchestrates our benchmarking workflow reliably and makes our application code easy to maintain. Figure 2 summarizes the state machine transitions that we’ll describe.

Figure 2. AWS Step Functions state machine for automated benchmarking

Gather consumption details

Configure the DynamoDB stream view type to New image so that the entire item is passed through as it appears after it was changed. We set up an Amazon EventBridge pipe with event filtering and the DynamoDB stream as a source. Our event filter uses multiple matching on records with a status of COMPLETE, but no cost entry in order to avoid an infinite loop. Once our state machine has updated the DynamoDB item with the workflow price, the resulting record in the DynamoDB stream will not pass our event filter.

The syntax of our event filter is as follows:

{
  "dynamodb": {
    "NewImage": {
      "status": {
        "S": ["COMPLETE"]
      },
      "totalCost": {
        "S": [{
          "exists": false
        }]
      }
    }
  }
}

We use an input transformer to simplify follow-on parsing by removing unnecessary metadata from the event.

The consumed resources included in the stream record are the auto-scaling group ID for AWS Batch and the Amazon FSx for Lustre volume ID. We use the DescribeJobs API (describe_jobs in Boto3) to determine which compute resources were used. If the response is a list of EC2 instances, we then look up consumption information including start and end times using the ListJobs API (list_jobs in Boto3) for each compute node. We use describe_volumes with filters on the identified EC2 instances to obtain the size and type of Amazon Elastic Block Store (Amazon EBS) volumes.

Calculate prices

Another Lambda function obtains the associated unit prices of all consumed resources using the GetProducts request of AWS Price List Query API (get_products in Boto3) and then parsing the pricePerUnit value. For Spot Instances, we use describe_spot_price_history of the EC2 client in Boto3 and specify the time range and instance types for which we want to receive prices.

Calculate the price of workflow launches based on the following factors:

Number and size of EC2 instances in auto-scaling node groups
Size of EBS volumes and Amazon FSx for Lustre
Processing duration

Our Python-based Lambda function calculates the total, rounds it, and delivers the price breakdown in the following format:

total_cost: str, instance_cost: str, volume_cost: str, filesystem_cost: str

Lastly, we put the price breakdown to the DynamoDB table using UpdateItem directly from the Amazon States Language.

Note that AWS credits and enterprise discounts might not be reflected in the responses of the AWS Price List Query API unless applied to the particular AWS account. This is often considered best practice in light of least-privilege considerations.

In the past, we’ve also used AWS Cost Explorer instead of the AWS Price List API. AWS Cost Explorer data is updated at least once every 24 hours. You can denote the pending price status in the DynamoDB table item and use the Wait state to delay the calculation process.

The presented solution can be extended to other compute services such as Amazon Elastic Kubernetes Service (Amazon EKS). For Amazon EKS, events are enriched with the cluster ID from the DynamoDB table and the price calculation should also include control plane costs.

Conclusion

Life-science research teams use benchmarking to compare workflow performance and inform their architectural decisions. Such evaluations are effort-intensive and therefore done irregularly.

In this blog post, we showed how life-science research teams can automate benchmarking for their scientific workflows. The insights teams gain from automated benchmarking indicate continuous optimization opportunities, such as by adjusting compute node configuration. The evaluation data is also available on demand for other purposes including chargeback.

Stay tuned for our next post in which we show how to use historical benchmarking data for price estimations of future workflow launches.

Related information

Accelerating revenue growth with real-time analytics: Poshmark’s journey

2023-03-20 Mahesh Pasupuleti

Post Syndicated from Mahesh Pasupuleti original https://aws.amazon.com/blogs/big-data/accelerating-revenue-growth-with-real-time-analytics-poshmarks-journey/

This post was co-written by Mahesh Pasupuleti and Gaurav Shah from Poshmark.

Poshmark is a leading social marketplace for new and secondhand styles for women, men, kids, pets, home, and more. By combining the human connection of physical shopping with the scale, ease, and selection benefits of Ecommerce, Poshmark makes buying and selling simple, social, and sustainable. Its community of more than 80 million registered users across the US, Canada, Australia, and India is driving a more sustainable future for the fashion industry.

An important goal to achieve for any organization is to grow the top line revenue. Top line revenue refers to the total value of sales of an organization’s services or products. The two main approaches organizations employ to increase revenue are to expand geographically to enter new markets and to increase market share within a market by improving customer experience (CX).

Improving CX is a well-known guideline to attract and retain customers and thereby increase the market share. In this post, we share how Poshmark improved CX and accelerated revenue growth by using a real-time analytics solution. We discuss how to create such a solution using Amazon Kinesis Data Streams, Amazon Managed Streaming for Kafka (Amazon MSK), Amazon Kinesis Data Analytics for Apache Flink; the design decisions that went into the architecture; and the observed business benefits by Poshmark.

High-level challenge: The need for real-time analytics

Previous efforts at Poshmark for improving CX through analytics were based on batch processing of analytics data and using it on a daily basis to improve CX. Although these batch analytics-based efforts were successful to some extent, they saw opportunities to improve the customer experience with real-time personalization and security guidance during the customer’s interaction with the Poshmark app. The customer insights gathered from the batch analytics couldn’t be paired with the current customer activities in real time due to the latencies involved in enriching the current activities with the knowledge gained through batch processes. Therefore, the opportunity to provide tailored offers or showcase products based on customers’ preference and behaviors in near-real time, which contributes to a much better customer experience, was missing. Similarly, the opportunity to catch fraud within a second, before checkout, was also missing.

To improve the customer experience, Poshmark decided to invest in building a real-time analytics platform to enable real-time capabilities, as explained further in this post. Poshmark engineers worked closely with AWS architects through the AWS Data Lab program. The AWS Data Lab offers accelerated, joint engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data and analytics modernization initiatives. The Design Lab is one half to two day engagement with customer team offering prescriptive guidance to arrive at the optimal solution architecture design before you embark on building the platform.

Designing the solution architecture through the AWS Data Lab process

The business and technical stakeholders from Poshmark and the AWS Data Lab architects discussed near-to-long-term business requirements along with the functional and non-functional capabilities required to decide on the architecture approach. They reviewed the current state architecture and constraints to understand data flow and technical integration points. The joint team discussed the pros and cons of various AWS services that already exist in Poshmark’s current architecture, as well as other AWS services that can meet the requirements.

Poshmark wanted to address the following business use cases via the real-time analytics platform:

Sessionization – Poshmark captures both server-side application events and client-side tracking events. They wanted to use these events to identify and analyze user sessions to track behavior.
Illegitimate sign-up and sign-in prevention – Poshmark wanted to detect and ban illegitimate sign-up or sign-in events from bots or non-human traffic in real time on the Poshmark application.
IP translation – The IP addresses present in events will be translated to city, state, and zip, and enriched with other information to implement near-real-time, location-aware services encompassing security-related functions as well as personalization functions.
Anonymization – Poshmark wanted to anonymize events and make the data available for internal users for querying in near-real time.
Personalized recommendations – User behavior based on clickstream events can be captured up to the last second before enriching it for personalization and sending it to the model to predict the recommendations.
Other use cases – Additional use cases relating to aggregations and machine learning (ML) inference use cases such as authorization to operate, listing spam detection, and avoiding account takeovers (ATOs), among others.

One common pattern identified for these use cases was the need for a central data enrichment pipeline to enrich incoming raw events before event data can be utilized for actual business processing. In the Design Lab, we focused on design for data enrichment pipelines aimed at enriching events with data from static files, dynamic data stores such as databases, APIs, or within the same event stream for the aforementioned streaming use cases. Later in this post, we cover the salient points discussed during the lab around design and architecture.

Batch analytics solution architecture

The following diagram shows the previous architecture at Poshmark. For brevity, only the flow pertaining to the real-time analytics platform is explained.

User interactions on Poshmark web and mobile applications generate server-side events. These events include add to cart, orders, transactions, and more on application servers, and the page view, clicks, and more on tracking servers. Fluentd with an Amazon Kinesis plugin is set up on both the application and tracking servers to send these events to Amazon Kinesis Data Streams. The Fluentd Kinesis plugin aggregates events before sending to Kinesis Data Streams. A single Kinesis data stream is currently set up to capture these events. A random partition key is configured in Fluentd for the events to allow even distribution of events across shards. The event data format is nested JSON. Poshmark maintains the same schema grammar at the first level of JSON for both server-side and client-side server events. The attributes at nested level can differ between server-side and client-side events.

Poshmark receives around 1 billion events per day (100 million per hour during peak hours, 10 million per hour during non-peak hours). The average size of the event record is 1.2 KB.

The data from the Kinesis data stream is consumed by two applications:

A Spark streaming application on Amazon EMR is used to write data from the Kinesis data stream to a data lake hosted on Amazon Simple Storage Service (Amazon S3) in a partitioned way. The data from the S3 data lake is used for batch processing and analytics through Amazon EMR and Amazon Redshift.
Druid hosted on Amazon Elastic Compute Cloud (Amazon EC2) integrates with the Kinesis data stream for streaming ingestion and allows users to run slice-and-dice OLAP queries. Operational dashboards are hosted on Grafana integrated with Druid.

Desired enhancements to the initial solution

The use cases discussed during the architecture sessions fall into one or more combinations of the following stream processing requirements:

Stateless event processing – For example, near-real-time anonymization.
External lookup – Looking up a value from external stores. For example, IP address, city, zip, state, or ID.
Stateful data processing – Accessing past events or aggregations or ML inferences.

To meet these requirements, the streaming platform is divided into two layers:

Central data enrichment – This layer runs enrichments commonly required by downstream streaming applications. This will help avoid replication of the same enrichment logic in each application and enable better operational maintenance. The enrichment should strive for per-record processing in most cases.
Specific streaming applications – This layer will house specific streaming applications with respect to use cases and utilize enriched data from the central data enrichment pipeline.

For central data enrichment, we made the following enhancements to the platform:

The total latency including ingestion and data enrichment was super critical and should be in the range of double-digit millisecond latency based on the overall latency budget of Poshmark to achieve real-time ML responses to events. The absolute lowest ingestion latency was achieved by Kafka, and the team decided to go with the managed version of Kafka, Amazon MSK.
Similarly, low-latency processing of data is also required, and appropriate framework should be considered accordingly.
Exactly-once delivery guarantees were required to avoid data duplication resulting in wrong calculations.
The enrichment source could be any source such as static files, databases, and APIs and latencies can vary between them. A number of server-side and client-side events are generated when a user interacts with a Poshmark application. As a result, the same information from the enrichment source is required to enrich each event. This frequently accessed information cached in a centralized cache will optimize fetch time.

Design decisions for the new solution

Poshmark made the following design decisions for central data enrichment:

Kafka can support double-digit millisecond latency from producer to consumer with appropriate performance tuning. Kafka can provide exactly-once semantics both at producers and consumer applications. AWS provides Kafka as part of its Amazon MSK offering, eliminating the operational overhead of maintaining and running Kafka cluster infrastructure on AWS, thereby allowing you to focus on developing and running Kafka-based applications. Poshmark decided to use Amazon MSK for their streaming ingestion and storage requirements.
We also decided to use Flink for streaming data enrichment applications for the following reasons:
- Flink can provide low-latency processing even at higher throughput with exactly-once guarantees. Spark Structured Streaming on the other hand can provide low latency with low throughput due to microbatch-based processing. Spark Structured Streaming continuous processing is an experimental feature and provides at-least once guarantees.
- The enrichment requests call to an external store if modeled in a map function (Spark’s map API or Flink’s MapFunction API) will make synchronous calls to the external store. The call will wait for a response from the external store before processing the next event, adding to delays and reducing overall throughput. The asynchronous interaction will allow sending requests and receiving responses concurrently from external stores. This will reduce wait time and improve overall throughput. Flink supports async I/O operators natively, allowing users to use asynchronous request clients with data streams. The API handles the integration with data streams, well as handling order, event time, fault tolerance, and more. Spark Structured Streaming doesn’t provide any such support natively and leaves it to users for custom implementation.
- Poshmark selected Kinesis Data Analytics for Apache Flink to run the data enrichment application. Kinesis Data Analytics for Apache Flink provides the underlying infrastructure for your Apache Flink applications. It handles core capabilities like provisioning compute resources, parallel computation, automatic scaling, and application backups (implemented as checkpoints and snapshots).
An enrichment microservice accompanying Amazon ElastiCache for Redis was set up to abstract access from data enrichment applications. The AsyncFunction in the Flink async I/O operator isn’t multi-threaded and won’t work in a truly asynchronous way if the call is blocked or waiting for a response. The enrichment microservice handles requests and responses asynchronously coming from Flink async I/O operators. The data is also cached in ElastiCache for Redis to improve the latency of the microservice.
The Poshmark ML applications are the consumers of this enriched data. The team has built and deployed different ML models over time. These models include a learning to rank algorithm, fraud detection, personalization and recommendations, and online spam filtering. Previously, for deploying each model into production, the Poshmark team had to go through a series of infrastructure setup steps that involved data extraction from real-time sources, building real-time aggregate features from streaming data, storing these features in a low-latency database (Redis) for sub-millisecond inferences, and finally performing inferences via Amazon SageMaker hosted endpoints.
We also designed an ML feature storage pipeline that consumes data from the enriched streaming sources (Kinesis or Kafka), generate single-level and aggregated-level features, and ingest these generated features into a feature store repository with a very low latency of less than 80 milliseconds.
The ML models are now able to extract the needed features with latency less than 10 milliseconds from the feature repository and perform real-time model inferencing.

Real-time analytics solution architecture

The following diagram illustrates the solution architecture for real-time analytics with Amazon MSK and Kinesis Data Analytics for Apache Flink.

The workflow is as follows:

Users interact on Poshmark’s web or mobile application.
Server-side events are captured on application servers and client-side events are captured on tracking servers. These events are written in the downstream MSK cluster.
The raw events will be ingested into the MSK cluster using the Fluentd plugin to produce data for Kafka.
The enrichment microservice consists of reactive (asynchronous) enrichment lookup APIs fetching data from persistent data stores. ElastiCache for Redis caches frequently accessed data, reducing fetch time for enrichment lookup APIs.
The Flink application running on Kinesis Data Analytics for Apache Flink consumes raw events from Amazon MSK and runs data enrichment on a per-record basis. The Flink data enrichment application uses Flink’s async I/O to read external data from the enrichment lookup store for enriching stream events.
Enriched events are written in the MSK cluster under different enriched events topics.
The existing Spark streaming application consumes from the enriched events topic (or raw events topic) in Amazon MSK and writes the data into an S3 data lake.
Druid streaming ingestion now reads from the enriched events topic or raw events topic in Amazon MSK depending on the requirements.

Enrichment of the captured event data

In this section, we discuss the different steps to enrich the captured event data.

Enrichment processing

Kinesis Data Analytics for Apache Flink provides the underlying infrastructure for the Apache Flink applications. It handles core capabilities like provisioning compute resources, parallel computation, automatic scaling, and application backups (implemented as checkpoints and snapshots). You can use the high-level Flink programming features (such as operators, functions, sources, and sinks) in the same way that you use them when hosting the Flink infrastructure yourself.

Flink on Amazon EMR gives the flexibility to choose your Flink version, installation, configuration, instances, and storage. However, you also have to take care of cluster management and operational requirements such as scaling, application backup, and provisioning.

Enrichment lookup store

The AsyncFunction in the Flink async I/O operator isn’t multi-threaded and won’t work in a truly asynchronous way if the call is blocked or waiting for a response. The enrichment lookup API should handle requests and responses asynchronously coming from Flink async I/O operators. The enrichment lookup API can be hosted on Amazon EC2 or containers such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS).

A number of server-side and client-side events are generated when a user interacts with a Poshmark application. As a result, the same information is required to enrich each event. This frequently accessed information cached in a centralized cache can optimize fetch time. The latency to the centralized cache can be further reduced by hosting the client (enrichment lookup API) and cache server in the same Availability Zone.

Reconciliation in case of pipeline errors

The event enrichment can fail in data enrichment applications for various reasons, such as the external store timing out or missing information in the store. The enriched fields may or may not be critical for downstream streaming applications. You should build your downstream streaming applications considering that these failures can occur and implement a fallback mechanism, for example retrying on-demand enrichment from the application. The failure handling will also be governed by latency tolerance of the application.

The processing of data is based on event time. In some situations, data can arrive late in the platform. Both Flink and Spark allow lateness and watermarks for users to handle late-arriving data by defining thresholds. Late-arriving data beyond the threshold is discarded from processing. It’s possible to get this discarded too-late data in Flink using a side output. There is no such provision in Spark Structured Streaming.

A few streaming applications require their batch counterpart to reconcile data hourly or daily to handle data mismatch or data discrepancy due to late-arriving data or missing data.

Improved customer experience

The new real-time architecture offered the following benefits for an improved customer experience:

Anonymization – Poshmark is now able to provide and utilize real-time anonymized data for multiple functions both internally and externally because anonymization happens in real time.
Fraud mitigation – Poshmark was previously able to detect and prevent 45% of ATOs with the batch-based solution. With the real-time system, Poshmark is able to prevent 80% of ATOs.
Personalization – By providing personalized search results, Poshmark achieved an 8% improvement on clickthrough rates for search. This is a significant increase in the top of the funnel, increasing overall search conversions.

Improvement in these three factors helped end-customers gain confidence in the Poshmark app and website, which in turn enabled customers to increase their interaction with the app and helped accelerate customer engagement and growth.

Conclusion

In this post, we discussed the ingestion of real-time clickstream and log event data into Amazon MSK. We showed how enrichment of the captured data can be performed through Kinesis Data Analytics for Apache Flink. We broke up the enrichment processing into multiple components, such as Kinesis Data Analytics for Apache Flink, the enrichment microservices and the enrichment lookup store, and an enrichment cache. We discussed the downstream applications that used this enriched customer information to perform real-time security checks and offer personalized recommendations to end-users. We also discussed some of the areas that may need attention in case there are failures in the pipeline. Lastly, we showed how Poshmark improved their customer experience and gained market share by implementing this real-time analytics pipeline.

About the authors

Mahesh Pasupuleti is a VP of Data & Machine Learning Engineering at Poshmark. He has helped several startups succeed in different domains, including media streaming, healthcare, the financial sector, and marketplaces. He loves software engineering, building high performance teams, and strategy, and enjoys gardening and playing badminton in his free time.

Gaurav Shah is Director of Data Engineering and ML at Poshmark. He and his team help build data-driven solutions to drive growth at Poshmark.

Raghu Mannam is a Sr. Solutions Architect at AWS in San Francisco. He works closely with late-stage startups, many of which have had recent IPOs. His focus is end-to-end solutioning including security, DevOps automation, resilience, analytics, machine learning, and workload optimization in the cloud.

Deepesh Malviya is Solutions Architect Manager on the AWS Data Lab team. He and his team help customers architect and build data, analytics, and machine learning solutions to accelerate their key initiatives as part of the AWS Data Lab.

Realtime monitoring of microservices and cloud-native applications with IBM Instana SaaS on AWS

2023-03-15 Eduardo Monich Fronza

Post Syndicated from Eduardo Monich Fronza original https://aws.amazon.com/blogs/architecture/realtime-monitoring-of-microservices-and-cloud-native-applications-with-ibm-instana-saas-on-aws/

Customers are adopting microservices architecture to build innovative and scalable applications on Amazon Web Services (AWS). These microservices applications are deployed across multiple AWS services, and customers are looking for comprehensive observability solutions that can help them effectively monitor and manage the performance of their applications in real-time.

IBM Instana is a fully automated application performance management (APM) solution, available to customers as a fully managed software as a service (SaaS) solution on AWS. It is specifically designed to help customers address the challenges of monitoring microservices and cloud-native applications in real-time. It uses artificial intelligence and machine learning to provide detailed insights into the health and behavior of applications, allowing developers and IT teams to gain real-time insights into their microservices applications, optimize performance, and quickly identify and troubleshoot issues.

This post explains the capabilities of IBM Instana to automatically collect observability metrics, traces, and events from microservices deployed on AWS cloud, as well as on-premises, to provide full visibility into the performance of individual components and applications as a whole.

IBM Instana solution overview

IBM Instana is designed to be highly scalable and adaptable to changing microservices applications environments. Its architecture (Figure 1) consists of several components that work together to provide comprehensive monitoring for microservices and cloud-native applications.

Instana’s main building blocks are host agents and agent sensors that are deployed in a customer’s AWS account and responsible for collecting, aggregating, and sending detailed monitoring information of applications and AWS services to the Instana SaaS backend.

The Instana SaaS backend services provide several key components, including data collectors, storage services, analytics engines, and user interfaces. It allows customers to process and analyze data in real-time, generate actionable insights, have a comprehensive view of their applications and infrastructure performance, enabling them to quickly identify and resolve issues and improve their overall operations.

Figure 1. IBM Instana architecture on AWS

Monitoring data

Instana monitors and observes microservices and cloud-native applications by collecting beacons, traces, and one-second metrics:

Beacons are small monitoring payloads that are transmitted by a JavaScript agent to the Instana servers, modeling specific events occurring within the lifecycle of a page view of a website; for example, page loading, resource retrieval, and HTTP requests.
Traces are detailed records of the requests and transactions that flow through a microservice architecture. They record the sequence of events that occur when a request is processed, including the services that are involved, the duration of each service, and any errors or exceptions that occur. Instana automatically correlates traces across services to provide a complete view of an entire transaction. This allows for easy identification and diagnosis of performance issues.
Metrics are numerical values that represent the performance and resource utilization of a microservice or infrastructure component. Metrics are collected by Instana Agents and sent to the Instana backend at regular intervals. Instana Agents collect hundreds of different metrics, including (but not limited to) CPU usage, memory usage, network traffic, and disk I/O.

This information is captured by Instana agents and sensors, which also collect application configurations and events, plus discover application building blocks, including clusters, containers, and services.

IBM Instana agents and sensors

The Instana host agent is a lightweight software component that collects and aggregates data from various sensors before sending the data to the Instana backend. It can be deployed to AWS services, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, AWS Lambda, or Red Hat OpenShift Service on AWS (ROSA). A single host agent, one per host, is used to collect data from monitored systems.

Once Instana agents are running, they automatically detect applications and services, such as containers running on Amazon EKS, and processes like Nginx, NodeJS, Spring Boot, Postgres, Elasticsearch, or Cassandra. For each component detected, different Instana sensors are automatically downloaded, installed, and configured to monitor the environment.

Instana sensors are small programs that are designed to attach and monitor one specific technology and pass their data to the agent. They are automatically managed, updated, loaded, and unloaded by the host agent.

These sensors can monitor several different AWS services like Lambda, Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), Amazon Aurora, Amazon Simple Queue Service, and Amazon Managed Streaming for Apache Kafka. They collect data—like request and error rates, latency, CPU utilization—via AWS APIs and Amazon CloudWatch.

Instana also provides sensors to collect data from applications running on AWS, like IBM MQ, IBM Db2, or Red Hat OpenShift Container Platform. Review IBM’s full list of supported technologies and AWS services.

Instana also provides tracers, which are used with runtimes like Java, .NET, NodeJS, plus others. They modify code execution to capture logs, traces at request level, and send those back to the Instana agent.

With the use of sensors, the host agent collects configuration data and monitors the applications it has detected. The host agent also handles communications with the Instana SaaS backend services. It collects, aggregates and sends logs, traces and records metrics (such as response times, error rates, and resource utilization) every second to the Instana SaaS backend in real-time, using secure and efficient communication protocols.

IBM Instana SaaS

The Instana SaaS backend is the heart of the Instana APM solution and responsible for processing, storing, and analyzing the monitoring data collected from the Instana agents and sensors installed in the customer’s infrastructure.

It consists of several components and services that work together to provide real-time monitoring and analysis of microservices applications, including:

Data collectors: Receive and process data from the Instana agents and sensors, and store it in the Instana backend for further analysis.
Analytics engine: Analyzes the data collected by the agents and sensors to provide insights into the performance and health of the microservices applications.
User interface: Web-based interface that customers use to view and analyze their monitoring data.
Alerting engine: Generates alerts when thresholds or anomalies are detected in the monitoring data.
Data storage: Time-series database that stores the monitoring data collected by the agents and sensors. Allows customers to query and analyze the data in real-time.
Integrations: Integrates with various third-party tools, such as Slack, PagerDuty, and ServiceNow, providing seamless alerting and incident management.

IBM Instana backend: making sense of the situation in real time

The Instana SaaS platform automatically ingests data from agents and continuously updates a dependency map (Figure 2). This map presents every dependency in context, giving users an easy way to understand the interrelationships between application components and services.

This understanding enables users to identify the upstream and downstream impacts of any issue, ensuring that they stay informed about any potential impacts.

Figure 2. An example of an IBM Instana dependency map

Instana traces every request end-to-end without sampling. The traces are analyzed in real-time, providing metrics that make any performance problems immediately visible. In the event of an incident, Instana can illustrate how a single issue can generate a ripple effect and impact a number of directly and indirectly connected services. Using the relationship information from the Dynamic Graph, Instana’s automatic root-cause analysis can precisely aggregate the individual issues into a single incident.

Figure 3. Applications monitoring with IBM Instana

Developers, IT operations, or site reliability engineers (SREs) can access the Instana backend end-user monitoring interface (Figure 3) or end-user monitoring (EUM) interface (Figure 4) to view monitoring data of their workloads. These can be websites, mobile applications, AWS services, and infrastructure levels. From this UI, these personas can access service dashboards that show key performance indicators (KPIs), like response time and error rate.

Figure 4. End-user monitoring with IBM Instana

The following actions demonstrate how an EUM for a JavaScript application, deployed to Amazon S3 can be completed:

Developers inject Instana JavaScript code (Figure 5) into the static website (HTML).
When a user visits the website, the JavaScript agent sends beacons to the Instana backend.
Dashboards show specific events of the website lifecycle, including page loading, JS errors, and HTTP requests.
Teams access Instana UI to check performance matrices. They can configure Smart Alerts with custom alerting policies based on specific metrics and KPIs.
Smart Alerts can send alerts via various channels, such as email, Slack, or IBM Watson AIOps Webhook.
In case of an incident, teams can use Instana to retrieve various performance metrics for root-cause analysis.
Developers can resolve the issues and apply the patch.

Figure 5. IBM Instana EUM JavaScript agent

Instana also offers Smart Alerts (Figure 6) to provide a more intuitive process of managing alerts. With Smart Alerts, customers can automatically generate alerting configurations using relevant KPIs and automatic threshold detection for use cases like website slowness or website errors.

Figure 6. IBM Instana Smart Alerts

Conclusion

In this post, we discussed how IBM Instana provides a comprehensive monitoring solution with the right tools to help you implement a real-time observability and monitoring solution. It allows you to gain insight into your microservices and cloud-native applications, including visibility into AWS services, containers, on-premises infrastructure, and other technologies. Instana can quickly identify and resolve issues before they impact end-users, ensuring that your applications are performing optimally.

As an IT administrator, developer, or business owner, IBM Instana on AWS give a deeper understanding of your applications and help you make data-driven decisions to improve overall performance.

Additional resources

Disaster Recovery Solutions with AWS-Managed Services, Part 3: Multi-Site Active/Passive

2023-03-10 Brent Kim

Post Syndicated from Brent Kim original https://aws.amazon.com/blogs/architecture/disaster-recovery-solutions-with-aws-managed-services-part-3-multi-site-active-passive/

Welcome to the third post of a multi-part series that addresses disaster recovery (DR) strategies with the use of AWS-managed services to align with customer requirements of performance, cost, and compliance. In part two of this series, we introduced a DR concept that utilizes managed services through a backup and restore strategy with multiple Regions. The post also introduces a multi-site active/passive approach.

The multi-site active/passive approach is best for customers who have business-critical workloads with higher availability requirements over other active/passive environments. A warm-standby strategy (as in Figure 1) is more costly than other active/passive strategies, but provides good protection from downtime and data loss outside of an active/active (A/A) environment.

Figure 1. Warm standby

Implementing the multi-site active/passive strategy

By replicating across multiple Availability Zones in same Region, your workloads become resilient to the failure of an entire data center. Using multiple Regions provides the most resilient option to deploy workloads, which safeguards against the risk of failure of multiple data centers.

Let’s explore an application that processes payment transactions and is modernized to utilize managed services in the AWS Cloud, as in Figure 2.

Figure 2. Warm standby with managed services

Let’s cover each of the components of this application, as well as how managed services behave in a multisite environment.

1. Amazon Route53 – Active/Passive Failover: This configuration consists of primary resources to be available, and secondary resources on standby in the case of failure of the primary environment. You would just need to create the records and specify failover for the routing policy. When responding to queries, Amazon Route 53 includes only the healthy primary resources. If the primary record configured in the Route 53 health check shows as unhealthy, Route 53 responds to DNS queries using the secondary record.

2. Amazon EKS control plane: Amazon Elastic Kubernetes Service (Amazon EKS) control plane nodes run in an account managed by AWS. Each EKS cluster control plane is single-tenant and unique, and runs on its own set of Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon EKS is also a Regional service, so each cluster is confined to the Region where it is deployed, with each cluster being a standalone entity.

3. Amazon EKS data plane: Operating highly available and resilient applications requires a highly available and resilient data plane. It’s best practice to create worker nodes using Amazon EC2 Auto Scaling groups instead of creating individual Amazon EC2 instances and joining them to the cluster.

Figure 2 shows three nodes in the primary Region while there will only be a single node in the secondary. In case of failover, the data plane scales up to meet the workload requirements. This strategy deploys a functional stack to the secondary Region to test Region readiness before failover. You can use Velero with Portworx to manage snapshots of persistent volumes. These snapshots can be stored in an Amazon Simple Storage Service (Amazon S3) bucket in the primary Region, which is replicated to an Amazon S3 bucket in another Region using Amazon S3 cross-Region replication.

During an outage in the primary Region, Velero restores volumes from the latest snapshots in the standby cluster.

4. Amazon OpenSearch Service: With cross-cluster replication in Amazon OpenSearch Service, you can replicate indexes, mappings, and metadata from one OpenSearch Service domain to another. The domain follows an active-passive replication model where the follower index (where the data is replicated) pulls data from the leader index. Using cross-cluster replication helps to ensure recovery from disaster events and allows you to replicate data across geographically distant data centers to reduce latency.

Cross-cluster replication is available on domains running Elasticsearch 7.10 or OpenSearch 1.1 or later. Full documentation for cross-cluster replication is available in the OpenSearch documentation.

If you are using any versions prior to Elasticsearch 7.10 or OpenSearch 1.1, refer to part two of our blog series for guidance on using APIs for cross-Region replication.

5. Amazon RDS for PostgreSQL: One of the managed service offerings of Amazon Relational Database Service (Amazon RDS) for PostgreSQL is cross-Region read replicas. Cross-Region read replicas enable you to have a DR solution scaling read database workloads, and cross-Region migration.

Amazon RDS for PostgreSQL supports the ability to create read replicas of a source database (DB). Amazon RDS uses an asynchronous replication method of the DB engine to update the read replica whenever there is a change made on the source DB instance. Although read replicas operate as a DB instance that allows only read-only connections, they can be used to implement a DR solution for your production DB environment. If the source DB instance fails, you can promote your Read Replica to a standalone source server.

Using a cross-Region read replica helps ensure that you get back up and running if you experience a Regional availability issue. For more information on PostgreSQL cross-Region read replicas, visit the Best Practices for Amazon RDS for PostgreSQL Cross-Region Read Replicas blog post.

6. Amazon ElastiCache: AWS provides a native solution called Global Datastore that enables cross-Region replication. By using the Global Datastore for Redis feature, you can work with fully managed, fast, reliable, and secure replication across AWS Regions. This feature helps create cross-Region read replica clusters for ElastiCache for Redis to enable low-latency reads and DR across AWS Regions. Each global datastore is a collection of one or more clusters that replicate to one another. When you create a global datastore in Amazon ElastiCache, ElastiCache for Redis automatically replicates your data from the primary cluster to the secondary cluster. ElastiCache then sets up and manages automatic, asynchronous replication of data between the two clusters.

7. Amazon Redshift: With Amazon Redshift, there are only two ways of deploying a true DR approach: backup and restore, and an (A/A) solution. We’ll use the A/A solution as this provides a better recovery time objective (RTO) for the overall approach. The recovery point objective (RPO) is dependent upon the configured schedule of AWS Lambda functions. The application within the primary Region sends data to both Amazon Simple Notification Service (Amazon SNS) and Amazon S3, and the data is distributed to the Redshift clusters in both Regions through Lambda functions.

Amazon EKS uploads data to an Amazon S3 bucket and publishes a message to an Amazon SNS topic with a reference to the stored S3 object. S3 acts as an intermediate data store for messages beyond the maximum output limit of Amazon SNS. Amazon SNS is configured with primary and secondary Region Amazon Simple Queue Service (Amazon SQS) endpoint subscriptions. Amazon SNS supports the cross-Region delivery of notifications to Amazon SQS queues. Lambda functions deployed in the primary and secondary Region are used to poll the Amazon SQS queue in respective Regions to read the message. The Lambda functions then use the Amazon SQS Extended Client Library for Java to retrieve the Amazon S3 object referenced in the message. Once the Amazon S3 object is retrieved, the Lambda functions upload the data into Amazon Redshift.

For more on how to coordinate large messages across accounts and Regions with Amazon SNS and Amazon SQS, explore the Coordinating Large Messages Across Accounts and Regions with Amazon SNS and SQS blog post.

Conclusion

This active/passive approach covers how you can build a creative DR solution using a mix of native and non-native cross-Region replication methods. By using managed services, this strategy becomes simpler through automation of service updates, deployment using Infrastructure as a Code (IaaC), and general management of the two environments.

Related information

Want to learn more? Explore the following resources within this series and beyond!

Let’s Architect! Architecting a data mesh

2023-03-08 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecting-a-data-mesh/

Data architectures were mainly designed around technologies rather than business domains in the past. This changed in 2019, when Zhamak Dehghani introduced the data mesh. Data mesh is an application of the Domain-Driven-Design (DDD) principles to data architectures: Data is organized into data domains and the data is the product that the team owns and offers for consumption.

A data mesh architecture unites the disparate data sources within an organization through centrally managed data-sharing and governance guidelines. Business functions can maintain control over how shared data is accessed because data mesh also solves advanced data security challenges through distributed, decentralized ownership.

This edition of Let’s Architect! introduces data mesh, highlights the foundational concepts of data architectures, and covers the patterns for designing a data mesh in the AWS cloud with supporting resources.

Data lakes, lake houses and data mesh: what, why, and how?

Let’s explore a video introduction to data lakes, lake houses, and data mesh. This resource explains how to leverage those concepts to gain greater data insights across different business segments, with a special focus on best practices to build a well-architected, modern data architecture on AWS. It also gives an overview of the AWS cloud services that can be used to create such architectures and describes the fundamental pillars of designing them.

Take me to this intro to data lakes, lake houses, and data mesh video!

Data mesh is an architecture pattern where data are organized into domains and seen as products to expose for consumption

Building data mesh architectures on AWS

Knowing what a data mesh architecture is, here is a step-by-step video from re:Invent 2022 on designing one. It covers a use case on how GoDaddy considered and implemented data mesh, in addition to:

The fundamental pillars behind a well-architected data mesh in the cloud
Finding an approach to build a data mesh architecture using native AWS services
Reasons for considering a data mesh architecture where data lakes provide limitations in some scenarios
How data mesh can be applied in practice to overcome them
The mental models to apply during the data mesh design process

Take me to this re:Invent 2022 video!

In the data mesh architecture the producers expose their data for consumption to the consumers. Access is regulated through a centralized governance layer.

Amazon DataZone: Democratize data with governance

Now let’s explore data accessibility as it relates to data mesh architectures.

Amazon DataZone is a new AWS business data catalog allowing you to unlock data across organizational boundaries with built-in governance. This service provides a unified environment where everyone in an organization—from data producers to data consumers—can access, share, and consume data in a governed manner.

Here is a video to learn how to apply AWS analytics services to discover, access, and share data across organizational boundaries within the context of a data mesh architecture.

Take me to this re:Invent 2022 video!

Amazon DataZone accelerates the adoption of the data mesh pattern by making it scalable to high number of producers and consumers.

Build a data mesh on AWS

Feeling inspired to build? Hands-on experience is a great way to learn and see how the theoretical concepts apply in practice.

This workshop teaches you a data mesh architecture building approach on AWS. Many organizations are interested in implementing this architecture to:

Move away from centralized data lakes to decentralized ownership
Deliver analytics solutions across business units

Learn how a data mesh architecture can be implemented with AWS native services.

Take me to this workshop!

The diagrams shows how to separate the producers, consumers and governance components through a multi-account strategy.

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about monitoring and observability.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Boosting Resiliency with an ML-based Telemetry Analytics Architecture

2023-03-03 Shibu Nair

Post Syndicated from Shibu Nair original https://aws.amazon.com/blogs/architecture/boosting-resiliency-with-an-ml-based-telemetry-analytics-architecture/

Data proliferation has become a norm and as organizations become more data driven, automating data pipelines that enable data ingestion, curation, and processing is vital. Since many organizations have thousands of time-bound, automated, complex pipelines, monitoring their telemetry information is critical. Keeping track of telemetry data helps businesses monitor and recover their pipelines faster which results in better customer experiences.

In our blog post, we explain how you can collect telemetry from your data pipeline jobs and use machine learning (ML) to build a lower- and upper-bound threshold to help operators identify anomalies in near-real time.

The applications of anomaly detection on telemetry data from job pipelines are wide-ranging, including these and more:

Detecting abnormal runtimes
Detecting jobs running slower than expected
Proactive monitoring
Notifications

Key tenets of telemetry analytics

There are five key tenets of telemetry analytics, as in Figure 1.

Figure 1. Key tenets of telemetry analytics

The key tenets for near real-time telemetry analytics for data pipelines are:

Collecting the metrics
Aggregating the metrics
Identify anomaly
Notify and resolve issues
Persist for compliance reasons, historical trend analysis, and to visualize

This blog post describes how customers can easily implement these steps by using AWS native no-code, low-code (AWS LCNC) solutions.

ML-based telemetry analytics solution architecture

The architecture defined here helps customers incrementally enable features with AWS LCNC solutions by leveraging AWS managed services to avoid the overhead of infrastructure provisioning. Most of the steps are configurations of the features provided by AWS services. This enables customers to make their applications resilient by tracking and resolving anomalies in near real time, as in Figure 2.

Figure 2. ML-based telemetry analytics solution architecture

Let’s explore each of the architecture steps in detail.

1. Indicative AWS data analytics services: Choose from a broad range of AWS analytics services, including data movement, data storage, data lakes, big data analytics, log analytics, and streaming analytics to business intelligence, ML, and beyond. This diagram shows a subset of these data analytics services. You may use one or a combination of many, depending on your use case.

2. Amazon CloudWatch metrics for telemetry analytics: Collecting and visualizing real-time logs, metrics, and event data is a key step in any process. CloudWatch helps you accomplish these tasks without any infrastructure provisioning. Almost every AWS data analytics service is integrated with CloudWatch to enable automatic capturing of the detailed metrics needed for telemetry analytics.

3. Near real-time use case examples: Step three presents practical, near real-time use cases that represent a range of real-world applications, one or more of which may apply to your own business needs.

Use case 1: Anomaly detection

CloudWatch provides the functionality to apply anomaly detection for a metric. The key business use case of this feature is to apply statistical and ML algorithms on a per-metrics basis of business critical applications to proactively identify issues and raise alarms.

The focus is on a single set of metrics that will be important for the application’s functioning—for example, AWS Lambda metrics of a 24/7 credit card company’s fraud monitoring application.

Use case 2: Unified metrics using Amazon Managed Grafana

For proper insights into telemetry data, it is important to unify metrics and collaboratively identify and troubleshoot issues in analytical systems. Amazon Managed Grafana helps to visualize, query, and corelate metrics from CloudWatch in near real-time.

For example, Amazon Managed Grafana can be used to monitor container metrics for Amazon EMR running on Amazon Elastic Kubernetes Service (Amazon EKS), which supports processing high-volume data from business critical Internet of Things (IoT) applications like connected factories, offsite refineries, wind farms, and more.

Use case 3: Combined business and metrics data using Amazon OpenSearch Service

Amazon OpenSearch Service provides the capability to perform near real-time, ML-based interactive log analytics, application monitoring, and search by combining business and telemetry data.

As an example, customers can combine AWS CloudTrail logs for AWS logins, Amazon Athena, and Amazon RedShift query access times with employee reference data to detect insider threats.

This log analytics use case architecture integrates into OpenSearch, as in Figure 3.

Figure 3. Log analytics use case architecture overview with OpenSearch

Use case 4: ML-based advanced analytics

Using Amazon Simple Storage Service (Amazon S3) as data storage, data lake customers can tap into AWS analytics services such as the AWS Glue Catalog, AWS Glue DataBrew, and Athena for preparing and transforming data, as well as build trend analysis using ML models in Amazon SageMaker. This mechanism helps with performing ML-based advanced analytics to identify and resolve recurring issues.

4. Anomaly resolution: When an alert is generated either by CloudWatch alarm, OpenSearch, or Amazon Managed Grafana, you have the option to act on the alert in near-real time. Amazon Simple Notification Service (Amazon SNS) and Lambda can help build workflows. Lambda also helps integrate with ServiceNow ticket creation, Slack channel notifications, or other ticketing systems.

Simple data pipeline example

Let’s explore another practical example using an architecture that demonstrates how AWS Step Functions orchestrates Lambda, AWS Glue jobs, and crawlers.

To report an anomaly on AWS Glue jobs based on total number of records processed, you can leverage the glue.driver.aggregate.recordsRead CloudWatch metric and set up a CloudWatch alarm based on anomaly detection, Amazon SNS topic for notifications, and Lambda for resolution, as in Figure 4.

Figure 4. AWS Step Functions orchestrating Lamba, AWS Glue jobs, and crawlers

Here are the steps involved in the architecture proposed:

CloudWatch automatically captures the metric glue.driver.aggregate.recordsRead from AWS Glue jobs.
Customers set a CloudWatch alarm based on the anomaly detection of glue.driver.aggregate.recordsRead metric and set a notification to Amazon SNS topic.
CloudWatch applies a ML algorithm to the metric’s past data and creates a model of metric’s expected values.
When the number of records increases significantly, the metric from the CloudWatch anomaly detection model notifies the Amazon SNS topic.
Customers can notify an email group and trigger a Lambda function to resolve the issue, or create tickets in their operational monitoring system.
Customers can also unify all the AWS Glue metrics using Amazon Managed Grafana. Using Amazon S3, data lake customers can crawl and catalog the data in the AWS Glue catalog and make it available for ad-hoc querying. Amazon SageMaker can be used for custom model training and inferencing.

Conclusion

In this blog post, we covered a recommended architecture to enable near-real time telemetry analytics for data pipelines, anomaly detection, notification, and resolution. This provides resiliency to the customer applications by proactively identifying and resolving issues.

Content Repository for Unstructured Data with Multilingual Semantic Search: Part 1

2023-03-01 Patrik Nagel

Post Syndicated from Patrik Nagel original https://aws.amazon.com/blogs/architecture/content-repository-for-unstructured-data-with-multilingual-semantic-search-part-1/

Unstructured data can make up to 80 percent of data in the day-to-day business of financial organizations. For example, these organizations typically store and read PDFs and images for claim processing, underwriting, and know your customer (KYC). Organizations need to make this ingested data accessible and searchable across different entities while logically separating data access according to role requirements.

In this two-part series, we use AWS services to build an end-to-end content repository for storing and processing unstructured data with the following features:

Dynamic access control-based logic over unstructured data
Multilingual semantic search capabilities

In part 1, we build the architectural foundation for the content repository, including the resource access control logic and a web UI to upload and list documents.

Solution overview

The content repository includes four building blocks:

Frontend and interaction: For this function, we use AWS Amplify, which is a set of purpose-built tools and features to help frontend web and mobile developers quickly build full-stack applications on AWS. The React application uses the AWS Amplify authentication feature to quickly set up a complete authentication flow integrated into Amazon Cognito. Amplify also hosts the frontend application.

Authentication and authorization: Implementing dynamic resource access control with a combination of roles and attributes is fundamental to your content repository security. Amazon Cognito provides a managed, scalable user directory, user sign-up and sign-in flows, and federation capabilities through third-party identity providers. We use Amazon Cognito user pools as the source of user identity for the content repository. You can work with user pool groups to represent different types of user collection, and you can manage their permissions using a group-associated AWS Identity and Access Management (IAM) role.

Users authenticate against the Amazon Cognito user pool. The web app will exchange the user pool tokens for AWS credentials through an Amazon Cognito identity pool in the content repository. You can complement the IAM role-based authorization model by mapping your relevant attributes to principal tags that will be evaluated as part of IAM permission policies. This allows a dynamic and flexible authorization strategy. For use cases that need federation with third-party identity providers, you can base your user collection on existing user group attributes, such as Active Directory group membership.

Backend and business logic: Authenticated users are redirected to the Amazon API Gateway. API Gateway provides managed publishing for application programming interfaces (APIs) that act as the repository’s “front door.” API Gateway also interacts with the repository’s backend through RESTful APIs. This makes the business logic of the content repository extensible for future use cases, such as transcription and translation. We use AWS Lambda as a serverless, event-driven compute service to run specific business logic code, such as uploading a document to the content repository.

Content storage: Amazon Simple Storage Service (Amazon S3) provides virtually unlimited scalability and high durability. With Amazon S3, you can cost-effectively store unstructured documents in their native formats and make it accessible in a secure and scalable way. Enriching the uploaded documents with tags simplifies data governance with fine-grained access control.

Technical architecture

The technical architecture of the content repository with these four components can be found in Figure 1.

Figure 1. Technical architecture of the content repository

Let’s explore the architecture step by step.

The frontend uses the Amplify JS library to add the authentication UI component to your React app, allowing authenticated users to sign in.
Once the user provides their sign-in credentials, they are redirected to Amazon Cognito user pools to be authenticated.
Once the authentication is successful, Amazon Cognito invokes a pre-token generation Lambda function. This function customizes the identity (ID) token with a new claim called department. This new claim is the Amazon Cognito group name from the cognito:preferred_role claim.
Amazon Cognito returns the identity, access, and refresh token in JSON format to the frontend.
The Amplify client library stores the tokens and handles refreshes using the refresh token while the React frontend application calls the API Gateway with the ID token. Note: Usually, you would use the access token to grant access to authorized resources. For this architecture, we use the ID token because we have enriched it with the custom claim during step 3.
API Gateway uses its native integration with Amazon Cognito and validates the ID token’s signature and expiration using Amazon Cognito user pool authorizer. For more complex authorization scenarios, you can use API Gateway Lambda authorizer with the AWS JSON Web Token (JWT) Verify library for verifying JWTs signed by Amazon Cognito.
After successful validation, API Gateway passes the ID token to the backend Lambda function, which can verify and authorize upon it for access control.
Upon document upload action, the backend Lambda function calls the Amazon Cognito identity pool to exchange the ID token for the temporary AWS credentials associated with the cognito:preferred_role claim.
The document upload Lambda function returns a pre-signed URL with the custom department claim in the Amazon S3 path prefix as well as the object tag. The Amazon S3 pre-signed URL is used for the document upload from the frontend application directly to Amazon S3.
Upon document list action, similar to step 8, the backend Lambda function exchanges the ID token for the temporary AWS credentials. The Lambda function returns only the documents based on the user’s preferred group and associated custom department claim.

Prerequisites

You must have the following prerequisites for this solution:

An AWS account or sign up to create and activate one.
The following software installed on your development machine, or use an AWS Cloud9 environment:
- Install the AWS Command Line Interface (AWS CLI) and configure it to point to your AWS account.
- Install TypeScript and use a package manager such as npm.
- Install the AWS Cloud Development Kit (AWS CDK).
Appropriate AWS credentials for interacting with resources in your AWS account.

Walkthrough

Setup

The following steps will deploy two AWS CDK stacks into your AWS account:

content-repo-stack (blog-content-repo-stack.ts) creates the environment detailed in Figure 1.
demo-data-stack (userpool-demo-data-stack.ts) deploys sample users, groups, and role mappings.

To continue setup, use the following commands:

Clone the project git repository:

git clone https://github.com/aws-samples/content-repository-with-dynamic-access-control content-repository

Install the necessary dependencies:

cd content-repository/backend-cdk
npm install

Configure environment variables:

export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text)
export CDK_DEFAULT_REGION=$(aws configure get region)

Bootstrap your account for AWS CDK usage:

cdk bootstrap aws://$CDK_DEFAULT_ACCOUNT/$CDK_DEFAULT_REGION

Deploy the code to your AWS account:
```
cdk deploy --all
```

Using the repository

Once you deploy the CDK stacks in your AWS account, follow these steps:

1. Access the frontend application:

a. Copy the amplifyHostedAppUrl value shown in the AWS CDK output from the content-repo-stack.

b. Use the URL with your web browser to access the frontend application.

c. A temporary page displays until the automated build and deployment of the React application completes after 4-5 minutes.

2. Application sign-in and role-based access control (RBAC):

a. The React webpage prompts you to sign in and then change the temporary password.

b. The content repository provides two demo users with credentials as part of the demo-data-stack in the AWS CDK output. In this walkthrough, we use the sales-user user, which belongs to the sales department group to validate RBAC.

3. Upload a document to the content repository:

a. Authenticate as sales-user.

b. Select upload to upload your first document to the content repository.

c. The repository provides sample documents in the assets sub-folder.

4. List your uploaded document:

a. Select list to show the uploaded sales content.

b. To verify the dynamic access control, repeat steps 2 and 3 for the marketing-user user, which belongs to the marketing department group.

c. Sign-in to the AWS Management Console and navigate to the Amazon S3 bucket with the prefix content-repo-stack-s3sourcebucket to confirm that all the uploaded content exists.

Implementation notes

Frontend deployment and cross-origin access

The content-repo-stack contains an AwsCustomResource construct. This construct uses the Amplify API to start the release job of the Amplify hosted frontend application. The preBuild step of the Amplify application build specification dynamically configures its backend for the Amazon Cognito-based authentication. The required Amazon Cognito configuration parameters are retrieved from the AWS Systems Manager Parameter Store during build time. Similarly, the Amplify application postBuild step updates the Amazon S3 cross-origin resource sharing (CORS) rule for the Amazon S3 bucket to only allow cross-origin access from the Amplify-hosted URL of the frontend application.

Application sign-in and access control

The Amazon Cognito identity pool configuration is set to Choose role from token for authenticated users, as in Figure 2. This setup permits authenticated users to pass the roles in the ID token that the Amazon Cognito user pool assigned. Backend Lambda functions use the roles that appear in the cognito:roles and cognito:preferred_role claims in the ID token for RBAC.

Figure 2. Amazon Cognito identity pool configuration – using tokens to assign roles to authenticated users

In the attributes for access control section, we configured a custom mapping from the augmented department token claim to a tag key, as in Figure 3. The backend logic uses the tag key to match the PrincipalTag condition in IAM policies to control access to AWS resources.

Figure 3. Amazon Cognito identity pool configuration – custom mapping from claim names to tag keys

Document upload

The presigned_url.py Lambda function generates a pre-signed Amazon S3 URL using the department token claim as the key. This function automatically organizes the uploaded document into a logical structure in the Amazon S3 source bucket. Accordingly, the cognito:preferred_role used for the Amazon S3 client credentials in the Lambda function has a permission policy using the PrincipalTag department to dynamically limit access to the Amazon S3 key, as in Figure 4.

Figure 4. Permission policy using PrincipalTag to upload documents to Amazon S3

Document listing

The list functionality only shows the uploaded content belonging to the preferred group of authenticated Amazon Cognito user pool user. To only list the files that a specific user (for example, sales-user) has access to, use the PrincipalTag s3:prefix condition, as in Figure 5.

Figure 5. Permission policy using s3:prefix condition with session tags to list documents

Cleanup

In the backend-cdk subdirectory, delete the deployed resources:

cdk destroy --all

Conclusion

In this blog, we demonstrated how to build a content repository with an easy-to-use web application for unstructured data that ingests documents while maintaining dynamic access control for users within departments. These steps provide a foundation to build your own content repository to store and process documents. As next steps, based on your organization’s security requirements, you can implement more complex access control use cases by balancing IAM role and principal tags. For example, you can use Amazon Cognito user pool custom attributes for additional dimensions such as document “clearance” with optional modification in the pre-token generation Lambda.

In the next part of this blog series, we will enrich the content repository with multi-lingual semantic search features while maintaining the access control fundamentals we’ve already implemented. For additional information on how you can build a solution to search for information across multiple scanned documents, PDFs, and images with compliance capabilities, please refer our Document Understanding Solution from AWS Solutions Library.

Introducing Client-side Evaluation for Amazon CloudWatch Evidently

2023-02-28 Cole Thienes

Post Syndicated from Cole Thienes original https://aws.amazon.com/blogs/architecture/introducing-client-side-evaluation-for-amazon-cloudwatch-evidently/

Amazon CloudWatch Evidently enables developers to test new features on a small percentage of traffic and gauge the outcome before rolling it out to the rest of their users. Evidently feature flags are defined ahead of your release and, at runtime, your application code queries a remote service to determine whether to show the new feature to a given user. The remote call to fetch feature flags for a user is susceptible to network latency, adding several hundred milliseconds of delay in bad cases. Any additional latency added to fetching feature flags can directly impact the speed of a web page, where milliseconds matter. Our solution: client-side evaluation for Amazon CloudWatch Evidently. With client-side evaluation, developers can significantly decrease latency by fetching feature flags locally and avoiding network overhead altogether.

The term “client-side” does not refer to the browser in this case, but the operation taking place on your container application rather than through a remote API call. This removes the need for a network call by fetching feature flags for a user from the AWS AppConfig agent—a sidecar container running alongside your container application backend. The agent enables container runtimes to leverage AWS AppConfig, a service allowing customers to change the way an application behaves while running without deploying new code. In this post, we’ll walk through the solution architecture and how to instrument client-side evaluation in an Amazon Elastic Container Service (Amazon ECS) application.

Overview of solution

Figure 1 illustrates how client-side evaluation works in an application running on Amazon ECS. The webpage makes a call to the webpage backend to determine which website features to show an end-user. Let’s explore how this works.

Figure 1. High-level architecture of client-side evaluation for Amazon CloudWatch Evidently

Create an Evidently project, feature, and launch through the AWS Console Mobile Application, API, or AWS CloudFormation.
Create an Amazon ECS task for the backend application container and attach an AWS AppConfig agent container to the task. At runtime, the application container will invoke the EvaluateFeature API to fetch feature flags. Without client-side evaluation, this API call would perform a remote call to the Evidently cloud service.
With client-side evaluation, the API call is made from the application container to the AWS AppConfig agent container on localhost, short-circuiting the network overhead.
Evidently maintains a synchronized copy of the Evidently features in an AWS AppConfig configuration within your AWS account. When subsequent changes are made to the features, the configuration is updated (usually within a minute).
When the backend application initializes, the agent fetches the necessary configuration profile and caches it, polling periodically to refresh the cache. When the AWS AppConfig agent is invoked from the backend application, it evaluates the requested feature flag using the cached data.
After each successful EvaluateFeature call, a transaction record is generated, called an evaluation event. This useful bookkeeping mechanism helps developers view data to tell which of their users saw what feature and when. As the evaluation events are generated, they are placed in a buffer within the agent. Once the buffer reaches a certain size or age, events in the buffer are uploaded to Evidently via the PutProjectEvents API.
The evaluation events are then available for offline analysis in developer-configured storage, including CloudWatch Logs, Amazon Simple Storage Service (Amazon S3), and CloudWatch Metrics.

Walkthrough

Let’s take a practical example to demonstrate client-side evaluation. I have a simple webpage with a search bar on it. I’ve implemented a newer, fancier search bar, but I only want to show it to 10 percent of my visitors to make sure it doesn’t cause issues on my existing webpage before rolling it out to everyone, as in Figure 2.

Figure 2. A webpage experience where 10% of users see the new search bar

We could set up the necessary AWS resources by hand, but let’s use a pre-built AWS Cloud Development Kit (AWS CDK) example to save time. The sample code for this example is available on GitHub. Here are the high-level steps:

Provision the infrastructure. The infrastructure will consist of:
- An ECS service with a Virtual Private Cloud (Amazon VPC) to serve as our backend application and return the search bar variation
- An Evidently launch to split the traffic between the two search bars
- An AWS AppConfig environment, which the AWS AppConfig agent will fetch Evidently data from
Test our webpage. Once our code is deployed, we will visit our webpage to fetch feature flags using client-side evaluation.
Clean up by removing our infrastructure.

Prerequisites

Install Git
Install Node and npm
Create an AWS account.
Install the AWS CDK Toolkit. Ensure the AWS CDK Toolkit is updated to the latest version.
Install Docker. AWS CDK will not work without Docker installed.

Steps

Clone the repository

First, clone the official AWS CDK example repository:

git clone https://github.com/aws-samples/aws-cdk-examples

The repository has many examples for setting up AWS infrastructure in CDK. Let’s go to a client-side evaluation example.

Code explanation

Let’s take a look at the code example. When we visit our web page, a request will be routed to our application deployed on AWS Fargate, which allows us to run containers directly using ECS without having to manage Elastic Compute Cloud (Amazon EC2) instances. The application code is written in Node.js with Typescript and leverages the Express framework:

<p><code>// local-image/app.ts</code></p><p><code>import * as express from 'express';</code><br /><code>import {Evidently} from '@aws-sdk/client-evidently';</code></p><p><code>const app = express();</code></p><p><code>const evidently = new Evidently({</code><br /><code>    endpoint: 'http://localhost:2772',</code><br /><code>    disableHostPrefix: true</code><br /><code>});</code></p><p><code>app.get("/", async (_, res) =&gt; {</code><br /><code>    try {</code><br /><code>        console.time('latency')</code><br /><code>        const evaluation = await evidently.evaluateFeature({</code><br /><code>            project: 'WebPage',</code><br /><code>            feature: 'SearchBar',</code><br /><code>            entityId: 'WebPageVisitor43'</code><br /><code>        })</code><br /><code>        console.timeEnd('latency')</code><br /><code>        res.send(evaluation.variation)</code><br /><code>    } catch (err: any) {</code><br /><code>        console.timeEnd('latency')</code><br /><code>        res.send(err.toString())</code><br /><code>    }</code><br /><code>});</code></p>

The container application will invoke the EvaluateFeature API using the AWS SDK for Javascript and return the search bar variation, either the old or new search bar. Here we also log the latency of the operation. The EvaluateFeature request is forwarded to the endpoint we configure for the Evidently client: http://localhost:2772. This is the local address where the AWS AppConfig agent can be reached. To make this possible, we add the AWS AppConfig agent as a container to the Amazon ECS task definition:

// index.ts

service.taskDefinition.addContainer('AppConfigAgent', {
   image: ecs.ContainerImage.fromRegistry('public.ecr.aws/aws-appconfig/aws-appconfig-agent:2.x'),
   portMappings: [{
   containerPort: 2772
   }]
})

We also need to set up an AppConfig environment for the Evidently project. This tells Evidently where to create the configuration to keep a synchronized copy of the features in the project:

// index.ts

const application = new appconfig.CfnApplication(this,'AppConfigApplication', {
name: 'MyApplication'
});

const environment = new appconfig.CfnEnvironment(this, 'AppConfigEnvironment', {
applicationId: application.ref,
name: 'MyEnvironment'
});

const project = new evidently.CfnProject(this, 'EvidentlyProject', {
   name: 'WebPage',
   appConfigResource: {
   applicationId: application.ref,
   environmentId: environment.ref
   }
});

Finally, we set up an Evidently feature and launch that ensures only 10 percent of traffic receives the new search bar:

// index.ts

const launch = new evidently.CfnLaunch(this, 'EvidentlyLaunch', {
  project: project.name,
  name: 'MyLaunch',
  executionStatus: {
   status: 'START'
  },
  groups: [
   {
   feature: feature.name,
   variation: OLD_SEARCH_BAR,
   groupName: OLD_SEARCH_BAR
   },
   {
   feature: feature.name,
   variation: NEW_SEARCH_BAR,
   groupName: NEW_SEARCH_BAR
   }
  ],
  scheduledSplitsConfig: [{
   startTime: '2022-01-01T00:00:00Z',
   groupWeights: [
   {
   groupName: OLD_SEARCH_BAR,
   splitWeight: 90000
   },
   {
   groupName: NEW_SEARCH_BAR,
   splitWeight: 10000
   }
   ]
  }]
})

We start the launch immediately by setting executionStatus to START and startTime to a timestamp in the past. If you want to wait to show the new search bar, we can specify a future start time.

Install dependencies

Install the necessary Node modules:

npm install

Build and deploy

Build the AWS CDK template from the source code:

npm run build

Before deploying the app:

Ensure that you set up AWS credentials in your environment.
The AWS CDK Toolkit is bootstrapped in your AWS account.
Confirm the max number of VPCs has not been reached in your AWS account; the Amazon ECS service will deploy an Amazon VPC.

After that, you can deploy the AWS CDK template to your AWS account:

cdk deploy

Test the webpage

After the previous step, you should see the following output in your console:

Console output showing a successful CDK deployment

Figure 3. Console output showing a successful AWS CDK deployment

In your browser, visit the URL specified by the FargateServiceServiceURL output above and you will see oldSearchBar. We can visit our CloudWatch Logs from the Amazon ECS task to see our application logs. Go to the AWS console and visit the CloudWatch log groups page and visit the log group with the prefix EvidentlyClientSideEvaluationEcs. There, we can see that fetching feature flags took under two milliseconds, as in Figure 4.

Figure 4. CloudWatch Logs for the Amazon ECS task showing an EvaluateFeature latency of 1.238 milliseconds

Additionally, we can see how visitors have seen each version of the search bar. On the AWS console, visit the CloudWatch metrics page and see the Evidently metrics under All > Evidently > Feature, Project, Variation, as in Figure 5:

Figure 5. CloudWatch metrics showing the VariationCount (the number of times each feature flag variation was fetched)

We can increase or decrease the percentage of visitors seeing the new search bar at any time. On the AWS console, go to the CloudWatch Evidently page and go to Projects > WebPage > Launches > MyLaunch > Modify launch traffic and adjust the Traffic percentage, as in Figure 6.

Figure 6. Adjusting the traffic percentage of an Evidently launch

Cleaning up

To avoid incurring future charges, delete the resources. Let’s run:

cdk destroy

You can confirm the removal by going into CloudFormation and confirming the resources were deleted.

Conclusion

In this blog post, we learned how to set up a web page backend in Amazon ECS with client-side evaluation for Amazon CloudWatch Evidently. We easily deployed the example CloudFormation stack with AWS CDK Toolkit. Then we visited the example webpage and demonstrated the improved speed of fetching feature flags with client-side evaluation. If you’re interested in using client-side evaluation with AWS Lambda instead of Amazon ECS, check out this AWS CDK example.

Architecting for Sustainability at AWS re:Invent 2022

2023-02-24 Thomas Burns

Post Syndicated from Thomas Burns original https://aws.amazon.com/blogs/architecture/architecting-for-sustainability-at-aws-reinvent-2022/

AWS re:Invent 2022 featured 24 breakout sessions, chalk talks, and workshops on sustainability. In this blog post, we’ll highlight the sessions and announcements and discuss their relevance to the sustainability of, in, and through the cloud.

First, we’ll look at AWS’ initiatives and progress toward delivering efficient, shared infrastructure, water stewardship, and sourcing renewable power.

We’ll then summarize breakout sessions featuring AWS customers who are demonstrating the best practices from the AWS Well-Architected Framework Sustainability Pillar.

Lastly, we’ll highlight use cases presented by customers who are solving sustainability challenges through the cloud.

Sustainability of the cloud

The re:Invent 2022 Sustainability in AWS global infrastructure (SUS204) session is a deep dive on AWS’ initiatives to optimize data centers to minimize their environmental impact. These increases in efficiency provide carbon reduction opportunities to customers who migrate workloads to the cloud. Amazon’s progress includes:

Amazon is on path to power its operations with 100% renewable energy by 2025, five years ahead of the original target of 2030.
Amazon is the largest corporate purchaser of renewable energy with more than 400 projects globally, including recently announced projects in India, Canada, and Singapore. Once operational, the global renewable energy projects are expected to generate 56,881 gigawatt-hours (GWh) of clean energy each year.

At re:Invent, AWS announced that it will become water positive (Water+) by 2030. This means that AWS will return more water to communities than it uses in direct operations. This Water stewardship and renewable energy at scale (SUS211) session provides an excellent overview of our commitment. For more details, explore the Water Positive Methodology that governs implementation of AWS’ water positive goal, including the approach and measuring of progress.

Sustainability in the cloud

Independent of AWS efforts to make the cloud more sustainable, customers continue to influence the environmental impact of their workloads through the architectural choices they make. This is what we call sustainability in the cloud.

At re:Invent 2021, AWS launched the sixth pillar of the AWS Well-Architected Framework to explain the concepts, architectural patterns, and best practices to architect sustainably. In 2022, we extended the Sustainability Pillar best practices with a more comprehensive structure of anti-patterns to avoid, expected benefits, and implementation guidance.

Let’s explore sessions that show the Sustainability Pillar in practice. In the session Architecting sustainably and reducing your AWS carbon footprint (SUS205), Elliot Nash, Senior Manager of Software Development at Amazon Prime Video, dives deep on the exclusive streaming of Thursday Night Football on Prime Video. The teams followed the Sustainability Pillar’s improvement process from setting goals to replicating the successes to other teams. Implemented improvements include:

Automation of contingency switches that turn off non-critical customer features under stress to flatten demand peaks
Pre-caching content shown to the whole audience at the end of the game

Amazon Prime Video uses the AWS Customer Carbon Footprint Tool along with sustainability proxy metrics and key performance indicators (KPIs) to quantify and track the effectiveness of optimizations. Example KPIs are normalized Amazon Elastic Compute Cloud (Amazon EC2) instance hours per page impression or infrastructure cost per concurrent stream.

Another example of sustainability KPIs was presented in the Build a cost-, energy-, and resource-efficient compute environment (CMP204) session by Troy Gasaway, Vice President of Infrastructure and Engineering at Arm—a global semiconductor industry leader. Troy’s team wanted to measure, track, and reduce the impact of Electronic Design Automation (EDA) jobs. They used Amazon EC2 instances’ vCPU hours to calculate KPIs for Amazon EC2 Spot adoption, AWS Graviton adoption, and the resources needed per job.

The Sustainability Pillar recommends selecting Amazon EC2 instance types with the least impact and taking advantage of those designed to support specific workloads. The Sustainability and AWS silicon (SUS206) session gives an overview of the embodied carbon and energy consumption of silicon devices. The session highlights examples in which AWS silicon reduced the power consumption for machine learning (ML) inference with AWS Inferentia by 92 percent, and model training with AWS Trainium by 52 percent. Two effects contributed to the reduction in power consumption:

Purpose-built processors use less energy for the job
Due to better performance fewer instances are needed

David Chaiken, Chief Architect at Pinterest, shared Pinterest’s sustainability journey and how they complemented a rigid cost and usage management for ML workloads with data from the AWS Customer Carbon Footprint Tool, as in the figure below.

Figure 1. David Chaiken, Chief Architect at Pinterest, describes Pinterest’s sustainability journey with AWS

AWS announced the preview of a new generation of AWS Inferentia with the Inf2 instances, and C7gn instances. C7gn instances utilize the fifth generation of AWS Nitro cards. AWS Nitro offloads the host CPU to specialized hardware for a more consistent performance with lower CPU utilization. The new Nitro cards offer 40 percent better performance per watt than the previous generation.

Another best practice from the Sustainability Pillar is to use managed services. AWS is responsible for a large share of the optimization for resource efficiency for AWS managed services. We want to highlight the launch of AWS Verified Access. Traditionally, customers protect internal services from unauthorized access by placing resources into private subnets accessible through a Virtual Private Network (VPN). This often involves dedicated on-premises infrastructure that is provisioned to handle peak network usage of the staff. AWS Verified Access removes the need for a VPN. It shifts the responsibility for managing the hardware to securely access corporate applications to AWS and even improves your security posture. The service is built on AWS Zero Trust guiding principles and validates each application request before granting access. Explore the Introducing AWS Verified Access: Secure connections to your apps (NET214) session for demos and more.

In the session Provision and scale OpenSearch resources with serverless (ANT221) we announced the availability of Amazon OpenSearch Serverless. By decoupling compute and storage, OpenSearch Serverless scales resources in and out for both indexing and searching independently. This feature supports two key sustainability in the cloud design principles from the Sustainability Pillar out of the box:

Maximizing utilization
Scaling the infrastructure with user load

Sustainability through the cloud

Sustainability challenges are data problems that can be solved through the cloud with big data, analytics, and ML.

According to one study by PEDCA research, data centers in the EU consume approximately 3 percent of the EU’s energy generated. While it’s important to optimize IT for sustainability, we must also pay attention to reducing the other 97 percent of energy usage.

The session Serve your customers better with AWS Supply Chain (BIZ213) introduces AWS Supply Chain that generates insights into the data from your suppliers and your network to forecast and mitigate inventory risks. This service provides recommendations for stock rebalancing scored by distance to move inventory, risks, and also an estimation of the carbon emission impact.

The Easily build, train, and deploy ML models using geospatial data (AIM218) session introduces new Amazon SageMaker geospatial capabilities to analyze satellite images for forest density and land use changes and observe supply chain impacts. The AWS Solutions Library contains dedicated Guidance for Geospatial Insights for Sustainability on AWS with example code.

Some other examples for driving sustainability through the cloud as covered at re:Invent 2022 include these sessions:

SUS208: Utilizing sustainability data at scale
SUS210: Modeling climate change impacts and risks at scale
SUS212: Accelerating decarbonization and sustainability transformation
SUS301: Sustainable machine learning for protecting natural resources
SUS312: How innovators are driving more sustainable manufacturing
STP213: Scaling global carbon footprint management

Conclusion

We recommend revisiting the talks highlighted in this post to learn how you can utilize AWS to enhance your sustainability strategy. You can find all videos from the AWS re:Invent 2022 sustainability track in the Customer Enablement playlist. If you’d like to optimize your workloads on AWS for sustainability, visit the AWS Well-Architected Sustainability Pillar.

Let’s Architect! Architecture tools

2023-02-22 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecture-tools/

Tools, such as diagramming software, low-code applications, and frameworks, make it possible to experiment quickly. They are essential in today’s fast-paced and technology-driven world. From improving efficiency and accuracy, to enhancing collaboration and creativity, a well-defined set of tools can make a significant impact on the quality and success of a project in the area of software architecture.

As an architect, you can take advantage of a wide range of resources to help you build solutions that meet the needs of your organization. For example, with tools in the likes of the Amazon Web Services (AWS) Solutions Library and Serverless Land, you can boost your knowledge and productivity while working on event-driven architectures, microservices, and stateless computing.

In this Let’s Architect! edition, we explore how to incorporate these patterns into your architecture, and which tools to leverage to build solutions that are scalable, secure, and cost-effective.

How AWS Application Composer helps your team build great apps

In this re:Invent 2022 session, Chase Douglas, Principal Engineer at AWS, speaks about AWS Application Composer, a newly launched service.

This service has the potential to change the way architects design solutions—without writing a single line of code! The service is user-friendly, intuitive, and requires no prior coding experience. It allows users to scaffold a serverless architecture, defining a CloudFormation template visually with drag-and-drop. A detailed AWS Compute Blog post takes readers through the process of using AWS Application Composer.

Take me to this re:Invent 2022 video!

How an architecture can be designed with AWS Application Composer

AWS design + build tools

When migrating to the cloud, we suggest referencing these four tried-and-true AWS resources that can be used to design and build projects.

AWS Workshops are created by AWS teams to provide opportunities for hands-on learning to develop practical skills. Workshops are available in multiple categories and for skill levels 100-400.
AWS Architecture Center contains a collection of best practices and architectural patterns for designing and deploying cloud-based solutions using AWS services. Furthermore, it includes detailed architecture diagrams, whitepapers, case studies, and other resources that provide a wealth of information on how to design and implement cloud solutions.
Serverless Land (an Amazon property) brings together various patterns, workflows, code snippets, and blog posts pertaining to AWS serverless architectures.
AWS Solutions Library provides customers with templates, tools, and automated workflows to easily deploy, operate, and manage common use cases on the AWS Cloud.

Inside event-driven architectures designed by David Boyne on Serverless Land

The Well-Architected way

In this session, the AWS Well-Architected provides guidance on how to implement the architectural models reported in the AWS Well-Architected Framework within your organization at scale.

Discover a customer story and understand how to use the features of the AWS Well-Architected Tool and APIs to receive recommendations based on your workload and measure your architectural metrics. In the Framework whitepaper, you can explore the six pillars of Well-Architected (operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability) and best practices to achieve them.

Understanding the key design pillars can help architects make informed design decisions, leading to more robust and efficient solutions. This knowledge also enables architects to identify potential problems early on in the design process and find appropriate patterns to address those issues.

Take me to the Well-Architected video!

Discover how the AWS Well-Architected Framework can help you design scalable, maintainable, and reusable solutions

See you next time!

Thanks for exploring architecture tools and resources with us!

Join us next time when we’ll talk about data mesh architecture!

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Detecting solar panel damage with Amazon Rekognition Custom Labels

2023-02-20 Ramakant Joshi

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/architecture/detecting-solar-panel-damage-with-amazon-rekognition-custom-labels/

Enterprises perform quality control to ensure products meet production standards and avoid potential brand reputation damage. As the cost of sensors decreases and connectivity increases, industries adopt real-time imagery analysis to detect quality issues.

At the same time, artificial intelligence (AI) advancements enable advanced automation, reduce overall cost and project time, and produce accurate defect detection results in manufacturing plants. As these technologies mature, AI-driven inspections are more common outside of the plant environment.

Overview of solution

This post describes our SOLVED (Solar Roving Eye Detector) project leveraging machine learning (ML) to identify damaged solar panels using Amazon Rekognition Custom Labels and alert operators to take corrective action.

As solar adoption increases, so does the need to detect panel damage. Applying AWS-managed AI services is a simpler, more cost-effective approach than human solar panel inspection or custom-built production applications.

Customers can capture and process videos from the field and build effective computer vision models without creating a dedicated data science team. This approach can be generalized for use cases across industries to detect defects in wind turbines, cell phone towers, automotive parts, and other field components.

Amazon Rekognition Custom Labels builds off of existing service capabilities already trained to identify the objects and scenes in millions of cross-category images. You upload a small set of training images—typically a few hundred or less—into our console. The solution automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. You can then integrate your custom model into your applications through the Amazon Rekognition Custom Labels API.

Walkthrough

This post introduces the SOLVED project featured at the re:Invent 2021 Builders Fair. It will:

Review the need for solar panel damage detection
Discuss a cloud-based approach to ingest, store, process, analyze, and detect damaged solar panels
Present a diagram streaming videos from a Raspberry Pi, storing them on Amazon Simple Storage Service (Amazon S3), processing them using an AWS video-on-demand solution, and inferring damage using Amazon Rekognition
Introduce a console to mimic an operation center for appropriate action
Demonstrate the integration of AWS IoT Core with a Philips Hue bulb for operator alerts

Prerequisites

Before getting started, review the following prerequisites for this solution:

This blog assumes familiarity with AWS Lambda, Amazon Kinesis Video Streams, AWS IoT Core, and AWS Identity and Access Management (IAM)
Access to an AWS account with permissions to create the resources described in the installation steps section
An AWS IAM user with the permissions described later in this post
Access to a Freenove 4WD Smart Car
Access to a Raspberry Pi
Access to a Philips Hue Smart Bulb

The SOLVED project

The SOLVED project leverages ML to identify damaged solar panels using Amazon Rekognition Custom Labels. It involves four steps:

Data ingestion: Live solar panel video ingested from moving rover into an Amazon S3 bucket
Pre-processing: Captured video split into thumbnail images
Processing and visualization: ML models making real-time inferences to identify defective panels with a dashboard to review images and prediction scores
Alerting: Defective panels result in notification sent through MQTT messages to light a smart bulb

Figure 1 shows the SOLVED project system architecture.

Figure 1. The SOLVED project system architecture

Installation steps

Let’s review each of the steps in this use case.

Data ingestion

The data ingestion layer of the SOLVED project consists of a continuous video stream captured as a rover moves through a field of solar panels.

We used a Freenove 4WD Smart Car rover with Raspberry Pi. The mounted camera captures video as it moves through the field. We installed an Amazon Kinesis Video Streams Producer on the Pi and streamed the live video to a Kinesis Video Stream named reinventbuilder2021.

Figure 2 shows the Kinesis Video Stream setup window for reinventbuilder2021.

Figure 2. Kinesis Video Stream setup for reinventbuilder2021

To start streaming, use the following steps.

Create a new Kinesis Video Stream using this Amazon Kinesis Video Streams Developer Guide
Make a note of the Amazon Resource Name (ARN)
On the Pi, access the command prompt and use aws sts get-session-token for temporary credentials. The IAM user should have the permissions for Kinesis Video Streams PutMedia.

Set the following environment variables:

export AWS_DEFAULT_REGION="us-east-1"
export AWS_ACCESS_KEY_ID="xxxxx"
export AWS_SECRET_ACCESS_KEY="yyyyy"
export AWS_SESSION_TOKEN=“zzzzz”

Start the streamer using the following command:

cd ~/amazon-kinesis-video-streams-producer-sdk-cpp/build
./kvs_gstreamer_sample reinventbuilder2021

Validate the captured stream by viewing the Media playback on the console.

Figure 3 shows the video stream console, including the Media playback option.

Figure 3. Video stream console with Media playback option

There are two ways to clip video snippets, which we’ll do next.

You can use the Download clip button on the video stream console as shown in Figure 4.

Figure 4. Choose your video streaming clip duration

Alternately, you can use a script from the following command line:

ONE_MIN_AGO=$(date -v -30S -u "+%FT%T+0000")
NOW=$(date -u "+%FT%T+0000")

FILE_NAME=reinventbuilder-solved-$RANDOM.mp4
echo $FILE_NAME
S3_PATH=s3://videoondemandsplitter-source-e6lyof9qjv1j/

aws kinesis-video-archived-media get-clip --endpoint-url $KVS_DATA_ENDPOINT \
--stream-name reinventbuilder2021 \
--clip-fragment-selector "FragmentSelectorType=SERVER_TIMESTAMP,TimestampRange={StartTimestamp=$ONE_MIN_AGO,EndTimestamp=$NOW}" \
$FILE_NAME

echo "Running get-clip for stream"

sleep 45

aws s3 cp $FILE_NAME $S3_PATH
echo "copying file $FILE_NAME TO $S3_PATH"

The clip is available in the Amazon S3 source folder created using AWS CloudFormation, as shown in Figure 5.

Figure 5. Access your clip in the Amazon S3 source folder

Pre-processing

To process the video, we leverage Video on Demand at AWS. This solution encodes video files with AWS Elemental MediaConvert. Out of the box, it:

1. Automatically transcodes videos uploaded to Amazon S3 into formats suitable for playback on a range of devices using MediaConvert
2. Customizes MediaConvert job settings by uploading a custom file and using different settings per input
3. Stores transcoded files in a destination Amazon S3 bucket and uses CloudFront to deliver them to end viewers
4. Provides outputs including input file metadata, job settings, and output details in addition to transcoded video. These outputs are stored in a separate JSON file, available for further processing

For our use case, we used the frame capture feature to create a set of thumbnails from the source videos. The thumbnails are stored in the Amazon S3 bucket with the video output.

To deploy this solution, use the CloudFormation stack.

Processing and visualization

Every trained ML model requires quality training data. We began with publicly available solar panel images that were categorized as “good” or “defective” and uploaded the images to an Amazon S3 bucket into corresponding folders.

Next, we configured Amazon Rekognition Custom Labels with the folders to indicate the labels to use in training and deploying the model. Using the rover images, we tested the model.

We used the rover to record videos of good and damaged solar panels over an extended period and label the outcome favorably. The video was then split into individual frames using MediaConvert, giving us a well-labeled dataset that we trained our model with using Amazon Rekognition Custom Labels.

We used the model endpoint to infer outcomes on solar panels with varying damage footprints across multiple locations. AWS Elemental Mediaconvert expedited the process of curating the training set, and creating the model and endpoint using Amazon Rekognition was straightforward.

As shown in Figure 6, we used a training set of 7,000 images with an even mix of good and damaged panels.

Figure 6. A training set of images

Examples of good panel images are depicted in Figure 7.

Figure 7. Good panel images

Examples of damaged panel images are depicted in Figure 8.

Figure 8. Damaged panel images

In this use case, 90 percent model accuracy was achieved.

To visualize the results, we leveraged AWS Amplify to provide an operator interface to identify the damaged panels.

Figure 9 shows screenshots from the operator dashboard with output from the Amazon Custom Labels Rekognition model for good and defective panels.

Figure 9. Operator dashboard in AWS Amplify

Alerting

Maintenance teams must be notified of defective panels to take corrective action. To create alerts, we configured AWS IoT Core to send MQTT messages to a Philips Hue smart bulb, with red bulbs indicating defective panels. To set up the Philips Hue API, use the How to develop for Hue guide.

For example, here’s the API to change color:

PUT https://192.xx.xx.xx/api/xxxxxxx/lights/1/state

{"on":true, "sat":254, "bri":254,"hue":20000}

turns color to green

{"on":true, "sat":254, "bri":254,"hue":1000}

turns to red.

We set up a client on the Pi that listens on an AWS IoT Core MQTT topic and makes an API request to Philips Hue.

To connect a device to AWS IoT, complete these steps:

Create an IoT thing, a device certificate, and an AWS IoT policy. An AWS IoT thing represents a physical device (in this case, Raspberry Pi) and contains static device metadata, as shown in Figure 10.

Figure 10. AWS IoT Thing

2. Create a device certificate, required to connect to and authenticate with AWS IoT. An example is shown in Figure 11.

Figure 11. Device certificate

3. Associate an AWS IoT policy with each device certificate. They determine which AWS IoT resources the device can access. In this case, we allowed iot.*, giving the device access to all IoT resources, as shown in Figure 12.

Figure 12. IoT policy

Devices and other clients use an AWS IoT root CA certificate to authenticate the server they’re communicating with. For more on how devices authenticate with AWS IoT Core, see Server authentication in the AWS IoT Core Developer Guide. Copy the certificate chain to the Raspberry Pi.

For communication with the Philips Hue, we used the Qhue wrapper as shown in Figure 13.

Figure 13. Qhue wrapper

The authors presented a demo of this solution at re:Invent 2021 Builder’s Fair.

Figure 14. Author demo at re:Invent 2021 Builder’s Fair

Clean up

If you used the CloudFormation stack, delete it to avoid unexpected future charges. Delete Amazon S3 buckets and terminate Amazon Rekognition jobs to stop accruing charges.

Conclusion

Amazon Rekognition helps customers collect images in the field and apply AI-based analysis to interpret the condition of assets within the images.

In this post, you learned how to configure the Kinesis Video Stream producer on a Raspberry Pi to upload captured videos to Amazon Kinesis Video streams. You also learned how to save video streams to Amazon S3 and leverage the Video on Demand at AWS solution.

Using AWS MediaConvert, we transcoded the videos and create a set of thumbnails from the source videos. We then used Amazon Rekognition Custom Labels to train and deploy models for solar panel damage detection. Finally, we configured AWS IoT core to send MQTT messages to a Philips Hue smart bulb for notifications.

In this post, we presented a serverless architecture on AWS to detect defective solar panels. The reference architecture diagram is adaptable to solve inspection and damage detection problems across other industries.

Author Spotlight: Eduardo Monich Fronza, Senior Partner Solutions Architect, Linux and IBM

2023-02-17 Elise Chahine

Post Syndicated from Elise Chahine original https://aws.amazon.com/blogs/architecture/author-spotlight-eduardo-monich-fronza-senior-partner-sa-linux-and-ibm/

The Author Spotlight series pulls back the curtain on some of AWS’s most prolific authors. Read on to find out more about our very own Eduardo Monich Fronza’s journey, in his own words!

I have been a Partner Solutions Architect at Amazon Web Services (AWS) for just over two years. In this period, I have had the opportunity to work in projects from different partners and customers across the globe, in multiple industry segments, using a wide variety of technologies.

At AWS, we are obsessed with our customers, and this influences all of our activities. I enjoy diving deep to understand our partners’ motivations, as well as their technical and business challenges. Plus, I work backwards from their goals, helping them build innovative solutions using AWS services—solutions that they can successfully offer to their customers and achieve their targeted business results.

Before joining AWS, I worked mainly in Brazil for many years as a middleware engineer and, later, a cloud migration architect. During this period, I travelled to my customers in North America and Europe. These experiences taught me a lot about customer-facing engagements, how to focus on customers problems, and how to work backwards from those.

When I joined AWS, I was exposed to so many new technologies and projects that I have never had any previous experience with! This was a very exciting, as it provided me with many opportunities to dive deep and learn. A couple of the places I love to go to learn new content are our AWS Architecture Blog and AWS Reference Architecture Diagrams library.

The other thing I’ve realized during my tenure is how amazing it is to work with other people at AWS. I can say that I feel very fortunate to work with a wide range of intelligent and passionate problem-solvers. My peers are always willing to help and work together to provide the best possible solutions for our partners. I believe this collaboration is one of the reasons why AWS has been able to help partners and customer be so successful in their journeys to the cloud.

AWS encourages us to dive deep and specialize in technology domains. My background as a middleware engineer has influenced my decisions, and I am passionate about application modernization and containers areas in particular. A couple of topics that I am particularly interested in are Red Hat OpenShift Service on AWS (ROSA) and IBM software on AWS.

Eduardo presenting on the strategic partnership between AWS and IBM at IBM Think London 2022

This also shows how interesting it is to work with ISVs like Red Hat and IBM. It demonstrates, yet again, how AWS is customer-obsessed and works backwards from what customers need to be successful in their own rights. Regardless of if they are using AWS native services or an ISV solution on AWS, we at AWS always focus on what is right for our customers.

I am also very fond of running workshops, called Immersion Days, for our customers. And, I have recently co-authored an AWS modernization workshop with IBM, which shows how customers can use IBM Cloud Pak for Data on AWS along with AWS services to create exciting Analytics and AI/ML workloads!

In conclusion, working as a Partner Solutions Architect at AWS has been an incredibly rewarding experience for me. I work with great people, a wide range of industries and technologies, and, most importantly, help our customers and partners innovate and find success on AWS. If you are considering a career at AWS, I would highly recommend it: it’s an unparalleled working experience, and the are no shortages of opportunities to take part in exciting projects!

Eduardo’s favorite blog posts!

Deploying IBM Cloud Pak for Data on Red Hat OpenShift Service on AWS

Alright, I will admit: I am being a bit biased. But, hey, this was my first blog at AWS! Many customers are looking to adopt IBM Data and AI solutions on AWS, particularly on how to use ROSA to deploy IBM Cloud Pak for Data.

So, I created a how-to deployment guide, demonstrating how a customer can take advantage of ROSA, without having to manage the lifecycle of Red Hat OpenShift Container Platform clusters. Instead, I focus on developing new solutions and innovating faster, using IBM’s integrated data and artificial intelligence platform on AWS.

IBM Cloud Pak for Integration on ROSA architecture

Unleash Mainframe Applications by Augmenting New Channels on AWS with IBM Z and Cloud Modernization Stack

Many AWS customers use the IBM mainframe for their core business-critical applications. These customers are looking for ways to build modern cloud-native applications on AWS, that often require access to business-critical data on their IBM mainframe.

This AWS Partner Network (APN) Blog post shows how these customers can integrate cloud-native applications on AWS, with workloads running on mainframes, by exposing them as industry standard RESTful APIs with a no-code approach.

Mainframe-to-AWS integration reference architecture.

Migrate and Modernize Db2 Databases to Amazon EKS Using IBM’s Click to Containerize Tool

This blog shows customers, who are exploring ways to modernize their IBM Db2 databases, can move their databases quickly and easily to Amazon Elastic Kubernetes Service (Amazon EKS), ROSA and IBM’s Cloud Pak for Data products on AWS.

Scenario showing move from instance to container

Self-service AWS native service adoption in OpenShift using ACK

This Containers Blog post demonstrates how customers can use AWS Controllers for Kubernetes (ACK) to define and create AWS resources directly from within OpenShift. It allows customers to take advantage of AWS-managed services to complement the application workloads running in OpenShift, without needing to define resources outside of the cluster or run services that provide supporting capabilities like databases or message queues.

ACK is now integrated into OpenShift and being used to provide a broad collection of AWS native services presently available on the OpenShift OperatorHub.

AWS Controllers for Kubernetes workflow

Building an event-driven solution for AvalonBay property leasing and search

2023-02-10 Kausik Dey

Post Syndicated from Kausik Dey original https://aws.amazon.com/blogs/architecture/building-an-event-driven-solution-for-avalonbay-property-leasing-and-search/

In this blog post, we show you how to build an event-driven and serverless solution for property leasing and search that is scalable and resilient. This solution was created for AvalonBay Communities, Inc.—a leading residential Real Estate Investment Trusts (REITs). It enables:

More than 150,000 multi-parameter searches per day
The processing of more than 3,500 lease applications and 85,000 individual rent payments per month

Introduction

AvalonBay is an equity REIT. The company has a long track record of developing, redeveloping, acquiring, and managing apartment homes in top U.S. markets. AvalonBay builds long-term value for customers using innovative technology solutions.

The company understands that data-driven insights contribute to targeted business growth. But AvalonBay found that managing the complex interdependencies between multiple data sets—from real estate and property management systems to financial and payment systems—required a new solution.

The challenge

AvalonBay owned or held a direct or indirect ownership interest in 293 apartment communities containing 88,405 apartment homes in 12 states and Washington, D.C., as of September 30, 2022.

Of these, 18 communities were under development and one was under redevelopment. This presented a unique challenge to both internal and external users looking to search and lease apartment units based on multi-parameter selection criteria in geographically dispersed regions. For example, finding units in buildings with specific amenities, lease terms, furnishings and availability dates.

Overview of solution

AvalonBay’s fully managed leasing solution for applicants and residents is hosted by Amazon Web Services (AWS). The solution is secure, autoscaling, and multi-region, ensuring resiliency and performance with efficient resource usage.

In this event-driven solution, AvalonBay’s leasing service is hosted in multiple AWS Regions to provide low latency response to users across various geolocations. This blog post focuses on showing use case implementation in only one region—Region East— as shown in Figure 1.

Figure 1. AvalonBay lease processing platform

Several AWS services come together in this solution to meet key company objectives. Let’s explore each one and its purpose within the architecture.

1. Amazon Route 53: For AvalonBay’s Lease Processing solution, any non-transient service failure is unacceptable. In addition to providing a high degree of resiliency through a Multi-AZ architecture, Lease Processing also provides regional-level high availability through its multi-Region active-active architecture. Route 53 with latency-based routing allows dynamic rerouting of requests within seconds to alternate Regions.
2. Amazon API Gateway: Route 53 latency-based routing is configured across multiple AWS Regions as Route 53 to route traffic to an API Gateway endpoint. API Gateway authorizers were added to control access to APIs using an Amazon Cognito user pool.
3. AWS Lambda with provisioned concurrency: The Lambda services are set up for automatic scaling, secured through a private subnet, and span across Availability Zones. This provides horizontal scaling capability, self-healing capacity, and resiliency across Availability Zones. Provisioned concurrency minimizes the estimate of cold starts by generating execution environments. It also greatly reduces time spent on APIs invocations.
4. Amazon Aurora V2 for Amazon Relational Database Service (Amazon RDS) PostgreSQL-Compatible Edition: Aurora Serverless V2 is an on-demand, autoscaling configuration for Aurora. Serverless Aurora V2 PostgreSQL-Compatible is used for the Lease Processing solution. The global database was configured with two Regions; us-east-1 as the primary cluster and us-west-2 as the secondary cluster. Automated Aurora global database endpoint management for planned and unplanned failover is configured through a Route 53 private hosted zone, Amazon EventBridge, and Lambda.
5. Amazon RDS Proxy for Aurora: Amazon RDS Proxy allows the leasing application to pool and share database connections to improve its ability to scale. It also makes the leasing solution more resilient to database failures by automatically connecting to a standby database instance while preserving application connections.
6. Amazon EventBridge: EventBridge supports the solution through two primary purposes:
  1. Oversees lease flow events – During the lease application process, the solution generates various events which are consumed by AvalonBay and external applications such as property management, finance portals, administration, and more. Leasing events are sent to EventBridge and various event rules are configured for multiple destinations including Lambda, Amazon Simple Notification Service (Amazon SNS), and external API endpoints.
  2. Handles global Aurora V2 failover – Aurora generates events when certain actions are taken or events occur, including any type of global database activity.
    - When a managed planned failover is initiated—either via the AWS Command Line Interface (AWS CLI), API, or console—the global database failover process starts and generates an event.
    - When an Aurora cluster is removed from a global cluster through the AWS CLI, API, or console, the Aurora cluster is promoted as a single primary cluster. Once this process is completed, it generates an event.

An EventBridge rule is created to match an event pattern any time a global database managed planned failover completes successfully in a Region. When a failover is completed, the completion event is detected, and this rule is triggered. The event rule is configured to invoke a Lambda function that is triggered on global database failover and updates the Amazon CloudFront CNAME record to the correct value.

Scalable search solution

Leasing professionals need to easily scan a huge amount of property information using AvalonBay’s Search Solution to obtain required information.

Using the Amazon OpenSearch Service, agents can generate property profiles and other asset data to identify matching units and quickly respond to end customers. OpenSearch is a fully open-source search and analytics engine that securely unlocks real-time search, monitoring, and analysis of business and operational data. It is employed for use cases such as application monitoring, log analytics, observability, and website search.

The AvalonBay Search Service solution architecture featuring OpenSearch is shown in Figure 2.

Figure 2: AvalonBay Search Service solution architecture

AvalonBay search requires search criteria including keyword and Universal Resource Identifier (URI) search, SQL-based search, and custom package search, all of which are detailed in the Amazon OpenSearch Service Developer Guide.

OpenSearch automatically detects and replaces failed OpenSearch Service nodes, reducing the overhead associated with self-managed infrastructures.

Let’s explore this architecture further by step.

Amazon Kinesis event stream – The AvalonBay community requires near real-time updates to search attributes such as amenities, features, promotion, and pricing. Events created through various producers are streamed through Kinesis and inserted or updated through OpenSearch.
Amazon OpenSearch – OpenSearch is used for end-to-end community search—a managed service making it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. As community search data is read-only, UltraWarm and cold storage are also used based on usage frequency.
Amazon Simple Storage Service (S3): Various community documents, policies, and image or video files are key search elements. They must be maintained securely and reliably for years due to contractual obligations. Amazon S3 simplifies this task with high durability, lifecycle rules, and varied controls for retention.

Conclusion

This post showed how AvalonBay has built and deployed custom leasing and search solutions on AWS serverless platforms without compromising resiliency, performance, and capacity requirements. This is a 24/7, fully managed solution with no additional equipment on-premises.

Choosing AWS for leasing and search solutions gives AvalonBay the ability to dynamically scale and meet future growth demands while introducing cost advantages. In addition, the global availability of AWS services makes it possible to deploy services across geographic locations to meet performance requirements.

Let’s Architect! Architecting for sustainability

2023-02-08 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecting-for-sustainability/

Sustainability is an important topic in the tech industry, as well as society as a whole, and defined as the ability to continue to perform a process or function over an extended period of time without depletion of natural resources or the environment.

One of the key elements to designing a sustainable workload is software architecture. Think about how event-driven architecture can help reduce the load across multiple microservices, leveraging solutions like batching and queues. In these cases, the main traffic is absorbed at the entry-point of a cloud workload and ease inside your system. On top of architecture, think about data patterns, hardware optimizations, multi-environment strategies, and many more aspects of a software development lifecycle that can contribute to your sustainable posture in the Cloud.

The key takeaway: designing with sustainability in mind can help you build an application that is not only durable but also flexible enough to maintain the agility your business requires.

In this edition of Let’s Architect!, we share hands-on activities, case studies, and tips and tricks for making your Cloud applications more sustainable.

Architecting sustainably and reducing your AWS carbon footprint

Amazon Web Services (AWS) launched the Sustainability Pillar of the AWS Well-Architected Framework to help organizations evaluate and optimize their use of AWS services, and built the customer carbon footprint tool so organizations can monitor, analyze, and reduce their AWS footprint.

This session provides updates on these programs and highlights the most effective techniques for optimizing your AWS architectures. Find out how Amazon Prime Video used these tools to establish baselines and drive significant efficiencies across their AWS usage.

Take me to this re:Invent 2022 video!

Prime Video case study for understanding how the architecture can be designed for sustainability

Optimize your modern data architecture for sustainability

The modern data architecture is the foundation for a sustainable and scalable platform that enables business intelligence. This AWS Architecture Blog series provides tips on how to develop a modern data architecture with sustainability in mind.

Comprised of two posts, it helps you revisit and enhance your current data architecture without compromising sustainability.

Take me to Part 1! | Take me to Part 2!

An AWS data architecture; it’s now time to account for sustainability

AWS Well-Architected Labs: Sustainability

This workshop introduces participants to the AWS Well-Architected Framework, a set of best practices for designing and operating high-performing, highly scalable, and cost-efficient applications on AWS. The workshop also discusses how sustainability is critical to software architecture and how to use the AWS Well-Architected Framework to improve your application’s sustainability performance.

Take me to this workshop!

Sustainability implementation best practices and monitoring

Sustainability in the cloud with Rust and AWS Graviton

In this video, you can learn about the benefits of Rust and AWS Graviton to reduce energy consumption and increase performance. Rust combines the resource efficiency of programming languages, like C, with memory safety of languages, like Java. The video also explains the benefits deriving from AWS Graviton processors designed to deliver performance- and cost-optimized cloud workloads. This resource is very helpful to understand how sustainability can become a driver for cost optimization.

Take me to this re:Invent 2022 video!

Discover how Rust and AWS Graviton can help you make your workload more sustainable and performant

See you next time!

Thanks for joining us to discuss sustainability in the cloud! See you in two weeks when we’ll talk about tools for architects.

To find all the blogs from this series, you can check the Let’s Architect! list of content on the AWS Architecture Blog.