Tag Archives: Amazon EventBridge

Three ways to accelerate incident response in the cloud: insights from re:Inforce 2023

Post Syndicated from Anne Grahn original https://aws.amazon.com/blogs/security/three-ways-to-accelerate-incident-response-in-the-cloud-insights-from-reinforce-2023/

AWS re:Inforce took place in Anaheim, California, on June 13–14, 2023. AWS customers, partners, and industry peers participated in hundreds of technical and non-technical security-focused sessions across six tracks, an Expo featuring AWS experts and AWS Security Competency Partners, and keynote and leadership sessions.

The threat detection and incident response track showcased how AWS customers can get the visibility they need to help improve their security posture, identify issues before they impact business, and investigate and respond quickly to security incidents across their environment.

With dozens of service and feature announcements—and innumerable best practices shared by AWS experts, customers, and partners—distilling highlights is a challenge. From an incident response perspective, three key themes emerged.

Proactively detect, contextualize, and visualize security events

When it comes to effectively responding to security events, rapid detection is key. Among the launches announced during the keynote was the expansion of Amazon Detective finding groups to include Amazon Inspector findings in addition to Amazon GuardDuty findings.

Detective, GuardDuty, and Inspector are part of a broad set of fully managed AWS security services that help you identify potential security risks, so that you can respond quickly and confidently.

Using machine learning, Detective finding groups can help you conduct faster investigations, identify the root cause of events, and map to the MITRE ATT&CK framework to quickly run security issues to ground. The finding group visualization panel shown in the following figure displays findings and entities involved in a finding group. This interactive visualization can help you analyze, understand, and triage the impact of finding groups.

Figure 1: Detective finding groups visualization panel

Figure 1: Detective finding groups visualization panel

With the expanded threat and vulnerability findings announced at re:Inforce, you can prioritize where to focus your time by answering questions such as “was this EC2 instance compromised because of a software vulnerability?” or “did this GuardDuty finding occur because of unintended network exposure?”

In the session Streamline security analysis with Amazon Detective, AWS Principal Product Manager Rich Vorwaller, AWS Senior Security Engineer Rima Tanash, and AWS Program Manager Jordan Kramer demonstrated how to use graph analysis techniques and machine learning in Detective to identify related findings and resources, and investigate them together to accelerate incident analysis.

In addition to Detective, you can also use Amazon Security Lake to contextualize and visualize security events. Security Lake became generally available on May 30, 2023, and several re:Inforce sessions focused on how you can use this new service to assist with investigations and incident response.

As detailed in the following figure, Security Lake automatically centralizes security data from AWS environments, SaaS providers, on-premises environments, and cloud sources into a purpose-built data lake stored in your account. Security Lake makes it simpler to analyze security data, gain a more comprehensive understanding of security across an entire organization, and improve the protection of workloads, applications, and data. Security Lake automates the collection and management of security data from multiple accounts and AWS Regions, so you can use your preferred analytics tools while retaining complete control and ownership over your security data. Security Lake has adopted the Open Cybersecurity Schema Framework (OCSF), an open standard. With OCSF support, the service normalizes and combines security data from AWS and a broad range of enterprise security data sources.

Figure 2: How Security Lake works

Figure 2: How Security Lake works

To date, 57 AWS security partners have announced integrations with Security Lake, and we now have more than 70 third-party sources, 16 analytics subscribers, and 13 service partners.

In Gaining insights from Amazon Security Lake, AWS Principal Solutions Architect Mark Keating and AWS Security Engineering Manager Keith Gilbert detailed how to get the most out of Security Lake. Addressing questions such as, “How do I get access to the data?” and “What tools can I use?,” they demonstrated how analytics services and security information and event management (SIEM) solutions can connect to and use data stored within Security Lake to investigate security events and identify trends across an organization. They emphasized how bringing together logs in multiple formats and normalizing them into a single format empowers security teams to gain valuable context from security data, and more effectively respond to events. Data can be queried with Amazon Athena, or pulled by Amazon OpenSearch Service or your SIEM system directly from Security Lake.

Build your security data lake with Amazon Security Lake featured AWS Product Manager Jonathan Garzon, AWS Product Solutions Architect Ross Warren, and Global CISO of Interpublic Group (IPG) Troy Wilkinson demonstrating how Security Lake helps address common challenges associated with analyzing enterprise security data, and detailing how IPG is using the service. Wilkinson noted that IPG’s objective is to bring security data together in one place, improve searches, and gain insights from their data that they haven’t been able to before.

“With Security Lake, we found that it was super simple to bring data in. Not just the third-party data and Amazon data, but also our on-premises data from custom apps that we built.” — Troy Wilkinson, global CISO, Interpublic Group

Use automation and machine learning to reduce mean time to response

Incident response automation can help free security analysts from repetitive tasks, so they can spend their time identifying and addressing high-priority security issues.

In How LLA reduces incident response time with AWS Systems Manager, telecommunications provider Liberty Latin America (LLA) detailed how they implemented a security framework to detect security issues and automate incident response in more than 180 AWS accounts accessed by internal stakeholders and third-party partners by using AWS Systems Manager Incident Manager, AWS Organizations, Amazon GuardDuty, and AWS Security Hub.

LLA operates in over 20 countries across Latin America and the Caribbean. After completing multiple acquisitions, LLA needed a centralized security operations team to handle incidents and notify the teams responsible for each AWS account. They used GuardDuty, Security Hub, and Systems Manager Incident Manager to automate and streamline detection and response, and they configured the services to initiate alerts whenever there was an issue requiring attention.

Speaking alongside AWS Principal Solutions Architect Jesus Federico and AWS Principal Product Manager Sarah Holberg, LLA Senior Manager of Cloud Services Joaquin Cameselle noted that when GuardDuty identifies a critical issue, it generates a new finding in Security Hub. This finding is then forwarded to Systems Manager Incident Manager through an Amazon EventBridge rule. This configuration helps ensure the involvement of the appropriate individuals associated with each account.

“We have deployed a security framework in Liberty Latin America to identify security issues and streamline incident response across over 180 AWS accounts. The framework that leverages AWS Systems Manager Incident Manager, Amazon GuardDuty, and AWS Security Hub enabled us to detect and respond to incidents with greater efficiency. As a result, we have reduced our reaction time by 90%, ensuring prompt engagement of the appropriate teams for each AWS account and facilitating visibility of issues for the central security team.” — Joaquin Cameselle, senior manager, cloud services, Liberty Latin America

How Citibank (Citi) advanced their containment capabilities through automation outlined how the National Institute of Standards and Technology (NIST) Incident Response framework is applied to AWS services, and highlighted Citi’s implementation of a highly scalable cloud incident response framework designed to support the 28 AWS services in their cloud environment.

After describing the four phases of the incident response process — preparation and prevention; detection and analysis; containment, eradication, and recovery; and post-incident activity—AWS ProServe Global Financial Services Senior Engagement Manager Harikumar Subramonion noted that, to fully benefit from the cloud, you need to embrace automation. Automation benefits the third phase of the incident response process by speeding up containment, and reducing mean time to response.

Citibank Head of Cloud Security Operations Elvis Velez and Vice President of Cloud Security Damien Burks described how Citi built the Cloud Containment Automation Framework (CCAF) from the ground up by using AWS Step Functions and AWS Lambda, enabling them to respond to events 24/7 without human error, and reduce the time it takes to contain resources from 4 hours to 15 minutes. Velez described how Citi uses adversary emulation exercises that use the MITRE ATT&CK Cloud Matrix to simulate realistic attacks on AWS environments, and continuously validate their ability to effectively contain incidents.

Innovate and do more with less

Security operations teams are often understaffed, making it difficult to keep up with alerts. According to data from CyberSeek, there are currently 69 workers available for every 100 cybersecurity job openings.

Effectively evaluating security and compliance posture is critical, despite resource constraints. In Centralizing security at scale with Security Hub and Intuit’s experience, AWS Senior Solutions Architect Craig Simon, AWS Senior Security Hub Product Manager Dora Karali, and Intuit Principal Software Engineer Matt Gravlin discussed how to ease security management with Security Hub. Fortune 500 financial software provider Intuit has approximately 2,000 AWS accounts, 10 million AWS resources, and receives 20 million findings a day from AWS services through Security Hub. Gravlin detailed Intuit’s Automated Compliance Platform (ACP), which combines Security Hub and AWS Config with an internal compliance solution to help Intuit reduce audit timelines, effectively manage remediation, and make compliance more consistent.

“By using Security Hub, we leveraged AWS expertise with their regulatory controls and best practice controls. It helped us keep up to date as new controls are released on a regular basis. We like Security Hub’s aggregation features that consolidate findings from other AWS services and third-party providers. I personally call it the super aggregator. A key component is the Security Hub to Amazon EventBridge integration. This allowed us to stream millions of findings on a daily basis to be inserted into our ACP database.” — Matt Gravlin, principal software engineer, Intuit

At AWS re:Inforce, we launched a new Security Hub capability for automating actions to update findings. You can now use rules to automatically update various fields in findings that match defined criteria. This allows you to automatically suppress findings, update the severity of findings according to organizational policies, change the workflow status of findings, and add notes. With automation rules, Security Hub provides you a simplified way to build automations directly from the Security Hub console and API. This reduces repetitive work for cloud security and DevOps engineers and can reduce mean time to response.

In Continuous innovation in AWS detection and response services, AWS Worldwide Security Specialist Senior Manager Himanshu Verma and GuardDuty Senior Manager Ryan Holland highlighted new features that can help you gain actionable insights that you can use to enhance your overall security posture. After mapping AWS security capabilities to the core functions of the NIST Cybersecurity Framework, Verma and Holland provided an overview of AWS threat detection and response services that included a technical demonstration.

Bolstering incident response with AWS Wickr enterprise integrations highlighted how incident responders can collaborate securely during a security event, even on a compromised network. AWS Senior Security Specialist Solutions Architect Wes Wood demonstrated an innovative approach to incident response communications by detailing how you can integrate the end-to-end encrypted collaboration service AWS Wickr Enterprise with GuardDuty and AWS WAF. Using Wickr Bots, you can build integrated workflows that incorporate GuardDuty and third-party findings into a more secure, out-of-band communication channel for dedicated teams.

Evolve your incident response maturity

AWS re:Inforce featured many more highlights on incident response, including How to run security incident response in your Amazon EKS environment and Investigating incidents with Amazon Security Lake and Jupyter notebooks code talks, as well as the announcement of our Cyber Insurance Partners program. Content presented throughout the conference made one thing clear: AWS is working harder than ever to help you gain the insights that you need to strengthen your organization’s security posture, and accelerate incident response in the cloud.

To watch AWS re:Inforce sessions on demand, see the AWS re:Inforce playlists on YouTube.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Anne Grahn

Anne Grahn

Anne is a Senior Worldwide Security GTM Specialist at AWS based in Chicago. She has more than a decade of experience in the security industry, and focuses on effectively communicating cybersecurity risk. She maintains a Certified Information Systems Security Professional (CISSP) certification.

Author

Himanshu Verma

Himanshu is a Worldwide Specialist for AWS Security Services. In this role, he leads the go-to-market creation and execution for AWS Security Services, field enablement, and strategic customer advisement. Prior to AWS, he held several leadership roles in Product Management, engineering and development, working on various identity, information security, and data protection technologies. He obsesses brainstorming disruptive ideas, venturing outdoors, photography, and trying various “hole in the wall” food and drinking establishments around the globe.

Jesus Federico

Jesus Federico

Jesus is a Principal Solutions Architect for AWS in the telecommunications vertical, working to provide guidance and technical assistance to communication service providers on their cloud journey. He supports CSPs in designing and implementing secure, resilient, scalable, and high-performance applications in the cloud.

How GoDaddy Implemented a Multi-Region Event-Driven Platform at Scale

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/how-godaddy-implemented-a-multi-region-event-driven-platform-at-scale/

GoDaddy, a leading global provider of domain registration and web hosting services, has served over 84 million domains and 22 million customers since its establishment in 1997. Among its various internal systems, the Customer Signal Platform provides tooling to capture, analyze, and act on customer and product data to drive better business outcomes. With this platform, GoDaddy can track user visits and interactions on its website and use meaningful event data to improve its customer experience and overall business performance.

Nowadays, the Customer Signal Platform processes 400 million events every day. As GoDaddy expands its integrations, it aims to increase this number to 2 billion events per day in the near future.

When building the Customer Signal Platform, GoDaddy had three main requirements for the system architecture:

  1. Minimize their operational load.
  2. Scale automatically as traffic changes.
  3. Provide high availability and ensure that all the customer signals are captured.

Amazon EventBridge Event Bus
After evaluating many options against their requirements, GoDaddy decided to implement the customer signal platform using Amazon EventBridge Event Bus. EventBridge Event Bus is a serverless event bus that helps you receive, filter, transform, route, and deliver events. Because EventBridge is serverless, it requires minimal configuration to get started and scales automatically—GoDaddy’s first two requirements were checked.

To comply with the third requirement, the solution needed to provide business continuity and ensure that no event is lost from the moment the client produces it until it gets to the platform to be analyzed. EventBridge Event Bus comes with many features that helped GoDaddy build their application with this requirement in mind.

The main feature that GoDaddy took advantage of was global endpoints. EventBridge global endpoints provide a reliable and simple way to improve the business continuity of event-driven applications. This new feature, added in 2022, allows customers to build a multi-Region event-driven application.

EventBridge Global Endpoints
Global endpoints allow you to configure a managed DNS endpoint in EventBridge, to which your applications will send events. Then you need to configure two custom event buses in two distinct AWS Regions. One is the primary Region, and the other is the failover, or secondary Region. The failover of events is decided based on the health indicated by an Amazon Route 53 health check. When the health check is healthy, the events are routed from the global endpoint to the custom event bus in the primary Region. And if the health check is unhealthy, then the global endpoint will send the events to the event bus in the secondary Region.

Healthcheck status

The simplest configuration for global endpoints is the active/archive configuration. This configuration provides business continuity and simplicity at the same time. The active/archive configuration defines two different Regions. The primary Region is where the application is deployed and all the business processes are happening. The archive Region is where only a custom bus is deployed and all the events are archived.

In addition, there is a bidirectional replication rule between the buses in separate Regions. In the normal case, when there are no errors, whenever an event arrives at the custom bus in the primary Region, the event is automatically replicated to the archive custom bus in the secondary Region.

In the case of failover, the global endpoint redirects the events to the secondary Region, where they get archived for processing at another time.

Active/ Archive configuration

GoDaddy Implementation of Global Endpoints
GoDaddy was looking for a solution that minimized their operations load while still providing business continuity, and that is why they adopted global endpoints and the active/archive configuration. In this way, they could have the event processing logic in their primary Region and have a secondary Region in case of any issues.

In their configuration, events are archived in the secondary Region for 30 days, after which the events expire. In the case of a failover, because they don’t need to process the events in real time, they collect them in the archive. If the issue is resolved within 24 hours, the retention period for the replication rule, the events are sent automatically to the primary Region. If the issue is solved in more than 24 hours the events need to be replayed to the primary Region.

The following image shows what their current solution looks like. They are working with two Regions. US West (Oregon) is their primary Region and is the location of the data lake, which is the primary consumer of the events. US East (N. Virginia) is the secondary Region. Events are being produced in different clients; from the clients, they are sent to Amazon API Gateway. GoDaddy deployed two API Gateways in their two Regions. The events are sent to the API Gateway with the smallest latency from the client. To do that, they use latency-based routing provided by Amazon Route 53. Then events are sent to an AWS Lambda function that validates the events and forwards them to the EventBridge global endpoint at the DNS level.

GoDaddy architecture

The global endpoint is configured with the active/archive setup, and the failover is configured to be triggered via a Route 53 health check that monitors an Amazon CloudWatch alarm. That alarm observes the IngestionToInvocationStartLatency metric in the primary Region.

IngestionToInvocationStartLatency is a service-level metric that exposes the time to process events from the point at which they are ingested by EventBridge to the point the first invocation of a target in the configured rules is made. This metric is measured across all the rules in your bus and provides an indication of the health of the EventBridge service. Any extended periods of high latency over 30 seconds indicate a service disruption.

When the system is in the normal state, the events are forwarded from the global endpoint to the custom ingress event bus in the primary Region. That custom event bus has replication enabled; this means that all the events that arrive at the bus get replicated automatically in the secondary Region custom ingress event bus.

All the events received by the ingress event bus are sent to the enrichment function. This function performs basic validation and authentication, and it enriches the event data to make sure that all the events from different clients are standard.

From there, the events are forwarded to the data platform event bus to be sent to the different consumer targets. The main target is their data lake solution, which analyzes all the events.

What Was the Impact?
For GoDaddy, business continuity is important, and their customer signals are not getting lost due to any issue with their platform. This makes them confident that they can expand their customer signal platforms from 400 million events per day to 2 billion events per day without introducing any additional operations overhead.

Now, they can confidently process hundreds of millions of events per day to their system, and they can keep on growing. The following image shows the number of events ingested by global endpoints in a normal day.

Events ingested

While GoDaddy’s use of the active/archive pattern enables them to ensure they never lose any events, they’re already starting to see certain use cases where they want to minimize any delays in processing their events, even when service disruptions occur. Because they’re already replicating their events to a secondary Region, they can deploy their most critical consumers to both Regions and enable an active/active configuration for their mission-critical systems. Active/active configuration allows you to process parallel events in both the primary and secondary Regions, simplifying the processing of events even during disruptions and enabling business continuity.

The vision when building the Customer Signal Platform was to align with GoDaddy’s high bar for reliability, scalability, and maintainability and, at the same time, keep the platform self-service so that developers can focus on business needs. This led GoDaddy to choose Amazon EventBridge global endpoints and serverless technologies to build this solution.

GoDaddy Customer Signal Platform is an excellent example of what serverless technologies enable. By leveraging the cloud to handle as much of the undifferentiated heavy lifting as possible, GoDaddy has reduced the operational complexity of setting up an event bus for a multi-Region strategy, implemented failover mechanisms in the case of Regional distruptions, and ensured that events are not lost by enabling replication. Global endpoints active/archive configuration improves the availability of customer applications with the least amount of configuration changes.

If you want to get started with EventBridge global endpoints, you can check out this talk on event-driven applications. For a working demo on how to use EventBridge global endpoints for failover events, check out this Serverless Land repository.

Marcia

AWS Week in Review – AWS Documentation Updates, Amazon EventBridge is Faster, and More – May 22, 2023

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-aws-documentation-updates-amazon-eventbridge-is-faster-and-more-may-22-2023/

AWS Data Hero Anahit Pogosova keynote at CloudConf 2023Here are your AWS updates from the previous 7 days. Last week I was in Turin, Italy for CloudConf, a conference I’ve had the pleasure to participate in for the last 10 years. AWS Hero Anahit Pogosova was also there sharing a few serverless tips in front of a full house. Here’s a picture I took from the last row during her keynote.

On Thursday, May 25, I’ll be at the AWS Community Day in Dublin to celebrate the 10 years of the local AWS User Group. Say hi if you’re there!

Last Week’s Launches
Last week was packed with announcements! Here are the launches that got my attention:

Amazon SageMakerGeospatial capabilities are now generally available with security updates and more use case samples.

Amazon DetectiveSimplify the investigation of AWS Security Findings coming from new sources such as AWS IAM Access Analyzer, Amazon Inspector, and Amazon Macie.

Amazon EventBridge – EventBridge now delivers events up to 80% faster than before, as measured by the time an event is ingested to the first invocation attempt. No change is required on your side.

AWS Control Tower – The service has launched 28 new proactive controls that allow you to block non-compliant resources before they are provisioned for services such as AWS OpenSearch Service, AWS Auto Scaling, Amazon SageMaker, Amazon API Gateway, and Amazon Relational Database Service (Amazon RDS). Check out the original posts from when proactive controls were launched.

Amazon CloudFront – CloudFront now supports two new control directives to help improve performance and availability: stale-while-revalidate (to immediately deliver stale responses to users while it revalidates caches in the background) and the stale-if-error cache (to define how long stale responses should be reused if there’s an error).

Amazon Timestream – Timestream now enables to export query results to Amazon S3 in a cost-effective and secure manner using the new UNLOAD statement.

AWS Distro for OpenTelemetryThe tail sampling and the group-by-trace processors are now generally available in the AWS Distro for OpenTelemetry (ADOT) collector. For example, with tail sampling, you can define sampling policies such as “ingest 100% of all error cases and 5% of all success cases.”

AWS DataSync – You can now use DataSync to copy data to and from Amazon S3 compatible storage on AWS Snowball Edge Compute Optimized devices.

AWS Device Farm – Device Farm now supports VPC integration for private devices, for example, when an unreleased version of an app is accessing a staging environment and tests are accessing internal packages only accessible via private networking. Read more at Access your private network from real mobile devices using AWS Device Farm.

Amazon Kendra – Amazon Kendra now helps you search across different content repositories with new connectors for Gmail, Adobe Experience Manager Cloud, Adobe Experience Manager On-Premise, Alfresco PaaS, and Alfresco Enterprise. There is also an updated Microsoft SharePoint connector.

Amazon Omics – Omics now offers pre-built bioinformatic workflows, synchronous upload capability, integration with Amazon EventBridge, and support for Graphical Processing Units (GPUs). For more information, check out New capabilities make it easier for healthcare and life science customers to get started, build applications, and scale-up on Amazon Omics.

Amazon Braket – Braket now supports Aria, IonQ’s largest and highest fidelity publicly available quantum computing device to date. To learn more, read Amazon Braket launches IonQ Aria whith built-in error mitigation.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more news items and blog posts you might have missed:

AWS Documentation home page screenshot.AWS Documentation – The AWS Documentation home page has been redesigned. Leave your feedback there to let us know what you think or to suggest future improvements. Last week we also announced that we are retiring the AWS Documentation GitHub repo to focus our resources to directly improve the documentation and the website.

Peloton case studyPeloton embraces Amazon Redshift to unlock the power of data during changing times.

Zoom case studyLearn how Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on Amazon EMR.

Nice solutionIntroducing an image-to-speech Generative AI application using SageMaker and Hugging Face.

For AWS open-source news and updates, check out the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
Here are some opportunities to meet and learn:

AWS Data Insights Day (May 24) – A virtual event to discover how to innovate faster and more cost-effectively with data. This event focuses on customer voices, deep-dive sessions, and best practices of Amazon Redshift. You can register here.

AWS Silicon Innovation Day (June 21) – AWS has designed and developed purpose-built silicon specifically for the cloud. Join to learn AWS innovations in custom-designed Amazon EC2 chips built for high performance and scale in the cloud. Register here.

AWS re:Inforce (June 13–14) – You can still register for AWS re:Inforce. This year it is taking place in Anaheim, California.

AWS Global Summits – Sign up for the AWS Summit closest to where you live: Hong Kong (May 23), India (May 25), Amsterdam (June 1), London (June 7), Washington, DC (June 7-8), Toronto (June 14), Madrid (June 15), and Milano (June 22). If you want to meet, I’ll be at the one in London.

AWS Community Days – Join these community-led conferences where event logistics and content is planned, sourced, and delivered by community leaders: Dublin, Ireland (May 25), Shenzhen, China (May 28), Warsaw, Poland (June 1), Chicago, USA (June 15), and Chile (July 1).

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Learn How to Modernize Your Applications at AWS Serverless Innovation Day

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/learn-how-to-modernize-your-applications-at-aws-serverless-innovation-day/

Join us on Wednesday, May 17, for AWS Serverless Innovation Day, a free full-day virtual event. You will learn about AWS Serverless technologies and event-driven architectures from customers, experts, and leaders.

AWS Serverless Innovation Day is an event to empower builders and technical decision-makers with different AWS Serverless technologies, including AWS Lambda, Amazon Elastic Container Service (Amazon ECS) with AWS Fargate, Amazon EventBridge, and AWS Step Functions. The talks of the day will cover three key topics: event-driven architectures, serverless containers, and serverless functions, and how they can be utilized to build and modernize applications. Application modernization is a priority for organizations this year, and serverless helps to increase the software delivery speed and reduce the total cost of ownership.

AWS Serverless Innovation Day

Eric Johnson and Jessica Deen will be the hosts for the event. Holly Mesrobian, VP of Serverless Compute at AWS, will deliver the welcome keynote and share AWS’s vision for Serverless. The day ends with closing remarks from James Beswick and Usman Khalid, Events and Workflows Director at AWS.

The event is split into three groups of talks: event-driven architecture, serverless containers, and Lambda-based applications. Each group kicks off with a fireside chat between AWS customers and an AWS leader. You can learn how organizations, such as Capital One, PostNL, Pentasoft, Delta Air Lines, and Smartsheets, are using AWS Serverless technologies to solve their most challenging problems and continue to innovate.

During the day, all the sessions include demos and use cases, where you can learn the best practices and how to build applications. If you cannot attend all day, here are some of my favorite sessions to watch:

  • Building with serverless workflows at scaleBen Smith will show you how to unleash the power of AWS Step Functions.
  • Event design and event-first development – In this session, David Boyne will show you a robust approach to event design with Amazon EventBridge.
  • Best practices for AWS Lambda – You will learn from Julian Wood how to get the most out of your functions.
  • Optimizing for cost using Amazon ECSScott Coulton will show you how to reduce operational overhead from the control plane with Amazon ECS.

There is no up-front registration required to join the AWS Serverless Innovation Day, but if you want to be notified before the event starts, get in-depth news, articles, and event updates, and get a notification when the on-demand videos are available, you can register on the event page. The event will be streamed on Twitch, LinkedIn Live, YouTube, and Twitter.

See you there.

Marcia

Automating stopping and starting Amazon MWAA environments to reduce cost

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/automating-stopping-and-starting-amazon-mwaa-environments-to-reduce-cost/

This was written by Uma Ramadoss, Specialist Integration Services, and Chandan Rupakheti, Solutions Architect.

This blog post shows how you can save cost by automating the stopping and starting of an Amazon Managed Workflows for Apache Airflow (Amazon MWAA) environment. It describes how you can retain the data stored in a metadata database and presents an automated solution you can use in your AWS account.

Customers run end to end data pipelines at scale with MWAA. It is a common best practice to run non-production environments for development and testing. A nonproduction environment often does not need to run throughout the day due to factors such as working hours of the development team. As there is no automatic way to stop an MWAA environment when not in use and deleting an environment causes metadata loss, customers often run it continually and pay the full cost.

Overview

Amazon MWAA has a distributed architecture with multiple components such as scheduler, worker, webserver, queue, and database. Customers build data pipelines as Directed Acyclic Graphs (DAGs) and run in Amazon MWAA. The DAGs use variables and connections from the Amazon MWAA metadata database. The history of DAG runs and related data are stored in the same metadata database. The database also stores other information such as user roles and permissions.

When you delete the Amazon MWAA environment, all the components including the database are deleted so that you do not incur any cost. As this normal deletion results in loss of metadata, you need a customized solution to back up the data and to automate the deletion and recreation.

The sample application deletes and recreates your MWAA environment at a scheduled interval defined by you using Amazon EventBridge Scheduler. It exports all metadata into an Amazon S3 bucket before deletion and imports the metadata back to the environment after creation. As this is a managed database and you cannot access the database outside the Amazon MWAA environment, it uses DAGs to import and export the data. The entire process is orchestrated using AWS Step Functions.

Deployment architecture

The sample application is in a GitHub repository. Use the instructions in the readme to deploy the application.

Sample architecture

The sample application deploys the following resources –

  1. A Step Functions state machine to orchestrate the steps needed to delete the MWAA environment.
  2. A Step Functions state machine to orchestrate the steps needed to recreate the MWAA environment.
  3. EventBridge Scheduler rules to trigger the state machines at the scheduled time.
  4. An S3 bucket to store metadata database backup and environment details.
  5. Two DAG files uploaded to the source S3 bucket configured with the MWAA environment. The export DAG exports metadata from the MWAA metadata database to backup S3 bucket. The import DAG restores the metadata from the backup S3 bucket to the newly created MWAA environment.
  6. AWS Lambda functions for triggering the DAGs using MWAA CLI API.
  7. A Step Functions state machine to wait for the long-running MWAA creation and deletion process.
  8. Amazon EventBridge rule to notify on state machine failures.
  9. Amazon Simple Notification Service (Amazon SNS) topic as a target to the EventBridge rule for failure notifications.
  10. Amazon Interface VPC Endpoint for Step Functions for MWAA environment deployed in the private mode.

Stop workflow

At a scheduled time, Amazon EventBridge Scheduler triggers a Step Functions state machine to stop the MWAA environment. The state machine performs the following actions:

Stop workflow

  1. Fetch Amazon MWAA environment details such as airflow configurations, IAM execution role, logging configurations and VPC details.
  2. If the environment is not in the “AVAILABLE” status, it fails the workflow by branching to the “Pausing unsuccessful” state.
  3. Otherwise, it runs the normal workflow and stores the environment details in an S3 bucket so that Start workflow can recreate the environment with this data.
  4. Trigger an MWAA DAG using AWS Lambda function to export metadata to the Amazon S3 bucket. This step uses Step Functions to wait for callback token integration.
  5. Resume the workflow when the task token is returned from the MWAA DAG.
  6. Delete Amazon MWAA environment.
  7. Wait to confirm the deletion.

Start workflow

At a scheduled time, EventBridge Scheduler triggers the Step Functions state machine to recreate the MWAA environment. The steps in the state machine perform the following actions:

Start workflow

  1. Retrieve the environment details stored in Amazon S3 bucket by the stop workflow.
  2. Create an MWAA environment with the same configuration as the original.
  3. Trigger an MWAA DAG through the Lambda function to restore the metadata from the S3 bucket to the newly created environment.

Cost savings

Consider a small MWAA environment in us-east-2 with a minimum of one worker, a maximum of one worker, and 1GB data storage. At the time of this writing, the monthly cost of the environment is $357.80. Let’s assume you use this environment between 6 am and 6 pm on weekdays.

The schedule in the env file of the sample application looks like:

MWAA_PAUSE_CRON_SCHEDULE=’0 18 ? * MON-FRI *’
MWAA_RESUME_CRON_SCHEDULE=’30 5 ? * MON-FRI *’

As MWAA environment creation takes anywhere between 20 and 30 minutes, the MWAA_RESUME_CRON_SCHEDULE is set at 5.30 pm.

Assuming 21 weekdays per month, the monthly cost of the environment is $123.48 and is 65.46% less compared to running the environment continuously:

  • 21 weekdays * 12 hours * 0.49 USD per hour = $123.48

Additional considerations

The sample application only restores at-store data. Though the deletion process pauses all the DAGs before making the backup, it cannot stop any running tasks or in-flight messages in the queue. It also does not backup tasks that are not in completed state. This can result in task history loss for the tasks that were running during the backup.

Over time, the metadata grows in size, which can increase latency in query performance. You can use a DAG as shown in the example to clean up the database regularly.

Avoid setting the catchup by default configuration flag in the environment setting to true or in the DAG definition unless it is required. Catch up feature runs all the DAG runs that are missed for any data interval. When the environment is created again, if the flag is true, it catches up with the missed DAG runs and can overload the environment.

Conclusion

Automating the deletion and recreation of Amazon MWAA environments is a powerful solution for cost optimization and efficient management of resources. By following the steps outlined in this blog post, you can ensure that your MWAA environment is deleted and recreated without losing any of the metadata or configurations. This allows you to deploy new code changes and updates more quickly and easily, without having to configure your environment each time manually.

The potential cost savings of running your MWAA environment for only 12 hours on weekdays are significant. The example shows how you can save up to 65% of your monthly costs by choosing this option. This makes it an attractive solution for organizations that are looking to reduce cost while maintaining a high level of performance.

Visit the samples repository to learn more about Amazon MWAA. It contains a wide variety of examples and templates that you can use to build your own applications.

For more serverless learning resources, visit Serverless Land.

Let’s Architect! Designing serverless solutions

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-designing-serverless-solutions/

During his re:Invent 2022 keynote, Werner Vogels, AWS Vice President and Chief Technology Officer, emphasized the asynchronous nature of our world and the challenges associated with incorporating asynchronicity into our architectures. AWS serverless services can help users concentrate on the asynchronous aspects of their workloads, easing the execution of event-driven architectures and enabling the adoption of effective integration patterns for communication both within and beyond a bounded context.

In this edition of Let’s Architect!, we offer an in-depth exploration of the architecture of serverless AWS services, such as AWS Lambda. We also present a new workshop centered on design patterns employing serverless AWS services, which ultimately delivers valuable insights on implementing event-driven architectures within systems.

A closer look at AWS Lambda

This video is the perfect companion for those seeking to learn and master a Lambda architecture, empowering you to effectively leverage its capabilities in your workloads.

With the knowledge gained from this video, you will be well-equipped to design your functions’ code in a highly optimized manner, ensuring efficient performance and resource utilization. Furthermore, a comprehensive understanding of Lambda functions can help identify and apply the most suitable approach to cloud workloads, resulting in an agile and robust cloud infrastructure that meets a project’s unique requirements.

Take me to this video!

Discover how AWS Lambda functions work under the hood

Discover how AWS Lambda functions work under the hood

Implementing an event-driven serverless story generation application with ChatGPT and DALL-E

This example of an event-driven serverless architecture showcases the power of leveraging AWS services and AI technologies to develop innovative solutions. Built upon a foundation of serverless services, including Amazon EventBridge, Amazon DynamoDB, Lambda, Amazon Simple Storage Service, and managed artificial intelligence (AI ) services like Amazon Polly, this architecture demonstrates the seamless capacity to create daily stories with a scheduled launch. By utilizing EventBridge scheduler, an Lambda function is initiated every night to generate new content. The integration of AI services, like ChatGPT and DALL-E, further elevates the solution, as their compatibility with the serverless model enables efficient and dynamic content creation. This case serves as a testament to the potential of combining event-driven serverless architectures, with cutting-edge AI technologies for inventive and impactful applications.

Take me to this Compute Blog post!

How to build an event-driven architecture with serverless AWS services integrating ChatGPT and DALL-E

How to build an event-driven architecture with serverless AWS services integrating ChatGPT and DALL-E

AWS Workshop Studio: Serverless Patterns

The AWS Serverless Patterns workshop offers a comprehensive learning experience to enhance your understanding of architectural patterns applicable to serverless projects. Throughout the workshop, participants will delve into various patterns, such as synchronous and asynchronous implementations, tailored to meet the demands of modern serverless applications. This hands-on approach ensures a production-ready understanding, encompassing crucial topics like testing serverless workloads, establishing automation pipelines, and more. Take this workshop to elevate your serverless architecture knowledge!

Take me to the serverless workshop!

The high-level architecture of the workshops modules

The high-level architecture of the workshops modules

Building Serverlesspresso: Creating event-driven architectures

Serverlesspresso is an event-driven, serverless workload that uses EventBridge and AWS Step Functions to coordinate events across microservices and support thousands of orders per day. This comprehensive session delves into design considerations, development processes, and valuable lessons learned from creating a production-ready solution. Discover practical patterns and extensibility options that contribute to a robust, scalable, and cost-effective application. Gain insights into combining EventBridge and Step Functions to address complex architectural challenges in larger applications.

Take me to this video!

How to leverage AWS Step Functions for orchestrating your workflows

How to leverage AWS Step Functions for orchestrating your workflows

See you next time!

Thanks for joining our conversation on serverless solutions! We’ll see you next time when we talk about AWS microservices.

Can’t get enough of the Let’s Architect! series? Visit the Let’s Architect! page of the AWS Architecture Blog!

Extending a serverless, event-driven architecture to existing container workloads

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/extending-a-serverless-event-driven-architecture-to-existing-container-workloads/

This post is written by Dhiraj Mahapatro, Principal Specialist SA, and Sascha Moellering, Principal Specialist SA, and Emily Shea, WW Lead, Integration Services.

Many serverless services are a natural fit for event-driven architectures (EDA), as events invoke them and only run when there is an event to process. When building in the cloud, many services emit events by default and have built-in features for managing events. This combination allows customers to build event-driven architectures easier and faster than ever before.

The insurance claims processing sample application in this blog series uses event-driven architecture principles and serverless services like AWS LambdaAWS Step FunctionsAmazon API GatewayAmazon EventBridge, and Amazon SQS.

When building an event-driven architecture, it’s likely that you have existing services to integrate with the new architecture, ideally without needing to make significant refactoring changes to those services. As services communicate via events, extending applications to new and existing microservices is a key benefit of building with EDA. You can write those microservices in different programming languages or running on different compute options.

This blog post walks through a scenario of integrating an existing, containerized service (a settlement service) to the serverless, event-driven insurance claims processing application described in this blog post.

Overview of sample event-driven architecture

The sample application uses a front-end to sign up a new user and allow the user to upload images of their car and driver’s license. Once signed up, they can file a claim and upload images of their damaged car. Previously, it did not yet integrate with a settlement service for completing the claims and settlement process.

In this scenario, the settlement service is a brownfield application that runs Spring Boot 3 on Amazon ECS with AWS Fargate. AWS Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building container applications without managing servers.

The Spring Boot application exposes a REST endpoint, which accepts a POST request. It applies settlement business logic and creates a settlement record in the database for a car insurance claim. Your goal is to make settlement work with the new EDA application that is designed for claims processing without re-architecting or rewriting. Customer, claims, fraud, document, and notification are the other domains that are shown as blue-colored boxes in the following diagram:

Reference architecture

Project structure

The application uses AWS Cloud Development Kit (CDK) to build the stack. With CDK, you get the flexibility to create modular and reusable constructs imperatively using your language of choice. The sample application uses TypeScript for CDK.

The following project structure enables you to build different bounded contexts. Event-driven architecture relies on the choreography of events between domains. The object oriented programming (OOP) concept of CDK helps provision the infrastructure to separate the domain concerns while loosely coupling them via events.

You break the higher level CDK constructs down to these corresponding domains:

Comparing domains

Application and infrastructure code are present in each domain. This project structure creates a seamless way to add new domains like settlement with its application and infrastructure code without affecting other areas of the business.

With the preceding structure, you can use the settlement-service.ts CDK construct inside claims-processing-stack.ts:

const settlementService = new SettlementService(this, "SettlementService", {
  bus,
});

The only information the SettlementService construct needs to work is the EventBridge custom event bus resource that is created in the claims-processing-stack.ts.

To run the sample application, follow the setup steps in the sample application’s README file.

Existing container workload

The settlement domain provides a REST service to the rest of the organization. A Docker containerized Spring Boot application runs on Amazon ECS with AWS Fargate. The following sequence diagram shows the synchronous request-response flow from an external REST client to the service:

Settlement service

  1. External REST client makes POST /settlement call via an HTTP API present in front of an internal Application Load Balancer (ALB).
  2. SettlementController.java delegates to SettlementService.java.
  3. SettlementService applies business logic and calls SettlementRepository for data persistence.
  4. SettlementRepository persists the item in the Settlement DynamoDB table.

A request to the HTTP API endpoint looks like:

curl --location <settlement-api-endpoint-from-cloudformation-output> \
--header 'Content-Type: application/json' \
--data '{
  "customerId": "06987bc1-1234-1234-1234-2637edab1e57",
  "claimId": "60ccfe05-1234-1234-1234-a4c1ee6fcc29",
  "color": "green",
  "damage": "bumper_dent"
}'

The response from the API call is:

API response

You can learn more here about optimizing Spring Boot applications on AWS Fargate.

Extending container workload for events

To integrate the settlement service, you must update the service to receive and emit events asynchronously. The core logic of the settlement service remains the same. When you file a claim, upload damaged car images, and the application detects no document fraud, the settlement domain subscribes to Fraud.Not.Detected event and applies its business logic. The settlement service emits an event back upon applying the business logic.

The following sequence diagram shows a new interface in settlement to work with EDA. The settlement service subscribes to events that a producer emits. Here, the event producer is the fraud service that puts an event in an EventBridge custom event bus.

Sequence diagram

  1. Producer emits Fraud.Not.Detected event to EventBridge custom event bus.
  2. EventBridge evaluates the rules provided by the settlement domain and sends the event payload to the target SQS queue.
  3. SubscriberService.java polls for new messages in the SQS queue.
  4. On message, it transforms the message body to an input object that is accepted by SettlementService.
  5. It then delegates the call to SettlementService, similar to how SettlementController works in the REST implementation.
  6. SettlementService applies business logic. The flow is like the REST use case from 7 to 10.
  7. On receiving the response from the SettlementService, the SubscriberService transforms the response to publish an event back to the event bus with the event type as Settlement.Finalized.

The rest of the architecture consumes this Settlement.Finalized event.

Using EventBridge schema registry and discovery

Schema enforces a contract between a producer and a consumer. A consumer expects the exact structure of the event payload every time an event arrives. EventBridge provides schema registry and discovery to maintain this contract. The consumer (the settlement service) can download the code bindings and use them in the source code.

Enable schema discovery in EventBridge before downloading the code bindings and using them in your repository. The code bindings provide a marshaller that unmarshals the incoming event from SQS queue to a plain old Java object (POJO) FraudNotDetected.java. You download the code bindings using the choice of your IDE. AWS Toolkit for IntelliJ makes it convenient to download and use them.

Download code bindings

The final architecture for the settlement service with REST and event-driven architecture looks like:

Final architecture

Transition to become fully event-driven

With the new capability to handle events, the Spring Boot application now supports both the REST endpoint and the event-driven architecture by running the same business logic through different interfaces. In this example scenario, as the event-driven architecture matures and the rest of the organization adopts it, the need for the POST endpoint to save a settlement may diminish. In the future, you can deprecate the endpoint and fully rely on polling messages from the SQS queue.

You start with using an ALB and Fargate service CDK ECS pattern:

const loadBalancedFargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(
  this,
  "settlement-service",
  {
    cluster: cluster,
    taskImageOptions: {
      image: ecs.ContainerImage.fromDockerImageAsset(asset),
      environment: {
        "DYNAMODB_TABLE_NAME": this.table.tableName
      },
      containerPort: 8080,
      logDriver: new ecs.AwsLogDriver({
        streamPrefix: "settlement-service",
        mode: ecs.AwsLogDriverMode.NON_BLOCKING,
        logRetention: RetentionDays.FIVE_DAYS,
      })
    },
    memoryLimitMiB: 2048,
    cpu: 1024,
    publicLoadBalancer: true,
    desiredCount: 2,
    listenerPort: 8080
  });

To adapt to EDA, you update the resources to retrofit the SQS queue to receive messages and EventBridge to put events. Add new environment variables to the ApplicationLoadBalancerFargateService resource:

environment: {
  "SQS_ENDPOINT_URL": queue.queueUrl,
  "EVENTBUS_NAME": props.bus.eventBusName,
  "DYNAMODB_TABLE_NAME": this.table.tableName
}

Grant the Fargate task permission to put events in the custom event bus and consume messages from the SQS queue:

props.bus.grantPutEventsTo(loadBalancedFargateService.taskDefinition.taskRole);
queue.grantConsumeMessages(loadBalancedFargateService.taskDefinition.taskRole);

When you transition the settlement service to become fully event-driven, you do not need the HTTP API endpoint and ALB anymore, as SQS is the source of events.

A better alternative is to use QueueProcessingFargateService ECS pattern for the Fargate service. The pattern provides auto scaling based on the number of visible messages in the SQS queue, besides CPU utilization. In the following example, you can also add two capacity provider strategies while setting up the Fargate service: FARGATE_SPOT and FARGATE. This means, for every one task that is run using FARGATE, there are two tasks that use FARGATE_SPOT. This can help optimize cost.

const queueProcessingFargateService = new ecs_patterns.QueueProcessingFargateService(this, 'Service', {
  cluster,
  memoryLimitMiB: 1024,
  cpu: 512,
  queue: queue,
  image: ecs.ContainerImage.fromDockerImageAsset(asset),
  desiredTaskCount: 2,
  minScalingCapacity: 1,
  maxScalingCapacity: 5,
  maxHealthyPercent: 200,
  minHealthyPercent: 66,
  environment: {
    "SQS_ENDPOINT_URL": queueUrl,
    "EVENTBUS_NAME": props?.bus.eventBusName,
    "DYNAMODB_TABLE_NAME": tableName
  },
  capacityProviderStrategies: [
    {
      capacityProvider: 'FARGATE_SPOT',
      weight: 2,
    },
    {
      capacityProvider: 'FARGATE',
      weight: 1,
    },
  ],
});

This pattern abstracts the automatic scaling behavior of the Fargate service based on the queue depth.

Running the application

To test the application, follow How to use the Application after the initial setup. Once complete, you see that the browser receives a Settlement.Finalized event:

{
  "version": "0",
  "id": "e2a9c866-cb5b-728c-ce18-3b17477fa5ff",
  "detail-type": "Settlement.Finalized",
  "source": "settlement.service",
  "account": "123456789",
  "time": "2023-04-09T23:20:44Z",
  "region": "us-east-2",
  "resources": [],
  "detail": {
    "settlementId": "377d788b-9922-402a-a56c-c8460e34e36d",
    "customerId": "67cac76c-40b1-4d63-a8b5-ad20f6e2e6b9",
    "claimId": "b1192ba0-de7e-450f-ac13-991613c48041",
    "settlementMessage": "Based on our analysis on the damage of your car per claim id b1192ba0-de7e-450f-ac13-991613c48041, your out-of-pocket expense will be $100.00."
  }
}

Cleaning up

The stack creates a custom VPC and other related resources. Be sure to clean up resources after usage to avoid the ongoing cost of running these services. To clean up the infrastructure, follow the clean-up steps shown in the sample application.

Conclusion

The blog explains a way to integrate existing container workload running on AWS Fargate with a new event-driven architecture. You use EventBridge to decouple different services from each other that are built using different compute technologies, languages, and frameworks. Using AWS CDK, you gain the modularity of building services decoupled from each other.

This blog shows an evolutionary architecture that allows you to modernize existing container workloads with minimal changes that still give you the additional benefits of building with serverless and EDA on AWS.

The major difference between the event-driven approach and the REST approach is that you unblock the producer once it emits an event. The event producer from the settlement domain that subscribes to that event is loosely coupled. The business functionality remains intact, and no significant refactoring or re-architecting effort is required. With these agility gains, you may get to the market faster

The sample application shows the implementation details and steps to set up, run, and clean up the application. The app uses ECS Fargate for a domain service, but you do not limit it to just Fargate. You can also bring container-based applications running on Amazon EKS similarly to event-driven architecture.

Learn more about event-driven architecture on Serverless Land.

How CyberCRX cut ML processing time from 8 days to 56 minutes with AWS Step Functions Distributed Map

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/how-cybercrx-cut-ml-processing-time-from-8-days-to-56-minutes-with-aws-step-functions-distributed-map/

Last December, Sébastien Stormacq wrote about the availability of a distributed map state for AWS Step Functions, a new feature that allows you to orchestrate large-scale parallel workloads in the cloud. That’s when Charles Burton, a data systems engineer for a company called CyberGRX, found out about it and refactored his workflow, reducing the processing time for his machine learning (ML) processing job from 8 days to 56 minutes. Before, running the job required an engineer to constantly monitor it; now, it runs in less than an hour with no support needed. In addition, the new implementation with AWS Step Functions Distributed Map costs less than what it did originally.

What CyberGRX achieved with this solution is a perfect example of what serverless technologies embrace: letting the cloud do as much of the undifferentiated heavy lifting as possible so the engineers and data scientists have more time to focus on what’s important for the business. In this case, that means continuing to improve the model and the processes for one of the key offerings from CyberGRX, a cyber risk assessment of third parties using ML insights from its large and growing database.

What’s the business challenge?
CyberGRX shares third-party cyber risk (TPCRM) data with their customers. They predict, with high confidence, how a third-party company will respond to a risk assessment questionnaire. To do this, they have to run their predictive model on every company in their platform; they currently have predictive data on more than 225,000 companies. Whenever there’s a new company or the data changes for a company, they regenerate their predictive model by processing their entire dataset. Over time, CyberGRX data scientists improve the model or add new features to it, which also requires the model to be regenerated.

The challenge is running this job for 225,000 companies in a timely manner, with as few hands-on resources as possible. The job runs a set of operations for each company, and every company calculation is independent of other companies. This means that in the ideal case, every company can be processed at the same time. However, implementing such a massive parallelization is a challenging problem to solve.

First iteration
With that in mind, the company built their first iteration of the pipeline using Kubernetes and Argo Workflows, an open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. These were tools they were familiar with, as they were already using them in their infrastructure.

But as soon as they tried to run the job for all the companies on the platform, they ran up against the limits of what their system could handle efficiently. Because the solution depended on a centralized controller, Argo Workflows, it was not robust, and the controller was scaled to its maximum capacity during this time. At that time, they only had 150,000 companies. And running the job with all of the companies took around 8 days, during which the system would crash and need to be restarted. It was very labor intensive, and it always required an engineer on call to monitor and troubleshoot the job.

The tipping point came when Charles joined the Analytics team at the beginning of 2022. One of his first tasks was to do a full model run on approximately 170,000 companies at that time. The model run lasted the whole week and ended at 2:00 AM on a Sunday. That’s when he decided their system needed to evolve.

Second iteration
With the pain of the last time he ran the model fresh in his mind, Charles thought through how he could rewrite the workflow. His first thought was to use AWS Lambda and SQS, but he realized that he needed an orchestrator in that solution. That’s why he chose Step Functions, a serverless service that helps you automate processes, orchestrate microservices, and create data and ML pipelines; plus, it scales as needed.

Charles got the new version of the workflow with Step Functions working in about 2 weeks. The first step he took was adapting his existing Docker image to run in Lambda using Lambda’s container image packaging format. Because the container already worked for his data processing tasks, this update was simple. He scheduled Lambda provisioned concurrency to make sure that all functions he needed were ready when he started the job. He also configured reserved concurrency to make sure that Lambda would be able to handle this maximum number of concurrent executions at a time. In order to support so many functions executing at the same time, he raised the concurrent execution quota for Lambda per account.

And to make sure that the steps were run in parallel, he used Step Functions and the map state. The map state allowed Charles to run a set of workflow steps for each item in a dataset. The iterations run in parallel. Because Step Functions map state offers 40 concurrent executions and CyberGRX needed more parallelization, they created a solution that launched multiple state machines in parallel; in this way, they were able to iterate fast across all the companies. Creating this complex solution, required a preprocessor that handled the heuristics of the concurrency of the system and split the input data across multiple state machines.

This second iteration was already better than the first one, as now it was able to finish the execution with no problems, and it could iterate over 200,000 companies in 90 minutes. However, the preprocessor was a very complex part of the system, and it was hitting the limits of the Lambda and Step Functions APIs due to the amount of parallelization.

Second iteration with AWS Step Functions

Third and final iteration
Then, during AWS re:Invent 2022, AWS announced a distributed map for Step Functions, a new type of map state that allows you to write Step Functions to coordinate large-scale parallel workloads. Using this new feature, you can easily iterate over millions of objects stored in Amazon Simple Storage Service (Amazon S3), and then the distributed map can launch up to 10,000 parallel sub-workflows to process the data.

When Charles read in the News Blog article about the 10,000 parallel workflow executions, he immediately thought about trying this new state. In a couple of weeks, Charles built the new iteration of the workflow.

Because the distributed map state split the input into different processors and handled the concurrency of the different executions, Charles was able to drop the complex preprocessor code.

The new process was the simplest that it’s ever been; now whenever they want to run the job, they just upload a file to Amazon S3 with the input data. This action triggers an Amazon EventBridge rule that targets the state machine with the distributed map. The state machine then executes with that file as an input and publishes the results to an Amazon Simple Notification Service (Amazon SNS) topic.

Final iteration with AWS Step Functions

What was the impact?
A few weeks after completing the third iteration, they had to run the job on all 227,000 companies in their platform. When the job finished, Charles’ team was blown away; the whole process took only 56 minutes to complete. They estimated that during those 56 minutes, the job ran more than 57 billion calculations.

Processing of the Distributed Map State

The following image shows an Amazon CloudWatch graph of the concurrent executions for one Lambda function during the time that the workflow was running. There are almost 10,000 functions running in parallel during this time.

Lambda concurrency CloudWatch graph

Simplifying and shortening the time to run the job opens a lot of possibilities for CyberGRX and the data science team. The benefits started right away the moment one of the data scientists wanted to run the job to test some improvements they had made for the model. They were able to run it independently without requiring an engineer to help them.

And, because the predictive model itself is one of the key offerings from CyberGRX, the company now has a more competitive product since the predictive analysis can be refined on a daily basis.

Learn more about using AWS Step Functions:

You can also check the Serverless Workflows Collection that we have available in Serverless Land for you to test and learn more about this new capability.

Marcia

Cross-account integration between SaaS platforms using Amazon AppFlow

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/big-data/cross-account-integration-between-saas-platforms-using-amazon-appflow/

Implementing an effective data sharing strategy that satisfies compliance and regulatory requirements is complex. Customers often need to share data between disparate software as a service (SaaS) platforms within their organization or across organizations. On many occasions, they need to apply business logic to the data received from the source SaaS platform before pushing it to the target SaaS platform.

Let’s take an example. AnyCompany’s marketing team hosted an event at the Anaheim Convention Center, CA. The marketing team created leads based on the event in Adobe Marketo. An automated process downloaded the leads from Marketo in the marketing AWS account. These leads are then pushed to the sales AWS account. A business process picks up those leads, filters them based on a “Do Not Call” criteria, and creates entries in the Salesforce system. Now, the sales team can pursue those leads and continue to track the opportunities in Salesforce.

In this post, we show how to share your data across SaaS platforms in a cross-account structure using fully managed, low-code AWS services such as Amazon AppFlow, Amazon EventBridge, AWS Step Functions, and AWS Glue.

Solution overview

Considering our example of AnyCompany, let’s look at the data flow. AnyCompany’s Marketo instance is integrated with the producer AWS account. As the leads from Marketo land in the producer AWS account, they’re pushed to the consumer AWS account, which is integrated to Salesforce. Business logic is applied to the leads data in the consumer AWS account, and then the curated data is loaded into Salesforce.

We have used a serverless architecture to implement this use case. The following AWS services are used for data ingestion, processing, and load:

  • Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications like Salesforce, SAP, Marketo, Slack, and ServiceNow, and AWS services like Amazon S3 and Amazon Redshift, in just a few clicks. With AppFlow, you can run data flows at nearly any scale at the frequency you choose—on a schedule, in response to a business event, or on demand. You can configure data transformation capabilities like filtering and validation to generate rich, ready-to-use data as part of the flow itself, without additional steps. Amazon AppFlow is used to download leads data from Marketo and upload the curated leads data into Salesforce.
  • Amazon EventBridge is a serverless event bus that lets you receive, filter, transform, route, and deliver events. EventBridge is used to track the events like receiving the leads data in the producer or consumer AWS accounts and then triggering a workflow.
  • AWS Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines. Step Functions is used to orchestrate the data processing.
  • AWS Glue is a serverless data preparation service that makes it easy to run extract, transform, and load (ETL) jobs. An AWS Glue job encapsulates a script that reads, processes, and then writes data to a new schema. This solution uses Python 3.6 AWS Glue jobs for data filtration and processing.
  • Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Amazon S3 is used to store the leads data.

Let’s review the architecture in detail. The following diagram shows a visual representation of how this integration works.

The following steps outline the process for transferring and processing leads data using Amazon AppFlow, Amazon S3, EventBridge, Step Functions, AWS Glue, and Salesforce:

  1. Amazon AppFlow runs on a daily schedule and retrieves any new leads created within the last 24 hours (incremental changes) from Marketo.
  2. The leads are saved as Parquet format files in an S3 bucket in the producer account.
  3. When the daily flow is complete, Amazon AppFlow emits events to EventBridge.
  4. EventBridge triggers Step Functions.
  5. Step Functions copies the Parquet format files containing the leads from the producer account’s S3 bucket to the consumer account’s S3 bucket.
  6. Upon a successful file transfer, Step Functions publishes an event in the consumer account’s EventBridge.
  7. An EventBridge rule intercepts this event and triggers Step Functions in the consumer account.
  8. Step Functions calls an AWS Glue crawler, which scans the leads Parquet files and creates a table in the AWS Glue Data Catalog.
  9. The AWS Glue job is called, which selects records with the Do Not Call field set to false from the leads files, and creates a new set of curated Parquet files. We have used an AWS Glue job for the ETL pipeline to showcase how you can use purpose-built analytics service for complex ETL needs. However, for simple filtering requirements like Do Not Call, you can use the existing filtering feature of Amazon AppFlow.
  10. Step Functions then calls Amazon AppFlow.
  11. Finally, Amazon AppFlow populates the Salesforce leads based on the data in the curated Parquet files.

We have provided artifacts in this post to deploy the AWS services in your account and try out the solution.

Prerequisites

To follow the deployment walkthrough, you need two AWS accounts, one for the producer and other for the consumer. Use us-east-1 or us-west-2 as your AWS Region.

Consumer account setup:

Stage the data

To prepare the data, complete the following steps:

  1. Download the zipped archive file to use for this solution and unzip the files locally.

The AWS Glue job uses the glue-job.py script to perform ETL and populates the curated table in the Data Catalog.

  1. Create an S3 bucket called consumer-configbucket-<ACCOUNT_ID> via the Amazon S3 console in the consumer account, where ACCOUNT_ID is your AWS account ID.
  2. Upload the script to this location.

Create a connection to Salesforce

Follow the connection setup steps outlined in here. Please make a note of the Salesforce connector name.

Create a connection to Salesforce in the consumer account

Follow the connection setup steps outlined in Create Opportunity Object Flow.

Set up resources with AWS CloudFormation

We provided two AWS CloudFormation templates to create resources: one for the producer account, and one for the consumer account.

Amazon S3 now applies server-side encryption with Amazon S3 managed keys (SSE-S3) as the base level of encryption for every bucket in Amazon S3. Starting January 5, 2023, all new object uploads to Amazon S3 are automatically encrypted at no additional cost and with no impact on performance. We use this default encryption for both producer and consumer S3 buckets. If you choose to bring your own keys with AWS Key Management Service (AWS KMS), we recommend referring to Replicating objects created with server-side encryption (SSE-C, SSE-S3, SSE-KMS) for cross-account replication.

Launch the CloudFormation stack in the consumer account

Let’s start with creating resources in the consumer account. There are a few dependencies on the consumer account resources from the producer account. To launch the CloudFormation stack in the consumer account, complete the following steps:

  1. Sign in to the consumer account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack.
    BDB-2063-launch-cloudformation-stack
  3. Choose Next.
  4. For Stack name, enter a stack name, such as stack-appflow-consumer.
  5. Enter the parameters for the connector name, object, and producer (source) account ID.
  6. Choose Next.
  7. On the next page, choose Next.
  8. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  9. Choose Create stack.

Stack creation takes approximately 5 minutes to complete. It will create the following resources. You can find them on the Outputs tab of the CloudFormation stack.

  • ConsumerS3Bucketconsumer-databucket-<consumer account id>
  • Consumer S3 Target Foldermarketo-leads-source
  • ConsumerEventBusArnarn:aws:events:<region>:<consumer account id>:event-bus/consumer-custom-event-bus
  • ConsumerEventRuleArnarn:aws:events:<region>:<consumer account id>:rule/consumer-custom-event-bus/consumer-custom-event-bus-rule
  • ConsumerStepFunctionarn:aws:states:<region>:<consumer account id>:stateMachine:consumer-state-machine
  • ConsumerGlueCrawlerconsumer-glue-crawler
  • ConsumerGlueJobconsumer-glue-job
  • ConsumerGlueDatabaseconsumer-glue-database
  • ConsumerAppFlowarn:aws:appflow:<region>:<consumer account id>:flow/consumer-appflow

Producer account setup:

Create a connection to Marketo

Follow the connection setup steps outlined in here. Please make a note of the Marketo connector name.

Launch the CloudFormation stack in the producer account

Now let’s create resources in the producer account. Complete the following steps:

  1. Sign in to the producer account’s AWS CloudFormation console in the source Region.
  2. Choose Launch Stack.
    BDB-2063-launch-cloudformation-stack
  3. Choose Next.
  4. For Stack name, enter a stack name, such as stack-appflow-producer.
  5. Enter the following parameters and leave the rest as default:
    • AppFlowMarketoConnectorName: name of the Marketo connector, created above
    • ConsumerAccountBucket: consumer-databucket-<consumer account id>
    • ConsumerAccountBucketTargetFolder: marketo-leads-source
    • ConsumerAccountEventBusArn: arn:aws:events:<region>:<consumer account id>:event-bus/consumer-custom-event-bus
    • DefaultEventBusArn: arn:aws:events:<region>:<producer account id>:event-bus/default


  6. Choose Next.
  7. On the next page, choose Next.
  8. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  9. Choose Create stack.

Stack creation takes approximately 5 minutes to complete. It will create the following resources. You can find them on the Outputs tab of the CloudFormation stack.

  • Producer AppFlowproducer-flow
  • Producer Bucketarn:aws:s3:::producer-bucket.<region>.<producer account id>
  • Producer Flow Completion Rulearn:aws:events:<region>:<producer account id>:rule/producer-appflow-completion-event
  • Producer Step Functionarn:aws:states:<region>:<producer account id>:stateMachine:ProducerStateMachine-xxxx
  • Producer Step Function Rolearn:aws:iam::<producer account id>:role/service-role/producer-stepfunction-role
  1. After successful creation of the resources, go to the consumer account S3 bucket, consumer-databucket-<consumer account id>, and update the bucket policy as follows:
{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "AllowAppFlowDestinationActions",
            "Effect": "Allow",
            "Principal": {"Service": "appflow.amazonaws.com"},
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::consumer-databucket-<consumer-account-id>",
                "arn:aws:s3:::consumer-databucket-<consumer-account-id>/*"
            ]
        }, {
            "Sid": "Producer-stepfunction-role",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<producer-account-id>:role/service-role/producer-stepfunction-role"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::consumer-databucket-<consumer-account-id>",
                "arn:aws:s3:::consumer-databucket-<consumer-account-id>/*"
            ]
        }
    ]
}

Validate the workflow

Let’s walk through the flow:

  1. Review the Marketo and Salesforce connection setup in the producer and consumer account respectively.

In the architecture section, we suggested scheduling the AppFlow (producer-flow) in the producer account. However, for quick testing purposes, we demonstrate how to manually run the flow on demand.

  1. Go to the AppFlow (producer-flow) in the producer account. On the Filters tab of the flow, choose Edit filters.
  2. Choose the Created At date range for which you have data.
  3. Save the range and choose Run flow.
  4. Review the producer S3 bucket.

AppFlow generates the files in the producer-flow prefix within this bucket. The files are temporarily located in the producer S3 bucket under s3://<producer-bucket>.<region>.<account-id>/producer-flow.

  1. Review the EventBridge rule and Step Functions state machine in the producer account.

The Amazon AppFlow job completion triggers an EventBridge rule (arn:aws:events:<region>:<producer account id>:rule/producer-appflow-completion-event, as noted in the Outputs tab of the CloudFromation stack in the Producer Account), which triggers the Step Functions state machine (arn:aws:states:<region>:<producer account id>:stateMachine:ProducerStateMachine-xxxx) in the producer account. The state machine copies the files to the consumer S3 bucket from the producer-flow prefix in the producer S3 bucket. Once file copy is complete, the state machine moves the files from the producer-flow prefix to the archive prefix in the producer S3 bucket. You can find the files in s3://<producer-bucket>.<region>.<account-id>/archive.

  1. Review the consumer S3 bucket.

The Step Functions state machine in the producer account copies the files to the consumer S3 bucket and sends an event to EventBridge in the consumer account. The files are located in the consumer S3 bucket under s3://consumer-databucket-<account-id>/marketo-leads-source/.

  1. Review the EventBridge rule (arn:aws:events:<region>:<consumer account id>:rule/consumer-custom-event-bus/consumer-custom-event-bus-rule) in the consumer account, which should have triggered the Step Function workflow (arn:aws:states:<region>:<consumer account id>:stateMachine:consumer-state-machine).

The AWS Glue crawler (consumer-glue-crawler) runs to update the metadata followed by the AWS Glue job (consumer-glue-job), which curates the data by applying the Do not call filter. The curated files are placed in s3://consumer-databucket-<account-id>/marketo-leads-curated/. After data curation, the flow is started as part of the state machine.

  1. Review the Amazon AppFlow job (arn:aws:appflow:<region>:<consumer account id>:flow/consumer-appflow) run status in the consumer account.

Upon a successful run of the Amazon AppFlow job, the curated data files are moved to the s3://consumer-databucket-<account-id>/marketo-leads-processed/ folder and Salesforce is updated with the leads. Additionally, all the original source files are moved from s3://consumer-databucket-<account-id>/marketo-leads-source/ to s3://consumer-databucket-<account-id>/marketo-leads-archive/.

  1. Review the updated data in Salesforce.

You will see newly created or updated leads created by Amazon AppFlow.

Clean up

To clean up the resources created as part of this post, delete the following resources:

  1. Delete the resources in the producer account:
    • Delete the producer S3 bucket content.
    • Delete the CloudFormation stack.
  2. Delete the resources in the consumer account:
    • Delete the consumer S3 bucket content.
    • Delete the CloudFormation stack.

Summary

In this post, we showed how you can support a cross-account model to exchange data between different partners with different SaaS integrations using Amazon AppFlow. You can expand this idea to support multiple target accounts.

For more information, refer to Simplifying cross-account access with Amazon EventBridge resource policies. To learn more about Amazon AppFlow, visit Amazon AppFlow.


About the authors

Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.

Debaprasun Chakraborty is an AWS Solutions Architect, specializing in the analytics domain. He has around 20 years of software development and architecture experience. He is passionate about helping customers in cloud adoption, migration and strategy.

Suraj Subramani Vineet is a Senior Cloud Architect at Amazon Web Services (AWS) Professional Services in Sydney, Australia. He specializes in designing and building scalable and cost-effective data platforms and AI/ML solutions in the cloud. Outside of work, he enjoys playing soccer on weekends.

Monitoring Amazon DevOps Guru insights using Amazon Managed Grafana

Post Syndicated from MJ Kubba original https://aws.amazon.com/blogs/devops/monitoring-amazon-devops-guru-insights-using-amazon-managed-grafana/

As organizations operate day-to-day, having insights into their cloud infrastructure state can be crucial for the durability and availability of their systems. Industry research estimates[1] that downtime costs small businesses around $427 per minute of downtime, and medium to large businesses an average of $9,000 per minute of downtime. Amazon DevOps Guru customers want to monitor and generate alerts using a single dashboard. This allows them to reduce context switching between applications, providing them an opportunity to respond to operational issues faster.

DevOps Guru can integrate with Amazon Managed Grafana to create and display operational insights. Alerts can be created and communicated for any critical events captured by DevOps Guru and notifications can be sent to operation teams to respond to these events. The key telemetry data types of logs and metrics are parsed and filtered to provide the necessary insights into observability.

Furthermore, it provides plug-ins to popular open-source databases, third-party ISV monitoring tools, and other cloud services. With Amazon Managed Grafana, you can easily visualize information from multiple AWS services, AWS accounts, and Regions in a single Grafana dashboard.

In this post, we will walk you through integrating the insights generated from DevOps Guru with Amazon Managed Grafana.

Solution Overview:

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana. Insights originate from DevOps Guru, each insight generating an event. These events are captured by Amazon EventBridge, and then saved as logs to Amazon CloudWatch Log Group DevOps Guru service metrics, and then parsed by Amazon Managed Grafana to create new dashboards.

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana, starting with DevOps Guru and then using Amazon EventBridge to save the insight event logs to Amazon CloudWatch Log Group DevOps Guru service metrics to be parsed by Amazon Managed Grafana and create new dashboards in Grafana from these logs and Metrics.

Now we will walk you through how to do this and set up notifications to your operations team.

Prerequisites:

The following prerequisites are required for this walkthrough:

  • An AWS Account
  • Enabled DevOps Guru on your account with CloudFormation stack, or tagged resources monitored.

Using Amazon CloudWatch Metrics

 

DevOps Guru sends service metrics to CloudWatch Metrics. We will use these to      track metrics for insights and metrics for your DevOps Guru usage; the DevOps Guru service reports the metrics to the AWS/DevOps-Guru namespace in CloudWatch by default.

First, we will provision an Amazon Managed Grafana workspace and then create a Dashboard in the workspace that uses Amazon CloudWatch as a data source.

Setting up Amazon CloudWatch Metrics

  1. Create Grafana Workspace
    Navigate to Amazon Managed Grafana from AWS console, then click Create workspace

a. Select the Authentication mechanism

i. AWS IAM Identity Center (AWS SSO) or SAML v2 based Identity Providers

ii. Service Managed Permission or Customer Managed

iii. Choose Next

b. Under “Data sources and notification channels”, choose Amazon CloudWatch

c. Create the Service.

You can use this post for more information on how to create and configure the Grafana workspace with SAML based authentication.

Next, we will show you how to create a dashboard and parse the Logs and Metrics to display the DevOps Guru insights and recommendations.

2. Configure Amazon Managed Grafana

a. Add CloudWatch as a data source:
From the left bar navigation menu, hover over AWS and select Data sources.

b. From the Services dropdown select and configure CloudWatch.

3. Create a Dashboard

a. From the left navigation bar, click on add a new Panel.

b. You will see a demo panel.

c. In the demo panel – Click on Data source and select Amazon CloudWatch.

The Amazon Grafana Workspace dashboard with the Grafana data source dropdown menu open. The drop down has 'Amazon CloudWatch (region name)' highlighted, other options include 'Mixed, 'Dashboard', and 'Grafana'.

d. For this panel we will use CloudWatch metrics to display the number of insights.

e. From Namespace select the AWS/DevOps-Guru name space, Insights as Metric name and Average for Statistics.

In the Amazon Grafana Workspace dashboard the user has entered values in three fields. "Grafana Query with Namespace" has the chosen value: AWS/DevOps-Guru. "Metric name" has the chosen value: Insights. "Statistic" has the chosen value: Average.

click apply

Time series graph contains a single new data point, indicting a recent event.

f. This is our first panel. We can change the panel name from the right-side bar under Title. We will name this panel “Insights

g. From the top right menu, click save dashboard and give your new dashboard a name

Using Amazon CloudWatch Logs via Amazon EventBridge

For other insights outside of the service metrics, such as a number of insights per specific service or the average for a region or for a specific AWS account, we will need to parse the event logs. These logs first need to be sent to Amazon CloudWatch Logs. We will go over the details on how to set this up and how we can parse these logs in Amazon Managed Grafana using CloudWatch Logs Query Syntax. In this post, we will show a couple of examples. For more details, please check out this User Guide documentation. This is not done by default and we will need to use Amazon EventBridge to pass these logs to CloudWatch.

DevOps Guru logs include other details that can be helpful when building Dashboards, such as region, Insight Severity (High, Medium, or Low), associated resources, and DevOps guru dashboard URL, among other things.  For more information, please check out this User Guide documentation.

EventBridge offers a serverless event bus that helps you receive, filter, transform, route, and deliver events. It provides one to many messaging solutions to support decoupled architectures, and it is easy to integrate with AWS Services and 3rd-party tools. Using Amazon EventBridge with DevOps Guru provides a solution that is easy to extend to create a ticketing system through integrations with ServiceNow, Jira, and other tools. It also makes it easy to set up alert systems through integrations with PagerDuty, Slack, and more.

 

Setting up Amazon CloudWatch Logs

  1. Let’s dive in to creating the EventBridge rule and enhance our Grafana dashboard:

a. First head to Amazon EventBridge in the AWS console.

b. Click Create rule.

     Type in rule Name and Description. You can leave the Event bus to default and Rule type to Rule with an event pattern.

c. Select AWS events or EventBridge partner events.

    For event Pattern change to Customer patterns (JSON editor) and use:

{"source": ["aws.devops-guru"]}

This filters for all events generated from DevOps Guru. You can use the same mechanism to filter out specific messages such as new insights, or insights closed to a different channel. For this demonstration, let’s consider extracting all events.

As the user configures their EventBridge Rule, for the Creation method they have chosen "Custom pattern (JSON editor) write an event pattern in JSON." For the Event pattern editor just below they have entered {"source":["aws.devops-guru"]}

d. Next, for Target, select AWS service.

    Then use CloudWatch log Group.

    For the Log Group, give your group a name, such as “devops-guru”.

In the prompt for the new Target's configurations, the user has chosen AWS service as the Target type. For the Select a target drop down, they chose CloudWatch log Group. For the log group, they selected the /aws/events radio option, and then filled in the following input text box with the kebab case group name devops-guru.

e. Click Create rule.

f. Navigate back to Amazon Managed Grafana.
It’s time to add a couple more additional Panels to our dashboard.  Click Add panel.
    Then Select Amazon CloudWatch, and change from metrics to CloudWatch Logs and select the Log Group we created previously.

In the Grafana Workspace, the user has "Data source" selected as Amazon CloudWatch us-east-1. Underneath that they have chosen to use the default region and CloudWatch Logs. Below that, for the Log Groups they have entered /aws/events/DevOpsGuru

g. For the query use the following to get the number of closed insights:

fields @detail.messageType
| filter detail.messageType="CLOSED_INSIGHT"
| count(detail.messageType)

You’ll see the new dashboard get updated with “Data is missing a time field”.

New panel suggestion with switch to table or open visualization suggestions

You can either open the suggestions and select a gauge that makes sense;

New Suggestions display a dial graph, a bar graph, and a count numerical tracker

Or choose from multiple visualization options.

Now we have 2 panels:

Two panels are shown, one is the new dial graph, and the other is the time series graph that was created earlier.

h. You can repeat the same process. To create 3rd panel for the new insights using this query:

fields @detail.messageType 
| filter detail.messageType="NEW_INSIGHT" 
| count(detail.messageType)

Now we have 3 panels:

Grafana now shows three 3 panels. Two dial graphs, and the time series graph.

Next, depending on the visualizations, you can work with the Logs and metrics data types to parse and filter the data.

Setting up a 4th panel as table. Under the Query tab, in the query editor, the user has entered the text: fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter | filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"

i. For our fourth panel, we will add DevOps Guru dashboard direct link to the AWS Console.

Repeat the same process as demonstrated previously one more time with this query:

fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter 
| filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"                       

                        Switch to table when prompted on the panel.

Grafana now shows 4 panels. The new panel displays a data table that contains information about the most recent DevOps Guru insights. There are also the two dial graphs, and the time series graph from before.

This will give us a direct link to the DevOps Guru dashboard and help us get to the insight details and Recommendations.

Grafana now shows 4 panels. The new panel displays a data table that contains information about the most recent DevOps Guru insights. There are also the two dial graphs, and the time series graph from before.

Save your dashboard.

  1. You can extend observability by sending notifications through alerts on dashboards of panels providing metrics. The alerts will be triggered when a condition is met. The Alerts are communicated with Amazon SNS notification mechanism. This is our SNS notification channel setup.

Screenshot: notification settings show Name: DevopsGuruAlertsFromGrafana and Type: SNS

A previously created notification is used next to communicate any alerts when the condition is met across the metrics being observed.

Screenshot: notification setting with condition when count of query is above 5, a notification is sent to DevopsGuruAlertsFromGrafana with message, "More than 5 insights in the past 1 hour"

Cleanup

To avoid incurring future charges, delete the resources.

  • Navigate to EventBridge in AWS console and delete the rule created in step 4 (a-e) “devops-guru”.
  • Navigate to CloudWatch logs in AWS console and delete the log group created as results of step 4 (a-e) named “devops-guru”.
  • Amazon Managed Grafana: Navigate to Amazon Managed Grafana service and delete the Grafana services you created in step 1.

Conclusion

In this post, we have demonstrated how to successfully incorporate Amazon DevOps Guru insights into Amazon Managed Grafana and use Grafana as the observability tool. This will allow Operations team to successfully observe the state of their AWS resources and notify them through Alarms on any preset thresholds on DevOps Guru metrics and logs. You can expand on this to create other panels and dashboards specific to your needs. If you don’t have DevOps Guru, you can start monitoring your AWS applications with AWS DevOps Guru today using this link.

[1] https://www.atlassian.com/incident-management/kpis/cost-of-downtime

About the authors:

MJ Kubba

MJ Kubba is a Solutions Architect who enjoys working with public sector customers to build solutions that meet their business needs. MJ has over 15 years of experience designing and implementing software solutions. He has a keen passion for DevOps and cultural transformation.

David Ernst

David is a Sr. Specialist Solution Architect – DevOps, with 20+ years of experience in designing and implementing software solutions for various industries. David is an automation enthusiast and works with AWS customers to design, deploy, and manage their AWS workloads/architectures.

Sofia Kendall

Sofia Kendall is a Solutions Architect who helps small and medium businesses achieve their goals as they utilize the cloud. Sofia has a background in Software Engineering and enjoys working to make systems reliable, efficient, and scalable.

Serverless ICYMI Q1 2023

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/serverless-icymi-q1-2023/

Welcome to the 21st edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

ICYMI2023Q1

In case you missed our last ICYMI, check out what happened last quarter here.

Artificial intelligence (AI) technologies, ChatGPT, and DALL-E are creating significant interest in the industry at the moment. Find out how to integrate serverless services with ChatGPT and DALL-E to generate unique bedtime stories for children.

Example notification of a story hosted with Next.js and App Runner

Example notification of a story hosted with Next.js and App Runner

Serverless Land is a website maintained by the Serverless Developer Advocate team to help you build serverless applications and includes workshops, code examples, blogs, and videos. There is now enhanced search functionality so you can search across resources, patterns, and video content.

SLand-search

ServerlessLand search

AWS Lambda

AWS Lambda has improved how concurrency works with Amazon SQS. You can now control the maximum number of concurrent Lambda functions invoked.

The launch blog post explains the scaling behavior of Lambda using this architectural pattern, challenges this feature helps address, and a demo of maximum concurrency in action.

Maximum concurrency is set to 10 for the SQS queue.

Maximum concurrency is set to 10 for the SQS queue.

AWS Lambda Powertools is an open-source library to help you discover and incorporate serverless best practices more easily. Lambda Powertools for .NET is now generally available and currently focused on three observability features: distributed tracing (Tracer), structured logging (Logger), and asynchronous business and application metrics (Metrics). Powertools is also available for Python, Java, and Typescript/Node.js programming languages.

To learn more:

Lambda announced a new feature, runtime management controls, which provide more visibility and control over when Lambda applies runtime updates to your functions. The runtime controls are optional capabilities for advanced customers that require more control over their runtime changes. You can now specify a runtime management configuration for each function with three settings, Automatic (default), Function update, or manual.

There are three new Amazon CloudWatch metrics for asynchronous Lambda function invocations: AsyncEventsReceived, AsyncEventAge, and AsyncEventsDropped. You can track the asynchronous invocation requests sent to Lambda functions to monitor any delays in processing and take corrective actions if required. The launch blog post explains the new metrics and how to use them to troubleshoot issues.

Lambda now supports Amazon DocumentDB change streams as an event source. You can use Lambda functions to process new documents, track updates to existing documents, or log deleted documents. You can use any programming language that is supported by Lambda to write your functions.

There is a helpful blog post suggesting best practices for developing portable Lambda functions that allow you to port your code to containers if you later choose to.

AWS Step Functions

AWS Step Functions has expanded its AWS SDK integrations with support for 35 additional AWS services including Amazon EMR Serverless, AWS Clean Rooms, AWS IoT FleetWise, AWS IoT RoboRunner and 31 other AWS services. In addition, Step Functions also added support for 1000+ new API actions from new and existing AWS services such as Amazon DynamoDB and Amazon Athena. For the full list of added services, visit AWS SDK service integrations.

Amazon EventBridge

Amazon EventBridge has launched the AWS Controllers for Kubernetes (ACK) for EventBridge and Pipes . This allows you to manage EventBridge resources, such as event buses, rules, and pipes, using the Kubernetes API and resource model (custom resource definitions).

EventBridge event buses now also support enhanced integration with Service Quotas. Your quota increase requests for limits such as PutEvents transactions-per-second, number of rules, and invocations per second among others will be processed within one business day or faster, enabling you to respond quickly to changes in usage.

AWS SAM

The AWS Serverless Application Model (SAM) Command Line Interface (CLI) has added the sam list command. You can now show resources defined in your application, including the endpoints, methods, and stack outputs required to test your deployed application.

AWS SAM has a preview of sam build support for building and packaging serverless applications developed in Rust. You can use cargo-lambda in the AWS SAM CLI build workflow and AWS SAM Accelerate to iterate on your code changes rapidly in the cloud.

You can now use AWS SAM connectors as a source resource parameter. Previously, you could only define AWS SAM connectors as a AWS::Serverless::Connector resource. Now you can add the resource attribute on a connector’s source resource, which makes templates more readable and easier to update over time.

AWS SAM connectors now also support multiple destinations to simplify your permissions. You can now use a single connector between a single source resource and multiple destination resources.

In October 2022, AWS released OpenID Connect (OIDC) support for AWS SAM Pipelines. This improves your security posture by creating integrations that use short-lived credentials from your CI/CD provider. There is a new blog post on how to implement it.

Find out how best to build serverless Java applications with the AWS SAM CLI.

AWS App Runner

AWS App Runner now supports retrieving secrets and configuration data stored in AWS Secrets Manager and AWS Systems Manager (SSM) Parameter Store in an App Runner service as runtime environment variables.

AppRunner also now supports incoming requests based on HTTP 1.0 protocol, and has added service level concurrency, CPU and Memory utilization metrics.

Amazon S3

Amazon S3 now automatically applies default encryption to all new objects added to S3, at no additional cost and with no impact on performance.

You can now use an S3 Object Lambda Access Point alias as an origin for your Amazon CloudFront distribution to tailor or customize data to end users. For example, you can resize an image depending on the device that an end user is visiting from.

S3 has introduced Mountpoint for S3, a high performance open source file client that translates local file system API calls to S3 object API calls like GET and LIST.

S3 Multi-Region Access Points now support datasets that are replicated across multiple AWS accounts. They provide a single global endpoint for your multi-region applications, and dynamically route S3 requests based on policies that you define. This helps you to more easily implement multi-Region resilience, latency-based routing, and active-passive failover, even when data is stored in multiple accounts.

Amazon Kinesis

Amazon Kinesis Data Firehose now supports streaming data delivery to Elastic. This is an easier way to ingest streaming data to Elastic and consume the Elastic Stack (ELK Stack) solutions for enterprise search, observability, and security without having to manage applications or write code.

Amazon DynamoDB

Amazon DynamoDB now supports table deletion protection to protect your tables from accidental deletion when performing regular table management operations. You can set the deletion protection property for each table, which is set to disabled by default.

Amazon SNS

Amazon SNS now supports AWS X-Ray active tracing to visualize, analyze, and debug application performance. You can now view traces that flow through Amazon SNS topics to destination services, such as Amazon Simple Queue Service, Lambda, and Kinesis Data Firehose, in addition to traversing the application topology in Amazon CloudWatch ServiceLens.

SNS also now supports setting content-type request headers for HTTPS notifications so applications can receive their notifications in a more predictable format. Topic subscribers can create a DeliveryPolicy that specifies the content-type value that SNS assigns to their HTTPS notifications, such as application/json, application/xml, or text/plain.

EDA Visuals collection added to Serverless Land

The Serverless Developer Advocate team has extended Serverless Land and introduced EDA visuals. These are small bite sized visuals to help you understand concept and patterns about event-driven architectures. Find out about batch processing vs. event streaming, commands vs. events, message queues vs. event brokers, and point-to-point messaging. Discover bounded contexts, migrations, idempotency, claims, enrichment and more!

EDA-visuals

EDA Visuals

To learn more:

Serverless Repos Collection on Serverless Land

There is also a new section on Serverless Land containing helpful code repositories. You can search for code repos to use for examples, learning or building serverless applications. You can also filter by use-case, runtime, and level.

Serverless Repos Collection

Serverless Repos Collection

Serverless Blog Posts

January

Jan 12 – Introducing maximum concurrency of AWS Lambda functions when using Amazon SQS as an event source

Jan 20 – Processing geospatial IoT data with AWS IoT Core and the Amazon Location Service

Jan 23 – AWS Lambda: Resilience under-the-hood

Jan 24 – Introducing AWS Lambda runtime management controls

Jan 24 – Best practices for working with the Apache Velocity Template Language in Amazon API Gateway

February

Feb 6 – Previewing environments using containerized AWS Lambda functions

Feb 7 – Building ad-hoc consumers for event-driven architectures

Feb 9 – Implementing architectural patterns with Amazon EventBridge Pipes

Feb 9 – Securing CI/CD pipelines with AWS SAM Pipelines and OIDC

Feb 9 – Introducing new asynchronous invocation metrics for AWS Lambda

Feb 14 – Migrating to token-based authentication for iOS applications with Amazon SNS

Feb 15 – Implementing reactive progress tracking for AWS Step Functions

Feb 23 – Developing portable AWS Lambda functions

Feb 23 – Uploading large objects to Amazon S3 using multipart upload and transfer acceleration

Feb 28 – Introducing AWS Lambda Powertools for .NET

March

Mar 9 – Server-side rendering micro-frontends – UI composer and service discovery

Mar 9 – Building serverless Java applications with the AWS SAM CLI

Mar 10 – Managing sessions of anonymous users in WebSocket API-based applications

Mar 14 –
Implementing an event-driven serverless story generation application with ChatGPT and DALL-E

Videos

Serverless Office Hours – Tues 10AM PT

Weekly office hours live stream. In each session we talk about a specific topic or technology related to serverless and open it up to helping you with your real serverless challenges and issues. Ask us anything you want about serverless technologies and applications.

January

Jan 10 – Building .NET 7 high performance Lambda functions

Jan 17 – Amazon Managed Workflows for Apache Airflow at Scale

Jan 24 – Using Terraform with AWS SAM

Jan 31 – Preparing your serverless architectures for the big day

February

Feb 07- Visually design and build serverless applications

Feb 14 – Multi-tenant serverless SaaS

Feb 21 – Refactoring to Serverless

Feb 28 – EDA visually explained

March

Mar 07 – Lambda cookbook with Python

Mar 14 – Succeeding with serverless

Mar 21 – Lambda Powertools .NET

Mar 28 – Server-side rendering micro-frontends

FooBar Serverless YouTube channel

Marcia Villalba frequently publishes new videos on her popular serverless YouTube channel. You can view all of Marcia’s videos at https://www.youtube.com/c/FooBar_codes.

January

Jan 12 – Serverless Badge – A new certification to validate your Serverless Knowledge

Jan 19 – Step functions Distributed map – Run 10k parallel serverless executions!

Jan 26 – Step Functions Intrinsic Functions – Do simple data processing directly from the state machines!

February

Feb 02 – Unlock the Power of EventBridge Pipes: Integrate Across Platforms with Ease!

Feb 09 – Amazon EventBridge Pipes: Enrichment and filter of events Demo with AWS SAM

Feb 16 – AWS App Runner – Deploy your apps from GitHub to Cloud in Record Time

Feb 23 – AWS App Runner – Demo hosting a Node.js app in the cloud directly from GitHub (AWS CDK)

March

Mar 02 – What is Amazon DynamoDB? What are the most important concepts? What are the indexes?

Mar 09 – Choreography vs Orchestration: Which is Best for Your Distributed Application?

Mar 16 – DynamoDB Single Table Design: Simplify Your Code and Boost Performance with Table Design Strategies

Mar 23 – 8 Reasons You Should Choose DynamoDB for Your Next Project and How to Get Started

Sessions with SAM & Friends

SAMFiends

AWS SAM & Friends

Eric Johnson is exploring how developers are building serverless applications. We spend time talking about AWS SAM as well as others like AWS CDK, Terraform, Wing, and AMPT.

Feb 16 – What’s new with AWS SAM

Feb 23 – AWS SAM with AWS CDK

Mar 02 – AWS SAM and Terraform

Mar 10 – Live from ServerlessDays ANZ

Mar 16 – All about AMPT

Mar 23 – All about Wing

Mar 30 – SAM Accelerate deep dive

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.

Build event-driven data pipelines using AWS Controllers for Kubernetes and Amazon EMR on EKS

Post Syndicated from Victor Gu original https://aws.amazon.com/blogs/big-data/build-event-driven-data-pipelines-using-aws-controllers-for-kubernetes-and-amazon-emr-on-eks/

An event-driven architecture is a software design pattern in which decoupled applications can asynchronously publish and subscribe to events via an event broker. By promoting loose coupling between components of a system, an event-driven architecture leads to greater agility and can enable components in the system to scale independently and fail without impacting other services. AWS has many services to build solutions with an event-driven architecture, such as Amazon EventBridge, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and AWS Lambda.

Amazon Elastic Kubernetes Service (Amazon EKS) is becoming a popular choice among AWS customers to host long-running analytics and AI or machine learning (ML) workloads. By containerizing your data processing tasks, you can simply deploy them into Amazon EKS as Kubernetes jobs and use Kubernetes to manage underlying computing compute resources. For big data processing, which requires distributed computing, you can use Spark on Amazon EKS. Amazon EMR on EKS, a managed Spark framework on Amazon EKS, enables you to run Spark jobs with benefits of scalability, portability, extensibility, and speed. With EMR on EKS, the Spark jobs run using the Amazon EMR runtime for Apache Spark, which increases the performance of your Spark jobs so that they run faster and cost less than open-source Apache Spark.

Data processes require a workflow management to schedule jobs and manage dependencies between jobs, and require monitoring to ensure that the transformed data is always accurate and up to date. One popular orchestration tool for managing workflows is Apache Airflow, which can be installed in Amazon EKS. Alternatively, you can use the AWS-managed version, Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Another option is to use AWS Step Functions, which is a serverless workflow service that integrates with EMR on EKS and EventBridge to build event-driven workflows.

In this post, we demonstrate how to build an event-driven data pipeline using AWS Controllers for Kubernetes (ACK) and EMR on EKS. We use ACK to provision and configure serverless AWS resources, such as EventBridge and Step Functions. Triggered by an EventBridge rule, Step Functions orchestrates jobs running in EMR on EKS. With ACK, you can use the Kubernetes API and configuration language to create and configure AWS resources the same way you create and configure a Kubernetes data processing job. Because most of the managed services are serverless, you can build and manage your entire data pipeline using the Kubernetes API with tools such as kubectl.

Solution overview

ACK lets you define and use AWS service resources directly from Kubernetes, using the Kubernetes Resource Model (KRM). The ACK project contains a series of service controllers, one for each AWS service API. With ACK, developers can stay in their familiar Kubernetes environment and take advantage of AWS services for their application-supporting infrastructure. In the post Microservices development using AWS controllers for Kubernetes (ACK) and Amazon EKS blueprints, we show how to use ACK for microservices development.

In this post, we show how to build an event-driven data pipeline using ACK controllers for EMR on EKS, Step Functions, EventBridge, and Amazon Simple Storage Service (Amazon S3). We provision an EKS cluster with ACK controllers using Terraform modules. We create the data pipeline with the following steps:

  1. Create the emr-data-team-a namespace and bind it with the virtual cluster my-ack-vc in Amazon EMR by using the ACK controller.
  2. Use the ACK controller for Amazon S3 to create an S3 bucket. Upload the sample Spark scripts and sample data to the S3 bucket.
  3. Use the ACK controller for Step Functions to create a Step Functions state machine as an EventBridge rule target based on Kubernetes resources defined in YAML manifests.
  4. Use the ACK controller for EventBridge to create an EventBridge rule for pattern matching and target routing.

The pipeline is triggered when a new script is uploaded. An S3 upload notification is sent to EventBridge and, if it matches the specified rule pattern, triggers the Step Functions state machine. Step Functions calls the EMR virtual cluster to run the Spark job, and all the Spark executors and driver are provisioned inside the emr-data-team-a namespace. The output is saved back to the S3 bucket, and the developer can check the result on the Amazon EMR console.

The following diagram illustrates this architecture.

Prerequisites

Ensure that you have the following tools installed locally:

Deploy the solution infrastructure

Because each ACK service controller requires different AWS Identity and Access Management (IAM) roles for managing AWS resources, it’s better to use an automation tool to install the required service controllers. For this post, we use Amazon EKS Blueprints for Terraform and the AWS EKS ACK Addons Terraform module to provision the following components:

  • A new VPC with three private subnets and three public subnets
  • An internet gateway for the public subnets and a NAT Gateway for the private subnets
  • An EKS cluster control plane with one managed node group
  • Amazon EKS-managed add-ons: VPC_CNI, CoreDNS, and Kube_Proxy
  • ACK controllers for EMR on EKS, Step Functions, EventBridge, and Amazon S3
  • IAM execution roles for EMR on EKS, Step Functions, and EventBridge

Let’s start by cloning the GitHub repo to your local desktop. The module eks_ack_addons in addon.tf is for installing ACK controllers. ACK controllers are installed by using helm charts in the Amazon ECR public galley. See the following code:

cd examples/usecases/event-driven-pipeline
terraform init
terraform plan
terraform apply -auto-approve #defaults to us-west-2

The following screenshot shows an example of our output. emr_on_eks_role_arn is the ARN of the IAM role created for Amazon EMR running Spark jobs in the emr-data-team-a namespace in Amazon EKS. stepfunction_role_arn is the ARN of the IAM execution role for the Step Functions state machine. eventbridge_role_arn is the ARN of the IAM execution role for the EventBridge rule.

The following command updates kubeconfig on your local machine and allows you to interact with your EKS cluster using kubectl to validate the deployment:

region=us-west-2
aws eks --region $region update-kubeconfig --name event-driven-pipeline-demo

Test your access to the EKS cluster by listing the nodes:

kubectl get nodes
# Output should look like below
NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-1-10-64.us-west-2.compute.internal    Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-10-65.us-west-2.compute.internal    Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-10-7.us-west-2.compute.internal     Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-10-73.us-west-2.compute.internal    Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-11-96.us-west-2.compute.internal    Ready    <none>   19h     v1.24.9-eks-49d8fe8
ip-10-1-12-197.us-west-2.compute.internal   Ready    <none>   19h     v1.24.9-eks-49d8fe8

Now we’re ready to set up the event-driven pipeline.

Create an EMR virtual cluster

Let’s start by creating a virtual cluster in Amazon EMR and link it with a Kubernetes namespace in EKS. By doing that, the virtual cluster will use the linked namespace in Amazon EKS for running Spark workloads. We use the file emr-virtualcluster.yaml. See the following code:

apiVersion: emrcontainers.services.k8s.aws/v1alpha1
kind: VirtualCluster
metadata:
  name: my-ack-vc
spec:
  name: my-ack-vc
  containerProvider:
    id: event-driven-pipeline-demo  # your eks cluster name
    type_: EKS
    info:
      eksInfo:
        namespace: emr-data-team-a # namespace binding with EMR virtual cluster

Let’s apply the manifest by using the following kubectl command:

kubectl apply -f ack-yamls/emr-virtualcluster.yaml

You can navigate to the Virtual clusters page on the Amazon EMR console to see the cluster record.

Create an S3 bucket and upload data

Next, let’s create a S3 bucket for storing Spark pod templates and sample data. We use the s3.yaml file. See the following code:

apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
  name: sparkjob-demo-bucket
spec:
  name: sparkjob-demo-bucket

kubectl apply -f ack-yamls/s3.yaml

If you don’t see the bucket, you can check the log from the ACK S3 controller pod for details. The error is mostly caused if a bucket with the same name already exists. You need to change the bucket name in s3.yaml as well as in eventbridge.yaml and sfn.yaml. You also need to update upload-inputdata.sh and upload-spark-scripts.sh with the new bucket name.

Run the following command to upload the input data and pod templates:

bash spark-scripts-data/upload-inputdata.sh

The sparkjob-demo-bucket S3 bucket is created with two folders: input and scripts.

Create a Step Functions state machine

The next step is to create a Step Functions state machine that calls the EMR virtual cluster to run a Spark job, which is a sample Python script to process the New York City Taxi Records dataset. You need to define the Spark script location and pod templates for the Spark driver and executor in the StateMachine object .yaml file. Let’s make the following changes (highlighted) in sfn.yaml first:

  • Replace the value for roleARN with stepfunctions_role_arn
  • Replace the value for ExecutionRoleArn with emr_on_eks_role_arn
  • Replace the value for VirtualClusterId with your virtual cluster ID
  • Optionally, replace sparkjob-demo-bucket with your bucket name

See the following code:

apiVersion: sfn.services.k8s.aws/v1alpha1
kind: StateMachine
metadata:
  name: run-spark-job-ack
spec:
  name: run-spark-job-ack
  roleARN: "arn:aws:iam::xxxxxxxxxxx:role/event-driven-pipeline-demo-sfn-execution-role"   # replace with your stepfunctions_role_arn
  tags:
  - key: owner
    value: sfn-ack
  definition: |
      {
      "Comment": "A description of my state machine",
      "StartAt": "input-output-s3",
      "States": {
        "input-output-s3": {
          "Type": "Task",
          "Resource": "arn:aws:states:::emr-containers:startJobRun.sync",
          "Parameters": {
            "VirtualClusterId": "f0u3vt3y4q2r1ot11m7v809y6",  
            "ExecutionRoleArn": "arn:aws:iam::xxxxxxxxxxx:role/event-driven-pipeline-demo-emr-eks-data-team-a",
            "ReleaseLabel": "emr-6.7.0-latest",
            "JobDriver": {
              "SparkSubmitJobDriver": {
                "EntryPoint": "s3://sparkjob-demo-bucket/scripts/pyspark-taxi-trip.py",
                "EntryPointArguments": [
                  "s3://sparkjob-demo-bucket/input/",
                  "s3://sparkjob-demo-bucket/output/"
                ],
                "SparkSubmitParameters": "--conf spark.executor.instances=10"
              }
            },
            "ConfigurationOverrides": {
              "ApplicationConfiguration": [
                {
                 "Classification": "spark-defaults",
                "Properties": {
                  "spark.driver.cores":"1",
                  "spark.executor.cores":"1",
                  "spark.driver.memory": "10g",
                  "spark.executor.memory": "10g",
                  "spark.kubernetes.driver.podTemplateFile":"s3://sparkjob-demo-bucket/scripts/driver-pod-template.yaml",
                  "spark.kubernetes.executor.podTemplateFile":"s3://sparkjob-demo-bucket/scripts/executor-pod-template.yaml",
                  "spark.local.dir" : "/data1,/data2"
                }
              }
              ]
            }...

You can get your virtual cluster ID from the Amazon EMR console or with the following command:

kubectl get virtualcluster -o jsonpath={.items..status.id}
# result:
f0u3vt3y4q2r1ot11m7v809y6  # VirtualClusterId

Then apply the manifest to create the Step Functions state machine:

kubectl apply -f ack-yamls/sfn.yaml

Create an EventBridge rule

The last step is to create an EventBridge rule, which is used as an event broker to receive event notifications from Amazon S3. Whenever a new file, such as a new Spark script, is created in the S3 bucket, the EventBridge rule will evaluate (filter) the event and invoke the Step Functions state machine if it matches the specified rule pattern, triggering the configured Spark job.

Let’s use the following command to get the ARN of the Step Functions state machine we created earlier:

kubectl get StateMachine -o jsonpath={.items..status.ackResourceMetadata.arn}
# result
arn: arn:aws:states:us-west-2:xxxxxxxxxx:stateMachine:run-spark-job-ack # sfn_arn

Then, update eventbridge.yaml with the following values:

  • Under targets, replace the value for roleARN with eventbridge_role_arn

Under targets, replace arn with your sfn_arn

  • Optionally, in eventPattern, replace sparkjob-demo-bucket with your bucket name

See the following code:

apiVersion: eventbridge.services.k8s.aws/v1alpha1
kind: Rule
metadata:
  name: eb-rule-ack
spec:
  name: eb-rule-ack
  description: "ACK EventBridge Filter Rule to sfn using event bus reference"
  eventPattern: | 
    {
      "source": ["aws.s3"],
      "detail-type": ["Object Created"],
      "detail": {
        "bucket": {
          "name": ["sparkjob-demo-bucket"]    
        },
        "object": {
          "key": [{
            "prefix": "scripts/"
          }]
        }
      }
    }
  targets:
    - arn: arn:aws:states:us-west-2:xxxxxxxxxx:stateMachine:run-spark-job-ack # replace with your sfn arn
      id: sfn-run-spark-job-target
      roleARN: arn:aws:iam::xxxxxxxxx:role/event-driven-pipeline-demo-eb-execution-role # replace your eventbridge_role_arn
      retryPolicy:
        maximumRetryAttempts: 0 # no retries
  tags:
    - key:owner
      value: eb-ack

By applying the EventBridge configuration file, an EventBridge rule is created to monitor the folder scripts in the S3 bucket sparkjob-demo-bucket:

kubectl apply -f ack-yamls/eventbridge.yaml

For simplicity, the dead-letter queue is not set and maximum retry attempts is set to 0. For production usage, set them based on your requirements. For more information, refer to Event retry policy and using dead-letter queues.

Test the data pipeline

To test the data pipeline, we trigger it by uploading a Spark script to the S3 bucket scripts folder using the following command:

bash spark-scripts-data/upload-spark-scripts.sh

The upload event triggers the EventBridge rule and then calls the Step Functions state machine. You can go to the State machines page on the Step Functions console and choose the job run-spark-job-ack to monitor its status.

For the Spark job details, on the Amazon EMR console, choose Virtual clusters in the navigation pane, and then choose my-ack-vc. You can review all the job run history for this virtual cluster. If you choose Spark UI in any row, you’re redirected the Spark history server for more Spark driver and executor logs.

Clean up

To clean up the resources created in the post, use the following code:

aws s3 rm s3://sparkjob-demo-bucket --recursive # clean up data in S3
kubectl delete -f ack-yamls/. #Delete aws resources created by ACK
terraform destroy -target="module.eks_blueprints_kubernetes_addons" -target="module.eks_ack_addons" -auto-approve -var region=$region
terraform destroy -target="module.eks_blueprints" -auto-approve -var region=$region
terraform destroy -auto-approve -var region=$regionterraform destroy -auto-approve -var region=$region

Conclusion

This post showed how to build an event-driven data pipeline purely with native Kubernetes API and tooling. The pipeline uses EMR on EKS as compute and uses serverless AWS resources Amazon S3, EventBridge, and Step Functions as storage and orchestration in an event-driven architecture. With EventBridge, AWS and custom events can be ingested, filtered, transformed, and reliably delivered (routed) to more than 20 AWS services and public APIs (webhooks), using human-readable configuration instead of writing undifferentiated code. EventBridge helps you decouple applications and achieve more efficient organizations using event-driven architectures, and has quickly become the event bus of choice for AWS customers for many use cases, such as auditing and monitoring, application integration, and IT automation.

By using ACK controllers to create and configure different AWS services, developers can perform all data plane operations without leaving the Kubernetes platform. Also, developers only need to maintain the EKS cluster because all the other components are serverless.

As a next step, clone the GitHub repository to your local machine and test the data pipeline in your own AWS account. You can modify the code in this post and customize it for your own needs by using different EventBridge rules or adding more steps in Step Functions.


About the authors

Victor Gu is a Containers and Serverless Architect at AWS. He works with AWS customers to design microservices and cloud native solutions using Amazon EKS/ECS and AWS serverless services. His specialties are Kubernetes, Spark on Kubernetes, MLOps and DevOps.

Michael Gasch is a Senior Product Manager for AWS EventBridge, driving innovations in event-driven architectures. Prior to AWS, Michael was a Staff Engineer at the VMware Office of the CTO, working on open-source projects, such as Kubernetes and Knative, and related distributed systems research.

Peter Dalbhanjan is a Solutions Architect for AWS based in Herndon, VA. Peter has a keen interest in evangelizing AWS solutions and has written multiple blog posts that focus on simplifying complex use cases. At AWS, Peter helps with designing and architecting variety of customer workloads.

How to use Amazon GuardDuty and AWS WAF v2 to automatically block suspicious hosts

Post Syndicated from Eucke Warren original https://aws.amazon.com/blogs/security/how-to-use-amazon-guardduty-and-aws-waf-v2-to-automatically-block-suspicious-hosts/

In this post, we’ll share an automation pattern that you can use to automatically detect and block suspicious hosts that are attempting to access your Amazon Web Services (AWS) resources. The automation will rely on Amazon GuardDuty to generate findings about the suspicious hosts, and then you can respond to those findings by programmatically updating AWS WAF to block the host from accessing your workloads.

You should implement security measures across your AWS resources by using a holistic approach that incorporates controls across multiple areas. In the AWS CAF Security Perspective section of the AWS Security Incident Response Guide, we define these controls across four categories:

  • Directive controls — Establish the governance, risk, and compliance models the environment will operate within
  • Preventive controls — Protect your workloads and mitigate threats and vulnerabilities
  • Detective controls — Provide full visibility and transparency over the operation of your deployments in AWS
  • Responsive controls — Drive remediation of potential deviations from your security baselines

Security automation is a key principle outlined in the Response Guide. It helps reduce operational overhead and creates repeatable, predictable approaches to monitoring and responding to events. AWS services provide the building blocks to create powerful patterns for the automated detection and remediation of threats against your AWS environments. You can configure automated flows that use both detective and responsive controls and might also feed into preventative controls to help mitigate risks in the future. Depending on the type of source event, you can automatically invoke specific actions, such as modifying access controls, terminating instances, or revoking credentials.

The patterns highlighted in this post provide an example of how to automatically remediate detected threats. You should modify these patterns to suit your defined requirements, and test and validate them before deploying them in a production environment.

AWS services used for the example pattern

Amazon GuardDuty is a continuous security monitoring and threat detection service that incorporates threat intelligence, anomaly detection, and machine learning to help protect your AWS resources, including your AWS accounts. Amazon EventBridge delivers a near-real-time stream of system events that describe changes in AWS resources. Amazon GuardDuty sends events to Amazon CloudWatch when a change in the findings takes place. In the context of GuardDuty, such changes include newly generated findings and subsequent occurrences of these findings. You can quickly set up rules to match events generated by GuardDuty findings in EventBridge events and route those events to one or more target actions. The pattern in this post routes matched events to AWS Lambda, which then updates AWS WAF web access control lists (web ACLs) and Amazon Virtual Private Cloud (Amazon VPC) network access control lists (network ACLs). AWS WAF is a web application firewall that helps protect your web applications from common web exploits that could affect application availability, security, or excess resource consumption. It supports both managed rules as well as a powerful rule language for custom rules. A network ACL is stateless and is an optional layer of security for your VPC that helps you restrict specific inbound and outbound traffic at the subnet level.

Pattern overview

This example pattern assumes that Amazon GuardDuty is enabled in your AWS account. If it isn’t enabled, you can learn more about the free trial and pricing, and follow the steps in the GuardDuty documentation to configure the service and start monitoring your account. The example code will only work in the us-east-1 AWS Region due to the use of Amazon CloudFront and web ACLs within the template.

Figure 1 shows how the AWS CloudFormation template creates the sample pattern.

Figure 1: How the CloudFormation template works

Figure 1: How the CloudFormation template works

Here’s how the pattern works, as shown in the diagram:

  1. A GuardDuty finding is generated due to suspected malicious activity.
  2. An EventBridge event is configured to filter for GuardDuty finding types by using event patterns.
  3. A Lambda function is invoked by the EventBridge event and parses the GuardDuty finding.
  4. The Lambda function checks the Amazon DynamoDB state table for an existing entry that matches the identified host. If state data is not found in the table for the identified host, a new entry is created in the Amazon DynamoDB state table.
  5. The Lambda function creates a web ACL rule inside AWS WAF and updates a subnet network ACL.
  6. A notification email is sent through Amazon Simple Notification Service (SNS).

A second Lambda function runs on a 5-minute recurring schedule and removes entries that are past the configurable retention period from AWS WAF IPSets (an IPSet is a list that contains the blocklisted IPs or CIDRs), VPC network ACLs, and the DynamoDB table.

GuardDuty prefix patterns and findings

The EventBridge event rule provided by the example automation uses the following seven prefix patterns, which allow coverage for 36 GuardDuty finding types. These specific finding types are of a network nature, and so we can use AWS WAF to block them. Be sure to read through the full list of finding types in the GuardDuty documentation to better understand what GuardDuty can report findings for. The covered findings are as follows:

  1. UnauthorizedAccess:EC2
    • UnauthorizedAccess:EC2/MaliciousIPCaller.Custom
    • UnauthorizedAccess:EC2/MetadataDNSRebind
    • UnauthorizedAccess:EC2/RDPBruteForce
    • UnauthorizedAccess:EC2/SSHBruteForce
    • UnauthorizedAccess:EC2/TorClient
    • UnauthorizedAccess:EC2/TorRelay
  2. Recon:EC2
    • Recon:EC2/PortProbeEMRUnprotectedPort
    • Recon:EC2/PortProbeUnprotectedPort
    • Recon:EC2/Portscan
  3. Trojan:EC2
    • Trojan:EC2/BlackholeTraffic
    • Trojan:EC2/BlackholeTraffic!DNS
    • Trojan:EC2/DGADomainRequest.B
    • Trojan:EC2/DGADomainRequest.C!DNS
    • Trojan:EC2/DNSDataExfiltration
    • Trojan:EC2/DriveBySourceTraffic!DNS
    • Trojan:EC2/DropPoint
    • Trojan:EC2/DropPoint!DNS
    • Trojan:EC2/PhishingDomainRequest!DNS
  4. Backdoor:EC2
    • Backdoor:EC2/C&CActivity.B
    • Backdoor:EC2/C&CActivity.B!DNS
    • Backdoor:EC2/DenialOfService.Dns
    • Backdoor:EC2/DenialOfService.Tcp
    • Backdoor:EC2/DenialOfService.Udp
    • Backdoor:EC2/DenialOfService.UdpOnTcpPorts
    • Backdoor:EC2/DenialOfService.UnusualProtocol
    • Backdoor:EC2/Spambot
  5. Impact:EC2
    • Impact:EC2/AbusedDomainRequest.Reputation
    • Impact:EC2/BitcoinDomainRequest.Reputation
    • Impact:EC2/MaliciousDomainRequest.Reputation
    • Impact:EC2/PortSweep
    • Impact:EC2/SuspiciousDomainRequest.Reputation
    • Impact:EC2/WinRMBruteForce
  6. CryptoCurrency:EC2
    • CryptoCurrency:EC2/BitcoinTool.B
    • CryptoCurrency:EC2/BitcoinTool.B!DNS
  7. Behavior:EC2
    • Behavior:EC2/NetworkPortUnusual
    • Behavior:EC2/TrafficVolumeUnusual

When activity occurs that generates one of these GuardDuty finding types and is then matched by the EventBridge event rule, an entry is created in the target web ACLs and subnet network ACLs to deny access from the suspicious host, and then a notification is sent to an email address by this pattern’s Lambda function. Blocking traffic from the suspicious host helps to mitigate potential threats while you perform additional investigation and remediation. For more information, see Remediating a compromised EC2 instance.

Solution deployment

To deploy the solution, you’ll do the following steps. Each step is described in more detail in the sections that follow.

  1. Download the required files.
  2. Create your Amazon Simple Storage Service (Amazon S3) bucket and upload the .zip files.
  3. Deploy the CloudFormation template.
  4. Create and test the Lambda function for a GuardDuty finding event.
  5. Confirm the entry for the test event in the VPC network ACL.
  6. Confirm the entry in the AWS WAF IP sets.
  7. Confirm the SNS notification email alert.
  8. Apply the AWS WAF web ACLs to resources.

Step 1: Download the required files

Download the following four files from the amazon-guardduty-waf-acl GitHub code repository:

  1. CloudFormation template – Copy and save the linked raw text, using the file name guarddutytoacl.template on your local file system.
  2. JSON event test file – Copy and save the linked raw text, using the file name gd2acl_test_event.json on your local file system.
  3. guardduty_to_acl_lambda_wafv2.zip – Choose the Download button on the GitHub page and save the .zip file to your local file system.
  4. prune_old_entries_wafv2.zip – Choose the Download button on the GitHub page and save the .zip file to your local file system.

Step 2: Create your S3 bucket and upload .zip files

For this step, create an S3 bucket with public access blocked, and then upload the Lambda .zip files to the newly created S3 bucket.

To create your S3 bucket and upload .zip files

  1. Create an S3 bucket in the us-east-1 Region.
  2. Upload the .zip files guardduty_to_acl_lambda_wafv2.zip and prune_old_entries_wafv2.zip that you saved to your local file system in Step 1 to the newly created S3 bucket.

Step 3: Deploy the CloudFormation template

For this step, deploy the CloudFormation template only to the us-east-1 Region within the AWS account where GuardDuty findings are to be monitored.

To deploy the CloudFormation template

  1. Sign in to the AWS Management Console, choose the CloudFormation service, and set N.Virginia (us-east-1) as the Region.
  2. Choose Create stack, and then choose With new resources (standard).
  3. When the Create stack landing page is presented, make sure that Template is ready is selected in the Prepare template section. In the Template source section, choose Upload a template file.
  4. Choose the Choose file button and browse to the location where the guarddutytoacl.template file was saved on your local file system. Select the file, choose Open, and then choose Next.
  5. On the Specify stack details page, provide the following input parameters. You can modify the default values to customize the pattern for your environment.

    Input parameter Input parameter description
    Notification email The email address to receive notifications. Must be a valid email address.
    Retention time, in minutes How long to retain IP addresses in the blocklist (in minutes). The default is 12 hours.
    S3 bucket for artifacts The S3 bucket with artifact files (Lambda functions, templates, HTML files, and so on). Keep the default value for deployment into the N. Virginia Region.
    S3 path to artifacts The path in the S3 bucket that contains artifact files. Keep the default value for deployment into the N. Virginia Region.
    CloudFrontWebACL Create CloudFront Web ACL? If set to true, a CloudFront IP set will be created automatically.
    RegionalWebACL Create Regional Web ACL? If set to true, a Regional IP set will be created automatically.

    Figure 2 shows an example of the values entered on this page.

    Figure 2: CloudFormation parameters on the Specify stack details page

    Figure 2: CloudFormation parameters on the Specify stack details page

  6. Enter values for all of the input parameters, and then choose Next.
  7. On the Configure stack options page, accept the defaults, and then choose Next.
  8. On the Review page, confirm the details, check the box acknowledging that the template will require capabilities for AWS::IAM::Role, and then choose Create Stack.

    The stack normally requires no more than 3–5 minutes to complete.

  9. While the stack is being created, check the email inbox that you specified for the Notification email address parameter. Look for an email message with the subject “AWS Notification – Subscription Confirmation”. Choose the link in the email to confirm the subscription to the SNS topic. You should see a message similar to the following.
    Figure 3: Subscription confirmation

    Figure 3: Subscription confirmation

When the Status field for the CloudFormation stack changes to CREATE_COMPLETE, as shown in Figure 4, the pattern is implemented and is ready for testing.

Figure 4: The stack status is CREATE_COMPLETE

Figure 4: The stack status is CREATE_COMPLETE

Step 4: Create and test the Lambda function for a GuardDuty finding event

After the CloudFormation stack has completed deployment, you can test the functionality by using a Lambda test event.

To create and run a Lambda GuardDuty finding test event

  1. In the AWS Management Console, choose Services > VPC > Subnets and locate a subnet that is suitable for testing the pattern.
  2. On the Details tab, copy the subnet ID to the clipboard or to a text editor.
    Figure 5: The subnet ID value on the Details tab

    Figure 5: The subnet ID value on the Details tab

  3. In the AWS Management Console, choose Services > CloudFormation > GuardDutytoACL stack. On the Outputs tab for the stack, look for the GuardDutytoACLLambda entry.
    Figure 6: The GuardDutytoACLLambda entry on the Outputs tab

    Figure 6: The GuardDutytoACLLambda entry on the Outputs tab

  4. Choose the link for the entry, and you’ll be redirected to the Lambda console, with the Lambda Code source page already open.
    Figure 7: The Lambda function open in the Lambda console

    Figure 7: The Lambda function open in the Lambda console

  5. In the middle of the Code source menu, in the Test dropdown list, locate and select the Configure test event option.
    Figure 8: Select Configure test event from the dropdown list

    Figure 8: Select Configure test event from the dropdown list

  6. To facilitate testing, we’ve provided a test event file. On the Configure test event page, do the following:
    1. For Event name, enter a name.
    2. In the body of the Event JSON field, paste the provided test event JSON, overwriting the existing contents.
    3. Update the value of SubnetId key (line 35) to the value of the subnet ID that you chose in Step 1 of this procedure.
    4. Choose Save.
    Figure 9: Update the value of the subnetId key

    Figure 9: Update the value of the subnetId key

  7. Choose Test to invoke the Lambda function with the test event. You should see the message “Status: succeeded” at the top of the execution results, similar to what is shown in Figure 10.
    Figure 10: The Test button and the “succeeded” message

    Figure 10: The Test button and the “succeeded” message

Step 5: Confirm the entry in the VPC network ACL

In this step, you’ll confirm that the DENY entry was created in the network ACL. This pattern is configured to create up to 10 entries in an ACL, ranging between rule numbers 71 and 80. Because network ACL rules are processed in order, it’s important that the DENY rule is placed before the ALLOW rule.

To confirm the entry in the VPC network ACL

  1. In the AWS Management Console, choose Services > VPC > Subnets, and locate the subnet you provided for the test event.
  2. Choose the network ACL link and confirm that the new DENY entry was generated from the test event.
    Figure 11: Check the entry from the test event on the Network tab

    Figure 11: Check the entry from the test event on the Network tab

    Note that VPC network ACL entries are created in the rule number range between 71 and 80. Older entries are aged out to create a “sliding window” of blocked hosts.

Step 6: Confirm the entry in the AWS WAF IP sets and blocklists

Next, verify that the entry was added to the CloudFront AWS WAF IP set and to the Application Load Balancer (ALB) AWS WAF IP set.

To confirm the entry in the AWS WAF IP set and blocklist

  1. In the AWS Management Console, choose Services > WAF & Shield > Web ACLs, and then set the selected Region to Global (CloudFront).
  2. Find and select the web ACL name that starts with CloudFrontBlockListWeb. In the Rule view, on the Rules tab, select the rule named CloudFrontBlocklistIPSetRule. Note that 198.51.100.0/32 appears as an entry in the rule.
    Figure 12: Confirm that the IP address was added

    Figure 12: Confirm that the IP address was added

  3. In the AWS Management Console, on the left navigation menu, choose Web ACLs, and then set the selected Region to US East (N. Virginia).
  4. Find and select the web ACL name that starts with RegionalBlocklistACL. In the Rule view, on the Rules tab, select the rule named RegionalBlocklistIPSetRule. Note that 198.51.100.0/32 appears as an entry in the rule.
    Figure 13: Make sure that the IP address was added

    Figure 13: Make sure that the IP address was added

There might be specific host addresses that you want to prevent from being added to the blocklist. You can do this within GuardDuty by using a trusted IP list. Trusted IP lists consist of IP addresses that you have allowlisted for secure communication with your AWS infrastructure and applications. GuardDuty doesn’t generate findings for IP addresses on trusted IP lists. For more information, see Working with trusted IP lists and threat lists.

Step 7: Confirm the SNS notification email

Finally, verify that the SNS notification was sent to the email address you set up.

To confirm receipt of the SNS notification email

  • Review the email inbox that you specified for the AdminEmail parameter and look for a message with the subject line “AWS GD2ACL Alert”. The contents of the message from SNS should be similar to the following.
    Figure 14: SNS message example

    Figure 14: SNS message example

Step 8: Apply the AWS WAF web ACLs to resources

The final task is to associate the web ACL with the CloudFront distributions and Application Load Balancers that you want to automatically update with this pattern. To learn how to do this, see Associating or disassociating a web ACL with an AWS resource.

You can also use AWS Firewall Manager to associate the web ACLs. AWS Firewall Manager can simplify your AWS WAF administration and maintenance tasks across multiple accounts and resources. With Firewall Manager, you set up your firewall rules just once. The service automatically applies your rules across your accounts and resources, even as you add new resources.

Conclusion

In this post, you’ve learned how to use Lambda to automatically update AWS WAF and VPC network ACLs in response to GuardDuty findings. With just a few steps, you can use this sample pattern to help mitigate threats by blocking communication with suspicious hosts. You can explore additional possible patterns by using GuardDuty finding types and Amazon EventBridge target actions. This pattern’s code is available on GitHub. Feel free to play around with the code to add more GuardDuty findings to this pattern and also to build bigger and better patterns! Make sure to modify the patterns in this post to suit your defined requirements, and test and validate them before deploying them in a production environment.

If you have comments about this blog post, you can submit them in the Comments section below. If you have questions about using this pattern, start a thread in the GuardDutyAWS WAF, or CloudWatch forums, or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Eucke Warren

Eucke Warren

Eucke is a Sr Solution Architect helping ISV customers grow and mature securely. He has been fortunate to be able to work with technology for more than 30 years and counts automation, infrastructure, and security as areas of focus. When he’s not supporting customers, he enjoys time with his wife, family, and the company of a very bossy 18-pound dog.

Geoff Sweet

Geoff Sweet

Geoff has been in industry for over 20 years. He began his career in Electrical Engineering. Starting in IT during the dot-com boom, he has held a variety of diverse roles, such as systems architect, network architect, and, for the past several years, security architect. Geoff specializes in infrastructure security.

Genomics workflows, Part 5: automated benchmarking

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-5-automated-benchmarking/

Launching and running genomics workflows can take hours and involves large pools of compute instances that process data at a petabyte scale. Benchmarking helps you evaluate workflow performance and discover faster and cheaper ways of running them.

In practice, performance evaluations happen irregularly because of the associated heavy lifting. In this blog post, we discuss how life-science research teams can automate evaluations.

Business Benefits

An automated benchmarking solution provides:

  • more accurate enterprise resource planning by performing historical analytics,
  • lower cost to the business by comparing performance on different resource types, and
  • cost transparency to the business by quantifying periodical chargeback.

We’ve used automated benchmarking to compare processing times on different services such as Amazon Elastic Compute Cloud (Amazon EC2), AWS Batch, AWS ParallelCluster, Amazon Elastic Kubernetes Service (Amazon EKS), and on-premises HPC clusters. Scientists, financiers, technical leaders, and other stakeholders can build reports and dashboards to compare consumption data by consumer, workflow type, and time period.

Design pattern

Our automated benchmarking solution measures performance on two dimensions:

  • Timing: measures the duration of a workflow launch on a specific dataset
  • Pricing: measures the associated cost

This solution can be extended to other performance metrics such as iterations per second or process/thread distribution across compute nodes.

Our requirements include the following:

  • Consistent measurement of timing based on workflow status (such as preparing, waiting, ready, running, failed, complete)
  • Extensible pricing models based on unit prices (the Amazon EC2 Spot price at a specific period of time compared to Amazon EC2 On-Demand pricing)
  • Scalable, cost-efficient, and flexible data store enabling historical benchmarking and estimations
  • Minimal infrastructure management overhead

We choose a serverless design pattern using AWS Step Functions orchestration, AWS Lambda for our application code, and Amazon DynamoDB to track workflow launch IDs and states (as described in Part 3). We assume that the genomics workflows run on AWS Batch with genomics data on Amazon FSx for Lustre (Part 1). AWS Step Functions allows us to break down processing into smaller steps and avoid monolithic application code. Our evaluation process runs in four steps:

  1. Monitor for completed workflow launches in the DynamoDB stream using an Amazon EventBridge pipe with a Step Functions workflow as target. This event-driven approach is more efficient than periodic polling and avoids custom code for parsing status and cost values in all records of the DynamoDB stream.
  2. Collect a list of all compute resources associated with the workflow launch. Design a Lambda function that queries the AWS Batch API (see Part 1) to describe compute environment parameters like the Amazon EC2 instance IDs and their details, such as processing times, instance family/size, and allocation strategy (for example, Spot Instances, Reserved Instances, On-Demand Instances).
  3. Calculate the cost of all consumed resources. We achieve this with another Lambda function, which calculates the total price based on unit prices from the AWS Price List Query API.
  4. Our state machine updates the total price in the DynamoDB table without the need for additional application code.

Figure 1 visualizes these steps.

Automated benchmarking of genomics workflows

Figure 1. Automated benchmarking of genomics workflows

Implementation considerations

AWS Step Functions orchestrates our benchmarking workflow reliably and makes our application code easy to maintain. Figure 2 summarizes the state machine transitions that we’ll describe.

AWS Step Functions state machine for automated benchmarking

Figure 2. AWS Step Functions state machine for automated benchmarking

Gather consumption details

Configure the DynamoDB stream view type to New image so that the entire item is passed through as it appears after it was changed. We set up an Amazon EventBridge pipe with event filtering and the DynamoDB stream as a source. Our event filter uses multiple matching on records with a status of COMPLETE, but no cost entry in order to avoid an infinite loop. Once our state machine has updated the DynamoDB item with the workflow price, the resulting record in the DynamoDB stream will not pass our event filter.

The syntax of our event filter is as follows:

{
  "dynamodb": {
    "NewImage": {
      "status": {
        "S": ["COMPLETE"]
      },
      "totalCost": {
        "S": [{
          "exists": false
        }]
      }
    }
  }
}

We use an input transformer to simplify follow-on parsing by removing unnecessary metadata from the event.

The consumed resources included in the stream record are the auto-scaling group ID for AWS Batch and the Amazon FSx for Lustre volume ID. We use the DescribeJobs API (describe_jobs in Boto3) to determine which compute resources were used. If the response is a list of EC2 instances, we then look up consumption information including start and end times using the ListJobs API (list_jobs in Boto3) for each compute node. We use describe_volumes with filters on the identified EC2 instances to obtain the size and type of Amazon Elastic Block Store (Amazon EBS) volumes.

Calculate prices

Another Lambda function obtains the associated unit prices of all consumed resources using the GetProducts request of AWS Price List Query API (get_products in Boto3) and then parsing the pricePerUnit value. For Spot Instances, we use describe_spot_price_history of the EC2 client in Boto3 and specify the time range and instance types for which we want to receive prices.

Calculate the price of workflow launches based on the following factors:

  • Number and size of EC2 instances in auto-scaling node groups
  • Size of EBS volumes and Amazon FSx for Lustre
  • Processing duration

Our Python-based Lambda function calculates the total, rounds it, and delivers the price breakdown in the following format:

total_cost: str, instance_cost: str, volume_cost: str, filesystem_cost: str

Lastly, we put the price breakdown to the DynamoDB table using UpdateItem directly from the Amazon States Language.

Note that AWS credits and enterprise discounts might not be reflected in the responses of the AWS Price List Query API unless applied to the particular AWS account. This is often considered best practice in light of least-privilege considerations.

In the past, we’ve also used AWS Cost Explorer instead of the AWS Price List API. AWS Cost Explorer data is updated at least once every 24 hours. You can denote the pending price status in the DynamoDB table item and use the Wait state to delay the calculation process.

The presented solution can be extended to other compute services such as Amazon Elastic Kubernetes Service (Amazon EKS). For Amazon EKS, events are enriched with the cluster ID from the DynamoDB table and the price calculation should also include control plane costs.

Conclusion

Life-science research teams use benchmarking to compare workflow performance and inform their architectural decisions. Such evaluations are effort-intensive and therefore done irregularly.

In this blog post, we showed how life-science research teams can automate benchmarking for their scientific workflows. The insights teams gain from automated benchmarking indicate continuous optimization opportunities, such as by adjusting compute node configuration. The evaluation data is also available on demand for other purposes including chargeback.

Stay tuned for our next post in which we show how to use historical benchmarking data for price estimations of future workflow launches.

Related information

Implementing an event-driven serverless story generation application with ChatGPT and DALL-E

Post Syndicated from David Boyne original https://aws.amazon.com/blogs/compute/implementing-an-event-driven-serverless-story-generation-application-with-chatgpt-and-dall-e/

This post demonstrates how to integrate AWS serverless services with artificial intelligence (AI) technologies, ChatGPT, and DALL-E. This full stack event-driven application showcases a method of generating unique bedtime stories for children by using predetermined characters and scenes as a prompt for ChatGPT.

Every night at bedtime, the serverless scheduler triggers the application, initiating an event-driven workflow to create and store new unique AI-generated stories with AI-generated images and supporting audio.

These datasets are used to showcase the story on a custom website built with Next.js hosted with AWS App Runner. After the story is created, a notification is sent to the user containing a URL to view and read the story to the children.

Example notification of a story hosted with Next.js and App Runner

Example notification of a story hosted with Next.js and App Runner

By integrating AWS services with AI technologies, you can now create new and innovative ideas that were previously unimaginable.

The application mentioned in this blog post demonstrates examples of point-to-point messaging with Amazon EventBridge pipes, publish/subscribe patterns with Amazon EventBridge and reacting to change data capture events with DynamoDB Streams.

Understanding the architecture

The following image shows the serverless architecture used to generate stories:

Architecture diagram for Serverless bed time story generation with ChatGPT and DALL-E

Architecture diagram for Serverless bed time story generation with ChatGPT and DALL-E

A new children’s story is generated every day at configured time using Amazon EventBridge Scheduler (Step 1). EventBridge Scheduler is a service capable of scaling millions of schedules with over 200 targets and over 6000 API calls. This example application uses EventBridge scheduler to trigger an AWS Lambda function every night at the same time (7:15pm). The Lambda function is triggered to start the generation of the story.

EventBridge scheduler triggers Lambda function every day at 7:15pm (bed time)

EventBridge scheduler triggers Lambda function every day at 7:15pm (bed time)

The “Scenes” and “Characters” Amazon DynamoDB tables contain the characters involved in the story and a scene that is randomly selected during its creation. As a result, ChatGPT receives a unique prompt each time. An example of the prompt may look like this:

“`
Write a title and a rhyming story on 2 main characters called Parker and Jackson. The story needs to be set within the scene haunted woods and be at least 200 words long

“`

After the story is created, it is then saved in the “Stories” DynamoDB table (Step 2).

Scheduler triggering Lambda function to generate the story and store story into DynamoDB

Scheduler triggering Lambda function to generate the story and store story into DynamoDB

Once the story is created this initiates a change data capture event using DynamoDB Streams (Step 3). This event flows through point-to-point messaging with EventBridge pipes and directly into EventBridge. Input transforms are then used to convert the DynamoDB Stream event into a custom EventBridge event, which downstream consumers can comprehend. Adopting this pattern is beneficial as it allows us to separate contracts from the DynamoDB event schema and not having downstream consumers conform to this schema structure, this mapping allows us to remain decoupled from implementation details.

EventBridge Pipes connecting DynamoDB streams directly into EventBridge.

EventBridge Pipes connecting DynamoDB streams directly into EventBridge.

Upon triggering the StoryCreated event in EventBridge, three targets are triggered to carry out several processes (Step 4). Firstly, AI Images are processed, followed by the creation of audio for the story. Finally, the end user is notified of the completed story through Amazon SNS and email subscriptions. This fan-out pattern enables these tasks to be run asynchronously and in parallel, allowing for faster processing times.

EventBridge pub/sub pattern used to start async processing of notifications, audio, and images.

EventBridge pub/sub pattern used to start async processing of notifications, audio, and images.

An SNS topic is triggered by the `StoryCreated` event to send an email to the end user using email subscriptions (Step 6). The email consists of a URL with the id of the story that has been created. Clicking on the URL takes the user to the frontend application that is hosted with App Runner.

Using SNS to notify the user of a new story

Using SNS to notify the user of a new story

Example email sent to the user

Example email sent to the user

Amazon Polly is used to generate the audio files for the story (Step 6). Upon triggering the `StoryCreated` event, a Lambda function is triggered, and the story description is used and given to Amazon Polly. Amazon Polly then creates an audio file of the story, which is stored in Amazon S3. A presigned URL is generated and saved in DynamoDB against the created story. This allows the frontend application and browser to retrieve the audio file when the user views the page. The presigned URL has a validity of two days, after which it can no longer be accessed or listened to.

Lambda function to generate audio using Amazon Polly, store in S3 and update story with presigned URL

Lambda function to generate audio using Amazon Polly, store in S3 and update story with presigned URL

The `StoryCreated` event also triggers another Lambda function, which uses the OpenAI API to generate an AI image using DALL-E based on the generated story (Step 7). Once the image is generated, the image is downloaded and stored in Amazon S3. Similar to the audio file, the system generates a presigned URL for the image and saves it in DynamoDB against the story. The presigned URL is only valid for two days, after which it becomes inaccessible for download or viewing.

Lambda function to generate images, store in S3 and update story with presigned URL.

Lambda function to generate images, store in S3 and update story with presigned URL.

In the event of a failure in audio or image generation, the frontend application still loads the story, but does not display the missing image or audio at that moment. This ensures that the frontend can continue working and provide value. If you wanted more control and only trigger the user’s notification event once all parallel tasks are complete the aggregator messaging pattern can be considered.

Hosting the frontend Next.js application with AWS App Runner

Next.js is used by the frontend application to render server-side rendered (SSR) pages that can access the stories from the DynamoDB table, which are then hosted with AWS App Runner after being containerized.

Next.js application hosted with App Runner, with permissions into DynamoDB table.

Next.js application hosted with App Runner, with permissions into DynamoDB table.

AWS App Runner enables you to deploy containerized web applications and APIs securely, without needing any prior knowledge of containers or infrastructure. With App Runner, developers can concentrate on their application, while the service handles container startup, running, scaling, and load balancing. After deployment, App Runner provides a secure URL for clients to begin making HTTP requests against.

With App Runner, you have two primary options for deploying your container: source code connections or source images. Using source code connections grants App Runner permission to pull the image file directly from your source code, and with Automatic deployment configured, it can redeploy the application when changes are made. Alternatively, source images provide App Runner with the image’s location in an image registry, and this image is deployed by App Runner.

In this example application, CDK deploys the application using the DockerImageAsset construct with the App Runner construct. Once deployed, App Runner builds and uploads the frontend image to Amazon Elastic Container Registry (ECR) and deploys it. Downstream consumers can access the application using the secure URL provided by App Runner. In this example, the URL is used when the SNS notification is sent to the user when the story is ready to be viewed.

Giving the frontend container permission to DynamoDB table

To grant the Next.js application permission to obtain stories from the Stories DynamoDB table, App Runner instance roles are configured. These roles are optional and can provide the necessary permissions for the container to access AWS services required by the compute service.

If you want to learn more about AWS App Runner, you can explore the free workshop.

Design choices and assumptions

The DynamoDB Time to Live (TTL) feature is ideal for the short-lived nature of daily generated stories. DynamoDB handle the deletion of stories after two days by setting the TTL attribute on each story. Once a story is deleted, it becomes inaccessible through the generated story URLs.

Using Amazon S3 presigned URLs is a method to grant temporary access to a file in S3. This application creates presigned URLs for the audio file and generated images that last for 2 days, after which the URLs for the S3 items become invalid.

Input transforms are used between DynamoDB streams and EventBridge events to decouple the schemas and events consumed by downstream targets. Consuming the events as they are is known as the “conformist” pattern, and couples us to implementation details of DynamoDB streams with downstream EventBridge consumers. This allows the application to remain decoupled from implementation details and remain flexible.

Conclusion

The adoption of artificial intelligence (AI) technology has significantly increased in various industries. ChatGPT, a large language model that can understand and generate human-like responses in natural language, and DALL-E, an image generation system that can create realistic images based on textual descriptions, are examples of such technology. These systems have demonstrated the potential for AI to provide innovative solutions and transform the way we interact with technology.

This blog post explores ways in which you can utilize AWS serverless services with ChatGTP and DALL-E to create a story generation application fronted by a Next.js application hosted with App Runner. EventBridge Scheduler is used to trigger the story creation process then react to change data capture events with DynamoDB streams and EventBridge Pipes, and use Amazon EventBridge to fan out compute tasks to process notifications, images, and audio files.

You can find the documentation and the source code for this application in GitHub.

For more serverless learning resources, visit Serverless Land.

Managing sessions of anonymous users in WebSocket API-based applications

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/managing-sessions-of-anonymous-users-in-websocket-api-based-applications/

This post is written by Alexey Paramonov, Solutions Architect, ISV.

This blog post demonstrates how to reconnect anonymous users to WebSocket API without losing their session context. You learn how to link WebSocket API connection IDs to the logical user, what to do when the connection fails, and how to store and recover session information when the user reconnects.

WebSocket APIs are common in modern interactive applications. For example, for watching stock prices, following live chat feeds, collaborating with others in productivity tools, or playing online games. Another example described in “Implementing reactive progress tracking for AWS Step Functions” blog post uses WebSocket APIs to send progress back to the client.

The backend is aware of the client’s connection ID to find the right client to send data to. But if the connection temporarily fails, the client reconnects with a new connection ID. If the backend does not have a mechanism to associate a new connection ID with the same client, the interaction context is lost and the user must start over again. In a live chat feed that could mean skipping to the most recent message with a loss of some previous messages. And in case of AWS Step Functions progress tracking the user would lose progress updates.

Overview

The sample uses Amazon API Gateway WebSocket APIs. It uses AWS Lambda for WebSocket connection management and for mocking a teleprinter to generate a stateful stream of characters for testing purposes. There are two Amazon DynamoDB tables for storing connection IDs and user sessions, as well as an optional MediaWiki API for retrieving an article.

The following diagram outlines the architecture:

Reference architecture

  1. The browser generates a random user ID and stores it locally in the session storage. The client sends the user ID inside the Sec-WebSocket-Protocol header to WebSocket API.
  2. The default WebSocket API route OnConnect invokes OnConnect Lambda function and passes the connection ID and the user ID to it. OnConnect Lambda function determines if the user ID exists in the DynamoDB table and either creates a new item or updates the existing one.
  3. When the connection is open, the client sends the Read request to the Teleprinter Lambda function.
  4. The Teleprinter Lambda function downloads an article about WebSocket APIs from Wikipedia and stores it as a string in memory.
  5. The Teleprinter Lambda function checks if there is a previous state stored in the Sessions table in DynamoDB. If it is a new user, the Teleprinter Lambda function starts sending the article from the beginning character by character back to the client via WebSocket API. If it is a returning user, the Teleprinter Lambda function retrieves the last cursor position (the last successfully sent character) from the DynamoDB table and continues from there.
  6. The Teleprinter Lambda function sends 1 character every 100ms back to the client.
  7. The client receives the article and prints it out character by character as each character arrives.
  8. If the connection breaks, the WebSocket API calls the OnDisconnect Lambda function automatically. The function marks the connection ID for the given user ID as inactive.
  9. If the user does not return within 5 minutes, Amazon EventBridge scheduler invokes the OnDelete Lambda function, which deletes items with more than 5 minutes of inactivity from Connections and Sessions tables.

A teleprinter returns data one character at a time instead of a single response. When the Lambda function fetches an article, it feeds it character by character inside the for-loop with a delay of 100ms on every iteration. This demonstrates how the user can continue reading the article after reconnecting and not starting the feed all over again. The pattern could be useful for traffic limiting by slowing down user interactions with the backend.

Understanding sample code functionality

When the user connects to the WebSocket API, the client generates a user ID and saves it in the browser’s session storage. The user ID is a random string. When the client opens a WebSocket connection, the frontend sends the user ID inside the Sec-WebSocket-Protocol header, which is standard for WebSocket APIs. When using the JavaScript WebSocket library, there is no need to specify the header manually. To initialize a new connection, use:

const newWebsocket = new WebSocket(wsUri, userId);

wsUri is the deployed WebSocket API URL and userId is a random string generated in the browser. The userId goes to the Sec-WebSocket-Protocol header because WebSocket APIs generally offer less flexibility in choosing headers compared to RESTful APIs. Another way to pass the user ID to the backend could be a query string parameter.

The backend receives the connection request with the user ID and connection ID. The WebSocket API generates the connection ID automatically. It’s important to include the Sec-WebSocket-Protocol in the backend response, otherwise the connection closes immediately:

    return {
        'statusCode': 200,
        'headers': {
            'Sec-WebSocket-Protocol': userId
        }
    }

The Lambda function stores this information in the DynamoDB Connections table:

    ddb.put_item(
        TableName=table_name,
        Item={
            'userId': {'S': userId},
            'connectionId': {'S': connection_id},
            'domainName': {'S': domain_name},
            'stage': {'S': stage},
            'active': {'S': True}
        }
    )

The active attribute indicates if the connection is up. In the case of inactivity over a specified time limit, the OnDelete Lambda function deletes the item automatically. The Put operation comes handy here because you don’t need to query the DB if the user already exists. If it is a new user, Put creates a new item. If it is a reconnection, Put updates the connectionId and sets active to True again.

The primary key for the Connections table is userId, which helps locate users faster as they reconnect. The application relies on a global secondary index (GSI) for locating connectionId. This is necessary for marking the connection inactive when WebSocket API calls OnDisconnect function automatically.

Now the application has connection management, you can retrieve session data. The Teleprinter Lambda function receives a connection ID from WebSocket API. You can query the GSI of Connections table and find if the user exists:

    def get_user_id(connection_id):
        response = ddb.query(
            TableName=connections_table_name,
            IndexName='connectionId-index',
            KeyConditionExpression='connectionId = :c',
            ExpressionAttributeValues={
                ':c': {'S': connection_id}
            }
        )

        items = response['Items']

        if len(items) == 1:
            return items[0]['userId']['S']

        return None

Another DynamoDB table called Sessions is used to check if the user has an existing session context. If yes, the Lambda function retrieves the cursor position and resumes teletyping. If it is a brand-new user, the Lambda function starts sending characters from the beginning. The function stores a new cursor position if the connection breaks. This makes it easier to detect stale connections and store the current cursor position in the Sessions table.

    Try:
        api_client.post_to_connection(
            ConnectionId=connection_id,
            Data=bytes(ch, 'utf-8')
        )
    except api_client.exceptions.GoneException as e:
        print(f"Found stale connection, persisting state")
        store_cursor_position(user_id, pos)
        return {
            'statusCode': 410
        }

    def store_cursor_position(user_id, pos):
        ddb.put_item(
            TableName=sessions_table_name,
            Item={
                'userId': {'S': user_id},
                'cursorPosition': {'N': str(pos)}
            }
        )

After this, the Teleprinter Lambda function ends.

Serving authenticated users

The same mechanism also works for authenticated users. The main difference is it takes a given ID from a JWT token or other form of authentication and uses it instead of a randomly generated user ID. The backend relies on unambiguous user identification and may require only minor changes for handling authenticated users.

Deleting inactive users

The OnDisconnect function marks user connections as inactive and adds a timestamp to ‘lastSeen’ attribute. If the user never reconnects, the application should purge inactive items from Connections and Sessions tables. Depending on your operational requirements, there are two options.

Option 1: Using EventBridge Scheduler

The sample application uses EventBridge to schedule OnDelete function execution every 5 minutes. The following code shows how AWS SAM adds the scheduler to the OnDelete function:

  OnDeleteSchedulerFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: app.handler
      Runtime: python3.9
      CodeUri: handlers/onDelete/
      MemorySize: 128
      Environment:
        Variables:
          CONNECTIONS_TABLE_NAME: !Ref ConnectionsTable
          SESSIONS_TABLE_NAME: !Ref SessionsTable
      Policies:
      - DynamoDBCrudPolicy:
          TableName: !Ref ConnectionsTable
      - DynamoDBCrudPolicy:
          TableName: !Ref SessionsTable
      Events:
        Schedule:
          Type: ScheduleV2
          Properties:
            ScheduleExpression: rate(5 minute)

The Events section of the function definition is where AWS SAM sets up the scheduled execution. Change values in ScheduleExpression to meet your scheduling requirements.

The OnDelete function relies on the GSI to find inactive user IDs. The following code snippet shows how to query connections with more than 5 minutes of inactivity:

    five_minutes_ago = int((datetime.now() - timedelta(minutes=5)).timestamp())

    stale_connection_items = table_connections.query(
        IndexName='lastSeen-index',
        KeyConditionExpression='active = :hk and lastSeen < :rk',
        ExpressionAttributeValues={
            ':hk': 'False',
            ':rk': five_minutes_ago
        }
    )

After that, the function iterates through the list of user IDs and deletes them from the Connections and Sessions tables:

    # remove inactive connections
    with table_connections.batch_writer() as batch:
        for item in inactive_users:
            batch.delete_item(Key={'userId': item['userId']})

    # remove inactive sessions
    with table_sessions.batch_writer() as batch:
        for item in inactive_users:
            batch.delete_item(Key={'userId': item['userId']})

The sample uses batch_writer() to avoid requests for each user ID.

Option 2: Using DynamoDB TTL

DynamoDB provides a built-in mechanism for expiring items called Time to Live (TTL). This option is easier to implement. With TTL, there’s no need for EventBridge scheduler, OnDelete Lambda function and additional GSI to track time span. To set up TTL, use the ‘lastSeen’ attribute as an object creation time. TTL deletes or archives the item after a specified period of time. When using AWS SAM or AWS CloudFormation templates, add TimeToLiveSpecification to your code.

The TTL typically deletes expired items within 48 hours. If your operational requirements demand faster and more predictable timing, use option 1. For example, if your application aggregates data while the user is offline, use option 1. But if your application stores a static session data, like cursor position used in this sample, option 2 can be an easier alternative.

Deploying the sample

Prerequisites

Ensure you can manage AWS resources from your terminal.

  • AWS CLI and AWS SAM CLI installed.
  • You have an AWS account. If not, visit this page.
  • Your user has enough permissions to manage AWS resources.
  • Git is installed.
  • NPM is installed (only for local frontend deployment).

You can find the source code and README here.

The repository contains both the frontend and the backend code. To deploy the sample, follow this procedure:

  1. Open a terminal and clone the repository:
    git clone https://github.com/aws-samples/websocket-sessions-management.git
  2. Navigate to the root of the repository.
  3. Build and deploy the AWS SAM template:
    sam build && sam deploy --guided
  4. Note the value of WebSocketURL in the output. You need it later.

With the backend deployed, test it using the hosted frontend.

Testing the application

Testing the application

You can watch an animated demo here.

Notice that the app has generated a random user ID on startup. The app shows the user ID above in the header.

  1. Paste the WebSocket URL into the text field. You can find the URL in the console output after you have successfully deployed your AWS SAM template. Alternatively, you can navigate to AWS Management Console (make sure you are in the right Region), select the API you have recently deployed, go to “Stages”, select the deployed stage and copy the “WebSocket URL” value.
  2. Choose Connect. The app opens a WebSocket connection.
  3. Choose Tele-read to start receiving the Wikipedia article character by character. New characters appear in the second half of the screen as they arrive.
  4. Choose Disconnect to close WebSocket connection. Reconnect again and choose Tele-read. Your session resumes from where you stopped.
  5. Choose Reset Identity. The app closes the WebSocket connection and changes the user ID. Choose Connect and Tele-read again and your character feed starts from the beginning.

Cleaning up

To clean up the resources, in the root directory of the repository run:

sam delete

This removes all resources provisioned by the template.yml file.

Conclusion

In this post, you learn how to keep track of user sessions when using WebSockets API and not lose the session context when the user reconnects again. Apply learnings from this example to improve your user experience when using WebSocket APIs for web-frontend and mobile applications, where internet connections may be unstable.

For more serverless learning resources, visit  Serverless Land.

Building an event-driven solution for AvalonBay property leasing and search

Post Syndicated from Kausik Dey original https://aws.amazon.com/blogs/architecture/building-an-event-driven-solution-for-avalonbay-property-leasing-and-search/

In this blog post, we show you how to build an event-driven and serverless solution for property leasing and search that is scalable and resilient. This solution was created for AvalonBay Communities, Inc.—a leading residential Real Estate Investment Trusts (REITs). It enables:

  • More than 150,000 multi-parameter searches per day
  • The processing of more than 3,500 lease applications and 85,000 individual rent payments per month

Introduction

AvalonBay is an equity REIT. The company has a long track record of developing, redeveloping, acquiring, and managing apartment homes in top U.S. markets. AvalonBay builds long-term value for customers using innovative technology solutions.

The company understands that data-driven insights contribute to targeted business growth. But AvalonBay found that managing the complex interdependencies between multiple data sets—from real estate and property management systems to financial and payment systems—required a new solution.

The challenge

AvalonBay owned or held a direct or indirect ownership interest in 293 apartment communities containing 88,405 apartment homes in 12 states and Washington, D.C., as of September 30, 2022.

Of these, 18 communities were under development and one was under redevelopment. This presented a unique challenge to both internal and external users looking to search and lease apartment units based on multi-parameter selection criteria in geographically dispersed regions. For example, finding units in buildings with specific amenities, lease terms, furnishings and availability dates.

Overview of solution

AvalonBay’s fully managed leasing solution for applicants and residents is hosted by Amazon Web Services (AWS). The solution is secure, autoscaling, and multi-region, ensuring resiliency and performance with efficient resource usage.

In this event-driven solution, AvalonBay’s leasing service is hosted in multiple AWS Regions to provide low latency response to users across various geolocations. This blog post focuses on showing use case implementation in only one region—Region East— as shown in Figure 1.

AvalonBay lease processing platform

Figure 1. AvalonBay lease processing platform

Several AWS services come together in this solution to meet key company objectives. Let’s explore each one and its purpose within the architecture.

    1. Amazon Route 53: For AvalonBay’s Lease Processing solution, any non-transient service failure is unacceptable. In addition to providing a high degree of resiliency through a Multi-AZ architecture, Lease Processing also provides regional-level high availability through its multi-Region active-active architecture. Route 53 with latency-based routing allows dynamic rerouting of requests within seconds to alternate Regions.
    2. Amazon API Gateway: Route 53 latency-based routing is configured across multiple AWS Regions as Route 53 to route traffic to an API Gateway endpoint. API Gateway authorizers were added to control access to APIs using an Amazon Cognito user pool.
    3. AWS Lambda with provisioned concurrency: The Lambda services are set up for automatic scaling, secured through a private subnet, and span across Availability Zones. This provides horizontal scaling capability, self-healing capacity, and resiliency across Availability Zones. Provisioned concurrency minimizes the estimate of cold starts by generating execution environments. It also greatly reduces time spent on APIs invocations.
    4. Amazon Aurora V2 for Amazon Relational Database Service (Amazon RDS) PostgreSQL-Compatible Edition: Aurora Serverless V2 is an on-demand, autoscaling configuration for Aurora. Serverless Aurora V2 PostgreSQL-Compatible is used for the Lease Processing solution. The global database was configured with two Regions; us-east-1 as the primary cluster and us-west-2 as the secondary cluster. Automated Aurora global database endpoint management for planned and unplanned failover is configured through a Route 53 private hosted zone, Amazon EventBridge, and Lambda.
    5. Amazon RDS Proxy for Aurora: Amazon RDS Proxy allows the leasing application to pool and share database connections to improve its ability to scale. It also makes the leasing solution more resilient to database failures by automatically connecting to a standby database instance while preserving application connections.
    6. Amazon EventBridge: EventBridge supports the solution through two primary purposes:
      1. Oversees lease flow events – During the lease application process, the solution generates various events which are consumed by AvalonBay and external applications such as property management, finance portals, administration, and more. Leasing events are sent to EventBridge and various event rules are configured for multiple destinations including Lambda, Amazon Simple Notification Service (Amazon SNS), and external API endpoints.
      2. Handles global Aurora V2 failover – Aurora generates events when certain actions are taken or events occur, including any type of global database activity.
        • When a managed planned failover is initiated—either via the AWS Command Line Interface (AWS CLI), API, or console—the global database failover process starts and generates an event.
        • When an Aurora cluster is removed from a global cluster through the AWS CLI, API, or console, the Aurora cluster is promoted as a single primary cluster. Once this process is completed, it generates an event.

An EventBridge rule is created to match an event pattern any time a global database managed planned failover completes successfully in a Region. When a failover is completed, the completion event is detected, and this rule is triggered. The event rule is configured to invoke a Lambda function that is triggered on global database failover and updates the Amazon CloudFront CNAME record to the correct value.

Scalable search solution

Leasing professionals need to easily scan a huge amount of property information using AvalonBay’s Search Solution to obtain required information.

Using the Amazon OpenSearch Service, agents can generate property profiles and other asset data to identify matching units and quickly respond to end customers. OpenSearch is a fully open-source search and analytics engine that securely unlocks real-time search, monitoring, and analysis of business and operational data. It is employed for use cases such as application monitoring, log analytics, observability, and website search.

The AvalonBay Search Service solution architecture featuring OpenSearch is shown in Figure 2.

AvalonBay Search Service solution architecture

Figure 2: AvalonBay Search Service solution architecture

AvalonBay search requires search criteria including keyword and Universal Resource Identifier (URI) search, SQL-based search, and custom package search, all of which are detailed in the Amazon OpenSearch Service Developer Guide.

OpenSearch automatically detects and replaces failed OpenSearch Service nodes, reducing the overhead associated with self-managed infrastructures.

Let’s explore this architecture further by step.

  1. Amazon Kinesis event stream – The AvalonBay community requires near real-time updates to search attributes such as amenities, features, promotion, and pricing. Events created through various producers are streamed through Kinesis and inserted or updated through OpenSearch.
  2. Amazon OpenSearch – OpenSearch is used for end-to-end community search—a managed service making it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. As community search data is read-only, UltraWarm and cold storage are also used based on usage frequency.
  3. Amazon Simple Storage Service (S3): Various community documents, policies, and image or video files are key search elements. They must be maintained securely and reliably for years due to contractual obligations. Amazon S3 simplifies this task with high durabilitylifecycle rules, and varied controls for retention.

Conclusion

This post showed how AvalonBay has built and deployed custom leasing and search solutions on AWS serverless platforms without compromising resiliency, performance, and capacity requirements. This is a 24/7, fully managed solution with no additional equipment on-premises.

Choosing AWS for leasing and search solutions gives AvalonBay the ability to dynamically scale and meet future growth demands while introducing cost advantages. In addition, the global availability of AWS services makes it possible to deploy services across geographic locations to meet performance requirements.

Scaling an ASG using target tracking with a dynamic SQS target

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/scaling-an-asg-using-target-tracking-with-a-dynamic-sqs-target/

This blog post is written by Wassim Benhallam, Sr Cloud Application Architect AWS WWCO ProServe, and Rajesh Kesaraju, Sr. Specialist Solution Architect, EC2 Flexible Compute.

Scaling an Amazon EC2 Auto Scaling group based on Amazon Simple Queue Service (Amazon SQS) is a commonly used design pattern in decoupled applications. For example, an EC2 Auto Scaling Group can be used as a worker tier to offload the processing of audio files, images, or other files sent to the queue from an upstream tier (e.g., web tier). For latency-sensitive applications, AWS guidance describes a common pattern that allows an Auto Scaling group to scale in response to the backlog of an Amazon SQS queue while accounting for the average message processing duration (MPD) and the application’s desired latency.

This post builds on that guidance to focus on latency-sensitive applications where the MPD varies over time. Specifically, we demonstrate how to dynamically update the target value of the Auto Scaling group’s target tracking policy based on observed changes in the MPD. We also cover the utilization of Amazon EC2 Spot instances, mixed instance policies, and attribute-based instance selection in the Auto Scaling Groups as well as best practice implementation to achieve greater cost savings.

The challenge

The key challenge that this post addresses is applications that fail to honor their acceptable/target latency in situations where the MPD varies over time. Latency refers here to the time required for any queue message to be consumed and fully processed.

Consider the example of a customer using a worker tier to process image files (e.g., resizing, rescaling, or transformation) uploaded by users within a target latency of 100 seconds. The worker tier consists of an Auto Scaling group configured with a target tracking policy. To achieve the target latency mentioned previously, the customer assumes that each image can be processed in one second, and configures the target value of the scaling policy so that the average image backlog per instance is maintained at approximately 100 images.

In the first week, the customer submits 1000 images to the Amazon SQS queue for processing, each of which takes one second of processing time. Therefore, the Auto Scaling group scales to 10 instances, each of which processes 100 images in 100s, thereby honoring the target latency of 100s.

In the second week, the customer submits 1000 slightly larger images for processing. Since an image’s processing duration scales with its size, each image takes two seconds to process. As in the first week, the Auto Scaling group scales to 10 instances, but this time each instance processes 100 images in 200s, which is twice as long as was needed in the first round. As a result, the application fails to process the latter images within its acceptable latency.

Therefore, the challenge is common to any latency-sensitive application where the MPD is subject to change. Applications where the processing duration scales with input data size are particularly vulnerable to this problem. This includes image processing, document processing, computational jobs, and others.

Solution overview

Before we dive into the solution, let’s briefly review the target tracking policy’s scaling metric and its corresponding target value. A target tracking scaling policy works by adjusting the capacity to keep a scaling metric at, or close to, the specified target value. When scaling in response to an Amazon SQS backlog, it’s good practice to use a scaling metric known as the Backlog Per Instance (BPI) and a target value based on the acceptable BPI. These are computed as follows:

BPI equation, Saling metric and target value.

Given the acceptable BPI equation, a longer MPD requires us to use a smaller target value if we are to process these messages in the same acceptable latency, and vice versa. Therefore, the solution we propose here works by monitoring the average MPD over time and dynamically adjusting the target value of the Auto Scaling group’s target tracking policy (acceptable BPI) based on the observed changes in the MPD. This allows the scaling policy to adapt to variations in the average MPD over time, and thus enables the application to honor its acceptable latency.

Solution architecture

To demonstrate how the approach above can be implemented in practice, we put together an example architecture highlighting the services involved (see the following figure). We also provide an automated deployment solution for this architecture using an AWS Serverless Application Model (AWS SAM) template and some Python code (repository link). The repository also includes a README file with detailed instructions that you can follow to deploy the solution. The AWS SAM template deploys several resources, including an autoscaling group, launch template, target tracking scaling policy, an Amazon SQS queue, and a few AWS Lambda functions that serve various functions described as follows.

The Amazon SQS queue is used to accumulate messages intended for processing, while the Auto Scaling group instances are responsible for polling the queue and processing any messages received. To do this, a launch template defines a bootstrap script that allows the group’s instances to download and execute a Python code when first launched. The Python code consumes messages from the Amazon SQS queue and simulates their processing by sleeping for the MPD duration specified in the message body. After processing each message, the instance publishes the MPD as an Amazon CloudWatch metric (see the following figure).

Figure 1: Architecture diagram showing the components deployed by the AWS SAM template. These include an SQS queue, an Auto Scaling group responsible for polling and processing queue messages, a Lambda function that regularly updates the BPI CloudWatch metric, and a “Target Setter” Lambda function that regularly updates the Auto Scaling group’s target tracking scaling policy.

Figure 1: Architecture diagram showing the components deployed by the AWS SAM template.

To enable scaling, the Auto Scaling group is configured with a target tracking scaling policy that specifies BPI as the scaling metric and with an initial target value provided by the user.

The BPI CloudWatch metric is calculated and published by the “Metric-Publisher” Lambda function which is invoked every one minute using an Amazon EventBridge rate expression. To calculate BPI, the Lambda function simply takes the ratio of the number of messages visible in the Amazon SQS queue by the total number of in-service instances in the Auto Scaling group, as shown in equation (2) above.

On the other hand, the scaling policy’s target value is updated by the “Target-Setter” Lambda function, which is invoked every 30 minutes using another EventBridge rate expression. To calculate the new target value, the Lambda function simply takes the ratio of the user-defined acceptable latency value by the current average MPD queried from the corresponding CloudWatch metric, as shown in the previous equation (1).

Finally, to help you quickly test this solution, a Lambda function “Testing Lambda” is also provided and can be used to send messages to the Amazon SQS queue with a processing duration of your choice. This is specified within each message’s body. You can invoke this Lambda function with different MPDs (by modifying the corresponding environment variable) to verify how the Auto Scaling group scales in response. A CloudWatch dashboard is also deployed to enable you to track key scaling metrics through time. These include the number of messages visible in the queue, the number of in-service instances in the Auto Scaling group, the MPD, and BPI vs acceptable BPI.

Solution testing

To demonstrate the solution in action and its impact on application latency, we conducted two tests that you can reproduce by following the instructions described in the “Testing” section of the repository’s README file (repository link). In both tests, we assume a hypothetical application with a target latency of 300s. We also modified the invocation frequency of the “Target Setter” Lambda function to one minute to quickly assess the impact of target value changes. In both tests, we submit 50 messages to the Amazon SQS queue through the provided helper lambda. An MPD of 25s and 50s was used for the first and second test, respectively. The provided CloudWatch dashboard shows that the ASG scales to a total of four instances in the first test, and eight instances in the second (see the following figure). See README file for a detailed description of how various metrics evolve over time.

Comparison of Tests 1 and 2

Since Test 2 messages take twice as long to process, the Auto Scaling group launched twice as many instances to attempt to process all of the messages in the same amount of time as Test 1 (latency). The following figure shows that the total time to process all 50 messages in Test 1 was 9 mins vs 10 mins in Test 2. In contrast, if we were to use a static/fixed acceptable BPI of 12, then a total of four instances would have been operational in Test 2, thereby requiring double the time of Test 1 (~20 minutes) to process all of the messages. This demonstrates the value of using a dynamic scaling target when processing messages from Amazon SQS queues, especially in circumstances where the MPD is prone to vary with time.

Figure 2: CloudWatch dashboard showing Auto Scaling group scaling test results (Test 1 and 2). Although Test 2 messages require double the MPD of Test 1 messages, the Auto Scaling group processed Test 2 messages in the same amount of time as Test 1 by launching twice as many instances.

Figure 2: CloudWatch dashboard showing Auto Scaling group scaling test results (Test 1 and 2).

Recommended best practices for Auto Scaling groups

This section highlights a few key best practices that we recommend adopting when deploying and working with Auto Scaling groups.

Reducing cost using EC2 Spot instances

Amazon SQS helps build loosely coupled application architectures, while providing reliable asynchronous communication between the various layers/components of an application. If a worker node fails to process a message within the Amazon SQS message visibility time-out, then the message is returned to the queue and another worker node can pick up and process that message. This makes Amazon SQS-backed applications fault-tolerant by design and thus a great fit for EC2 Spot instances. EC2 spot instances are spare compute capacity in the AWS cloud that is available to you at steep discounts as compared to On-Demand prices.

Maximizing capacity using attribute-based instance selection

With the recently released attribute-based instance selection feature, you can define infrastructure requirements based on application needs such as vCPU, RAM, and processor family (e.g., x86, ARM). This removes the need to define specific instances in your Auto Scaling group configuration, and it eliminates the burden of identifying the correct instance families and sizes. In addition, newly released instance types will be automatically considered if they fit your requirements. Attribute-based instance selection lets you tap into hundreds of different EC2 instance pools, which increases the chance of getting EC2 (Spot/On-demand) instances. When using attribute-based instance selection with the capacity optimized allocation strategy, Amazon EC2 allocates instances from deeper Spot capacity pools, thereby further reducing the chance of Spot interruption.

The following sample configuration creates an Auto Scaling group with attribute-based instance selection:

AutoScalingGroupName: 'my-asg' # [REQUIRED] 
MixedInstancesPolicy:
  LaunchTemplate:
    LaunchTemplateSpecification:
      LaunchTemplateId: 'lt-0537239d9aef10a77'
    Overrides:
    - InstanceRequirements:
        VCpuCount: # [REQUIRED] 
          Min: 2
          Max: 4
        MemoryMiB: # [REQUIRED] 
          Min: 2048
  InstancesDistribution:
    SpotAllocationStrategy: 'capacity-optimized'
MinSize: 0 # [REQUIRED] 
MaxSize: 100 # [REQUIRED] 
DesiredCapacity: 4
VPCZoneIdentifier: 'subnet-e76a128a,subnet-e66a128b,subnet-e16a128c'

Conclusion

As can be seen from the test results, this approach demonstrates how an Auto Scaling group can honor a user-provided acceptable latency constraint while accomodating variations in the MPD over time. This is possible because the average MPD is monitored and regularly updated as a CloudWatch metric. In turn, this is continously used to update the target value of the group’s target tracking policy. Moreover, we have covered additional Auto Scaling group best practices suitable for this use case, including the use of Spot instances to reduce costs and attribute-based instance selection to simplify the selection of relevant instance types.

For more information on scaling options for Auto Scaling groups, visit the Amazon EC2 Auto Scaling documentation page and the SQS-based scaling guide.

Implementing architectural patterns with Amazon EventBridge Pipes

Post Syndicated from David Boyne original https://aws.amazon.com/blogs/compute/implementing-architectural-patterns-with-amazon-eventbridge-pipes/

This post is written by Dominik Richter (Solutions Architect)

Architectural patterns help you solve recurring challenges in software design. They are blueprints that have been used and tested many times. When you design distributed applications, enterprise integration patterns (EIP) help you integrate distributed components. For example, they describe how to integrate third-party services into your existing applications. But patterns are technology agnostic. They do not provide any guidance on how to implement them.

This post shows you how to use Amazon EventBridge Pipes to implement four common enterprise integration patterns (EIP) on AWS. This helps you to simplify your architectures. Pipes is a feature of Amazon EventBridge to connect your AWS resources. Using Pipes can reduce the complexity of your integrations. It can also reduce the amount of code you have to write and maintain.

Content filter pattern

The content filter pattern removes unwanted content from a message before forwarding it to a downstream system. Use cases for this pattern include reducing storage costs by removing unnecessary data or removing personally identifiable information (PII) for compliance purposes.

In the following example, the goal is to retain only non-PII data from “ORDER”-events. To achieve this, you must remove all events that aren’t “ORDER” events. In addition, you must remove any field in the “ORDER” events that contain PII.

While you can use this pattern with various sources and targets, the following architecture shows this pattern with Amazon Kinesis. EventBridge Pipes filtering discards unwanted events. EventBridge Pipes input transformers remove PII data from events that are forwarded to the second stream with longer retention.

Instead of using Pipes, you could connect the streams using an AWS Lambda function. This requires you to write and maintain code to read from and write to Kinesis. However, Pipes may be more cost effective than using a Lambda function.

Some situations require an enrichment function. For example, if your goal is to mask an attribute without removing it entirely. For example, you could replace the attribute “birthday” with an “age_group”-attribute.

In this case, if you use Pipes for integration, the Lambda function contains only your business logic. On the other hand, if you use Lambda for both integration and business logic, you do not pay for Pipes. At the same time, you add complexity to your Lambda function, which now contains integration code. This can increase its execution time and cost. Therefore, your priorities determine the best option and you should compare both approaches to make a decision.

To implement Pipes using the AWS Cloud Development Kit (AWS CDK), use the following source code. The full source code for all of the patterns that are described in this blog post can be found in the AWS samples GitHub repo.

const filterPipe = new pipes.CfnPipe(this, 'FilterPipe', {
  roleArn: pipeRole.roleArn,
  source: sourceStream.streamArn,
  target: targetStream.streamArn,
  sourceParameters: { filterCriteria: { filters: [{ pattern: '{"data" : {"event_type" : ["ORDER"] }}' }] }, kinesisStreamParameters: { startingPosition: 'LATEST' } },
  targetParameters: { inputTemplate: '{"event_type": <$.data.event_type>, "currency": <$.data.currency>, "sum": <$.data.sum>}', kinesisStreamParameters: { partitionKey: 'event_type' } },
});

To allow access to source and target, you must assign the correct permissions:

const pipeRole = new iam.Role(this, 'FilterPipeRole', { assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com') });

sourceStream.grantRead(pipeRole);
targetStream.grantWrite(pipeRole);

Message translator pattern

In an event-driven architecture, event producers and consumers are independent of each other. Therefore, they may exchange events of different formats. To enable communication, the events must be translated. This is known as the message translator pattern. For example, an event may contain an address, but the consumer expects coordinates.

If a computation is required to translate messages, use the enrichment step. The following architecture diagram shows how to accomplish this enrichment via API destinations. In the example, you can call an existing geocoding service to resolve addresses to coordinates.

There may be cases where the translation is purely syntactical. For example, a field may have a different name or structure.

You can achieve these translations without enrichment by using input transformers.

Here is the source code for the pipe, including the role with the correct permissions:

const pipeRole = new iam.Role(this, 'MessageTranslatorRole', { assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com'), inlinePolicies: { invokeApiDestinationPolicy } });

sourceQueue.grantConsumeMessages(pipeRole);
targetStepFunctionsWorkflow.grantStartExecution(pipeRole);

const messageTranslatorPipe = new pipes.CfnPipe(this, 'MessageTranslatorPipe', {
  roleArn: pipeRole.roleArn,
  source: sourceQueue.queueArn,
  target: targetStepFunctionsWorkflow.stateMachineArn,
  enrichment: enrichmentDestination.apiDestinationArn,
  sourceParameters: { sqsQueueParameters: { batchSize: 1 } },
});

Normalizer pattern

The normalizer pattern is similar to the message translator but there are different source components with different formats for events. The normalizer pattern routes each event type through its specific message translator so that downstream systems process messages with a consistent structure.

The example shows a system where different source systems store the name property differently. To process the messages differently based on their source, use an AWS Step Functions workflow. You can separate by event type and then have individual paths perform the unifying process. This diagram visualizes that you can call a Lambda function if needed. However, in basic cases like the preceding “name” example, you can modify the events using Amazon States Language (ASL).

In the example, you unify the events using Step Functions before putting them on your event bus. As is often the case with architectural choices, there are alternatives. Another approach is to introduce separate queues for each source system, connected by its own pipe containing only its unification actions.

This is the source code for the normalizer pattern using a Step Functions workflow as enrichment:

const pipeRole = new iam.Role(this, 'NormalizerRole', { assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com') });

sourceQueue.grantConsumeMessages(pipeRole);
enrichmentWorkflow.grantStartSyncExecution(pipeRole);
normalizerTargetBus.grantPutEventsTo(pipeRole);

const normalizerPipe = new pipes.CfnPipe(this, 'NormalizerPipe', {
  roleArn: pipeRole.roleArn,
  source: sourceQueue.queueArn,
  target: normalizerTargetBus.eventBusArn,
  enrichment: enrichmentWorkflow.stateMachineArn,
  sourceParameters: { sqsQueueParameters: { batchSize: 1 } },
});

Claim check pattern

To reduce the size of the events in your event-driven application, you can temporarily remove attributes. This approach is known as the claim check pattern. You split a message into a reference (“claim check”) and the associated payload. Then, you store the payload in external storage and add only the claim check to events. When you process events, you retrieve relevant parts of the payload using the claim check. For example, you can retrieve a user’s name and birthday based on their userID.

The claim check pattern has two parts. First, when an event is received, you split it and store the payload elsewhere. Second, when the event is processed, you retrieve the relevant information. You can implement both aspects with a pipe.

In the first pipe, you use the enrichment to split the event, in the second to retrieve the payload. Below are several enrichment options, such as using an external API via API Destinations, or using Amazon DynamoDB via Lambda. Other enrichment options are Amazon API Gateway and Step Functions.

Using a pipe to split and retrieve messages has three advantages. First, you keep events concise as they move through the system. Second, you ensure that the event contains all relevant information when it is processed. Third, you encapsulate the complexity of splitting and retrieving within the pipe.

The following code implements a pipe for the claim check pattern using the CDK:

const pipeRole = new iam.Role(this, 'ClaimCheckRole', { assumedBy: new iam.ServicePrincipal('pipes.amazonaws.com') });

claimCheckLambda.grantInvoke(pipeRole);
sourceQueue.grantConsumeMessages(pipeRole);
targetWorkflow.grantStartExecution(pipeRole);

const claimCheckPipe = new pipes.CfnPipe(this, 'ClaimCheckPipe', {
  roleArn: pipeRole.roleArn,
  source: sourceQueue.queueArn,
  target: targetWorkflow.stateMachineArn,
  enrichment: claimCheckLambda.functionArn,
  sourceParameters: { sqsQueueParameters: { batchSize: 1 } },
  targetParameters: { stepFunctionStateMachineParameters: { invocationType: 'FIRE_AND_FORGET' } },
});

Conclusion

This blog post shows how you can implement four enterprise integration patterns with Amazon EventBridge Pipes. In many cases, this reduces the amount of code you have to write and maintain. It can also simplify your architectures and, in some scenarios, reduce costs.

You can find the source code for all the patterns on the AWS samples GitHub repo.

For more serverless learning resources, visit Serverless Land. To find more patterns, go directly to the Serverless Patterns Collection.

Building ad-hoc consumers for event-driven architectures

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/building-ad-hoc-consumers-for-event-driven-architectures/

This post is written by Corneliu Croitoru (Media Streaming and Edge Architect) and Benjamin Smith (Principal Developer Advocate, Serverless)

In January 2022, the Serverless Developer Advocate team launched Serverlesspresso Extensions, a program that lets you contribute to Serverlesspresso. This is a multi-tenant event-driven application for a pop-up coffee bar that allows you to order from your phone. In 2022, Serverlesspresso processed over 20,000 orders at technology events around the world. The goal of Serverlesspresso extensions is to showcase the power and simplicity of evolving an event-driven application.

Event-driven architecture is a design pattern that allows developers to create and evolve applications by responding to events generated by various parts of the system. For modern applications, the need for flexible and scalable approaches is critical, and event-driven architecture can provide a powerful solution.

This blog post shows how to build and deploy an extension to an event-driven application. It describes the benefits and challenges of evolving event-driven applications. It also walks through a real-life example that was created in under 24 hours.

Decoupled integrations

A key benefit of event-driven architecture is its ability to decouple different parts of the system, making it easier to manage changes and evolve the application. In traditional, monolithic applications, changes to one part of the system can affect the entire application.

With event-driven architecture, you can change individual parts of the system without affecting the rest of the application. Event-driven architecture also makes it easier to integrate new functionality into an existing application by creating new event handlers to respond to existing events. This way, you can add new functionality without affecting the existing system, making it easier to test and deploy.

The following diagrams illustrate how to add and remove consumers and producers without affecting the core application.

Adding and removing event-driven extensions

Extension 2 is consuming events from the event bus and emitting events back onto the bus. It can be added to the core application without creating any dependencies. When extension 2 is removed, the core application remains unchanged.

In monolithic applications, additional features can create dependencies on the core application. Removing those features keeps those dependencies in place, making it more complex to remove them.

Adding and removing monolithic extensions

Collaboration

In a traditional monolithic application, it can be difficult to collaborate with multiple developers on a single code base. It can lead to conflicts, bugs, and other issues that must be resolved. Integrating new features and components into these applications can be challenging, especially when multiple developers are using different technologies. Deploying updates can also be complex when multiple developers are involved and different parts of the application must be updated simultaneously.

With event-based applications, these challenges are often less significant. A well-designed consumer contains well-defined permissions boundaries. Its resources should not need permission to interact with resources outside the extension definition. This means you can deploy and delete them independently of other extensions and of the core application. This makes it easier to collaborate with multiple developers across different languages, runtimes, and deployment frameworks.

Near real-time feedback

Another characteristic of event-driven architecture is the ability to provide real-time feedback to users. This is because consumers can process events as they occur, making it possible to provide immediate feedback. This can be useful in applications that handle high volumes of data or interact with multiple users, as they can provide real-time updates and ensure that the application remains responsive.

An alternative approach for near real-time feedback is to use batching. This involves grouping multiple events or data points into a batch and processing them. Choosing between batching and processing data in real-time with events depends on the amount of data being processed, the latency requirements, and the complexity of the processing logic. Batching can be more efficient for large volumes of data as it reduces the overhead of processing each event individually, while processing data with events can be better suited for real-time applications that require low latency.

The newest Serverlesspresso extension uses an event-driven approach to gain real-time insight into the application.

The average wait time extension

A new extension was created by Corneliu Croitoru that calculates the average wait for each drink at the Serverlesspresso coffee bar. This extension uses AWS Step Functions, DynamoDB, and AWS Lambda. The app displays the results in near real-time, allowing customers to see how long they may need to wait for their order. The extension uses the AWS Cloud Development Kit (CDK) for deployment.

The extension uses the existing Amazon EventBridge event bus to start a Step Functions workflow. The workflow is triggered by the order submission and order completion events and calculates the average wait time for each type of drink (for example, Caffe Latte). This information is then sent back to the Serverlesspresso event bus.

The following diagram illustrates the Step Functions workflow:

When a new order submission event is emitted, the Step Functions workflow persists the event timestamp to a DynamoDB table, a key/value data store. It uses the unique order ID as the key. When an order completion event is emitted, the workflow persists the completion timestamp to DynamoDB. The workflow then invokes a Lambda function to calculate the average duration of that specific drink by using the last 10 orders stored in the DynamoDB table.

This is the DynamoDB table structure:

The workflow sends an event to the Serverlesspresso event bus with the calculated duration and drink type. A rule on the event bus routes this event to an IoT topic, which publishes it to the front-end application via an existing open WebSocket connection. The result appears on the front end:

Alternative approaches

There are a number of alternative approaches that you could use to build a real-time “average wait” extension without using events.

One such approach might be to use DynamoDB as a cache for the event-driven data. This way it would be possible to query the database periodically to check for updates. This approach can be implemented by adding a timestamp field to the database records and querying for records that have been updated since the last time you checked.

Alternatively, you could use DynamoDB streams to capture changes as they occur instead of subscribing to new events directly. However, these approaches may face several challenges. The extension would require permission to read data from the DynamoDB table or stream. Since the DynamoDB table resource is defined in the application’s core template, this presents challenges of ownership, permissions boundaries and dependencies. It adds additional complexity to the application as the extension would not be decoupled from the core.

The challenges

The biggest challenge in building this extension is the required shift in developer mindset. Despite understanding the principles of decoupled event-driven architecture, it was not until building an event-driven architecture extension that the concept became clear.

For example, you may think it necessary to deploy the existing application to submit orders, emit events onto the application event bus, and interact with various core resources. The development team had discussions about the degree to which the extension should interact with existing application components. This was not an event-driven mindset.

Each new extension must be based entirely on events. This means it can only interact with the core application through the shared event bus by consuming and emitting events. It also means that you could write the extension in any runtime, with any infrastructure as code (IaC) framework, and that it should be possible to deploy and destroy the extension stack with no effect on the core application.

Once you understand this, the next challenge is discoverability. Finding the right events to consume may prove harder than expected. This is why documenting events as you build your application is important. The event schema, producer and consumer should be documented, and evolve with each version of the event. The Serverlesspresso Events Catalog helps to overcome this in this example.

Finally, the event player can emit realistic Serverlesspresso events onto the event bus. This replaces the need to deploy the core application stack.

Conclusion

The Serverlesspresso Extensions program shows the simplicity of developing event-driven applications. Building event-driven architectures allows for decoupled integrations, making it easier to manage changes and develop the application. It also simplifies collaboration among multiple teams as consumers of events can come and go independently without affecting the procedure or core application.

Using these principles, the average wait time extension was built and deployed within 24 hours, using a different IaC framework to the core application.

Use the Serverlesspresso extensions GitHub repository to read how to build more Serverlesspresso extensions.

For more serverless learning resources, visit Serverless Land.