Tag Archives: Amazon Route 53

Top Architecture Blog Posts of 2023

Post Syndicated from Andrea Courtright original https://aws.amazon.com/blogs/architecture/top-architecture-blog-posts-of-2023/

2023 was a rollercoaster year in tech, and we at the AWS Architecture Blog feel so fortunate to have shared in the excitement. As we move into 2024 and all of the new technologies we could see, we want to take a moment to highlight the brightest stars from 2023.

As always, thanks to our readers and to the many talented and hardworking Solutions Architects and other contributors to our blog.

I give you our 2023 cream of the crop!

#10: Build a serverless retail solution for endless aisle on AWS

In this post, Sandeep and Shashank help retailers and their customers alike in this guided approach to finding inventory that doesn’t live on shelves.

Building endless aisle architecture for order processing

Figure 1. Building endless aisle architecture for order processing

Check it out!

#9: Optimizing data with automated intelligent document processing solutions

Who else dreads wading through large amounts of data in multiple formats? Just me? I didn’t think so. Using Amazon AI/ML and content-reading services, Deependra, Anirudha, Bhajandeep, and Senaka have created a solution that is scalable and cost-effective to help you extract the data you need and store it in a format that works for you.

AI-based intelligent document processing engine

Figure 2: AI-based intelligent document processing engine

Check it out!

#8: Disaster Recovery Solutions with AWS managed services, Part 3: Multi-Site Active/Passive

Disaster recovery posts are always popular, and this post by Brent and Dhruv is no exception. Their creative approach in part 3 of this series is most helpful for customers who have business-critical workloads with higher availability requirements.

Warm standby with managed services

Figure 3. Warm standby with managed services

Check it out!

#7: Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

Continuing with the theme of “when bad things happen,” we have Siva, Elamaran, and Re’s post about preparing for workload failures. If resiliency is a concern (and it really should be), the secret is test, test, TEST.

Architecture flow for Microservices to simulate a realistic failure scenario

Figure 4. Architecture flow for Microservices to simulate a realistic failure scenario

Check it out!

#6: Let’s Architect! Designing event-driven architectures

Luca, Laura, Vittorio, and Zamira weren’t content with their four top-10 spots last year – they’re back with some things you definitely need to know about event-driven architectures.

Let's Architect

Figure 5. Let’s Architect artwork

Check it out!

#5: Use a reusable ETL framework in your AWS lake house architecture

As your lake house increases in size and complexity, you could find yourself facing maintenance challenges, and Ashutosh and Prantik have a solution: frameworks! The reusable ETL template with AWS Glue templates might just save you a headache or three.

Reusable ETL framework architecture

Figure 6. Reusable ETL framework architecture

Check it out!

#4: Invoking asynchronous external APIs with AWS Step Functions

It’s possible that AWS’ menagerie of services doesn’t have everything you need to run your organization. (Possible, but not likely; we have a lot of amazing services.) If you are using third-party APIs, then Jorge, Hossam, and Shirisha’s architecture can help you maintain a secure, reliable, and cost-effective relationship among all involved.

Invoking Asynchronous External APIs architecture

Figure 7. Invoking Asynchronous External APIs architecture

Check it out!

#3: Announcing updates to the AWS Well-Architected Framework

The Well-Architected Framework continues to help AWS customers evaluate their architectures against its six pillars. They are constantly striving for improvement, and Haleh’s diligence in keeping us up to date has not gone unnoticed. Thank you, Haleh!

Well-Architected logo

Figure 8. Well-Architected logo

Check it out!

#2: Let’s Architect! Designing architectures for multi-tenancy

The practically award-winning Let’s Architect! series strikes again! This time, Luca, Laura, Vittorio, and Zamira were joined by Federica to discuss multi-tenancy and why that concept is so crucial for SaaS providers.

Let's Architect

Figure 9. Let’s Architect

Check it out!

And finally…

#1: Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Haresh, Lewis, and Bonnie revamped this 2022 post into a masterpiece that completely stole our readers’ hearts and is among the top posts we’ve ever made!

Resilience patterns and trade-offs

Figure 10. Resilience patterns and trade-offs

Check it out!

Bonus! Three older special mentions

These three posts were published before 2023, but we think they deserve another round of applause because you, our readers, keep coming back to them.

Thanks again to everyone for their contributions during a wild year. We hope you’re looking forward to the rest of 2024 as much as we are!

AWS Weekly Roundup—Amazon Route53, Amazon EventBridge, Amazon SageMaker, and more – January 15, 2024

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-route53-amazon-eventbridge-amazon-sagemaker-and-more-january-15-2024/

We are in January, the start of a new year, and I imagine many of you have made a new year resolution to learn something new. If you want to learn something new and get a free Amazon Web Services (AWS) Learning Badge, check out the new Events and Workflows Learning Path. This learning path will teach you everything you need to know about AWS Step Functions, Amazon EventBridge, event-driven architectures, and serverless, and when you finish the learning path, you can take an assessment. If you pass the assessment, you get an AWS Learning Badge, credited by Credly, that you can share in your résumé and social media profiles.

Events and workflows learning path badge

Last Week’s Launches
Here are some launches that got my attention during the previous week.

Amazon Route 53 – Now you can enable Route 53 Resolver DNS Firewall to filter DNS traffic based on the query type contained in the question section of the DNS query format. In addition, Route 53 now supports geoproximity routing as an additional routing policy for DNS records. Expand and reduce the geographic area from which traffic is routed to a resource by changing the record’s bias value. This is really helpful for industries that need to deliver highly responsive digital experiences.

Amazon CloudWatch LogsCloudWatch Logs now support creating account-level subscription filters. This capability allows you to forward all the logs groups from an account to other services like Amazon OpenSearch Service or Amazon Kinesis Data Firehose.

Amazon Elastic Container Service (Amazon ECS) Amazon ECS now integrates with Amazon Elastic Block Store (Amazon EBS), allowing you to provision and attach EBS volumes to Amazon ECS tasks running on both AWS Fargate and Amazon Elastic Cloud Compute (Amazon EC2). Read the blog post Channy wrote where he shows this feature in action.

Amazon EventBridgeEventBridge now supports AWS AppSync as a target of EventBridge buses. This enables you to stream real-time updates from your backend applications to your front-end clients. For example, you can get notifications in your mobile application from an order you did when the order status changes on the backend.

Amazon SageMakerSageMaker now supports M7i, C7i, and R7i instances for machine learning (ML) inference. These instances are powered by custom 4th generation Intel Xeon scalable processors and deliver up to 15 percent better price performance than their previous generations.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

If you are a serverless enthusiast, this week, the AWS Compute Blog published the Serverless ICYMI (in case you missed it) quarterly recap for the last quarter of 2023. This post compiles the announcements made during the months of October, November, and December, with all the relevant content that was produced by AWS Developer Advocates during that time. In addition to that blog post, you can learn about ServerlessVideo, a new demo application that we launched at AWS re:Invent 2023.

ServerlessVideo

This week there were also a couple of really interesting blog posts that explain how to solve very common challenges that customers face. The first one is the blog post in the AWS Security Blog that explains how to customize access tokens in Amazon Cognito user pools. And the second one is from the AWS Database Blog, which explains how to effectively sort data with Amazon DynamoDB.

The Official AWS Podcast – Listen each week for updates on the latest AWS news and deep dives into exciting use cases. There are also official AWS podcasts in several languages. Check out the ones in FrenchGermanItalian, and Spanish.

AWS open source newsletter – This is a newsletter curated by my colleague Ricardo to bring you the latest open source projects, posts, events, and more.

For our customers in Turkey, on January 1, 2024, AWS Turkey Pazarlama Teknoloji ve Danışmanlık Hizmetleri Limited Şirketi (AWS Turkey) replaced AWS EMEA SARL (AWS Europe) as the contracting party and service provider to customers in Türkiye. This enables AWS customers in Türkiye to transact in their local currency (Turkish Lira) and with a local bank. For more information on AWS Turkey, visit the FAQ page.

Upcoming AWS Events
The beginning of the year is the season of AWS re:Invent recaps, which are happening all around the globe during the next two months. You can check the recaps page to find the one closest to you.

You can browse all upcoming AWS led in-person and virtual events, as well as developer-focused events such as AWS DevDay.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Happy New Year! AWS Weekly Roundup – January 8, 2024

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/happy-new-year-aws-weekly-roundup-january-8-2024/

Happy New Year! Cloud technologies, machine learning, and generative AI have become more accessible, impacting nearly every aspect of our lives. Amazon CTO Dr. Werner Vogels offers four tech predictions for 2024 and beyond:

  • Generative AI becomes culturally aware
  • FemTech finally takes off
  • AI assistants redefine developer productivity
  • Education evolves to match the speed of technology

Read how these technology trends will converge to help solve some of society’s most difficult problems. Download the Werner Vogels’ Tech Predictions for 2024 and Beyond ebook or read Werner’s All Things Distributed blog.

AWS re:Invent 2023To hear insights from AWS and industry thought leaders, grow your skills, and get inspired, watch AWS re:Invent 2023 videos on demand for keynotes, innovation talks, breakout sessions, and AWS Hero guide playlists.

Launches from the last few weeks
Since our last week in review on December 18, 2023, I’d like to highlight some launches from year end, as well as last week:

New AWS Canada West (Calgary) Region – We are opening a new and second Region and in Canada, AWS Canada West (Calgary). At the end of 2023, AWS had 33 AWS Regions and 105 Availability Zones (AZs) globally. We preannounced 12 additional AZs in four future Regions in Malaysia, New Zealand, Thailand, and the AWS European Sovereign Cloud. We will share more information on these Regions in 2024. Please stay tuned.

DNS over HTTPS in Amazon Route 53 Resolver – You can use the DNS over HTTPS (DoH) protocol for both inbound and outbound Route 53 Resolver endpoints. As the name suggests, DoH supports HTTP or HTTP/2 over TLS to encrypt the data exchanged for Domain Name System (DNS) resolutions.

Automatic enrollment to Amazon RDS Extended Support – Your MySQL 5.7 and PostgreSQL 11 database instances running on Amazon Aurora and Amazon RDS will be automatically enrolled into Amazon RDS Extended Support starting on February 29, 2024. You can have more control over when you want to upgrade the major version of your database after the community end of life (EoL).

New Amazon CloudWatch Network Monitor – This is a new feature of Amazon CloudWatch that helps monitor network availability and performance between AWS and your on-premises environments. Network Monitor needs zero manual instrumentation and gives you access to real-time network visibility to proactively and quickly identify issues within the AWS network and your own hybrid environment. For more information, read Monitor hybrid connectivity with Amazon CloudWatch Network Monitor.

Amazon Aurora PostgreSQL integrations with Amazon Bedrock – You can use two methods to integrate Aurora PostgreSQL databases with Amazon Bedrock to power generative AI applications. You can use the SQL query with Aurora ML integration with Amazon Bedrock and Aurora vector store with Knowledge Bases for Amazon Bedrock for Retrieval Augmented Generation (RAG).

New WordPress setup on Amazon Lightsail – Set up your WordPress website on Amazon Lightsail with the new workflow to eliminate complexity and time spent configuring your website. The workflow allows you to complete all the necessary steps, including setting up a Secure Sockets Layer (SSL) certificate to secure your website with HTTPS.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting in the new year:

Book recommendations for AWS customer executives – Plan for the new year and catch up on what others are doing and thinking. AWS Enterprise Strategy team recommends what books are most important for our AWS customer executives to read.

Best practices for scaling AWS CDK adoption with Platform Engineering – A recent evolution in DevOps is the introduction of platform engineering teams to build services, toolchains, and documentation to support workload teams. This blog post introduces strategies and best practices for accelerating CDK adoption within your organization. You can learn how to scale the lessons learned from the pilot project across your organization through platform engineering.

High performance running HPC applications on AWS Graviton instances – When running the Parallel Lattice Boltzmann Solver (Palabos) on Amazon EC2 Hpc7g instances to solve computational fluid dynamics (CFD) problems, performance increased by up to 70% and price performance was up to 3x better than on the previous generation of Graviton instances.

The new AWS open source newsletter, #181 – Check up on all the latest open source content, which this week includes AWS Amplify, Amazon Corretto, dbt, Apache Flink, Karpenter, LangChain, Pinecone, and more.

Upcoming AWS Events
Check your calendars and sign up for these AWS events in the new year:

AWS at CES 2024 (January 9-12) – AWS will be representing some of the latest cloud services and solutions that are purpose built for the automotive, mobility, transportation, and manufacturing industries. Join us to learn about the latest cloud capabilities across generative AI, software define vehicles, product engineering, sustainability, new digital customer experiences, connected mobility, autonomous driving, and so much more in Amazon Experience Area.

APJ Builders Online Series (January 18) – This online conference is designed for you to learn core AWS concepts, and step-by-step architectural best practices, including demonstrations to help you get started and accelerate your success on AWS.

You can browse all upcoming AWS-led in-person and virtual events, and developer-focused events such as AWS DevDay.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

DNS over HTTPS is now available in Amazon Route 53 Resolver

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/dns-over-https-is-now-available-in-amazon-route-53-resolver/

Starting today, Amazon Route 53 Resolver supports using the DNS over HTTPS (DoH) protocol for both inbound and outbound Resolver endpoints. As the name suggests, DoH supports HTTP or HTTP/2 over TLS to encrypt the data exchanged for Domain Name System (DNS) resolutions.

Using TLS encryption, DoH increases privacy and security by preventing eavesdropping and manipulation of DNS data as it is exchanged between a DoH client and the DoH-based DNS resolver.

This helps you implement a zero-trust architecture where no actor, system, network, or service operating outside or within your security perimeter is trusted and all network traffic is encrypted. Using DoH also helps follow recommendations such as those described in this memorandum of the US Office of Management and Budget (OMB).

DNS over HTTPS support in Amazon Route 53 Resolver
You can use Amazon Route 53 Resolver to resolve DNS queries in hybrid cloud environments. For example, it allows AWS services access for DNS requests from anywhere within your hybrid network. To do so, you can set up inbound and outbound Resolver endpoints:

  • Inbound Resolver endpoints allow DNS queries to your VPC from your on-premises network or another VPC.Amazon Route 53 Resolver inbound endpoint architecture.
  • Outbound Resolver endpoints allow DNS queries from your VPC to your on-premises network or another VPC.Amazon Route 53 Resolver outbound endpoint architecture.

After you configure the Resolver endpoints, you can set up rules that specify the name of the domains for which you want to forward DNS queries from your VPC to an on-premises DNS resolver (outbound) and from on-premises to your VPC (inbound).

Now, when you create or update an inbound or outbound Resolver endpoint, you can specify which protocols to use:

  • DNS over port 53 (Do53), which is using either UDP or TCP to send the packets.
  • DNS over HTTPS (DoH), which is using TLS to encrypt the data.
  • Both, depending on which one is used by the DNS client.
  • For FIPS compliance, there is a specific implementation (DoH-FIPS) for inbound endpoints.

Let’s see how this works in practice.

Using DNS over HTTPS with Amazon Route 53 Resolver
In the Route 53 console, I choose Inbound endpoints from the Resolver section of the navigation pane. There, I choose Create inbound endpoint.

I enter a name for the endpoint, select the VPC, the security group, and the endpoint type (IPv4, IPv6, or dual-stack). To allow using both encrypted and unencrypted DNS resolutions, I select Do53, DoH, and DoH-FIPS in the Protocols for this endpoint option.

Console screenshot.

After that, I configure the IP addresses for DNS queries. I select two Availability Zones and, for each, a subnet. For this setup, I use the option to have the IP addresses automatically selected from those available in the subnet.

After I complete the creation of the inbound endpoint, I configure the DNS server in my network to forward requests for the amazonaws.com domain (used by AWS service endpoints) to the inbound endpoint IP addresses.

Similarly, I create an outbound Resolver endpoint and and select both Do53 and DoH as protocols. Then, I create forwarding rules that tell for which domains the outbound Resolver endpoint should forward requests to the DNS servers in my network.

Now, when the DNS clients in my hybrid environment use DNS over HTTPS in their requests, DNS resolutions are encrypted. Optionally, I can enforce encryption and select only DoH in the configuration of inbound and outbound endpoints.

Things to know
DNS over HTTPS support for Amazon Route 53 Resolver is available today in all AWS Regions where Route 53 Resolver is offered, including GovCloud Regions and Regions based in China.

DNS over port 53 continues to be the default for inbound or outbound Resolver endpoints. In this way, you don’t need to update your existing automation tooling unless you want to adopt DNS over HTTPS.

There is no additional cost for using DNS over HTTPS with Resolver endpoints. For more information, see Route 53 pricing.

Start using DNS over HTTPS with Amazon Route 53 Resolver to increase privacy and security for your hybrid cloud environments.

Danilo

Zonal autoshift – Automatically shift your traffic away from Availability Zones when we detect potential issues

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/zonal-autoshift-automatically-shift-your-traffic-away-from-availability-zones-when-we-detect-potential-issues/

Today we’re launching zonal autoshift, a new capability of Amazon Route 53 Application Recovery Controller that you can enable to automatically and safely shift your workload’s traffic away from an Availability Zone when AWS identifies a potential failure affecting that Availability Zone and shift it back once the failure is resolved.

When deploying resilient applications, you typically deploy your resources across multiple Availability Zones in a Region. Availability Zones are distinct groups of physical data centers at a meaningful distance apart (typically miles) to make sure that they have diverse power, connectivity, network devices, and flood plains.

To help you protect against an application’s errors, like a failed deployment, an error of configuration, or an operator error, we introduced last year the ability to manually or programmatically trigger a zonal shift. This enables you to shift the traffic away from one Availability Zone when you observe degraded metrics in that zone. It does so by configuring your load balancer to direct all new connections to infrastructure in healthy Availability Zones only. This allows you to preserve your application’s availability for your customers while you investigate the root cause of the failure. Once fixed, you stop the zonal shift to ensure the traffic is distributed across all zones again.

Zonal shift works at the Application Load Balancer (ALB) or Network Load Balancer (NLB) level only when cross-zone load balancing is turned off, which is the default for NLB. In a nutshell, load balancers offer two levels of load balancing. The first level is configured in the DNS. Load balancers expose one or more IP addresses for each Availability Zone, offering a client-side load balancing between zones. Once the traffic hits an Availability Zone, the load balancer sends traffic to registered healthy targets, typically an Amazon Elastic Compute Cloud (Amazon EC2) instance. By default, ALBs send traffic to targets across all Availability Zones. For zonal shift to properly work, you must configure your load balancers to disable cross-zone load balancing.

When zonal shift starts, the DNS sends all traffic away from one Availability Zone, as illustrated by the following diagram.

ARC Zonal Shift

Manual zonal shift helps to protect your workload against errors originating from your side. But when there is a potential failure in an Availability Zone, it is sometimes difficult for you to identify or detect the failure. Detecting an issue in an Availability Zone using application metrics is difficult because, most of the time, you don’t track metrics per Availability Zone. Moreover, your services often call dependencies across Availability Zone boundaries, resulting in errors seen in all Availability Zones. With modern microservice architectures, these detection and recovery steps must often be performed across tens or hundreds of discrete microservices, leading to recovery times of multiple hours.

Customers asked us if we could take the burden off their shoulders to detect a potential failure in an Availability Zone. After all, we might know about potential issues through our internal monitoring tools before you do.

With this launch, you can now configure zonal autoshift to protect your workloads against potential failure in an Availability Zone. We use our own AWS internal monitoring tools and metrics to decide when to trigger a network traffic shift. The shift starts automatically; there is no API to call. When we detect that a zone has a potential failure, such as a power or network disruption, we automatically trigger an autoshift of your infrastructure’s NLB or ALB traffic, and we shift the traffic back when the failure is resolved.

Obviously, shifting traffic away from an Availability Zone is a delicate operation that must be carefully prepared. We built a series of safeguards to ensure we don’t degrade your application availability by accident.

First, we have internal controls to ensure we shift traffic away from no more than one Availability Zone at a time. Second, we practice the shift on your infrastructure for 30 minutes every week. You can define blocks of time when you don’t want the practice to happen, for example, 08:00–18:00, Monday through Friday. Third, you can define two Amazon CloudWatch alarms to act as a circuit breaker during the practice run: one alarm to prevent starting the practice run at all and one alarm to monitor your application health during a practice run. When either alarm triggers during the practice run, we stop it and restore traffic to all Availability Zones. The state of application health alarm at the end of the practice run indicates its outcome: success or failure.

According to the principle of shared responsibility, you have two responsibilities as well.

First you must ensure there is enough capacity deployed in all Availability Zones to sustain the increase of traffic in remaining Availability Zones after traffic has shifted. We strongly recommend having enough capacity in remaining Availability Zones at all times and not relying on scaling mechanisms that could delay your application recovery or impact its availability. When zonal autoshift triggers, AWS Auto Scaling might take more time than usual to scale your resources. Pre-scaling your resource ensures a predictable recovery time for your most demanding applications.

Let’s imagine that to absorb regular user traffic, your application needs six EC2 instances across three Availability Zones (2×3 instances). Before configuring zonal autoshift, you should ensure you have enough capacity in the remaining Availability Zones to absorb the traffic when one Availability Zone is not available. In this example, it means three instances per Availability Zone (3×3 = 9 instances with three Availability Zones in order to keep 2×3 = 6 instances to handle the load when traffic is shifted to two Availability Zones).

In practice, when operating a service that requires high reliability, it’s normal to operate with some redundant capacity online for eventualities such as customer-driven load spikes, occasional host failures, etc. Topping up your existing redundancy in this way both ensures you can recover rapidly during an Availability Zone issue but can also give you greater robustness to other events.

Second, you must explicitly enable zonal autoshift for the resources you choose. AWS applies zonal autoshift only on the resources you chose. Applying a zonal autoshift will affect the total capacity allocated to your application. As I just described, your application must be prepared for that by having enough capacity deployed in the remaining Availability Zones.

Of course, deploying this extra capacity in all Availability Zones has a cost. When we talk about resilience, there is a business tradeoff to decide between your application availability and its cost. This is another reason why we apply zonal autoshift only on the resources you select.

Let’s see how to configure zonal autoshift
To show you how to configure zonal autoshift, I deploy my now-famous TicTacToe web application using a CDK script. I open the Route 53 Application Recovery Controller page of the AWS Management Console. On the left pane, I select Zonal autoshift. Then, on the welcome page, I select Configure zonal autoshift for a resource.

Zonal autoshift - 1

I select the load balancer of my demo application. Remember that currently, only load balancers with cross-zone load balancing turned off are eligible for zonal autoshift. As the warning on the console reminds me, I also make sure my application has enough capacity to continue to operate with the loss of one Availability Zone.

Zonal autoshift - 2

I scroll down the page and configure the times and days I don’t want AWS to run the 30-minute practice. At first, and until I’m comfortable with autoshift, I block the practice 08:00–18:00, Monday through Friday. Pay attention that hours are expressed in UTC, and they don’t vary with daylight saving time. You may use a UTC time converter application for help. While it is safe for you to exclude business hours at the start, we recommend configuring the practice run also during your business hours to ensure capturing issues that might not be visible when there is low or no traffic on your application. You probably most need zonal autoshift to work without impact at your peak time, but if you have never tested it, how confident are you? Ideally, you don’t want to block any time at all, but we recognize that’s not always practical.

Zonal autoshift - 3

Further down on the same page, I enter the two circuit breaker alarms. The first one prevents the practice from starting. You use this alarm to tell us this is not a good time to start a practice run. For example, when there is an issue ongoing with your application or when you’re deploying a new version of your application to production. The second CloudWatch alarm gives the outcome of the practice run. It enables zonal autoshift to judge how your application is responding to the practice run. If the alarm stays green, we know all went well.

If either of these two alarms triggers during the practice run, zonal autoshift stops the practice and restores the traffic to all Availability Zones.

Finally, I acknowledge that a 30-minute practice run will run weekly and that it might reduce the availability of my application.

Then, I select Create.

Zonal autoshift - 4And that’s it.

After a few days, I see the history of the practice runs on the Zonal shift history for resource tab of the console. I monitor the history of my two circuit breaker alarms to stay confident everything is correctly monitored and configured.

ARC Zonal Shift - practice run

It’s not possible to test an autoshift itself. It triggers automatically when we detect a potential issue in an Availability Zone. I asked the service team if we could shut down an Availability Zone to test the instructions I shared in this post; they politely declined my request :-).

To test your configuration, you can trigger a manual shift, which behaves identically to an autoshift.

A few more things to know
Zonal autoshift is now available at no additional cost in all AWS Regions, except for China and GovCloud.

We recommend applying the crawl, walk, run methodology. First, you get started with manual zonal shifts to acquire confidence in your application. Then, you turn on zonal autoshift configured with practice runs outside of your business hours. Finally, you modify the schedule to include practice zonal shifts during your business hours. You want to test your application response to an event when you least want it to occur.

We also recommend that you think holistically about how all parts of your application will recover when we move traffic away from one Availability Zone and then back. The list that comes to mind (although certainly not complete) is the following.

First, plan for extra capacity as I discussed already. Second, think about possible single points of failure in each Availability Zone, such as a self-managed database running on a single EC2 instance or a microservice that leaves in a single Availability Zone, and so on. I strongly recommend using managed databases, such as Amazon DynamoDB or Amazon Aurora for applications requiring zonal shifts. These have built-in replication and fail-over mechanisms in place. Third, plan the switch back when the Availability Zone will be available again. How much time do you need to scale your resources? Do you need to rehydrate caches?

You can learn more about resilient architectures and methodologies with this great series of articles from my colleague Adrian.

Finally, remember that only load balancers with cross-zone load balancing turned off are currently eligible for zonal autoshift. To turn off cross-zone load balancing from a CDK script, you need to remove stickinessCookieDuration and add load_balancing.cross_zone.enabled=false on the target group. Here is an example with CDK and Typescript:

    // Add the auto scaling group as a load balancing
    // target to the listener.
    const targetGroup = listener.addTargets('MyApplicationFleet', {
      port: 8080,
      // for zonal shift, stickiness & cross-zones load balancing must be disabled
      // stickinessCookieDuration: Duration.hours(1),
      targets: [asg]
    });    
    // disable cross zone load balancing
    targetGroup.setAttribute("load_balancing.cross_zone.enabled", "false");

Now it’s time for you to select your applications that would benefit from zonal autoshift. Start by reviewing your infrastructure capacity in each Availability Zone and then define the circuit breaker alarms. Once you are confident your monitoring is correctly configured, go and enable zonal autoshift.

— seb

How Sonar built a unified API on AWS

Post Syndicated from Patrick Madec original https://aws.amazon.com/blogs/architecture/how-sonar-built-a-unified-api-on-aws/

SonarCloud, a software-as-a-service (SaaS) product developed by Sonar, seamlessly integrates into developers’ CI/CD workflows to increase code quality and identify vulnerabilities. Over the last few months, Sonar’s cloud engineers have worked on modernizing SonarCloud to increase the lead time to production.

Following Domain Driven Design principles, Sonar split the application into multiple business domains, each owned by independent teams. They also built a unified API to expose these domains publicly.

This blog post will explore Sonar’s design for SonarCloud’s unified API, utilizing Elastic Load Balancing, AWS PrivateLink, and Amazon API Gateway. Then, we’ll uncover the benefits aligned with the AWS Well-Architected Framework including enhanced security and minimal operational overhead.

This solution isn’t exclusive to Sonar; it’s a blueprint for organizations modernizing their applications towards domain-driven design or microservices with public service exposure.

Introduction

SonarCloud’s core was initially built as a monolithic application on AWS, managed by a single team. Over time, it gained widespread adoption among thousands of organizations, leading to the introduction of new features and contributions from multiple teams.

In response to this growth, Sonar recognized the need to modernize its architecture. The decision was made to transition to domain-driven design, aligning with the team’s structure. New functionalities are now developed within independent domains, managed by dedicated teams, while existing components are gradually refactored using the strangler pattern.

This transformation resulted in SonarCloud being composed of multiple domains, and securely exposing them to customers became a key challenge. To address this, Sonar’s engineers built a unified API, a solution we’ll explore in the following section.

Solution overview

Figure 1 illustrates the architecture of the unified API, the gateway through which end-users access SonarCloud services. It is built on an Application Load Balancer and Amazon API Gateway private APIs.

Unified API architecture

Figure 1. Unified API architecture

The VPC endpoint for API Gateway spans three Availability Zones (AZs), providing an Elastic Network Interface (ENI) in each private subnet. Meanwhile, the ALB is configured with an HTTPS listener, linked to a target group containing the IP addresses of the ENIs.

To streamline access, we’ve established an API Gateway custom domain at api.example.com. Within this domain, we’ve created API mappings for each domain. This setup allows for seamless routing, with paths like /domain1 leading directly to the corresponding domain1 private API of the API Gateway service.

Here is how it works:

  1. The user makes a request to api.example.com/domain1, which is routed to the ALB using Amazon Route53 for DNS resolution.
  2. The ALB terminates the connection, decrypts the request and sends it to one of the VPC endpoint ENIs. At this point, the domain name and the path of the request respectively match our custom domain name, api.example.com, and our API mapping for /domain1.
  3. Based on the custom domain name and API mapping, the API Gateway service routes the request to the domain1 private API.

In this solution, we leverage the two following functionalities of the Amazon API Gateway:

  • Private REST APIs in Amazon API Gateway can only be accessed from your virtual private cloud by using an interface VPC endpoint. This is an ENI that you create in your VPC.
  • API Gateway custom domains allow you to set up your API’s hostname. The default base URL for an API is:
    https://api-id.execute-api.region.amazonaws.com/stage

    With custom domains you can define a more intuitive URL, such as:
    https://api.example.com/domain1This is not supported for private REST APIs by default so we are using a workaround documented in https://github.com/aws-samples/.

Conclusion

In this post, we described the architecture of a unified API built by Sonar to securely expose multiple domains through a single API endpoint. To conclude, let’s review how this solution is aligned with the best practices of the AWS Well-Architected Framework.

Security

The unified API approach improves the security of the application by reducing the attack surface as opposed to having a public API per domain. AWS Web Application Firewall (WAF) used on the ALB protects the application from common web exploits. AWS Shield, enabled by default on Amazon CloudFront, provides Network/Transport layer protection against DDoS attacks.

Operational Excellence

The design allows each team to independently deploy application and infrastructure changes behind a dedicated private API Gateway. This leads to a minimal operational overhead for the platform team and was a requirement. In addition, the architecture is based on managed services, which scale automatically as SonarCloud usage evolves.

Reliability

The solution is built using AWS services providing high-availability by default across Availability Zones (AZs) in the AWS Region. Requests throttling can be configured on each private API Gateway to protect the underlying resources from being overwhelmed.

Performance

Amazon CloudFront increases the performance of the API, especially for users located far from the deployment AWS Region. The traffic flows through the AWS network backbone which offers superior performance for accessing the ALB.

Cost

The ALB is used as the single entry-point and brings an extra cost as opposed to exposing multiple public API Gateways. This is a trade-off for enhanced security and customer experience.

Sustainability

By using serverless managed services, Sonar is able to match the provisioned infrastructure with the customer demand. This avoids overprovisioning resources and reduces the environmental impact of the solution.

Further reading

Converting stateful application to stateless using AWS services

Post Syndicated from Sarat Para original https://aws.amazon.com/blogs/architecture/converting-stateful-application-to-stateless-using-aws-services/

Designing a system to be either stateful or stateless is an important choice with tradeoffs regarding its performance and scalability. In a stateful system, data from one session is carried over to the next. A stateless system doesn’t preserve data between sessions and depends on external entities such as databases or cache to manage state.

Stateful and stateless architectures are both widely adopted.

  • Stateful applications are typically simple to deploy. Stateful applications save client session data on the server, allowing for faster processing and improved performance. Stateful applications excel in predictable workloads and offer consistent user experiences.
  • Stateless architectures typically align with the demands of dynamic workload and changing business requirements. Stateless application design can increase flexibility with horizontal scaling and dynamic deployment. This flexibility helps applications handle sudden spikes in traffic, maintain resilience to failures, and optimize cost.

Figure 1 provides a conceptual comparison of stateful and stateless architectures.

Conceptual diagram for stateful vs stateless architectures

Figure 1. Conceptual diagram for stateful vs stateless architectures

For example, an eCommerce application accessible from web and mobile devices manages several aspects of the customer transaction life cycle. This lifecycle starts with account creation, then moves to placing items in the shopping cart, and proceeds through checkout. Session and user profile data provide session persistence and cart management, which retain the cart’s contents and render the latest updated cart from any device. A stateless architecture is preferable for this application because it decouples user data and offloads the session data. This provides the flexibility to scale each component independently to meet varying workloads and optimize resource utilization.

In this blog, we outline the process and benefits of converting from a stateful to stateless architecture.

Solution overview

This section walks you through the steps for converting stateful to stateless architecture:

  1. Identifying and understanding the stateful requirements
  2. Decoupling user profile data
  3. Offloading session data
  4. Scaling each component dynamically
  5. Designing a stateless architecture

Step 1: Identifying and understanding the stateful components

Transforming a stateful architecture to a stateless architecture starts with reviewing the overall architecture and source code of the application, and then analyzing dataflow and dependencies.

Review the architecture and source code

It’s important to understand how your application accesses and shares  data. Pay attention to components that persist state data and retain state information. Examples include user credentials, user profiles, session tokens, and data specific to sessions (such as shopping carts). Identifying how this data is handled serves as the foundation for planning the conversion to a stateless architecture.

Analyze dataflow and dependencies

Analyze and understand the components that maintain state within the architecture. This helps you assess the potential impact of transitioning to a stateless design.

You can use the following questionnaire to assess the components. Customize the questions according to your application.

  • What data is specific to a user or session?
  • How is user data stored and managed?
  • How is the session data accessed and updated?
  • Which components rely on the user and session data?
  • Are there any shared or centralized data stores?
  • How does the state affect scalability and tolerance?
  • Can the stateful components be decoupled or made stateless?

Step 2: Decoupling user profile data

Decoupling user data involves separating and managing user data from the core application logic. Delegate responsibilities for user management and secrets, such as application programming interface (API) keys and database credentials, to a separate service that can be resilient and scale independently. For example, you can use:

  • Amazon Cognito to decouple user data from application code by using features, such as identity pools, user pools, and Amazon Cognito Sync.
  • AWS Secrets Manager to decouple user data by storing secrets in a secure, centralized location. This means that the application code doesn’t need to store secrets, which makes it more secure.
  • Amazon S3 to store large, unstructured data, such as images and documents. Your application can retrieve this data when required, eliminating the need to store it in memory.
  • Amazon DynamoDB to store information such as user profiles. Your application can query this data in near-real time.

Step 3: Offloading session data

Offloading session data refers to the practice of storing and managing session related data external to the stateful components of an application. This involves separating the state from business logic. You can offload session data to a database, cache, or external files.

Factors to consider when offloading session data include:

  • Amount of session data
  • Frequency and latency
  • Security requirements

Amazon ElastiCache, Amazon DynamoDB, Amazon Elastic File System (Amazon EFS), and Amazon MemoryDB for Redis are examples of AWS services that you can use to offload session data. The AWS service you choose for offloading session data depends on application requirements.

Step 4: Scaling each component dynamically

Stateless architecture gives the flexibility to scale each component independently, allowing the application to meet varying workloads and optimize resource utilization. While planning for scaling, consider using:

Step 5: Design a stateless architecture

After you identify which state and user data need to be persisted, and your storage solution of choice, you can begin designing the stateless architecture. This involves:

  • Understanding how the application interacts with the storage solution.
  • Planning how session creation, retrieval, and expiration logic work with the overall session management.
  • Refactoring application logic to remove references to the state information that’s stored on the server.
  • Rearchitecting the application into smaller, independent services, as described in steps 2, 3, and 4.
  • Performing thorough testing to ensure that all functionalities produce the desired results after the conversion.

The following figure is an example of a stateless architecture on AWS. This architecture separates the user interface, application logic, and data storage into distinct layers, allowing for scalability, modularity, and flexibility in designing and deploying applications. The tiers interact through well-defined interfaces and APIs, ensuring that each component focuses on its specific responsibilities.

Example of a stateless architecture

Figure 2. Example of a stateless architecture

Benefits

Benefits of adopting a stateless architecture include:

  • Scalability:  Stateless components don’t maintain a local state. Typically, you can easily replicate and distribute them to handle increasing workloads. This supports horizontal scaling, making it possible to add or remove capacity based on fluctuating traffic and demand.
  • Reliability and fault tolerance: Stateless architectures are inherently resilient to failures. If a stateless component fails, it can be replaced or restarted without affecting the overall system. Because stateless applications don’t have a shared state, failures in one component don’t impact other components. This helps ensure continuity of user sessions, minimizes disruptions, and improves fault tolerance and overall system reliability.
  • Cost-effectiveness: By leveraging on-demand scaling capabilities, your application can dynamically adjust resources based on actual demand, avoiding overprovisioning of infrastructure. Stateless architectures lend themselves to serverless computing models, paying for the actual run time and resulting in cost savings.
  • Performance: Externalizing session data by using services optimized for high-speed access, such as in-memory caches, can reduce the latency compared to maintaining session data internally.
  • Flexibility and extensibility: Stateless architectures provide flexibility and agility in application development. Offloaded session data provides more flexibility to adopt different technologies and services within the architecture. Applications can easily integrate with other AWS services for enhanced functionality, such as analytics, near real-time notifications, or personalization.

Conclusion

Converting stateful applications to stateless applications requires careful planning, design, and implementation. Your choice of architecture depends on your application’s specific needs. If an application is simple to develop and debug, then a stateful architecture might be a good choice. However, if an application needs to be scalable and fault tolerant, then a stateless architecture might be a better choice. It’s important to understand the current application thoroughly before embarking on a refactoring journey.

Further reading

AWS Weekly Roundup – CodeWhisperer, CodeCatalyst, RDS, Route53, and more – October 24, 2023

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-codewhisperer-codecatalyst-rds-route53-and-more-october-23-2023/

The entire AWS News Blog team is fully focused on writing posts to announce the new services and features during our annual customer conference in Las Vegas, AWS re:Invent! And while we prepare content for you to read, our services teams continue to innovate. Here is my summary of last week’s launches.

Last week’s launches
Here are some of the launches that captured my attention:

Amazon CodeCatalystYou can now add a cron expression to trigger a CI/CD workflow, providing a way to start workflows at set times. CodeCatalyst is a unified development service that integrates a project’s collaboration tools, CI/CD pipelines, and development and deployment environments.

Amazon Route53You can now route your customer’s traffic to their closest AWS Local Zones to improve application performance for latency-sensitive workloads. Learn more about geoproximity routing in the Route53 documentation.

Amazon RDS – The root certificates we use to sign your databases’ TLS certificates will expire in 2024. You must generate new certificates for your databases before the expiration date. This blog post details the procedure step by step. The new root certificates we generated are valid for the next 40 years for RSA2048 and 100 years for the RSA4098 and ECC384. It is likely this is the last time in your professional career that you are obliged to renew your database certificates for AWS.

Amazon MSK – Replicating Kafka clusters at scale is difficult and often involves managing the infrastructure and the replication solution by yourself. We launched Amazon MSK Replicator, a fully managed replication solution for your Kafka clusters, in the same or across multiple AWS Regions.

Amazon CodeWhisperer – We launched a preview for an upcoming capability of Amazon CodeWhisperer Professional. You can now train CodeWhisperer on your private code base. It allows you to give your organization’s developers more relevant suggestions to better assist them in their day-to-day coding against your organization’s private libraries and frameworks.

Amazon EC2The seventh generation of memory-optimized EC2 instances is available (R7i). These instances use the 4th Generation Intel Xeon Scalable Processors (Sapphire Rapids). This family of instances provides up to 192 vCPU and 1,536 GB of memory. They are well-suited for memory-intensive applications such as in-memory databases or caches.

X in Y – We launched existing services and instance types in additional Regions:

Other AWS news
Here are some other blog posts and news items that you might like:

The Community.AWS blog has new posts to teach you how to integrate Amazon Bedrock inside your Java and Go applications, and my colleague Brooke wrote a survival guide for re:Invent first-timers.

The Official AWS Podcast – Listen each week for updates on the latest AWS news and deep dives into exciting use cases. There are also official AWS podcasts in several languages. Check out the ones in FrenchGermanItalian, and Spanish.

Some other great sources of AWS news include:

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS Community DayAWS Community Days – Join a community-led conference run by AWS user group leaders in your region: Jaipur (November 4), Vadodara (November 4), and Brasil (November 4).

AWS Innovate: Every Application Edition – Join our free online conference to explore cutting-edge ways to enhance security and reliability, optimize performance on a budget, speed up application development, and revolutionize your applications with generative AI. Register for AWS Innovate Online Asia Pacific & Japan on October 26.

AWS re:Invent 2023AWS re:Invent (November 27 – December 1) – Join us to hear the latest from AWS, learn from experts, and connect with the global cloud community. Browse the session catalog and attendee guides and check out the re:Invent highlights for generative AI.

You can browse all upcoming in-person and virtual events.

And that’s all for me today. I’ll go back writing my re:Invent blog posts.

Check back next Monday for another Weekly Roundup!

— seb

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

How AWS protects customers from DDoS events

Post Syndicated from Tom Scholl original https://aws.amazon.com/blogs/security/how-aws-protects-customers-from-ddos-events/

At Amazon Web Services (AWS), security is our top priority. Security is deeply embedded into our culture, processes, and systems; it permeates everything we do. What does this mean for you? We believe customers can benefit from learning more about what AWS is doing to prevent and mitigate customer-impacting security events.

Since late August 2023, AWS has detected and been protecting customer applications from a new type of distributed denial of service (DDoS) event. DDoS events attempt to disrupt the availability of a targeted system, such as a website or application, reducing the performance for legitimate users. Examples of DDoS events include HTTP request floods, reflection/amplification attacks, and packet floods. The DDoS events AWS detected were a type of HTTP/2 request flood, which occurs when a high volume of illegitimate web requests overwhelms a web server’s ability to respond to legitimate client requests.

Between August 28 and August 29, 2023, proactive monitoring by AWS detected an unusual spike in HTTP/2 requests to Amazon CloudFront, peaking at over 155 million requests per second (RPS). Within minutes, AWS determined the nature of this unusual activity and found that CloudFront had automatically mitigated a new type of HTTP request flood DDoS event, now called an HTTP/2 rapid reset attack. Over those two days, AWS observed and mitigated over a dozen HTTP/2 rapid reset events, and through the month of September, continued to see this new type of HTTP/2 request flood. AWS customers who had built DDoS-resilient architectures with services like Amazon CloudFront and AWS Shield were able to protect their applications’ availability.

Figure 1: Global HTTP requests per second, September 13 – 16

Figure 1. Global HTTP requests per second, September 13 – 16

Overview of HTTP/2 rapid reset attacks

HTTP/2 allows for multiple distinct logical connections to be multiplexed over a single HTTP session. This is a change from HTTP 1.x, in which each HTTP session was logically distinct. HTTP/2 rapid reset attacks consist of multiple HTTP/2 connections with requests and resets in rapid succession. For example, a series of requests for multiple streams will be transmitted followed up by a reset for each of those requests. The targeted system will parse and act upon each request, generating logs for a request that is then reset, or cancelled, by a client. The system performs work generating those logs even though it doesn’t have to send any data back to a client. A bad actor can abuse this process by issuing a massive volume of HTTP/2 requests, which can overwhelm the targeted system, such as a website or application.

Keep in mind that HTTP/2 rapid reset attacks are just a new type of HTTP request flood. To defend against these sorts of DDoS attacks, you can implement an architecture that helps you specifically detect unwanted requests as well as scale to absorb and block those malicious HTTP requests.

Building DDoS resilient architectures

As an AWS customer, you benefit from both the security built into the global cloud infrastructure of AWS as well as our commitment to continuously improve the security, efficiency, and resiliency of AWS services. For prescriptive guidance on how to improve DDoS resiliency, AWS has built tools such as the AWS Best Practices for DDoS Resiliency. It describes a DDoS-resilient reference architecture as a guide to help you protect your application’s availability. While several built-in forms of DDoS mitigation are included automatically with AWS services, your DDoS resilience can be improved by using an AWS architecture with specific services and by implementing additional best practices for each part of the network flow between users and your application.

For example, you can use AWS services that operate from edge locations, such as Amazon CloudFront, AWS Shield, Amazon Route 53, and Route 53 Application Recovery Controller to build comprehensive availability protection against known infrastructure layer attacks. These services can improve the DDoS resilience of your application when serving any type of application traffic from edge locations distributed around the world. Your application can be on-premises or in AWS when you use these AWS services to help you prevent unnecessary requests reaching your origin servers. As a best practice, you can run your applications on AWS to get the additional benefit of reducing the exposure of your application endpoints to DDoS attacks and to protect your application’s availability and optimize the performance of your application for legitimate users. You can use Amazon CloudFront (and its HTTP caching capability), AWS WAF, and Shield Advanced automatic application layer protection to help prevent unnecessary requests reaching your origin during application layer DDoS attacks.

Putting our knowledge to work for AWS customers

AWS remains vigilant, working to help prevent security issues from causing disruption to your business. We believe it’s important to share not only how our services are designed, but also how our engineers take deep, proactive ownership of every aspect of our services. As we work to defend our infrastructure and your data, we look for ways to help protect you automatically. Whenever possible, AWS Security and its systems disrupt threats where that action will be most impactful; often, this work happens largely behind the scenes. We work to mitigate threats by combining our global-scale threat intelligence and engineering expertise to help make our services more resilient against malicious activities. We’re constantly looking around corners to improve the efficiency and security of services including the protocols we use in our services, such as Amazon CloudFront, as well as AWS security tools like AWS WAF, AWS Shield, and Amazon Route 53 Resolver DNS Firewall.

In addition, our work extends security protections and improvements far beyond the bounds of AWS itself. AWS regularly works with the wider community, such as computer emergency response teams (CERT), internet service providers (ISP), domain registrars, or government agencies, so that they can help disrupt an identified threat. We also work closely with the security community, other cloud providers, content delivery networks (CDNs), and collaborating businesses around the world to isolate and take down threat actors. For example, in the first quarter of 2023, we stopped over 1.3 million botnet-driven DDoS attacks, and we traced back and worked with external parties to dismantle the sources of 230 thousand L7/HTTP DDoS attacks. The effectiveness of our mitigation strategies relies heavily on our ability to quickly capture, analyze, and act on threat intelligence. By taking these steps, AWS is going beyond just typical DDoS defense, and moving our protection beyond our borders. To learn more behind this effort, please read How AWS threat intelligence deters threat actors.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Tom Scholl

Mark Ryland

Mark is the director of the Office of the CISO for AWS. He has over 30 years of experience in the technology industry, and has served in leadership roles in cybersecurity, software engineering, distributed systems, technology standardization, and public policy. Previously, he served as the Director of Solution Architecture and Professional Services for the AWS World Public Sector team.

Tom Scholl

Tom Scholl

Tom is Vice President and Distinguished Engineer at AWS.

Amazon Route 53 Resolver Now Available on AWS Outposts Rack

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-route-53-resolver-now-available-on-aws-outposts-rack/

Starting today, Amazon Route 53 Resolver is now available on AWS Outposts rack, providing your on-premises services and applications with local DNS resolution directly from Outposts. Local Route 53 Resolver endpoints also enable DNS resolution between Outposts and your on-premises DNS server. Route 53 Resolver on Outposts helps to improve your on-premises applications availability and performance.

AWS Outposts provides a hybrid cloud solution that allows you to extend your AWS infrastructure and services to your on-premises data centers. This enables you to build and operate hybrid applications that seamlessly integrate with your existing on-premises infrastructure. Your applications deployed on Outposts benefit from low-latency access to on-premises systems. You also get a consistent management experience across AWS Regions and your on-premises environments. This includes access to the same AWS management tools, APIs, and services that you use when managing AWS services in a Region. Outposts uses the same security controls and policies as AWS in the cloud, providing you with a consistent security posture across your hybrid cloud environment. This includes data encryption, identity and access management, and network security.

One of the typical use cases for Outposts is to deploy applications that require low-latency access to on-premises systems, such as factory equipment, high-frequency trading applications, or medical diagnosis systems.

DNS stands for Domain Name System, which is the system that translates human-readable domain names like “example.com” into IP addresses like “93.184.216.34” that computers use to communicate with each other on the internet. A Route 53 Resolver is a component that is responsible for resolving domain names to IP addresses.

Until today, applications and services running on an Outpost forwarded their DNS queries to the parent AWS Region the Outpost is connected to. But remember, as Amazon CTO Dr Werner Vogels says: everything fails all the time. There can be temporary site disconnections—think about fiber cuts or weather events. When the on-premises facility becomes temporarily disconnected from the internet, local DNS resolution fails, making it difficult for applications and services to discover other services, even when they are running on the same Outposts rack. For example, applications running locally on the Outpost won’t be able to discover the IP address of a local database running on the same Outpost, or a microservice won’t be able to locate other microservices running locally.

Starting today, when you opt in for local Route 53 Resolvers on Outposts, applications and services will continue to benefit from local DNS resolution to discover other services—even in a parent AWS Region connectivity loss event. Local Resolvers also help to reduce latency for DNS resolutions as query results are cached and served locally from the Outposts, eliminating unnecessary round-trips to the parent AWS Region. All the DNS resolutions for applications in Outposts VPCs using private DNS are served locally.

In addition to local Resolvers, this launch also enables local Resolver endpoints. Route 53 Resolver endpoints are not new; creating inbound or outbound Resolver endpoints in a VPC has been available since November 2018. Today, you can also create endpoints inside the VPC on Outposts. Route 53 Resolver outbound endpoints enable Route 53 Resolvers to forward DNS queries to DNS resolvers that you manage, for example, on your on-premises network. In contrast, Route 53 Resolver inbound endpoints forward the DNS queries they receive from outside the VPC to the Resolver running on Outposts. It allows sending DNS queries for services deployed on a private Outposts VPC from outside of that VPC.

Let’s See It in Action
To create and test a local Resolver on Outposts, I first connect to the Outpost section of the AWS Management Console. I navigate to the Route 53 Outposts section and select Create Resolver.

Create local resolver on outpost

I select the Outpost on which I want to create the Resolver and enter a Resolver name. Then, I select the size of the instances to deploy the Resolver and the number of instances. The selection of instance size impacts the performance of the Resolver (the number of resolutions it can process per second). The default is an m5.large instance able to handle up to 7,000 queries per second. The number of instances impacts the availability of the Resolver, the default is four instances. I select Create Resolver to create the Resolver instances.

Create local resolver - choose instance type and number

After a few minutes, I should see the Resolver status becoming ✅ Operational.

Local resolver is operationalThe next step is to create the Resolver endpoint. Inbound endpoints allow to forward external DNS queries to the local Resolver on the Outpost. Outbound endpoints allow to forward locally initiated DNS queries to external DNS resolvers you manage. For this demo, I choose to create an inbound endpoint.

Under the Inbound endpoints section, I select Create inbound endpoint.

Local resolver - create inbound endpoint

I enter an Endpoint name, I choose the VPC in the Region to attach this endpoint to, and I select the previously created Security group for this endpoint.

Create inbound endpoint details

I select the IP address the endpoint will consume in each subnet. I can select to Use an IP address that is selected automatically or Use an IP address that I specify.

Create inbound endpoint - select an IP addressFinally, I select the instance type to bind to the inbound endpoint. The larger the instance, the more queries per second it will handle. The service creates two endpoint instances for high availability.

When I am ready, I select the Create inbound endpoint to start the creation process.

Create inbound endpoint - select the instance type

After a few minutes, the endpoint Status becomes ✅ Operational.

Create inbound endpoint sttaus operational

The setup is now ready to test. I therefore SSH-connect to an EC2 instance running on the Outpost, and I test the time it takes to resolve an external DNS name. Local Resolvers cache queries on the Outpost itself. I therefore expect my first query to take a few milliseconds and the second one to be served immediately from the cache.

Indeed, the first query resolves in 13 ms (see the line ;; Query time: 13 msec).

➜  ~ dig amazon.com

; <<>> DiG 9.16.38-RH <<>> amazon.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35859
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;amazon.com.			IN	A

;; ANSWER SECTION:
amazon.com.		797	IN	A	52.94.236.248
amazon.com.		797	IN	A	205.251.242.103
amazon.com.		797	IN	A	54.239.28.85

;; Query time: 13 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Sun May 28 09:47:27 CEST 2023
;; MSG SIZE  rcvd: 87

And when I repeat the same query, it resolves in zero milliseconds, showing it is now served from a local cache.

➜  ~ dig amazon.com

; <<>> DiG 9.16.38-RH <<>> amazon.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63500
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;amazon.com.			IN	A

;; ANSWER SECTION:
amazon.com.		586	IN	A	54.239.28.85
amazon.com.		586	IN	A	205.251.242.103
amazon.com.		586	IN	A	52.94.236.248

;; Query time: 0 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Sun May 28 09:50:58 CEST 2023
;; MSG SIZE  rcvd: 87

Pricing and Availability
Remember that only the Resolver and the VPC endpoints are deployed on your Outposts. You continue to manage your Route 53 zones and records from the AWS Regions. The local Resolver and its endpoints will consume some capacity on the Outposts. You will need to provide four EC2 instances from your Outposts for the Route 53 Resolver and two other instances for each Resolver endpoint.

Your existing Outposts racks must have the latest Outposts software for you to use the local Route 53 Resolver and the Resolver endpoints. You can raise a ticket with us to have your Outpost updated (the console will also remind you to do so when needed).

The local Resolvers are provided without additional cost. The endpoints are charged per elastic network interface (ENI) per hour, as is already the case today.

You can configure local Resolvers and local endpoints in all AWS Regions where Outposts racks are available, except in AWS GovCloud (US) Regions. That’s a list of 22 AWS Regions as of today.

Go and configure local Route 53 Resolvers on Outposts now!

— seb

 

P.S. We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Please take this quick survey to share insights on your experience with the AWS Blog. Note that this survey is hosted by an external company, so the link does not lead to our website. AWS handles your information as described in the AWS Privacy Notice.

Disaster Recovery for Oracle Database on Amazon EC2 with Fast-Start Failover

Post Syndicated from Harshad Gohil original https://aws.amazon.com/blogs/architecture/disaster-recovery-for-oracle-database-on-amazon-ec2-with-fast-start-failover/

High availability is non-negotiable for organizations today to prevent business-critical application disruptions. Enterprises must prioritize database scalability and availability to avoid downtime in their databases, network, servers, or storage environments.

For organizations that want to avoid required application changes, Oracle Real Application Clusters (RAC) is an option for providing high availability and scalability to the Oracle database. While the RAC feature is not supported by Oracle databases on Amazon Elastic Compute Cloud (Amazon EC2), Oracle Active Data Guard helps achieve high availability on AWS cloud.

The Oracle Data Guard feature helps customers survive disasters and data corruption while creating, maintaining, and managing one or more synchronized standby databases. But further, configuring Oracle Data Guard Fast-Start Failover (FSFO) helps achieve high availability.

In this blog post, we provide an architectural solution to achieve database high availability when running Oracle Database on Amazon EC2 with Oracle Data Guard along with Fast-Start Failover to address Availability Zones (AZs) or Amazon EC2 instance failures. We also introduce the steps you can take to make database failover happen without manual intervention, and offer recommendations for cross-Region disaster recovery.

Solution overview

Let’s explore this solution by discussing the architecture and two alternate options for securing high availability using Oracle Data Guard, along with the advantages and limitations of each. We will then offer a walkthrough of steps to make database failover happen without manual intervention.

Oracle high availability using Oracle Data Guard with multi-AZ and multi-Region with multi-AZ setup

This architecture is recommended to maintain high availability for Oracle databases on Amazon EC2 with protection against Amazon EC2 service outages in a Region. A disaster recovery environment and higher resiliency are provided after an Amazon EC2 service outage. This protects against Amazon EC2 service outages in an AWS Region and maintains resiliency due to the multi-AZ setup in a secondary Region.

In this architecture, Oracle Data Guard Fast Sync replication exists between the Primary database in AZ 1 in Region A, with standbys in AZ 2 Region A (Fast Sync), AZ1 in Region B (ASYNC), and AZ2 in Region B (ASYNC). There is an asynchronous cascading replication setup between standby databases to avoid network latency issues across regions.

Should Region A experience an Amazon EC2 service outage, the Oracle observer, a client software that monitors Oracle Data Guard and initiate failover to the Standby database in Region B. Applications can continue to connect to the database resulting in high availability with limited/minimal data loss based on the data change rate amount, as in Figure 1.

Oracle with cascading standby databases across regions

Figure 1. Oracle with cascading standby databases across regions

Using Oracle RedoRoutes, the default behavior of Data guard can be controlled and it can be set using the following example during setup.

Oracle RedoRoutes setup example:

dgmgrl > edit database DB_1A set property RedoRoutes= ‘ (LOCAL: DB_1B FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2,DB_2B ASYNC PRIORITY=3)) (DB_1B: (DB_2A ASYNC PRIORITY=1, DB_2B ASYNC PRIORITY=2)) (DB_2A: DB_1B ASYNC) (DB_2B: DB_1B ASYNC)’

dgmgrl > edit database DB_1B set property RedoRoutes= ‘(LOCAL: (DB_1A FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2,DB_2B ASYNC PRIORITY=3))(DB_1A: (DB_2A ASYNC PRIORITY=1, DB_2B ASYNC PRIORITY=2)) ‘

dgmgrl > edit database DB_1B set property RedoRoutes= ‘(LOCAL: (DB_2B FASTSYNC PRIORITY=1, DB_1A ASYNC PRIORITY=2, DB_1B ASYNC PRIORITY=3))(DB_2B: (DB_1A ASYNC PRIORITY=1, DB_1B ASYNC PRIORITY=2)) (DB_1A: DB_2B ASYNC)(DB_1B: DB_2B ASYNC )’

dgmgrl > edit database DB_1B set property RedoRoutes= ‘(LOCAL: (DB_2A FASTSYNC PRIORITY=1, DB_1A ASYNC PRIORITY=2, DB_1B ASYNC PRIORITY=3))(DB_2A: (DB_1A ASYNC PRIORITY=1, DB_1B ASYNC PRIORITY=2))’

For more information on Oracle RedoRoutes setup for Oracle Cascading Standby, refer to this step-by-step configuration documentation.

Database failover with Amazon Route 53 and Oracle Data Guard

The following walkthrough defines the steps you can take to make database failover happen without manual intervention using Amazon Route 53 and Oracle Data Guard.

Prerequisites

Before getting started, review the following prerequisites for this solution:

Walkthrough

Step 1. Create Oracle Database Service

For applications to connect without manual intervention on event of failure, we recommend creating an Oracle database service using the Oracle DBMS_Package called DBMS_SERVICE.

exec dbms_service.CREATE_SERVICE(SERVICE_NAME=>'DB_SERVICE_FOR_APP', NETWORK_NAME=>'DB_SERVICE_FOR_APP');

exec dbms_service.START_SERVICE('DB_SERVICE_FOR_APP');

Step 2. Network configuration

Applications can connect to the database seamlessly without manual intervention in an event of a failover from the Primary database to Standby using the Oracle Transparent Application Failover (TAF) approach, though TAF requires updating application connection strings in case of a host IP change.

The following approach using Amazon Route 53 is recommended for added flexibility and scalability. Route 53 has DNS A records that map to the database instance IPs and CNAME records that can redirect DNS queries to A records. The following depicts the DNS mapping. The CNAME, along with the database service name, can be used by the application in its network configuration.

Database_Name =
 (DESCRIPTION =
    (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = TCP)(HOST = <db_cname>)(PORT = 1521))
   (connect_data = 
       (service_name = <db_service_name>)
   )) )

To update the CNAME in Route 53 to map to the Primary host automatically in the event of failure, follow these steps.

Step 3. Route 53 setup

Create a script named route53update.sh and place it on the database hosts using the following code.

#!/bin/bash

export ORACLE_HOME="<<change>> "

export LD_LIBRARY_PATH=$ORACLE_HOME/lib

export PATH=$ORACLE_HOME/bin:$PATH:/usr/local/bin:/usr/bin

LOG_FILE="/tmp/switch_dns_$$.log"

DNS_DOMAIN="<<change>> "

ACTIVE_DB_CNAME="<<change>> "

HOSTED_ZONE_ID="<<change>> "

TTL="<<change>> "

update_dns () {

TMPFILE="/tmp/route53_dns_$$.log"

 cat > ${TMPFILE} << EOF

    {

      "Comment":"Updating DNS of record ${1}.${DNS_DOMAIN}",

      "Changes":[

        {

          "Action":"UPSERT",

          "ResourceRecordSet":{

            "ResourceRecords":[

              {

                "Value":"$2"

              }

            ],

            "Name":"${1}.${DNS_DOMAIN}.",

            "Type":"CNAME",

            "TTL":$TTL

          }

        }

      ]

    }

EOF

  /usr/local/bin/aws route53 change-resource-record-sets \

        --hosted-zone-id $HOSTED_ZONE_ID \

        --change-batch file://"$TMPFILE" >> "$LOG_FILE"

}

prim_uniq_sid=`$ORACLE_HOME/bin/sqlplus -s  / as sysdba <<EOF

set feedback off echo off lines 2000 head off

select upper(db_unique_name) from  v\\$dataguard_config where DEST_ROLE='PRIMARY DATABASE';

EOF`

prim_uniq_sid=`echo $prim_uniq_sid| sed 's/^[ \t]*//;s/[ \t]*$//'`

host_current=`$ORACLE_HOME/bin/tnsping ${prim_uniq_sid}|sed -n 's/\(.*Host\)\([^)]*\)\(.*\)/\2/pi' |sed 's/=//g'|sed 's/^[ \t]*//;s/[ \t]*$//'`

dns_current_host=`/usr/local/bin/aws route53 list-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --query  "ResourceRecordSets[?Name == '${ACTIVE_DB_CNAME}.${DNS_DOMAIN}.'].ResourceRecords" --output text`

if [ "$host_current" != "$dns_current_host" ]; then

        update_dns ${ACTIVE_DB_CNAME} $host_current

fi

Step 4. Database job setup

Create a job in the Oracle Primary database to execute the shell script just introduced to initiate in the event of failover using the following code.

begin
  dbms_scheduler.create_job
  (
    job_name             => 'route53update',
    job_type             => 'executable',
    number_of_arguments  => 0,
    job_action           => '/<<location of script>>/ route53update.sh',
    auto_drop            => false
  );

  dbms_scheduler.enable('route53update');
end;
/

Step 5. Database trigger setup

In an event of a failure, the Primary will failover and the Standby starts up as the new Primary. A trigger needs to be created on the Primary database to execute the job on any failover to update the Route53 CNAME using the following code.

create or replace trigger SYS.Update_Route53_Record
AFTER STARTUP ON DATABASE
DECLARE
db_role varchar2(16);
db_mode varchar2(20);BEGIN
select database_role, open_mode into db_role, db_mode from v$database;
if db_role = 'PRIMARY' then
dbms_scheduler.run_job('route53update') ;
END IF;
END;
/

Alternate Option 1: Single Region with multi-AZ

This option is a minimum recommended configuration to maintain high availability for Oracle databases on Amazon EC2 for customers who do not have a multi-region setup.

  • Advantage: Protects against Amazon EC2 service outage in a single AZ.
  • Limitation: Does not protect against Amazon EC2 service outages in a single Region.

In this architecture, Oracle Data Guard Fast Sync replication exists between the Oracle database instance in a multi-AZ setup with the Primary database (Read Write) in AZ 1 and the Standby database (Read Only) in AZ 2.

If the primary database is unreachable due to any failure, the observer will failover to the standby database in a different AZ. Applications can continue to connect to the database with zero data loss due to synchronous replication between AZ using the Maximum Availability/Maximum Protection mode setup in Oracle Data Guard. If the primary database is in us-east-1a and standby in us-east-1b, the RedoRoutes property can be defined as follows.

Oracle RedoRoutes setup example:

dgmgrl> edit database DB_1A set property RedoRoutes= '(LOCAL: (DB_1B FASTSYNC)'

dgmgrl>  edit database DB_1B set property RedoRoutes= '(LOCAL: (DB_1A FASTSYNC)'

For more information on how disaster recovery works in the AWS Cloud, visit the Disaster recovery is different in the cloud section of the AWS Well-Architected Framework. For more on Oracle RedoRoutes setup, refer to the Oracle Redo Routing Rules documentation.

Alternate Option 2: Multi-AZ with multi-Region with single AZ

This option is recommended to maintain high availability for an Oracle database on Amazon EC2 for customers who need multi-region availability. It provides protection against the rare unavailability of Amazon EC2 instances in the primary Region, in which case a disaster recovery environment is provided.

  • Advantage: Protects against Amazon EC2 service outages in a 2 AZ or AWS Region.
  • Limitation: Decreased resiliency without high availability on Amazon EC2 service outage in an entire Region

In this architecture, Oracle Data Guard Fast Sync replication exists between the Oracle database instance in multi-AZ within the single Region, with the Primary database in AZ 1 in Region A and Standby database in AZ 2 in Region A. There is an asynchronous replication setup between the Standby database cross-Region.

Asynchronous replication is recommended between Region replication to avoid network latency issue. A cascading standby setup ensures there is no additional performance impact on the primary database to send data to multiple standbys.

If the primary database is unreachable, failover happens between AZs in Region A. In the event of an Amazon EC2 service outage in a Region, failover occurs to Region B, resulting in high availability with minimal data loss based on the data change rate amount. If the primary database is in us-east-1a and standby in us-east-1b (Fast Sync) and us-east-2a (Async), the RedoRoutes property can be defined as follows.

Oracle RedoRoutes setup example:

dgmgrl > edit database DB_1A set property RedoRoutes= '(LOCAL: (DB_1B FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2))(DB_1B: DB_2A ASYNC)(DB_2A: DB_1B ASYNC)'

dgmgrl > edit database DB_1B set property RedoRoutes= '(LOCAL: (DB_1A FASTSYNC PRIORITY=1, DB_2A ASYNC  PRIORITY=2)) (DB_1A: DB_2A ASYNC)'

dgmgrl > edit database DB_1B set property RedoRoutes= '(LOCAL: (DB_1A FASTSYNC PRIORITY=1, DB_1B ASYNC  PRIORITY=2))'

Cleaning up

The services involved in this solution incur costs. When you’re done using this solution, clean up the following resources:

  • Amazon EC2 instances – Stop or delete (terminate) the Amazon EC2 instances that you provisioned.
  • Route53 – Delete the hosted Zone ID and A records/CNAMEs created.

Conclusion

This blog post demonstrates how high availability and disaster recovery can be achieved for an Oracle database on an Amazon EC2 instance using Oracle Data Guard. Using the architectures in this post, you can achieve zero data loss with the Oracle Fast-Start Failover option within the same Region or cross-Region on Amazon EC2.

You can also use this architecture to replicate data from an Oracle database on Amazon EC2 to an Oracle database hosted outside of the AWS cloud. With Oracle Cascading Standby and Oracle RedoRoutes, you can remove high dependency on the Primary database to improve overall performance.

Enable transparent connectivity to Oracle Data Guard environments using Amazon Route 53 CNAME records

Post Syndicated from Sudip Acharya original https://aws.amazon.com/blogs/architecture/enable-transparent-connectivity-to-oracle-data-guard-environments-using-amazon-route-53-cname-records/

Customers choose AWS for running their Oracle database workload to help increase resiliency, performance, and scalability of the database layer. A high availability (HA) solution for the database stack is an important aspect to consider when migrating or deploying Oracle databases in AWS to help ensure that the architecture can meet the service level agreement (SLA) of the application. Customers who run their Oracle databases on Amazon Elastic Compute Cloud (Amazon EC2) commonly choose Oracle Data Guard physical standby databases to help meet the HA and disaster recovery (DR) for their Oracle database workloads.

As discussed in this Oracle documentation, role-based services with multiple listener endpoints in the connection URL or tnsnames.ora entry is the preferred way to transparently connect to the database layer that is part of a Data Guard configuration. However, some application components and driver configurations don’t support multiple hostnames in the connection URL. Those applications require a single hostname or IP for the clients to connect to the Data Guard environment.

This post talks about the concept of using an Amazon Route 53 CNAME record in a Data Guard environment on EC2 and lists the artifacts to automatically route the connection between primary and standby environments in a Data Guard configuration based on the database role.

Solution overview

To help avoid the manual efforts to update DNS entries or tnsnames.ora file after a failover or switchover operation in a Data Guard environment, the solution uses an AFTER DB_ROLE_CHANGE trigger to automate the DNS failover process. This trigger runs a shell script on the database host, which in turn updates the CNAME record in Route 53 to point the CNAME records to reflect the role transition. The following diagram illustrates the solution architecture (Figure 1).

Figure 1. Solution architecture

Figure 1. Solution architecture

The solution discussed in this post covers routing new database connection requests to the right database post a Data Guard switchover activity. However, other factors such as application/client TTL settings and behavior of the connection pool to invalidate the connection handles created prior to the switchover activity can cause the application to connect to the database with a different role (like read-write workloads are connected to standby after switchover) and can generate errors, such as ORA-16000: database or pluggable database open for read-only access. It is a best practice to verify the database role before using the connection handles for transactions to verify that the application is connected to the database with the expected role.

The following workflow depicts the sequence of events that happens during a failover or switchover activity in a Data Guard environment to enable seamless connectivity for the application:

  1. A role transition event occurs in the Data Guard environment.
  2. The event triggers the AFTER DB_ROLE_CHANGE trigger.
  3. The trigger runs the shell script on the EC2 instance using a scheduler job.
  4. The shell script updates Route 53 to point the CNAME records to reflect the role transition.

Prerequisites

This post assumes the following prerequisites:

  • You should have an existing Data Guard configuration with one primary and one standby DB instance within a single VPC. Refer to the Oracle quick start template to deploy a Data Guard environment on Amazon EC2.
  • The steps discussed here are for self-managed Data Guard configuration on Amazon EC2 with Red Hat Linux AMI.
  • The scenario discussed in the post involves one primary and one standby database in the Data Guard configuration. For any other configurations, the scripts shown in this example require additional changes.
  • A private or public Route 53 hosted zone should be configured in the VPC where the DB environment exists.
  • The shell script uses the instance profile of the EC2 instance to run the AWS Command Line Interface (AWS CLI) commands. Make sure that the instance profile of the EC2 instances hosting the primary and standby databases has a policy attached that allows changing the record set in the hosted zone such as the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DBCnameFlipPloicy",
"Effect": "Allow",
"Action": [
"route53:ChangeResourceRecordSets",
"route53:ListResourceRecordSets"
],
"Resource": "arn:aws:route53:::hostedzone/<<YourHostedZoneId>>"
}
]
}
  • Nslookup, jq, and curl utilities must be installed on all of the DB hosts. If not installed, you can install the utility on RHEL Linux using the following command:
yum install -y bind-utils
yum install -y curl
yum install -y jq

Environment details

This post assumes a Data Guard configuration with two instances within a single VPC, one primary and one standby, with the following details and naming conventions:

  • Oracle database version – 19.10 configured in maximum performance mode with Active Data Guard
  • Route 53 domain name – mydbdomain
  • Database name – orcl
  • DB_UNIQUE_NAME – orcl_a and orcl_b
  • Instance names – orcl
  • Route 53 A record for the host in AZ1 – orcl-a-db.mydbdomain
  • Route 53 A record for the host in AZ2 – orcl-b-db.mydbdomain

Route 53 configuration

Two A records are created in Route 53 to point to the IPs of the primary and standby hosts. Two CNAME records are also created in Route 53, which are automatically updated during the Data Guard switchover and failover scenarios. The CNAME record orcl-rw.mydbdomain points to the instance in the primary role that can accept read/write transactions, and orcl-ro.mydbdomain points to the instance in the standby role that accepts read-only queries.

The A records configuration is as follows:

  • DB host IP in AZ1 (10.0.0.5 in this example) – orcl-a-db.mydbdomain
  • DB host IP in AZ2 (10.0.32.5 in this example) – orcl-b-db.mydbdomain

The CNAME records configuration is as follows:

  • orcl-a-db.mydbdomain – orcl-rw.mydbdomain
  • orcl-b-db.mydbdomain – orcl-ro.mydbdomain

The following screenshot shows the Route 53 console view of the domain mydbdomain.

The Route 53 console view of the domain mydbdomain

Figure 2. The Route 53 console view of the domain mydbdomain

TNS configuration

The following tnsnames.ora file entries show how connections can be made to primary and standby databases using the CNAME records without a dependency on the actual IP address of the EC2 instances that host primary and standby databases. The entry orcl_a always points to the instance on orcl-a-db.mydbdomain, and orcl_b always points to the instance on orcl-b-db.mydbdomain, regardless of their roles. The entries orclrw and orclro direct the connection to the databases playing primary and standby roles, respectively.

orcl_a =
(description =
(address = (protocol = tcp)(host = orcl-a-db.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl_a)
)
)

orcl_b =
(description =
(address = (protocol = tcp)(host = orcl-b-db.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl_b)
)
)

orclrw =
(description =
(address = (protocol = tcp)(host = orcl-rw.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl)
)
)

orclro =
(description =
(address = (protocol = tcp)(host = orcl-ro.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl)
)
)

To enable connectivity using orclrw and orclro TNS entries, you can use either a role-based service or a static listener registration entry in both the primary and standby listener, as shown in the following code:

SID_DESC =
      (GLOBAL_DBNAME = orcl)
      (ORACLE_HOME = /opt/oracle/product/19c/dbhome_1)
      (SID_NAME = orcl)
    )

Implement the solution

To implement an automated DNS update during an Oracle switchover or failover, we use an Oracle database trigger and a shell script. The following are the high-level steps for the entire workflow:

  1. Create a DB_ROLE_CHANGE ON DATABASE trigger on the primary database
  2. The trigger in turn creates a DBMS job that calls a shell script with the cname_switch.sh.
  3. The shell script updates the Route 53 CNAME entries.

Database trigger

Use the following code for the database trigger:

CREATE OR REPLACE TRIGGER sys.cname_flip_post_role_change 
AFTER DB_ROLE_CHANGE ON DATABASE
DECLARE
  v_db_name VARCHAR2(9);
  v_db_role VARCHAR2(16);
BEGIN
  SELECT DATABASE_ROLE  INTO v_db_role FROM V$DATABASE;
  SELECT DB_UNIQUE_NAME INTO v_db_name FROM V$DATABASE;

  IF v_db_role = 'PRIMARY' THEN
    BEGIN
      dbms_scheduler.drop_job('RW_CNAME_FLIP');
    EXCEPTION
      WHEN OTHERS THEN NULL;
    END;

    dbms_scheduler.create_job(
      job_name   => 'RW_CNAME_FLIP',
      job_type   => 'EXECUTABLE',
      number_of_arguments => 1,
      job_action => '/home/oracle/admin/bin/cname_switch.sh',
      enabled    => false,
      auto_drop  => true);

    dbms_scheduler.set_job_argument_value(
      job_name          => 'RW_CNAME_FLIP',
      argument_position => 1,
      argument_value    => v_db_name);

    BEGIN
      dbms_scheduler.run_job('RW_CNAME_FLIP');
    EXCEPTION
    WHEN OTHERS THEN
      raise_application_error(-20101, 'CNAME flip failed, check script error');
    END;

  END IF;

EXCEPTION
  WHEN OTHERS THEN
    raise_application_error(-20102, 'CNAME flip failed due to error: ' || SQLERR
M);
END;
/

Shell script

This script determines the current CNAME, identifies the dependent A records, and maps the CNAME to the correct A records accordingly. This shell script is provided for reference assuming the naming conventions for db_name and db_unique_name as used in the sample configuration. You should review and modify the script to meet your specific requirements and organization standards.

As per the example shown earlier, the shell script is placed in the location /home/oracle/admin/bin/cname_switch.sh.

Note: it’s common to see production databases that are restored or cloned to lower environments.

If the script is run in those environments, it can potentially change the CNAME entries unexpectedly. To mitigate this, the shell script has the function restore_safeguard. This function checks that the IP assigned to the EC2 instance is actually matching with the A records configured for this database in Route 53. If no match is found, this will not perform CNAME failover.

#! /bin/bash
#set -x
​
# Variables may need to be changed to suit your environment
​
DB_NAME=$1
DB_IN=$1
echo "Orginal Input : ${DB_NAME}"
DB_NAME=`echo "${DB_NAME::-2}"`  # removing last 2 characters from DB_UNIQUE_NAME
DB_NAME=`echo "${DB_NAME}" | tr '[:upper:]' '[:lower:]'`
echo "Modified Input : ${DB_NAME}"
​
DB_DOMAIN=<<YOUR_AWS_ROUTE53_DOMAIN_NAME>>    # Update as per your AWS Route53 domian name
ZONE_ID=<<YOUR_AWS_ROUTE53_HOSTED_ZONE_ID>>   # Update as per your AWS Route53 hosted zone ID
EC2_METADATA='http://169.254.169.254/latest/dynamic/instance-identity/document'
​
# CNAME and A-Records related varables :
​
RW_CNAME=`echo "${DB_NAME}-rw.${DB_DOMAIN}"`
RO_CNAME=`echo "${DB_NAME}-ro.${DB_DOMAIN}"`
A_CNAME=`echo "${DB_NAME}-a-db.${DB_DOMAIN}"`
B_CNAME=`echo "${DB_NAME}-b-db.${DB_DOMAIN}"`
​
REGION=`curl -s ${EC2_METADATA}|grep region|awk -F\" '{print $4}'`
​
# Logfile configuration and file initilization
​
TS=`date +%Y%m%d_%H%M%S`
LOG_DIR=/tmp
CHANGE_SET_FILE=`echo "${LOG_DIR}/${DB_NAME}-CnameFlip-${TS}.json"`
LOG_FILE=`echo "${LOG_DIR}/${DB_NAME}-CnameFlip-${TS}.log"`
CONF_FILE=`echo "file://${CHANGE_SET_FILE}"`
​
# Function to check if current host IP matching with Route 53 configuration
​
IS_SAFE='Unsafe'
​
function restore_safeguard()
{
    AWS_TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
    LOCAL_IPV4=`curl -sH "X-aws-ec2-metadata-token: $AWS_TOKEN" -v http://169.254.169.254/latest/meta-data/local-ipv4`
    PUBLIC_IPV4=`curl -sH "X-aws-ec2-metadata-token: $AWS_TOKEN" -v http://169.254.169.254/latest/meta-data/public-ipv4`
    NOT_FOUND=`echo ${PUBLIC_IPV4} | grep '404 - Not Found' | wc -l`
​
    if [ ${NOT_FOUND} == 1 ]; then
       PUBLIC_IPV4='No Public IP Assigned'
    fi
​
    A_IP=$(aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} \
           --query 'ResourceRecordSets[?Type==`A`].{Name: Name, Value:ResourceRecords[0].Value}' | \
           jq -cr --arg DB_NAME "${DB_NAME}-a" '.[] | select( .Name | contains($DB_NAME)).Value')
​
    B_IP=$(aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} \
           --query 'ResourceRecordSets[?Type==`A`].{Name: Name, Value:ResourceRecords[0].Value}' | \
           jq -cr --arg DB_NAME "${DB_NAME}-b" '.[] | select( .Name | contains($DB_NAME)).Value')
​
    PREVIOUS_RW_ID=$(aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} \
           --query 'ResourceRecordSets[?Type==`CNAME`].{Name: Name, Value:ResourceRecords[0].Value}' | \
           jq -cr --arg DB_NAME "${DB_NAME}-rw" '.[] | select( .Name | contains($DB_NAME)).Value' | cut -d'-' -f2)
​
    if [ ${PREVIOUS_RW_ID} == 'a' ]; then
       RW_NODE_IP=${A_IP}
       RO_NODE_IP=${B_IP}
    else
       RW_NODE_IP=${B_IP}
       RO_NODE_IP=${A_IP}
    fi
​
    # Looging Input values
​
    echo "Orginal Input   : ${DB_IN}"          | tee -a ${LOG_FILE}
    echo "Modified Input  : ${DB_NAME}"        | tee -a ${LOG_FILE}
    echo "Current RW ID   : ${PREVIOUS_RW_ID}" | tee -a ${LOG_FILE}
    echo "Host Private IP : ${LOCAL_IPV4}"     | tee -a ${LOG_FILE}
    echo "Host Public IP  : ${PUBLIC_IPV4}"    | tee -a ${LOG_FILE}
    echo "A Node IP       : ${A_IP}"           | tee -a ${LOG_FILE}
    echo "A Node IP       : ${B_IP}"           | tee -a ${LOG_FILE}
    echo "RW Node IP      : ${RW_NODE_IP}"     | tee -a ${LOG_FILE}
    echo "RO Node IP      : ${RO_NODE_IP}"     | tee -a ${LOG_FILE}
​
    if [ "${LOCAL_IPV4}" == "${RO_NODE_IP}" -o "${PUBLIC_IPV4}" == "${RO_NODE_IP}" ]; then
       IS_SAFE='Safe'
    else
       IS_SAFE='Unsafe'
    fi
}
​
restore_safeguard
​
if [ ${IS_SAFE} == 'Safe' ]; then
   echo "Safe for CNAME faliover..." | tee -a ${LOG_FILE}
else
   echo "Unsafe for CNAME faliover..." | tee -a ${LOG_FILE}
   echo "Aborting..."
   exit 1
fi
​
PRI_DB_ID=`nslookup ${RW_CNAME}|grep "canonical name"|cut -d'=' -f2|cut -d'-' -f2`
​
# Looging Input values :
echo "Orginal Input      : ${DB_IN}"     | tee    ${LOG_FILE}
echo "Modified Input     : ${DB_NAME}"   | tee -a ${LOG_FILE}
echo "Current RW host ID : ${PRI_DB_ID}" | tee -a ${LOG_FILE}
​
echo -e "\nChange to be done : \n" | tee -a ${LOG_FILE}
​
if [ ${PRI_DB_ID} == 'a' ]; then
   echo "Changing ${RW_CNAME} from ${A_CNAME} to ${B_CNAME}" | tee -a ${LOG_FILE}
   echo "Changing ${RO_CNAME} from ${B_CNAME} to ${A_CNAME}" | tee -a ${LOG_FILE}
   TO_BE_RW_CNAME=${B_CNAME}
   TO_BE_RO_CNAME=${A_CNAME}
else
   echo "Changing ${RW_CNAME} from ${B_CNAME} to ${A_CNAME}" | tee -a ${LOG_FILE}
   echo "Changing ${RO_CNAME} from ${A_CNAME} to ${B_CNAME}" | tee -a ${LOG_FILE}
   TO_BE_RW_CNAME=${A_CNAME}
   TO_BE_RO_CNAME=${B_CNAME}
fi
​
R53_CHANGE=`echo -e "
{
  \"Comment\": \"Flip CNAMEs\",
  \"Changes\": [
    {
      \"Action\" : \"UPSERT\",
      \"ResourceRecordSet\" : {
        \"Name\" : \"${RW_CNAME}.\",
        \"Type\" : \"CNAME\",
        \"TTL\"  : 60,
        \"ResourceRecords\" : [{ \"Value\": \"${TO_BE_RW_CNAME}.\" }]
      }
    },
    {
      \"Action\" : \"UPSERT\",
      \"ResourceRecordSet\" : {
        \"Name\" : \"${RO_CNAME}\",
        \"Type\" : \"CNAME\",
        \"TTL\"  : 60,
        \"ResourceRecords\" : [{ \"Value\": \"${TO_BE_RO_CNAME}.\" }]
      }
    }
  ]
}
"`
​
echo -e "\nRoute53 Change Set :\n" | tee -a ${LOG_FILE}
echo ${R53_CHANGE} | tee -a ${LOG_FILE}
echo ${R53_CHANGE} > ${CHANGE_SET_FILE}
​
echo -e "\nCommand to Execute : " | tee -a ${LOG_FILE}
echo -e "\naws route53 change-resource-record-sets --hosted-zone-id ${ZONE_ID} \
         --change-batch ${CONF_FILE} \n" | tee -a ${LOG_FILE}
​
echo -e "\nExecution Result :\n"
aws route53 change-resource-record-sets --hosted-zone-id ${ZONE_ID} \
--change-batch ${CONF_FILE} | tee -a ${LOG_FILE}
​
echo -e "\nAfter Change :\n "
aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} | tee -a ${LOG_FILE}

Test the solution

The following screenshot shows the Route 53 console view of the domain mydbdomain before the switchover. The primary database is running on orcl-a-db.mydomain because orcl-rw.mydomain is pointing to that.

Route 53 console view of the domain mydbdomain before the switchover

Figure 3. Route 53 console view of the domain mydbdomain before the switchover

The following SQL displays the current role of both primary and standby databases and host_name they are currently running on.

[oracle@ip-10-0-0-5 sql]$ cat db_info.sql

ALTER SESSION SET NLS_DATE_FORMAT='YYYY-MM-DD:HH24:MI';
set lines 150 pages 200
col HOST_NAME for a30 trunc

select d.NAME, d.db_unique_name, d.DATABASE_ROLE, d.OPEN_MODE, i.INSTANCE_NAME, 
i.HOST_NAME, i.STARTUP_TIME
from v$instance i, v$database d;

[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclrw

SQL> @db_info

NAME  DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE INSTANCE_NAME HOST_NAME STARTUP_TIME
------ ---------------- -------------- ---------------- ------------------------------ ----------------
ORCL orcl_a PRIMARY READ WRITE orcl ip-10-0-0-5.us-west-2.compute. 2020-05-24:01:47

[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclro

SQL> @db_info

NAME DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE INSTANCE_NAME HOST_NAME STARTUP_TIME
------ ---------------- -------------------- -------------- ------------------------------- ----------------
ORCL orcl_b PHYSICAL STANDBY READ ONLY WITH APPLY orcl ip-10-0-32-5.us-west-2.compute. 2020-05-24:05:50

Let’s initiate the switchover:

[oracle@ip-10-0-0-5 sql]$ dgmgrl /
DGMGRL for Linux: Release 12.2.0.1.0 - Production on Wed May 27 06:42:51 2020

Copyright (c) 1982, 2017, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "orcl_a"
Connected as SYSDG.
DGMGRL> show configuration;

Configuration - awsguard

  Protection Mode: MaxPerformance
  Members:
  orcl_a - Primary database
    orcl_b - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 39 seconds ago)

DGMGRL> switchover to orcl_b;
Performing switchover NOW, please wait...
Operation requires a connection to database "orcl_b"
Connecting ...
Connected to "orcl_b"
Connected as SYSDBA.
New primary database "orcl_b" is opening...
Oracle Clusterware is restarting database "orcl_a" ...
Switchover succeeded, new primary is "orcl_b"
DGMGRL>
DGMGRL> show configuration;

Configuration - awsguard

  Protection Mode: MaxPerformance
  Members:
  orcl_b - Primary database
    orcl_a - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 67 seconds ago)

DGMGRL>

Now that the switchover is complete, let’s connect to the database using the orclrw and orclro TNS entries using the following code:

[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclrw

SQL> @db_info

NAME DB_UNIQUE_NAME  DATABASE_ROLE  OPEN_MODE     INSTANCE_NAME  HOST_NAME                      STARTUP_TIME
----- -------------- ------------- -------------- ------------------------------ ----------------
ORCL  orcl_b PRIMARY        READ WRITE    orcl          ip-10-0-32-5.us-west-2.compute 2020-05-24:05:50


[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclro

SQL> @db_info

NAME  DATABASE_ROLE     OPEN_MODE            INSTANCE_NAME  HOST_NAME            STARTUP_TIME
----- ----------------- -------------------- -------------- ------------------------------ ----------------
ORCL orcl_a PHYSICAL STANDBY  READ ONLY WITH APPLY orcl          ip-10-0-0-5.us-west-2.compute. 2020-05-27:06:43

The following screenshot shows the Route 53 console view of the domain mydbdomain after the switchover. The primary database is now running on orcl-b-db.mydomain because orcl-rw.mydomain is pointing to that.

Route 53 console view of the domain mydbdomain after the switchover

Figure 4. Route 53 console view of the domain mydbdomain after the switchover

Conclusion

Application connectivity to a Data Guard environment can be challenging, especially when the application configuration doesn’t support multiple hostnames or listener endpoints. In this post, we discussed step-by-step details to enable seamless connectivity to Data Guard environments using Route 53 CNAME records, a database trigger, and a shell script. You can use these artifacts to direct the DB connections to the database with the right role seamlessly without application changes. If you are using Data Guard Observer for automated failover, another blog, Setup a high availability design for Oracle Data Guard (Fast-Start Failover) using Amazon Route 53 discusses an alternate mechanism to achieve the same result.

Mitigating DDoS with data science using AWS Shield Advanced and AWS WAF

Post Syndicated from Jasmine Maheshwari original https://aws.amazon.com/blogs/architecture/mitigating-ddos-with-data-science-using-aws-shield-advanced-and-aws-waf/

This blog post helps customers in mitigating distributed denial-of-service (DDoS) using AWS Shield Advanced, AWS WAF, and data science. We explore how to use these services along with machine learning (ML) to detect and mitigate DDoS attacks.

Bad actors conduct DDoS attacks using botnets. Through botnets, attackers look for zero-day vulnerabilities—specifically on network devices such as routers—and make these devices a part of their network. Bots speared around the world are waiting for instructions from control servers.

This post examines a real-world use case. Online payment solutions company Razorpay has a business-to-business (B2B) model where merchants invoke APIs for payments. Given the complex nature of B2B payments, the traffic patterns between small-scale merchants and large-scale merchants needs to be distinguished. This business model puts the company at risk for DDoS attacks, and here is how AWS helps them address this ongoing concern.

First, we will introduce an initial DDoS architecture approach and then how it was modified to meet specific Razorpay needs, as is possible for cross-industry client with varying requirements.

Basic DDoS recommendations

Let’s explore some out-of-the-box DDoS business solution recommendations for different resource categories:

  • Static websites: Amazon CloudFront is a content delivery network (CDN) service that is used to host static websites. The AWS WAF rate-limiting rule can be used to limit traffic on a CDN.
  • Dynamic websites: In this scenario, a standard WAF Rate Limit rule can be used on Application Load Balancer (ALB).
  • APIs with customer context: APIs receive these assets on behalf of other merchants (for example, an end-user places a food order from a food-delivery application, completes a payment, and the request is sent to api.razorpay.com. Razorpay understands that the end user is trying to make a payment toward the food app).

Figure 1 represents the difficulty levels and solutions for each resource category.

DDoS business solution resource category recommendations

Figure 1. DDoS business solution resource category recommendations

DDoS response phases

We’ve discussed initial DDoS architectures, so now let’s explore the various DDoS response phases:

  • Phase 1 – Automated inventory: Creating an automated inventory system capturing the list of APIs or websites exposed. The APIs are ranked by exposure and risk of an outage to create a list of assets that is ranked on the basis of risk.
  • Phase 2 – Bucketing assets into groups: Bucketing APIs and websites (static or dynamic) into groups. This phase also subgroups the assets into unauthenticated or authenticated, users and more. As we will discuss later in this post, most Razorpay APIs have a customer context requiring a multilayered defense mechanism.
  • Phase 3 – Testing the solution: Simulating attack traffic to test the solution under different loads, origins, and more.

AWS DDoS solutions

AWS has two major out-of-the-box solutions when protecting against DDoS attacks: AWS Shield and AWS WAF.

AWS Shield has three important features for DDoS mitigation:

  1. Alarms: Triggers an alarm when a DDoS attack is suspected, customized based on metrics such as total volume, error rates at the ALB, and response latency
  2. Visibility: Provides top five important field values in real time, such as requesting client IPs, countries, user agents, referrer headers, and url routes to take action.
  3. On-call support: Provides on-call support from the Shield Response team (SRT) to understand attack vectors and create AWS WAF rules based on insights.

AWS WAF has two rule types:

  1. Blocking: Blocks requests matching expected variables, from individuals to a combination of fields such as IP, URL route, body, or country
  2. Rate limiting: Tracks the rate of requests for each originating IP address, and triggers the rule action on IPs with rates that go over the limit. A scope-down statement can also be added to the rule, to only track and rate limit requests that match the scope-down statement. (We will soon discuss how Razorpay relies heavily on rate limiting it the AWS WAF and other microservice architecture solutions.)

Razorpay microservices protection architecture

Razorpay has two main layers protecting their internal microservices. The request to payment API api.razorpay.com goes from Amazon Route 53 to ALB. AWS WAF and AWS Shield Advanced are deployed on ALB. Requests are then forwarded to microservices fronted by API Gateway as in Figure 2.

Razorpay’s internal microservices protection layers

Figure 2. Razorpay internal microservices protection layers

Razorpay API Gateway

An API Gateway is a critical piece of infrastructure for microservice architectures. Razorpay is a B2B organization that performs actions on behalf of merchants. To achieve this—and understand context about the merchants and end-users alike—a custom API Gateway:

  • Works at a high scale with very low latency
  • Understands merchant context to provide authentication and authorization
  • Provides security features around malicious traffic

Razorpay uses a self-managed API gateway called Edge. The API Gateway is the first system on the ingress path to understand Razorpay domain context that prior layers cannot. Every request goes through multiple layers of logic at the API Gateway before being forwarded to respective services.

Razorpay Edge and AWS solution insights

As Edge performs multiple computations on each request, Razorpay generates insights for each request to build intelligence and prevent DDoS attacks. These events are fed into a time-series database in near real time and generate insights using machine learning (ML) models. The insights:

  1. Identify malicious patterns: Generating confidence percentage that helps in defining further action, verifies consumer authenticity, and serves the request. It also blocks malicious requests at the edge.
  2. Derive pattern-based rate limits: Deriving rate limits based on a larger set of data—including consumer and IP address—by looking at weekly and monthly patterns.
    Once this information is available at Razorypay Edge, it can perform various operations to ensure only legit requests reach the service layer. Let’s explore the request flow to understand each stage’s contribution toward DDoS handling, as in Figure 3.
Razorpay Edge request flow for DDoS handling

Figure 3. Razorpay Edge request flow for DDoS handling

All Razorpay requests flow through API Gateway. Here are some key recommendations:

  • Based on feedback data available, the gateway checks if the given request pattern falls in the malicious category. If the confidence percentage is high for a malicious request, block the request. A block rule is applied at the AWS WAF layer. If the confidence percentage is low, use CAPTCHA to check user authenticity.
  • Apply rate limit on pre-existing data in the request object using feedback from an ML model.
  • Perform authentication/identification and authorization.
  • Apply rate limit on data derived from request object, also using feedback from a ML model.
  • Request user CAPTCHA verification if the request is identified as malicious with low confidence.

As a best practice, publish generated insights for each request to Apache Kafka out of the request flow to avoid latency impact and reduce blast radius.

Conclusion

In this post, you learned about addressing DDoS attacks with ML models.

Organizations can use the insights data generated by an API Gateway to identify request patterns, categorize the patterns into various buckets, and perform the following depending on category:

  • Send feedback to an API Gateway for immediate action
  • Update the rate limits for Edge-based historical patterns
  • Update WAF rule for blocking/reducing rate limits

As discussed, using AWS WAF and AWS Shield Advanced along with ML helps detect and mitigate DDoS attacks.

Disaster Recovery Solutions with AWS-Managed Services, Part 3: Multi-Site Active/Passive

Post Syndicated from Brent Kim original https://aws.amazon.com/blogs/architecture/disaster-recovery-solutions-with-aws-managed-services-part-3-multi-site-active-passive/

Welcome to the third post of a multi-part series that addresses disaster recovery (DR) strategies with the use of AWS-managed services to align with customer requirements of performance, cost, and compliance. In part two of this series, we introduced a DR concept that utilizes managed services through a backup and restore strategy with multiple Regions. The post also introduces a multi-site active/passive approach.

The multi-site active/passive approach is best for customers who have business-critical workloads with higher availability requirements over other active/passive environments. A warm-standby strategy (as in Figure 1) is more costly than other active/passive strategies, but provides good protection from downtime and data loss outside of an active/active (A/A) environment.

Warm standby

Figure 1. Warm standby

Implementing the multi-site active/passive strategy

By replicating across multiple Availability Zones in same Region, your workloads become resilient to the failure of an entire data center. Using multiple Regions provides the most resilient option to deploy workloads, which safeguards against the risk of failure of multiple data centers.

Let’s explore an application that processes payment transactions and is modernized to utilize managed services in the AWS Cloud, as in Figure 2.

Warm standby with managed services

Figure 2. Warm standby with managed services

Let’s cover each of the components of this application, as well as how managed services behave in a multisite environment.

1. Amazon Route53 – Active/Passive Failover: This configuration consists of primary resources to be available, and secondary resources on standby in the case of failure of the primary environment. You would just need to create the records and specify failover for the routing policy. When responding to queries, Amazon Route 53 includes only the healthy primary resources. If the primary record configured in the Route 53 health check shows as unhealthy, Route 53 responds to DNS queries using the secondary record.

2. Amazon EKS control plane: Amazon Elastic Kubernetes Service (Amazon EKS) control plane nodes run in an account managed by AWS. Each EKS cluster control plane is single-tenant and unique, and runs on its own set of Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon EKS is also a Regional service, so each cluster is confined to the Region where it is deployed, with each cluster being a standalone entity.

3. Amazon EKS data plane: Operating highly available and resilient applications requires a highly available and resilient data plane. It’s best practice to create worker nodes using Amazon EC2 Auto Scaling groups instead of creating individual Amazon EC2 instances and joining them to the cluster.

Figure 2 shows three nodes in the primary Region while there will only be a single node in the secondary. In case of failover, the data plane scales up to meet the workload requirements. This strategy deploys a functional stack to the secondary Region to test Region readiness before failover. You can use Velero with Portworx to manage snapshots of persistent volumes. These snapshots can be stored in an Amazon Simple Storage Service (Amazon S3) bucket in the primary Region, which is replicated to an Amazon S3 bucket in another Region using Amazon S3 cross-Region replication.

During an outage in the primary Region, Velero restores volumes from the latest snapshots in the standby cluster.

4. Amazon OpenSearch Service: With cross-cluster replication in Amazon OpenSearch Service, you can replicate indexes, mappings, and metadata from one OpenSearch Service domain to another. The domain follows an active-passive replication model where the follower index (where the data is replicated) pulls data from the leader index. Using cross-cluster replication helps to ensure recovery from disaster events and allows you to replicate data across geographically distant data centers to reduce latency.

Cross-cluster replication is available on domains running Elasticsearch 7.10 or OpenSearch 1.1 or later. Full documentation for cross-cluster replication is available in the OpenSearch documentation.

If you are using any versions prior to Elasticsearch 7.10 or OpenSearch 1.1, refer to part two of our blog series for guidance on using APIs for cross-Region replication.

5. Amazon RDS for PostgreSQL: One of the managed service offerings of Amazon Relational Database Service (Amazon RDS) for PostgreSQL is cross-Region read replicas. Cross-Region read replicas enable you to have a DR solution scaling read database workloads, and cross-Region migration.

Amazon RDS for PostgreSQL supports the ability to create read replicas of a source database (DB). Amazon RDS uses an asynchronous replication method of the DB engine to update the read replica whenever there is a change made on the source DB instance. Although read replicas operate as a DB instance that allows only read-only connections, they can be used to implement a DR solution for your production DB environment. If the source DB instance fails, you can promote your Read Replica to a standalone source server.

Using a cross-Region read replica helps ensure that you get back up and running if you experience a Regional availability issue. For more information on PostgreSQL cross-Region read replicas, visit the Best Practices for Amazon RDS for PostgreSQL Cross-Region Read Replicas blog post.

6. Amazon ElastiCache: AWS provides a native solution called Global Datastore that enables cross-Region replication. By using the Global Datastore for Redis feature, you can work with fully managed, fast, reliable, and secure replication across AWS Regions. This feature helps create cross-Region read replica clusters for ElastiCache for Redis to enable low-latency reads and DR across AWS Regions. Each global datastore is a collection of one or more clusters that replicate to one another. When you create a global datastore in Amazon ElastiCache, ElastiCache for Redis automatically replicates your data from the primary cluster to the secondary cluster. ElastiCache then sets up and manages automatic, asynchronous replication of data between the two clusters.

7. Amazon Redshift: With Amazon Redshift, there are only two ways of deploying a true DR approach: backup and restore, and an (A/A) solution. We’ll use the A/A solution as this provides a better recovery time objective (RTO) for the overall approach. The recovery point objective (RPO) is dependent upon the configured schedule of AWS Lambda functions. The application within the primary Region sends data to both Amazon Simple Notification Service (Amazon SNS) and Amazon S3, and the data is distributed to the Redshift clusters in both Regions through Lambda functions.

Amazon EKS uploads data to an Amazon S3 bucket and publishes a message to an Amazon SNS topic with a reference to the stored S3 object. S3 acts as an intermediate data store for messages beyond the maximum output limit of Amazon SNS. Amazon SNS is configured with primary and secondary Region Amazon Simple Queue Service (Amazon SQS) endpoint subscriptions. Amazon SNS supports the cross-Region delivery of notifications to Amazon SQS queues. Lambda functions deployed in the primary and secondary Region are used to poll the Amazon SQS queue in respective Regions to read the message. The Lambda functions then use the Amazon SQS Extended Client Library for Java to retrieve the Amazon S3 object referenced in the message. Once the Amazon S3 object is retrieved, the Lambda functions upload the data into Amazon Redshift.

For more on how to coordinate large messages across accounts and Regions with Amazon SNS and Amazon SQS, explore the Coordinating Large Messages Across Accounts and Regions with Amazon SNS and SQS blog post.

Conclusion

This active/passive approach covers how you can build a creative DR solution using a mix of native and non-native cross-Region replication methods. By using managed services, this strategy becomes simpler through automation of service updates, deployment using Infrastructure as a Code (IaaC), and general management of the two environments.

Related information

Want to learn more? Explore the following resources within this series and beyond!

Three ways to boost your email security and brand reputation with AWS

Post Syndicated from Michael Davie original https://aws.amazon.com/blogs/security/three-ways-to-boost-your-email-security-and-brand-reputation-with-aws/

If you own a domain that you use for email, you want to maintain the reputation and goodwill of your domain’s brand. Several industry-standard mechanisms can help prevent your domain from being used as part of a phishing attack. In this post, we’ll show you how to deploy three of these mechanisms, which visually authenticate emails sent from your domain to users and verify that emails are encrypted in transit. It can take as little as 15 minutes to deploy these mechanisms on Amazon Web Services (AWS), and the result can help to provide immediate and long-term improvements to your organization’s email security.

Phishing through email remains one of the most common ways that bad actors try to compromise computer systems. Incidents of phishing and related crimes far outnumber the incidents of other categories of internet crime, according to the most recent FBI Internet Crime Report. Phishing has consistently led to large annual financial losses in the US and globally.

Overview of BIMI, MTA-STS, and TLS reporting

An earlier post has covered how you can use Amazon Simple Email Service (Amazon SES) to send emails that align with best practices, including the IETF internet standards: Sender Policy Framework (SPF), DomainKeys Identified Mail (DKIM), and Domain-based Message Authentication, Reporting, and Conformance (DMARC). This post will show you how to build on this foundation and configure your domains to align with additional email security standards, including the following:

  • Brand Indicators for Message Identification (BIMI) – This standard allows you to associate a logo with your email domain, which some email clients will display to users in their inbox. Visit the BIMI Group’s Where is my BIMI Logo Displayed? webpage to see how logos are displayed in the user interfaces of BIMI-supporting mailbox providers; Figure 1 shows a mock-up of a typical layout that contains a logo.
  • Mail Transfer Agent Strict Transport Security (MTA-STS) – This standard helps ensure that email servers always use TLS encryption and certificate-based authentication when they send messages to your domain, to protect the confidentiality and integrity of email in transit.
  • SMTP TLS reporting – This reporting allows you to receive reports and monitor your domain’s TLS security posture, identify problems, and learn about attacks that might be occurring.
Figure 1: A mock-up of how BIMI enables branded logos to be displayed in email user interfaces

Figure 1: A mock-up of how BIMI enables branded logos to be displayed in email user interfaces

These three standards require your Domain Name System (DNS) to publish specific records, for example by using Amazon Route 53, that point to web pages that have additional information. You can host this information without having to maintain a web server by storing it in Amazon Simple Storage Service (Amazon S3) and delivering it through Amazon CloudFront, secured with a certificate provisioned from AWS Certificate Manager (ACM).

Note: This AWS solution works for DKIM, BIMI, and DMARC, regardless of what you use to serve the actual email for your domains, which services you use to send email, and where you host DNS. For purposes of clarity, this post assumes that you are using Route 53 for DNS. If you use a different DNS hosting provider, you will manually configure DNS records in your existing hosting provider.

Solution architecture

The architecture for this solution is depicted in Figure 2.

Figure 2: The architecture diagram showing how the solution components interact

Figure 2: The architecture diagram showing how the solution components interact

The interaction points are as follows:

  1. The web content is stored in an S3 bucket, and CloudFront has access to this bucket through an origin access identity, a mechanism of AWS Identity and Access Management (IAM).
  2. As described in more detail in the BIMI section of this blog post, the Verified Mark Certificate is obtained from a BIMI-qualified certificate authority and stored in the S3 bucket.
  3. When an external email system receives a message claiming to be from your domain, it looks up BIMI records for your domain in DNS. As depicted in the diagram, a DNS request is sent to Route 53.
  4. To retrieve the BIMI logo image and Verified Mark Certificate, the external email system will make HTTPS requests to a URL published in the BIMI DNS record. In this solution, the URL points to the CloudFront distribution, which has a TLS certificate provisioned with ACM.

A few important warnings

Email is a complex system of interoperating technologies. It is also brittle: a typo or a missing DNS record can make the difference between whether an email is delivered or not. Pay close attention to your email server and the users of your email systems when implementing the solution in this blog post. The main indicator that something is wrong is the absence of email. Instead of seeing an error in your email server’s log, users will tell you that they’re expecting to receive an email from somewhere and it’s not arriving. Or they will tell you that they sent an email, and their recipient can’t find it.

The DNS uses a lot of caching and time-out values to improve its efficiency. That makes DNS records slow and a little unpredictable as they propagate across the internet. So keep in mind that as you monitor your systems, it can be hours or even more than a day before the DNS record changes have an effect that you can detect.

This solution uses AWS Cloud Development Kit (CDK) custom resources, which are supported by AWS Lambda functions that will be created as part of the deployment. These functions are configured to use CDK-selected runtimes, which will eventually pass out of support and require you to update them.

Prerequisites

You will need permission in an AWS account to create and configure the following resources:

  • An Amazon S3 bucket to store the files and access logs
  • A CloudFront distribution to publicly deliver the files from the S3 bucket
  • A TLS certificate in ACM
  • An origin access identity in IAM that CloudFront will use to access files in Amazon S3
  • Lambda functions, IAM roles, and IAM policies created by CDK custom resources

You might also want to enable these optional services:

  • Amazon Route 53 for setting the necessary DNS records. If your domain is hosted by another DNS provider, you will set these DNS records manually.
  • Amazon SES or an Amazon WorkMail organization with a single mailbox. You can configure either service with a subdomain (for example, [email protected]) such that the existing domain is not disrupted, or you can create new email addresses by using your existing email mailbox provider.

BIMI has some additional requirements:

  • BIMI requires an email domain to have implemented a strong DMARC policy so that recipients can be confident in the authenticity of the branded logos. Your email domain must have a DMARC policy of p=quarantine or p=reject. Additionally, the domain’s policy cannot have sp=none or pct<100.

    Note: Do not adjust the DMARC policy of your domain without careful testing, because this can disrupt mail delivery.

  • You must have your brand’s logo in Scaled Vector Graphics (SVG) format that conforms to the BIMI standard. For more information, see Creating BIMI SVG Logo Files on the BIMI Group website.
  • Purchase a Verified Mark Certificate (VMC) issued by a third-party certificate authority. This certificate attests that the logo, organization, and domain are associated with each other, based on a legal trademark registration. Many email hosting providers require this additional certificate before they will show your branded logo to their users. Others do not currently support BIMI, and others might have alternative mechanisms to determine whether to show your logo. For more information about purchasing a Verified Mark Certificate, see the BIMI Group website.

    Note: If you are not ready to purchase a VMC, you can deploy this solution and validate that BIMI is correctly configured for your domain, but your branded logo will not display to recipients at major email providers.

What gets deployed in this solution?

This solution deploys the DNS records and supporting files that are required to implement BIMI, MTA-STS, and SMTP TLS reporting for an email domain. We’ll look at the deployment in more detail in the following sections.

BIMI

BIMI is described by the Internet Engineering Task Force (IETF) as follows:

Brand Indicators for Message Identification (BIMI) permits Domain Owners to coordinate with Mail User Agents (MUAs) to display brand-specific Indicators next to properly authenticated messages. There are two aspects of BIMI coordination: a scalable mechanism for Domain Owners to publish their desired Indicators, and a mechanism for Mail Transfer Agents (MTAs) to verify the authenticity of the Indicator. This document specifies how Domain Owners communicate their desired Indicators through the BIMI Assertion Record in DNS and how that record is to be interpreted by MTAs and MUAs. MUAs and mail-receiving organizations are free to define their own policies for making use of BIMI data and for Indicator display as they see fit.

If your organization has a trademark-protected logo, you can set up BIMI to have that logo displayed to recipients in their email inboxes. This can have a positive impact on your brand and indicates to end users that your email is more trustworthy. The BIMI Group shows examples of how brand logos are displayed in user inboxes, as well as a list of known email service providers that support the display of BIMI logos.

As a domain owner, you can implement BIMI by publishing the relevant DNS records and hosting the relevant files. To have your logo displayed by most email hosting providers, you will need to purchase a Verified Mark Certificate from a BIMI-qualified certificate authority.

This solution will deploy a valid BIMI record in Route 53 (or tell you what to publish in the DNS if you’re not using Route 53) and will store your provided SVG logo and Verified Mark Certificate files in Amazon S3, to be delivered through CloudFront with a valid TLS certificate from ACM.

To support BIMI, the solution makes the following changes to your resources:

  1. A DNS record of type TXT is published at the following host:
    default._bimi.<your-domain>. The value of this record is: v=BIMI1; l=<url-of-your-logo> a=<url-of-verified-mark-certificate>. The value of <your-domain> refers to the domain that is used in the From header of messages that your organization sends.
  2. The logo and optional Verified Mark Certificate are hosted publicly at the HTTPS locations defined by <url-of-your-logo> and <url-of-verified-mark-certificate>, respectively.

MTA-STS

MTA-STS is described by the IETF in RFC 8461 as follows:

SMTP (Simple Mail Transport Protocol) MTA Strict Transport Security (MTA-STS) is a mechanism enabling mail service providers to declare their ability to receive Transport Layer Security (TLS) secure SMTP connections and to specify whether sending SMTP servers should refuse to deliver to MX hosts that do not offer TLS with a trusted server certificate.

Put simply, MTA-STS helps ensure that email servers always use encryption and certificate-based authentication when sending email to your domains, so that message integrity and confidentiality are preserved while in transit across the internet. MTA-STS also helps to ensure that messages are only sent to authorized servers.

This solution will deploy a valid MTA-STS policy record in Route 53 (or tell you what value to publish in the DNS if you’re not using Route 53) and will create an MTA-STS policy document to be hosted on S3 and delivered through CloudFront with a valid TLS certificate from ACM.

To support MTA-STS, the solution makes the following changes to your resources:

  1. A DNS record of type TXT is published at the following host: _mta-sts.<your-domain>. The value of this record is: v=STSv1; id=<unique value used for cache invalidation>.
  2. The MTA-STS policy document is hosted at and obtained from the following location: https://mta-sts.<your-domain>/.well-known/mta-sts.txt.
  3. The value of <your-domain> in both cases is the domain that is used for routing inbound mail to your organization and is typically the same domain that is used in the From header of messages that your organization sends externally. Depending on the complexity of your organization, you might receive inbound mail for multiple domains, and you might choose to publish MTA-STS policies for each domain.

Is it ever bad to encrypt everything?

In the example MTA-STS policy file provided in the GitHub repository and explained later in this post, the MTA-STS policy mode is set to testing. This means that your email server is advertising its willingness to negotiate encrypted email connections, but it does not require TLS. Servers that want to send mail to you are allowed to connect and deliver mail even if there are problems in the TLS connection, as long as you’re in testing mode. You should expect reports when servers try to connect through TLS to your mail server and fail to do so.

Be fully prepared before you change the MTA-STS policy to enforce. After this policy is set to enforce, servers that follow the MTA-STS policy and that experience an enforceable TLS-related error when they try to connect to your mail server will not deliver mail to your mail server. This is a difficult situation to detect. You will simply stop receiving email from servers that comply with the policy. You might receive reports from them indicating what errors they encountered, but it is not guaranteed. Be sure that the email address you provide in SMTP TLS reporting (in the following section) is functional and monitored by people who can take action to fix issues. If you miss TLS failure reports, you probably won’t receive email. If the TLS certificate that you use on your email server expires, and your MTA-STS policy is set to enforce, this will become an urgent issue and will disrupt the flow of email until it is fixed.

SMTP TLS reporting

SMTP TLS reporting is described by the IETF in RFC 8460 as follows:

A number of protocols exist for establishing encrypted channels between SMTP Mail Transfer Agents (MTAs), including STARTTLS, DNS-Based Authentication of Named Entities (DANE) TLSA, and MTA Strict Transport Security (MTA-STS). These protocols can fail due to misconfiguration or active attack, leading to undelivered messages or delivery over unencrypted or unauthenticated channels. This document describes a reporting mechanism and format by which sending systems can share statistics and specific information about potential failures with recipient domains. Recipient domains can then use this information to both detect potential attacks and diagnose unintentional misconfigurations.

As you gain the security benefits of MTA-STS, SMTP TLS reporting will allow you to receive reports from other internet email providers. These reports contain information that is valuable when monitoring your TLS security posture, identifying problems, and learning about attacks that might be occurring.

This solution will deploy a valid SMTP TLS reporting record on Route 53 (or provide you with the value to publish in the DNS if you are not using Route 53).

To support SMTP TLS reporting, the solution makes the following changes to your resources:

  1. A DNS record of type TXT is published at the following host: _smtp._tls.<your-domain>. The value of this record is: v=TLSRPTv1; rua=mailto:<report-receiver-email-address>
  2. The value of <report-receiver-email-address> might be an address in your domain or in a third-party provider. Automated systems that process these reports must be capable of processing GZIP compressed files and parsing JSON.

Deploy the solution with the AWS CDK

In this section, you’ll learn how to deploy the solution to create the previously described AWS resources in your account.

  1. Clone the following GitHub repository:

    git clone https://github.com/aws-samples/serverless-mail
    cd serverless-mail/email-security-records

  2. Edit CONFIG.py to reflect your desired settings, as follows:
    1. If no Verified Mark Certificate is provided, set VMC_FILENAME = None.
    2. If your DNS zone is not hosted on Route 53, or if you do not want this app to manage Route 53 DNS records, set ROUTE_53_HOSTED = False. In this case, you will need to set TLS_CERTIFICATE_ARN to the Amazon Resource Name (ARN) of a certificate hosted on ACM in us-east-1. This certificate is used by CloudFront and must support two subdomains: mta-sts and your configured BIMI_ASSET_SUBDOMAIN.
  3. Finalize the preparation, as follows:
    1. Place your BIMI logo and Verified Mark Certificate files in the assets folder.
    2. Create an MTA-STS policy file at assets/.well-known/mta-sts.txt to reflect your mail exchange (MX) servers and policy requirements. An example file is provided at assets/.well-known/mta-sts.txt.example
  4. Deploy the solution, as follows:
    1. Open a terminal in the email-security-records folder.
    2. (Recommended) Create and activate a virtual environment by running the following commands.
      python3 -m venv .venv
      source .venv/bin/activate
    3. Install the Python requirements in your environment with the following command.
      pip install -r requirements.txt
    4. Assume a role in the target account that has the permissions outlined in the Prerequisites section of this post.

      Using AWS CDK version 2.17.0 or later, deploy the bootstrap in the target account by running the following command. To learn more, see Bootstrapping in the AWS CDK Developer Guide.
      cdk bootstrap

    5. Run the following command to synthesize the CloudFormation template. Review the output of this command to verify what will be deployed.
      cdk synth
    6. Run the following command to deploy the CloudFormation template. You will be prompted to accept the IAM changes that will be applied to your account.
      cdk deploy

      Note: If you use Route53, these records are created and activated in your DNS zones as soon as the CDK finishes deploying. As the records propagate through the DNS, they will gradually start affecting the email in the affected domains.

    7. If you’re not using Route53 and instead are using a third-party DNS provider, create the CNAME and TXT records as indicated. In this case, your email is not affected by this solution until you create the records in DNS.

Testing and troubleshooting

After you have deployed the CDK solution, you can test it to confirm that the DNS records and web resources are published correctly.

BIMI

  1. Query the BIMI DNS TXT record for your domain by using the dig or nslookup command in your terminal.

    dig +short TXT default._bimi.<your-domain.example>

    Verify the response. For example:

    "v=BIMI1; l=https://bimi-assets.<your-domain.example>/logo.svg"

  2. In your web browser, open the URL from that response (for example, https://bimi-assets.<your-domain.example>/logo.svg) to verify that the logo is available and that the HTTPS certificate is valid.
  3. The BIMI group provides a tool to validate your BIMI configuration. This tool will also validate your VMC if you have purchased one.

MTA-STS

  1. Query the MTA-STS DNS TXT record for your domain.

    dig +short TXT _mta-sts.<your-domain.example>

    The value of this record is as follows:

    v=STSv1; id=<unique value used for cache invalidation>

  2. You can load the MTA-STS policy document using your web browser. For example, https://mta-sts.<your-domain.example>/.well-known/mta-sts.txt
  3. You can also use third party tools to examine your MTA-STS configuration, such as MX Toolbox.

TLS reporting

  1. Query the TLS reporting DNS TXT record for your domain.

    dig +short TXT _smtp._tls.<your-domain.example>

    Verify the response. For example:

    "v=TLSRPTv1; rua=mailto:<your email address>"

  2. You can also use third party tools to examine your TLS reporting configuration, such as Easy DMARC.

Depending on which domains you communicate with on the internet, you will begin to see TLS reports arriving at the email address that you have defined in the TLS reporting DNS record. We recommend that you closely examine the TLS reports, and use automated analytical techniques over an extended period of time before changing the default testing value of your domain’s MTA-STS policy. Not every email provider will send TLS reports, but examining the reports in aggregate will give you a good perspective for making changes to your MTA-STS policy.

Cleanup

To remove the resources created by this solution:

  1. Open a terminal in the cdk-email-security-records folder.
  2. Assume a role in the target account with permission to delete resources.
  3. Run cdk destroy.

Note: The asset and log buckets are automatically emptied and deleted by the cdk destroy command.

Conclusion

When external systems send email to or receive email from your domains they will now query your new DNS records and will look up your domain’s BIMI, MTA-STS, and TLS reporting information from your new CloudFront distribution. By adopting the email domain security mechanisms outlined in this post, you can improve the overall security posture of your email environment, as well as the perception of your brand.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Michael Davie

Michael Davie

Michael is a Senior Industry Specialist with AWS Security Assurance. He works with our customers, their regulators, and AWS teams to help raise the bar on secure cloud adoption and usage. Michael has over 20 years of experience working in the defence, intelligence, and technology sectors in Canada and is a licensed professional engineer.

Jesse Thompson

Jesse Thompson

Jesse is an Email Deliverability Manager with the Amazon Simple Email Service team. His background is in enterprise IT development and operations, with a focus on email abuse mitigation and encouragement of authenticity practices with open standard protocols. Jesse’s favorite activity outside of technology is recreational curling.

Building an event-driven solution for AvalonBay property leasing and search

Post Syndicated from Kausik Dey original https://aws.amazon.com/blogs/architecture/building-an-event-driven-solution-for-avalonbay-property-leasing-and-search/

In this blog post, we show you how to build an event-driven and serverless solution for property leasing and search that is scalable and resilient. This solution was created for AvalonBay Communities, Inc.—a leading residential Real Estate Investment Trusts (REITs). It enables:

  • More than 150,000 multi-parameter searches per day
  • The processing of more than 3,500 lease applications and 85,000 individual rent payments per month

Introduction

AvalonBay is an equity REIT. The company has a long track record of developing, redeveloping, acquiring, and managing apartment homes in top U.S. markets. AvalonBay builds long-term value for customers using innovative technology solutions.

The company understands that data-driven insights contribute to targeted business growth. But AvalonBay found that managing the complex interdependencies between multiple data sets—from real estate and property management systems to financial and payment systems—required a new solution.

The challenge

AvalonBay owned or held a direct or indirect ownership interest in 293 apartment communities containing 88,405 apartment homes in 12 states and Washington, D.C., as of September 30, 2022.

Of these, 18 communities were under development and one was under redevelopment. This presented a unique challenge to both internal and external users looking to search and lease apartment units based on multi-parameter selection criteria in geographically dispersed regions. For example, finding units in buildings with specific amenities, lease terms, furnishings and availability dates.

Overview of solution

AvalonBay’s fully managed leasing solution for applicants and residents is hosted by Amazon Web Services (AWS). The solution is secure, autoscaling, and multi-region, ensuring resiliency and performance with efficient resource usage.

In this event-driven solution, AvalonBay’s leasing service is hosted in multiple AWS Regions to provide low latency response to users across various geolocations. This blog post focuses on showing use case implementation in only one region—Region East— as shown in Figure 1.

AvalonBay lease processing platform

Figure 1. AvalonBay lease processing platform

Several AWS services come together in this solution to meet key company objectives. Let’s explore each one and its purpose within the architecture.

    1. Amazon Route 53: For AvalonBay’s Lease Processing solution, any non-transient service failure is unacceptable. In addition to providing a high degree of resiliency through a Multi-AZ architecture, Lease Processing also provides regional-level high availability through its multi-Region active-active architecture. Route 53 with latency-based routing allows dynamic rerouting of requests within seconds to alternate Regions.
    2. Amazon API Gateway: Route 53 latency-based routing is configured across multiple AWS Regions as Route 53 to route traffic to an API Gateway endpoint. API Gateway authorizers were added to control access to APIs using an Amazon Cognito user pool.
    3. AWS Lambda with provisioned concurrency: The Lambda services are set up for automatic scaling, secured through a private subnet, and span across Availability Zones. This provides horizontal scaling capability, self-healing capacity, and resiliency across Availability Zones. Provisioned concurrency minimizes the estimate of cold starts by generating execution environments. It also greatly reduces time spent on APIs invocations.
    4. Amazon Aurora V2 for Amazon Relational Database Service (Amazon RDS) PostgreSQL-Compatible Edition: Aurora Serverless V2 is an on-demand, autoscaling configuration for Aurora. Serverless Aurora V2 PostgreSQL-Compatible is used for the Lease Processing solution. The global database was configured with two Regions; us-east-1 as the primary cluster and us-west-2 as the secondary cluster. Automated Aurora global database endpoint management for planned and unplanned failover is configured through a Route 53 private hosted zone, Amazon EventBridge, and Lambda.
    5. Amazon RDS Proxy for Aurora: Amazon RDS Proxy allows the leasing application to pool and share database connections to improve its ability to scale. It also makes the leasing solution more resilient to database failures by automatically connecting to a standby database instance while preserving application connections.
    6. Amazon EventBridge: EventBridge supports the solution through two primary purposes:
      1. Oversees lease flow events – During the lease application process, the solution generates various events which are consumed by AvalonBay and external applications such as property management, finance portals, administration, and more. Leasing events are sent to EventBridge and various event rules are configured for multiple destinations including Lambda, Amazon Simple Notification Service (Amazon SNS), and external API endpoints.
      2. Handles global Aurora V2 failover – Aurora generates events when certain actions are taken or events occur, including any type of global database activity.
        • When a managed planned failover is initiated—either via the AWS Command Line Interface (AWS CLI), API, or console—the global database failover process starts and generates an event.
        • When an Aurora cluster is removed from a global cluster through the AWS CLI, API, or console, the Aurora cluster is promoted as a single primary cluster. Once this process is completed, it generates an event.

An EventBridge rule is created to match an event pattern any time a global database managed planned failover completes successfully in a Region. When a failover is completed, the completion event is detected, and this rule is triggered. The event rule is configured to invoke a Lambda function that is triggered on global database failover and updates the Amazon CloudFront CNAME record to the correct value.

Scalable search solution

Leasing professionals need to easily scan a huge amount of property information using AvalonBay’s Search Solution to obtain required information.

Using the Amazon OpenSearch Service, agents can generate property profiles and other asset data to identify matching units and quickly respond to end customers. OpenSearch is a fully open-source search and analytics engine that securely unlocks real-time search, monitoring, and analysis of business and operational data. It is employed for use cases such as application monitoring, log analytics, observability, and website search.

The AvalonBay Search Service solution architecture featuring OpenSearch is shown in Figure 2.

AvalonBay Search Service solution architecture

Figure 2: AvalonBay Search Service solution architecture

AvalonBay search requires search criteria including keyword and Universal Resource Identifier (URI) search, SQL-based search, and custom package search, all of which are detailed in the Amazon OpenSearch Service Developer Guide.

OpenSearch automatically detects and replaces failed OpenSearch Service nodes, reducing the overhead associated with self-managed infrastructures.

Let’s explore this architecture further by step.

  1. Amazon Kinesis event stream – The AvalonBay community requires near real-time updates to search attributes such as amenities, features, promotion, and pricing. Events created through various producers are streamed through Kinesis and inserted or updated through OpenSearch.
  2. Amazon OpenSearch – OpenSearch is used for end-to-end community search—a managed service making it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. As community search data is read-only, UltraWarm and cold storage are also used based on usage frequency.
  3. Amazon Simple Storage Service (S3): Various community documents, policies, and image or video files are key search elements. They must be maintained securely and reliably for years due to contractual obligations. Amazon S3 simplifies this task with high durabilitylifecycle rules, and varied controls for retention.

Conclusion

This post showed how AvalonBay has built and deployed custom leasing and search solutions on AWS serverless platforms without compromising resiliency, performance, and capacity requirements. This is a 24/7, fully managed solution with no additional equipment on-premises.

Choosing AWS for leasing and search solutions gives AvalonBay the ability to dynamically scale and meet future growth demands while introducing cost advantages. In addition, the global availability of AWS services makes it possible to deploy services across geographic locations to meet performance requirements.

AWS Lambda: Resilience under-the-hood

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/compute/aws-lambda-resilience-under-the-hood/

This post is written by Adrian Hornsby (Principal System Dev Engineer) and Marcia Villalba (Principal Developer Advocate).

AWS Lambda comprises over 80 services working together to provide the serverless compute service that it offers to customers. Under the hood, many of these services are built on top of Amazon Elastic Compute Cloud (Amazon EC2) instances, provisioned within Availability Zones. However, AWS Lambda is a Regional service. This means that customers use Lambda services from the Region level and its services are designed to be resilient to impairments that the underlying Availability Zones might have.

This blog post discusses how a Regional service such as Lambda takes advantage of Availability Zones and static stability to achieve its high availability target, and shows how Lambda teams verify their service’s static stability using AWS Fault Injection Simulator (AWS FIS). It also provides a solution using AWS services and tools to achieve Lambda’s resiliency strategy, using FIS, Amazon CloudWatch, and Amazon Route 53 Application Recovery Controller (Route 53 ARC).

The role of Availability Zones

Availability Zones are physically isolated sections of an AWS Region, designed to operate but also fail independently. They are separated by a meaningful distance from each other, up to 100 kilometers (60 miles), to prevent correlated failures, but close enough to use synchronous replication with single-digit millisecond latency.

Customers and AWS services have been using Availability Zones for years to build highly available, fault tolerant, and scalable applications. In particular, AWS Regional services such as AWS Lambda, Amazon DynamoDB, Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Storage Service (Amazon S3), have achieved their high availability promises by spreading multiple independent replicas of their services across multiple Availability Zones. It uses the principles of independence and redundancy of Availability Zones to maximize the overall availability of that service.

Each replica is called a zonal replica. The system is designed so that any of the replicas can fail at any time. When a replica fails, it can be temporarily removed from the system until everything works as expected again. When that happens, the load is shared between the remaining zonal replicas.

Designing for failures

One lesson we learned at AWS when building services is when there is an Availability Zone impairment, it is better not to rely on control plane operations to remediate the failure. A control plane operation can, for example, be provisioning more capacity in an Availability Zone that is not affected by the impairment.

This principle is called static stability, and it describes the capability for a system to keep its original steady-state (or behavior) even when subjected to disruptive events without having to make any changes. A statically stable service should have as few dependencies as possible for its recovery process.

For a Regional service like AWS Lambda, this means that the remaining capacity in the healthy Availability Zones can absorb the traffic from a potentially impaired Availability Zone without having to scale up. This implies over-provisioning resources in all Availability Zones. Having that extra capacity pre-provisioned helps Lambda achieve its static stability. It is a tradeoff between the cost of over-provisioning resources and service availability. Since AWS Lambda promises high availability to its customers, with a monthly uptime service commitment of 99.95%, that tradeoff falls towards service availability.

How to prepare for failures

Preparing for an Availability Zone impairment is difficult because the symptoms and size of the impact can vary widely. An Availability Zone may be partially accessible or totally unreachable, and everything in between. Causes for the impairment can range from fiber cuts, power issues, overheating, hardware malfunctions, networking problems, capacity issues, and other unexpected situations. While those happen, they happen rarely. The most common categories of failures are bad deployments and bad configurations.

While some of these failures can be difficult to infer or reproduce, common symptoms include disruption of connectivity, increased latency, increased traffic due to retry storms, increased CPU and memory usage, and slow I/O.

At AWS, we learned to expect the unexpected and plan for failure. This means injecting faults in the system to reproduce some of the common symptoms of Availability Zone impairments, then observe how the system responds, and implement improvements. In addition, injecting faults in the system helps uncover potential monitoring and alarming blind spots, and gives an opportunity for teams to practice and improve their response to events with a focus on reducing time to recovery.

How Lambda tests its response to an Availability Zone impairment

Lambda’s approach to being resilient to Availability Zone impairments is to rely on static stability and automated systems. Humans are slower than machines for detecting issues and mitigating them. Therefore, Lambda must ensure that its services can detect issues within a zonal replica and remediate automatically within minutes and with no operator intervention. This auto-remediation is done by shifting customer traffic away from the affected Availability Zone to healthy ones, and it is called Availability Zone evacuation.

To do this, Lambda built a tool that detects failures and performs the Availability Zone evacuation when needed. This tool does a statistical comparison of metrics between different Availability Zones and EC2 instances in order to identify unhealthy Availability Zones. If an Availability Zone is found to have issues, the tool starts the evacuation out of the unhealthy Availability Zone automatically. This automation cuts the time to the first action from 30 minutes to less than 3 minutes.

How AWS Lambda uses AWS FIS

To verify the automation continuously works as expected, Lambda performs a wide variety of tests, which includes Availability Zone failure testing in their pre-production environment. The main objective of these tests is to verify the services are statically stable in the presence of Availability Zone impairments, and to verify that the Availability Zone evacuation can be successfully initiated. The benefit of having an automated test is that teams can repeat it regularly and don’t need to have special skills. One click is all it takes to launch the test.

For these tests, Lambda uses AWS FIS to inject faults into their large fleet of EC2 instances. They use AWS FIS with support of the AWS System Manager (SSM) agent and resource filters to target their fleet of EC2 instances in a particular Availability Zone. This is a versatile approach that can inject resource faults, such as CPU and memory exhaustion, and networking faults, such as packet latency, loss, or drop.

Injecting packet loss or latency is very important, since these symptoms can have a serious impact on application and network performance. Indeed, latency and loss, even in small quantities, can create inefficiencies and prevent applications from running at their peak performance. For Lambda, being able to detect increased latency or loss before it affects customers is critical.

How to recover your applications rapidly from Availability Zones failures

You can build a similar solution to rapidly recover your applications from a zonal failure. The solution must have a mechanism to evacuate an impaired Availability Zone, a monitoring system that allows you to detect when a zonal replica is impaired, and a way to test the static stability of your system. AWS provides many tools and services that can help you build this solution to achieve Lambda’s resiliency strategy.

For performing Availability Zone evacuation, you can use the new zonal shift capability from Route 53 ARC, which at the time of writing is in preview. Zonal shift lets you evacuate an Availability Zone for applications that are uses Elastic Load Balancing. If you find out that a zonal replica is impaired or unhealthy, you can use zonal shift to evacuate the Availability Zone for a period of time, while the issue gets fixed.

For performing the zonal shift, you must detect when a zonal replica is unhealthy. Your application must provide a signal of its health per Availability Zone. There are two common ways to capture this signal. First, passively, you can check your metrics, like response times, HTTP status codes, and other metrics that can help track fatal errors in your applications. Or actively, using synthetic monitoring, which allows you to create synthetic requests against your production application to provide a more complete view of the customer experience.

Amazon CloudWatch Synthetics provides canaries, which are scripts that run on a schedule and perform synthetic requests in your application endpoints and APIs. Canaries perform the same actions as customers and continuously verify the customer experience. You can create a canary for each zonal replica of your application and monitor the results independently.

With this information, if the user experience diminishes in one of the replicas, you can start an Availability Zone evacuation using zonal shift and minimize the bad experience for the user while you find and fix the sources of the failure.

To ensure that you can successfully recover from a failure, you must test the solution in advance. Without testing, it is just an assumption. To prove or disprove your assumptions about your system’s capability to handle disruptive events such as issues within an Availability Zones, you can use FIS.

With FIS, you can inject faults simultaneously in multiple resources within the same failure domain, such as Availability Zones. FIS currently integrates with several AWS services including EC2, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), Amazon Relational Database Service (Amazon RDS), AWS Networking, and CloudWatch.

Typical use cases for testing a workload’s resilience to Availability Zones impairment are, for example, terminating all compute resources and databases within a particular Availability Zone, injecting latency or packet loss, increasing resource consumption (CPU, memory, and I/O) in compute resources in a particular Availability Zone, or impacting network communication within or between Availability Zones.

For more information and a step-by-step example of how to recover rapidly from application failures in a single Availability Zone and testing it with AWS FIS, read this blog post.

Conclusion

­­­This article discusses static stability, a mechanism that is used by AWS services such as Lambda to build resilient Regional services. It also discusses how AWS takes advantage of the same services and infrastructure as customers. It shows how Lambda uses multiple Availability Zones and services like AWS FIS to build highly available services and improve its recovery time from unexpected failures to only a few minutes without human intervention. Finally, it shows a solution that you can implement for your applications to achieve Lambda’s resilience strategy.

To learn more about AWS FIS, there are many tutorials and a workshop you can check out.

For more serverless learning resources, visit Serverless Land.

Enabling load-balancing of non-HTTP(s) traffic on AWS Wavelength

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/enabling-load-balancing-of-non-https-traffic-on-aws-wavelength/

This blog post is written by Jack Chen, Telco Solutions Architect, and Robert Belson, Developer Advocate.

AWS Wavelength embeds AWS compute and storage services within 5G networks, providing mobile edge computing infrastructure for developing, deploying, and scaling ultra-low-latency applications. AWS recently introduced support for Application Load Balancer (ALB) in AWS Wavelength zones. Although ALB addresses Layer-7 load balancing use cases, some low latency applications that get deployed in AWS Wavelength Zones rely on UDP-based protocols, such as QUIC, WebRTC, and SRT, which can’t be load-balanced by Layer-7 Load Balancers. In this post, we’ll review popular load-balancing patterns on AWS Wavelength, including a proposed architecture demonstrating how DNS-based load balancing can address customer requirements for load-balancing non-HTTP(s) traffic across multiple Amazon Elastic Compute Cloud (Amazon EC2) instances. This solution also builds a foundation for automatic scale-up and scale-down capabilities for workloads running in an AWS Wavelength Zone.

Load balancing use cases in AWS Wavelength

In the AWS Regions, customers looking to deploy highly-available edge applications often consider Amazon Elastic Load Balancing (Amazon ELB) as an approach to automatically distribute incoming application traffic across multiple targets in one or more Availability Zones (AZs). However, at the time of this publication, AWS-managed Network Load Balancer (NLB) isn’t supported in AWS Wavelength Zones and ALB is being rolled out to all AWS Wavelength Zones globally. As a result, this post will seek to document general architectural guidance for load balancing solutions on AWS Wavelength.

As one of the most prominent AWS Wavelength use cases, highly-immersive video streaming over UDP using protocols such as WebRTC at scale often require a load balancing solution to accommodate surges in traffic, either due to live events or general customer access patterns. These use cases, relying on Layer-4 traffic, can’t be load-balanced from a Layer-7 ALB. Instead, Layer-4 load balancing is needed.

To date, two infrastructure deployments involving Layer-4 load balancers are most often seen:

  • Amazon EC2-based deployments: Often the environment of choice for earlier-stage enterprises and ISVs, a fleet of EC2 instances will leverage a load balancer for high-throughput use cases, such as video streaming, data analytics, or Industrial IoT (IIoT) applications
  • Amazon EKS deployments: Customers looking to optimize performance and cost efficiency of their infrastructure can leverage containerized deployments at the edge to manage their AWS Wavelength Zone applications. In turn, external load balancers could be configured to point to exposed services via NodePort objects. Furthermore, a more popular choice might be to leverage the AWS Load Balancer Controller to provision an ALB when you create a Kubernetes Ingress.

Regardless of deployment type, the following design constraints must be considered:

  • Target registration: For load balancing solutions not managed by AWS, seamless solutions to load balancer target registration must be managed by the customer. As one potential solution, visit a recent HAProxyConf presentation, Practical Advice for Load Balancing at the Network Edge.
  • Edge Discovery: Although DNS records can be populated into Amazon Route 53 for each carrier-facing endpoint, DNS won’t deterministically route mobile clients to the most optimal mobile endpoint. When available, edge discovery services are required to most effectively route mobile clients to the lowest latency endpoint.
  • Cross-zone load balancing: Given the hub-and-spoke design of AWS Wavelength, customer-managed load balancers should proxy traffic only to that AWS Wavelength Zone.

Solution overview – Amazon EC2

In this solution, we’ll present a solution for a highly-available load balancing solution in a single AWS Wavelength Zone for an Amazon EC2-based deployment. In a separate post, we’ll cover the needed configurations for the AWS Load Balancer Controller in AWS Wavelength for Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

The proposed solution introduces DNS-based load balancing, a technique to abstract away the complexity of intelligent load-balancing software and allow your Domain Name System (DNS) resolvers to distribute traffic (equally, or in a weighted distribution) to your set of endpoints.

Our solution leverages the weighted routing policy in Route 53 to resolve inbound DNS queries to multiple EC2 instances running within an AWS Wavelength zone. As EC2 instances for a given workload get deployed in an AWS Wavelength zone, Carrier IP addresses can be assigned to the network interfaces at launch.

Through this solution, Carrier IP addresses attached to AWS Wavelength instances are automatically added as DNS records for the customer-provided public hosted zone.

To determine how Route 53 responds to queries, given an arbitrary number of records of a public hosted zone, Route53 offers numerous routing policies:

Simple routing policy – In the event that you must route traffic to a single resource in an AWS Wavelength Zone, simple routing can be used. A single record can contain multiple IP addresses, but Route 53 returns the values in a random order to the client.

Weighted routing policy – To route traffic more deterministically using a set of proportions that you specify, this policy can be selected. For example, if you would like Carrier IP A to receive 50% of the traffic and Carrier IP B to receive 50% of the traffic, we’ll create two individual A records (one for each Carrier IP) with a weight of 50 and 50, respectively. Learn more about Route 53 routing policies by visiting the Route 53 Developer Guide.

The proposed solution leverages weighted routing policy in Route 53 DNS to route traffic to multiple EC2 instances running within an AWS Wavelength zone.

Reference architecture

The following diagram illustrates the load-balancing component of the solution, where EC2 instances in an AWS Wavelength zone are assigned Carrier IP addresses. A weighted DNS record for a host (e.g., www.example.com) is updated with Carrier IP addresses.

DNS-based load balancing

When a device makes a DNS query, it will be returned to one of the Carrier IP addresses associated with the given domain name. With a large number of devices, we expect a fair distribution of load across all EC2 instances in the resource pool. Given the highly ephemeral mobile edge environments, it’s likely that Carrier IPs could frequently be allocated to accommodate a workload and released shortly thereafter. However, this unpredictable behavior could yield stale DNS records, resulting in a “blackhole” – routes to endpoints that no longer exist.

Time-To-Live (TTL) is a DNS attribute that specifies the amount of time, in seconds, that you want DNS recursive resolvers to cache information about this record.

In our example, we should set to 30 seconds to force DNS resolvers to retrieve the latest records from the authoritative nameservers and minimize stale DNS responses. However, a lower TTL has a direct impact on cost, as a result of increased number of calls from recursive resolvers to Route53 to constantly retrieve the latest records.

The core components of the solution are as follows:

Alongside the services above in the AWS Wavelength Zone, the following services are also leveraged in the AWS Region:

  • AWS Lambda – a serverless event-driven function that makes API calls to the Route 53 service to update DNS records.
  • Amazon EventBridge– a serverless event bus that reacts to EC2 instance lifecycle events and invokes the Lambda function to make DNS updates.
  • Route 53– cloud DNS service with a domain record pointing to AWS Wavelength-hosted resources.

In this post, we intentionally leave the specific load balancing software solution up to the customer. Customers can leverage various popular load balancers available on the AWS Marketplace, such as HAProxy and NGINX. To focus our solution on the auto-registration of DNS records to create functional load balancing, this solution is designed to support stateless workloads only. To support stateful workloads, sticky sessions – a process in which routes requests to the same target in a target group – must be configured by the underlying load balancer solution and are outside of the scope of what DNS can provide natively.

Automation overview

Using the aforementioned components, we can implement the following workflow automation:

Event-driven Auto Scaling Workflow

Amazon CloudWatch alarm can trigger the Auto Scaling group Scale out or Scale in event by adding or removing EC2 instances. Eventbridge will detect the EC2 instance state change event and invoke the Lambda function. This function will update the DNS record in Route53 by either adding (scale out) or deleting (scale in) a weighted A record associated with the EC2 instance changing state.

Configuration of the automatic auto scaling policy is out of the scope of this post. There are many auto scaling triggers that you can consider using, based on predefined and custom metrics such as memory utilization. For the demo purposes, we will be leveraging manual auto scaling.

In addition to the core components that were already described, our solution also utilizes AWS Identity and Access Management (IAM) policies and CloudWatch. Both services are key components to building AWS Well-Architected solutions on AWS. We also use AWS Systems Manager Parameter Store to keep track of user input parameters. The deployment of the solution is automated via AWS CloudFormation templates. The Lambda function provided should be uploaded to an AWS Simple Storage Service (Amazon S3) bucket.

Amazon Virtual Private Cloud (Amazon VPC), subnets, Carrier Gateway, and Route Tables are foundational building blocks for AWS-based networking infrastructure. In our deployment, we are creating a new VPC, one subnet in an AWS Wavelength zone of your choice, a Carrier Gateway, and updating the route table for this subnet to point the default route to the Carrier Gateway.

Wavelength VPC architecture.

Deployment prerequisites

The following are prerequisites to deploy the described solution in your account:

  • Access to an AWS Wavelength zone. If your account is not allow-listed to use AWS Wavelength zones, then opt-in to AWS Wavelength zones here.
  • Public DNS Hosted Zone hosted in Route 53. You must have access to a registered public domain to deploy this solution. The zone for this domain should be hosted in the same account where you plan to deploy AWS Wavelength workloads.
    If you don’t have a public domain, then you can register a new one. Note that there will be a service charge for the domain registration.
  • Amazon S3 bucket. For the Lambda function that updates DNS records in Route 53, store the source code as a .zip file in an Amazon S3 bucket.
  • Amazon EC2 Key pair. You can use an existing Key pair for the deployment. If you don’t have a KeyPair in the region where you plan to deploy this solution, then create one by following these instructions.
  • 4G or 5G-connected device. Although the infrastructure can be deployed independent of the underlying connected devices, testing the connectivity will require a mobile device on one of the Wavelength partner’s networks. View the complete list of Telecommunications providers and Wavelength Zone locations to learn more.

Conclusion

In this post, we demonstrated how to implement DNS-based load balancing for workloads running in an AWS Wavelength zone. We deployed the solution that used the EventBridge Rule and the Lambda function to update DNS records hosted by Route53. If you want to learn more about AWS Wavelength, subscribe to AWS Compute Blog channel here.

Building highly resilient applications with on-premises interdependencies using AWS Local Zones

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/building-highly-resilient-applications-with-on-premises-interdependencies-using-aws-local-zones/

This blog post is written by Rachel Rui Liu, Senior Solutions Architect.

AWS Local Zones are a type of infrastructure deployment that places compute, storage, database, and other select AWS services close to large population and industry centers.

Following the successful launch of the AWS Local Zones in 16 US cities since 2019, in Feb 2022, AWS announced plans to launch new AWS Local Zones in 32 metropolitan areas in 26 countries worldwide.

With Local Zones, we’ve seen use cases in two common categories.

The first category of use cases is for workloads that require extremely low latency between end-user devices and workload servers. For example, let’s consider media content creation and real-time multiplayer gaming. For these use cases, deploying the workload to a Local Zone can help achieve down to single-digit milliseconds latency between end-user devices and the AWS infrastructure, which is ideal for a good end-user experience.

This post will focus on addressing the second category of use cases, which is commonly seen in an enterprise hybrid architecture, where customers must achieve low latency between AWS infrastructure and existing on-premises data centers.  Compared to the first category of use cases, these use cases can tolerate slightly higher latency between the end-user devices and the AWS infrastructure. However, these workloads have dependencies to these on-premises systems, so the lowest possible latency between AWS infrastructure and on-premises data centers is required for better application performance. Here are a few examples of these systems:

  • Financial services sector mainframe workloads hosted on premises serving regional customers.
  • Enterprise Active Directory hosted on premise serving cloud and on-premises workloads.
  • Enterprise applications hosted on premises processing a high volume of locally generated data.

For workloads deployed in AWS, the time taken for each interaction with components still hosted in the on-premises data center is increased by the latency. In turn, this delays responses received by the end-user. The total latency accumulates and results in suboptimal user experiences.

By deploying modernized workloads in Local Zones, you can reduce latency while continuing to access systems hosted in on-premises data centers, thereby reducing the total latency for the end-user. At the same time, you can enjoy the benefits of agility, elasticity, and security offered by AWS, and can apply the same automation, compliance, and security best practices that you’ve been familiar with in the AWS Regions.

Enterprise workload resiliency with Local Zones

While designing hybrid architectures with Local Zones, resiliency is an important consideration. You want to route traffic to the nearest Local Zone for low latency. However, when disasters happen, it’s critical to fail over to the parent Region automatically.

Let’s look at the details of hybrid architecture design based on real world deployments from different angles to understand how the architecture achieves all of the design goals.

Hybrid architecture with resilient network connectivity

The following diagram shows a high-level overview of a resilient enterprise hybrid architecture with Local Zones, where you have redundant connections between the AWS Region, the Local Zone, and the corporate data center.

resillient network connectivity

Here are a few key points with this network connectivity design:

  1. Use AWS Direct Connect or Site-to-Site VPN to connect the corporate data center and AWS Region.
  2. Use Direct Connect or self-hosted VPN to connect the corporate data center and the Local Zone. This connection will provide dedicated low-latency connectivity between the Local Zone and corporate data center.
  3. Transit Gateway is a regional service. When attaching the VPC to AWS Transit Gateway, you can only add subnets provisioned in the Region. Instances on subnets in the Local Zone can still use Transit Gateway to reach resources in the Region.
  4. For subnets provisioned in the Region, the VPC route table should be configured to route the traffic to the corporate data center via Transit Gateway.
  5. For subnets provisioned in Local Zone, the VPC route table should be configured to route the traffic to the corporate data center via the self-hosted VPN instance or Direct Connect.

Hybrid architecture with resilient workload deployment

The next examples show a public and a private facing workload.

To simplify the diagram and focus on application layer architecture, the following diagrams assume that you are using Direct Connect to connect between AWS and the on-premises data center.

Example 1: Resilient public facing workload

With a public facing workload, end-user traffic will be routed to the Local Zone. If the Local Zone is unavailable, then the traffic will be routed to the Region automatically using an Amazon Route 53 failover policy.

public facing workload resilliency
Here are the key design considerations for this architecture:

  1. Deploy the workload in the Local Zone and put the compute layer in an AWS AutoScaling Group, so that the application can scale up and down depending on volume of requests.
  2. Deploy the workload in both the Local Zone and an AWS Region, and put the compute layer into an autoscaling group. The regional deployment will act as pilot light or warm standby with minimal footprint. But it can scale out when the Local Zone is unavailable.
  3. Two Application Load Balancers (ALBs) are required: one in the Region and one in the Local Zone. Each ALB will dispatch the traffic to each workload cluster inside the autoscaling group local to it.
  4. An internet gateway is required for public facing workloads. When using a Local Zone, there’s no extra configuration needed: define a single internet gateway and attach it to the VPC.

If you want to specify an Elastic IP address to be the workload’s public endpoint, the Local Zone will have a different address pool than the Region. Noting that BYOIP is unsupported for Local Zones.

  1. Create a Route 53 DNS record with “Failover” as the routing policy.
  • For the primary record, point it to the alias of the ALB in the Local Zone. This will set Local Zone as the preferred destination for the application traffic which minimizes latency for end-users.
  • For the secondary record, point it to the alias of the ALB in the AWS Region.
  • Enable health check for the primary record. If health check against the primary record fails, which indicates that the workload deployed in the Local Zone has failed to respond, then Route 53 will automatically point to the secondary record, which is the workload deployed in the AWS Region.

Example 2: Resilient private workload

For a private workload that’s only accessible by internal users, a few extra considerations must be made to keep the traffic inside of the trusted private network.

private workload resilliency

The architecture for resilient private facing workload has the same steps as public facing workload, but with some key differences. These include:

  1. Instead of using a public hosted zone, create private hosted zones in Route 53 to respond to DNS queries for the workload.
  2. Create the primary and secondary records in Route 53 just like the public workload but referencing the private ALBs.
  3. To allow end-users onto the corporate network (within offices or connected via VPN) to resolve the workload, use the Route 53 Resolver with an inbound endpoint. This allows end-users located on-premises to resolve the records in the private hosted zone. Route 53 Resolver is designed to be integrated with an on-premises DNS server.
  4. No internet gateway is required for hosting the private workload. You might need an internet gateway in the Local Zone for other purposes: for example, to host a self-managed VPN solution to connect the Local Zone with the corporate data center.

Hosting multiple workloads

Customers who host multiple workloads in a single VPC generally must consider how to segregate those workloads. As with workloads in the AWS Region, segregation can be implemented at a subnet or VPC level.

If you want to segregate workloads at the subnet level, you can extend your existing VPC architecture by provisioning extra sets of subnets to the Local Zone.

segregate workloads at subnet level

Although not shown in the diagram, for those of you using a self-hosted VPN to connect the Local Zone with an on-premises data center, the VPN solution can be deployed in a centralized subnet.

You can continue to use security groups, network access control lists (NACLs) , and VPC route tables – just as you would in the Region to segregate the workloads.

If you want to segregate workloads at the VPC level, like many of our customers do, within the Region, inter-VPC routing is generally handled by Transit Gateway. However, in this case, it may be undesirable to send traffic to the Region to reach a subnet in another VPC that is also extended to the Local Zone.

segregate workloads at VPC level

Key considerations for this design are as follows:

  1. Direct Connect is deployed to connect the Local Zone with the corporate data center. Therefore, each VPC will have a dedicated Virtual Private Gateway provisioned to allow association with the Direct Connect Gateway.
  2. To enable inter-VPC traffic within the Local Zone, peer the two VPCs together.
  3. Create a VPC route table in VPC A. Add a route for Subnet Y where the destination is the peering link. Assign this route table to Subnet X.
  4. Create a VPC route table in VPC B. Add a route for Subnet X where the destination is the peering link. Assign this route table to Subnet Y.
  5. If necessary, add routes for on-premises networks and the transit gateway to both route tables.

This design allows traffic between subnets X and Y to stay within the Local Zone, thereby avoiding any latency from the Local Zone to the AWS Region while still permitting full connectivity to all other networks.

Conclusion

In this post, we summarized the use cases for enterprise hybrid architecture with Local Zones, and showed you:

  • Reference architectures to host workloads in Local Zones with low-latency connectivity to corporate data centers and resiliency to enable fail over to the AWS Region automatically.
  • Different design considerations for public and private facing workloads utilizing this hybrid architecture.
  • Segregation and connectivity considerations when extending this hybrid architecture to host multiple workloads.

Hopefully you will be able to follow along with these reference architectures to build and run highly resilient applications with local system interdependencies using Local Zones.

Setup a high availability design for Oracle Data Guard (Fast-Start Failover) using Amazon Route 53

Post Syndicated from Viqash Adwani original https://aws.amazon.com/blogs/architecture/setup-a-high-availability-design-for-oracle-data-guard-fast-start-failover-using-amazon-route-53/

Many customers use Oracle Database deployed on Amazon Elastic Compute Cloud (Amazon EC2) to run their Oracle E-Business Suite applications. They rely on Oracle Data Guard for high availability databases, with a standby database running in a different availability zone. Oracle Data Guard can switch a standby database to the primary role in case a production database becomes unavailable due to planned/unplanned outage.

Oracle E-Business Suite has AutoConfig Database Context files that points to Domain Name System (DNS), like a private IP DNS name or IP address on Amazon EC2. In case of switchover/failover, Database Context files in Oracle E-Business Suite need to be updated. With this solution, database context files do not need to be updated in case of switchover/failover. This is achieved by providing a single DNS name hosted on Amazon Route 53, always pointing to a primary database irrespective of running on any node.

This post demonstrates setting up a Route 53 hosted zone that points to primary and standby databases and will route requests to the database having a primary role. We will setup Route 53 health checks to monitor Amazon CloudWatch alarms, based on Oracle Data Guard Fast-Start Failover (FSFO) logs pushed from Oracle Database using CloudWatch agent.

Prerequisites

Before getting started, you must have the following:

  • Oracle databases running on two separate EC2 instances for 1\Primary and 2\Standby node
  • EC2 instance for 3\Observer node, with either Oracle Client Administrator software or the full Oracle Database software stack
  • Oracle Data Guard configured to maintain standby databases as transactional consistent copy of the primary database
  • Oracle Data Guard Command-Line Interface (DGMGRL) configured with observer process to facilitate FSFO
  • FSFO enabled with observer configuration

Solution overview

Figure 1 depicts AWS services that are used build an architecture using a single Domain Name System (DNS) and Route 53 to have requests route to the primary database for Oracle Data Guard deployed on EC2 instances.

Using a single DNS and Amazon Route 53 to route requests

Figure 1. Using a single DNS and Amazon Route 53 to route requests

We encourage you to explore these articles prior to launching this architecture:

Architecture components

  • Primary node: An Oracle Data Guard configuration contains one production database (primary database) that functions in the primary role. This is the database that is accessed by most of your applications.
  • Standby node: A standby database is a transactional consistent copy of the primary database. Once created, Oracle Data Guard automatically maintains each standby database by transmitting redo data from the primary database and applying it to the standby database.
  • Observer node: A component of DGMGRL that is configured on a separate server with Oracle Client Interface outside the systems running the Oracle Data Guard configuration, which monitors the availability of the primary database. We recommend it is in a separate availability zone than the primary and standby databases. Should it detect that the primary database is unavailable or the connection cannot be made, it will issue a failover after waiting for the 30 seconds or specified by the FastStartFailoverThreshold property.

Note: This solution was tested on Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 and Oracle Enterprise Linux OL7.5-x86_64. Select the required combination of operating system and database by referring to E-Business Suite Database Certifications. Active Oracle Data Guard configuration is not used in this solution; therefore, the database stays in mount state and unavailable for customer reads. However, this solution can be used with an active Oracle Data Guard configuration as well.

Infrastructure setup

This table details the lab environment and the database instance names used throughout this post.

Database role IP address Instance name Database unique name Database open mode Database port
Primary node 172.31.xx.xx ORCLVIQ orclviq Read write 1522
Secondary node 172.31.xx.xx ORCLVIQ orclviq Mounted 1522
FSFO observer node 172.31.xx.xx
Route 53 hosted zone Dataguardviq.com Dataguardviq.com

Solution implementation

1. IAM policy

To configure the CloudWatch agent on an EC2 instance, create an IAM role via the console, AWS Command Line Interface, or AWS API. By default, an IAM role does not have any privileges and cannot access AWS resources. Before creating an IAM role, create an IAM policy with the permissions required to access CloudWatch logs, specifying the CloudWatch resources you want to monitor or *, which will allow access to CloudWatch logs.

Create the IAM policy using the following JSON and give it a name, like dgpolicy, which grants specific permissions. In this case, the permissions include creating log groups and log streams, inputting log events, and describing log stream for all logs.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams",
                "logs:GetLogEvents",
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

2. IAM role

Create the IAM role (for example, dgrole) and attach the policy created in Step 1.

3. Associate IAM role with Amazon EC2 having FSFO observer node running on it

Associate the IAM role with Amazon EC2 from AWS console:

  1. Open the Amazon EC2 console.
  2. In the navigation pane, choose Instances.
  3. Select the instance, and choose Action, Security, Modify IAM role.
  4. Select the IAM role to attach to your instance; choose Save.

You also can use the associate-iam-instance-profile command to attach the IAM role to the instance by specifying the instance profile:

aws ec2 associate-iam-instance-profile \
    --instance-id i-1234567890lmnopq1 \
    --iam-instance-profile Name=" dgrole"

4. Install CloudWatch log agent on observer node

Install CloudWatch Logs agent on the observer node that already has FSFO log configuration setup on it. After installation is complete, logs will automatically flow from the instance to the log stream you created while installing the agent, as depicted in Figures 2 and 3.

$curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py -O
$sudo python ./awslogs-agent-setup.py --region ap-southeast-2

Steps to install Amazon CloudWatch agent

Figure 2. Steps to install Amazon CloudWatch agent

Example of output

Figure 3. Example of output

At this point, FSFO logs are visible in CloudWatch logs. Perform a database switchover to confirm that logs are being published to CloudWatch logs.

5. Create CloudWatch metric filter

Create primary metric filter

  • Create metric filter on “Standby database has changed to orclviq” and set value to “1”.
  • Open the CloudWatch log group and search for “Standby database has changed to orclviq“. Once results are displayed, “Create Metric Filter” button will appear at top right (as demonstrated in Figure 4).
Amazon CloudWatch log group

Figure 4. Amazon CloudWatch log group

  • Filter name” and “Filter pattern” are filled automatically, as we already filtered that while accessing the CloudWatch log. Create a new name space in the metric, which we will later use in second metric filter. Specify “Metric namespace”, which will also be the same for the second metric filter. “Metric value” is “1”, as detailed in Figure 5.
Amazon CloudWatch metric filter

Figure 5. Amazon CloudWatch metric filter

Create secondary metric filter

  • To create a second metric filter, follow steps to create the primary metric filter but the filter pattern will be “Standby database has changed to orclstd“. The only difference in this step will be setting the “Metric value” to “0”.
  • Verify the metric filter is working and able to find desired entries in the FSFO logs that will decide which instance is the primary database by:
    • Click on one of the metric filters
    • Select log data to test, and choose and EC2 instance ID
    • Testing the pattern to verify if metric filters are working properly
    • Select “Next”
    • Save changes

6. Create a CloudWatch alarm

Create a CloudWatch alarm that will monitor CloudWatch metrics and perform actions based on the value of the metric.

  • Within CloudWatch Alarms section (Step 1), select “Create alarm” and then select “Metric”, as in Figure 6.
  • In Step 2, you can create a new topic and specify the email address where you would like to receive notifications regarding changes in primary or standby database.
  • In Step 3, you can specify the alarm name and optional alarm description.
Amazon CloudWatch alarm

Figure 6. Amazon CloudWatch alarm

  • To configure Conditions, as detailed in Figure 7, be sure the threshold type is set to “Static”, the alarm condition is “Greater/Equal”, and the threshold value is “1”.
Amazon CloudWatch alarm condition

Figure 7. Amazon CloudWatch alarm condition

  • Figure 8 demonstrates the summary of the ready CloudWatch metrics and configured alarm.  At this point, “orclviq” is the primary database, so the alarm state will be “Insufficient data”. Try to do a switchover and the alarm state should be changed to “In alarm”.
  • Switch it back to original primary before proceeding further
Final view of both metric filters combined

Figure 8. Final view of both metric filters combined

7. Create Route 53 health checks

Creating Route 53 health checks based on the CloudWatch alarm identifies DNS failover. Health checks can be created on public IP directly without creating above metric filters and alarms, but it is unlikely that customers will have a public IP on database servers. Route 53 cannot check the health of an IP address endpoint despite if it is local, private, non-routable, or multi-cast ranges. Therefore, health checks will have to be created on the state of CloudWatch alarm (Figure 9):

  • Within the Route 53 console, select “Configure health check”.
  • Select “State of CloudWatch alarm”, then select the region of your CloudWatch metrics and choose the CloudWatch alarm that was created earlier from the dropdown menu.
  • Skip creating an alarm for health check, as the alarm is already configured for CloudWatch metric filter.
Amazon Route 53 health check parameters

Figure 9. Amazon Route 53 health check parameters

  • The health check is now ready and status will be “Unknown”. It will change to “Healthy”, as it is currently having primary orclviq, which means it’s satisfying “Standby database has changed to orclstd“.
  • At this point, try doing a failover and observe if health check status changes to “Unhealthy”. Switch it back to the original primary before proceeding.
  • The health check should return to “Healthy” state.
  • Check the status of the alarm:
    • OK: the status is healthy
    • ALARM: the status is unhealthy
    • INSUFFICIENT: use last known status

8. Create Route 53 hosted zone

A Route 53 hosted zone in this solution will have failover routing based on records pointing to IP addresses of primary and standby database server. When the Route 53 health check is in an “Unhealthy” state, failover routing kicks in and the hosted zone starts routing requests to the IP address of the standby database server.

  • A Route 53 hosted zone will have records of your primary and standby database instance private IP. Because it is a private hosted zone, it will use the same VPC as the observer node on which we deployed the CloudWatch agent (Figure 10).
Amazon Route 53 hosted zone

Figure 10. Amazon Route 53 hosted zone

  • Once the Route 53 hosted zone is ready, the last step is to create records by adding private IP addresses of primary and secondary database servers (Figures 11 and 12). Routing policy is set as a failover. That means, if the Route 53 health check fails, the hosted zone does a DNS failover to standby server IP.
Amazon Route 53 hosted zone record for primary servers

Figure 11. Amazon Route 53 hosted zone record for primary servers

Amazon Route 53 hosted zone record for secondary servers

Figure 12. Amazon Route 53 hosted zone record for secondary servers

Once the records are created, the hosted zone will have an IP address of primary and secondary database servers, as demonstrated in Figure 13.

Final view of Amazon Route 53 hosted zone

Figure 13. Final view of Amazon Route 53 hosted zone

Solution testing

To test the solution, do a telnet utility test to evaluate network connectivity on the Route 53 hosted zone that was created. It should display the IP address of the primary database server: 172.31.10.89.

[ec2-user@ip-172-xx-xx-xx ~]$ telnet dataguardviq.com 1522
Trying 172.31.10.89…
Connected to dataguardviq.com.
Escape character is ‘^]’.

Next, perform a switchover or failover of the primary to secondary database server. This can be done with the DGMGRL command “switchover to orclstd”. Output will be:

DGMGRL> switchover to orclstd;
Performing switchover NOW, please wait…
Operation requires a connection to database “orclstd”
Connecting …
Connected to “orclstd”
Connected as SYSDBA.
New primary database “orclstd” is opening…
Oracle Clusterware is restarting database “orclviq” …
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to “orclviq”
Connected to “orclviq”
Switchover succeeded, new primary is “orclstd”

Now, the new primary is “orclstd”. In the backend, the Route 53 health check will be initiated and cause a failover. This means the telnet Route 53 hosted zone will give the IP address of the new primary (old secondary), which is 172.31.36.241.

[ec2-user@ip-172-xx-xx-xx ~]$ telnet dataguardviq.com 1522
Trying 172.31.36.241…
Connected to dataguardviq.com.
Escape character is ‘^]’.

Cleanup

To cleanup, remove:

  • Route 53 host zone
  • CloudWatch metric filter
  • CloudWatch alarm
  • CloudWatch agent from observer node
  • IAM role and policy created specifically for this solution

Conclusion

This solution demonstrated how to use a Route 53 hosted zone to route requests to the active primary node for Oracle Data Guard on Amazon EC2.