Tag Archives: Amazon Route 53

Introducing Amazon Route 53 Global Resolver for secure anycast DNS resolution (preview)

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/introducing-amazon-route-53-global-resolver-for-secure-anycast-dns-resolution-preview/

Today, we’re announcing Amazon Route 53 Global Resolver, a new Amazon Route 53 service that provides secure and reliable DNS resolution globally for queries from anywhere (preview). You can use Global Resolver to resolve DNS queries to public domains on the internet and private domains associated with Route 53 private hosted zones. Route 53 Global Resolver offers network administrators a unified solution to resolve queries from authenticated clients and sources in on-premises data centers, branch offices, and remote locations through globally distributed anycast IP addresses. This service includes built-in security controls including DNS traffic filtering, support for encrypted queries, and centralized logging to help organizations reduce operational overhead while maintaining compliance with security requirements.

Organizations with hybrid deployments face operational complexity when managing DNS resolution across distributed environments. Resolving public internet domains and private application domains often requires maintaining split DNS infrastructure, which increases cost and administrative burden especially when replicating to multiple locations. Network administrators must configure custom forwarding solutions, deploy Route 53 Resolver endpoints for private domain resolution, and implement separate security controls across different locations. Additionally, they must configure and maintain multi-Region failover strategies for Route 53 Resolver endpoints and provide consistent security policy enforcement across all Regions while testing failover scenarios.

Route 53 Global Resolver has key capabilities that address these challenges. The service resolves both public internet domains and Route 53 private hosted zones, eliminating the need for separate split-DNS forwarding. It provides DNS resolution through multiple protocols, including DNS over UDP (Do53), DNS-over-HTTPS (DoH), and DNS-over-TLS (DoT). Each deployment provides a single set of common IPv4 and IPv6 anycast IP addresses that route queries to the nearest AWS Region, reducing latency for distributed client populations.

Route 53 Global Resolver provides integrated security features equivalent to Route 53 Resolver DNS Firewall. Administrators can configure filtering rules using AWS Managed Domain Lists that provide flexible controls with lists classified by DNS threats (malware, spam, phishing) or web content (adult sites, gambling, social networking) that might not be safe for work or create custom domain lists by importing domains from a file. Advanced threat protection detects and blocks domain generation algorithm (DGA) patterns and DNS tunneling attempts. For encrypted DNS traffic, Route 53 Global Resolver supports DoH and DoT protocols to protect queries from unauthorized access during transit.

Route 53 Global Resolver only accepts traffic from known clients that need to authenticate with the Resolver. For Do53, DoT, and DoH connections, administrators can configure IP and CIDR allowlists. For DoH and DoT connections, token-based authentication provides granular access control with customizable expiration periods and revocation capabilities. Administrators can assign tokens to specific client groups or individual devices based on organizational requirements.

Route 53 Global Resolver supports DNSSEC validation to verify the authenticity and integrity of DNS responses from public nameservers. It also includes EDNS Client Subnet support, which forwards client subnet information to enable more accurate geographic-based DNS responses from content delivery networks.

Getting started with Route 53 Global Resolver
This walkthrough shows how to configure Route 53 Global Resolver for an organization with offices on the US East and West coasts that needs to resolve both public domains and private applications hosted in Route 53 private hosted zones. To configure Route 53 Global Resolver, go to the AWS Management Console, choose Global resolvers from the navigation pane, and choose Create global resolver.

In the Resolver details section, enter a Resolver name such as corporate-dns-resolver. Add an optional description like DNS resolver for corporate offices and remote clients. In the Regions section, choose the AWS Regions where you want the resolver to operate, such as US East (N. Virginia) and US West (Oregon). The anycast architecture routes DNS queries from your clients to the nearest selected Region.

After the resolver is created, the console displays the resolver details, including the anycast IPv4 and IPv6 addresses that you will use for DNS queries. You can proceed to create a DNS view by choosing Create DNS view to configure client authentication and DNS query resolution settings.

In the Create DNS view section, enter a DNS view name such as primary-view and optionally add a Description like DNS view for corporate offices. A DNS view helps you create different logical groupings for your clients and sources, and determine the DNS resolution for those groups. This helps you maintain different DNS filtering rules and private hosted zone resolution policies for different clients in your organization.

For DNSSEC validation, choose Enable to verify the authenticity of DNS responses from public DNS servers. For Firewall rules fail open behavior, choose Disable to block DNS queries when firewall rules can’t be evaluated, which provides additional security. For EDNS client subnet, keep Enable selected to forward client location information to DNS servers, which allows content delivery networks to provide more accurate geographic responses. DNS view creation might take a few minutes to become operational.

After the DNS view is created and operational, configure DNS Firewall rules to filter network traffic by choosing Create rule. In the Create DNS Firewall rules section, enter a Rule name such as block-malware-domains and optionally add a description. For Rule configuration type, you can choose Customer managed domain lists, AWS managed domain lists provided by AWS or DNS Firewall Advanced protection.

For this walkthrough, choose AWS managed domain lists. In the Domain lists dropdown, choose one or more AWS managed lists such as Threat – Malware to block known malicious domains. You can leave Query type empty to apply the rule to all DNS query types. In this example, choose A to apply this rule only to IPv4 address queries. In the Rule action section, select Block to prevent DNS resolution for domains that match the selected lists. For Response to send for Block action, keep NODATA selected to indicate that the query was successful but no response is available, then choose Create rules.

The next step is to configure access sources to specify which IP addresses or CIDR blocks are allowed to send DNS queries to the resolver. Navigate to the Access sources tab in the DNS view and then choose Create access source.

In the Access source details section, enter a Rule name such as office-networks to identify the access source. In the CIDR block field, enter the IP address range for your offices to allow queries from that network. For Protocol, select Do53 for standard DNS queries over UDP or choose DoH or DoT if you want to require encrypted DNS connections from clients. After configuring these settings, choose Create access source to allow the specified network to send DNS queries to the resolver.

Next, navigate to the Access tokens tab in the DNS view to create token-based authentication for clients and choose Create access token. In the Access token details section, enter a Token name such as remote-clients-token. For Token expiry, select an expiration period from the dropdown based on your security requirements, such as 365 days for long-term client access, or choose a shorter duration like 30 days or 90 days for tighter access control. After configuring these settings, choose Create access token to generate the token, which clients can use to authenticate DoH and DoT connections to the resolver.

After the access token is created, navigate to the Private hosted zones tab in the DNS view to associate Route 53 private hosted zones with the DNS view so that the resolver can resolve queries for your private application domains. Choose Associate private hosted zone and in the Private hosted zones section, select a private hosted zone from the list that you want the resolver to handle. After selecting the zone, choose Associate to enable the resolver to respond to DNS queries for these private domains from your configured access sources.

With the DNS view configured, firewall rules created, access sources and tokens defined, and private hosted zones associated, the Route 53 Global Resolver setup is complete and ready to handle DNS queries from your configured clients.

After creating your Route 53 Global Resolver, you need to configure your DNS clients to send queries to the resolver’s anycast IP addresses. The configuration method depends on the access control you configured in your DNS view:

  • For IP-based access sources (CIDR blocks) – Configure your source clients to point DNS traffic to the Route 53 Global Resolver anycast IP addresses provided in the resolver details. Global Resolver will only allow access from allowlisted IPs that you have specified in your access sources. You can also associate the access sources to different DNS views to provide more granular DNS resolution views for different sets of IPs.
  • For access token–based authentication – Deploy the tokens on your clients to authenticate DoH and DoT connections with Route 53 Global Resolver. You must also configure your clients to point the DNS traffic to the Route 53 Global Resolver anycast IP addresses provided in the resolver details.

For detailed configuration instructions for your specific operating system and protocol, refer to the technical documentation.

Additional things to know
We’re renaming the existing Route 53 Resolver to Route 53 VPC Resolver. This naming change clarifies the architectural distinction between the two services. VPC Resolver operates Regionally within your VPCs to provide DNS resolution for resources in your Amazon VPC environment. VPC Resolver continues to support inbound and outbound resolver endpoints for hybrid DNS architectures within specific AWS Regions.

Route 53 Global Resolver complements Route 53 VPC Resolver by providing internet-reachable, global and private DNS resolution for on-premises and remote clients without requiring VPC deployment or private connections.

Existing VPC Resolver configurations remain unchanged and continue to function as configured. The renaming affects the service name in the AWS Management Console and documentation, but API operation names remain unchanged. If your architecture requires DNS resolution for resources within your VPCs, continue using VPC Resolver.

Join the preview
Route 53 Global Resolver reduces operational overhead by providing unified DNS resolution for public and private domains through a single managed service. The global anycast architecture improves reliability and reduces latency for distributed clients. Integrated security controls and centralized logging help organizations maintain consistent security policies across all locations while meeting compliance requirements.

To learn more about Amazon Route 53 Global Resolver, visit the Amazon Route 53 documentation.

You can start using Route 53 Global Resolver through the AWS Management Console in US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Frankfurt), Europe (Ireland), Europe (London), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Tokyo), and Asia Pacific (Sydney) Regions.

— Esra

Amazon Route 53 launches Accelerated recovery for managing public DNS records

Post Syndicated from Micah Walter original https://aws.amazon.com/blogs/aws/amazon-route-53-launches-accelerated-recovery-for-managing-public-dns-records/

Today, we’re announcing Amazon Route 53 Accelerated recovery for managing public DNS records, a new Domain Name Service (DNS) business continuity feature that is designed to provide a 60-minute recovery time objective (RTO) during service disruptions in the US East (N. Virginia) AWS Region. This enhancement ensures that customers can continue making DNS changes and provisioning infrastructure even during regional outages, providing greater predictability and resilience for mission-critical applications.

Customers running applications that require business continuity have told us they need additional DNS resilience capabilities to meet their business continuity requirements and regulatory compliance obligations. While AWS maintains exceptional availability across our global infrastructure, organizations in regulated industries like banking, FinTech, and SaaS want the confidence that they will be able to make DNS changes even during unexpected regional disruptions, allowing them to quickly provision standby cloud resources or redirect traffic when needed.

Accelerated recovery for managing public DNS records addresses this need by targeting DNS changes that customers can make within 60 minutes of a service disruption in the US East (N. Virginia) Region. The feature works seamlessly with your existing Route 53 setup, providing access to key Route 53 API operations during failover scenarios, including ChangeResourceRecordSets, GetChange, ListHostedZones, and ListResourceRecordSets. Customers can continue using their existing Route 53 API endpoint without modifying applications or scripts.

Let’s try it out
Configuring a Route53 hosted zone to use accelerated recovery is simple. Here I am creating a new hosted zone for a new website I’m building.

Once I have created my hosted zone, I see a new tab labeled Accelerated recovery. I can see here that accelerated recovery is disabled by default.

To enable it, I just need to click the Enable button and confirm my choice in the modal that appears as depicted in the dialog below.

Enabling accelerated recovery will take a couple minutes to complete. Once it’s enabled, I see a green Enabled status as depicted in the screenshot below.

I can disable accelerated recovery at any time from this same area of the AWS Management Console. I can also enable accelerated recovery for any existing hosted zones I have already created.

Enhanced DNS business continuity
With accelerated recovery enabled, customers gain several key capabilities during service disruptions. The feature maintains access to essential Route 53 API operations, ensuring that DNS management remains available when it’s needed most. Organizations can continue to make critical DNS changes, provision new infrastructure, and redirect traffic flows without waiting for full service restoration.

The implementation is designed for simplicity and reliability. Customers don’t need to learn new APIs or modify existing automation scripts. The same Route 53 endpoints and API calls continue to work, providing a seamless experience during both normal operations and failover scenarios.

Now available
Accelerated recovery for Amazon Route 53 public hosted zones is available now. You can enable this feature through the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS Software Development Kit (AWS SDKs), or infrastructure as code tools like AWS CloudFormation and AWS Cloud Development Kit (AWS CDK). There is no additional cost for using accelerated recovery.

To learn more about accelerated recovery and get started, visit the documentation. This new capability represents our continued commitment to providing customers with the DNS resilience they need to build and operate mission-critical applications in the cloud.

AWS Weekly Roundup: Project Rainier, Amazon CloudWatch investigations, AWS MCP servers, and more (June 30, 2025)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-project-rainier-amazon-cloudwatch-investigations-aws-mcp-servers-and-more-june-30-2025/

Every time I visit Seattle, the first thing that greets me at the airport is Mount Rainier. Did you know that the most innovative project at Amazon Web Services (AWS) is named after this mountain?

Project Rainier is a new project to create what is expected to be the world’s most powerful computer for training AI models across multiple data centers in the United Stages. Anthropic will develop the advanced versions of its Claude models with five times more computing power than its current largest training cluster.

The key technology powering Project Rainier is AWS custom-designed Trainium2 chips, which are specialized for the immense data processing required to train complex AI models. Thousands of these Trainium2 chips will be connected in a new type of Amazon EC2 UltraServer and EC2 UltraCluster architecture that allows ultra-fast communication and data sharing across the massive system.

Learn about the AWS vertical integration of Project Rainer, where it designs every component of the technology stack from chips to software, allows it to optimize the entire system for maximum efficiency and reliability.

Last week’s launches
Here are some launches that got my attention:

  • Amazon S3 access for Amazon FSx for OpenZFS – You can access and analyze your FSx for OpenZFS file data through Amazon S3 Access Points, enabling seamless integration with AWS AI/ML, and analytics services without moving your data out of the file system. You can treat your FSx for OpenZFS data as if it were stored in S3, making it accessible through the S3 API for various applications including Amazon Bedrock, Amazon SageMaker, AWS Glue, and other S3 based cloud-native applications.
  • Amazon S3 with sort and z-order compaction for Apache Iceberg tables – You can optimize query performance and reduce costs with new sort and z-order compaction. With S3 Tables, sort compaction automatically organizes data files based on defined column orders, while z-order compaction can be enabled through the maintenance API for efficient multicolumn queries.
  • Amazon CloudWatch investigations – You can accelerate your operational troubleshooting in AWS environments using the Amazon CloudWatch AI-powered investigation feature, which helps identify anomalies, surface related signals, and suggest remediation steps. This capability can be initiated through CloudWatch data widgets, multiple AWS consoles, CloudWatch alarm actions, or Amazon Q chat and enables team collaboration and integration with Slack and Microsoft Teams.
  • Amazon Bedrock Guardrails Standard tier – You can enhance your AI content safety measures using the new Standard tier. It offers improved content filtering and topic denial capabilities across up to 60 languages, better detection of variations including typos, and stronger protection against prompt attacks. This feature lets you configure safeguards to block harmful content, prevent model hallucinations, redact personally identifiable information (PII), and verify factual claims through automated reasoning checks.
  • Amazon Route 53 Resolver endpoints for private hosted zone – You can simplify DNS management across AWS and on-premises infrastructure using the new Route 53 DNS delegation feature for private hosted zone subdomains, which works with both inbound and outbound Resolver endpoints. You can delegate subdomain authority between your on-premises infrastructure and Route 53 Resolver cloud service using name server records, eliminating the need for complex conditional forwarding rules.
  • Amazon Q Developer CLI for Java transformation – You can automate and scale Java application upgrades using the new Amazon Q Developer Java transformation command line interface (CLI). This feature perform upgrades from Java versions 8, 11, 17, or 21 to versions 17 or 21 directly from the command line. This tool offers selective transformation options so you can choose specific steps from transformation plans and customize library upgrades.
  • New AWS IoT Device Management managed integrations – You can simplify Internet of Things (IoT) device management across multiple manufacturers and protocols using the new managed integrations feature, which provides a unified interface for controlling devices whether they connect directly, through hubs or third-party clouds. The feature includes pre-built cloud-to-cloud (C2C) connectors, device data model templates, and SDKs that support ZigBee, Z-Wave, and Wi-Fi protocols, while you can still create custom connectors and data models.

For a full list of AWS announcements, be sure to keep an eye on the What’s New with AWS? page.

Other AWS news
Various Model Context Protocol (MCP) servers for AWS services have been released. Here are some tutorials about MCP servers that you might find interesting:

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events:

  • AWS re:Invent – Register now to get a head start on choosing your best learning path, booking travel and accommodations, and bringing your team to learn, connect, and have fun. If you’re an early-career professional, you can apply to the All Builders Welcome Grant program, which is designed to remove financial barriers and create diverse pathways into cloud technology.
  • AWS NY Summits – You can gain insights from Swami’s keynote featuring the latest cutting-edge AWS technologies in compute, storage, and generative AI. My News Blog team is also preparing some exciting news for you. If you’re unable to attend in person, you can still participate by registering for the global live stream. Also, save the date for these upcoming Summits in July and August near your city.
  • AWS Builders Online Series – If you’re based in one of the Asia Pacific time zones, join and learn fundamental AWS concepts, architectural best practices, and hands-on demonstrations to help you build, migrate, and deploy your workloads on AWS.

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Channy

Amazon Bedrock baseline architecture in an AWS landing zone

Post Syndicated from Abdel-Rahman Awad original https://aws.amazon.com/blogs/architecture/amazon-bedrock-baseline-architecture-in-an-aws-landing-zone/

As organizations increasingly adopt Amazon Bedrock to build and deploy large-scale AI applications, it’s important that they understand and adopt critical network access controls to protect their data and workloads. These generative AI-enabled applications might have access to sensitive or confidential information within their knowledge bases, Retrieval Augmented Generation (RAG) data sources, or models themselves, which could pose a risk if exposed to unauthorized parties. Additionally, organizations might want to limit access to certain AI models to specific teams or services, making sure only authorized users can use the most powerful capabilities. Another important consideration is cost optimization, because organizations need to be able to monitor and control access to manage various aspects of their cloud spending.

In this post, we explore the Amazon Bedrock baseline architecture and how you can secure and control network access to your various Amazon Bedrock capabilities within AWS network services and tools. We discuss key design considerations, such as using Amazon VPC Lattice auth policies, Amazon Virtual Private Cloud (Amazon VPC) endpoints, and AWS Identity and Access Management (IAM) to restrict and monitor access to your Amazon Bedrock capabilities.

By the end of this post, you will have a better understanding of how to configure your AWS landing zone to establish secure and controlled network connectivity to Amazon Bedrock across your organization using VPC Lattice.

Solution overview

Addressing the aforementioned challenges requires a well-designed network architecture and security controls. For this, we use the standard AWS Landing Zone Accelerator networking configuration. It provides a good starting point for managing network communication across multiple accounts. On top of the AWS Landing Zone Accelerator network design, we add two shared accounts.

In this solution design, we create a centralized architecture for managing organization AI capabilities across different accounts. The architecture consists of three main parts that work together to provide secure and controlled access to AI services:

  • Service network account – This account serves as the central networking hub for the organization, managing network connectivity and access policies. Through this account, network administrators can centrally manage and control access to AI services across the organization. The account follows AWS Landing Zone Accelerator networking practices that scale with enterprise organizational needs.
  • Generative AI account – This account hosts the organization’s Amazon Bedrock capabilities and serves as the central point for AI/ML management. The organization’s AI/ML scientists and prompt engineers will centrally build and manage Amazon Bedrock capabilities. The account provides access to various large language models (LLMs) through Amazon Bedrock by using VPC interface endpoints, while also enabling centralized monitoring of cost consumption and access patterns.
  • Workload accounts (dev, test, prod) – These accounts represent different environments where teams develop and deploy applications that consume AI services. Through secure network connections established through the service network account, these workload accounts can access the AI capabilities hosted in the generative AI account. This separation enforces proper isolation between development, testing, and production workloads while maintaining secure access to AI services.
Amazon Bedrock baseline architecture in an AWS landing zone

Amazon Bedrock baseline architecture in an AWS landing zone

The following diagram illustrates the solution architecture.

The service network account has its own VPC Lattice service network—a centralized networking construct that enables service-to-service communication across your organization, which is shared with workload accounts using AWS Resource Access Manager (AWS RAM) to enable VPC Lattice Service network sharing.

Workload accounts (dev, test, prod) establish VPC associations with the shared VPC Lattice service network by creating a service network association in their VPC. When an application in these accounts makes a request, it first queries the VPC resolver for DNS resolution. The resolver routes the traffic to the VPC Lattice service network.

Access control is implemented through an VPC Lattice auth policy. The service network policies determine which accounts can access the VPC Lattice service network, and service-level policies control access to specific AI services and define what actions each account can perform.

In the central AI services account, we find the proxy layer, we create a VPC Lattice service that points to a proxy layer, which acts as a single entry point, providing workload accounts access to Amazon Bedrock. This proxy layer then connects to Amazon Bedrock through VPC endpoints. Through this setup, the AI team can configure which foundation models (FMs) are available and manage access permissions for different workload accounts. After the necessary policies and connections are in place, workload accounts can access Amazon Bedrock capabilities through the established secure pathway. This setup enables secure, cross-account access to AI services while maintaining centralized control and monitoring.

Network components

We use VPC Lattice, which is a fully managed application networking service that helps you simplify network connectivity, security, and monitoring for service-to-service communication needs. With VPC Lattice, organizations can achieve a centralized connectivity pattern to control and monitor access to the services required for building generative AI applications.

For details about VPC Lattice, refer to the Amazon VPC Lattice User Guide. The following is an overview of the constructs you can use in setting up the centralized pattern in this solution:

  • VPC Lattice service network – You can use the VPC Lattice service network to provide central connectivity and security to the central AI services account. The service network is a logical grouping mechanism that simplifies how you can enable connectivity across VPCs or accounts, and apply common security policies for application communication patterns. You can create a service network in an account and share it with other accounts within or outside AWS Organizations using AWS RAM.
  • VPC Lattice service – In a service network, you can associate a VPC Lattice service, which consists of a listener (protocol and port number), routing rules that allow for control of the application flow (for example, path, method, header-based, or weighted routing), and target group, which defines the application infrastructure. A service can have multiple listeners to meet various client capabilities. Supported protocols include HTTP, HTTPS, gRPC, and TLS. The path-based routing allows control to various high-performing FMs and other capabilities you would need to build a generative AI application.
  • Proxy layer – You use a proxy layer for the VPC Lattice service target group. The proxy layer can be built based on your organization’s preference of AWS services, such as AWS Lambda, AWS Fargate, or Amazon Elastic Kubernetes Service (Amazon EKS). The purpose of the proxy layer is to provide a single entry point to access LLMs, knowledge bases, and other capabilities that are tested and approved according to your organization’s compliance requirements.
  • VPC Lattice auth policies – For security, you use VPC Lattice auth policies. VPC Lattice auth policies are specified using the same syntax as IAM policies. You can apply an auth policy to VPC Lattice service network as well as to the VPC Lattice service.
  • Fully Qualified Domain Names –To facilitate service discovery, VPC Lattice supports custom domain names for your services and resources, and maintains a Fully Qualified Domain Name (FQDN) for each VPC Lattice service and resource you define. You can use these FQDNs in your Amazon Route 53 private hosted zone configurations, and empower business units or teams to discover and access services and resources.
  • Service network VPC – Business units or teams can access generative AI services in a service network using service network VPC associations or a service network VPC endpoint.
  • Monitoring – You can choose to enable monitoring at the VPC Lattice service network level and VPC Lattice service level. VPC Lattice generates metrics and logs for requests and responses, making it more efficient to monitor and troubleshoot applications

The preceding guidance takes a “secure by default” approach—you must be explicit about which features, models, and so on should be accessed by which business unit. The setup also enables you to implement a defense-in-depth strategy at multiple layers of the network:

  • The first level of defense is that business team needs to connect to the service network in order to get access to the generative AI service through the central AI service account.
  • The second level includes network-level security protections in the business team’s VPC for the service network, such as security groups and network access control lists (ACLs). By using these, you can allow access to specific workloads or teams in a VPC.
  • The third level is through the VPC Lattice auth policy, which you can apply at two layers: at the service network level to allow authenticated requests within the organization, and at the service level to allow access to specific models and features.

VPC Lattice auth policy

This solution makes it possible to centrally manage access to Amazon Bedrock resources across your organization. This approach uses an VPC Lattice auth policy to centrally control Amazon Bedrock resources and manage it from one location across all the organization accounts.

Typically, the auth policy on the service network is operated by the network or cloud administrator. For example, allowing only authenticated requests from specific workloads or teams in your AWS organization. In the following example, access is granted to invoke the generated AI service for authenticated requests and to principals that are part of the o-123456example organization:

{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Allow",
         "Principal": "*",
         "Action": "vpc-lattice-svcs:Invoke",
         "Resource": "*",
         "Condition": {
            "StringEquals": {
               "aws:PrincipalOrgID": [ 
                  "o-123456example"
               ]
            }
         }
      }
   ]
}

The auth policy at the service level is managed by the central AI service team to set fine-grained controls, which can be more restrictive than the coarse-grained authorization applied at the service network level. For example, the following policy restricts access to claude-3-haiku for only business-team1:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect": "Allow",
         "Principal": {
            "AWS": [
               "arn:aws:iam::<account-number>:role/businss-team1"
            ]
         },
         "Action": "vpc-lattice-svcs:Invoke",
         "Resource": [
            "arn:aws:vpc-lattice:<aws-region>:<account-number>:service/svc-0123456789abcdef0/*"
         ],
         "Condition": {
            "StringEquals": {
               "vpc-lattice-svcs:RequestQueryString/modelid": "claude-3-haiku"            }
         }
      }
   ]
}

Monitoring and tracking

This design employs three monitoring approaches, using Amazon CloudWatch, AWS CloudTrail, and VPC Lattice access logs. This strategy provides a view of service usage, security, and performance.

CloudWatch metrics offer real-time monitoring of VPC Lattice service performance and usage. CloudWatch tracks metrics such as request counts and response times for Amazon Bedrock related endpoints, allowing for the setup of alarms for proactive management of service health and capacity. This enables monitoring of overall usage patterns of Amazon Bedrock models across different business units, facilitating capacity planning and resource allocation. CloudTrail provides detailed API-level auditing of Amazon Bedrock related actions. It logs cross-account access attempts and interactions with Amazon Bedrock services, providing a compliance and security audit trail. This tracking of who is accessing which Amazon Bedrock models, when, and from which accounts helps organizations adhere to their organizational policies.VPC Lattice access logs provide detailed insights into HTTP/HTTPS requests to Amazon Bedrock services, capturing specific usage patterns of AI models by different business teams. These logs contain client-specific information, which for example can be used to enable organizations to implement capabilities such as charge-back models. This allows for accurate attribution of AI service usage to specific teams or departments, facilitating fair cost allocation and responsible resource utilization across the organization. These services work together to enhance security, optimize performance, and provide valuable insights for managing cross-account Amazon Bedrock access.

Conclusion

In this post, we explored the importance of securing and controlling network access to Amazon Bedrock capabilities within an organization’s AWS landing zone. We discussed the key business challenges, such as the need to protect sensitive information in Amazon Bedrock knowledge bases, limit access to AI models, and optimize cloud costs by monitoring and controlling Amazon Bedrock capabilities. To address these challenges, we outlined a multi-layered network solution that uses AWS networking services, including a VPC Lattice auth policy to restrict and monitor access to Amazon Bedrock capabilities. Try out this solution for your own use case, and share your feedback in the comments.


About the authors

Designing centralized and distributed network connectivity patterns for Amazon OpenSearch Serverless

Post Syndicated from Ankush Goyal original https://aws.amazon.com/blogs/big-data/designing-centralized-and-distributed-network-connectivity-patterns-for-amazon-opensearch-serverless/

Amazon OpenSearch Serverless is a fully managed search and analytics service that automatically provisions and scales infrastructure to help you run search and analytics workloads without cluster management. With OpenSearch Serverless, you can quickly build search and analytics capabilities into your applications.

As organizations scale their use of OpenSearch Serverless, understanding network architecture and DNS management becomes increasingly important. Building upon the connectivity patterns discussed in our previous post Network connectivity patterns for Amazon OpenSearch Serverless, this post covers advanced deployment scenarios focused on centralized and distributed access patterns—specifically, how enterprises can simplify network connectivity across multiple AWS accounts and extend access to on-premises environments for their OpenSearch Serverless deployments.

We outline two key deployment patterns:

  • Pattern 1 – A centralized endpoint model where interface virtual private cloud (VPC) endpoints for OpenSearch Serverless are deployed in a shared services VPC, allowing spoke VPCs from other AWS accounts and on premises to access OpenSearch Serverless collections through these consolidated endpoints.
  • Pattern 2 – A distributed endpoint model where interface VPC endpoints are created in individual spoke VPCs, with multiple consumers (central account, on-premises networks, and other spoke accounts) accessing these endpoints through centralized DNS management. This approach provides direct connectivity within each spoke VPC while maintaining centralized DNS control and management across the organization.

Before diving into advanced deployment patterns, let’s review the DNS behavior of OpenSearch Serverless when accessed through an interface VPC endpoint (AWS PrivateLink). Understanding this foundational aspect can help clarify the connectivity patterns we explore in this post.

OpenSearch Serverless interface VPC endpoint DNS resolution

When creating an OpenSearch Serverless interface VPC endpoint, the service automatically provisions three private hosted zones: one visible private hosted zone us-east-1.aoss.amazonaws.com that handles domain resolution for the OpenSearch Serverless collection and dashboard, another visible private hosted zone us-east-1.opensearch.amazonaws.com that manages resolution for the OpenSearch UI (OpenSearch Dashboards), and one hidden internal private hosted zone that manages the final DNS resolution to private IP addresses.

Our objective in this post is to explore how the two private hosted zones for OpenSearch Serverless work together: the visible private hosted zone us-east-1.aoss.amazonaws.com for collections and dashboards, and the hidden private hosted zone for final DNS resolution to private IP addresses. We examine how these private hosted zones enable scalable DNS resolution in both centralized and distributed architectures. The following workflow diagram shows the DNS resolution flow for the us-east-1 AWS Region. The same pattern applies to other Regions, with the Region identifiers in the DNS records changing accordingly.

DNS Workflow Diagram AOSS

The workflow consists of the following steps:

  1. A user requests access to a collection URL (for example, abc.us-east-1.aoss.amazonaws.com).
  2. The DNS request is sent to the Amazon Route 53 Resolver, which checks the visible private hosted zone us-east-1.aoss.amazonaws.com and finds a CNAME record pointing to the endpoint-specific domain.
  3. The Route 53 Resolver uses the hidden internal private hosted zone to resolve this endpoint-specific domain to the VPC endpoint’s private IP address.
  4. Traffic is allowed only if it originates from the interface VPC endpoint approved by OpenSearch Serverless network policies.

Although this DNS Resolution Process provides flexible and secure private access, it becomes complex when you need connectivity from multiple VPCs, different AWS accounts, or on-premises networks. The following patterns address these challenges and outline strategies to simplify network access and DNS management for OpenSearch Serverless in such environments.

Pattern 1: Centralized interface VPC endpoint for OpenSearch Serverless

This pattern uses a centralized approach where a shared services AWS account with a shared services VPC hosts the OpenSearch Serverless interface VPC endpoint and OpenSearch Serverless collection. From there, other AWS accounts with Amazon VPCs (spoke VPCs) need to be able to access OpenSearch Serverless collections through this central endpoint. Organizations commonly implement this setup in hub-and-spoke network designs that connect their VPCs using either AWS Transit Gateway or AWS Cloud WAN. The following diagram illustrates this architecture.

AOSS Centralized Account

Challenge

When accessing from on-premises networks, both network access and DNS resolution for the OpenSearch Serverless interface VPC endpoint work successfully. However, although the endpoint is network-accessible from spoke VPCs (for example, through Transit Gateway or AWS Cloud WAN), DNS resolution from these VPCs fail.

This happens because OpenSearch Serverless creates and uses a private hosted zone us-east-1.aoss.amazonaws.com that is only associated with the VPC containing the endpoint, in this case, the Shared Services VPC. Simply sharing this private hosted zone with the spoke VPCs doesn’t solve the problem, because the wildcard CNAME record references a DNS name privatelink.c0X.sgw.iad.prod.aoss.searchservices.aws.dev. This DNS name can’t be resolved from other VPCs without additional configuration, because it belongs to a private hosted zone privatelink.c0X.sgw.iad.prod.aoss.searchservices.aws.dev that is only associated with the shared services VPC. This private hosted zone isn’t visible in your account and is controlled by AWS.

Solution: Use Amazon Route 53 Profiles for cross-VPC DNS resolution

To enable centralized DNS resolution, you can use Amazon Route 53 Profiles. With Route 53 Profiles, you can manage and apply DNS-related Amazon Route 53 configurations across multiple VPCs and AWS accounts. The following diagram illustrates the solution architecture.

AOSS Centralized Account Route53 Profile

The solution consists of the following steps:

  1. Create an OpenSearch Serverless interface VPC endpoint in the shared services VPC. This automatically creates and associates the following:
      1. Two default private hosted zones.
      2. One hidden private hosted zone with this VPC.
  2. Create a Route 53 Profile in the shared services account.
  3. Associate the interface VPC endpoint for OpenSearch Serverless with the Route 53 Profile.
    1. The Route 53 Profile automatically associates the hidden private hosted zone with the profile.
  4. Associate the private hosted zone us-east-1.aoss.amazonaws.com that was automatically created by OpenSearch Serverless with the Route 53 Profile.
  5. Share the Route 53 Profile with your other AWS accounts in your organization using AWS Resource Access Manager (AWS RAM).
  6. Associate the spoke VPCs (located in different accounts) with the Route 53 Profile.

If you have an existing Route 53 Profile in your shared services account that is already associated to spoke VPCs, you can simply associate the OpenSearch Serverless interface VPC endpoint and the private hosted zone us-east-1.aoss.amazonaws.com to this profile.

After completing these steps, the DNS resolution for the OpenSearch Serverless collection and dashboard endpoints works seamlessly from spoke VPCs associated with the Route 53 Profile. Clients in spoke VPCs can resolve and access OpenSearch Serverless collections and dashboards through the centralized VPC endpoint.

Pattern 2: Distributed interface VPC endpoint for OpenSearch Serverless

Each spoke VPC, residing in its respective AWS account, hosts its own OpenSearch Serverless collection and interface VPC endpoint. We now want to achieve the following:

  • Centralize DNS management in a shared services VPC to provide consistent resolution for OpenSearch Serverless collections deployed across multiple spoke accounts
  • Provide on-premises resources with DNS resolution capability for all OpenSearch Serverless collections across the organization through a Route 53 Resolver inbound endpoint in the shared services VPC

The following diagram illustrates this architecture.

AOSS Distributed

Challenge

Managing DNS resolution for OpenSearch Serverless collections and dashboards becomes complex in this distributed model because each interface VPC endpoint creates its own set of private hosted zones that are only associated with their respective VPCs. This creates a fragmented DNS landscape where the shared services VPC and on-premises networks need a consolidated way to resolve domains of OpenSearch Serverless collections and dashboards across multiple spoke accounts.

Solution: Use a self-managed private hosted zone in the shared services VPC for on-prem DNS resolution

To enable centralized DNS resolution for distributed endpoints, create a self-managed private hosted zone in the shared services account and associate it with the shared services VPC. Within this private hosted zone, you can create CNAME records that map each OpenSearch Serverless collection endpoint to its respective interface VPC endpoint DNS names in the spoke accounts. The following diagram illustrates this architecture.

AOSS Distributed Centralized DNS On Prem and VPC

Implementation consists of the following steps:

  1. Create a self-managed private hosted zone in the shared services account with the domain name us-east-1.aoss.amazonaws.com and associate it with the shared services VPC. For each OpenSearch Serverless collection, create a CNAME record that points to the Regional DNS name of its corresponding interface VPC endpoint.

This configuration enables both on-premises resources and resources in the shared services VPC to resolve OpenSearch Serverless endpoints that are in the spoke accounts.

After you complete these steps, each OpenSearch Serverless interface VPC endpoint remains within its original AWS account, maintaining security boundaries and account-level autonomy. On-premises systems can access OpenSearch Serverless collections and dashboards using original collection DNS names (for example, {collection-name}.us-east-1.aoss.amazonaws.com) through DNS resolution provided by the private hosted zone in the shared services VPC.

Conclusion

As organizations scale their adoption of OpenSearch Serverless, establishing secure and centralized network access becomes increasingly important. In this post, we explored two architectural patterns specifically around DNS management:

  • Centralized endpoint model – This pattern is ideal when a shared services account manages the OpenSearch Serverless interface VPC endpoints, allowing multiple spoke accounts to access OpenSearch Serverless collections and dashboards through a centralized set of network resources.
  • Distributed endpoint model with centralized DNS – This pattern is suitable for organizations that require account-level autonomy, where each AWS account manages its own OpenSearch Serverless interface VPC endpoints, while DNS resolution is centralized through a shared self-managed private hosted zone in a shared services account.

By understanding the DNS architecture of OpenSearch Serverless and using services like Route 53 Profiles and AWS RAM, organizations can build secure and robust access patterns that align with their organizational structure and needs.


About the Authors

Ankush GoyalAnkush Goyal is a Enterprise Support Lead in AWS Enterprise Support who helps customers streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience.

Anvesh KogantiAnvesh Koganti is a Solutions Architect at AWS specializing in Networking. He focuses on helping customers build networking architectures for highly scalable and resilient AWS environments. Outside of work, Anvesh is passionate about consumer technology and enjoys listening to podcasts on tech and business. When disconnecting from the digital world, Anvesh spends time outdoors hiking and biking.

Salman AhmedSalman Ahmed is a Senior Technical Account Manager in AWS Enterprise Support. He specializes in guiding customers through the design, implementation, and support of AWS solutions. Combining his networking expertise with a drive to explore new technologies, he helps organizations successfully navigate their cloud journey. Outside of work, he enjoys photography, traveling, and watching his favorite sports teams.

Protect against advanced DNS threats with Amazon Route 53 Resolver DNS Firewall

Post Syndicated from Lawton Pittenger original https://aws.amazon.com/blogs/security/protect-against-advanced-dns-threats-with-amazon-route-53-resolver-dns-firewall/

Every day, millions of applications seamlessly connect users to the digital services they need through DNS queries. These queries act as an interface to the internet’s address book, translating familiar domain names like amazon.com into the IP addresses that computers use to appropriately route traffic. The DNS landscape presents unique security challenges and opportunities in Amazon Virtual Private Cloud (Amazon VPC) environments. First, DNS resolution acts as an early checkpoint that you can use to control network traffic before it even begins. Second, DNS queries in your VPC follow a distinct path through the Amazon Route 53 Resolver that operates independently from your standard internet gateway, bypassing other network security controls.

To address this, Amazon Route 53 Resolver DNS Firewall provides protection for DNS traffic, starting with traditional domain lists where you can explicitly allow or deny DNS resolution of specific domains. Also, included are AWS Managed Domain Lists, which automatically block known malicious domains identified through Amazon Threat Intelligence and our trusted security partners. While this approach works effectively to help prevent known threats, sophisticated bad actors are increasingly using techniques that traditional blocklists can’t catch.

Instead of relying solely on static lists, Amazon Route 53 Resolver DNS Firewall Advanced provides intelligent protection alongside these traditional controls. These advanced rules work like a skilled security analyst, watching for suspicious patterns in DNS queries in real time. By examining characteristics such as query length, entropy, and frequency, the service can spot potentially malicious activity even when encountering previously unknown domains. This approach enables detecting and blocking advanced threats like DNS tunneling and domain generation algorithms (DGAs)—techniques that bad actors use to establish hidden communication channels or connect malware to their control servers.

In this post, we take you on a practical journey exploring these DNS-based threats and tools to help prevent them. You’ll learn how to set up effective Route 53 Resolver DNS Firewall Advanced rules, and we provide a ready-to-deploy CloudFormation template with our recommended configurations. Finally, we demonstrate an example of real-world threat detection and show you how the service integrates with AWS Security Hub to improve visibility of alerts. By the time you finish reading this post, you’ll have a clear understanding of how to deploy Route 53 Resolver DNS Firewall rules to add an intelligent, proactive layer of security to your AWS environment.

Understanding the risks of DNS tunneling and DGAs

As mentioned earlier, the Route 53 Resolver provides a service-managed path to the internet that operates independently from your VPC’s internet gateway. While this architecture enables efficient DNS resolution, it can be exploited through techniques such as DNS tunneling. Let’s explore how these techniques work and why they present unique challenges.

DNS tunneling takes advantage of the DNS protocol’s basic function—asking questions about domain names and receiving answers from the authoritative nameserver for the domain. But instead of using DNS for its intended purpose of domain name resolution, tunneling encodes other types of data within DNS queries and responses. For example, rather than asking simply what is the IP address for example.com?, a tunneling exploit might embed data within a query like secretdata123.attacker.com, where secretdata123 contains encoded information. This can lead to DNS being used as a two-way communications command and control channel. Detecting and blocking DNS tunneling is a vital control for stopping data exfiltration and command and control (C2) communications.

DGAs represent a different challenge for DNS security. Rather than using a fixed, predictable domain name that can be quickly blocked, DGAs automatically create many possible domain names using mathematical formulas, which are then used as a destination for C2 traffic. For instance, a DGA might generate domains like xkt7py.com today and mn9qrs.com tomorrow. This makes it difficult to maintain effective blocklists, because the domains change frequently and appear random. Traditional threat intelligence feeds, which rely on identifying and blocking known malicious domains, struggle to keep pace with DGA-generated domains.

How DNS Firewall Advanced works

When examining a domain name, Route 53 Resolver DNS Firewall Advanced looks at multiple characteristics that help distinguish between legitimate and suspicious domains. For example, legitimate domain names typically use real words and follow predictable patterns that are designed to facilitate a human’s ability to recall and enter them accurately. In contrast, domains used for tunneling or generated by DGAs often contain random-looking strings of characters or unusual patterns.

Route 53 Resolver DNS Firewall Advanced builds its intelligence on extensive analysis of real-world domain usage patterns. It learns what legitimate domain names look like by studying the most resolved domains on the internet, combined with actual domain resolution patterns from across AWS. This real-world training data helps establish a baseline for normal domain name characteristics. DNS Firewall Advanced then contrasts these patterns against known techniques used in DNS tunneling and domain generation to identify suspicious activity.

The service analyzes various aspects of each domain name, including:

  • How the domain name is structured and broken into parts
  • The patterns of letters and numbers used
  • How closely the domain resembles natural language
  • The presence of common words versus random character combinations

The service analyzes queries in real time, processing each one in less than a millisecond, which maintains strong security controls without affecting your applications’ performance.

Route 53 Resolver DNS Firewall Advanced has customized protection levels that you can use to choose how aggressively you want to detect and respond to suspicious domains through confidence thresholds:

  • High confidence: This setting focuses on the most obvious threats, minimizing false positives. It’s ideal for production environments where blocking legitimate traffic could be disruptive.
  • Medium confidence: Provides balanced protection, suitable for most environments.
  • Low confidence: Offers the most detection but might require more tuning to avoid false positives. This setting is useful for high-security environments or for initial monitoring to understand traffic patterns.

You can combine these confidence levels with different actions (block or alert) to create a defense strategy that matches your security needs.

Manually create a DNS Firewall Advanced rule:

To start, we show you how to manually create a Route 53 Resolver DNS Firewall Advanced rule in the AWS Management Console. This rule will block DNS queries that it has detected to be DNS tunneling with high confidence.

To manually create a rule:

  1. In the Route 53 console, choose Rules in the navigation pane, and then choose Add rule.
    Figure 1: Rules in the Route 53 console

    Figure 1: Rules in the Route 53 console

  2. Enter a name for the rule and select DNS Firewall Advanced protections.
    Figure 2: Add a rule

    Figure 2: Add a rule

  3. Under DNS Firewall Advanced protection:
    1. Select DNS tunneling detection.
    2. For Confidence threshold, select High.
    3. Leave the Query type empty so that the rule applies to all query types.
    Figure 3: Select DNS protection options

    Figure 3: Select DNS protection options

  4. Under Action:
    1. Select Block.
    2. For the response, select OVERRIDE.
    3. For the Record value, enter dns-firewall-advanced-block.
    4. For the Record type, select CNAME.
    5. Choose Add rule.
    Figure 4: Configure actions for the rule

    Figure 4: Configure actions for the rule

We’ve created an AWS CloudFormation stack that deploys the following recommended Route 53 Resolver DNS Firewall rules in a DNS Firewall rule group. We recommend this configuration because it provides a balanced security approach—blocking high-confidence threats immediately while generating alerts for lower-confidence detections.

The inclusion of the AWS Managed Aggregate Threat List is particularly valuable because it combines domains from multiple threat categories (malware, ransomware, botnet, spyware, and DNS tunneling) into a blocklist. This consolidated list includes the domains from other AWS Managed Domain Lists, including those identified by GuardDuty threat intelligence systems, giving you broad protection against known malicious domains while the Route 53 DNS Firewall Advanced rules catch previously unseen threats.

For enterprise environments, you can scale this protection across your entire organization by using AWS Firewall Manager to automatically deploy and manage this rule group configuration consistently across the VPCs in your organization.

  • BLOCK – Aggregate Threat List (domains associated with multiple DNS threat categories including malware, ransomware, botnet, spyware, and DNS tunneling to help block multiple types of threats)
  • BLOCK – DNS Tunneling | Confidence: HIGH
  • BLOCK – DGAs | Confidence: HIGH
  • ALERT – DNS Tunneling | Confidence: LOW
  • ALERT – DGAs | Confidence: LOW

To deploy this rule group using a CloudFormation stack:

  1. Navigate to the CloudFormation console, choose Stacks from the navigation pane. Choose Create Stack in the upper right and select With new resources (standard).
    Figure 5: Create a stack

    Figure 5: Create a stack

  2. Download the CloudFormation template. Select Choose an existing template and then select Upload a template file and upload the CloudFormation stack. Choose Next.
    Figure 6: Use the CloudFormation template

    Figure 6: Use the CloudFormation template

  3. Enter a stack name and choose Next.
    Figure 7: Enter a stack name

    Figure 7: Enter a stack name

  4. Leave the default values for all options, select Next, and then choose Submit.
  5. Navigate to the Route 53 Resolver DNS Firewall by visiting the Amazon VPC console, scroll down to the DNS firewall section, and select the Rule groups tab.
  6. Select the newly created rule group.
  7. Select the Associated VPCs tab, choose Associate VPC, and then associate a VPC you want to protect and choose Associate.
    Figure 8: Associate a VPC

    Figure 8: Associate a VPC

Observability

Route 53 Resolver query logging provides detailed visibility into DNS queries made from resources associated with your VPCs, enabling you to monitor and analyze your DNS traffic for security and compliance purposes. By configuring query logging, you can capture essential information about each DNS request, including the domain name being queried, the record type, the response code, and the originating VPC and instance. Query logging is particularly valuable when used in conjunction with Route 53 Resolver DNS Firewall, because it helps you track blocked queries and fine-tune your security rules based on actual DNS traffic patterns in your environment. The following are examples of log entries generated when DNS Firewall detects and responds to suspicious activities, showing the detailed information available for security analysis and incident response.

Example log entry: DNS tunneling block

The following is an example of a DNS tunneling block.

{
    "version": "1.100000",
    "account_id": "11111111111",
    "region": "us-west-2",
    "vpc_id": "vpc-0fcc85bd45b791d5a",
    "query_timestamp": "2025-02-05T03:54:12Z",
    "query_name": "1WTE4CyL4Vf1LQDDAToimuqFBEtMXyYMsYP8zPgVyTagzSh5PvinuQcL6N8at4A.REZv3VqKU4x43DPcCKAzQk4UKoZjB3nDMukHAuKTtDckTqZ8SDDZ1iXRey6a5sD.mEDMdrzPocS9exqoBQ1xfSuKfvW.1.dnstunnel.com.",
    "query_type": "A",
    "query_class": "IN",
    "rcode": "NXDOMAIN",
    "answers": [
        {
            "Rdata": "dns-firewall-advanced-block.",
            "Type": "CNAME",
            "Class": "IN"
        }
    ],
    "srcaddr": "10.1.0.122",
    "srcport": "41859",
    "transport": "UDP",
    "srcids": {
        "instance": "i-0c738190f19db9a2c"
    },
    "firewall_rule_action": "BLOCK",
    "firewall_rule_group_id": "rslvr-frg-63efa138b43f428b",
    "firewall_protection": "DNS_TUNNELING"
}

Example log entry: DNS tunneling alert

The following is an example of a DNS tunneling alert.

{
    "version": "1.100000",
    "account_id": "11111111111",
    "region": "us-west-2",
    "vpc_id": "vpc-0fcc85bd45b791d5a",
    "query_timestamp": "2025-02-05T04:00:02Z",
    "query_name": "1WTEc8GwFH3qHY8XKjbhXuj43yGShMrhacqwJYSZkSqRQ95sagz64NUpnuj4R8R.S79aru2KRB8d9nCHEPdXWJxGT4aUjVMqtCRSq9EZXRCo8NH5cmLvmcho3hh1mbK.NqGY1X6M4qpMGX6dnTSHuCsZFbf.1.dnstunnel.com.",
    "query_type": "A",
    "query_class": "IN",
    "rcode": "NOERROR",
    "answers": [
        {
        "Rdata": "202.92.34.217",
        "Type": "A",
        "Class": "IN"
        }
    ],
    "srcaddr": "10.1.0.122",
    "srcport": "35116",
    "transport": "UDP",
    "srcids": {
        "instance": "i-0c738190f19db9a2c",
        "resolver_endpoint": "rslvr-out-e20639d3666748f58"
    },
    "firewall_rule_action": "ALERT",
    "firewall_rule_group_id": "rslvr-frg-63efa138b43f428b",
    "firewall_protection": "DNS_TUNNELING"
}

Integration with Security Hub

Security Hub provides you with a view of your security state in AWS and helps you to check your environment against security industry standards and best practices. Security Hub collects security data from across AWS accounts, AWS services, and supported third-party partner products, and helps you to analyze security trends and identify the highest priority security issues. It enables findings from both the Amazon: Route 53 Resolver DNS Firewall – AWS List and Amazon: Route 53 Resolver DNS Firewall Advanced list by default, so you’ll automatically receive these alerts without additional configuration. You only need to manually enable Amazon: Route 53 Resolver DNS Firewall – Custom List findings if you’re using custom domain lists in your rule groups. See Sending findings from Route 53 Resolver DNS Firewall to Security Hub for more information.

The following figure is an example of how Route 53 Resolver DNS Firewall Advanced findings appear in the Security Hub console, providing you with actionable security intelligence directly in your centralized dashboard.

Figure 9: DNS Firewall Advanced findings in Security Hub

Figure 9: DNS Firewall Advanced findings in Security Hub

Select a finding to view details such as Finding ID, Types, Workflow status, and so on.

Figure 10: Findings details

Figure 10: Findings details

Conclusion

Amazon Route 53 Resolver DNS Firewall Advanced represents a significant step forward in protecting organizations against sophisticated DNS-based threats. As mentioned, DNS queries sent to the Route 53 Resolver follow a unique path that bypasses traditional AWS security controls like security groups, NACLs, and even AWS Network Firewall—creating a security gap in many environments. Throughout this post, we’ve explored how DNS tunneling and DGA-based exploits take advantage of this blind spot, and how you can use Route 53 Resolver DNS Firewall Advanced to protect from these threats through real-time pattern analysis and anomaly detection. You learned how to configure the service in the AWS console and use the provided CloudFormation template with recommended rules that balance blocking high-confidence threats while alerting on potential issues. And you saw how query logging provides valuable visibility into your DNS traffic and how Security Hub integration centralizes your security findings. Implementing these capabilities helps you protect your infrastructure from sophisticated DNS-based exploits that traditional domain blocklists cannot catch, strengthening your cloud security posture while maintaining operational efficiency.

If you have feedback about this post, submit comments in the Comments section below.

Lawton Pittenger

Lawton Pittenger

Lawton is a Security Solutions Architect at AWS, based in New York City, focused on helping customers implement native AWS security services. Professionally, Lawton has worked in IT security roles, securing cloud environments. Outside of cloud security, his interests include skateboarding, snowboarding, and rock climbing.

Michael Leighty

Michael Leighty

Michael is a Senior Security Solutions Architect at AWS, based in Atlanta. He specializes in helping customers design and implement effective network security controls, drawing from extensive experience at leading network security vendors. At AWS, he works closely with service teams to drive continuous improvement in security services based on customer needs and feedback.

How UNiDAYS achieved AWS Region expansion in 3 weeks

Post Syndicated from Sanhawat Taongern original https://aws.amazon.com/blogs/architecture/how-unidays-achieved-aws-region-expansion-in-3-weeks/

UNiDAYS is a fast, free digital platform that provides exclusive student offers and benefits to over 29 million verified members worldwide. With a rapidly growing user base and an increasing number of global partnerships, UNiDAYS recognized the need to enhance its platform’s performance to deliver a seamless consumer experience in geographic regions far from its original base of operations.

In this post, we share how UNiDAYS achieved AWS Region expansion in just 3 weeks using AWS services.

Business challenges

In response to growth opportunities, UNiDAYS faced a pressing business requirement: deliver low-latency responses and provide high availability for users across diverse geographic regions. At the same time, the platform needed to guarantee global data consistency while adhering to tight deadlines—all within just a few weeks. However, the existing monolithic application, although built on Amazon Web Services (AWS), wasn’t optimized for active-active multi-Region deployments.

The challenge was further complicated by the need to extend functionality from this legacy system, which used the AWS global network for improved user experience but fell short of meeting new business requirements. Re-architecting the entire platform to support multi-Region deployments within the given timeframe wasn’t feasible.

Solution overview

UNiDAYS opted to create complementary services tailored to these new requirements, using AWS services for a multi-Region, active-active architecture. The key services used included:

This approach allowed UNiDAYS to meet its latency, availability, and consistency goals while seamlessly integrating with existing infrastructure. The following diagram is the architecture for the solution.

Global delivery and resiliency

To provide the lowest latency and multi-Region resiliency, CloudFront was used with latency-based routing configured in Route 53. This routing directs requests to the Regional Application Load Balancers with the lowest latency, automatically providing resiliency in the event of Regional issues. Security was a key consideration. AWS WAF integration with CloudFront provided application-layer protection at the edge. Additional security measures included:

  • Custom HTTP headers on origin requests, enforced using Application Load Balancer listener rule conditions
  • Prefix lists to restrict access to Application Load Balancers, making sure that traffic originated from the intended CloudFront distributions

Rapid Regional deployment

The core infrastructure is deployed through Terraform, and applications are deployed using custom tooling that wraps AWS CloudFormation. This hybrid approach enabled rapid delivery by using existing patterns without disrupting established workflows. Resources were organized into tiers: platform, global, and Regional. Platform and global resources were deployed one time, and Regional resources were rolled out to each activated Region, streamlining expansion efforts.

One technical challenge involved CloudFormation exports, which are Regional by design. To address this, we implemented a custom CloudFormation macro to enable cross-Region access to exported values, providing consistency across deployments.

Amazon ECS enabled progressive application deployments within each Region, allowing teams to focus on scaling applications rather than managing infrastructure. For cost-efficiency, we used Spot Instances. During testing, container start-up latency was observed due to cross-Region image downloads from Amazon Elastic Container Registry (Amazon ECR). This issue was resolved by enabling private image replication in Amazon ECR so that container images were available locally in each Region. This solution significantly reduced start-up times, improving application responsiveness during deployments and scaling events.

Data consistency and performance

DynamoDB global tables were instrumental in providing eventual data consistency and Regional replication. With DynamoDB handling these aspects, we could focus on application logic.

The result was a substantial reduction in latency at key locations. For example, client-experienced latency in one Region dropped from approximately 200 milliseconds to 50 milliseconds upon deployment, as shown in the following screenshot.

Outcome

Key technical hurdles

We addressed the following technical obstacles while developing the solution:

  • Cross-Region CloudFormation exports – CloudFormation exports are Regional by design. We addressed this by creating a custom CloudFormation macro to read exports across Regions.
  • Container start-up latency – Latency caused by cross-Region image downloads was mitigated by implementing Amazon ECR private image replication. This meant that container images were readily available in each Region, reducing deployment times and improving overall performance.
  • Security assurance – By using CloudFront, AWS WAF, and Application Load Balancer security features, we made sure that traffic and data remained secure.

Why AWS?

UNiDAYS chose AWS due to its comprehensive global infrastructure and robust service offerings, which allowed the platform to:

  • Seamlessly expand compute operations to Regions closer to its user base
  • Take advantage of a full stack of services for reliable, secure, and low-latency content delivery
  • Meet tight delivery deadlines without compromising on performance or security
  • Maintain flexibility where required, with the ability to use more managed services, which allowed a focus on our applications

Conclusion

By adopting a multi-Region, active-active architecture on AWS, UNiDAYS successfully met its business goals within only 3 weeks, rapidly expanding to new Regions while promoting platform resiliency. The solution improved latency by 75% in new Regions (from 200 milliseconds to 50 milliseconds), provided Regional data availability through DynamoDB global tables, and maintained 100% service uptime during resiliency tests, even in cases of Regional connectivity loss. Additionally, deployment velocity increased by over 40%, allowing faster feature releases and improved agility. This architecture not only provides a scalable and resilient platform for current operations but also establishes a strong foundation for future global expansion.

Learn more

Is your organization looking to expand into new Regions while maintaining performance and reliability?

  • Contact AWS experts to explore tailored solutions for your multi-Region strategy.
  • Use AWS Global Infrastructure to optimize your expansion.
  • Share your challenges and successes in the comments—we’d love to hear your insights!


About the Authors

Amazon API Gateway now supports dual-stack (IPv4 and IPv6) endpoints

Post Syndicated from Betty Zheng (郑予彬) original https://aws.amazon.com/blogs/aws/amazon-api-gateway-now-supports-dual-stack-ipv4-and-ipv6-endpoints/

Today, we are launching IPv6 support for Amazon API Gateway across all endpoint types, custom domains, and management APIs, in all commercial and AWS GovCloud (US) Regions. You can now configure REST, HTTP, and WebSocket APIs, and custom domains, to accept calls from IPv6 clients alongside the existing IPv4 support. You can also call API Gateway management APIs from dual-stack (IPv6 and IPv4) clients. As organizations globally confront growing IPv4 address scarcity and increasing costs, implementing IPv6 becomes critical for future-proofing network infrastructure. This dual-stack approach helps organizations maintain future network compatibility and expand global reach. To learn more about dualstack in the Amazon Web Services (AWS) environment, see the IPv6 on AWS documentation.

Creating new dual-stack resources

This post focuses on two ways to create an API or a domain name with a dualstack IP address type: AWS Management Console and AWS Cloud Development Kit (CDK).

AWS Console

When creating a new API or domain name in the console, select IPv4 only or dualstack (IPv4 and IPv6) for the IP address type.

As shown in the following image, you can select the dualstack option when creating a new REST API.
For custom domain names, you can similarly configure dualstack as shown in the next image.

If you need to revert to IPv4-only for any reason, you can modify the IP address type setting, with no need to redeploy your API for the update to take effect.

REST APIs of all endpoint types (EDGE, REGIONAL and PRIVATE) support dualstack. Private REST APIs only support dualstack configuration.

AWS CDK

With AWS CDK, start by configuring a dual-stack REST API and domain name.

const api = new apigateway.RestApi(this, "Api", {
  restApiName: "MyDualStackAPI",
  endpointConfiguration: {ipAddressType: "dualstack"}
});

const domain_name = new apigateway.DomainName(this, "DomainName", {
  regionalCertificateArn: 'arn:aws:acm:us-east-1:111122223333:certificate/a1b2c3d4-5678-90ab',
  domainName: 'dualstack.example.com',
  endpointConfiguration: {
    types: ['Regional'],
    ipAddressType: 'dualstack'
  },
  securityPolicy: 'TLS_1_2'
});

const basepathmapping = new apigateway.BasePathMapping(this, "BasePathMapping", {
  domainName: domain_name,
  restApi: api
});

IPv6 Source IP and authorization

When your API begins receiving IPv6 traffic, client source IPs will be in IPv6 format. If you use resource policies, Lambda authorizers, or AWS Identity and Access Management (IAM) policies that reference source IP addresses, make sure they’re updated to accommodate IPv6 address formats.

For example, to permit traffic from a specific IPv6 range in a resource policy.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": "execute-api:Invoke",
      "Resource": "execute-api:stage-name/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "192.0.2.0/24",
            "2001:db8:1234::/48"
          ]
        }
      }
    }
  ]
}

Summary

API Gateway dual-stack support helps manage IPv4 address scarcity and costs, comply with government and industry mandates, and prepare for the future of networking. The dualstack implementation provides a smooth transition path by supporting both IPv4 and IPv6 clients simultaneously.

To get started with API Gateway dual-stack support, visit the Amazon API Gateway documentation. You can configure dualstack for new APIs or update existing APIs with minimal configuration changes.

Betty

Special thanks to Ellie Frank (elliesf), Anjali Gola (anjaligl), and Pranika Kakkar (pranika) for providing resources, answering questions, and offering valuable feedback during the writing process. This blog post was made possible through the collaborative support of the service and product management teams.


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

From virtual machine to Kubernetes to serverless: How dacadoo saved 78% on cloud costs and automated operations

Post Syndicated from Andreas Gehrig original https://aws.amazon.com/blogs/architecture/from-virtual-machine-to-kubernetes-to-serverless-how-dacadoo-saved-78-on-cloud-costs-and-automated-operations/

dacadoo is a global Swiss-based technology company that develops solutions for digital health engagement and health risk quantification. Their products include a software-as-a-service (SaaS)-based digital health engagement platform that uses behavioral science, AI, and gamification to help end users improve their health outcomes.

The company embarked on a journey to modernize an API to quantify health and lifestyle data plus a risk engine to calculate mortality and morbidity probabilities based on years of scientific research data.

To transform a virtual machine–based API service into a globally redundant, scalable health score and risk calculation solution dacadoo chose Amazon Web Services (AWS) technology. The service handles highly sensitive health data from a global customer base and must comply with regional regulations.

The result is a cost reduction of 78% and an infrastructure maintenance effort of less than an hour per year , allowing dacadoo to deliver and operate more AWS infrastructure without scaling its site reliability engineering (SRE) team, thanks to a high level of automation and an agile mindset.

In this post, we walk you step-by-step through dacadoo’s journey of embracing managed services, highlighting their architectural decisions as we go.

Background

The solution architecture went through a three-stage journey:

  1. Incubation – Single virtual machine on premises with disaster recovery (DR) in Switzerland
  2. Global and scalable – Multiple global Kubernetes clusters
  3. Operational excellence – Fully serverless and geo-redundant on AWS

Stage 1: Incubation with a virtual machine

After years of scientific research and development, the service was launched, running on a single on-premises virtual machine that used hypervisor technology to provide disaster recovery (DR). However, it had no high availability (HA) capability and it required manual recovery.

The application serving the API requests and the NoSQL database were both running on the same host. Software deployment and operating system maintenance were performed manually using Secure Shell (SSH)—a typical low-automation setup that also included downtime.

The following architecture diagram shows a virtual machine encompassing the monolithic application and its database.

Monolithic architecture

Challenges

A single virtual machine was quick to set up and inexpensive to operate, but it had considerable shortcomings. The health API was only available in Switzerland, infrastructure maintenance was performed manually, and software deployment was handled manually. Additionally, database backups were done using virtual machine snapshots, uptime monitoring only, and testing was conducted on the developer workstation.

Stage 2: Global and scalable with Kubernetes

At that time, dacadoo made a strategic decision to heavily invest in Kubernetes for managing containerized workloads on a global scale. As part of this technology rollout, the health score and risk service were migrated to Kubernetes.

Due to the geographically distributed customer base and low latency requirements, three Kubernetes clusters were deployed, one on each continent. The NoSQL database was hosted in proximity to the workload to reduce service latency and keep the migration effort low.

To reduce the operational maintenance, the NoSQL database was integrated as a SaaS offering, and monitoring was centralized using Datadog.

All cloud infrastructure was provisioned exclusively with Terraform, covering the Kubernetes cluster, NoSQL database , and integration with GitLab and Datadog.

dacadoo containerized the API service and used Gitlab continuous integration and continuous deployment (CI/CD) pipelines to deploy multiple environments and clusters on a global hyperscaler.

In retrospect, this was a typical replatform modernization project from virtual machine to Kubernetes, with a high level of automation and a SaaS-first approach.

The following diagram is the architecture for the container solution with managed NoSQL database.

Containers architecture

Challenges

The service faced several challenges, including increased costs from deploying three regional Kubernetes clusters across three environments, resulting in 27 cluster nodes and additional expenses from managing NoSQL database SaaS instances for each cluster. The complexity of CI/CD pipelines for multi-environment multi-cluster deployments added to the difficulty. Significant operational effort was required to keep infrastructure and Kubernetes components up to date.

Stage 3: Operational excellence with serverless

The Kubernetes-based architecture met the requirements, but some features in the dacadoo API service backlog needed to fit better with the application architecture at the time.

This was the right moment to take a holistic view of the infrastructure and software architecture and refactor the solution according to the latest AWS technologies and best practices, the next frontier for dacadoo’s engineering team.

Solution requirements

Requirements for the solution refactoring were as follows:

  • Keep the functionality of the API unmodified
  • Constrain data processing to a region of choice for compliance with local data protection laws
  • Avoid weekly patch cycles by exclusively using managed serverless services
  • Reduce costs by choosing services with a pay-as-you-go billing model
  • Delegate authentication to a dedicated service
  • Use an established web framework with an extensive ecosystem

Refactoring the apps

The API service has two components: a developer portal and the health score and risk calculations API. The database is only required for API keys, algorithm parameters, quotas, and usage statistics. Health data is processed regionally by the compute layer but not persisted, opening the door for a distributed database: Amazon DynamoDB global tables is the perfect fit for the solution. Writes are distributed to all connected Regions, whereas reads are local, providing low latency for complying with dacadoo service level agreements (SLAs).

The developer portal is a web UI with API documentation and API key management features. AWS Lambda is a great fit because it scales automatically and has a pay-per-request billing model.

The health and risk API uses algorithms implemented in the C programming language for short bursting, compute-intense simulations. These calls are wrapped by a REST API using the Python FastAPI framework. These characteristics make AWS Lambda a great fit.

Serverless architecture

HTTP requests are routed to the Lambda functions using Amazon API Gateway with AWS WAF for protection from malicious requests and attacks. Static assets are served from an Amazon Simple Storage Service (Amazon S3) bucket through API Gateway. The additional features of Amazon CloudFront aren’t required, and Amazon S3 reduces the complexity.

Amazon Route 53 provides a powerful feature known as latency-based routing, which allows it to direct DNS queries to the endpoint that offers the lowest latency for the requester.

This feature provides Regional high availability for API users without data processing location requirements. Alternatively, the user can call specific Regional endpoints to make sure requests are processed in the desired Region.

API authorization is HTTP header-based and is performed in the application with data stored in Amazon DynamoDB.

The following diagram is the architecture for a geo-redundant fully serverless solution.

Serverless architecture

With a dacadoo SRE team proficient in Python, they opted for Pulumi for its advanced features such as programming language flow control constructs, powerful configuration capabilities, and multi-cloud support.

For continuous integration, GitLab CI compiles the algorithm library, tests the FastAPI applications and packages everything. The application deployment is just an update of the AWS Lambda, a simple and reliable workflow.

Summary

The solution evolved from a managed infrastructure setup, where the customer held most of the responsibility, to an AWS managed service architecture.

Infrastructure provisioning evolved from manual, error-prone processes to powerful code-driven workflows in Pulumi. The SRE needed to enhance their software engineering skills to adopt Pulumi, transitioning from configuration-based approaches to designing and maintaining an infrastructure code base using object-oriented Python. This was part of dacadoo’s investment in the SRE team and broader modernization efforts. The serverless architecture enabled a GitOps engineering culture focused on productivity.

The transformation maximized scalability and availability while reducing costs and operational effort:

Virtual machine

  • Scalability: Low
  • Availability: Best effort
  • Infrastructure costs: Low
  • Maintenance effort: High

Kubernetes

  • Scalability: High
  • Availability: 99.95%
  • Infrastructure costs: High
  • Maintenance effort: Medium

Serverless

  • Scalability: Very high
  • Availability: 99.999% (with failover to another AWS Region)
  • Infrastructure costs: Low
  • Maintenance effort: Very low

The global redundancy elevates availability to an impressive 99.999% while keeping the costs low.

Conclusion

Migrating from a virtual machine to Kubernetes and ultimately to AWS Lambda demonstrates the progression of cloud engineering toward enhanced efficiency and scalability.

Each step in this journey reduced the complexity of managing resources while increasing flexibility and automation. Transitioning dacadoo’s API service to a fully serverless, geo-redundant architecture not only advanced the platform but also upskilled engineers, maintained a lean SRE team, and kept infrastructure costs low. Get started with your own AWS serverless solution.


About the Authors

Top Architecture Blog Posts of 2024

Post Syndicated from Andrea Courtright original https://aws.amazon.com/blogs/architecture/top-architecture-blog-posts-of-2024/

Well, it’s been another historic year! We’ve watched in awe as the use of real-world generative AI has changed the tech landscape, and while we at the Architecture Blog happily participated, we also made every effort to stay true to our channel’s original scope, and your readership this last year has proven that decision was the right one.

AI/ML carries itself in the top posts this year, but we’re also happy to see that foundational topics like resiliency and cost optimization are still of great interest to our audience.

(By the way, if you were hoping for more AI/ML content, head on over to our sister channel, the AWS Machine Learning Blog!).

Without further ado, here are our top posts from 2024!

#10 Deploy Stable Diffusion ComfyUI on AWS elastically and efficiently

This post helps you get started using ComfyUI, and was so successful that we followed it up later in the year with How to build custom nodes workflow with ComfyUI on EKS!

Architecture for deploying stable diffusion on ComfyUI

Figure 1. Architecture for deploying stable diffusion on ComfyUI

#9 Let’s Architect! Designing Well-Architected systems

In keeping with Let’s Architect! series, we have our first of three favorites for the year. This set of resources helps you apply Well-Architected standards in practice.

Let's Architect

Figure 2. Let’s Architect

#8 Let’s Architect! Learn About Machine Learning on AWS

As I said, Let’s Architect! has a winning series, and they’ve got a finger on the pulse of the tech world. This post about machine learning showcases some of the most exciting things happening at AWS.

Let's Architect

Figure 3. Let’s Architect

If you’re more interested in generative AI, you can also take a look at another post from 2024: Let’s Architect! GenAI

#7 Creating an organizational multi-Region failover strategy

Preparedness is another common theme in this year’s favorites. Michael, John, and Saurabh are well-versed in multi-Region architecture, and they’re here to share some strategies to contain failure impact.

When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.

Figure 4. When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.

#6 Building a three-tier architecture on a budget

Let’s talk cost optimization. This post about a three-tier architecture that relies on the AWS Free Tier is a must-read for anyone looking for tips to help them avoid unnecessary costs (and that’s everyone).

Example of a three-tier architecture on AWS

Figure 5. Example of a three-tier architecture on AWS

#5 Announcing updates to the AWS Well-Architected Framework guidance

As usual, Haleh & team are pros at making sure the Well-Architected Framework is current and relevant. Take a look at the enhanced and expanded guidance in all six pillars.

Well-Architected logo

Figure 6. Well-Architected logo

#4 Let’s Architect! Serverless developer experience in AWS

One more winning post from Luca, Federica, Vittorio, and Zamira! This collection of developer resources includes new ideas in AWS Lambda, Amazon Q Developer, and Amazon DynamoDB.

Let's Architect

Figure 7. Let’s Architect

#3 London Stock Exchange Group uses chaos engineering on AWS to improve resilience

This post from April 1 was not an April Fool’s joke! See how LSEG designed failure scenarios to test their resilience and observability.

Chaos engineering pattern for hybrid architecture (3-tier application)

Figure 8. Chaos engineering pattern for hybrid architecture (3-tier application)

#2 Achieving Frugal Architecture using the AWS Well-Architected Framework Guidance

Frugality AND Well-Architected? What a winning combo! This post, inspired by the 2023 re:Invent keynote, outlines the seven laws of Frugal Architecture.

Well-Architected logo

Figure 9. Well-Architected logo

#1 How an insurance company implements disaster recovery of 3-tier applications

And finally, our number one post of the year! Amit and Luiz showcase a customer solution with real-world applications that builds on the guidelines of other posts in this list! Well done!

The Pilot Light scenario for a 3-tier application that has application servers and a database deployed in two Regions

Figure 10. The Pilot Light scenario for a 3-tier application that has application servers and a database deployed in two Regions

Thank you!

As always, thanks to our contributors for their dedication and desire to share, and to you, our readers! We would be nothing with you. Literally.

For other top post lists, see our Top 10 and Top 5 posts from previous years.

How an insurance company implements disaster recovery of 3-tier applications

Post Syndicated from Amit Narang original https://aws.amazon.com/blogs/architecture/how-an-insurance-company-implements-disaster-recovery-of-3-tier-applications/

A good strategy for resilience will include operating with high availability and planning for business continuity. It also accounts for the incidence of natural disasters, such as earthquakes or floods and technical failures, such as power failure or network connectivity. AWS recommends a multi-AZ strategy for high availability and a multi-Region strategy for disaster recovery. In this post, we explore how one of our customers, a US-based insurance company, uses cloud-native services to implement the disaster recovery of 3-tier applications.

At this insurance company, a relevant number of critical applications are 3-tier Java or .Net applications. These applications require access to IBM DB2, Oracle, or Microsoft SQLServer databases that run on Amazon EC2 instances. The requirement was to create a disaster recovery strategy that implements a Pilot Light or Warm/Standby scenario. This design needs to keep costs at a minimum, and it needs to allow for failure detection and manual failover of resources. Furthermore, it needs to keep the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO) under 15 minutes. Finally, the solution could not use any public resources.

The solution

Amazon Route53 Application Recovery Controller (Route53 ARC) helps manage and orchestrate application failover and recovery across multiple AWS Regions or on-premises environments. It is specifically focused on managing DNS routing and traffic management during failover and recovery operation; however, some customers decide to implement their own strategies for application recovery. In this blog, we are going to focus on how one of our financial services customer implements it.

The Well-Architected framework explains that a good disaster recovery plan needs to manage configuration drift. A good practice is to use the delivery pipeline to deploy to both Regions and to regularly test the recovery pattern. There are customers that go a step further and even choose to operate in the secondary Region for a period of time.

The solution chosen by one of our leading insurance customers encompasses two distinct scenarios: a failover and a failback scenario. The failover scenario covers a list of steps to failover applications from the primary Region to the secondary Region. The failback process is the return of the operations to the primary Region.

Failover

Our customer decided to test the Pilot Light scenario. This scenario considers an application and a database deployed both in the primary and secondary Regions. As a requirement to achieve the 15-minute RPO, an application deployed in the primary Region needs to replicate data to the secondary Region. This async replication is implemented for each of the company’s database engines (DB2, SQLServer, Oracle) using native tooling. Leveraging native tooling was an existing practice and going with it would help minimize any operational impact.

It is important to notice that the detection and failover mechanisms is created in the secondary Region. This ensures these components will remain available when the primary Region becomes unavailable. Another important aspect is to establish connectivity between the two networks. This is needed to allow for the database replication.

The Pilot Light scenario for a 3-tier application that has application servers and a database deployed in two Regions

Figure 1. The Pilot Light scenario for a 3-tier application that has application servers and a database deployed in two Regions

The failover procedure uses the following steps for detection and failover:

  1. An Amazon EventBridge scheduler runs the AWS Lambda function every 60 seconds.
  2. The Lambda function tests the application endpoint and adds a custom metric to Amazon CloudWatch. If the application is unavailable, a CloudWatch Alarm will start the Lambda Function that initiates the failover.
  3. A Lambda function initiates the failover by starting a Jenkins pipeline. The pipeline will failover the application and the database to the secondary Region. The Jenkins pipeline starts with a manual approval step, ensuring that the failover process does not start automatically.
  4. Once approvers validate the necessity of the failover, they approve the workflow, and the pipeline moves to the next stage.
  5. The pipeline failovers the database, promoting the database in the secondary Region to the primary state and enables write operations.
  6. Next, start or scale out application servers that run on EC2 instances or containers. This is important to assure they will support the increased load in the secondary Region once failover is complete.
  7. At this point, database and application servers are ready to receive load. Next, the Application Load Balancer (ALB) needs to failover to the secondary Region. Route53 failover routing policy automatically failovers between Regions, but this customer wanted to manually control this step using a health check. To implement a manual failover of the ALB, the pipeline creates a file in a designated S3 bucket. A Lambda function regularly checks if this file exists in the expected location. If so, it triggers a CloudWatch Alarm and the Route53 health check will fail. At this point, Route 53 will redirect traffic to the ALB in the secondary Region, becoming the new active endpoint.

Failback

The failback scenario starts when all the required services become online in the primary Region. AWS recommends using AWS Personal Health Dashboard to check for service health. Figure 2 illustrates the failback process in detail. It shows the step-by-step flow from initiating the failback procedure to the final DNS switchover, highlighting the key components and interactions involved in each stage. This visual representation helps to clarify the complex process of returning operations to the primary Region.

Diagram of the failback process

Figure 2. Diagram of the failback process

The failback procedure is implemented in six steps:

  1. A cloud operator or Site Reliability Engineer (SRE) initiates the failback procedure by submitting a form on an HTML page. A Lambda function starts a Jenkins pipeline.
  2. The pipeline initiates the delta sync replication of the database. This ensures that data changes made in the secondary Region are replicated to the primary Region.
  3. The next stage is a manual approval to recover back to the primary Region, where the SRE verifies that the databases are in sync and all services needed are online in the primary Region.
  4. Upon approval, the pipeline starts the application servers in the primary Region.
  5. Next, the database in the primary Region is promoted for write operations. The database endpoint in the secondary Region is updated to point to the primary Region’s database.
  6. As explained in the failover section, the DNS switchover depends on a file existing in S3. Since this file was created for our failover event, the pipeline will now remove this file. The Lambda function identifies the change and updates the state of the CloudWatch Alarm, then the Route53 Healthcheck will change the state. At this point, the ALB in the primary Region becomes active and failback completes successfully.

Benefits

This customer identified the following benefits in implementing this design:

  • Customizable solution that aligns with the company’s internal processes, operating model, and technologies in use
  • Standardized pattern applicable across the organization for applications with different technologies, including databases, Windows and Linux applications running on EC2
  • Recovery Point Objective (RPO) and Recovery Time Objective (RTO) of less than 15 minutes
  • A cost optimized solution that uses cloud native services to implement the detection and failover scenarios

Conclusion

The solution for the disaster recovery of 3-tier applications demonstrates this financial services customer’s commitment to ensuring business continuity and resilience. This design showcases the company’s ability to tailor their architecture to their specific requirements. Achieving an RPO and RTO of less than 15 minutes for critical applications is a remarkable feat. It ensures minimal disruption to business operations during regional outages.

Furthermore, this solution leverages existing technologies and processes within the company, allowing for seamless integration and adoption across the organization. The ability to standardize this pattern for applications with different technologies helps simplifying the operating model.

If you’re an enterprise seeking to enhance the resilience of your critical applications, this disaster recovery solution from one of our enterprise customers serves as an inspiring example. To further explore the disaster recovery strategies and best practices on AWS, we recommend the following resources:

Implementing multi-Region failover for Amazon API Gateway

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/implementing-multi-region-failover-for-amazon-api-gateway/

This post is written by Marcos Ortiz, Principal AWS Solutions Architect and Khubyar Behramsha, Sr. AWS Solutions Architect.

In this post, you learn how organizations can evolve from a single-Region architecture API Gateway to a multi-Region one, using a reliable failover mechanism without dependencies on AWS control plane operations. An AWS Well-Architected best practice is to rely on the data plane and not the control plane during recovery. Failover controls should work with no dependencies on the primary Region. This pattern shows how to independently failover discrete services deployed behind a shared public API. Additionally, there is a walkthrough on how to deploy and test the proposed architecture, using our open-source code available on GitHub.

For many organizations, running services behind a Regional Amazon API Gateway endpoint aligned to AWS Well-Architected best practices, offers the right balance of resilience, simplicity, and affordability. However, depending on business criticality, regulatory requirements, or disaster recovery objectives, some organizations must deploy their APIs using a multi-Region architecture.

When dealing with business-critical applications, organizations often want full control over how and when to trigger a failover. A manually triggered failover allows for dependencies to be failed over in a specific order. Failover actions follow the chain of approvals needed, which helps prevent failing over to an unprepared replica or other flapping issues caused by intermittent disruptions. While the failover action or trigger has a human-in-the-loop component, the recommendation is for all subsequent actions to be automated as much as possible. This approach gives application owners control over the failover process, including the ability to trigger the failover in cases of intermittent issues.

Overview

One common approach for customers is to deploy a public Regional API with a custom domain name, providing more intuitive URLs for their users. The backend uses API mappings to connect multiple API stages to a custom domain. This approach allows service owners to deploy their services independently while sharing the same top-level API domain name. Here is a typical architecture that follows this pattern:

Regional endpoint with mapping

Regional endpoint with mapping

However, when trying to evolve this to a multi-Region architecture, organizations often struggle to fail over each service independently. If the preceding architecture is deployed in two Regions as-is, it becomes an all-or-nothing scenario, where organizations must either fail over all the services behind API Gateway or none.

Evolving to a multi-Region architecture

To enable each team to manage and failover their services independently, you can implement this new approach for a multi-Region architecture. Each service has its own subdomain, using API Gateway HTTP integrations to route the request to a given service. This allows the service APIs the flexibility to be independently failed over, or all at once, with the shared public API.

Multi-Region architecture

Multi-Region architecture

This is the request flow:

  1. Users access a specific service through the public shared API domain name using a URL suffix. For instance, to access service1, the end user would send a request to http://example.com/service1.
  2. Amazon Route 53 has the top-level domain, example.com, registered with a primary and a secondary failover record. It routes the request to the API Gateway external API endpoint in the primary Region (us-east-1).
  3. API Gateway uses an HTTP integration to forward the request to service1 at https://service1.example.com.
  4. Amazon Route 53, has the domain service1.example.com registered with a primary and a secondary failover record. It routes the request to the API Gateway service1 API Regional endpoint in the primary Region (us-east-1) when healthy and routes to the service1 API Regional endpoint in the secondary Region (us-west-2) when unhealthy.
  5. Represents the primary route for service1 configured in Amazon Route 53.
  6. Represents the secondary route for service1 configured in Amazon Route 53.

This solution requires deploying each service API in both the primary (us-east-1) and secondary (us-west-2) Regions. Both Regions use the same custom domain configuration. For the primary Region, primary DNS records for each service point to the Regional API Gateway distribution endpoint. In the secondary Region, secondary DNS records for each service point to the Regional API Gateway distribution endpoint in the secondary Region.

Route 53 records

Route 53 records

Active-passive manual failover

The example provided here enables a reliable failover mechanism that does not rely on the Amazon Route 53 control plane. It uses Amazon Route 53 Application Recovery Controller (Route 53 ARC), which provides a cluster with five Regional endpoints across five different AWS Regions. The failover process uses these endpoints, instead of manually editing Amazon Route 53 DNS records, which is a control plane operation. The routing controls in Route 53 ARC failover traffic from the primary Region to the secondary one.

Route 53 ARC routing controls

Route 53 ARC routing controls

Routing controls are on-off switches that enable you to redirect client traffic from one instance of your workload to another. Traffic re-routing is the result of setting associated DNS health checks as healthy or unhealthy.

Route 53 ARC toggles

Route 53 ARC toggles

Deploying the sample application

Pre-requisites

  1. A public domain (example.com) registered with Amazon Route 53. Follow the instructions here on how to register a domain and the instructions here to configure Amazon Route 53 as your DNS service.
  2. An AWS Certificate Manager certificate (*.example.com) for your domain name on both the primary and secondary Regions you plan to deploy the sample APIs.

Deploy the Amazon Route 53 ARC stack

Deploy the Amazon Route 53 ARC stack first, which creates a cluster and the routing controls that enable you to fail over the APIs.

Follow the detailed instructions here to deploy the Amazon Route 53 Application Recovery Controller (ARC) stack.

Deploy the Service1 API both in the primary and secondary Regions

This deploys an API Gateway Regional endpoint in each Region, which calls an AWS Lambda function to return the service name and the current AWS Region serving the request:

{"service": "service1", "region": "us-east-1"}

This is the code for the Lambda function:

import json
import os

def lambda_handler(event, context):
    return {
"statusCode": 200,
"body": json.dumps({
  "service": "service1",
  "region": os.environ['AWS_REGION']}),
}

Follow the detailed instructions here to deploy the service1 stack.

Deploy the Service2 API both in the primary and secondary Regions

This stack is similar to service1, but has a different domain name and returns service2 as the service name:

{"service": "service2", "region": "us-east-1"}

Follow the detailed instructions here to deploy the service2 stack.

Deploy the shared public API both in the primary and secondary Regions

This step configures HTTP endpoints so that when you call example.com/service1 or example.com/service2, it routes the request to the respective public DNS records you have set up for service1 and service2.

Follow the detailed instructions here to deploy the external API stack.

Failover tests

To test the deployed example, modify then run the provided test script:

  1. Update lines 3–5 in the test.sh file to reference the domain name you configured for your APIs.
  2. Provide execute permissions and run the script:
chmod +x ./test/sh
./test.sh

This script sends an HTTP request to each one of your three endpoints every 5 seconds. You can then use Amazon Route 53 ARC to fail over your services independently and see the responses served from different Regions.

Initially, all services are routing traffic to the us-east-1 Region:

Initial routing

Initial routing

With the following command, you update two routing controls for service1, setting the primary Region (us-east-1) health check state to off, and the secondary Region (us-west-2) health check state to on:

aws route53-recovery-cluster update-routing-control-states \
 --update-routing-control-state-entries \
 '[{"RoutingControlArn":"arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/abcdefg1234567","RoutingControlState":"On"},
{"RoutingControlArn":"arn:aws:route53-recovery-control:: 111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/hijklmnop987654321","RoutingControlState":"Off"}]' \
 --region ap-southeast-2 \
 --endpoint-url https://abcd1234.route53-recovery-cluster.ap-southeast-2.amazonaws.com/v1

After a few seconds, the script terminal shows that service1 is now routing traffic to us-west-2, while the other services are still routing traffic to the us-east-1 Region.

Flipping service1 to backup Region

Flipping service1 to backup Region

To fail back service1 to the us-east-1 Region, run this command, now setting the service1 primary Region (us-east-1) health check state to on, and the secondary Region (us-west-2) health check state to off:

aws route53-recovery-cluster update-routing-control-states \
 --update-routing-control-state-entries \
 '[{"RoutingControlArn":"arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/abcdefg1234567","RoutingControlState":"Off"},
{"RoutingControlArn":"arn:aws:route53-recovery-control:: 111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/hijklmnop987654321","RoutingControlState":"On"}]' \
 --region ap-southeast-2 \
 --endpoint-url https:// abcd1234.route53-recovery-cluster.ap-southeast-2.amazonaws.com/v1

After a few seconds, the script terminal shows that service1 is now routing traffic to the us-east-1 Region again, like the other services.

Routing recovery

Routing recovery

Cleaning up

After you are finished, follow the cleanup instructions on GitHub.

Conclusion

This solution helps put the control back in the hands of the teams managing critical workloads using API Gateway. By decoupling the frontend and backend, this solution gives organizations granular control over failover at the service level using Amazon Route 53 ARC to remove dependencies on control plane actions.

The pattern outlined also reduces the impact to consumers of the service as it allows you to use the same public API and top-level domain when moving from a single-Region to a multi-Region architecture.

For more resilience learning, visit AWS Architecture Blog – Resilience.

For more serverless learning, visit Serverless Land.

Fault-isolated, zonal deployments with AWS CodeDeploy

Post Syndicated from Michael Haken original https://aws.amazon.com/blogs/devops/fault-isolated-zonal-deployments-with-aws-codedeploy/

In this blog post you’ll learn how to use a new feature in AWS CodeDeploy to deploy your application one Availability Zone (AZ) at a time to help increase the operational resilience or your services through improved fault isolation.

Introducing change to a system can be a time of risk. Even the most advanced CI/CD systems with comprehensive testing and phased deployments can still promote a bad change into production. A common approach to reduce this risk is using fractional deployments and closely monitoring critical metrics like availability and latency to gauge a deployment’s success. If the deployment shows signs of failure, the CI/CD system initiates an

This blog post will show you how to leverage CodeDeploy zonal deployments as part of a holistic AZ independent (AZI) architecture strategy, both patterns that many AWS service teams follow. With this feature, you no longer need to distinguish between infrastructure or deployment failures in order to respond to the event. You can use the same observability tools and recovery techniques for both, which allows you to contain the scope of impact to a single AZ and mitigate the impact more quickly and with less complexity. First, let’s define what an AZI architecture is so we can understand how this feature in CodeDeploy supports it.

Availability Zone independence

Fault isolation is an architectural pattern that limits the scope of impact of failures by creating independent fault containers that don’t share fate. It also allows you to quickly recover from failures by shifting traffic or resources away from the impaired fault container. AWS provides a number of different fault isolation boundaries, but the ones most people are familiar with are AZs and Regions. When you build multi-AZ architectures, you can design your application to implement AZI that uses the fault boundary provided by AZs to keep the interaction of resources isolated to their respective AZ (to the greatest extent possible).

A three tier Availability Zone indepdendent architecture deployed across three AZa

An Availability Zone independent (AZI) architecture implemented by disabling cross-zone load balancing and using an independent database read replica per AZ. Only database writes have to cross an AZ boundary.

The result is that the impacts from an impairment in one AZ don’t cascade to resources in other AZs, making the operation of your application in one AZ independent from events in the others. You should also monitor the health of each AZ independently, for example by looking at per-AZ load balancer HTTPCode_Target_5XX_Count metrics, or by sending synthetic requests to the load balancer nodes in each AZ and recording availability and latency metrics for each. When an event occurs that impacts your availability or latency in a single AZ, you can use capabilities like Amazon Route 53 Application Recovery Controller zonal shift to shift traffic away from that AZ to quickly mitigate impact, often in single-digit minutes.

Using zonal shift to shift traffic away from an single AZ

Using zonal shift to move traffic away from the AZ experiencing a service impairment

Traditional deployment strategy challenges

During an event, SRE, engineering, or operations teams can spend a lot of time trying to figure out if the source of impact is an infrastructure problem or related to a failed deployment. Then, based on the identified cause, they may take different mitigation actions. Thus, precious time is spent investigating the source of impact and deciding on the appropriate mitigation plan.

When the cause is due to a failed deployment, traditionally rollbacks are used to mitigate the problem. But rollbacks, even when automated, take time to complete. For example, let’s say your deployment batches take 5 minutes to complete, you deploy in 10% batches, and you’re halfway through a deployment to 100 instances when the rollback is initiated. This means it’s going to take at least 25 minutes to finish the rollback (5 batches, each taking 5 minutes to re-deploy to). And it’s entirely possible during that time that instances where the new software was deployed continue to pass health checks, but result in errors being returned to your customers. In the worst case, if all instances had been deployed to, this event could last for almost an hour with customers being impacted during the entire rollback process. In some cases, deployments can’t be rolled back and have to be rolled forward, meaning a new, updated version needs to be deployed to fix the previous deployment. Writing the code for the new deployment and testing it adds to the recovery time of your system and can be error prone.

Additionally, if your unit of deployment includes multiple AZs, then your potential scope of impact from a failed deployment isn’t predictably bounded. For example, if your CodeDeploy deployment groups target Amazon Elastic Compute Cloud (Amazon EC2) instances based on tags or an Amazon EC2 Auto Scaling group that spans multiple AZs, then you could see impact across the whole Region, even if you’re using fractional deployments. There’s not a smaller fault container that helps consistently limit the scope of impact to a predictable size.

Let’s look at how we can overcome these two challenges by using zonal deployments with CodeDeploy.

Zonal deployments with AWS CodeDeploy

One of the best practices we follow at AWS, described in My CI/CD pipeline is my release captain, is performing fractional deployments aligned to intended fault isolation boundaries, like individual hosts, cells, AZs, and Regions. When we release a change, the deployment is separated into waves, which represent fault containers (like Regions) that are deployed to in parallel, and those are further separated into stages. Within a single Region, the deployment starts with a one-box environment, representing a single host, then moves on to fractional batches (like 10% at a time) inside a single AZ, waits for a period of bake time, moves on to the next AZ, and so on until we’ve completed rolling out the change.

Four stages in a deployment pipeline showing per AZ deployments with bake time.

Deployment stages aligned to intended fault isolation boundaries within a single deployment wave for one Region

By aligning each stage to an expected fault isolation boundary, we create well-defined fault containers that provide an understood and bounded scope of impact in the case that something goes wrong with a deployment. You can take advantage of this same deployment strategy in your own applications by using zonal deployments in CodeDeploy. To utilize this capability, you need to define a custom deployment configuration shown below.

The configuration options for a CodeDeploy deployment configuration using a zonal configuration

Creating a zonal deployment configuration that deploys to 10% of the EC2 instances in each AZ at a time, one AZ at a time

This configuration defines a few important properties. First, it enables the zonal configuration, which ensures deployments will be phased one AZ at a time. In this case, updates will be deployed to batches of 10% of the instances in each AZ (see the minimum number of healthy instances per Availability Zone for more details on configuring this setting). Second, it defines a monitor duration, which is the bake time where the effects of the changes are observed before moving on to the next AZ. This ensures sufficient use of the new software to discover any potential bugs or problems before moving on. The value in this example is defined as 900 seconds, or 15 minutes. You should ensure this value is longer than the time it takes for your alarms to trigger. For example, if you are using an M of N alarm for availability and/or latency, that is using 3 data points out of 5 with 1-minute intervals, you need to make sure your bake time is set to greater than 600 seconds, otherwise, you might move on to the next AZ before your alarm has a chance to mark the deployment as unsuccessful. Finally, I’ve also defined a first zone monitor duration. This overrides the “monitor duration” for the first AZ being deployed to. This is useful since the first AZ is acting as our canary or one-box environment and we may want to wait additional time to be really confident the deployment is successful before moving on to the second AZ.

If your service is deployed behind a load balancer with cross-zone load balancing disabled (which is important to achieve AZI), carefully consider your batch size. The load balancer evenly splits traffic across AZs regardless of how many healthy hosts are in each AZ. Ensure your batch size is small enough that the temporary reduction in capacity during each batch doesn’t overwhelm the remaining instances in the same AZ. You can use the CodeDeploy minimum healthy hosts per AZ option to ensure there are enough healthy hosts in the AZ during a deployment batch or Elastic Load Balancing (ELB) target group minimum healthy target count with DNS failover to shift traffic away from the AZ if too few targets are present.

Recovering from a failed zonal deployment.

When a failure occurs, the highest priority is mitigating the impact, not fixing the root cause. While an automated rollback can help achieve both for a failed deployment, using a zonal shift can improve your recovery time. Let’s take a simple dashboard like the following figure. The top graph shows your availability as perceived by customers through using the regional endpoint of your load balancer like https://load-balancer-name-and-id.elb.us-east-1.amazonaws.com. The graphs below it show the measured availability from Amazon CloudWatch Synthetics canaries that test the load balancer endpoints in each AZ using endpoints like https://us-east-1a.load-balancer-name-and-id.elb.us-east-1.amazonaws.com.

Dashboards showing a drop in availability in one AZ that also impacts the regional customer experience

Dashboard showing impact in one AZ that affects the availability of the service

We can see that something starts impacting resources in AZ1 at 10:38 causing an availability drop. As we would expect, this impact is also seen by customers, shown in the top graph, but it’s unclear what the underlying cause of the availability drop is. Using the approach described in this post, it turns out that it doesn’t matter. Within a few minutes, at 10:41 the CloudWatch composite alarm monitoring the health of AZ1 transitions to the ALARM state and invokes a Lambda function that reads the alarm’s definition to get the AZ ID and ALB ARN involved, and initiates the zonal shift. It’s important that the alarm logic only reacts when a single AZ is impacted, if there was impact in more than one AZ, we would need to treat this as a Regional issue.

The process of identifying a failed deployment in a single AZ and responding with a zonal shift.

After a failed deployment to AZ1, an automatically initiated zonal shift quickly mitigates the customer impact

Then, after a few more minutes, at 10:44, we can see availability from the customer perspective has gone back up to 100% by shifting traffic away from AZ1.

Dashboards showing the regional customer experience has recovered while the AZ is still impacted by the failed deployment

The impact of the failed deployment is mitigated by shifting traffic away from AZ1

It turns out the cause of impact in this case was a failed deployment, and we can see that our synthetic canaries still see the failure while the deployment is rolling back, but we’ve achieved our primary goal of quickly removing the impact to the customer experience. From the start of impact to mitigation, 6 minutes elapsed, which was significantly faster than waiting for the deployment to completely rollback. After the rollback is complete, at 10:58, 20 minutes after the start of the event, we can see the alarm transition back to the OK state and availability return to normal in AZ1, meaning we can end the zonal shift and return to normal operation.

Dashboards showing that after the rollback is complete, the impact to the single AZ has also dissipated

After the deployment rollback is complete, the availability in AZ1 recovers and the zonal shift can be ended

Conclusion

Performing zonal deployments helps improve the effectiveness of AZI architectures. Aligning your deployments to your intended fault isolation boundaries, in this case AZs, creates a predictable scope of impact and helps prevents cascading failures. This in turn allows you to use a common set of observability and mitigation tools for both single-AZ infrastructure events and failed deployments, which can mitigate the impact faster than automated rollbacks. Additionally, by removing the ambiguity on selecting a recovery strategy for operators, it further reduces recovery time and complexity. Learn more about zonal deployments in AWS CodeDeploy here.

Michael Haken

Michael Haken

Michael is a Senior Principal Solutions Architect on the AWS Strategic Accounts team where he helps customers innovate, differentiate their business, and transform their customer experiences. He has 15 years’ experience supporting financial services, public sector, and digital native customers. Michael has his B.A. from UVA and M.S. in Computer Science from Johns Hopkins. Outside of work you’ll find him playing with his family and dogs on his farm.

Stop the CNAME chain struggle: Simplified management with Route 53 Resolver DNS Firewall

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/stop-the-cname-chain-struggle-simplified-management-with-route-53-resolver-dns-firewall/

Starting today, you can configure your DNS Firewall to automatically trust all domains in a resolution chain (such as aCNAME, DNAME, or Alias chain).

Let’s walk through this in nontechnical terms for those unfamiliar with DNS.

Why use DNS Firewall?
DNS Firewall provides protection for outbound DNS requests from your private network in the cloud (Amazon Virtual Private Cloud (Amazon VPC)). These requests route through Amazon Route 53 Resolver for domain name resolution. Firewall administrators can configure rules to filter and regulate the outbound DNS traffic.

DNS Firewall helps to protect against multiple security risks.

Let’s imagine a malicious actor managed to install and run some code on your Amazon Elastic Compute Cloud (Amazon EC2) instances or containers running inside one of your virtual private clouds (VPCs). The malicious code is likely to initiate outgoing network connections. It might do so to connect to a command server and receive commands to execute on your machine. Or it might initiate connections to a third-party service in a coordinated distributed denial of service (DDoS) attack. It might also try to exfiltrate data it managed to collect on your network.

Fortunately, your network and security groups are correctly configured. They block all outgoing traffic except the one to well-known API endpoints used by your app. So far so good—the malicious code cannot dial back home using regular TCP or UDP connections.

But what about DNS traffic? The malicious code may send DNS requests to an authoritative DNS server they control to either send control commands or encoded data, and it can receive data back in the response. I’ve illustrated the process in the following diagram.

DNS exfiltration illustrated

To prevent these scenarios, you can use a DNS Firewall to monitor and control the domains that your applications can query. You can deny access to the domains that you know to be bad and allow all other queries to pass through. Alternately, you can deny access to all domains except those you explicitly trust.

What is the challenge with CNAME, DNAME, and Alias records?
Imagine you configured your DNS Firewall to allow DNS queries only to specific well-known domains and blocked all others. Your application communicates with alexa.amazon.com; therefore, you created a rule allowing DNS traffic to resolve that hostname.

However, the DNS system has multiple types of records. The ones of interest in this article are

  • A records that map a DNS name to an IP address,
  • CNAME records that are synonyms for other DNS names,
  • DNAME records that provide redirection from a part of the DNS name tree to another part of the DNS name tree, and
  • Alias records that provide a Route 53 specific extension to DNS functionality. Alias records let you route traffic to selected AWS resources, such as Amazon CloudFront distributions and Amazon S3 buckets

When querying alexa.amazon.com, I see it’s actually a CNAME record that points to pitangui.amazon.com, which is another CNAME record that points to tp.5fd53c725-frontier.amazon.com, which, in turn, is a CNAME to d1wg1w6p5q8555.cloudfront.net. Only the last name (d1wg1w6p5q8555.cloudfront.net) has an A record associated with an IP address 3.162.42.28. The IP address is likely to be different for you. It points to the closest Amazon CloudFront edge location, likely the one from Paris (CDG52) for me.

A similar redirection mechanism happens when resolving DNAME or Alias records.

DNS resolution for alexa.amazon.com

To allow the complete resolution of such a CNAME chain, you could be tempted to configure your DNS Firewall rule to allow all names under amazon.com (*.amazon.com), but that would fail to resolve the last CNAME that goes to cloudfront.net.

Worst, the DNS CNAME chain is controlled by the service your application connects to. The chain might change at any time, forcing you to manually maintain the list of rules and authorized domains inside your DNS Firewall rules.

Introducing DNS Firewall redirection chain authorization
Based on this explanation, you’re now equipped to understand the new capability we launch today. We added a parameter to the UpdateFirewallRule API (also available on the AWS Command Line Interface (AWS CLI) and AWS Management Console) to configure the DNS Firewall so that it follows and automatically trusts all the domains in a CNAME, DNAME, or Alias chain.

This parameter allows firewall administrators to only allow the domain your applications query. The firewall will automatically trust all intermediate domains in the chain until it reaches the A record with the IP address.

Let’s see it in action
I start with a DNS Firewall already configured with a domain list, a rule group, and a rule that ALLOW queries for the domain alexa.amazon.com. The rule group is attached to a VPC where I have an EC2 instance started.

When I connect to that EC2 instance and issue a DNS query to resolve alexa.amazon.com, it only returns the first name in the domain chain (pitangui.amazon.com) and stops there. This is expected because pitangui.amazon.com is not authorized to be resolved.

DNS query for alexa.amazon.com is blocked at first CNAME

To solve this, I update the firewall rule to trust the entire redirection chain. I use the AWS CLI to call the update-firewall-rule API with a new parameter firewall-domain-redirection-action set to TRUST_REDIRECTION_DOMAIN.

AWS CLI to update the DNS firewall rule

The following diagram illustrates the setup at this stage.

DNS Firewall rule diagram

Back to the EC2 instance, I try the DNS query again. This time, it works. It resolves the entire redirection chain, down to the IP address 🎉.

DNS resolution for the full CNAME chain

Thanks to the trusted chain redirection, network administrators now have an easy way to implement a strategy to block all domains and authorize only known domains in their DNS Firewall without having to care about CNAME, DNAME, or Alias chains.

This capability is available at no additional cost in all AWS Regions. Try it out today!

— seb

Unify DNS management using Amazon Route 53 Profiles with multiple VPCs and AWS accounts

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/unify-dns-management-using-amazon-route-53-profiles-with-multiple-vpcs-and-aws-accounts/

If you are managing lots of accounts and Amazon Virtual Private Cloud (Amazon VPC) resources, sharing and then associating many DNS resources to each VPC can present a significant burden. You often hit limits around sharing and association, and you may have gone as far as building your own orchestration layers to propagate DNS configuration across your accounts and VPCs.

Today, I’m happy to announce Amazon Route 53 Profiles, which provide the ability to unify management of DNS across all of your organization’s accounts and VPCs. Route 53 Profiles let you define a standard DNS configuration, including Route 53 private hosted zone (PHZ) associations, Resolver forwarding rules, and Route 53 Resolver DNS Firewall rule groups, and apply that configuration to multiple VPCs in the same AWS Region. With Profiles, you have an easy way to ensure all of your VPCs have the same DNS configuration without the complexity of handling separate Route 53 resources. Managing DNS across many VPCs is now as simple as managing those same settings for a single VPC.

Profiles are natively integrated with AWS Resource Access Manager (RAM) allowing you to share your Profiles across accounts or with your AWS Organizations account. Profiles integrates seamlessly with Route 53 private hosted zones by allowing you to create and add existing private hosted zones to your Profile so that your organizations have access to these same settings when the Profile is shared across accounts. AWS CloudFormation allows you to use Profiles to set DNS settings consistently for VPCs as accounts are newly provisioned. With today’s release, you can better govern DNS settings for your multi-account environments.

How Route 53 Profiles works
To start using the Route 53 Profiles, I go to the AWS Management Console for Route 53, where I can create Profiles, add resources to them, and associate them to their VPCs. Then, I share the Profile I created across another account using AWS RAM.

In the navigation pane in the Route 53 console, I choose Profiles and then I choose Create profile to set up my Profile.

I give my Profile configuration a friendly name such as MyFirstRoute53Profile and optionally add tags.

I can configure settings for DNS Firewall rule groups, private hosted zones and Resolver rules or add existing ones within my account all within the Profile console page.

I choose VPCs to associate my VPCs to the Profile. I can add tags as well as do configurations for recursive DNSSEC validation, the failure mode for the DNS Firewalls associated to my VPCs. I can also control the order of DNS evaluation: First VPC DNS then Profile DNS, or first Profile DNS then VPC DNS.

I can associate one Profile per VPC and can associate up to 5,000 VPCs to a single Profile.

Profiles gives me the ability to manage settings for VPCs across accounts in my organization. I am able to disable reverse DNS rules for each of the VPCs the Profile is associated with rather than configuring these on a per-VPC basis. The Route 53 Resolver automatically creates rules for reverse DNS lookups for me so that different services can easily resolve hostnames from IP addresses. If I use DNS Firewall, I am able to select the failure mode for my firewall via settings, to fail open or fail closed. I am also able to specify if I wish for the VPCs associated to the Profile to have recursive DNSSEC validation enabled without having to use DNSSEC signing in Route 53 (or any other provider).

Let’s say I associate a Profile to a VPC. What happens when a query exactly matches both a resolver rule or PHZ associated directly to the VPC and a resolver rule or PHZ associated to the VPC’s Profile? Which DNS settings take precedence, the Profile’s or the local VPC’s? For example, if the VPC is associated to a PHZ for example.com and the Profile contains a PHZ for example.com, that VPC’s local DNS settings will take precedence over the Profile. When a query is made for a name for a conflicting domain name (for example, the Profile contains a PHZ for infra.example.com and the VPC is associated to a PHZ that has the name account1.infra.example.com), the most specific name wins.

Sharing Route 53 Profiles across accounts using AWS RAM
I use AWS Resource Access Manager (RAM) to share the Profile I created in the previous section with my other account.

I choose the Share profile option in the Profiles detail page or I can go to the AWS RAM console page and choose Create resource share.

I provide a name for my resource share and then I search for the ‘Route 53 Profiles’ in the Resources section. I select the Profile in Selected resources. I can choose to add tags. Then, I choose Next.

Profiles utilize RAM managed permissions, which allow me to attach different permissions to each resource type. By default, only the owner (the network admin) of the Profile will be able to modify the resources within the Profile. Recipients of the Profile (the VPC owners) will only be able to view the contents of the Profile (the ReadOnly mode). To allow a recipient of the Profile to add PHZs or other resources to it, the Profile’s owner will have to attach the necessary permissions to the resource. Recipients will not be able to edit or delete any resources added by the Profile owner to the shared resource.

I leave the default selections and choose Next to grant access to my other account.

On the next page, I choose Allow sharing with anyone, enter my other account’s ID and then choose Add. After that, I choose that account ID in the Selected principals section and choose Next.

In the Review and create page, I choose Create resource share. Resource share is successfully created.

Now, I switch to my other account that I share my Profile with and go to the RAM console. In the navigation menu, I go to the Resource shares and choose the resource name I created in the first account. I choose Accept resource share to accept the invitation.

That’s it! Now, I go to my Route 53 Profiles page and I choose the Profile shared with me.

I have access to the shared Profile’s DNS Firewall rule groups, private hosted zones, and Resolver rules. I can associate this account’s VPCs to this Profile. I am not able to edit or delete any resources. Profiles are Regional resources and cannot be shared across Regions.

Available now
You can easily get started with Route 53 Profiles using the AWS Management Console, Route 53 API, AWS Command Line Interface (AWS CLI), AWS CloudFormation, and AWS SDKs.

Route 53 Profiles will be available in all AWS Regions, except in Canada West (Calgary), the AWS GovCloud (US) Regions and the Amazon Web Services China Regions.

For more details about the pricing, visit the Route 53 pricing page.

Get started with Profiles today and please let us know your feedback either through your usual AWS Support contacts or the AWS re:Post for Amazon Route 53.

— Esra

Top Architecture Blog Posts of 2023

Post Syndicated from Andrea Courtright original https://aws.amazon.com/blogs/architecture/top-architecture-blog-posts-of-2023/

2023 was a rollercoaster year in tech, and we at the AWS Architecture Blog feel so fortunate to have shared in the excitement. As we move into 2024 and all of the new technologies we could see, we want to take a moment to highlight the brightest stars from 2023.

As always, thanks to our readers and to the many talented and hardworking Solutions Architects and other contributors to our blog.

I give you our 2023 cream of the crop!

#10: Build a serverless retail solution for endless aisle on AWS

In this post, Sandeep and Shashank help retailers and their customers alike in this guided approach to finding inventory that doesn’t live on shelves.

Building endless aisle architecture for order processing

Figure 1. Building endless aisle architecture for order processing

Check it out!

#9: Optimizing data with automated intelligent document processing solutions

Who else dreads wading through large amounts of data in multiple formats? Just me? I didn’t think so. Using Amazon AI/ML and content-reading services, Deependra, Anirudha, Bhajandeep, and Senaka have created a solution that is scalable and cost-effective to help you extract the data you need and store it in a format that works for you.

AI-based intelligent document processing engine

Figure 2: AI-based intelligent document processing engine

Check it out!

#8: Disaster Recovery Solutions with AWS managed services, Part 3: Multi-Site Active/Passive

Disaster recovery posts are always popular, and this post by Brent and Dhruv is no exception. Their creative approach in part 3 of this series is most helpful for customers who have business-critical workloads with higher availability requirements.

Warm standby with managed services

Figure 3. Warm standby with managed services

Check it out!

#7: Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

Continuing with the theme of “when bad things happen,” we have Siva, Elamaran, and Re’s post about preparing for workload failures. If resiliency is a concern (and it really should be), the secret is test, test, TEST.

Architecture flow for Microservices to simulate a realistic failure scenario

Figure 4. Architecture flow for Microservices to simulate a realistic failure scenario

Check it out!

#6: Let’s Architect! Designing event-driven architectures

Luca, Laura, Vittorio, and Zamira weren’t content with their four top-10 spots last year – they’re back with some things you definitely need to know about event-driven architectures.

Let's Architect

Figure 5. Let’s Architect artwork

Check it out!

#5: Use a reusable ETL framework in your AWS lake house architecture

As your lake house increases in size and complexity, you could find yourself facing maintenance challenges, and Ashutosh and Prantik have a solution: frameworks! The reusable ETL template with AWS Glue templates might just save you a headache or three.

Reusable ETL framework architecture

Figure 6. Reusable ETL framework architecture

Check it out!

#4: Invoking asynchronous external APIs with AWS Step Functions

It’s possible that AWS’ menagerie of services doesn’t have everything you need to run your organization. (Possible, but not likely; we have a lot of amazing services.) If you are using third-party APIs, then Jorge, Hossam, and Shirisha’s architecture can help you maintain a secure, reliable, and cost-effective relationship among all involved.

Invoking Asynchronous External APIs architecture

Figure 7. Invoking Asynchronous External APIs architecture

Check it out!

#3: Announcing updates to the AWS Well-Architected Framework

The Well-Architected Framework continues to help AWS customers evaluate their architectures against its six pillars. They are constantly striving for improvement, and Haleh’s diligence in keeping us up to date has not gone unnoticed. Thank you, Haleh!

Well-Architected logo

Figure 8. Well-Architected logo

Check it out!

#2: Let’s Architect! Designing architectures for multi-tenancy

The practically award-winning Let’s Architect! series strikes again! This time, Luca, Laura, Vittorio, and Zamira were joined by Federica to discuss multi-tenancy and why that concept is so crucial for SaaS providers.

Let's Architect

Figure 9. Let’s Architect

Check it out!

And finally…

#1: Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Haresh, Lewis, and Bonnie revamped this 2022 post into a masterpiece that completely stole our readers’ hearts and is among the top posts we’ve ever made!

Resilience patterns and trade-offs

Figure 10. Resilience patterns and trade-offs

Check it out!

Bonus! Three older special mentions

These three posts were published before 2023, but we think they deserve another round of applause because you, our readers, keep coming back to them.

Thanks again to everyone for their contributions during a wild year. We hope you’re looking forward to the rest of 2024 as much as we are!

AWS Weekly Roundup—Amazon Route53, Amazon EventBridge, Amazon SageMaker, and more – January 15, 2024

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-route53-amazon-eventbridge-amazon-sagemaker-and-more-january-15-2024/

We are in January, the start of a new year, and I imagine many of you have made a new year resolution to learn something new. If you want to learn something new and get a free Amazon Web Services (AWS) Learning Badge, check out the new Events and Workflows Learning Path. This learning path will teach you everything you need to know about AWS Step Functions, Amazon EventBridge, event-driven architectures, and serverless, and when you finish the learning path, you can take an assessment. If you pass the assessment, you get an AWS Learning Badge, credited by Credly, that you can share in your résumé and social media profiles.

Events and workflows learning path badge

Last Week’s Launches
Here are some launches that got my attention during the previous week.

Amazon Route 53 – Now you can enable Route 53 Resolver DNS Firewall to filter DNS traffic based on the query type contained in the question section of the DNS query format. In addition, Route 53 now supports geoproximity routing as an additional routing policy for DNS records. Expand and reduce the geographic area from which traffic is routed to a resource by changing the record’s bias value. This is really helpful for industries that need to deliver highly responsive digital experiences.

Amazon CloudWatch LogsCloudWatch Logs now support creating account-level subscription filters. This capability allows you to forward all the logs groups from an account to other services like Amazon OpenSearch Service or Amazon Kinesis Data Firehose.

Amazon Elastic Container Service (Amazon ECS) Amazon ECS now integrates with Amazon Elastic Block Store (Amazon EBS), allowing you to provision and attach EBS volumes to Amazon ECS tasks running on both AWS Fargate and Amazon Elastic Cloud Compute (Amazon EC2). Read the blog post Channy wrote where he shows this feature in action.

Amazon EventBridgeEventBridge now supports AWS AppSync as a target of EventBridge buses. This enables you to stream real-time updates from your backend applications to your front-end clients. For example, you can get notifications in your mobile application from an order you did when the order status changes on the backend.

Amazon SageMakerSageMaker now supports M7i, C7i, and R7i instances for machine learning (ML) inference. These instances are powered by custom 4th generation Intel Xeon scalable processors and deliver up to 15 percent better price performance than their previous generations.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

If you are a serverless enthusiast, this week, the AWS Compute Blog published the Serverless ICYMI (in case you missed it) quarterly recap for the last quarter of 2023. This post compiles the announcements made during the months of October, November, and December, with all the relevant content that was produced by AWS Developer Advocates during that time. In addition to that blog post, you can learn about ServerlessVideo, a new demo application that we launched at AWS re:Invent 2023.

ServerlessVideo

This week there were also a couple of really interesting blog posts that explain how to solve very common challenges that customers face. The first one is the blog post in the AWS Security Blog that explains how to customize access tokens in Amazon Cognito user pools. And the second one is from the AWS Database Blog, which explains how to effectively sort data with Amazon DynamoDB.

The Official AWS Podcast – Listen each week for updates on the latest AWS news and deep dives into exciting use cases. There are also official AWS podcasts in several languages. Check out the ones in FrenchGermanItalian, and Spanish.

AWS open source newsletter – This is a newsletter curated by my colleague Ricardo to bring you the latest open source projects, posts, events, and more.

For our customers in Turkey, on January 1, 2024, AWS Turkey Pazarlama Teknoloji ve Danışmanlık Hizmetleri Limited Şirketi (AWS Turkey) replaced AWS EMEA SARL (AWS Europe) as the contracting party and service provider to customers in Türkiye. This enables AWS customers in Türkiye to transact in their local currency (Turkish Lira) and with a local bank. For more information on AWS Turkey, visit the FAQ page.

Upcoming AWS Events
The beginning of the year is the season of AWS re:Invent recaps, which are happening all around the globe during the next two months. You can check the recaps page to find the one closest to you.

You can browse all upcoming AWS led in-person and virtual events, as well as developer-focused events such as AWS DevDay.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Happy New Year! AWS Weekly Roundup – January 8, 2024

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/happy-new-year-aws-weekly-roundup-january-8-2024/

Happy New Year! Cloud technologies, machine learning, and generative AI have become more accessible, impacting nearly every aspect of our lives. Amazon CTO Dr. Werner Vogels offers four tech predictions for 2024 and beyond:

  • Generative AI becomes culturally aware
  • FemTech finally takes off
  • AI assistants redefine developer productivity
  • Education evolves to match the speed of technology

Read how these technology trends will converge to help solve some of society’s most difficult problems. Download the Werner Vogels’ Tech Predictions for 2024 and Beyond ebook or read Werner’s All Things Distributed blog.

AWS re:Invent 2023To hear insights from AWS and industry thought leaders, grow your skills, and get inspired, watch AWS re:Invent 2023 videos on demand for keynotes, innovation talks, breakout sessions, and AWS Hero guide playlists.

Launches from the last few weeks
Since our last week in review on December 18, 2023, I’d like to highlight some launches from year end, as well as last week:

New AWS Canada West (Calgary) Region – We are opening a new and second Region and in Canada, AWS Canada West (Calgary). At the end of 2023, AWS had 33 AWS Regions and 105 Availability Zones (AZs) globally. We preannounced 12 additional AZs in four future Regions in Malaysia, New Zealand, Thailand, and the AWS European Sovereign Cloud. We will share more information on these Regions in 2024. Please stay tuned.

DNS over HTTPS in Amazon Route 53 Resolver – You can use the DNS over HTTPS (DoH) protocol for both inbound and outbound Route 53 Resolver endpoints. As the name suggests, DoH supports HTTP or HTTP/2 over TLS to encrypt the data exchanged for Domain Name System (DNS) resolutions.

Automatic enrollment to Amazon RDS Extended Support – Your MySQL 5.7 and PostgreSQL 11 database instances running on Amazon Aurora and Amazon RDS will be automatically enrolled into Amazon RDS Extended Support starting on February 29, 2024. You can have more control over when you want to upgrade the major version of your database after the community end of life (EoL).

New Amazon CloudWatch Network Monitor – This is a new feature of Amazon CloudWatch that helps monitor network availability and performance between AWS and your on-premises environments. Network Monitor needs zero manual instrumentation and gives you access to real-time network visibility to proactively and quickly identify issues within the AWS network and your own hybrid environment. For more information, read Monitor hybrid connectivity with Amazon CloudWatch Network Monitor.

Amazon Aurora PostgreSQL integrations with Amazon Bedrock – You can use two methods to integrate Aurora PostgreSQL databases with Amazon Bedrock to power generative AI applications. You can use the SQL query with Aurora ML integration with Amazon Bedrock and Aurora vector store with Knowledge Bases for Amazon Bedrock for Retrieval Augmented Generation (RAG).

New WordPress setup on Amazon Lightsail – Set up your WordPress website on Amazon Lightsail with the new workflow to eliminate complexity and time spent configuring your website. The workflow allows you to complete all the necessary steps, including setting up a Secure Sockets Layer (SSL) certificate to secure your website with HTTPS.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting in the new year:

Book recommendations for AWS customer executives – Plan for the new year and catch up on what others are doing and thinking. AWS Enterprise Strategy team recommends what books are most important for our AWS customer executives to read.

Best practices for scaling AWS CDK adoption with Platform Engineering – A recent evolution in DevOps is the introduction of platform engineering teams to build services, toolchains, and documentation to support workload teams. This blog post introduces strategies and best practices for accelerating CDK adoption within your organization. You can learn how to scale the lessons learned from the pilot project across your organization through platform engineering.

High performance running HPC applications on AWS Graviton instances – When running the Parallel Lattice Boltzmann Solver (Palabos) on Amazon EC2 Hpc7g instances to solve computational fluid dynamics (CFD) problems, performance increased by up to 70% and price performance was up to 3x better than on the previous generation of Graviton instances.

The new AWS open source newsletter, #181 – Check up on all the latest open source content, which this week includes AWS Amplify, Amazon Corretto, dbt, Apache Flink, Karpenter, LangChain, Pinecone, and more.

Upcoming AWS Events
Check your calendars and sign up for these AWS events in the new year:

AWS at CES 2024 (January 9-12) – AWS will be representing some of the latest cloud services and solutions that are purpose built for the automotive, mobility, transportation, and manufacturing industries. Join us to learn about the latest cloud capabilities across generative AI, software define vehicles, product engineering, sustainability, new digital customer experiences, connected mobility, autonomous driving, and so much more in Amazon Experience Area.

APJ Builders Online Series (January 18) – This online conference is designed for you to learn core AWS concepts, and step-by-step architectural best practices, including demonstrations to help you get started and accelerate your success on AWS.

You can browse all upcoming AWS-led in-person and virtual events, and developer-focused events such as AWS DevDay.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

DNS over HTTPS is now available in Amazon Route 53 Resolver

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/dns-over-https-is-now-available-in-amazon-route-53-resolver/

Starting today, Amazon Route 53 Resolver supports using the DNS over HTTPS (DoH) protocol for both inbound and outbound Resolver endpoints. As the name suggests, DoH supports HTTP or HTTP/2 over TLS to encrypt the data exchanged for Domain Name System (DNS) resolutions.

Using TLS encryption, DoH increases privacy and security by preventing eavesdropping and manipulation of DNS data as it is exchanged between a DoH client and the DoH-based DNS resolver.

This helps you implement a zero-trust architecture where no actor, system, network, or service operating outside or within your security perimeter is trusted and all network traffic is encrypted. Using DoH also helps follow recommendations such as those described in this memorandum of the US Office of Management and Budget (OMB).

DNS over HTTPS support in Amazon Route 53 Resolver
You can use Amazon Route 53 Resolver to resolve DNS queries in hybrid cloud environments. For example, it allows AWS services access for DNS requests from anywhere within your hybrid network. To do so, you can set up inbound and outbound Resolver endpoints:

  • Inbound Resolver endpoints allow DNS queries to your VPC from your on-premises network or another VPC.Amazon Route 53 Resolver inbound endpoint architecture.
  • Outbound Resolver endpoints allow DNS queries from your VPC to your on-premises network or another VPC.Amazon Route 53 Resolver outbound endpoint architecture.

After you configure the Resolver endpoints, you can set up rules that specify the name of the domains for which you want to forward DNS queries from your VPC to an on-premises DNS resolver (outbound) and from on-premises to your VPC (inbound).

Now, when you create or update an inbound or outbound Resolver endpoint, you can specify which protocols to use:

  • DNS over port 53 (Do53), which is using either UDP or TCP to send the packets.
  • DNS over HTTPS (DoH), which is using TLS to encrypt the data.
  • Both, depending on which one is used by the DNS client.
  • For FIPS compliance, there is a specific implementation (DoH-FIPS) for inbound endpoints.

Let’s see how this works in practice.

Using DNS over HTTPS with Amazon Route 53 Resolver
In the Route 53 console, I choose Inbound endpoints from the Resolver section of the navigation pane. There, I choose Create inbound endpoint.

I enter a name for the endpoint, select the VPC, the security group, and the endpoint type (IPv4, IPv6, or dual-stack). To allow using both encrypted and unencrypted DNS resolutions, I select Do53, DoH, and DoH-FIPS in the Protocols for this endpoint option.

Console screenshot.

After that, I configure the IP addresses for DNS queries. I select two Availability Zones and, for each, a subnet. For this setup, I use the option to have the IP addresses automatically selected from those available in the subnet.

After I complete the creation of the inbound endpoint, I configure the DNS server in my network to forward requests for the amazonaws.com domain (used by AWS service endpoints) to the inbound endpoint IP addresses.

Similarly, I create an outbound Resolver endpoint and and select both Do53 and DoH as protocols. Then, I create forwarding rules that tell for which domains the outbound Resolver endpoint should forward requests to the DNS servers in my network.

Now, when the DNS clients in my hybrid environment use DNS over HTTPS in their requests, DNS resolutions are encrypted. Optionally, I can enforce encryption and select only DoH in the configuration of inbound and outbound endpoints.

Things to know
DNS over HTTPS support for Amazon Route 53 Resolver is available today in all AWS Regions where Route 53 Resolver is offered, including GovCloud Regions and Regions based in China.

DNS over port 53 continues to be the default for inbound or outbound Resolver endpoints. In this way, you don’t need to update your existing automation tooling unless you want to adopt DNS over HTTPS.

There is no additional cost for using DNS over HTTPS with Resolver endpoints. For more information, see Route 53 pricing.

Start using DNS over HTTPS with Amazon Route 53 Resolver to increase privacy and security for your hybrid cloud environments.

Danilo

Zonal autoshift – Automatically shift your traffic away from Availability Zones when we detect potential issues

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/zonal-autoshift-automatically-shift-your-traffic-away-from-availability-zones-when-we-detect-potential-issues/

Today we’re launching zonal autoshift, a new capability of Amazon Route 53 Application Recovery Controller that you can enable to automatically and safely shift your workload’s traffic away from an Availability Zone when AWS identifies a potential failure affecting that Availability Zone and shift it back once the failure is resolved.

When deploying resilient applications, you typically deploy your resources across multiple Availability Zones in a Region. Availability Zones are distinct groups of physical data centers at a meaningful distance apart (typically miles) to make sure that they have diverse power, connectivity, network devices, and flood plains.

To help you protect against an application’s errors, like a failed deployment, an error of configuration, or an operator error, we introduced last year the ability to manually or programmatically trigger a zonal shift. This enables you to shift the traffic away from one Availability Zone when you observe degraded metrics in that zone. It does so by configuring your load balancer to direct all new connections to infrastructure in healthy Availability Zones only. This allows you to preserve your application’s availability for your customers while you investigate the root cause of the failure. Once fixed, you stop the zonal shift to ensure the traffic is distributed across all zones again.

Zonal shift works at the Application Load Balancer (ALB) or Network Load Balancer (NLB) level only when cross-zone load balancing is turned off, which is the default for NLB. In a nutshell, load balancers offer two levels of load balancing. The first level is configured in the DNS. Load balancers expose one or more IP addresses for each Availability Zone, offering a client-side load balancing between zones. Once the traffic hits an Availability Zone, the load balancer sends traffic to registered healthy targets, typically an Amazon Elastic Compute Cloud (Amazon EC2) instance. By default, ALBs send traffic to targets across all Availability Zones. For zonal shift to properly work, you must configure your load balancers to disable cross-zone load balancing.

When zonal shift starts, the DNS sends all traffic away from one Availability Zone, as illustrated by the following diagram.

ARC Zonal Shift

Manual zonal shift helps to protect your workload against errors originating from your side. But when there is a potential failure in an Availability Zone, it is sometimes difficult for you to identify or detect the failure. Detecting an issue in an Availability Zone using application metrics is difficult because, most of the time, you don’t track metrics per Availability Zone. Moreover, your services often call dependencies across Availability Zone boundaries, resulting in errors seen in all Availability Zones. With modern microservice architectures, these detection and recovery steps must often be performed across tens or hundreds of discrete microservices, leading to recovery times of multiple hours.

Customers asked us if we could take the burden off their shoulders to detect a potential failure in an Availability Zone. After all, we might know about potential issues through our internal monitoring tools before you do.

With this launch, you can now configure zonal autoshift to protect your workloads against potential failure in an Availability Zone. We use our own AWS internal monitoring tools and metrics to decide when to trigger a network traffic shift. The shift starts automatically; there is no API to call. When we detect that a zone has a potential failure, such as a power or network disruption, we automatically trigger an autoshift of your infrastructure’s NLB or ALB traffic, and we shift the traffic back when the failure is resolved.

Obviously, shifting traffic away from an Availability Zone is a delicate operation that must be carefully prepared. We built a series of safeguards to ensure we don’t degrade your application availability by accident.

First, we have internal controls to ensure we shift traffic away from no more than one Availability Zone at a time. Second, we practice the shift on your infrastructure for 30 minutes every week. You can define blocks of time when you don’t want the practice to happen, for example, 08:00–18:00, Monday through Friday. Third, you can define two Amazon CloudWatch alarms to act as a circuit breaker during the practice run: one alarm to prevent starting the practice run at all and one alarm to monitor your application health during a practice run. When either alarm triggers during the practice run, we stop it and restore traffic to all Availability Zones. The state of application health alarm at the end of the practice run indicates its outcome: success or failure.

According to the principle of shared responsibility, you have two responsibilities as well.

First you must ensure there is enough capacity deployed in all Availability Zones to sustain the increase of traffic in remaining Availability Zones after traffic has shifted. We strongly recommend having enough capacity in remaining Availability Zones at all times and not relying on scaling mechanisms that could delay your application recovery or impact its availability. When zonal autoshift triggers, AWS Auto Scaling might take more time than usual to scale your resources. Pre-scaling your resource ensures a predictable recovery time for your most demanding applications.

Let’s imagine that to absorb regular user traffic, your application needs six EC2 instances across three Availability Zones (2×3 instances). Before configuring zonal autoshift, you should ensure you have enough capacity in the remaining Availability Zones to absorb the traffic when one Availability Zone is not available. In this example, it means three instances per Availability Zone (3×3 = 9 instances with three Availability Zones in order to keep 2×3 = 6 instances to handle the load when traffic is shifted to two Availability Zones).

In practice, when operating a service that requires high reliability, it’s normal to operate with some redundant capacity online for eventualities such as customer-driven load spikes, occasional host failures, etc. Topping up your existing redundancy in this way both ensures you can recover rapidly during an Availability Zone issue but can also give you greater robustness to other events.

Second, you must explicitly enable zonal autoshift for the resources you choose. AWS applies zonal autoshift only on the resources you chose. Applying a zonal autoshift will affect the total capacity allocated to your application. As I just described, your application must be prepared for that by having enough capacity deployed in the remaining Availability Zones.

Of course, deploying this extra capacity in all Availability Zones has a cost. When we talk about resilience, there is a business tradeoff to decide between your application availability and its cost. This is another reason why we apply zonal autoshift only on the resources you select.

Let’s see how to configure zonal autoshift
To show you how to configure zonal autoshift, I deploy my now-famous TicTacToe web application using a CDK script. I open the Route 53 Application Recovery Controller page of the AWS Management Console. On the left pane, I select Zonal autoshift. Then, on the welcome page, I select Configure zonal autoshift for a resource.

Zonal autoshift - 1

I select the load balancer of my demo application. Remember that currently, only load balancers with cross-zone load balancing turned off are eligible for zonal autoshift. As the warning on the console reminds me, I also make sure my application has enough capacity to continue to operate with the loss of one Availability Zone.

Zonal autoshift - 2

I scroll down the page and configure the times and days I don’t want AWS to run the 30-minute practice. At first, and until I’m comfortable with autoshift, I block the practice 08:00–18:00, Monday through Friday. Pay attention that hours are expressed in UTC, and they don’t vary with daylight saving time. You may use a UTC time converter application for help. While it is safe for you to exclude business hours at the start, we recommend configuring the practice run also during your business hours to ensure capturing issues that might not be visible when there is low or no traffic on your application. You probably most need zonal autoshift to work without impact at your peak time, but if you have never tested it, how confident are you? Ideally, you don’t want to block any time at all, but we recognize that’s not always practical.

Zonal autoshift - 3

Further down on the same page, I enter the two circuit breaker alarms. The first one prevents the practice from starting. You use this alarm to tell us this is not a good time to start a practice run. For example, when there is an issue ongoing with your application or when you’re deploying a new version of your application to production. The second CloudWatch alarm gives the outcome of the practice run. It enables zonal autoshift to judge how your application is responding to the practice run. If the alarm stays green, we know all went well.

If either of these two alarms triggers during the practice run, zonal autoshift stops the practice and restores the traffic to all Availability Zones.

Finally, I acknowledge that a 30-minute practice run will run weekly and that it might reduce the availability of my application.

Then, I select Create.

Zonal autoshift - 4And that’s it.

After a few days, I see the history of the practice runs on the Zonal shift history for resource tab of the console. I monitor the history of my two circuit breaker alarms to stay confident everything is correctly monitored and configured.

ARC Zonal Shift - practice run

It’s not possible to test an autoshift itself. It triggers automatically when we detect a potential issue in an Availability Zone. I asked the service team if we could shut down an Availability Zone to test the instructions I shared in this post; they politely declined my request :-).

To test your configuration, you can trigger a manual shift, which behaves identically to an autoshift.

A few more things to know
Zonal autoshift is now available at no additional cost in all AWS Regions, except for China and GovCloud.

We recommend applying the crawl, walk, run methodology. First, you get started with manual zonal shifts to acquire confidence in your application. Then, you turn on zonal autoshift configured with practice runs outside of your business hours. Finally, you modify the schedule to include practice zonal shifts during your business hours. You want to test your application response to an event when you least want it to occur.

We also recommend that you think holistically about how all parts of your application will recover when we move traffic away from one Availability Zone and then back. The list that comes to mind (although certainly not complete) is the following.

First, plan for extra capacity as I discussed already. Second, think about possible single points of failure in each Availability Zone, such as a self-managed database running on a single EC2 instance or a microservice that leaves in a single Availability Zone, and so on. I strongly recommend using managed databases, such as Amazon DynamoDB or Amazon Aurora for applications requiring zonal shifts. These have built-in replication and fail-over mechanisms in place. Third, plan the switch back when the Availability Zone will be available again. How much time do you need to scale your resources? Do you need to rehydrate caches?

You can learn more about resilient architectures and methodologies with this great series of articles from my colleague Adrian.

Finally, remember that only load balancers with cross-zone load balancing turned off are currently eligible for zonal autoshift. To turn off cross-zone load balancing from a CDK script, you need to remove stickinessCookieDuration and add load_balancing.cross_zone.enabled=false on the target group. Here is an example with CDK and Typescript:

    // Add the auto scaling group as a load balancing
    // target to the listener.
    const targetGroup = listener.addTargets('MyApplicationFleet', {
      port: 8080,
      // for zonal shift, stickiness & cross-zones load balancing must be disabled
      // stickinessCookieDuration: Duration.hours(1),
      targets: [asg]
    });    
    // disable cross zone load balancing
    targetGroup.setAttribute("load_balancing.cross_zone.enabled", "false");

Now it’s time for you to select your applications that would benefit from zonal autoshift. Start by reviewing your infrastructure capacity in each Availability Zone and then define the circuit breaker alarms. Once you are confident your monitoring is correctly configured, go and enable zonal autoshift.

— seb