Tag Archives: AWS Direct Connect

Designing a hybrid AI/ML data access strategy with Amazon SageMaker

2023-07-10 Franklin Aguinaldo

Post Syndicated from Franklin Aguinaldo original https://aws.amazon.com/blogs/architecture/designing-a-hybrid-ai-ml-data-access-strategy-with-amazon-sagemaker/

Over time, many enterprises have built an on-premises cluster of servers, accumulating data, and then procuring more servers and storage. They often begin their ML journey by experimenting locally on their laptops. Investment in artificial intelligence (AI) is at a different stage in every business organization. Some remain completely on-premises, others are hybrid (both on-premises and cloud), and the remaining have moved completely into the cloud for their AI and machine learning (ML) workloads.

These enterprises are also researching or have started using the cloud to augment their on-premises systems for several reasons. As technology improves, both the size and quantity of data increases over time. The amount of data captured and the number of datapoints continues to expand, which presents a challenge to manage on-premises. Many enterprises are distributed, with offices in different geographic regions, continents, and time zones. While it is possible to increase the on-premises footprint and network pipes, there are still hidden costs to consider for maintenance and upkeep. These organizations are looking to the cloud to shift some of that effort and enable them to burst and use the rich AI and ML features on the cloud.

Defining a hybrid data access strategy

Moving ML workloads into the cloud calls for a robust hybrid data strategy describing how and when you will connect your on-premises data stores to the cloud. For most, it makes sense to make the cloud the source of truth, while still permitting your teams to use and curate datasets on-premises. Defining the cloud as source of truth for your datasets means the primary copy will be in the cloud and any dataset generated will be stored in the same location in the cloud. This ensures that requests for data is served from the primary copy and any derived copies.

A hybrid data access strategy should address the following:

Understand your current and future storage footprint for ML on-premises. Create a map of your ML workloads, along with performance and access requirements for testing and training.
Define connectivity across on-premises locations and the cloud. This includes east-west and north-south traffic to support interconnectivity between sites, required bandwidth, and throughput for the data movement workload requirements.
Define your single source of truth (SSOT)[1] and where the ML datasets will primarily live. Consider how dated, new, hot, and cold data will be stored.
Define your storage performance requirements, mapping them to the appropriate cloud storage services. This will give you the ability to take advantage of cloud-native ML with Amazon SageMaker.

Hybrid data access strategy architecture

To help address these challenges, we worked on outlining an end-to-end system architecture in Figure 1 that defines: 1) connectivity between on-premises data centers and AWS Regions; 2) mappings for on-premises data to the cloud; and 3) Aligning Amazon SageMaker to appropriate storage, based on ML requirements.

Figure 1. AI/ML hybrid data access strategy reference architecture

Let’s explore this architecture step by step.

On-premises connectivity to the AWS Cloud runs through AWS Direct Connect for high transfer speeds.
AWS DataSync is used for migrating large datasets into Amazon Simple Storage Service (Amazon S3). AWS DataSync agent is installed on-premises.
On-premises network file system (NFS) or server message block (SMB) data is bridged to the cloud through Amazon S3 File Gateway, using either a virtual machine (VM) or hardware appliance.
AWS Storage Gateway uploads data into Amazon S3 and caches it on-premises.
Amazon S3 is the source of truth for ML assets stored on the cloud.
Download S3 data for experimentation to Amazon SageMaker Studio.
Amazon SageMaker notebooks instances can access data through S3, Amazon FSx for Lustre, and Amazon Elastic File System. Use Amazon File Cache for high-speed caching for access to on-premises data, and Amazon FSx for NetApp ONTAP for cloud bursting.
SageMaker training jobs can use data in Amazon S3, EFS, and FSx for Lustre. S3 data is accessed via File, Fast File, or Pipe mode, and pre-loaded or lazy-loaded when using FSx for Lustre as training job input. Any existing data on EFS can also be made available to training jobs as well.
Leverage Amazon S3 Glacier for archiving data and reducing storage costs.

ML workloads using Amazon SageMaker

Let’s go deeper into how SageMaker can help you with your ML workloads.

To start mapping ML workloads to the cloud, consider which AWS storage services work with Amazon SageMaker. Amazon S3 typically serves as the central storage location for both structured and unstructured data that is used for ML. This includes raw data coming from upstream applications, and also curated datasets that are organized and stored as part of a Feature Store.

In the initial phases of development, a SageMaker Studio user will leverage S3 APIs to download data from S3 to their private home directory. This home directory is backed by a SageMaker-managed EFS file system. Studio users then point their notebook code (also stored in the home directory) to the local dataset and begin their development tasks.

To scale up and automate model training, SageMaker users can launch training jobs that run outside of the SageMaker Studio notebook environment. There are several options for making data available to a SageMaker training job.

Amazon S3. Users can specify the S3 location of the training dataset. When using S3 as a data source, there are three input modes to choose from:
- File mode. This is the default input mode, where SageMaker copies the data from S3 to the training instance storage. This storage is either a SageMaker-provisioned Amazon Elastic Block Store (Amazon EBS) volume or an NVMe SSD that is included with specific instance types. Training only starts after the dataset has been downloaded to the storage, and there must be enough storage space to fit the entire dataset.
- Fast file mode. Fast file mode exposes S3 objects as a POSIX file system on the training instance. Dataset files are streamed from S3 on demand, as the training script reads them. This means that training can start sooner and require less disk space. Fast file mode also does not require changes to the training code.
- Pipe mode. Pipe input also streams data in S3 as the training script reads it, but requires code changes. Pipe input mode is largely replaced by the newer and easier-to-use Fast File mode.
FSx for Lustre. Users can specify a FSx for Lustre file system, which SageMaker will mount to the training instance and run the training code. When the FSx for Lustre file system is linked to a S3 bucket, the data can be lazily loaded from S3 during the first training job. Subsequent training jobs on the same dataset can then access it with low latency. Users can also choose to pre-load the file system with S3 data using hsm_restore commands.
Amazon EFS. Users can specify an EFS file system that already contains their training data. SageMaker will mount the file system on the training instance and run the training code.
Find out how to Choose the best data source for your SageMaker training job.

Conclusion

With this reference architecture, you can develop and deliver ML workloads that run either on-premises or in the cloud. Your enterprise can continue using its on-premises storage and compute for particular ML workloads, while also taking advantage of the cloud, using Amazon SageMaker. The scale available on the cloud allows your enterprise to conduct experiments without worrying about capacity. Start defining your hybrid data strategy on AWS today!

Additional resources:

[1] The practice of aggregating data from many sources to a single source or location.

AWS Week in Review – Amazon Security Lake Now GA, New Actions on AWS Fault Injection Simulator, and More – June 5, 2023

2023-06-05 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/aws-week-in-review-amazon-security-lake-now-ga-new-actions-on-aws-fault-injection-simulator-and-more-june-5-2023/

Last Wednesday, I traveled to Cape Town to speak at the .Net Developer User Group. My colleague Francois Bouteruche also gave a talk but joined virtually. I enjoyed my time there—what an amazing community! Join the group in order to learn about upcoming events.

Now onto the AWS updates from last week. There was a lot of news related to AWS, and I have compiled a few announcements you need to know. Let’s get started!

Last Week’s Launches
Here are a few launches from last week that you might have missed:

Amazon Security Lake is now Generally Available – This service automatically centralizes security data from AWS environments, SaaS providers, on-premises environments, and cloud sources into a purpose-built data lake stored in your account, making it easier to analyze security data, gain a more comprehensive understanding of security across your entire organization, and improve the protection of your workloads, applications, and data. Read more in Channy’s post announcing the preview of Security Lake.

New AWS Direct Connect Location in Santiago, Chile – The AWS Direct Connect service lets you create a dedicated network connection to AWS. With this service, you can build hybrid networks by linking your AWS and on-premises networks to build applications that span environments without compromising performance. Last week we announced the opening of a new AWS Direct Connect location in Santiago, Chile. This new Santiago location offers dedicated 1 Gbps and 10 Gbps connections, with MACsec encryption available for 10 Gbps. For more information on over 115 Direct Connect locations worldwide, visit the locations section of the Direct Connect product detail pages.

New actions on AWS Fault Injection Simulator for Amazon EKS and Amazon ECS – Had it not been for Adrian Hornsby’s LinkedIn post I would have missed this announcement. We announced the expanded support of AWS Fault Injection Simulator (FIS) for Amazon Elastic Kubernetes Service (EKS) and Amazon Elastic Container Service (ECS). This expanded support adds additional AWS FIS actions for Amazon EKS and Amazon ECS. Learn more about Amazon ECS task actions here, and Amazon EKS pod actions here.

Other AWS News
A few more news items and blog posts you might have missed:

Autodesk Uses Sagemaker to Improve Observability – One of our customers, Autodesk, used AWS services including Amazon Sagemaker, Amazon Kinesis, and Amazon API Gateway to build a platform that enables development and deployment of near-real time personalization experiments by modeling and responding to user behavior data. All this delivered a dynamic, personalized experience for Autodesk’s customers. Read more about the story at AWS Customer Stories.

AWS DMS Serverless – We announced AWS DMS Serverless which lets you automatically provision and scale capacity for migration and data replication. Donnie wrote about this announcement here.

For AWS open-source news and updates, check out the latest newsletter curated by my colleague Ricardo Sueiras to bring you the most recent updates on open-source projects, posts, events, and more.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Upcoming AWS Events
We have the following upcoming events. These give you the opportunity to meet with other tech enthusiasts and learn:

AWS Silicon Innovation Day (June 21) – A one-day virtual event that will allow you to understand AWS Silicon and how you can use AWS’s unique silicon offerings to innovate. Learn more and register here.

AWS Global Summits – Sign up for the AWS Summit closest to where you live: London (June 7), Washington, DC (June 7–8), Toronto (June 14).

AWS Community Days – Join these community-led conferences where event logistics and content are planned, sourced, and delivered by community leaders: Chicago, Illinois (June 15), and Chile (July 1).

And with that, I end my very first Week in Review post, and this was such fun to write. Come back next Monday for another Week in Review!

– Veliswa x

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Building highly resilient applications with on-premises interdependencies using AWS Local Zones

2022-10-27 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/building-highly-resilient-applications-with-on-premises-interdependencies-using-aws-local-zones/

This blog post is written by Rachel Rui Liu, Senior Solutions Architect.

AWS Local Zones are a type of infrastructure deployment that places compute, storage, database, and other select AWS services close to large population and industry centers.

Following the successful launch of the AWS Local Zone s in 16 US cities since 2019, in Feb 2022, AWS announced plans to launch new AWS Local Zones in 32 metropolitan areas in 26 countries worldwide.

With Local Zones, we’ve seen use cases in two common categories.

The first category of use cases is for workloads that require extremely low latency between end-user devices and workload servers. For example, let’s consider media content creation and real-time multiplayer gaming. For these use cases, deploying the workload to a Local Zone can help achieve down to single-digit milliseconds latency between end-user devices and the AWS infrastructure, which is ideal for a good end-user experience.

This post will focus on addressing the second category of use cases, which is commonly seen in an enterprise hybrid architecture, where customers must achieve low latency between AWS infrastructure and existing on-premises data centers. Compared to the first category of use cases, these use cases can tolerate slightly higher latency between the end-user devices and the AWS infrastructure. However, these workloads have dependencies to these on-premises systems, so the lowest possible latency between AWS infrastructure and on-premises data centers is required for better application performance. Here are a few examples of these systems:

Financial services sector mainframe workloads hosted on premises serving regional customers.
Enterprise Active Directory hosted on premise serving cloud and on-premises workloads.
Enterprise applications hosted on premises processing a high volume of locally generated data.

For workloads deployed in AWS, the time taken for each interaction with components still hosted in the on-premises data center is increased by the latency. In turn, this delays responses received by the end-user. The total latency accumulates and results in suboptimal user experiences.

By deploying modernized workloads in Local Zones, you can reduce latency while continuing to access systems hosted in on-premises data centers, thereby reducing the total latency for the end-user. At the same time, you can enjoy the benefits of agility, elasticity, and security offered by AWS, and can apply the same automation, compliance, and security best practices that you’ve been familiar with in the AWS Regions.

Enterprise workload resiliency with Local Zones

While designing hybrid architectures with Local Zones, resiliency is an important consideration. You want to route traffic to the nearest Local Zone for low latency. However, when disasters happen, it’s critical to fail over to the parent Region automatically.

Let’s look at the details of hybrid architecture design based on real world deployments from different angles to understand how the architecture achieves all of the design goals.

Hybrid architecture with resilient network connectivity

The following diagram shows a high-level overview of a resilient enterprise hybrid architecture with Local Zones, where you have redundant connections between the AWS Region, the Local Zone, and the corporate data center.

Here are a few key points with this network connectivity design:

Use AWS Direct Connect or Site-to-Site VPN to connect the corporate data center and AWS Region.
Use Direct Connect or self-hosted VPN to connect the corporate data center and the Local Zone. This connection will provide dedicated low-latency connectivity between the Local Zone and corporate data center.
Transit Gateway is a regional service. When attaching the VPC to AWS Transit Gateway, you can only add subnets provisioned in the Region. Instances on subnets in the Local Zone can still use Transit Gateway to reach resources in the Region.
For subnets provisioned in the Region, the VPC route table should be configured to route the traffic to the corporate data center via Transit Gateway.
For subnets provisioned in Local Zone, the VPC route table should be configured to route the traffic to the corporate data center via the self-hosted VPN instance or Direct Connect.

Hybrid architecture with resilient workload deployment

The next examples show a public and a private facing workload.

To simplify the diagram and focus on application layer architecture, the following diagrams assume that you are using Direct Connect to connect between AWS and the on-premises data center.

Example 1: Resilient public facing workload

With a public facing workload, end-user traffic will be routed to the Local Zone. If the Local Zone is unavailable, then the traffic will be routed to the Region automatically using an Amazon Route 53 failover policy.

Here are the key design considerations for this architecture:

Deploy the workload in the Local Zone and put the compute layer in an AWS AutoScaling Group, so that the application can scale up and down depending on volume of requests.
Deploy the workload in both the Local Zone and an AWS Region, and put the compute layer into an autoscaling group. The regional deployment will act as pilot light or warm standby with minimal footprint. But it can scale out when the Local Zone is unavailable.
Two Application Load Balancers (ALBs) are required: one in the Region and one in the Local Zone. Each ALB will dispatch the traffic to each workload cluster inside the autoscaling group local to it.
An internet gateway is required for public facing workloads. When using a Local Zone, there’s no extra configuration needed: define a single internet gateway and attach it to the VPC.

If you want to specify an Elastic IP address to be the workload’s public endpoint, the Local Zone will have a different address pool than the Region. Noting that BYOIP is unsupported for Local Zones.

Create a Route 53 DNS record with “Failover” as the routing policy.

For the primary record, point it to the alias of the ALB in the Local Zone. This will set Local Zone as the preferred destination for the application traffic which minimizes latency for end-users.
For the secondary record, point it to the alias of the ALB in the AWS Region.
Enable health check for the primary record. If health check against the primary record fails, which indicates that the workload deployed in the Local Zone has failed to respond, then Route 53 will automatically point to the secondary record, which is the workload deployed in the AWS Region.

Example 2: Resilient private workload

For a private workload that’s only accessible by internal users, a few extra considerations must be made to keep the traffic inside of the trusted private network.

The architecture for resilient private facing workload has the same steps as public facing workload, but with some key differences. These include:

Instead of using a public hosted zone, create private hosted zones in Route 53 to respond to DNS queries for the workload.
Create the primary and secondary records in Route 53 just like the public workload but referencing the private ALBs.
To allow end-users onto the corporate network (within offices or connected via VPN) to resolve the workload, use the Route 53 Resolver with an inbound endpoint. This allows end-users located on-premises to resolve the records in the private hosted zone. Route 53 Resolver is designed to be integrated with an on-premises DNS server.
No internet gateway is required for hosting the private workload. You might need an internet gateway in the Local Zone for other purposes: for example, to host a self-managed VPN solution to connect the Local Zone with the corporate data center.

Hosting multiple workloads

Customers who host multiple workloads in a single VPC generally must consider how to segregate those workloads. As with workloads in the AWS Region, segregation can be implemented at a subnet or VPC level.

If you want to segregate workloads at the subnet level, you can extend your existing VPC architecture by provisioning extra sets of subnets to the Local Zone.

Although not shown in the diagram, for those of you using a self-hosted VPN to connect the Local Zone with an on-premises data center, the VPN solution can be deployed in a centralized subnet.

You can continue to use security groups, network access control lists (NACLs) , and VPC route tables – just as you would in the Region to segregate the workloads.

If you want to segregate workloads at the VPC level, like many of our customers do, within the Region, inter-VPC routing is generally handled by Transit Gateway. However, in this case, it may be undesirable to send traffic to the Region to reach a subnet in another VPC that is also extended to the Local Zone.

Key considerations for this design are as follows:

Direct Connect is deployed to connect the Local Zone with the corporate data center. Therefore, each VPC will have a dedicated Virtual Private Gateway provisioned to allow association with the Direct Connect Gateway.
To enable inter-VPC traffic within the Local Zone, peer the two VPCs together.
Create a VPC route table in VPC A. Add a route for Subnet Y where the destination is the peering link. Assign this route table to Subnet X.
Create a VPC route table in VPC B. Add a route for Subnet X where the destination is the peering link. Assign this route table to Subnet Y.
If necessary, add routes for on-premises networks and the transit gateway to both route tables.

This design allows traffic between subnets X and Y to stay within the Local Zone, thereby avoiding any latency from the Local Zone to the AWS Region while still permitting full connectivity to all other networks.

Conclusion

In this post, we summarized the use cases for enterprise hybrid architecture with Local Zones, and showed you:

Reference architectures to host workloads in Local Zones with low-latency connectivity to corporate data centers and resiliency to enable fail over to the AWS Region automatically.
Different design considerations for public and private facing workloads utilizing this hybrid architecture.
Segregation and connectivity considerations when extending this hybrid architecture to host multiple workloads.

Hopefully you will be able to follow along with these reference architectures to build and run highly resilient applications with local system interdependencies using Local Zones.

A multi-dimensional approach helps you proactively prepare for failures, Part 2: Infrastructure layer

2022-08-26 Piyali Kamra

Post Syndicated from Piyali Kamra original https://aws.amazon.com/blogs/architecture/a-multi-dimensional-approach-helps-you-proactively-prepare-for-failures-part-2-infrastructure-layer/

Distributed applications resiliency is a cumulative resiliency of applications, infrastructure, and operational processes. Part 1 of this series explored application layer resiliency. In Part 2, we discuss how using Amazon Web Services (AWS) managed services, redundancy, high availability, and infrastructure failover patterns based on recovery time and point objectives (RTO and RPO, respectively) can help in building more resilient infrastructures.

Pattern 1: Recognize high impact/likelihood infrastructure failures

To ensure cloud infrastructure resilience, we need to understand the likelihood and impact of various infrastructure failures, so we can mitigate them. Figure 1 illustrates that most of the failures with high likelihood happen because of operator error or poor deployments.

Automated testing, automated deployments, and solid design patterns can mitigate these failures. There could be datacenter failures—like whole rack failures—but deploying applications using auto scaling and multi-availability zone (multi-AZ) deployment, plus resilient AWS cloud native services, can mitigate the impact.

Figure 1. Likelihood and impact of failure events

As demonstrated in the Figure 1, infrastructure resiliency is a combination of high availability (HA) and disaster recovery (DR). HA involves increasing the availability of the system by implementing redundancy among the application components and removing single points of failure.

Application layer decisions, like creating stateless applications, make it simpler to implement HA at the infrastructure layer by allowing it to scale using Auto Scaling groups and distributing the redundant applications across multiple AZs.

Pattern 2: Understanding and controlling infrastructure failures

Building a resilient infrastructure requires understanding which infrastructure failures are under control and which ones are not, as demonstrated in Figure 2.

These insights allow us to automate the detection of failures, control them, and employ pro-active patterns, such as static stability, to mitigate the need to scale up the infrastructure by over-provisioning it in advance.

Figure 2. Proactively designing systems in the event of failure

The infrastructure decisions under our control that can increase the infrastructure resiliency of our system, include:

AWS services have control and data planes designed for minimum blast radius. Data planes typically have higher availability design goals than control planes and are usually less complex. When implementing recovery or mitigation responses to events that can affect resiliency, using control plane operations can lower the overall resiliency of your architectures. For example, Amazon Route 53 (Route 53) has a data plane designed for a 100% availability SLA. A good fail-over mechanism should rely on the data plane and not the control plane, as explained in Creating Disaster Recovery Mechanisms Using Amazon Route 53.
Understanding networking design and routes implemented in a virtual private cloud (VPC) are critical when testing the flow of traffic in our application. Understanding the flow of traffic helps us design better applications and see how one component failure can affect overall ingress/egress traffic. To achieve better network resiliency, it’s important to implement a good subnet strategy and manage our IP addresses to avoid fail-over issues and asymmetric routing in hybrid architectures. Use IP address management tools for established subnet strategies and routing decisions.
When designing VPCs and AZs, understanding the service limits, deploying independent routing tables and components in each zone increases availability. For example, highly available NAT gateways are preferred over NAT instances, as noted in the comparison provided in the Amazon VPC documentation.

Pattern 3: Considering different ways of increasing HA at the infrastructure layer

As already detailed, infrastructure resiliency = HA + DR.

Different ways by which system availability can be increased include:

Building for redundancy: Redundancy is the duplication of application components to increase the overall availability of the distributed system. After following application layer best practices, we can build auto healing mechanisms at the infrastructure layer.

We can take advantage of auto scaling features and use Amazon CloudWatch metrics and alarms to set up auto scaling triggers and deploy redundant copies of our applications across multiple AZs. This protects workloads from AZ failures, as shown in Figure 3.

Figure 3. Redundancy increases availability

Auto scale your infrastructure: When there are AZ failures, infrastructure auto scaling maintains the desired number of redundant components, which helps maintain the base level application throughput. This way, HA system and manage costs are maintained. Auto scaling uses metrics to scale in and out, appropriately, as shown in Figure 4.

Figure 4. How auto scaling improves availability

Implement resilient network connectivity patterns: While building highly resilient distributed systems, network access to AWS infrastructure also needs to be highly resilient. While deploying hybrid applications, the capacity needed for hybrid applications to communicate with their cloud native application counterparts is an important consideration in designing the network access using AWS Direct Connect or VPNs.

Testing failover and fallback scenarios helps validate that network paths operate as expected and routes fail over as expected to meet RTO objectives. As the number of connection points between the data center and AWS VPCs increases, a hub and spoke configuration provided by the Direct Connect gateway and transit gateways simplify network topology, testing, and fail over. For more information, visit the AWS Direct Connect Resiliency Recommendations.

Whenever possible, use the AWS networking backbone to increase security, resiliency, and lower cost. AWS PrivateLink provides secure access to AWS services and exposes the application’s functionalities and APIs to other business units or partner accounts hosted on AWS.
Security appliances need to be set up in HA configuration, so that even if one AZ is unavailable, security inspection can be taken over by the redundant appliances in the other AZs.
Think ahead about DNS resolution: DNS is a critical infrastructure component; hybrid DNS resolution should be designed carefully with Route 53 HA inbound and outbound resolver endpoints instead of using self-managed proxies.

Implement a good strategy to share DNS resolver rules across AWS accounts and VPC’s with Resource Access Manager. Network failover tests are an important part of Disaster Recovery and Business Continuity Plans. To learn more, visit Set up integrated DNS resolution for hybrid networks in Amazon Route 53.

Use managed services: The same concept of redundancy in application components affecting availability applies to AWS infrastructure components. AWS services, like AWS Lambda, Amazon Simple Queue Service, Elastic Load Balancing (ELB), and Amazon Simple Storage Service, use multiple AZs under the hood for resiliency.

Additionally, ELB uses health checks to make sure that requests will route to another component if the underlying traffic application component fails. This improves the distributed system’s availability, as it is the cumulative availability of all different layers in our system. Figure 5 details advantages of some AWS managed services.

Figure 5. AWS managed services help in building resilient infrastructures (click the image to enlarge)

Pattern 4: Use RTO and RPO requirements to determine the correct failover strategy for your application

Capture RTO and RPO requirements early on to determine solid failover strategies (Figure 6). Disaster recovery strategies within AWS range from low cost and complexity (like backup and restore), to more complex strategies when lower values of RTO and RPO are required.

In pilot light and warm standby, only the primary region receives traffic. Pilot light only critical infrastructure components run in the backup region. Automation is used to check failures in the primary region using health checks and other metrics.

When health checks fail, use a combination of auto scaling groups, automation, and Infrastructure as Code (IaC) for quick deployment of other infrastructure components.

Note: This strategy depends on control plane availability in the secondary region for deploying the resources; keep this point in mind if you don’t have compute pre-provisioned in the secondary region. Carefully consider the business requirements and a distributed system’s application-level characteristics before deciding on a failover strategy. To understand all the factors and complexities involved in each of these disaster recovery strategies refer to disaster recovery options in the cloud.

Figure 6. Relationship between RTO, RPO, cost, data loss, and length of service interruption

Conclusion

In Part 2 of this series, we discovered that infrastructure resiliency is a combination of HA and DR. It is important to consider likelihood and impact of different failure events on availability requirements. Building in application layer resiliency patterns (Part 1 of this series), along with early discovery of the RTO/RPO requirements, as well as operational and process resiliency of an organization helps in choosing the right managed services and putting in place the appropriate failover strategies for distributed systems.

It’s important to differentiate between normal and abnormal load threshold for applications in order to put automation, alerts, and alarms in place. This allows us to auto scale our infrastructure for normal expected load, plus implement corrective action and automation to root out issues in case of abnormal load. Use IaC for quick failover and test failover processes.

Stay tuned for Part 3, in which we discuss operational resiliency!

Data warehouse and business intelligence technology consolidation using AWS

2022-07-06 Bappaditya Datta

Post Syndicated from Bappaditya Datta original https://aws.amazon.com/blogs/architecture/data-warehouse-and-business-intelligence-technology-consolidation-using-aws/

Organizations have been using data warehouse and business intelligence (DWBI) workloads to support business decision making for many years. These workloads are brought to the Amazon Web Services (AWS) platform to utilize the benefit of AWS cloud. However, these workloads are built using multiple vendor tools and technologies, and the customer faces the burden of administrative overhead.

This post provides architectural guidance to consolidate multiple DWBI technologies to AWS Managed Services to help reduce the administrative overhead, bring operational ease, and business efficiency. Two scenarios are explored:

Upstream transactional databases are already on AWS
Upstream transactional databases are present at on-premise datacenter

Challenges faced by an organization

Organizations are engaged in managing multiple DWBI technologies due to acquisitions, mergers, and the lift-and-shift of workloads. These workloads use extract, transform, and load (ETL) tools to read relational data from upstream transactional databases, process it, and store it in a data warehouse. Thereafter, these workloads use business intelligence tools to generate valuable insight and present it to users in form of reports and dashboards.

These DWBI technologies are generally installed and maintained on their own server. Figure 1 demonstrates the increased the administrative overhead for the organization but also creates challenges in maintaining the team’s overall knowledge.

Figure 1. DWBI workload with multiple tools

Therefore, organizations are looking to consolidate technology usage and continue supporting important business functions.

Scenario 1

As we know, three major functions of DWBI workstream are:

ETL data using a tool
Store/manage the data in a data warehouse
Generate information from the data using business intelligence

Each of these functions can be performed efficiently using an AWS service. For example, AWS Glue can be used for ETL, Amazon Redshift for data warehouse, and Amazon QuickSight for business intelligence.

With the use of mentioned AWS services, organizations will be able to consolidate their DWBI technology usage. Organizations also will be able to quickly adapt to these services, as their engineering team can more easily use their DWBI knowledge with these services. For example, using SQL knowledge in AWS Glue jobs with SprakSQL, in Amazon Redshift queries, and in Amazon QuickSight dashboards.

Figure 2 demonstrates the redesigned the architecture of Figure 1 using AWS services. In this architecture, ETL functions are consolidated in AWS Glue. An AWS Glue crawler is used to auto-catalogue the source and target table metadata; then, AWS Glue ETL jobs use these catalogues to read data from source and write to target (data warehouse). AWS Glue jobs also apply necessary transformations (such as join, filter, and aggregate) to the data before writing. Additionally, an AWS Glue trigger is used to schedule the job executions. Alternatively, AWS Managed Workflows for Apache Airflow can be used to schedule jobs.

Figure 2. Consolidated workload with source on AWS

Similarly, data warehousing function is consolidated with Amazon Redshift. Amazon Redshift is used to store and organize enriched data and also enforce appropriate data access control for both workloads and users.

Lastly, business intelligence functions are consolidated using Amazon QuickSight. It used to create necessary dashboards that source data from Amazon Redshift and apply complex business logic to produce necessary charts and graphs needed for business insights. It is also used to implement necessary access restrictions to dashboards and data.

Scenario 2

In situation where source databases are in on-premises datacenter, the overall solution will be similar to Scenario 1, with an additional step to move the data continually from on-premise database to an Amazon Simple Storage Service (Amazon S3) bucket. The data movement can be efficiently handled by AWS Database Migration Service (AWS DMS).

To make the source database accessible to AWS DMS, a connection needs to established between the AWS cloud and on-premise network. Based on performance and throughput needs, the organization can choose either AWS Direct Connect service or AWS Site-to-Site VPN service to securely move the data. For the purpose of this discussion, we are considering AWS Direct Connect.

In Figure 3, AWS DMS task is used to perform a full-load followed by change data capture to continuously move the data to an S3 bucket. In this scenario, AWS Glue is used to catalogue and read the data from S3 bucket. The remaining portion of the dataflow is the same as the one mentioned in Scenario 1.

Figure 3. Consolidated workload with source at datacenter

Scaling

Both of the updated architectures provide necessary scaling:

Auto scaling feature can be used to scale-up or -down AWS Glue ETL job resources
Concurrency scaling feature can be used to support virtually unlimited concurrent users and queries in Amazon Redshift
Amazon QuickSight resources (web server, Amazon QuickSight engine, and SPICE) are auto scaled by design

Security, monitoring, and auditing

Also, the updated architectures provide necessary security by using access control, data encryption at-rest and in transit, monitoring, and auditing.

AWS Key Management Service can be used to generate keys necessary for data encryption at rest.
AWS CloudTrail can be used for tracking user activity and API usage for auditing and troubleshooting.
Amazon CloudWatch can be used to monitor Amazon Redshift service and log generated by AWS Glue jobs.
Amazon Simple Notification Service can be used for sending notifications from AWS cloud. For example, AWS Glue jobs’ execution status, Amazon QuickSight SPICE data failure notification.
AWS Identity and Access Management is used for user and group access in an organization’s AWS account.

Additionally, both Amazon Redshift and Amazon QuickSight provides their own authentication and access controls. Therefore, a user can be a local user or a federated one. With the help of these authentications, an organization will be able to control access to data in Amazon Redshift and also access to the dashboard in Amazon QuickSight.

Conclusion

In this blog post, we discussed how AWS Glue, Amazon Redshift, and Amazon QuickSight can be used to consolidate DWBI technologies. We also have discussed how an architecture can help an organization build a scalable, secure workload with auto scaling, access control, log monitoring and activity auditing.

Ready to get started?

Learn how to author job in AWS Glue
Authorize connection from Amazon QuickSight to Amazon Redshift clusters
Discover a typical Amazon Redshift data processing flow
Get started by checking hands-on with the Amazon Redshift Analytics Workshop

Identification of replication bottlenecks when using AWS Application Migration Service

2022-06-10 Tobias Reekers

Post Syndicated from Tobias Reekers original https://aws.amazon.com/blogs/architecture/identification-of-replication-bottlenecks-when-using-aws-application-migration-service/

Enterprises frequently begin their journey by re-hosting (lift-and-shift) their on-premises workloads into AWS and running Amazon Elastic Compute Cloud (Amazon EC2) instances. A simpler way to re-host is by using AWS Application Migration Service (Application Migration Service), a cloud-native migration service.

To streamline and expedite migrations, automate reusable migration patterns that work for a wide range of applications. Application Migration Service is the recommended migration service to lift-and-shift your applications to AWS.

In this blog post, we explore key variables that contribute to server replication speed when using Application Migration Service. We will also look at tests you can run to identify these bottlenecks and, where appropriate, include remediation steps.

Overview of migration using Application Migration Service

Figure 1 depicts the end-to-end data replication flow from source servers to a target machine hosted on AWS. The diagram is designed to help visualize potential bottlenecks within the data flow, which are denoted by a black diamond.

Figure 1. Data flow when using AWS Application Migration Service (black diamonds denote potential points of contention)

Baseline testing

To determine a baseline replication speed, we recommend performing a control test between your target AWS Region and the nearest Region to your source workloads. For example, if your source workloads are in a data center in Rome and your target Region is Paris, run a test between eu-south-1 (Milan) and eu-west-3 (Paris). This will give a theoretical upper bandwidth limit, as replication will occur over the AWS backbone. If the target Region is already the closest Region to your source workloads, run the test from within the same Region.

Network connectivity

There are several ways to establish connectivity between your on-premises location and AWS Region:

Public internet
VPN
AWS Direct Connect

This section pertains to options 1 and 2. If facing replication speed issues, the first place to look is at network bandwidth. From a source machine within your internal network, run a speed test to calculate your bandwidth out to the internet; common test providers include Cloudflare, Ookla, and Google. This is your bandwidth to the internet, not to AWS.

Next, to confirm the data flow from within your data center, run a traceroute (Windows) or tracert (Linux). Identify any network hops that are unusual or potentially throttling bandwidth (due to hardware limitations or configuration).

To measure the maximum bandwidth between your data center and the AWS subnet that is being used for data replication, while accounting for Security Sockets Layer (SSL) encapsulation, use the CloudEndure SSL bandwidth tool (refer to Figure 1).

Source storage I/O

The next area to look for replication bottlenecks is source storage. The underlying storage for servers can be a point of contention. If the storage is maxing-out its read speeds, this will impact the data-replication rate. If your storage I/O is heavily utilized, it can impact block replication by Application Migration Service. In order to measure storage speeds, you can use the following tools:

Windows: WinSat (or other third-party tooling, like AS SSD Benchmark)
Linux: hdparm

We suggest reducing read/write operations on your source storage when starting your migration using Application Migration Service.

Application Migration Service EC2 replication instance size

The size of the EC2 replication server instance can also have an impact on the replication speed. Although it is recommended to keep the default instance size (t3.small), it can be increased if there are business requirements, like to speed up the initial data sync. Note: using a larger instance can lead to increased compute costs.

-508 (1)

Common replication instance changes include:

Servers with <26 disks: change the instance type to m5.large. Increase the instance type to m5.xlarge or higher, as needed.
Servers with <26 disks (or servers in AWS Regions that do not support m5 instance types): change the instance type to m4.large. Increase to m4.xlarge or higher, as needed.

Note: Changing the replication server instance type will not affect data replication. Data replication will automatically pick up where it left off, using the new instance type you selected.

Application Migration Service Elastic Block Store replication volume

You can customize the Amazon Elastic Block Store (Amazon EBS) volume type used by each disk within each source server in that source server’s settings (change staging disk type).

By default, disks <500GiB use Magnetic HDD volumes. AWS best practice suggests not change the default Amazon EBS volume type, unless there is a business need for doing so. However, as we aim to speed up the replication, we actively change the default EBS volume type.

There are two options to choose from:

The lower cost, Throughput Optimized HDD (st1) option utilizes slower, less expensive disks.

-508 (2)

- Consider this option if you(r):
  - Want to keep costs low
  - Large disks do not change frequently
  - Are not concerned with how long the initial sync process will take
The faster, General Purpose SSD (gp2) option utilizes faster, but more expensive disks.

-508 (3)

- Consider this option if you(r):
  - Source server has disks with a high write rate, or if you need faster performance in general
  - Want to speed up the initial sync process
  - Are willing to pay more for speed

Source server CPU

The Application Migration Service agent that is installed on the source machine for data replication uses a single core in most cases (agent threads can be scheduled to multiple cores). If core utilization reaches a maximum, this can be a limitation for replication speed. In order to check the core utilization:

Windows: Launch the Task Manger application within Windows, and click on the “CPU” tab. Right click on the CPU graph (this is currently showing an average of cores) > select “Change graph to” > “Logical processors”. This will show individual cores and their current utilization (Figure 2).

Figure 2. Logical processor CPU utilization

Linux: Install htop and run from the terminal. The htop command will display the Application Migration Service/CE process and indicate the CPU and memory utilization percentage (this is of the entire machine). You can check the CPU bars to determine if a CPU is being maxed-out (Figure 3).

Figure 3. AWS Application Migration Service/CE process to assess CPU utilization

Conclusion

In this post, we explored several key variables that contribute to server replication speed when using Application Migration Service. We encourage you to explore these key areas during your migration to determine if your replication speed can be optimized.

Related information

Running hybrid Active Directory service with AWS Managed Microsoft Active Directory

2022-05-11 Lewis Tang

Post Syndicated from Lewis Tang original https://aws.amazon.com/blogs/architecture/running-hybrid-active-directory-service-with-aws-managed-microsoft-active-directory/

Enterprise customers often need to architect a hybrid Active Directory solution to support running applications in the existing on-premises corporate data centers and AWS cloud. There are many reasons for this, such as maintaining the integration with on-premises legacy applications, keeping the control of infrastructure resources, and meeting with specific industry compliance requirements.

To extend on-premises Active Directory environments to AWS, some customers choose to deploy Active Directory service on self-managed Amazon Elastic Compute Cloud (EC2) instances after setting up connectivity for both environments. This setup works fine, but it also presents management and operations challenges when it comes to EC2 instance operation management, Windows operating system, and Active Directory service patching and backup. This is where AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD) helps.

Benefits of using AWS Managed Microsoft AD

With AWS Managed Microsoft AD, you can launch an AWS-managed directory in the cloud, leveraging the scalability and high availability of an enterprise directory service while adding seamless integration into other AWS services.

In addition, you can still access AWS Managed Microsoft AD using existing administrative tools and techniques, such as delegating administrative permissions to select groups in your organization. The full list of permissions that can be delegated is described in the AWS Directory Service Administration Guide.

Active Directory service design consideration with a single AWS account

Single region

A single AWS account is where the journey begins: a simple use case might be when you need to deploy a new solution in the cloud from scratch (Figure 1).

Figure 1. A single AWS account and single-region model

In a single AWS account and single-region model, the on-premises Active Directory has “company.com” domain configured in the on-premises data center. AWS Managed Microsoft AD is set up across two availability zones in the AWS region for high availability. It has a single domain, “na.company.com”, configured. The on-premises Active Directory is configured to trust the AWS Managed Microsoft AD with network connectivity via AWS Direct Connect or VPN. Applications that are Active-Directory–aware and run on EC2 instances have joined na.company.com domain, as do the selected AWS managed services (for example, Amazon Relational Database Service for SQL server).

Multi-region

As your cloud footprint expands to more AWS regions, you have two options also to expand AWS Managed Microsoft AD, depending on which edition of AWS Managed Microsoft AD is used (Figure 2):

With AWS Managed Microsoft AD Enterprise Edition, you can turn on the multi-region replication feature to configure automatically inter-regional networking connectivity, deploy domain controllers, and replicate all the Active Directory data across multiple regions. This ensures that Active-Directory–aware workloads residing in those regions can connect to and use AWS Managed Microsoft AD with low latency and high performance.
With AWS Managed Microsoft AD Standard Edition, you will need to add a domain by creating independent AWS Managed Microsoft AD directories per-region. In Figure 2, “eu.company.com” domain is added, and AWS Transit Gateway routes traffic among Active-Directory–aware applications within two AWS regions. The on-premises Active Directory is configured to trust the AWS Managed Microsoft AD, either by Direct Connect or VPN.

Figure 2. A single AWS account and multi-region model

Active Directory Service Design consideration with multiple AWS accounts

Large organizations use multiple AWS accounts for administrative delegation and billing purposes. This is commonly implemented through AWS Control Tower service or AWS Control Tower landing zone solution.

Single region

You can share a single AWS Managed Microsoft AD with multiple AWS accounts within one AWS region. This capability makes it simpler and more cost-effective to manage Active-Directory–aware workloads from a single directory across accounts and Amazon Virtual Private Cloud (VPC). This option also allows you seamlessly join your EC2 instances for Windows to AWS Managed Microsoft AD.

As a best practice, place AWS Managed Microsoft AD in a separate AWS account, with limited administrator access but sharing the service with other AWS accounts. After sharing the service and configuring routing, Active Directory aware applications, such as Microsoft SharePoint, can seamlessly join Active Directory Domain Services and maintain control of all administrative tasks. Find more details on sharing AWS Managed Microsoft AD in the Share your AWS Managed AD directory tutorial.

Multi-region

With multiple AWS Accounts and multiple–AWS-regions model, we recommend using AWS Managed Microsoft AD Enterprise Edition. In Figure 3, AWS Managed Microsoft AD Enterprise Edition supports automating multi-region replication in all AWS regions where AWS Managed Microsoft AD is available. In AWS Managed Microsoft AD multi-region replication, Active-Directory–aware applications use the local directory for high performance but remain multi-region for high resiliency.

Figure 3. Multiple AWS accounts and multi-region model

Domain Name System resolution design

To enable Active-Directory–aware applications communicate between your on-premises data centers and the AWS cloud, a reliable solution for Domain Name System (DNS) resolution is needed. You can set the Amazon VPC Dynamic Host Configuration Protocol (DHCP) option sets to either AWS Managed Microsoft AD or on-premises Active Directory; then, assign it to each VPC in which the required Active-Directory–aware applications reside. The full list of options working with DHCP option sets is described in Amazon Virtual Private Cloud User Guide.

The benefit of configuring DHCP option sets is to allow any EC2 instances in that VPC to resolve their domain names by pointing to the specified domain and DNS servers. This prevents the need for manual configuration of DNS on EC2 instances. However, because DHCP option sets cannot be shared across AWS accounts, this requires a DHCP option sets also to be created in additional accounts.

Figure 4. DHCP option sets

An alternative option is creating an Amazon Route 53 Resolver. This allows customers to leverage Amazon-provided DNS and Route 53 Resolver endpoints to forward a DNS query to the on-premises Active Directory or AWS Managed Microsoft AD. This is ideal for multi-account setups and customers desiring hub/spoke DNS management.

This alternative solution replaces the need to create and manage EC2 instances running as DNS forwarders with a managed and scalable solution, as Route 53 Resolver forwarding rules can be shared with other AWS accounts. Figure 5 demonstrates a Route 53 resolver forwarding a DNS query to on-premises Active Directory.

Figure 5. Route 53 Resolver

Conclusion

In this post, we described the benefits of using AWS Managed Microsoft AD to integrate with on-premises Active Directory. We also discussed a range of design considerations to explore when architecting hybrid Active Directory service with AWS Managed Microsoft AD. Different design scenarios were reviewed, from a single AWS account and region, to multiple AWS accounts and multi-regions. We have also discussed choosing between the Amazon VPC DHCP option sets and Route 53 Resolver for DNS resolution.

Seamlessly migrate on-premises legacy workloads using a strangler pattern

2022-04-14 Arnab Ghosh

Post Syndicated from Arnab Ghosh original https://aws.amazon.com/blogs/architecture/seamlessly-migrate-on-premises-legacy-workloads-using-a-strangler-pattern/

Replacing a complex workload can be a huge job. Sometimes you need to gradually migrate complex workloads but still keep parts of the on-premises system to handle features that haven’t been migrated yet. Gradually replacing specific functions with new applications and services is known as a “strangler pattern.”

When you use a strangler pattern, monolithic workloads are broken down and individual services are scheduled for rehosting, replatforming, and even retirement. As you do this, there is value in having a uniform point of access for the various services, as well as a uniform level of security and a way to manage workloads in the cloud and on-premises.

This blog post covers how to implement a strangler architecture pattern for on-premises legacy workloads to create uniform access and security across your workloads. We walk you through how to implement this pattern, which uses an API facade to ensure your customers continue to see and use the same interface while you “strangle” the monolith by incrementally creating and deploying new microservices in the cloud.

Solution overview

Figure 1. API facade with connectivity to an on-premises monolith

This solution uses Amazon API Gateway to create an API facade for your on-premises monolith application. As you deploy new microservices on AWS, you can create new API resources/methods under the same API Gateway endpoint (to learn more about creating REST APIs, see Creating a REST API in Amazon API Gateway).

AWS Direct Connect, along with API Gateway private integrations that use virtual private cloud (VPC) links, provide secure network connectivity to your on-premises services.

The following sections provide more detail on these services and their functions.

On-premises Connectivity

Direct Connect provides a dedicated connection between the on-premises services and AWS. This allows you to implement a hybrid workload by securely connecting the API Gateway and the application deployed on your on-premises environment.

You can use an AWS Site-to-Site VPN to connect to on-premises environments, but Direct Connect is preferred for its reduced latency and dedicated bandwidth.

API facade

API Gateway creates the facade for customer APIs/services (the monolith and the new microservices) deployed in the on-premises environment as well as the ones migrated to AWS.

API Gateway uses private integrations to securely connect to on-premises services and resources launched into Amazon Virtual Private Cloud (Amazon VPC) like re-hosted microservices running on Amazon Elastic Compute Cloud (Amazon EC2) or modernized applications running on container services like Amazon Elastic Container Service (Amazon ECS).

The Network Load Balancer is part of the private integration for API Gateway. It acts as a high throughput, high availability resource that fronts the API backends deployed either in the on-premises environment or Amazon VPC. Network Load Balancers support different target types. Use the IP target type to target on-premises servers hosting legacy workloads and use the instance and Application Load Balancer target types for applications hosted within AWS environments.

Security

Use AWS Web Application Firewall for API Gateway REST endpoints. It provides the ability to monitor and block HTTP and HTTPS traffic according to stateless and stateful rule groups.

Amazon GuardDuty provides threat detection across your microservices.

(Optional) Enable AWS Shield Advanced for Amazon CloudFront distributions that are configured for regional API Gateway endpoints. This provides added distributed denial of service (DDoS) protection beyond AWS Shield Standard, which is automatically included.

Logging and monitoring

AWS X-Ray and Amazon CloudWatch give you visibility into your requests and assorted service metrics.

AWS CloudTrail allows you to track interactions with your infrastructure through the AWS control plane APIs.

Strangler process

The strangler pattern allows you to smoothly migrate resources from on-premises environments by placing a cloud-based API facade in front of them. The next sections show an example scenario of what a strangler pattern-based migration process could look like for a given workload.

Putting a facade in front of the monolith

First, we add our API Gateway facade in front of our on premises monolith. The API Gateway acts as a facade to the customer APIs/services (the monolith and the new microservices) deployed in the on-premises environment as well as the ones migrated to AWS. This means that as the on-premises monolith application is strangled and new microservices are created, the new services are added to the API Gateway so that they can consumed along with the monolith services, as shown in Figure 2.

Figure 2. API facade with connectivity to an on-premises monolith

Breaking up the monolith behind the facade

Next, let’s break up our monolith into component microservices, as shown in Figure 3. This allows us more flexibility in deciding how best to migrate individual services. With the strangler pattern, we can incrementally update sections of code and functionality of the monolith (extract as a microservice with minimum dependency to the monolith application) without needing to completely refactor the entire application. Eventually, all the monolith’s services and components will be migrated, and the legacy system can be retired. Monoliths can be decomposed by business capability, subdomain, transactions, or based on the teams that maintain them.

Microservices A and B being decomposed from a legacy monolith, component C scheduled for retirement is not broken out into a microservice

Figure 3. Microservices A and B being decomposed from a legacy monolith, component C scheduled for retirement is not broken out into a microservice

Migrating microservices into the cloud

With our monolith broken up into its component microservices, we can begin moving the microservices into the cloud.

In our example, we rehost microservice A and refactor microservice B.

Rehosting a microservice: Here, we take microservice A and rehost it from on-premises virtual machines onto EC2 instances in AWS. We have deployed the microservice across multiple Availability Zones with Amazon EC2 Auto Scaling group. As you see from Figure 4, even after deployment to AWS, microservice A continues to have limited dependency on the monolith application. This dependency will eventually be removed as the strangling process is completed and the monolith is completely decomposed.

Figure 4. Microservice A being rehosted onto EC2 instances within an Amazon EC2 Auto Scaling Group

Refactoring a microservice: With functionality broken out across microservices, we can opt to refactor certain services using containerization and orchestration platforms like Amazon ECS. Here, we take microservice B and containerize it using Docker and then use Amazon ECS to deploy it.

Figure 5. Microservice B being refactored and after containerization and being moved onto Amazon ECS

Retire the monolith

Finally, when ready (application users have all been migrated to the new microservice endpoints), you can retire the legacy monolith application. Figure 6 shows the end state where the monolith application is retired along with hybrid connectivity. The API facade now serves the new migrated microservices. At this point, you can decide to retire application components.

Figure 6. Microservices A and B after the legacy monolith retired and on-premises connectivity has ceased

Conclusion

In this blog post, we showed you how to use a strangler pattern to smoothly transition on-premises workloads through a hybrid migration process with a uniform entry point in AWS. We walked you through the process of strangling a legacy monolith by decomposing it into microservices and bringing microservices into the cloud one by one with migration approaches that best fit each service.

Ready to get started? Learn how to implement private integration for API Gateway. See how to further integrate mediation layers to support legacy XML and other non-JSON-based API responses. Get hands-on with the Break a Monolith Application into Microservices project.

Building Resilient and High Performing Cloud-based Applications in Hawaii

2022-01-21 Marie Yap

Post Syndicated from Marie Yap original https://aws.amazon.com/blogs/architecture/building-resilient-and-high-performing-cloud-based-applications-in-hawaii/

Hawaii is building a digital economy for a sustainable future. Many local businesses are already embarking on their journey to the cloud to meet their customers’ growing demand for digital services. To access Amazon Web Services (AWS) on the US mainland, customers’ data must traverse through submarine fiber-optic cable networks approximately 2,800 miles across the Pacific Ocean. As a result, organizations have two primary concerns:

Resiliency concerns about multiple outage events that could arise from breaks in the submarine cables.
Latency concerns for mission-critical applications driven by physical distance.

These problems can be solved by architecting the workloads for reliability, secure connectivity, and high performance.

Designing network connectivity that is reliable, secure, and highly performant

A typical workload in AWS can be broken down into three layers – Network, Infrastructure, and Application. For each layer, we can design for resiliency and latency concerns. Starting at the network layer, there are two recommended options for connecting the on-premises network within the island to AWS.

Use of AWS Direct Connect over a physical connection. AWS Direct Connect is a dedicated network connection that connects your on-premises environment to AWS Regions. In this case, the connection is traversing the fiber-optic cable across the Pacific Ocean to the mainland’s meet-me-point facilities. It can be provisioned from 50 Mbps up to 100 Gbps. This provides you with a presence in an AWS Direct Connect location, a third-party colocation facility, or an Internet Service Provider (ISP) that provides last-mile connectivity to AWS. In addition, the Direct Connect location establishes dedicated connectivity to Amazon Virtual Private Clouds (VPC). This improves application performance and addresses latency concerns by connecting directly to AWS and bypassing the public internet.
Use of AWS VPN over an internet connection. As a secondary option to Direct Connect, AWS Site-to-Site VPN provide connectivity into AWS over the public internet using VPN encryption technologies. The Site-to-Site VPN connects on-premises sites to AWS resources in an Amazon VPC. As a result, you can securely connect your on-premises network to AWS using an internet connection.

We recommend choosing the us-west-2 AWS Region in Oregon to build high performant connectivity closest to Hawaii. The us-west-2 Region generally provides more AWS services at a lower cost versus us-west-1. In addition, there are various options for AWS Direct Connect Locations in the US West Region. Many of these locations support up to 100 Gbps and support MACsec, which is an IEEE standard for security encryption in wired Ethernet LANs. Typically, customers will use multiple 10-Gbps connections for higher throughput and redundancy.

Subsea Cable	Hawaii Cable Landing Station	Mainland Cable Landing Station	Nearest Direct Connect Location
Southern Cross Cable Network (SCCN)	Kahe Point (Oahu)	Morro Bay, CA	CoreSite, Equinix
Southern Cross Cable Network (SCCN)	Kahe Point (Oahu)	Hillsboro, OR	Equnix, EdgeConnex, Pittock Block, CoreSite, T5, TierPoint
Hawaiki	Kapolei (Oahu)	Hillsboro, OR	Equnix, EdgeConnex, Pittock Block, CoreSite, T5, TierPoint
Asia-America Gateway (AAG)	Keawaula (Oahu)	San Luis Obispo, CA	CoreSite, Equinix
Japan-US Cable Network (JUS)	Makaha (Oahu)	Morro Bay, CA	CoreSite, Equinix
SEA-US	Makaha (Oahu)	Hermosa Beach, CA	CoreSite, Equinix, T5

Table 1. Subsea fiber-optic cables connecting Hawaii to the US mainland

(Source: Submarine Cable Map from TeleGeography)

To build resilient connectivity, six cables connect Hawaii to the mainland US: Hawaiki, SEA-US, Asia-America Gateway (AAG), Japan-US (JUS), and two Southern Cross (SCCN) cables. In addition, these cables connect to various locations on the US West Coast. If you require high resiliency, we recommend a minimum of two physically redundant Direct Connect connections into AWS. In addition, we recommend designing four Direct Connect connections that span two Direct Connect locations for maximum resiliency. If you build your architecture following these recommendations, AWS offers this published service level agreement (SLA).

Figure 1. Redundant direct connection from Hawaii to the US mainland

Most customers select an ISP to get them connectivity across the Pacific Ocean to an AWS Direct location. The Direct Connect locations are third-party colocation providers who act as meet-me points for AWS customers and the AWS Regions. For example, our local AWS Partner DRFortress connects multiple ISPs in a data center in Hawaii to the AWS US West Region. We recommend having at least two ISPs for resilient applications, each providing connectivity across a separate subsea cable from Hawaii to the mainland. If one cable should fail for any reason, connectivity to AWS would still be available. The red links in figure 2 are the ISP-provided connectivity that spans the Pacific Ocean. This is a minimum starting point for business-critical applications and should be designed with additional Direct Connect links for greater resiliency.

Architecting for high performance and resiliency

Moving from the network to the infrastructure and application layer, organizations have the option in building their application all in the cloud or in combination with an on-premises environment. An example of an application built all in the cloud is the LumiSight platform in AWS built by local AWS Partner, DataHouse. LumiSight has helped dozens of organizations quickly and securely reopen safely during the pandemic.

Other customers need a hybrid cloud architecture solution. These organizations require that their data processing and locally hosted applications analysis is close to other components within the island’s data center. With this proximity, they can deliver near real-time responses to their end users. AWS Outposts Family extends the capabilities of an AWS Region to the island. This enables local businesses to build and run low latency applications on-premises on an AWS fully managed infrastructure. You can now deploy Compute, Storage, Containers, Data Analytics clusters, Relational, and Cache databases in high performance, redundant and secure infrastructure maintained by AWS. Outposts can be shipped to Hawaii, connecting to the us-west-1 or us-west-2 Regions.

Another option for improving application performance is providing an efficient virtual desktop to access their applications anywhere. Amazon WorkSpaces provides a secure, managed cloud-based virtual desktop experience. Many workers who bring their own device (BYOD) or work remotely use Workspaces to access their corporate applications securely. Workspaces use streaming protocols that provide a secure and responsive desktop experience to end users located in remote Regions, like Hawaii. Workspaces can quickly provide a virtual desktop without managing the infrastructure, OS versions, and patches. You can test your connection to Workspaces from Hawaii, or anywhere else in the world, at the Connection Health Check page.

Architecting for resiliency in the infrastructure and application stack is vital for Business Continuity and Disaster Recovery (BCDR) plans. Organizations in Hawaii who are already using VMware can take advantage of creating a recovery site using VMware Cloud on AWS as their solution for disaster recovery. The VMware Cloud on AWS is a fully managed VMware software-defined Data Center (SDDC) running on AWS, which provides access to native AWS services. Organizations can pair their on-premises vCenter and virtual machines to the fully managed vCenter and virtual machines residing in the cloud. The active Site Recovery Manager provides the automation of failing over and failing back applications between on-premises to the cloud DR site and vice versa. Additionally, organizations can define their SDDC in the us-west-2 Region using AWS Direct Connect to minimize the latency of replicating the data from and to the data centers in the islands.

Conclusion

Organizations in Hawaii can build resilient and high performant cloud-based workloads with the help of AWS services in each layer of their workloads. Starting with the network layer, you can establish reliable and lower latency connectivity through redundant AWS Direct Connect connections. Next, for low latency, hybrid applications, we extend infrastructure capabilities locally through AWS Outposts. We also improve the user experience in accessing cloud-based applications by providing Amazon WorkSpaces as the virtual desktop. Finally, we build resilient infrastructure and applications using a familiar solution called VMware Cloud on AWS.

To start learning the fundamentals and building on AWS, visit the Getting Started Resource Center.

Integrate Okta to Extend Active Directory Infrastructure into AWS

2021-12-03 Pavankumar Kasani

Post Syndicated from Pavankumar Kasani original https://aws.amazon.com/blogs/architecture/integrate-okta-to-extend-active-directory-infrastructure-into-aws/

Are you ready to extend your on-premises Active Directory to Amazon Web Services (AWS) to remove undifferentiated heavy lifting? Would you like to maintain a highly available Directory Service for your applications? Companies who have already set up integration with Okta Identity Cloud for external or internal applications require Active Directory objects to be synced to Okta for authentication. To achieve centralized access for on-premises and cloud applications, you can extend your on-premises Active Directory to AWS Managed Microsoft Active Directory (AD) using a trust relationship.

This blog shows an architecture pattern that you can use to synchronize your on-premises AD and AWS Managed AD objects. You can use Okta Identity Cloud using an Okta AD agent for syncing users and groups. The Okta AD agent can be installed and configured on a domain-joined on-premises server or an Amazon EC2 instance on AWS (see Figure 1).

AWS Directory Service lets you run Microsoft Active Directory (AD) as a managed service, and is powered by Windows Server 2012 R2. When you select and launch this directory type, it is created as a highly available pair of domain controllers connected to your Amazon Virtual Private Cloud (VPC). The domain controllers run in different Availability Zones in an AWS Region of your choice.

Okta is an enterprise-grade identity management service, which is compatible with many on-premises and cloud applications. The Okta AD agent enables you to integrate Okta with your on-premises AD. This way you can integrate your SaaS applications and your AD instances with Okta. You can simplify and centralize user management and share user credentials with other integrated cloud and on-premises applications.

Figure 1. Active Directory objects synchronization to Okta identity cloud

Network connectivity between corporate data center and AWS Regions

Before getting started with configuring a trust relationship with on-premises AD and AWS managed AD, be sure you’ve read and understand the prerequisites for setting up trust. For example, it is highly recommended to have a VPN or AWS Direct Connect circuit in place between your VPC and your on-premises environment. To create a resilient VPN connection, refer to the Site-to-Site VPN User Guide.

AWS Site-to-Site VPN is a fully managed service that uses IP security (IPsec) tunnels to create a secure connection between your data center or branch office, and your AWS resources. When using Site-to-Site VPN, you can connect to Amazon VPC and also AWS Transit Gateway. Two tunnels per connection are used for increased redundancy. You can also create a dedicated or a hosted connection using AWS Direct Connect.

Trust relationship between on-premises AD and AWS Managed AD

A trust relationship is a link between two different domains. For example, a one-way trust scenario allows the user accounts from the trusted domain to access resources in the trusting domain. In a two-way trust scenario, user accounts and resources can be passed between the two domains bidirectionally. A two-way trust relationship is commonly set up between on-premises AD and AWS Managed AD to extend authentication. This is used for any directory-aware workloads in the AWS Cloud, providing users and groups access to resources in either domain using single sign-on (SSO).

AWS Managed Microsoft Active Directory (AD) supports external and forest trust relationships with your existing on-premises domain in all three trust relationship directions:

One-way incoming
One-way outgoing
Two-way bidirectional

To create a trust relationship, follow these steps:

Prepare your on-premises domain for the trust relationship. This includes preparing your firewall configuration, enable Kerberos pre-authentication, and configuring conditional forwarders.
Prepare your AWS Managed Microsoft AD for the trust relationship. This includes configuring your VPC subnets, security groups, and enabling Kerberos pre-authentication.
Create the trust relationship between your on-premises AD and your AWS Managed Microsoft Active Directory (AD).

Install and configure Okta agent

Download and install Okta AD agent on your Amazon EC2 instance, which should be domain-joined with AWS Managed AD. One Okta AD agent can associate with multiple domains. Once the trust has been set up between on-premises AD and AWS Managed AD, you can associate multiple domains under the same Okta AD agent on Amazon EC2, instead of hosting and managing separate Okta AD agent servers in your own data center and AWS.

For a highly available architecture, a redundant Okta AD agent running in your corporate data center is recommended. This will help you avoid the impact of network connectivity failure between data centers and AWS Regions. Okta recommends installing multiple Okta AD agents on each domain server to achieve high availability and failover protection.

Read Okta AD integration step-by-step setup for installing and configuring Okta agent.

Validate AD objects

Once the Okta agent is installed and configured on the Amazon EC2 instance, log in to the Okta admin console. Under the provisioning to Okta tab, do a full import of users from AWS Managed AD (see Figure 2, Figure 3). The subsequent objects synchronization will be done through scheduled import with a minimum interval of one hour. After the import is done, if there are any user account overlaps between AWS Managed AD and Okta, manually assign the AD users to Okta users. You can create matching rules to automatically map the users from AD to Okta. Read Import AD users to Okta.

Figure 2. Import users under Okta admin console

Figure 3. Import users results under Okta admin console

Matching rules are used in the import of users from all apps and directories that provide importing. If there is an existing Okta account, AD allows you to import and confirm users automatically (see Figure 4).

Figure 4. User creation and matching under Okta admin console

You can import groups from any forest or domain connected to Okta. The Okta AD Agent detects all groups in the domain or the organizational units (OUs) that you select. If you register an Okta AD Agent for more than one domain and you have the root OU selected for all domains, all groups will be imported. Read Import AD Groups to Okta to synchronize groups from AD to Okta.

Synchronize passwords to Okta

When you sign in to Okta using your organization’s AD credentials, the authentication process is delegated to the connected on-premises AD. Okta does not see or store the credentials.

In some cases, the credentials must be synchronized from a directory across Okta to an application. If a user changes the password stored in Active Directory and then tries to access applications using the same single sign-on session, they will receive a password error message. This is because the new password has not been synchronized to the application, so a new sign-in process is required for password validation.

To avoid a disruptive user experience, use the Okta AD Password Sync Agent to synchronize passwords from AD to Okta and to integrated apps. The Okta AD Password Sync Agent will track password changes in AD and then synchronize to Okta.

For more details on the password synchronization and password reset workflow, you can read step-by-step instructions on Synchronize passwords from Active Directory to Okta.

Summary

In this blog post, we discussed a way for synchronizing users and credentials from on-premises Active Directory and AWS Managed AD to Okta Identity Cloud. With synchronization, you can centralize access of cloud and on-premises applications and provide seamless access to AD-aware services within AWS.

Customers can also migrate on-premises AD to AWS using Active Directory Migration Tool (ADMT) along with the Password Export Server (PES) service.

Read more:

New – Site-to-Site Connectivity with AWS Direct Connect SiteLink

2021-12-02 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-site-to-site-connectivity-with-aws-direct-connect-sitelink/

We are launching AWS Direct Connect SiteLink, a new capability of AWS Direct Connect that lets you create connections between your on-premises networks through the AWS global network backbone.

Until today, when you needed direct connectivity between your data centers or branch offices, you had to rely on public internet or expensive and hard-to-deploy fixed networks. These are geographically constrained and can be tied to long-term contracts. This rigidity becomes a pain point as you expand your businesses globally. In turn, you’re required to create custom workarounds to interconnect networks from different providers, which increases your operating costs.

Starting today, you may connect your sites through Direct Connect locations, without sending your traffic through an AWS Region. We have 108 Direct Connect locations available in 32 countries as I am writing this post, located across Africa, Americas, Asia-Pacific, Europe, and the Middle East. Traffic flows from one Direct Connect location to another following the shortest possible path. You no longer need to connect through the closest AWS Region and manage and configure an AWS Transit Gateway for site-to-site network connectivity.

You can take advantage of Direct Connect’s reliability and global footprint to build a network that grows with your business, with no long-term contracts, flexible pay-as-you-go pricing, and a wide range of port-speeds, from 50 Mbps to 100 Gbps. SiteLink also integrates with other AWS services, letting you reach your VPCs, other AWS services, and your on-premises networks from your Direct Connect connections.

When talking about network topology, a small diagram is always more descriptive than long phrases.

The following diagram shows the way that you use Direct Connect today. Direct Connect is currently optimized to let you reach your AWS Resources running in any Region as quickly as possible. Sending data from one Direct Connect location to another is not possible.

Once you connect your locations (NY1, AM3, Paris, and TY2 in the diagram) to a Direct Connect gateway, those connections can reach any AWS Region (except the two AWS China Regions). No peering between Regions is necessary, because Direct Connect gateways are global resources.

The following diagram shows how you connect multiple sites using SiteLink. The data flows between Direct Connect locations without going through an AWS Region.

How to Get Started?
Configuring these connections is very similar to what you do today. The first step is to connect my network to Direct Connect locations. After that, SiteLink can be enabled or disabled in minutes.

Using the AWS Management Console, I navigate to the Direct Connect section, and I select Create virtual interface to create a virtual interface. Under the Additional Settings section, I make sure the SiteLink switch is turned on. Obviously, I repeat this on another virtual interface, once per site, to connect.

I have access to similar monitoring dashboards and metrics published to CloudWatch. I select my virtual interface, and then navigate to the Monitoring tab (hopefully your ViF will have more data available than mine that was created just for this post).

Availability and Pricing
You can connect your on-premises networks or branch offices to any of our Direct Connect locations available today, except in China.

Pricing is pay-as-you-go, with no commitment or recurring fees. In addition to existing Direct Connect charges, your monthly bill will include a price-per-hour for SiteLink virtual interfaces, as well as the cost of SiteLink data transfer. Check the pricing page to get the details.

Go ahead an start connecting your on-premises locations together with Direct Connect SiteLink!

— seb

Disaster Recovery (DR) for a Third-party Interactive Voice Response on AWS

2021-09-16 Priyanka Kulkarni

Post Syndicated from Priyanka Kulkarni original https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-for-a-third-party-interactive-voice-response-on-aws/

Voice calling systems are prevalent and necessary to many businesses today. They are usually designed to provide a 24×7 helpline support across multiple domains and use cases. Reliability and availability of such systems are important for a good customer experience. The thoughtful design of a cost-optimized solution will allow your business to sustain the system into the future.

We address a scenario in which you are mandated to host the workload on a corporate data center (DC), and configure the backup site on Amazon Web Services (AWS). Since the primary objective of a backup site is disaster recovery (DR) management, this site is often referred to as a DR site.

Disaster Recovery on AWS

DR strategy defines the recovery objectives for downtime and data loss. The workload has a recovery time objective (RTO) and a recovery point objective (RPO). RTO is the maximum acceptable delay between the interruption of service and the restoration of service. RPO is the maximum acceptable amount of time since the last data recovery point. AWS defines four DR strategies in increasing order of complexity, and decreasing order of RTO and RPO. These are backup and restore, active/passive (pilot light or warm standby), or active/active.

Figure 1. Disaster recovery (DR) options

In our use case, the DR site on AWS must serve the user traffic with RPO and RTO in seconds. Warm standby is the optimal choice in this case. It is a scaled-down version of a fully functional environment, and is always running in the cloud.

Amazon Connect is an omnichannel cloud contact center that helps you provide great customer service at a lower cost. But in some situations, Amazon Connect may not be available. In other cases, the customer may want to use their home developed or third-party contact center application. Our solution is designed to help in both these scenarios.

This architecture enables customers facing challenges of cost overhead with redundant Session Initiation Protocol (SIP) trunks for the DC and DR sites. It allows you to optimize your spend and yet retain a reliable workflow.

SIP trunk communication on AWS

Let’s see how the SIP trunk termination on the AWS network handles the failover scenario of a third-party IVR application installed on Amazon EC2 at the DR site.

There will be two connections made from the AWS Direct Connect location (DX). The first will be for a point-to-point connectivity between the corporate DC and the AWS DR site. The second connection will be originating from the multiplexer (MUX) of the telecom provider who is providing you the SIP trunk.

The telecom provider will lay the SIP trunk from its MUX to the customer router at the DX location. At this point, the mode of communication becomes IP-based. The telecom provider will send the call to the IP address attached to the Network Load Balancer (NLB) in Amazon Virtual Private Cloud (VPC).

Figure 2. Communication circuitry at telecom side

AWS Network Load Balancers can now distribute traffic to AWS resources using their IP addresses and instance IDs as targets. You can also distribute the traffic with on-premises resources over AWS Direct Connect. Load balancing across AWS and on-premises resources using the same load balancer streamlines migrate-to-cloud, burst-to-cloud, or failover-to-cloud.

In the backup site, the NLB will point to the Session Border Controller (SBC). This is a special-purpose device that protects and regulates IP communications flows. You can bring your own SBC, or you can use an SBC offered in the AWS Marketplace.

Best practices for high availability of IVR solution on AWS

Configure the multiple Availability Zone (Multi-AZ) SBC setup
Make sure that the telecom provider for the SIP trunk is different from the internet service provider (ISP). This is for last mile connectivity for the DC from Direct Connect
Consider redundancy for Direct Connect by using a Site-to-Site VPN tunnel

Figure 3. Solution architecture of DR on AWS for a third-party IVR solution

Communication flow for an IVR solution deployed on a corporate DC and its DR on AWS

The callers are received on the telecom providers SIP line, which terminates on the AWS Direct Connect location.
At the DX location, you will configure a route in the AWS router to send the traffic to the IP address of the NLB. The NLB should be configured to perform health checks on the virtual machine in your on-premises DC. Based on these health checks, the NLB will do the routing and the failover.
In a live scenario with successful health checks at the DC, the NLB will forward the call to the IP of the on-premises virtual machine. This is where the IVR application will be installed.
The communication between the NLB in Amazon VPC and the virtual machine in DC, will happen over Direct Connect.
In a DR scenario, the NLB will failover the communication to SBCs in Amazon VPC.

Conclusion

This solution is useful when a third-party IVR system is deployed in a corporate data center, and the passive DR site is hosted on AWS. Cost optimization on telecom components is an important aspect of this design. AWS Direct Connect provides dedicated connectivity to the AWS environment, from 50 Mbps up to 10 Gbps. This gives you managed and controlled latency. It also provides provisioned bandwidth, so your workload can connect to AWS resources in a reliable, scalable, and cost-effective way.

The solution in this blog explains the end-to-end flow of communication, from the user to the IVR agents. It also provides insights into managing failover and failback between DR and the DR site.

Further Reading:

Overview of Data Transfer Costs for Common Architectures

2021-07-01 Birender Pal

Post Syndicated from Birender Pal original https://aws.amazon.com/blogs/architecture/overview-of-data-transfer-costs-for-common-architectures/

Data transfer charges are often overlooked while architecting a solution in AWS. Considering data transfer charges while making architectural decisions can help save costs. This blog post will help identify potential data transfer charges you may encounter while operating your workload on AWS. Service charges are out of scope for this blog, but should be carefully considered when designing any architecture.

Data transfer between AWS and internet

There is no charge for inbound data transfer across all services in all Regions. Data transfer from AWS to the internet is charged per service, with rates specific to the originating Region. Refer to the pricing pages for each service—for example, the pricing page for Amazon Elastic Compute Cloud (Amazon EC2)—for more details.

Data transfer within AWS

Data transfer within AWS could be from your workload to other AWS services, or it could be between different components of your workload.

Data transfer between your workload and other AWS services

When your workload accesses AWS services, you may incur data transfer charges.

Accessing services within the same AWS Region

If the internet gateway is used to access the public endpoint of the AWS services in the same Region (Figure 1 – Pattern 1), there are no data transfer charges. If a NAT gateway is used to access the same services (Figure 1 – Pattern 2), there is a data processing charge (per gigabyte (GB)) for data that passes through the gateway.

Figure 1. Accessing AWS services in same Region

Accessing services across AWS Regions

If your workload accesses services in different Regions (Figure 2), there is a charge for data transfer across Regions. The charge depends on the source and destination Region (as described on the Amazon EC2 Data Transfer pricing page).

Figure 2. Accessing AWS services in different Region

Data transfer within different components of your workload

Charges may apply if there is data transfer between different components of your workload. These charges vary depending on where the components are deployed.

Workload components in same AWS Region

Data transfer within the same Availability Zone is free. One way to achieve high availability for a workload is to deploy in multiple Availability Zones.

Consider a workload with two application servers running on Amazon EC2 and a database running on Amazon Relational Database Service (Amazon RDS) for MySQL (Figure 3). For high availability, each application server is deployed into a separate Availability Zone. Here, data transfer charges apply for cross-Availability Zone communication between the EC2 instances. Data transfer charges also apply between Amazon EC2 and Amazon RDS. Consult the Amazon RDS for MySQL pricing guide for more information.

Figure 3. Workload components across Availability Zones

To minimize impact of a database instance failure, enable a multi-Availability Zone configuration within Amazon RDS to deploy a standby instance in a different Availability Zone. Replication between the primary and standby instances does not incur additional data transfer charges. However, data transfer charges will apply from any consumers outside the current primary instance Availability Zone. Refer to the Amazon RDS pricing page for more detail.

A common pattern is to deploy workloads across multiple VPCs in your AWS network. Two approaches to enabling VPC-to-VPC communication are VPC peering connections and AWS Transit Gateway. Data transfer over a VPC peering connection that stays within an Availability Zone is free. Data transfer over a VPC peering connection that crosses Availability Zones will incur a data transfer charge for ingress/egress traffic (Figure 4).

Figure 4. VPC peering connection

Transit Gateway can interconnect hundreds or thousands of VPCs (Figure 5). Cost elements for Transit Gateway include an hourly charge for each attached VPC, AWS Direct Connect, or AWS Site-to-Site VPN. Data processing charges apply for each GB sent from a VPC, Direct Connect, or VPN to Transit Gateway.

Figure 5. VPC peering using Transit Gateway in same Region

Workload components in different AWS Regions

If workload components communicate across multiple Regions using VPC peering connections or Transit Gateway, additional data transfer charges apply. If the VPCs are peered across Regions, standard inter-Region data transfer charges will apply (Figure 6).

Figure 6. VPC peering across Regions

For peered Transit Gateways, you will incur data transfer charges on only one side of the peer. Data transfer charges do not apply for data sent from a peering attachment to a Transit Gateway. The data transfer for this cross-Region peering connection is in addition to the data transfer charges for the other attachments (Figure 7).

Figure 7. Transit Gateway peering across Regions

Data transfer between AWS and on-premises data centers

Data transfer will occur when your workload needs to access resources in your on-premises data center. There are two common options to help achieve this connectivity: Site-to-Site VPN and Direct Connect.

Data transfer over AWS Site-to-Site VPN

One option to connect workloads to an on-premises network is to use one or more Site-to-Site VPN connections (Figure 8 – Pattern 1). These charges include an hourly charge for the connection and a charge for data transferred from AWS. Refer to Site-to-Site VPN pricing for more details. Another option to connect multiple VPCs to an on-premises network is to use a Site-to-Site VPN connection to a Transit Gateway (Figure 8 – Pattern 2). The Site-to-Site VPN will be considered another attachment on the Transit Gateway. Standard Transit Gateway pricing applies.

Figure 8. Site-to-Site VPN patterns

Data transfer over AWS Direct Connect

Direct Connect can be used to connect workloads in AWS to on-premises networks. Direct Connect incurs a fee for each hour the connection port is used and data transfer charges for data flowing out of AWS. Data transfer into AWS is $0.00 per GB in all locations. The data transfer charges depend on the source Region and the Direct Connect provider location. Direct Connect can also connect to the Transit Gateway if multiple VPCs need to be connected (Figure 9). Direct Connect is considered another attachment on the Transit Gateway and standard Transit Gateway pricing applies. Refer to the Direct Connect pricing page for more details.

Figure 9. Direct Connect patterns

A Direct Connect gateway can be used to share a Direct Connect across multiple Regions. When using a Direct Connect gateway, there will be outbound data charges based on the source Region and Direct Connect location (Figure 10).

Figure 10. Direct Connect gateway

General tips

Data transfer charges apply based on the source, destination, and amount of traffic. Here are some general tips for when you start planning your architecture:

Avoid routing traffic over the internet when connecting to AWS services from within AWS by using VPC endpoints:
- VPC gateway endpoints allow communication to Amazon S3 and Amazon DynamoDB without incurring data transfer charges.
- VPC interface endpoints are available for some AWS services. This type of endpoint incurs hourly service charges and data transfer charges.
Use Direct Connect instead of the internet for sending data to on-premises networks.
Traffic that crosses an Availability Zone boundary typically incurs a data transfer charge. Use resources from the local Availability Zone whenever possible.
Traffic that crosses a Regional boundary will typically incur a data transfer charge. Avoid cross-Region data transfer unless your business case requires it.
Use the AWS Free Tier. Under certain circumstances, you may be able to test your workload free of charge.
Use the AWS Pricing Calculator to help estimate the data transfer costs for your solution.
Use a dashboard to better visualize data transfer charges – this workshop will show how.

Conclusion

AWS provides the ability to deploy across multiple Availability Zones and Regions. With a few clicks, you can create a distributed workload. As you increase your footprint across AWS, it helps to understand various data transfer charges that may apply. This blog post provided information to help you make an informed decision and explore different architectural patterns to save on data transfer costs.

Using Route 53 Private Hosted Zones for Cross-account Multi-region Architectures

2021-01-20 Anandprasanna Gaitonde

Post Syndicated from Anandprasanna Gaitonde original https://aws.amazon.com/blogs/architecture/using-route-53-private-hosted-zones-for-cross-account-multi-region-architectures/

This post was co-written by Anandprasanna Gaitonde, AWS Solutions Architect and John Bickle, Senior Technical Account Manager, AWS Enterprise Support

Introduction

Many AWS customers have internal business applications spread over multiple AWS accounts and on-premises to support different business units. In such environments, you may find a consistent view of DNS records and domain names between on-premises and different AWS accounts useful. Route 53 Private Hosted Zones (PHZs) and Resolver endpoints on AWS create an architecture best practice for centralized DNS in hybrid cloud environment. Your business units can use flexibility and autonomy to manage the hosted zones for their applications and support multi-region application environments for disaster recovery (DR) purposes.

This blog presents an architecture that provides a unified view of the DNS while allowing different AWS accounts to manage subdomains. It utilizes PHZs with overlapping namespaces and cross-account multi-region VPC association for PHZs to create an efficient, scalable, and highly available architecture for DNS.

Architecture Overview

You can set up a multi-account environment using services such as AWS Control Tower to host applications and workloads from different business units in separate AWS accounts. However, these applications have to conform to a naming scheme based on organization policies and simpler management of DNS hierarchy. As a best practice, the integration with on-premises DNS is done by configuring Amazon Route 53 Resolver endpoints in a shared networking account. Following is an example of this architecture.

Figure 1 – Architecture Diagram

The customer in this example has on-premises applications under the customer.local domain. Applications hosted in AWS use subdomain delegation to aws.customer.local. The example here shows three applications that belong to three different teams, and those environments are located in their separate AWS accounts to allow for autonomy and flexibility. This architecture pattern follows the option of the “Multi-Account Decentralized” model as described in the whitepaper Hybrid Cloud DNS options for Amazon VPC.

This architecture involves three key components:

1. PHZ configuration: PHZ for the subdomain aws.customer.local is created in the shared Networking account. This is to support centralized management of PHZ for ancillary applications where teams don’t want individual control (Item 1a in Figure). However, for the key business applications, each of the teams or business units creates its own PHZ. For example, app1.aws.customer.local – Application1 in Account A, app2.aws.customer.local – Application2 in Account B, app3.aws.customer.local – Application3 in Account C (Items 1b in Figure). Application1 is a critical business application and has stringent DR requirements. A DR environment of this application is also created in us-west-2.

For a consistent view of DNS and efficient DNS query routing between the AWS accounts and on-premises, best practice is to associate all the PHZs to the Networking Account. PHZs created in Account A, B and C are associated with VPC in Networking Account by using cross-account association of Private Hosted Zones with VPCs. This creates overlapping domains from multiple PHZs for the VPCs of the networking account. It also overlaps with the parent sub-domain PHZ (aws.customer.local) in the Networking account. In such cases where there is two or more PHZ with overlapping namespaces, Route 53 resolver routes traffic based on most specific match as described in the Developer Guide.

2. Route 53 Resolver endpoints for on-premises integration (Item 2 in Figure): The networking account is used to set up the integration with on-premises DNS using Route 53 Resolver endpoints as shown in Resolving DNS queries between VPC and your network. Inbound and Outbound Route 53 Resolver endpoints are created in the VPC in us-east-1 to serve as the integration between on-premises DNS and AWS. The DNS traffic between on-premises to AWS requires an AWS Site2Site VPN connection or AWS Direct Connect connection to carry DNS and application traffic. For each Resolver endpoint, two or more IP addresses can be specified to map to different Availability Zones (AZs). This helps create a highly available architecture.

3. Route 53 Resolver rules (Item 3 in Figure): Forwarding rules are created only in the networking account to route DNS queries for on-premises domains (customer.local) to the on-premises DNS server. AWS Resource Access Manager (RAM) is used to share the rules to accounts A, B and C as mentioned in the section “Sharing forwarding rules with other AWS accounts and using shared rules” in the documentation. Account owners can now associate these shared rules with their VPCs the same way that they associate rules created in their own AWS accounts. If you share the rule with another AWS account, you also indirectly share the outbound endpoint that you specify in the rule as described in the section “Considerations when creating inbound and outbound endpoints” in the documentation. This implies that you use one outbound endpoint in a region to forward DNS queries to your on-premises network from multiple VPCs, even if the VPCs were created in different AWS accounts. Resolver starts to forward DNS queries for the domain name that’s specified in the rule to the outbound endpoint and forward to the on-premises DNS servers. The rules are created in both regions in this architecture.

This architecture provides the following benefits:

Resilient and scalable
Uses the VPC+2 endpoint, local caching and Availability Zone (AZ) isolation
Minimal forwarding hops
Lower cost: optimal use of Resolver endpoints and forwarding rules

In order to handle the DR, here are some other considerations:

For app1.aws.customer.local, the same PHZ is associated with VPC in us-west-2 region. While VPCs are regional, the PHZ is a global construct. The same PHZ is accessible from VPCs in different regions.
Failover routing policy is set up in the PHZ and failover records are created. However, Route 53 health checkers (being outside of the VPC) require a public IP for your applications. As these business applications are internal to the organization, a metric-based health check with Amazon CloudWatch can be configured as mentioned in Configuring failover in a private hosted zone.
Resolver endpoints are created in VPC in another region (us-west-2) in the networking account. This allows on-premises servers to failover to these secondary Resolver inbound endpoints in case the region goes down.
A second set of forwarding rules is created in the networking account, which uses the outbound endpoint in us-west-2. These are shared with Account A and then associated with VPC in us-west-2.
In addition, to have DR across multiple on-premises locations, the on-premises servers should have a secondary backup DNS on-premises as well (not shown in the diagram).
This ensures a simple DNS architecture for the DR setup, and seamless failover for applications in case of a region failure.

Considerations

If Application 1 needs to communicate to Application 2, then the PHZ from Account A must be shared with Account B. DNS queries can then be routed efficiently for those VPCs in different accounts.
Create additional IP addresses in a single AZ/subnet for the resolver endpoints, to handle large volumes of DNS traffic.
Look at Considerations while using Private Hosted Zones before implementing such architectures in your AWS environment.

Summary

Hybrid cloud environments can utilize the features of Route 53 Private Hosted Zones such as overlapping namespaces and the ability to perform cross-account and multi-region VPC association. This creates a unified DNS view for your application environments. The architecture allows for scalability and high availability for business applications.

Field Notes: Setting Up Disaster Recovery in a Different Seismic Zone Using AWS Outposts

2020-11-24 Vijay Menon

Post Syndicated from Vijay Menon original https://aws.amazon.com/blogs/architecture/field-notes-setting-up-disaster-recovery-in-a-different-seismic-zone-using-aws-outposts/

Recovering your mission-critical workloads from outages is essential for business continuity and providing services to customers with little or no interruption. That’s why many customers replicate their mission-critical workloads in multiple places using a Disaster Recovery (DR) strategy suited for their needs.

With AWS, a customer can achieve this by deploying multi Availability Zone High-Availability setup or a multi-region setup by replicating critical components of an application to another region. Depending on the RPO and RTO of the mission-critical workload, the requirement for disaster recovery ranges from simple backup and restore, to multi-site, active-active, setup. In this blog post, I explain how AWS Outposts can be used for DR on AWS.

In many geographies, it is possible to set up your disaster recovery for a workload running in one AWS Region to another AWS Region in the same country (for example in US between us-east-1 and us-west-2). For countries where there is only one AWS Region, it’s possible to set up disaster recovery in another country where AWS Region is present. This method can be designed for the continuity, resumption and recovery of critical business processes at an agreed level and limits the impact on people, processes and infrastructure (including IT). Other reasons include to minimize the operational, financial, legal, reputational and other material consequences arising from such events.

However, for mission-critical workloads handling critical user data (PII, PHI or financial data), countries like India and Canada have regulations which mandate to have a disaster recovery setup at a “safe distance” within the same country. This ensures compliance with any data sovereignty or data localization requirements mandated by the regulators. “Safe distance” means the distance between the DR site and the primary site is such that the business can continue to operate in the event of any natural disaster or industrial events affecting the primary site. Depending on the geography, this safe distance could be 50KM or more. These regulations limit the options customers have to use another AWS Region in another country as a disaster recovery site of their primary workload running on AWS.

In this blog post, I describe an architecture using AWS Outposts which helps set up disaster recovery on AWS within the same country at a distance that can meet the requirements set by regulators. This architecture also helps customers to comply with various data sovereignty regulations in a given country. Another advantage of this architecture is the homogeneity of the primary and disaster recovery site. Your existing IT teams can set up and operate the disaster recovery site using familiar AWS tools and technology in a homogenous environment.

Prerequisites

Readers of this blog post should be familiar with basic networking concepts like WAN connectivity, BGP and the following AWS services:

Amazon EC2
Amazon VPC
AWS Outposts
AWS Transit Gateway
AWS Managed VPN
AWS Direct Connect
AWS Marketplace Amazon Machine Images (AMI) for Network Infrastructure

Architecture Overview

I explain the architecture using an example customer scenario in India, where a customer is using AWS Mumbai Region for their mission-critical workload. This workload needs a DR setup to comply with local regulation and the DR setup needs to be in a different seismic zone than the one for Mumbai. Also, because of the nature of the regulated business, the user/sensitive data needs to be stored within India.

Following is the architecture diagram showing the logical setup.

This solution is similar to a typical AWS Outposts use case where a customer orders the Outposts to be installed in their own Data Centre (DC) or a CoLocation site (Colo). It will follow the shared responsibility model described in AWS Outposts documentation.

The only difference is that the AWS Outpost parent Region will be the closest Region other than AWS Mumbai, in this case Singapore. Customers will then provision an AWS Direct Connect public VIF locally for a Service Link to the Singapore Region. This ensures that the control plane stays available via the AWS Singapore Region even if there is an outage in AWS Mumbai Region affecting control plane availability. You can then launch and manage AWS Outposts supported resources in the AWS Outposts rack.

For data plane traffic, which should not go out of the country, the following options are available:

Provision a self-managed Virtual Private Network (VPN) between an EC2 instances running router AMI in a subnet of AWS Outposts and AWS Transit Gateway (TGW) in the primary Region.
Provision a self-managed Virtual Private Network (VPN) between an EC2 instances running router AMI in a subnet of AWS Outposts and Virtual Private Gateway (VGW) in the primary Region.

Note: The Primary Region in this example is AWS Mumbai Region. This VPN will be provisioned via Local Gateway and DX public VIF. This ensures that data plane traffic will not traverse any network out of the country (India) to comply with data localization mandated by the regulators.

Architecture Walkthrough

Make sure your data center (DC) or the choice of collocate facility (Colo) meets the requirements for AWS Outposts.
Create an Outpost and order Outpost capacity as described in the documentation. Make sure that you do this step while logged into AWS Outposts console of the AWS Singapore Region.
Provision connectivity between AWS Outposts and network of your DC/Colo as mentioned in AWS Outpost documentation. This includes setting up VLANs for service links and Local Gateway (LGW).
Provision an AWS Direct Connect connection and public VIF between your DC/Colo and the primary Region via the closest AWS Direct Connect location.
- For the WAN connectivity between your DC/Colo and AWS Direct Connect location you can choose any telco provider of your choice or work with one of AWS Direct Connect partners.
- This public VIF will be used to attach AWS Outposts to its parent Region in Singapore over AWS Outposts service link. It will also be used to establish an IPsec GRE tunnel between AWS Outposts subnet and a TGW or VGW for data plane traffic (explained in subsequent steps).
- Alternatively, you can provision separate Direct Connect connection and public VIFs for Service Link and data plane traffic for better segregation between the two. You will have to provision sufficient bandwidth on Direct Connect connection for the Service Link traffic as well as the Data Plane traffic (like data replication between primary Region and AWS outposts).
- For an optimal experience and resiliency, AWS recommends that you use dual 1Gbps connections to the AWS Region. This connectivity can also be achieved over Internet transit; however, I recommend using AWS Direct Connect because it provides private connectivity between AWS and your DC/Colo environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.
Create a subnet in AWS Outposts and launch an EC2 instance running a router AMI of your choice from AWS Marketplace in this subnet. This EC2 instance is used to establish the IPsec GRE tunnel to the TGW or VGW in primary Region.
- Choose an EC2 instance which can support your bandwidth requirement as per the AMI provider and disable source/destination check for this EC2 instance.
- Assign an Elastic IP address to this EC2 instance from the customer-owned pool provisioned for your AWS Outposts.
Add rules in security group of these EC2 instances to allow ISAKMP (UDP 500), NAT Traversal (UDP 4500), and ESP (IP Protocol 50) from VGW or TGW endpoint public IP addresses.
NAT (Network Address Translation) the EIP assigned in step 5 to a public IP address at your edge router connecting to AWS Direct connect or internet transit. This public IP will be used as the customer gateway to establish IPsec GRE tunnel to the primary Region.
Create a customer gateway using the public IP address used to NAT the EC2 instances step 7. Follow the steps in similar process found at Create a Customer Gateway.
Create a VPN attachment for the transit gateway using the customer gateway created in step 8. This VPN must be a dynamic route-based VPN. For steps, review Transit Gateway VPN Attachments. If you are connecting the customer gateway to VPC using VGW in primary Region then follow the steps mentioned at How do I create a secure connection between my office network and Amazon Virtual Private Cloud?.
Configure the customer gateway (EC2 instance running a router AMI in AWS Outposts subnet) side for VPN connectivity. You can base this configuration suggested by AWS during the creation of VPN in step 9. This suggested sample configuration can be downloaded from AWS console post VPN setup as discussed in this document.
Modify the route table of AWS outpost Subnets to point to the EC2 instance launched in step 5 as the target for any destination in your VPCs in the primary Region, which is AWS Mumbai in this example.

At this point, you will have end-to-end connectivity between VPCs in a primary Region and resources in an AWS Outposts. This connectivity can now be used to replicate data from your primary site to AWS Outposts for DR purposes. This keeps the setup compliant with any internal or external data localization requirements.

Conclusion

In this blog post, I described an architecture using AWS Outposts for Disaster Recovery on AWS in countries without a second AWS Region. To set up disaster recovery, your existing IT teams can set up and operate the disaster recovery site using the familiar AWS tools and technology in a homogeneous environment. To learn more about AWS Outposts, refer to the documentation and FAQ.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

New Whitepaper: Selecting & Designing Your Hybrid Connectivity Model

2020-10-08 Santiago Freitas

Post Syndicated from Santiago Freitas original https://aws.amazon.com/blogs/architecture/new-whitepaper-selecting-designing-your-hybrid-connectivity-model/

Introduction

Many organizations need to connect their on-premises data centers, remote sites, and the cloud. A hybrid network connects these different environments.

A modern organization uses an extensive array of IT resources. In the past, it was common to host these resources in an on-premises data center or a colocation facility. With the increased adoption of cloud computing, IT resources are delivered and consumed from cloud service providers over a network connection. In some cases, organizations have opted to migrate all existing IT resources to the cloud. In other cases, organizations maintain IT resources both on premises and in the cloud. In both cases, a common network is required to connect on-premises and cloud resources. Coexistence of on-premises and cloud resources is called “hybrid cloud” and the common network connecting them is referred to as a “hybrid network. “ Even if your organization keeps all of its IT resources in the cloud, it may still require hybrid connectivity to remote sites.

There are several connectivity models to choose from. Although having options adds flexibility, selecting the best option requires analysis of the business and technical requirements and the elimination of options that are not suitable. Requirements can be grouped together across considerations, such as: security, time to deploy, performance, reliability, communication model, scalability, and more. Once requirements are carefully collected, analyzed, and considered, network and cloud architects identify applicable AWS hybrid network building blocks and solutions. To identify and select the optimal model(s), architects must understand advantages and disadvantages of each model. There are also technical limitations that might cause an otherwise good model to be excluded.

Figure 1 – Consideration covered on the whitepaper.

A new whitepaper on Hybrid Connectivity describes AWS building blocks and the key things to consider when deciding which hybrid connectivity model is right for you. To help you determine the best solution for your business and technical requirements, we provide decision trees to guide you through the logical selection process as well as a customer use case to show how to apply the considerations and decision trees in practice.

Decision tree applied to Example Corp. Automotive use case

Figure 2: Example Corp. Automotive connection type decision tree

Contributors

Contributors to this new whitepaper on Hybrid Connectivity are: Marwan Al Shawi, AWS Solutions Architect; Santiago Freitas, AWS Head of Technology; Evgeny Vaganov, AWS Specialist Solutions Architect – Networking; and Tom Adamski, AWS Specialist Solutions Architect – Networking. Special thanks to Stephen Bird, AWS Senior Program Manager – Content.