All posts by Yogi Barot

Extend SQL Server DR using log shipping for SQL Server FCI with Amazon FSx for Windows configuration

Post Syndicated from Yogi Barot original https://aws.amazon.com/blogs/architecture/extend-sql-server-dr-using-log-shipping-for-sql-server-fci-with-amazon-fsx-for-windows-configuration/

This week for Women’s History Month, we’re continuing to feature female authors. We’re showcasing women in the tech industry who are building, creating, and, above all, inspiring, empowering, and encouraging everyone—especially women and girls—in tech.


Companies choosing to rehost their on-premises SQL Server workloads to AWS can face challenges with setting up their disaster recovery (DR) strategy. Solutions such as Always On can be a more expensive, complex configuration across Regions. It can cause latency issues when synchronously replicating data cross-Region. Snapshots have additional overhead and may breach their stringent recovery point objective/recovery time objective (RPO/RTO) requirements.

A log shipping solution can take advantage of cross-Region replication of data using Amazon FSx for Windows File Server. It has less maintenance overhead, doesn’t introduce latency, and meets RPO/RTO requirements. A multi-Region architecture for Microsoft SQL Server is often adopted for SQL Server deployments for business continuity (disaster recovery) and improved latency (for a geographically distributed customer base).

This blog post explores SQL Server DR architecture using SQL Server failover cluster with Amazon FSx for Windows File Server for the primary site and secondary DR site. We describe how to set up a multi-Region DR using log shipping. We’ll explain the architecture patterns so you can follow along and effectively design a highly available SQL Server deployment that spans two or more AWS Regions.

Here are some advantages of using log shipping versus Always On distributed availability group DR setup.

  • Log shipping works with SQL Server Standard edition
  • It lowers total cost of ownership (TCO) as you only need one SQL Server Standard edition license at the primary/DR site
  • It’s straightforward to configure
  • There’s no need for clustering setup at the OS level
  • It supports all SQL Server versions. You don’t need the SQL Server version to be the same for source and destination instances.

Log shipping DR solution for SQL Server FCI with Amazon FSx

The architecture diagram in Figure 1 depicts SQL Server failover cluster instance (FCI) using Amazon FSx as storage (multiple Availability Zones) in Region 1. It uses a standalone or a similar setup on Region 2. It uses a log shipping feature for replication and DR. This will also serve as the reference architecture for our solution.

Log shipping DR solution for SQL Server FCI with Amazon FSx

Figure 1. Log shipping DR solution for SQL Server FCI with Amazon FSx

Figure 1 shows an SQL cluster in Region 1 and standalone SQL cluster in Region 2. The primary cluster in Region 1 is initially configured with SQL Server failover cluster instance (FCI) using Amazon FSx for its shared storage. Region 2 can have a standalone Amazon EC2 server with SQL Server and Amazon Elastic Block Store (EBS) as storage. Or it can have an identical configuration to Region 1, but with different hostnames, and an SQL network name (SQLFCI02) to avoid possible collisions.

You can build the VPC peering or AWS Transit Gateway to have seamless connectivity between the two Regions for the opened ports (SQL Server, SMB for file share, and others.)

With Amazon FSx, you get a fully managed and shared file storage solution, that automatically replicates the underlying storage synchronously across multiple Availability Zones. Amazon FSx provides high availability with automatic failure detection and automatic failover if there are any hardware or storage issues. The service fully supports continuously available shares, a feature that permits SQL Server uninterrupted access to shared file data.

There is an asynchronous replication setup from Region 1 to Region 2 using the log shipping feature. In this type of configuration, Microsoft SQL Server log shipping replicates databases using transaction logs. This ensures that a physically replicated warm standby database is an exact binary replica of the primary database. This is referred to as physical replication.

Log shipping can be configured with two available modes. These are related to the state of the secondary log-shipped SQL Server database.

  • Standby mode. The database is available for querying. Users cannot access the database while restore is going on. But once restore is completed, users can access it in read-only mode.
  • Restore mode. The database is not accessible for users.

In this solution, you configure a warm standby SQL Server database on an EC2 instance designated in SQL FCI using Amazon FSx as shared storage. You can send transaction log backups asynchronously between your primary Region database and the warm standby server in the other Region. The transaction log backups are then applied to the warm standby database sequentially. When all the logs have been applied, you can perform a manual failover and point the application to the secondary Region. We recommend running the primary and secondary database instances in separate Availability Zones, and configuring a monitor instance to track all the details of log shipping.

Prerequisites

Walkthrough steps to set up DR for SQL Server FCI

Following are the steps required to configure SQL Server DR using SQL Server failover cluster. Amazon FSx for Windows File Server is used for the primary site and secondary DR site. We also demonstrate how to set up a multi-Region log shipping.

Assumed variables

Region_01:
WSFC Cluster Name: SQLCluster1
FCI Virtual Network Name: SQLFCI01
Region_02:
Amazon EC2 Name: EC2SQL2

Make sure to configure network connectivity between your clusters. In this solution, we are using two VPCs in two separate Regions.

    • VPC peering is configured to enable network traffic on both VPCs.
    • The domain controller (AWS Managed Microsoft AD) on both VPCs are configured with conditional forwarding. This enables DNS resolution between the two VPCs.
  1. Configure SQL FCI setup using Amazon FSx as shared storage on Region_01.
  2. Configure SQL standalone instance on Region_02 with EBS volume as storage.
  3. Create an Amazon FSx in the primary Region with AWS managed Active Directory, or on-premises Active Directory connected with trust relation or AD Connector.
  4. Create a SQL Server service account with proper permissions to be able to set up transaction log settings.
  5. Configure VPC peering between the primary and DR/secondary Region.
  6. Join the domain to the Active Directory network for both primary and secondary servers in primary Region.
  7. Mount Amazon FSx on primary and secondary server and allow shared permissions, so SQL Server is able to access the folder. Use Amazon FSx for storing transaction log backups and EBS for storing transaction logs on the secondary Region.
  8. Set up log shipping from the primary server SQL Server FCI01 to the secondary SQL Server EC2SQL2 with the standby option enabled. This way the databases can be in read on the secondary SQL Server.
  9. In case of disaster, follow the FAILOVER and FAILBACK steps in the next sections. Learn more by reading Change Roles Between Primary and Secondary Log Shipping Servers.

Failover steps

In case of disaster at primary Region node SQLFCI01, log shipping acts as DR solution. Following, we show the steps to bring the databases online on EC2SQL02. Once SQLFCI01 is back, Use the following steps if DR drill checks to failover. In a real disaster, follow the process from Step 3 onwards.

1. Stop all activities on SQLFCI01 databases involved in log shipping jobs on SQLFCI01 and EC2SQL02. Confirm if any process is running by using the following query:

Use master
Go
select * from sysprocesses where dbid = DB_ID('DatabaseName')

2. Take full backup on SQLFCI01 as rollback option.

BACKUP DATABASE [DatabaseName]
TO  DISK = N'Provide Drive details'
WITH COMPRESSION
GO

3. Take last tail transaction log backup if we have access to SQL Server. Otherwise, check the last available transaction log stored in EC2SQL02 and restore it with RECOVERY to bring the databases online on EC2SQL02.

RESTORE LOG [DatabaseName] FROM  DISK = N'Provide path of last tlog'
WITH  FILE = 1,  RECOVERY,  NOUNLOAD,  STATS = 10
GO

4. Redirect the application connections to EC2SQL02.

Failback methods

1. Native backup/restore or rollback strategy

  • Take full backup from EC2SQL02 and copy to the SQLFCI01.
  • RESTORE the full backup on SQLFCI01.
  • Reconfigure log shipping between SQLFCI01 and EC2SQL02.

2.    Reverse log shipping

In case of DR drills or business continuity and disaster recovery (BCDR) activities, we can set up reverse log shipping to reduce the time taken to failover. It doesn’t require reinitializing the database with a full backup if performed carefully. It is crucial to preserve the log sequence number (LSN) chain. Perform the final log backup using the NORECOVERY option. Backing up the log with this option puts the database in a state where log backups can be restored. It ensures that the database’s LSN chain doesn’t deviate. This procedure helps reduce downtime to bring back SQLFCI01.

  • STOP all activities on SQLFCI01 databases involved in log shipping jobs on SQLFCI01 and EC2SQL02.
  • TAKE Tlog backup of SQLFCI01 with NORECOVERY option.

BACKUP LOG [DatabaseName]
TO DISK = 'BackupFilePathname'
WITH NORECOVERY;

  • RESTORE transaction log backup on EC2SQL02 with NORECOVERY.
  • Reconfigure log shipping and reenable the jobs back.
  • Reconfigure the application connections to SQLFCI01.

Conclusion

A multi-Region strategy for your mission-critical SQL Server deployments is key for business continuity and disaster recovery. This blog post shows how to achieve that using log shipping for SQL Server FCI deployment. Setting up DR using log shipping can help you save costs and meet your business requirements.

To learn more, check out Simplify your Microsoft SQL Server high availability deployments using Amazon FSx for Windows File Server.

More posts for Women’s History Month!

Other ways to participate

Field Notes: Building a Multi-Region Architecture for SQL Server using FCI and Distributed Availability Groups

Post Syndicated from Yogi Barot original https://aws.amazon.com/blogs/architecture/field-notes-building-a-multi-region-architecture-for-sql-server-using-fci-and-distributed-availability-groups/

A multiple-Region architecture for Microsoft SQL Server is often a topic of interest that comes up when working with our customers. The main reasons customers adopt a multiple-Region architecture approach for SQL Server deployments are:

  • Business continuity and disaster recovery (DR)
  • Geographically distributed customer base, and improved latency for end users

We will explain the architecture patterns that you can follow to effectively design a highly available SQL Server deployment, which spans two or more AWS Regions. You will also learn how to use the multiple-Region approach to scale out the read workloads, and improve the latency for your globally distributed end users.

This blog post explores SQL Server DR architecture using SQL Server Failover Cluster with Amazon FSx for Windows File Server, for primary site and secondary DR site, and describes how to set up a multiple-Region Always On distributed availability group.

Architecture overview

The architecture diagram in Figure 1 depicts two SQL Server clusters (multiple Availability Zones) in two separate Regions, and uses distributed availability group for replication and DR. This will also serve as the reference architecture for this solution.

Figure 1. Two SQL Server clusters (multiple Availability Zones) in two separate Regions

Figure 1. Two SQL Server clusters (multiple Availability Zones) in two separate Regions

In Figure 1, there are two separate clusters in different Regions. The primary cluster in Region_01 is initially configured with SQL Server Failover Cluster Instance (FCI) using Amazon FSx for its shared storage. Always On is enabled on both nodes, and is configured to use FCI SQL Network Name (SQLFCI01) as the single replica for local Availability Group (AG01). Region_02 has an identical configuration to Region_01, but with different hostnames, listeners, and SQL Network Name to avoid possible collisions.

Highlighted in Figure 1, the Always On distributed availability group is then configured to use both listener endpoints (AG01 and AG02). Depending on what type of authentication infrastructure you have, you can either use certificates (no domain and trust dependency), or just AWS Directory Service for Microsoft Active Directory authentication to build the local mirroring endpoint that will be used by the distributed availability group.

With Amazon FSx, you get a fully managed shared file storage solution, that automatically replicates the underlying storage synchronously across multiple Availability Zones. Amazon FSx provides high availability with automatic failure detection, and automatic failover if there are any hardware or storage issues. The service fully supports continuously available shares, a feature that allows SQL Server uninterrupted access to shared file data.

There is an asynchronous replication setup using a distributed availability group from Region A to Region B. In this type of configuration, because there is only one availability group replica, it also serves as the forwarder for the local FCI cluster. The concept of a forwarder is new, and it’s one of the core functionalities for the distributed availability group. Because Windows Failover Cluster1 and Windows Failover Cluster2 are standalone and independent clusters, don’t need to open a large set of ports, thus minimizing security risk.

In this solution, because FCI is our primary high availability solution, users and applications should then connect through FCI SQL Server Network Name with the latest supported drivers and key parameters (such as, MultiSubNetFailover=True – if supported) to facilitate the failover and make sure that the applications seamlessly connect to the new replica without any errors or timeouts.

Prerequisites

Walkthrough

Following are the steps required to configure SQL Server DR using SQL Server Failover Cluster with Amazon FSx for Windows File Server for primary site and secondary DR site. We also show how to set up a multiple-Region Always On distributed availability group.

Assumed Variables

Region_01:

WSFC Cluster Name: SQLCluster1
FCI Virtual Network Name: SQLFCI01
Local Availability Group: SQLAG01

Region_02:

WSFC Cluster Name: SQLCluster2
FCI Virtual Network Name: SQLFCI02
Local Availability Group: SQLAG02

  • Make sure to configure network connectivity between your clusters. In this solution, we are using two VPCs in two separate Regions.
    • VPC peering is configured to enable network traffic on both VPCs.
    • The domain controller (AWS Managed Microsoft AD) on both VPCs are configured with forest trust and conditional forwarding (this enables DNS resolution between the two VPCs).
  • Create a local availability group, using FCI SQL Network Name as the replica. Because we will be setting up a domain-independent distributed availability group between the two clusters, we will be setting up certificates to authenticate between the two separate clusters.
  1. Create master key and endpoint for SQLCluster1

use master
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>'
GO
CREATE CERTIFICATE [SQLAG01-Cert]
with SUBJECT = 'SQLAG01 Endpoint Cert'
GO
 
BACKUP CERTIFICATE [SQLAG01-Cert]
TO FILE = N'\\<FileShare>\SQLAG01-Cert.crt'
GO
 
CREATE ENDPOINT [SQLAG01-Endpoint]
STATE = STARTED
AS TCP
(
LISTENER_PORT = 5022
)
FOR DATABASE_MIRRORING
(
AUTHENTICATION = CERTIFICATE [SQLAG01-Cert],
ROLE = ALL,
ENCRYPTION = REQUIRED ALGORITHM AES
)
GO
  1. Create master key and endpoint for SQLCluster2

use master
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>'
GO
CREATE CERTIFICATE [SQLAG02-Cert]
with SUBJECT = 'SQLAG02 Endpoint Cert'
GO
 
BACKUP CERTIFICATE [SQLAG02-Cert]
TO FILE = N'\\<Fileshare>\SQLAG02-Cert.crt'
GO
 
CREATE ENDPOINT [SQLAG02-Endpoint]
STATE = STARTED
AS TCP
(
LISTENER_PORT = 5022
)
FOR DATABASE_MIRRORING
(
AUTHENTICATION = CERTIFICATE [SQLAG02-Cert],
ROLE = ALL,
ENCRYPTION = REQUIRED ALGORITHM AES
)
GO
    • Make sure to place all exported certificates in a location that you can easily access from each FCI instance.
    • Create a SQL Server login and user in the master database on each FCI instance.
  1. Create database login in SQLCluster1

use master
CREATE LOGIN [SQLAG02_DAG] WITH PASSWORD = '<password>'
GO
CREATE USER [SQLAG02_DAG] FOR LOGIN [SQLAG02_DAG] 
GO
CREATE CERTIFICATE [SQLAG02-Cert]
AUTHORIZATION [SQLAG02_DAG]
FROM FILE = N'\\<Fileshare>\SQLAG02-Cert.crt'
GO
  1. Create database login in SQLCluster2

use master
CREATE LOGIN [SQLAG01_DAG] WITH PASSWORD = '<password>'
GO
CREATE USER [SQLAG01_DAG] FOR LOGIN [SQLAG01_DAG] 
GO
CREATE CERTIFICATE [SQLAG01-Cert]
AUTHORIZATION [SQLAG01_DAG]
FROM FILE = N'\\<Fileshare>\SQLAG01-Cert.crt'
GO
    • Now grant the newly created user endpoint access to the local mirroring endpoint in each FCI instance.
  1. Grant permission on endpoint – SQLCluster1

GRANT CONNECT ON ENDPOINT::[SQLAG01-Endpoint] TO [SQLAG02_DAG]
GO
  1. Grant permission on endpoint – SQLCluster2

GRANT CONNECT ON ENDPOINT::[SQLAG02-Endpoint] TO [SQLAG01_DAG]
GO
  1. Create distributed Always On availability group on SQLCluster1

Next, create the distributed availability group on the primary cluster.

CREATE AVAILABILITY GROUP [SQLFCIDAG]  
   WITH (DISTRIBUTED)   
   AVAILABILITY GROUP ON  
    'SQLAG01' WITH    
        (   
            LISTENER_URL = 'tcp://SQLFCI01.DEMOSQL.COM:5022',    
            AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,   
            FAILOVER_MODE = MANUAL,   
            SEEDING_MODE = AUTOMATIC
        ),   
    'SQLAG02' WITH    
        (   
            LISTENER_URL = 'tcp://SQLFCI02.SQLDEMO.COM:5022',   
            AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,   
            FAILOVER_MODE = MANUAL,   
            SEEDING_MODE = AUTOMATIC
      );   
    • Note that we are using the SQL Network Name of the FCI cluster as our listener URL.
    • Now, join our secondary WSFC FCI cluster to the distributed availability group.
  1. Join secondary cluster on SQLCluster2 to distributed availability group

ALTER AVAILABILITY GROUP [SQLFCIDAG]   
   JOIN   
   AVAILABILITY GROUP ON  
      'SQLAG01' WITH    
      (   
         LISTENER_URL = 'tcp://SQLFCI01.DEMOSQL.COM:5022',    
         AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,   
         FAILOVER_MODE = MANUAL,   
         SEEDING_MODE = AUTOMATIC   
      ),   
      'SQLAG02' WITH    
      (   
         LISTENER_URL = 'tcp://SQLFCI02.SQLDEMO.COM:5022',   
         AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,   
         FAILOVER_MODE = MANUAL,   
         SEEDING_MODE = AUTOMATIC   
      );    
GO  
    • After you run the join script, you should be able to see the database from the primary FCI cluster’s local availability group populate the secondary FCI cluster.
    • To do a distributed availability group failover, it is best practice to synchronize both clusters first.
  1. Synchronize primary cluster

ALTER AVAILABILITY GROUP [SQLFCIDAG] 
 MODIFY AVAILABILITY GROUP ON 
 'SQLAG01' 
WITH 
( 
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT 
), 
'SQLAG02'
WITH
( 
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT 
);
    • You can verify synchronization lag and verify state displays as “SYNCHRONIZED”:
SELECT ag.name
       , drs.database_id
       , db_name(drs.database_id) as database_name
       , drs.group_id
       , drs.replica_id
       , drs.synchronization_state_desc
       , drs.last_hardened_lsn  
FROM sys.dm_hadr_database_replica_states drs 
INNER JOIN sys.availability_groups ag on drs.group_id = ag.group_id;
  1. Perform failover at primary cluster

    After everything is ready, perform failover by first changing the DAG role on the global primary.

ALTER AVAILABILITY GROUP [SQLFCIDAG] SET (ROLE = SECONDARY);
  1. Perform failover at secondary cluster

After which, initiate the actual failover by running this script on the secondary cluster.

ALTER AVAILABILITY GROUP [SQLFCIDAG] FORCE_FAILOVER_ALLOW_DATA_LOSS;
  1. Change sync mode on primary and secondary clusters

    Then make sure to change Sync mode on both clusters back to Asynchronous:

 ALTER AVAILABILITY GROUP [SQLFCIDAG] 
 MODIFY AVAILABILITY GROUP ON 
'SQLAG01' 
WITH 
( 
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT 
), 
'SQLAG02'
WITH
( 
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT 
);

Conclusion

A multiple-Region strategy for your mission critical SQL Server deployments is key for business continuity and disaster recovery. This blog post focused on how to achieve that optimally by using distributed availability groups. You also learned about other benefits such as read scale outs by using distributed availability groups.

To learn more, check out Simplify your Microsoft SQL Server high availability deployments using Amazon FSx for Windows File Server.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.