How UnitedHealth Group Improved Disaster Recovery for Machine-to-Machine Authentication

Post Syndicated from Vinodh Kumar Rathnasabapathy original https://aws.amazon.com/blogs/architecture/how-unitedhealth-group-improved-disaster-recovery-for-machine-to-machine-authentication/

This blog post was co-authored by Vinodh Kumar Rathnasabapathy, Senior Manager of Software Engineering, UnitedHealth Group.

Engineers who use Amazon Cognito for machine-to-machine authentication select a primary Region where they deploy their application infrastructure and the Amazon Cognito authorization endpoint. Amazon Cognito is a highly available service in single Region deployments with a published service-level agreement (SLA) target of 99.9%. The UnitedHealth Group (UHG) team needed a solution that would enable them to build and deploy their applications in multiple Regions to achieve higher availability targets. A multi-Region application architecture would also allow UHG engineers to failover to a secondary Region in the event that their application experienced issues in the primary Region.

At UHG, Federated Data Services (FDS) is a business-critical customer-facing application, which requires 99.95% availability and disaster recovery features. The FDS engineering team needed their Amazon Cognito infrastructure to be highly available in case of any service events in AWS Regions, along with having greater flexibility of switching between Regions.

The FDS engineering team worked on a custom solution using existing AWS services to fulfill their availability and recovery requirements. This solution not only serves the purpose of their current business needs but also provides recovery in case of any future disaster.

Overview of the solution

In this solution, we select two AWS Regions which will include the primary Region and the failover Region. Amazon Cognito app clients (including client ID and client secret pair) are created in both Regions and stored in an Amazon DynamoDB table. Client applications are given the client ID and client secret of the primary Region. Optionally, an application-generated ID and secret can be provided to the client to conceal the actual Amazon Cognito client ID and client secret. The process is as follows:

The client application (machine) initiates an authentication request by sending the Amazon Cognito app client ID and client secret to an Amazon Route 53 domain record.
Route 53 routes authentication requests to the in-Region Amazon API Gateway utilizing a Simple routing policy. From there, API Gateway shuts down the TLS connection using AWS Certificate Manager (ACM), and serves as a proxy for the authentication request to AWS Lambda.
AWS Lambda verifies the client ID and client secret, and uses them to look up the in-Region client ID and client secret.
Lambda uses preconfigured environment variables to request the appropriate Region from DynamoDB.
AWS Lambda then passes the returned app client credentials to the in-Region Amazon Cognito deployment. Amazon Cognito verifies the client ID and client secret, and returns an access token to the Lambda function.
The client application (machine) can now use this token to access downstream applications.
The client authentication process in the secondary (failover) Region is the same, with one exception. In the secondary Region, the Lambda environment variables retrieve app client credentials from the DynamoDB database for the secondary Region Amazon Cognito instance.

To initiate a failover between Regions, the Route 53 domain record needs to be pointed to the secondary Region API Gateway Regional endpoint. The downstream application’s Amazon Cognito configuration files must also be updated to point to the secondary Region Amazon Cognito instance. Alerts can be enabled using Amazon CloudWatch alarms to notify system operators of issues that may warrant a failover (a manual process to help system operators decide when to failover). The entire failover process takes just a couple of seconds for DNS to switch over and for the application to start accepting tokens from the secondary Region. This failover process could be automated based on generated alerts.

This architecture is suitable for a hot standby, active-passive type of application deployment. It is important to note that independent Amazon Cognito environments are being used in each Region, so you will need to set up your application to failover to the secondary Region for authentication. For example, your backend should be able to accept and validate access tokens from both primary and secondary Amazon Cognito user pools. To learn more about disaster recovery options in AWS, visit Disaster recovery options in the cloud.

Architecture overview

Figure 1 shows you how to build a multi-Region machine-to-machine architecture using Amazon Cognito, which uses DynamoDB global tables to perform the data replication. A Lambda function is utilized to retrieve the credentials for the active Region that the application is operating in, and a Regional Amazon Cognito endpoint returns the required token.

Figure 1. Multi-Region Amazon Cognito machine-to-machine architecture

Process flow

The Route53 domain record for the authentication proxy service is given to the client application and pointed at the API Gateway Regional endpoint. The client passes primary Region app client credentials to the API Gateway.
API Gateway passes Lambda the client ID and client secret pair.
Lambda does a lookup in DynamoDB to verify the client ID and client secret. After the identity is confirmed, Lambda uses Region-based environment variables to identify if the client should be using the primary Region or secondary Region for authenticating to Amazon Cognito. Lambda retrieves the Region-based client ID and client secret from DynamoDB.
Lambda passes the Region-based app ID and secret to Amazon Cognito, which verifies the client ID and client secret, and returns an access token to the Lambda function.
Lambda passes the access token from the Regional Amazon Cognito environment back to the client to be used against Region-based backend applications.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Note: Ensure that you are following your organization’s security best practices while deploying this architecture.

Implementation

This implementation will focus on the Lambda logic that is used to retrieve user credentials based on Lambda environment variables. We will share snippets of the Lambda function code so you can create the logic necessary to enable multi-Region application architecture using Amazon Cognito app clients. In addition to the Lambda function, you will also need to create and configure the following resources using security implementation designated by your organization:

DynamoDB table with fields primaryClientID, primaryClientSecret, secondaryClientID, and secondaryClientSecret.

1. Because this table is used to store secrets, make sure encryption is enabled and you follow Security Best Practices for Amazon DynamoDB and your organization’s security best practices.
2. Enable DynamoDB global tables.

API Gateway Regional endpoint with TLS encryption using ACM.
Route53 domain record routing traffic to Regional API Gateway.
Downstream application configuration pointing at the Regional Amazon Cognito endpoint.
IAM role that grants access from Lambda to DynamoDB.

Now let’s configure our Lambda function using Node.js language. Within the Lambda console create a new Lambda function that you will author from scratch. Select the Node.js runtime and change the runtime Lambda execution role to the IAM role that you have created for Lambda. Next, we will walk through the Lambda function code and configuration.

Attach your new IAM role as a Lambda runtime role that will grant your Lambda function access to the DynamoDB table.
Within the Lambda configuration environment variables, create several key-value variables in the Lambda function. The following are the environmental variables we will add.

1. key: OAUTH_HOST_PRIMARY, value: https://${cognito-primary-region-domain-name}/oauth2/token
2. key: OAUTH_HOST_SECONDARY, value: https://${cognito-secondary-region-domain-name}/oauth2/token

Import the Node.js libraries using the following code.

"dependencies": {
    "aws-sdk": "^2.723.0",
    "aws-serverless-express": "^3.3.8",
    "axios": "^0.21.1",
    "base-64": "^1.0.0",
    "cors": "^2.8.5",
    "dotenv": "^8.2.0",
    "express": "^4.17.1",
    "jsonwebtoken": "^8.5.1",
    "jwt-decode": "^2.2.0",
    }

Parse data from the incoming client application authentication request.

router.post("/", async (request, response) => {
  const client_id = request.body["client_id"];
  const client_secret = request.body["client_secret"]
      
  const client = await dynamoDB.getClientCredentialById(client_id, client_secret);
    
  if (client === undefined) {
    return response.status(400).json({
       message:
          "Client not configured for in Cognito. Please check with support team",
        });
      }
    
   const client_credentials = getClientCredentials(client);
   let access_token = await authService.getJwtToken(client_credentials);
          
   response.send(access_token);
});

Reference the environment variables to determine the Region that the Lambda function is operating in and set the Region as primary or secondary.

getClientCredentials = client => {
   return region === "us-east-1" ? { clientId: client.primaryClientId, clientSecret: client.primaryClientSecret, oAuthHost: config.OAUTH_HOST }
   : region === "us-east-2" ? { clientId: client.secondaryClientId, clientSecret: client.secondaryClientSecret, oAuthHost: config.OAUTH_HOST_EAST_2 }
   : {};
}

Verify the client ID and client secret against the DynamoDB table and get the Region-based client ID and client secret.

const getClientCredentialById = async (client_id, client_secret) => {

  let params = {
    TableName: clientCredentialTable,
    Key: {
      primaryClientId: client_id,
      primaryClientSecret: client_secret,
    },
  };

  const clientCredential = await ddb.get(params).promise();
  return clientCredential.Item;
};

Pass the Regional client ID and client secret to Amazon Cognito. You will receive an access token from Amazon Cognito.

   const base64ClientCredentials = base64.encode(
     client_credentials.clientId.concat(":").concat(client_credentials.clientSecret)
   );
   const headers = {
     "content-type": "application/x-www-form-urlencoded",
     authorization: "Basic " + base64ClientCredentials,
    };
    const data = "grant_type=client_credentials";
    
    // Post request to Cognito OAuth URL
    const token = await Axios.post(client_credentials.oAuthHost, data, { headers });
    return token.data;
 };

Pass the access token from the Regional Amazon Cognito environment back to the client to be used against Region-based backend applications.

let access_token = await authService.getJwtToken(client_credentials);
 
  response.send(access_token);

New app client creation

You need to implement this Lambda function in both the primary and secondary Regions. Modify the environmental variables in the secondary Region with the secondary Region’s information.

To enroll a new client in this multi-Region architecture using Amazon Cognito we will go through the process as shown in the following illustration.

Figure 2. Creating a new app client

You need to create a new:

App ID and secret in Amazon Cognito in the primary Region.
App ID and secret in Amazon Cognito in the secondary Region.
Item in DynamoDB table consisting of the Amazon Cognito credentials created: primaryClientID, primaryClientSecret, secondaryClientID, and secondaryClientSecret.

Failover process

In this blog post we are creating a hot standby, active-passive type of application deployment. You will need to ensure your application is configured to be able to use either the primary Region Amazon Cognito or the secondary Region Amazon Cognito environment. The process to failover between Regions consists of the following:

Start application backend in the secondary Region.
Reconfigure the application backend Amazon Cognito identity provider YAML file to point at the secondary Region Amazon Cognito identity provider.
Modify the Route 53 domain record to point client applications at the secondary Region API Gateway Regional endpoint.

Cleaning up

In order to avoid unnecessary charges, please be sure to clean up any resources that were built as part of this architecture that are no longer in use.

Conclusion

In this blog post, we presented a solution that allows you to failover Amazon Cognito app clients from one AWS Region to another Region. The benefits of this architecture will allow you to have greater flexibility for running your Amazon Cognito app clients in the Region that is best suited for your use case. With this solution you now have the capability to failover Amazon Cognito app clients to a different AWS Region in the event of application or system errors.

Several variants of this solution can be implemented using the provided Lambda failover logic. You can store App ID and secret in AWS Secrets Manager. To learn more, see How to replicate secrets in AWS Secrets Manager to multiple Regions. You can also automate the failover process between primary and secondary Regions. To accomplish this, you will need to evaluate events which should cause a failover in your environment. Later, build the appropriate automation to failover your downstream application to the secondary Region Amazon Cognito deployment.

Amazon Cognito can be used for machine authentication as per the limits posted in Quotas in Amazon Cognito. Review the limits of number of app clients per user pool and the other applicable rate limits (for example, client credentials rate limits) and verify that these limits meet the needs of your application.

Optum’s Story

UnitedHealth Group (UHG) is the health and well-being company responsible for over 150 million lives globally. Optum, a part of UnitedHealth Group, is a health services business serving the healthcare marketplace, including payers, care providers, employers, governments, life sciences companies and consumers, through its OptumHealth, OptumInsight, OptumRx, and OptumServe businesses.

Federated Data Services (FDS) is the power behind the scenes that enables interoperability and secure transmission of personally identifiable information. It protects health information between lines of businesses, technology systems, applications, members and providers. With FDS, data is centralized, making it easier to share and retrieve by other systems. This assures businesses get a flexible and scalable solution that adapts to changes in technology and business needs.

Noise