Tag Archives: Replication

Quicksilver v2: evolution of a globally distributed key-value store (Part 1)

Post Syndicated from Anton Dort-Golts original https://blog.cloudflare.com/quicksilver-v2-evolution-of-a-globally-distributed-key-value-store-part-1/

Quicksilver is a key-value store developed internally by Cloudflare to enable fast global replication and low-latency access on a planet scale. It was initially designed to be a global distribution system for configurations, but over time it gained popularity and became the foundational storage system for many products in Cloudflare.

A previous post described how we moved Quicksilver to production and started replicating on all machines across our global network. That is what we called Quicksilver v1: each server has a full copy of the data and updates it through asynchronous replication. The design served us well for some time. However, as our business grew with an ever-expanding data center footprint and a growing dataset, it became more and more expensive to store everything everywhere.

We realized that storing the full dataset on every server is inefficient. Due to the uniform design, data accessed in one region or data center is replicated globally, even if it’s never accessed elsewhere. This leads to wasted disk space. We decided to introduce a more efficient system with two new server roles: replica, which stores the full dataset and proxy, which acts as a persistent cache, evicting unused key-value pairs to free up some disk space. We call this design Quicksilver v1.5 – an interim step towards a more sophisticated and scalable system.

To understand how those two roles helped us reduce disk space usage, we first need to share some background on our setup and introduce some terminology. Cloudflare is architected in a way where we have a few hyperscale core data centers that form our control plane, and many smaller data centers distributed across the globe where resources are more constrained. Quicksilver has dozens of servers in the core data centers with terabytes of storage called root nodes. In the smaller data centers, though, things are different. A typical data center has two types of nodes: intermediate nodes and leaf nodes. Intermediate servers replicate data either from the other intermediate nodes or directly from the root nodes. Leaf nodes serve end user traffic, and receive updates from intermediate servers, effectively being leaves of a replication tree. Disk capacity varies significantly between node types. While root nodes aren’t facing an imminent disk space bottleneck, it’s a definite concern for leaf nodes.

Every server – whether it’s a root, intermediate, or leaf – hosts 10 Quicksilver instances. These are independent databases, each used by specific Cloudflare services or products such as the DNS, CDN  or WAF


Figure 1. Global Quicksilver

Let’s consider the role distribution. Instead of hosting ten full datasets on every machine within a data center, what if we deploy only a few replicas in each?  The remaining servers would be proxies, maintaining a persistent cache of hot keys and querying replicas for any cache misses.


Figure 2. Role allocation for different Quicksilver instances

Data centers across our network are very different in size, ranging from hundreds of servers to a single rack with just a few servers. To ensure every data center has at least one replica, the simplest initial step is an even split: on each server, place five replicas of some instances and five proxies for others. The change immediately frees up disk space, as the cached hot dataset on a proxy should be smaller than a full replica. While it doesn’t remove the bottleneck entirely, it could, in theory, lead to an up to 50% reduction in disk space usage. More importantly, it lays the foundation for a new distributed design of Quicksilver, where queries can be served by multiple machines in a data center, paving the way for further horizontal scaling. Additionally, an iterative approach helps to battle-proof the code changes earlier.

Can it even work?

Before committing to building Quicksilver v1.5, we wanted to be sure that the proxy/replica design would actually work for our workload. If proxies needed to cache the entire dataset for good performance, then it would be a dead end, offering no potential disk space benefits. To assess this, we built a data pipeline which pushes accessed keys from all across our network to ClickHouse. This allowed us to estimate typical sizes of working sets. Our analysis revealed that:

  • in large data centers approximately, 20% of the keyspace was in use

  • in small data centers this number dropped to just about 1%

These findings gave us confidence that the caching approach should work, though it wouldn’t be without its challenges.

Persistent caching

When talking about caches, the first thing that comes to mind is an in-memory cache. However, this cannot work for Quicksilver for two main reasons: memory usage and the “cold cache” problem.

Indeed, with billions of stored keys, even a fraction of them would lead to an unmanageable increase in memory usage. System restarts should not affect performance, which means that cache data must be preserved somewhere anyway. So we decided to make the cache persistent and store it in the same way as full datasets: in our embedded RocksDB. Thus, cached keys normally sit on disk and can be retrieved on-demand with low memory footprint.

When a key cannot be found in the proxy’s cache, we request it from a replica using our internal distributed key-value protocol, and put it into a local cache after processing.

Evictions are based on RocksDB compaction filters. Compaction filters allow defining custom logic executed in background RocksDB threads responsible for compacting files on disk. Each key-value pair is processed with a filter on a regular basis, evicting least recently used data from the disk when available disk space drops below a certain threshold called a soft limit. To track keys accessed on disk, we have an LRU-like in-memory data structure, which is passed to the compaction filter to set last access date in key metadata and inform potential evictions.

However, with some specific workloads there is still a chance that evictions will not keep up with disk space growth, and for this scenario we have a hard limit: when available disk space drops below a critical threshold, we temporarily stop adding new keys to the cache. This hurts performance, but it acts as a safeguard, ensuring our proxies remain stable and don’t overflow under a massive surge of requests.

Consistency and asynchronous replication

Quicksilver has, from the start, provided sequential consistency to clients: if key A was written before B, it’s not possible to read B and not A. We are committed to maintaining this guarantee in the new design. We have experienced Hyrum’s Law first hand, with Quicksilver being so widely adopted across the company that every property we introduced in earlier versions is now relied upon by other teams. This means that changing behaviour would inevitably break existing functionality and introduce bugs.

However, there is one thing standing in our way: asynchronous replication. Quicksilver replication is asynchronous mainly because machines in different parts of the world replicate at different speeds, and we don’t want a single server to slow down the entire tree. But it turns out in a proxy-replica design, independent replication progress can result in non-monotonic reads!

Consider the following scenario: a client sequentially writes keys A, B, C, .. K one after another to the Quicksilver root node. These keys are asynchronously replicated through data centers across our network with varying latency. Imagine we have a proxy on index 5, which has observed keys from A to E, and two replicas: 

  • replica_1 is at index 2 (slightly behind the proxy), having only received A and B

  • replica_2 at index 9, which is slightly ahead due to a faster replication path and has received all keys from A to I


Figure 3. Asynchronous replication in QSv1.5

Now, a client performs two successive requests on a proxy, each time reading the keys E, F, G, H and I. For simplicity, we assume these keys are not cacheable (for example, due to low disk space). The proxy’s first remote request is routed to replica_2, which already has all keys and responds back with values. To prevent hot spots in a data center, we load balance requests from proxies, and the next one lands on replica_1, which hasn’t received any of the requested keys yet, and responds with a “not found” error.

So, which result is correct?

The correct behavior here is that of Quicksilver v1, which we aim to preserve. If the server on replication index 5 were a replica instead of a proxy, it would have seen updates for keys A through E inclusive, resulting in E being the only key in both replies, while all other keys cannot be found yet. Which means responses from both replica_1 and replica_2 are wrong!

Therefore, to maintain previous guarantees and API backwards compatibility, Quicksilver v1.5 must address two crucial consistency problems: cases where the replica is ahead of the proxy, and conversely, where it lags behind. For now let’s focus on the case where a proxy lags behind a replica.

Multiversion concurrency control

In our example, replica_2 responds to a request from a proxy “from the past”. We cannot use any locks for synchronizing two servers, as it would introduce undesirable delays to the replication tree, defeating the purpose of asynchronous replication. The only option is for replicas to maintain a history of recent updates. This naturally leads us to implementing multiversion concurrency control (MVCC), a popular database mechanism for tracking changes in a non-blocking fashion, where for any key we can keep multiple versions of its values for different points in time.

With MVCC, we no longer overwrite the latest value of a key in the default column family for every update. Instead, we introduced a new MVCC column family in RocksDB, where all updates are stored with a corresponding replication index. Lookup for a key at some index in the past goes as follows:

  1. First we search in the default column family. If a key is found and the write timestamp is not greater than the index of a requesting proxy, we can use it straight away.

  2. Otherwise, we begin scanning the MVCC column family, where keys have unique suffixes based on latest timestamps for which they are still valid.

In the example above, replica_2 has MVCC enabled and has keys A@1 .. K@11 in its default column family. The MVCC is initially empty, because no keys have been overwritten yet. When it receives a request for, say, key H with target index 5, it first makes a lookup in a default column family and finds the given key, but its timestamp is 8, which means this version should not be visible to the proxy yet. It then scans the MVCC, finds no matching previous versions and responds with “not found” to the proxy. Should key H be updated twice at indexes 4 and 8, we would have placed the initial version into MVCC before overwriting it in the default column family, and the proxy would receive the first version in response.

If a key E is requested at index 5, replica_2 can find it quickly in the default column family and return it back to the proxy. There is no need to read from MVCC, as the timestamp of the latest version (5) satisfies the request.

Another corner case to consider is deletions. When a key is deleted and then re-written, we need to explicitly mark the period of removal in MVCC. For that we’ve implemented tombstones – a special value format for absent keys.

Finally, we need to make sure that key history is not growing uncontrollably, using up all of the disk space available. Luckily we don’t actually need to record history for a long period of time, it just needs to cover the maximum replication index difference between any two machines. And in practice, a two-hour interval turned out to be way more than enough, while adding only about 500 MB of extra disk space usage. All records in the MVCC column family older than two hours are garbage collected, and for that again we use custom RocksDB compaction filters.

Sliding window

Now we know how to deal with proxies lagging behind replicas. But what about the opposite case, when a proxy is ahead of replicas?

The simplest solution is for replicas to not serve requests with a target index higher than its own. After all, it cannot know about keys from the future, whether they will be added, updated, or removed. In fact, our first implementation just returned an error when the proxy was ahead, as we expected it to happen quite infrequently. But after rolling out gradually to a few data centers, our metrics made it clear that the approach was not going to work.

This led us to analyze which keys are affected by this kind of replication asymmetry. It’s definitely not keys added or updated a long time ago, because replicas would already have the changes replicated. The only problematic keys are those updated very recently, which the proxy already knows about, but the replica does not.

With this insight, we realized that the issue should be solved on the proxies rather than on the replica side. By preserving all recent updates locally, the proxy can avoid querying the replica. This became known as the sliding window approach.

The sliding window retains all recent updates written in a short, rolling timeframe. Unlike cached keys, items in the window cannot be evicted until they move outside of the window. Internally, the sliding window is defined by lower and upper boundary pointers. These are kept in memory, and can easily be restored after a reload from the current database index and the pre-configured window size.


Figure 4. The sliding window shifts when replication updates arrive

When a new update event arrives from the replication layer we add it to the sliding window by moving both the upper and lower boundary one position higher. Thereby, we maintain the fixed size of the window. Keys written before the lower bound can be evicted by the compaction filter, which is aware of current sliding window boundaries.

Negative lookups

Another problem arising with our distributed replica-proxy design is negative lookups – requests for keys which don’t exist in the database. Interestingly, in our workloads we see about ten times more negative lookups than positive ones!

But why is it a problem? Unfortunately, each negative lookup will be a cache miss on a proxy, requiring a request to a replica. Given the volume of requests and proportion of such lookups, it would be a disaster for performance, with overloaded replicas, overused data center networks, and massive latency degradation. We needed a fast and efficient approach to identifying non-existing keys directly at the proxy level.

In v1, negative lookups are the quickest type of requests. We rely on a special probabilistic data structure – Bloom filters – used in RocksDB to determine if the requested key might belong to a certain data file containing a range of sorted keys (called Sorted Sequence Table or SST) or definitely not. 99% of the time, negative lookups are served using only this in-memory data structure, avoiding the need for disk I/O.

One approach we considered for proxies was to cache negative lookups. Two problems immediately arise:

  • How big is the keyspace of negative lookups? In theory, it’s infinite, but the real size was unclear. We can store it in our cache only if it is small enough.

  • Cached negative lookups would no longer be served by the fast Bloom filters. We have row and block caches in RocksDB, but the hit rate is nowhere near the filters for SSTs, which means negative lookups would end up going to disk more often.

These turned out to be dealbreakers: not only was the negative keyspace vast, greatly exceeding the actual keyspace (by a thousand times for some instances!), but clients also need lookups to be really fast, ideally served from memory.

In pursuit of probabilistic data structures which could give us a dynamic compact representation of a full keyspace on proxies, we spent some time exploring Cuckoo filters. Unfortunately, with 5 billion keys it takes about 18 GB to have a false positive rate similar to Bloom filters (which only require 6 GB). And this is not only about wasted disk space — to be fast we have to keep it all in memory too!

Clearly some other solution was needed.

Finally, we decided to implement key and value separation, storing all keys on proxies, but persisting values only for cached keys. Evicting a key from the cache actually results in the removal of its value.

But wait, don’t the keys, even stripped of values, take a lot of space? Well, yes and no.

The total size of pure keys in Quicksilver is approximately 11 times smaller than the full dataset. Of course, it’s larger than any representation by probabilistic data structure, but there are some very desirable properties to such a solution. Firstly, we continue to enjoy fast Bloom filter lookups in RocksDB. Another benefit is that it unlocks some cool optimizations for range queries in a distributed context.

We may revisit it one day, but so far it has worked great for us.

Discovery mechanism

Having solved all of the above challenges, one bit remained to be sorted out to make distributed query execution work: how can proxies discover replicas?

Within the local data center it is fairly easy. Each one runs its own consul cluster, where machines are registered as services. Consul is well integrated with our internal DNS resolvers, and with a single DNS request, we can get the names of all replicas running in a data center, which proxies can directly connect to.

However, data centers vary in size, servers are constantly added and removed, and having only local discovery would not be enough for the system to work reliably. Proxies also need to find replicas in other nearby data centers.

We had previously encountered a similar problem with our replication layer. Initially, the replication topology was statically defined in a configuration and distributed to all servers, such that they know from which sources they should replicate. While simple, this approach was quite fragile and tedious to operate. It led to a rigid replication tree with suboptimal overall performance, unable to adapt to network changes.

Our solution to this problem was the Network Oracle – a special overlay network based on a gossip protocol and consisting of intermediate nodes in our data centers. Each member of this overlay constantly exchanges status and metainformation with other nodes, which helps us see active members in near-real time. Each member runs network probes measuring round-trip time to its peers, making it easy to find closest (in terms of RTT) active intermediate nodes to form a low-latency replication tree. Introducing the Network Oracle was a major improvement: we no longer needed to reconfigure the topology, watch intermediate nodes or entire data centers go down, or investigate frequent replication issues. Replication is now a completely self-organized and self-healing dynamic system.

Naturally, we decided to reuse the Network Oracle for our discovery mechanism. It consists of two subproblems: data center discovery and specific service lookup. We use the Network Oracle to find the closest data centers. Adding all machines running Quicksilver to the same overlay would be inefficient because of significant increase of network traffic and message delivery times. Instead, we use intermediate nodes as sources of network proximity information for the leaf nodes. Knowing which data centers are nearby, we can directly send DNS queries there to resolve specific services – Quicksilver replicas in this case.

Proxies maintain a pool of connections to active replicas and distribute requests among them to smooth out the load and avoid hotspots in a data center. Proxies also have a health-tracking mechanism, monitoring the state of connections and errors coming from replicas, and temporarily deprioritizing or isolating potentially faulty ones.


Figure 5. Internal replica request errors

To demonstrate its efficiency, we graphed errors coming from replica requests, which showed that such errors almost disappeared after introducing the new discovery system.

Results

Our objective with Quicksilver v1.5 was simple: gain some disk space without losing request latency, because clients rely heavily on us being fast. While the replica-proxy design delivered significant space savings, what about latencies?

Proxy


Replica


Figure 6. Proxy-replica latency comparison

Above, we have the 99.9% percentile of request latency on both a replica and proxy during a 24-hour window. One can hardly find a difference between the two. Surprisingly, proxies can even be slightly faster than replicas sometimes, likely because of smaller datasets on disk!

Quicksilver v1.5 is released but our journey to a highly scalable and efficient solution is not over. In the next post we’ll share what challenges we faced with the following iteration. Stay tuned!

Thank you

This project was a big team effort, so we’d like to thank everyone on the Quicksilver team – it would not have come true without you all.

  • Aleksandr Matveev

  • Aleksei Surikov

  • Alex Dzyoba

  • Alexandra (Modi) Stana-Palade

  • Francois Stiennon

  • Geoffrey Plouviez

  • Ilya Polyakovskiy

  • Manzur Mukhitdinov

  • Volodymyr Dorokhov

How to set up ongoing replication from your third-party secrets manager to AWS Secrets Manager

Post Syndicated from Laurens Brinker original https://aws.amazon.com/blogs/security/how-to-set-up-ongoing-replication-from-your-third-party-secrets-manager-to-aws-secrets-manager/

Secrets managers are a great tool to securely store your secrets and provide access to secret material to a set of individuals, applications, or systems that you trust. Across your environments, you might have multiple secrets managers hosted on different providers, which can increase the complexity of maintaining a consistent operating model for your secrets. In these situations, centralizing your secrets in a single source of truth, and replicating subsets of secrets across your other secrets managers, can simplify your operating model.

This blog post explains how you can use your third-party secrets manager as the source of truth for your secrets, while replicating a subset of these secrets to AWS Secrets Manager. By doing this, you will be able to use secrets that originate and are managed from your third-party secrets manager in Amazon Web Services (AWS) applications or in AWS services that use Secrets Manager secrets.

I’ll demonstrate this approach in this post by setting up a sample open-source HashiCorp Vault to create and maintain secrets and create a replication mechanism that enables you to use these secrets in AWS by using AWS Secrets Manager. Although this post uses HashiCorp Vault as an example, you can also modify the replication mechanism to use secrets managers from other providers.

Important: This blog post is intended to provide guidance that you can use when planning and implementing a secrets replication mechanism. The examples in this post are not intended to be run directly in production, and you will need to take security hardening requirements into consideration before deploying this solution. As an example, HashiCorp provides tutorials on hardening production vaults.

You can use these links to navigate through this post:

Why and when to consider replicating secrets
Two approaches to secrets replication
Replicate secrets to AWS Secrets Manager with the pull model
Solution overview
Set up the solution
Step 1: Deploy the solution by using the AWS CDK toolkit
Step 2: Initialize the HashiCorp Vault
Step 3: Update the Vault connection secret
Step 4: (Optional) Set up email notifications for replication failures
Test your secret replication
Update a secret
Secret replication logic
Use your secret
Manage permissions
Options for customizing the sample solution

Why and when to consider replicating secrets

The primary use case for this post is for customers who are running applications on AWS and are currently using a third-party secrets manager to manage their secrets, hosted on-premises, in the AWS Cloud, or with a third-party provider. These customers typically have existing secrets vending processes, deployment pipelines, and procedures and processes around the management of these secrets. Customers with such a setup might want to keep their existing third-party secrets manager and have a set of secrets that are accessible to workloads running outside of AWS, as well as workloads running within AWS, by using AWS Secrets Manager.

Another use case is for customers who are in the process of migrating workloads to the AWS Cloud and want to maintain a (temporary) hybrid form of secrets management. By replicating secrets from an existing third-party secrets manager, customers can migrate their secrets to the AWS Cloud one-by-one, test that they work, integrate the secrets with the intended applications and systems, and once the migration is complete, remove the third-party secrets manager.

Additionally, some AWS services, such as Amazon Relational Database Service (Amazon RDS) Proxy, AWS Direct Connect MACsec, and AD Connector seamless join (Linux), only support secrets from AWS Secrets Manager. Customers can use secret replication if they have a third-party secrets manager and want to be able to use third-party secrets in services that require integration with AWS Secrets Manager. That way, customers don’t have to manage secrets in two places.

Two approaches to secrets replication

In this post, I’ll discuss two main models to replicate secrets from an external third-party secrets manager to AWS Secrets Manager: a pull model and a push model.

Pull model
In a pull model, you can use AWS services such as Amazon EventBridge and AWS Lambda to periodically call your external secrets manager to fetch secrets and updates to those secrets. The main benefit of this model is that it doesn’t require any major configuration to your third-party secrets manager. The AWS resources and mechanism used for pulling secrets must have appropriate permissions and network access to those secrets. However, there could be a delay between the time a secret is created and updated and when it’s picked up for replication, depending on the time interval configured between pulls from AWS to the external secrets manager.

Push model
In this model, rather than periodically polling for updates, the external secrets manager pushes updates to AWS Secrets Manager as soon as a secret is added or changed. The main benefit of this is that there is minimal delay between secret creation, or secret updating, and when that data is available in AWS Secrets Manager. The push model also minimizes the network traffic required for replication since it’s a unidirectional flow. However, this model adds a layer of complexity to the replication, because it requires additional configuration in the third-party secrets manager. More specifically, the push model is dependent on the third-party secrets manager’s ability to run event-based push integrations with AWS resources. This will require a custom integration to be developed and managed on the third-party secrets manager’s side.

This blog post focuses on the pull model to provide an example integration that requires no additional configuration on the third-party secrets manager.

Replicate secrets to AWS Secrets Manager with the pull model

In this section, I’ll walk through an example of how to use the pull model to replicate your secrets from an external secrets manager to AWS Secrets Manager.

Solution overview

Figure 1: Secret replication architecture diagram

Figure 1: Secret replication architecture diagram

The architecture shown in Figure 1 consists of the following main steps, numbered in the diagram:

  1. A Cron expression in Amazon EventBridge invokes an AWS Lambda function every 30 minutes.
  2. To connect to the third-party secrets manager, the Lambda function, written in NodeJS, fetches a set of user-defined API keys belonging to the secrets manager from AWS Secrets Manager. These API keys have been scoped down to give read-only access to secrets that should be replicated, to adhere to the principle of least privilege. There is more information on this in Step 3: Update the Vault connection secret.
  3. The third step has two variants depending on where your third-party secrets manager is hosted:
    1. The Lambda function is configured to fetch secrets from a third-party secrets manager that is hosted outside AWS. This requires sufficient networking and routing to allow communication from the Lambda function.

      Note: Depending on the location of your third-party secrets manager, you might have to consider different networking topologies. For example, you might need to set up hybrid connectivity between your external environment and the AWS Cloud by using AWS Site-to-Site VPN or AWS Direct Connect, or both.

    2. The Lambda function is configured to fetch secrets from a third-party secrets manager running on Amazon Elastic Compute Cloud (Amazon EC2).

    Important: To simplify the deployment of this example integration, I’ll use a secrets manager hosted on a publicly available Amazon EC2 instance within the same VPC as the Lambda function (3b). This minimizes the additional networking components required to interact with the secrets manager. More specifically, the EC2 instance runs an open-source HashiCorp Vault. In the rest of this post, I’ll refer to the HashiCorp Vault’s API keys as Vault tokens.

  4. The Lambda function compares the version of the secret that it just fetched from the third-party secrets manager against the version of the secret that it has in AWS Secrets Manager (by tag). The function will create a new secret in AWS Secrets Manager if the secret does not exist yet, and will update it if there is a new version. The Lambda function will only consider secrets from the third-party secrets manager for replication if they match a specified prefix. For example, hybrid-aws-secrets/.
  5. In case there is an error synchronizing the secret, an email notification is sent to the email addresses which are subscribed to the Amazon Simple Notification Service (Amazon SNS) Topic deployed. This sample application uses email notifications with Amazon SNS as an example, but you could also integrate with services like ServiceNow, Jira, Slack, or PagerDuty. Learn more about how to use webhooks to publish Amazon SNS messages to external services.

Set up the solution

In this section, I walk through deploying the pull model solution displayed in Figure 1 using the following steps:
Step 1: Deploy the solution by using the AWS CDK toolkit
Step 2: Initialize the HashiCorp Vault
Step 3: Update the Vault connection secret
Step 4: (Optional) Set up email notifications for replication failures

Step 1: Deploy the solution by using the AWS CDK toolkit

For this blog post, I’ve created an AWS Cloud Development Kit (AWS CDK) script, which can be found in this AWS GitHub repository. Using the AWS CDK, I’ve defined the infrastructure depicted in Figure 1 as Infrastructure as Code (IaC), written in TypeScript, ready for you to deploy and try out. The AWS CDK is an open-source software development framework that allows you to write your cloud application infrastructure as code using common programming languages such as TypeScript, Python, Java, Go, and so on.

Prerequisites:

To deploy the solution, the following should be in place on your system:

  1. Git
  2. Node (version 16 or higher)
  3. jq
  4. AWS CDK Toolkit. Install using npm (included in Node setup) by running npm install -g aws-cdk in a local terminal.
  5. An AWS access key ID and secret access key configured as this setup will interact with your AWS account. See Configuration basics in the AWS Command Line Interface User Guide for more details.
  6. Docker installed and running on your machine

To deploy the solution

  1. Clone the CDK script for secret replication.
    git clone https://github.com/aws-samples/aws-secrets-manager-hybrid-secret-replication-from-hashicorp-vault.git SecretReplication
  2. Use the cloned project as the working directory.
    cd SecretReplication
  3. Install the required dependencies to deploy the application.
    npm install
  4. Adjust any configuration values for your setup in the cdk.json file. For example, you can adjust the secretsPrefix value to change which prefix is used by the Lambda function to determine the subset of secrets that should be replicated from the third-party secrets manager.
  5. Bootstrap your AWS environments with some resources that are required to deploy the solution. With correctly configured AWS credentials, run the following command.
    cdk bootstrap

    The core resources created by bootstrapping are an Amazon Elastic Container Registry (Amazon ECR) repository for the AWS Lambda Docker image, an Amazon Simple Storage Service (Amazon S3) bucket for static assets, and AWS Identity and Access Management (IAM) roles with corresponding IAM policies. You can find a full list of the resources by going to the CDKToolkit stack in AWS CloudFormation after the command has finished.

  6. Deploy the infrastructure.
    cdk deploy

    This command deploys the infrastructure shown in Figure 1 for you by using AWS CloudFormation. For a full list of resources, you can view the SecretsManagerReplicationStack in AWS CloudFormation after the deployment has completed.

Note: If your local environment does not have a terminal that allows you to run these commands, consider using AWS Cloud9 or AWS CloudShell.

After the deployment has finished, you should see an output in your terminal that looks like the one shown in Figure 2. If successful, the output provides the IP address of the sample HashiCorp Vault and its web interface.

Figure 2: AWS CDK deployment output

Figure 2: AWS CDK deployment output

Step 2: Initialize the HashiCorp Vault

As part of the output of the deployment script, you will be given a URL to access the user interface of the open-source HashiCorp Vault. To simplify accessibility, the URL points to a publicly available Amazon EC2 instance running the HashiCorp Vault user interface as shown in step 3b in Figure 1.

Let’s look at the HashiCorp Vault that was just created. Go to the URL in your browser, and you should see the Raft Storage initialize page, as shown in Figure 3.

Figure 3: HashiCorp Vault Raft Storage initialize page

Figure 3: HashiCorp Vault Raft Storage initialize page

The vault requires an initial configuration to set up storage and get the initial set of root keys. You can go through the steps manually in the HashiCorp Vault’s user interface, but I recommend that you use the initialise_vault.sh script that is included as part of the SecretsManagerReplication project instead.

Using the HashiCorp Vault API, the initialization script will automatically do the following:

  1. Initialize the Raft storage to allow the Vault to store secrets locally on the instance.
  2. Create an initial set of unseal keys for the Vault. Importantly, for demo purposes, the script uses a single key share. For production environments, it’s recommended to use multiple key shares so that multiple shares are needed to reconstruct the root key, in case of an emergency.
  3. Store the unseal keys in init/vault_init_output.json in your project.
  4. Unseals the HashiCorp Vault by using the unseal keys generated earlier.
  5. Enables two key-value secrets engines:
    1. An engine named after the prefix that you’re using for replication, defined in the cdk.json file. In this example, this is hybrid-aws-secrets. We’re going to use the secrets in this engine for replication to AWS Secrets Manager.
    2. An engine called super-secret-engine, which you’re going to use to show that your replication mechanism does not have access to secrets outside the engine used for replication.
  6. Creates three example secrets, two in hybrid-aws-secrets, and one in super-secret-engine.
  7. Creates a read-only policy, which you can see in the init/replication-policy-payload.json file after the script has finished running, that allows read-only access to only the secrets that should be replicated.
  8. Creates a new vault token that has the read-only policy attached so that it can be used by the AWS Lambda function later on to fetch secrets for replication.

To run the initialization script, go back to your terminal, and run the following command.
./initialise_vault.sh

The script will then ask you for the IP address of your HashiCorp Vault. Provide the IP address (excluding the port) and choose Enter. Input y so that the script creates a couple of sample secrets.

If everything is successful, you should see an output that includes tokens to access your HashiCorp Vault, similar to that shown in Figure 4.

Figure 4: Initialize HashiCorp Vault bash script output

Figure 4: Initialize HashiCorp Vault bash script output

The setup script has outputted two tokens: one root token that you will use for administrator tasks, and a read-only token that will be used to read secret information for replication. Make sure that you can access these tokens while you’re following the rest of the steps in this post.

Note: The root token is only used for demonstration purposes in this post. In your production environments, you should not use root tokens for regular administrator actions. Instead, you should use scoped down roles depending on your organizational needs. In this case, the root token is used to highlight that there are secrets under super-secret-engine/ which are not meant for replication. These secrets cannot be seen, or accessed, by the read-only token.

Go back to your browser and refresh your HashiCorp Vault UI. You should now see the Sign in to Vault page. Sign in using the Token method, and use the root token. If you don’t have the root token in your terminal anymore, you can find it in the init/vault_init_output.json file.

After you sign in, you should see the overview page with three secrets engines enabled for you, as shown in Figure 5.

Figure 5: HashiCorp Vault secrets engines overview

Figure 5: HashiCorp Vault secrets engines overview

If you explore hybrid-aws-secrets and super-secret-engine, you can see the secrets that were automatically created by the initialization script. For example, first-secret-for-replication, which contains a sample key-value secret with the key secrets and value manager.

If you navigate to Policies in the top navigation bar, you can also see the aws-replication-read-only policy, as shown in Figure 6. This policy provides read-only access to only the hybrid-aws-secrets path.

Figure 6: Read-only HashiCorp Vault token policy

Figure 6: Read-only HashiCorp Vault token policy

The read-only policy is attached to the read-only token that we’re going to use in the secret replication Lambda function. This policy is important because it scopes down the access that the Lambda function obtains by using the token to a specific prefix meant for replication. For secret replication we only need to perform read operations. This policy ensures that we can read, but cannot add, alter, or delete any secrets in HashiCorp Vault using the token.

You can verify the read-only token permissions by signing into the HashiCorp Vault user interface using the read-only token rather than the root token. Now, you should only see hybrid-aws-secrets. You no longer have access to super-secret-engine, which you saw in Figure 5. If you try to create or update a secret, you will get a permission denied error.

Great! Your HashiCorp Vault is now ready to have its secrets replicated from hybrid-aws-secrets to AWS Secrets Manager. The next section describes a final configuration that you need to do to allow access to the secrets in HashiCorp Vault by the replication mechanism in AWS.

Step 3: Update the Vault connection secret

To allow secret replication, you must give the AWS Lambda function access to the HashiCorp Vault read-only token that was created by the initialization script. To do that, you need to update the vault-connection-secret that was initialized in AWS Secrets Manager as part of your AWS CDK deployment.

For demonstration purposes, I’ll show you how to do that by using the AWS Management Console, but you can also do it programmatically by using the AWS Command Line Interface (AWS CLI) or AWS SDK with the update-secret command.

To update the Vault connection secret (console)

  1. In the AWS Management Console, go to AWS Secrets Manager > Secrets > hybrid-aws-secrets/vault-connection-secret.
  2. Under Secret Value, choose Retrieve Secret Value, and then choose Edit.
  3. Update the vaultToken value to contain the read-only token that was generated by the initialization script.
Figure 7: AWS Secrets Manager - Vault connection secret page

Figure 7: AWS Secrets Manager – Vault connection secret page

Step 4: (Optional) Set up email notifications for replication failures

As highlighted in Figure 1, the Lambda function will send an email by using Amazon SNS to a designated email address whenever one or more secrets fails to be replicated. You will need to configure the solution to use the correct email address. To do this, go to the cdk.json file at the root of the SecretReplication folder and adjust the notificationEmail parameter to an email address that you own. Once done, deploy the changes using the cdk deploy command. Within a few minutes, you’ll get an email requesting you to confirm the subscription. Going forward, you will receive an email notification if one or more secrets fails to replicate.

Test your secret replication

You can either wait up to 30 minutes for the Lambda function to be invoked automatically to replicate the secrets, or you can manually invoke the function.

To test your secret replication

  1. Open the AWS Lambda console and find the Secret Replication function (the name starts with SecretsManagerReplication-SecretReplication).
  2. Navigate to the Test tab.
  3. For the text event action, select Create new event, create an event using the default parameters, and then choose the Test button on the right-hand side, as shown in Figure 8.
Figure 8: AWS Lambda - Test page to manually invoke the function

Figure 8: AWS Lambda – Test page to manually invoke the function

This will run the function. You should see a success message, as shown in Figure 9. If this is the first time the Lambda function has been invoked, you will see in the results that two secrets have been created.

Figure 9: AWS Lambda function output

Figure 9: AWS Lambda function output

You can find the corresponding logs for the Lambda function invocation in a Log group in AWS CloudWatch matching the name /aws/lambda/SecretsManagerReplication-SecretReplicationLambdaF-XXXX.

To verify that the secrets were added, navigate to AWS Secrets Manager in the console, and in addition to the vault-connection-secret that you edited before, you should now also see the two new secrets with the same hybrid-aws-secrets prefix, as shown in Figure 10.

Figure 10: AWS Secrets Manager overview - New replicated secrets

Figure 10: AWS Secrets Manager overview – New replicated secrets

For example, if you look at first-secret-for-replication, you can see the first version of the secret, with the secret key secrets and secret value manager, as shown in Figure 11.

Figure 11: AWS Secrets Manager – New secret overview showing values and version number

Figure 11: AWS Secrets Manager – New secret overview showing values and version number

Success! You now have access to the secret values that originate from HashiCorp Vault in AWS Secrets Manager. Also, notice how there is a version tag attached to the secret. This is something that is necessary to update the secret, which you will learn more about in the next two sections.

Update a secret

It’s a recommended security practice to rotate secrets frequently. The Lambda function in this solution not only replicates secrets when they are created — it also periodically checks if existing secrets in AWS Secrets Manager should be updated when the third-party secrets manager (HashiCorp Vault in this case) has a new version of the secret. To validate that this works, you can manually update a secret in your HashiCorp Vault and observe its replication in AWS Secrets Manager in the same way as described in the previous section. You will notice that the version tag of your secret gets updated automatically when there is a new secret replication from the third-party secrets manager to AWS Secrets Manager.

Secret replication logic

This section will explain in more detail the logic behind the secret replication. Consider the following sequence diagram, which explains the overall logic implemented in the Lambda function.

Figure 12: State diagram for secret replication logic

Figure 12: State diagram for secret replication logic

This diagram highlights that the Lambda function will first fetch a list of secret names from the HashiCorp Vault. Then, the function will get a list of secrets from AWS Secrets Manager, matching the prefix that was configured for replication. AWS Secrets Manager will return a list of the secrets that match this prefix and will also return their metadata and tags. Note that the function has not fetched any secret material yet.

Next, the function will loop through each secret name that HashiCorp Vault gave and will check if the secret exists in AWS Secrets Manager:

  • If there is no secret that matches that name, the function will fetch the secret material from HashiCorp Vault, including the version number, and create a new secret in AWS Secrets Manager. It will also add a version tag to the secret to match the version.
  • If there is a secret matching that name in AWS Secrets Manager already, the Lambda function will first fetch the metadata from that secret in HashiCorp Vault. This is required to get the version number of the secret, because the version number was not exposed when the function got the list of secrets from HashiCorp Vault initially. If the secret version from HashiCorp Vault does not match the version value of the secret in AWS Secrets Manager (for example, the version in HashiCorp vault is 2, and the version in AWS Secrets manager is 1), an update is required to get the values synchronized again. Only now will the Lambda function fetch the actual secret material from HashiCorp Vault and update the secret in AWS Secrets Manager, including the version number in the tag.

The Lambda function fetches metadata about the secrets, rather than just fetching the secret material from HashiCorp Vault straight away. Typically, secrets don’t update very often. If this Lambda function is called every 30 minutes, then it should not have to add or update any secrets in the majority of invocations. By using metadata to determine whether you need the secret material to create or update secrets, you minimize the number of times secret material is fetched both from HashiCorp Vault and AWS Secrets Manager.

Note: The AWS Lambda function has permissions to pull certain secrets from HashiCorp Vault. It is important to thoroughly review the Lambda code and any subsequent changes to it to prevent leakage of secrets. For example, you should ensure that the Lambda function does not get updated with code that unintentionally logs secret material outside the Lambda function.

Use your secret

Now that you have created and replicated your secrets, you can use them in your AWS applications or AWS services that are integrated with Secrets Manager. For example, you can use the secrets when you set up connectivity for a proxy in Amazon RDS, as follows.

To use a secret when creating a proxy in Amazon RDS

  1. Go to the Amazon RDS service in the console.
  2. In the left navigation pane, choose Proxies, and then choose Create Proxy.
  3. On the Connectivity tab, you can now select first-secret-for-replication or second-secret-for-replication, which were created by the Lambda function after replicating them from the HashiCorp Vault.
Figure 13: Amazon RDS Proxy - Example of using replicated AWS Secrets Manager secrets

Figure 13: Amazon RDS Proxy – Example of using replicated AWS Secrets Manager secrets

It is important to remember that the consumers of the replicated secrets in AWS Secrets Manager will require scoped-down IAM permissions to use the secrets and AWS Key Management Service (AWS KMS) keys that were used to encrypt the secrets. For example, see Step 3: Create IAM role and policy on the Set up shared database connections with Amazon RDS Proxy page.

Manage permissions

Due to the sensitive nature of the secrets, it is important that you scope down the permissions to the least amount required to prevent inadvertent access to your secrets. The setup adopts a least-privilege permission strategy, where only the necessary actions are explicitly allowed on the resources that are required for replication. However, the permissions should be reviewed in accordance to your security standards.

In the architecture of this solution, there are two main places where you control access to the management of your secrets in Secrets Manager.

Lambda execution IAM role: The IAM role assumed by the Lambda function during execution contains the appropriate permissions for secret replication. There is an additional safety measure, which explicitly denies any action to a resource that is not required for the replication. For example, the Lambda function only has permission to publish to the Amazon SNS topic that is created for the failed replications, and will explicitly deny a publish action to any other topic. Even if someone accidentally adds an allow to the policy for a different topic, the explicit deny will still block this action.

AWS KMS key policy: When other services need to access the replicated secret in AWS Secrets Manager, they need permission to use the hybrid-aws-secrets-encryption-key AWS KMS key. You need to allow the principal these permissions through the AWS KMS key policy. Additionally, you can manage permissions to the AWS KMS key for the principal through an identity policy. For example, this is required when accessing AWS KMS keys across AWS accounts. See Permissions for AWS services in key policies and Specifying KMS keys in IAM policy statements in the AWS KMS Developer Guide.

Options for customizing the sample solution

The solution that was covered in this post provides an example for replication of secrets from HashiCorp Vault to AWS Secrets Manager using the pull model. This section contains additional customization options that you can consider when setting up the solution, or your own variation of it.

  1. Depending on the solution that you’re using, you might have access to different metadata attached to the secrets, which you can use to determine if a secret should be updated. For example, if you have access to data that represents a last_updated_datetime property, you could use this to infer whether or not a secret ought to be updated.
  2. It is a recommended practice to not use long-lived tokens wherever possible. In this sample, I used a static vault token to give the Lambda function access to the HashiCorp Vault. Depending on the solution that you’re using, you might be able to implement better authentication and authorization mechanisms. For example, HashiCorp Vault allows you to use IAM auth by using AWS IAM, rather than a static token.
  3. This post addressed the creation of secrets and updating of secrets, but for your production setup, you should also consider deletion of secrets. Depending on your requirements, you can choose to implement a strategy that works best for you to handle secrets in AWS Secrets Manager once the original secret in HashiCorp Vault has been deleted. In the pull model, you could consider removing a secret in AWS Secrets Manager if the corresponding secret in your external secrets manager is no longer present.
  4. In the sample setup, the same AWS KMS key is used to encrypt both the environment variables of the Lambda function, and the secrets in AWS Secrets Manager. You could choose to add an additional AWS KMS key (which would incur additional cost), to have two separate keys for these tasks. This would allow you to apply more granular permissions for the two keys in the corresponding KMS key policies or IAM identity policies that use the keys.

Conclusion

In this blog post, you’ve seen how you can approach replicating your secrets from an external secrets manager to AWS Secrets Manager. This post focused on a pull model, where the solution periodically fetched secrets from an external HashiCorp Vault and automatically created or updated the corresponding secret in AWS Secrets Manager. By using this model, you can now use your external secrets in your AWS Cloud applications or services that have an integration with AWS Secrets Manager.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Secrets Manager re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Laurens Brinker

Laurens Brinker

Laurens is a Software Development Engineer working for AWS Security and is based in London. Previously, Laurens worked as a Security Solutions Architect at AWS, where he helped customers running their workloads securely in the AWS Cloud. Outside of work, Laurens enjoys cycling, a casual game of chess, and building open source projects.

How to replicate secrets in AWS Secrets Manager to multiple Regions

Post Syndicated from Fatima Ahmed original https://aws.amazon.com/blogs/security/how-to-replicate-secrets-aws-secrets-manager-multiple-regions/

On March 3, 2021, we launched a new feature for AWS Secrets Manager that makes it possible for you to replicate secrets across multiple AWS Regions. You can give your multi-Region applications access to replicated secrets in the required Regions and rely on Secrets Manager to keep the replicas in sync with the primary secret. In scenarios such as disaster recovery, you can read replicated secrets from your recovery Regions, even if your primary Region is unavailable. In this blog post, I show you how to automatically replicate a secret and access it from the recovery Region to support a disaster recovery plan.

With Secrets Manager, you can store, retrieve, manage, and rotate your secrets, including database credentials, API keys, and other secrets. When you create a secret using Secrets Manager, it’s created and managed in a Region of your choosing. Although scoping secrets to a Region is a security best practice, there are scenarios such as disaster recovery and cross-Regional redundancy that require replication of secrets across Regions. Secrets Manager now makes it possible for you to easily replicate your secrets to one or more Regions to support these scenarios.

With this new feature, you can create Regional read replicas for your secrets. When you create a new secret or edit an existing secret, you can specify the Regions where your secrets need to be replicated. Secrets Manager will securely create the read replicas for each secret and its associated metadata, eliminating the need to maintain a complex solution for this functionality. Any update made to the primary secret, such as a secret value updated through automatic rotation, will be automatically propagated by Secrets Manager to the replica secrets, making it easier to manage the life cycle of multi-Region secrets.

Note: Each replica secret is billed as a separate secret. For more details on pricing, see the AWS Secrets Manager pricing page.

Architecture overview

Suppose that your organization has a requirement to set up a disaster recovery plan. In this example, us-east-1 is the designated primary Region, where you have an application running on a simple AWS Lambda function (for the example in this blog post, I’m using Python 3). You also have an Amazon Relational Database Service (Amazon RDS) – MySQL DB instance running in the us-east-1 Region, and you’re using Secrets Manager to store the database credentials as a secret. Your application retrieves the secret from Secrets Manager to access the database. As part of the disaster recovery strategy, you set up us-west-2 as the designated recovery Region, where you’ve replicated your application, the DB instance, and the database secret.

To elaborate, the solution architecture consists of:

  • A primary Region for creating the secret, in this case us-east-1 (N. Virginia).
  • A replica Region for replicating the secret, in this case us-west-2 (Oregon).
  • An Amazon RDS – MySQL DB instance that is running in the primary Region and configured for replication to the replica Region. To set up read replicas or cross-Region replicas for Amazon RDS, see Working with read replicas.
  • A secret created in Secrets Manager and configured for replication for the replica Region.
  • AWS Lambda functions (running on Python 3) deployed in the primary and replica Regions acting as clients to the MySQL DBs.

This architecture is illustrated in Figure 1.

Figure 1: Architecture overview for a multi-Region secret replication with the primary Region active

Figure 1: Architecture overview for a multi-Region secret replication with the primary Region active

In the primary region us-east-1, the Lambda function uses the credentials stored in the secret to access the database, as indicated by the following steps in Figure 1:

  1. The Lambda function sends a request to Secrets Manager to retrieve the secret value by using the GetSecretValue API call. Secrets Manager retrieves the secret value for the Lambda function.
  2. The Lambda function uses the secret value to connect to the database in order to read/write data.

The replicated secret in us-west-2 points to the primary DB instance in us-east-1. This is because when Secrets Manager replicates the secret, it replicates the secret value and all the associated metadata, such as the database endpoint. The database endpoint details are stored within the secret because Secrets Manager uses this information to connect to the database and rotate the secret if it is configured for automatic rotation. The Lambda function can also use the database endpoint details in the secret to connect to the database.

To simplify database failover during disaster recovery, as I’ll cover later in the post, you can configure an Amazon Route 53 CNAME record for the database endpoint in the primary Region. The database host associated with the secret is configured with the database CNAME record. When the primary Region is operating normally, the CNAME record points to the database endpoint in the primary Region. The requests to the database CNAME are routed to the DB instance in the primary Region, as shown in Figure 1.

During disaster recovery, you can failover to the replica Region, us-west-2, to make it possible for your application running in this Region to access the Amazon RDS read replica in us-west-2 by using the secret stored in the same Region. As part of your failover script, the database CNAME record should also be updated to point to the database endpoint in us-west-2. Because the database CNAME is used to point to the database endpoint within the secret, your application in us-west-2 can now use the replicated secret to access the database read replica in this Region. Figure 2 illustrates this disaster recovery scenario.

Figure 2: Architecture overview for a multi-Region secret replication with the replica Region active

Figure 2: Architecture overview for a multi-Region secret replication with the replica Region active

Prerequisites

The procedure described in this blog post requires that you complete the following steps before starting the procedure:

  1. Configure an Amazon RDS DB instance in the primary Region, with replication configured in the replica Region.
  2. Configure a Route 53 CNAME record for the database endpoint in the primary Region.
  3. Configure the Lambda function to connect with the Amazon RDS database and Secrets Manager by following the procedure in this blog post.
  4. Sign in to the AWS Management Console using a role that has SecretsManagerReadWrite permissions in the primary and replica Regions.

Enable replication for secrets stored in Secrets Manager

In this section, I walk you through the process of enabling replication in Secrets Manager for:

  1. A new secret that is created for your Amazon RDS database credentials
  2. An existing secret that is not configured for replication

For the first scenario, I show you the steps to create a secret in Secrets Manager in the primary Region (us-east-1) and enable replication for the replica Region (us-west-2).

To create a secret with replication enabled

  1. In the AWS Management Console, navigate to the Secrets Manager console in the primary Region (N. Virginia).
  2. Choose Store a new secret.
  3. On the Store a new secret screen, enter the Amazon RDS database credentials that will be used to connect with the Amazon RDS DB instance. Select the encryption key and the Amazon RDS DB instance, and then choose Next.
  4. Enter the secret name of your choice, and then enter a description. You can also optionally add tags and resource permissions to the secret.
  5. Under Replicate Secret – optional, choose Replicate secret to other regions.

    Figure 3: Replicate a secret to other Regions

    Figure 3: Replicate a secret to other Regions

  6. For AWS Region, choose the replica Region, US West (Oregon) us-west-2. For Encryption Key, choose Default to store your secret in the replica Region. Then choose Next.

    Figure 4: Configure secret replication

    Figure 4: Configure secret replication

  7. In the Configure Rotation section, you can choose whether to enable rotation. For this example, I chose not to enable rotation, so I selected Disable automatic rotation. However, if you want to enable rotation, you can do so by following the steps in Enabling rotation for an Amazon RDS database secret in the Secrets Manager User Guide. When you enable rotation in the primary Region, any changes to the secret from the rotation process are also replicated to the replica Region. After you’ve configured the rotation settings, choose Next.
  8. On the Review screen, you can see the summary of the secret configuration, including the secret replication configuration.

    Figure 5: Review the secret before storing

    Figure 5: Review the secret before storing

  9. At the bottom of the screen, choose Store.
  10. At the top of the next screen, you’ll see two banners that provide status on:
    • The creation of the secret in the primary Region
    • The replication of the secret in the Secondary Region

    After the creation and replication of the secret is successful, the banners will provide you with confirmation details.

At this point, you’ve created a secret in the primary Region (us-east-1) and enabled replication in a replica Region (us-west-2). You can now use this secret in the replica Region as well as the primary Region.

Now suppose that you have a secret created in the primary Region (us-east-1) that hasn’t been configured for replication. You can also configure replication for this existing secret by using the following procedure.

To enable multi-Region replication for existing secrets

  1. In the Secrets Manager console, choose the secret name. At the top of the screen, choose Replicate secret to other regions.
    Figure 6: Enable replication for existing secrets

    Figure 6: Enable replication for existing secrets

    This opens a pop-up screen where you can configure the replica Region and the encryption key for encrypting the secret in the replica Region.

  2. Choose the AWS Region and encryption key for the replica Region, and then choose Complete adding region(s).
    Figure 7: Configure replication for existing secrets

    Figure 7: Configure replication for existing secrets

    This starts the process of replicating the secret from the primary Region to the replica Region.

  3. Scroll down to the Replicate Secret section. You can see that the replication to the us-west-2 Region is in progress.
    Figure 8: Review progress for secret replication

    Figure 8: Review progress for secret replication

    After the replication is successful, you can look under Replication status to review the replication details that you’ve configured for your secret. You can also choose to replicate your secret to more Regions by choosing Add more regions.

    Figure 9: Successful secret replication to a replica Region

    Figure 9: Successful secret replication to a replica Region

Update the secret with the CNAME record

Next, you can update the host value in your secret to the CNAME record of the DB instance endpoint. This will make it possible for you to use the secret in the replica Region without making changes to the replica secret. In the event of a failover to the replica Region, you can simply update the CNAME record to the DB instance endpoint in the replica Region as a part of your failover script

To update the secret with the CNAME record

  1. Navigate to the Secrets Manager console, and choose the secret that you have set up for replication
  2. In the Secret value section, choose Retrieve secret value, and then choose Edit.
  3. Update the secret value for the host with the CNAME record, and then choose Save.

    Figure 10: Edit the secret value

    Figure 10: Edit the secret value

  4. After you choose Save, you’ll see a banner at the top of the screen with a message that indicates that the secret was successfully edited.Because the secret is set up for replication, you can also review the status of the synchronization of your secret to the replica Region after you updated the secret. To do so, scroll down to the Replicate Secret section and look under Region Replication Status.

     

    Figure 11: Successful secret replication for a modified secret

    Figure 11: Successful secret replication for a modified secret

Access replicated secrets from the replica Region

Now that you’ve configured the secret for replication in the primary Region, you can access the secret from the replica Region. Here I demonstrate how to access a replicated secret from a simple Lambda function that is deployed in the replica Region (us-west-2).

To access the secret from the replica Region

  1. From the AWS Management Console, navigate to the Secrets Manager console in the replica Region (Oregon) and view the secret that you configured for replication in the primary Region (N. Virginia).

    Figure 12: View secrets that are configured for replication in the replica Region

    Figure 12: View secrets that are configured for replication in the replica Region

  2. Choose the secret name and review the details that were replicated from the primary Region. A secret that is configured for replication will display a banner at the top of the screen stating the replication details.

    Figure 13: The replication status banner

    Figure 13: The replication status banner

  3. Under Secret Details, you can see the secret’s ARN. You can use the secret’s ARN to retrieve the secret value from the Lambda function or application that is deployed in your replica Region (Oregon). Make a note of the ARN.

    Figure 14: View secret details

    Figure 14: View secret details

During a disaster recovery scenario when the primary Region isn’t available, you can update the CNAME record to point to the DB instance endpoint in us-west-2 as part of your failover script. For this example, my application that is deployed in the replica Region is configured to use the replicated secret’s ARN.

Let’s suppose your sample Lambda function defines the secret name and the Region in the environment variables. The REGION_NAME environment variable contains the name of the replica Region; in this example, us-west-2. The SECRET_NAME environment variable is the ARN of your replicated secret in the replica Region, which you noted earlier.

Figure 15: Environment variables for the Lambda function

Figure 15: Environment variables for the Lambda function

In the replica Region, you can now refer to the secret’s ARN and Region in your Lambda function code to retrieve the secret value for connecting to the database. The following sample Lambda function code snippet uses the secret_name and region_name variables to retrieve the secret’s ARN and the replica Region values stored in the environment variables.

secret_name = os.environ['SECRET_NAME']
region_name = os.environ['REGION_NAME']

def openConnection():
    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )
    try:
        get_secret_value_response = 
client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        if e.response['Error']['Code'] == 
'DecryptionFailureException':

Alternately, you can simply use the Python 3 sample code for the replicated secret to retrieve the secret value from the Lambda function in the replica Region. You can review the provided sample codes by navigating to the secret details in the console, as shown in Figure 16.

Figure 16: Python 3 sample code for the replicated secret

Figure 16: Python 3 sample code for the replicated secret

Summary

When you plan for disaster recovery, you can configure replication of your secrets in Secrets Manager to provide redundancy for your secrets. This feature reduces the overhead of deploying and maintaining additional configuration for secret replication and retrieval across AWS Regions. In this post, I showed you how to create a secret and configure it for multi-Region replication. I also demonstrated how you can configure replication for existing secrets across multiple Regions.

I showed you how to use secrets from the replica Region and configure a sample Lambda function to retrieve a secret value. When replication is configured for secrets, you can use this technique to retrieve the secrets in the replica Region in a similar way as you would in the primary Region.

You can start using this feature through the AWS Secrets Manager console, AWS Command Line Interface (AWS CLI), AWS SDK, or AWS CloudFormation. To learn more about this feature, see the AWS Secrets Manager documentation. If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the AWS Secrets Manager forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Fatima Ahmed

Fatima is a Security Global Practice Lead at AWS. She is passionate about cybersecurity and helping customers build secure solutions in the AWS Cloud. When she is not working, she enjoys time with her cat or solving cryptic puzzles.