Post Syndicated from Howie Li original https://aws.amazon.com/blogs/security/amazon-cognito-unlocks-advanced-capabilities-with-next-generation-infrastructure/
Amazon Cognito recently introduced high-throughput performance for demanding workloads, customer-managed keys for full control over data encryption at rest, and multi- Region replication for business continuity improvement. These capabilities were made possible through a next-generation storage infrastructure designed for extensibility and scale. To deliver this, we migrated hundreds of millions of user profiles, and you probably didn’t even notice. In this post, we walk through what’s new, the architecture behind it, and how we got here with a zero-downtime migration that kept your applications running.
New capabilities now available on Cognito
The migration to the new infrastructure wasn’t just about maintaining existing functionality—it created the foundation for delivering capabilities that solve customer challenges while positioning Amazon Cognito for continuous improvements.
High-throughput performance: The new architecture supports the higher request volumes and scale requirements of modern applications while maintaining the low latency performance that your applications depend on—able to support tens of millions of users per user pool and thousands of transactions per second (TPS).
Customer-managed keys: Customers can now use their own encryption keys stored in AWS Key Management Service (AWS KMS) for encrypting data at rest. This provides enhanced security control and capabilities, giving customers full ownership over their encryption key lifecycle.
Multi-Region replication: Customers can now synchronize their entire user pool data, including user passwords, attributes, and configurations to another user pool in another Region of their choice. This means that customers can implement business continuity strategies and maintain authentication availability in case of a Regional failover, helping their applications remain accessible to users even during unexpected disruptions.
An architecture for innovation
The new architecture uses a purpose-built storage layer designed for extensibility and scale of identity operations. We anchored the new architecture around a set of design tenets:
- Identity-first design: The storage layer understands user identities. There’s no client-specific business logic and no generalizations beyond identity management; keeping the system focused, portable, and optimized.
- Avoid one-way doors: Deliver value incrementally while keeping architectural choices reversible, so we can evolve as new needs arise.
- Backward compatible: Changes to the underlying infrastructure should never break customers’ applications.
These tenets shaped every architectural decision. The architecture separates into independently deployable domains. Previously, while using Amazon Cloud Directory, the service architecture relied on a single data store to persist all customer information. This provided straightforward data traversal mechanisms but required multi-service coordination to adjust database schema when new features were required. The new architecture uses different datasets, allowing them to evolve independently for faster feature iterations.
Migration with zero-downtime
Migrating users requires extreme precautions and a strategy designed to maintain zero downtime and ensure data integrity at every step. Our approach prioritizes both immediate stability and long-term flexibility through the following measures:
- Shadow mode validation: We ran customer API requests through both old and new infrastructures simultaneously, comparing response structures, status codes, and behavioral characteristics. The validation was designed so that sensitive information was never exposed in plaintext during comparison. We accounted for known variances—for example, timestamps could differ slightly between systems—so that only meaningful discrepancies surfaced as actionable alerts.
- Data backfill: Before switching a user pool to the new infrastructure, we performed a bulk backfill of all existing user records from the legacy system into the new storage. The backfill ran alongside live traffic with dual-write capturing any changes made during the backfill window, ensuring no data loss or stale data. Shadow mode served as the validation layer for the backfill; as we addressed more edge cases in data syncing, shadow mode match rates increased, confirming data completeness before we proceeded to the switchover.
- Dual-write architecture: We implemented a system where all identity operations were simultaneously written to both legacy and new services, with comprehensive validation to ensure consistency. Even when a dual-write to the new infrastructure failed, the operation still succeeded in the legacy system, preserving all customer-initiated requests. This means any dual-write failure was contained as an internal consistency issue and not customer-impacting.
- Anti–entropy validation: We implemented a data validation and correction system that continuously compared records across old and new infrastructures, detecting and resolving any data divergence. Anti-entropy scans compared user attributes, credential hashes, group memberships, and configurations, among other records. When true discrepancies were found, the system automatically reconciled them using the legacy system as the source of truth. This layer was able to catch edge cases that shadow mode and dual writes alone could not cover.
- Incremental rollout with rollback capability: We established controlled deployment phases with immediate rollback capabilities. After switching a user pool to the new infrastructure, we continued replicating all writes back to the legacy system, ensuring we can revert any user pool to the legacy infrastructure at any point without data loss. If a rollback was needed during migration, an orchestrator replayed entries in timestamp order, syncing user profiles back to the legacy system.
Lessons learned for infrastructure modernization
This modernization taught us valuable principles that apply to any large-scale infrastructure project, therefore we choose to share these learnings to help you perform similar migrations.
- Customer access patterns drive architecture decisions: Analyzing actual customer access patterns revealed that identity workloads follow predictable patterns, which meant we could adopt a synchronous dual-write approach that balanced completeness with operational simplicity. This principle applies to any domain-specific migration: understand your workload’s actual access patterns before reaching for general-purpose solutions.
- Behavioral preservation requires techniques beyond traditional testing: Ensuring equivalent functionality across old and new systems was straightforward. Preserving identical API behavior was not. Functional tests validate intended behaviors, but we identified scenarios where customers had built applications around specific API behaviors such that a change could have silently broken their applications. For example, concurrent writes to the same user could resolve to different final states between old and new systems where writes all succeed but outcome diverges slightly. Similarly, customers who write an attribute and immediately read it are affected by the consistency window. Subtle timing differences in when updates become visible could cause stale reads. These aren’t functional failures, but behavior under real traffic patterns can vary. Shadow mode verification surfaced edge cases that automated tests alone would have missed. Invest in these techniques early.
- Gradual validation builds confidence that testing alone cannot: Layer multiple independent validation techniques, such as shadow mode, dual writes, and anti-entropy scans—each covering a different access pattern. No single approach will catch everything, and the gaps between them are where production issues hide. Incremental rollout with immediate rollback capability lets you validate each step while maintaining the ability to revert quickly.
- Key principles for your own modernization projects: Invest in purpose-built solutions, design for extensibility, and implement gradual validation. Or use managed services so your infrastructure improves without effort on your part while your applications keep running; helping you focus on your business needs.
Conclusion
In this post, we shared the high-level approach and learnings from the Amazon Cognito infrastructure modernization that create a foundation for modern identity management capabilities. The new Cognito infrastructure is live, delivering capabilities such as customer-managed keys and multi-Region replication. As the migration continues, all Cognito customers will gain access to these capabilities on the same service they rely on today, with no action required.
Ready to modernize your authentication infrastructure? Visit Amazon Cognito to learn more.
If you have feedback about this post, submit comments in the Comments section below.