All posts by Haresh Nandwani

Understand resiliency patterns and trade-offs to architect efficiently in the cloud

2022-06-03 Haresh Nandwani

Post Syndicated from Haresh Nandwani original https://aws.amazon.com/blogs/architecture/understand-resiliency-patterns-and-trade-offs-to-architect-efficiently-in-the-cloud/

This post was originally published in June 2022 and is now updated with more information on efficiently architecting resilient patterns in the cloud.

Architecting workloads for resilience on the cloud often need to evaluate multiple factors before they can decide the most optimal architecture for their workloads.

Example Corp has multiple applications with varying criticality, and each of their applications have different needs in terms of resiliency, complexity, and cost. They have many choices to architect their workloads for resiliency and cost, but which option suits their needs best? What should they consider when choosing the patterns most appropriate for the needs of their applications?

To help answer these questions, we’ll discuss the five resilience patterns in Figure 1 and the trade-offs to consider when implementing them: 1) design complexity, 2) cost to implement, 3) operational effort, 4) effort to secure, and 5) environmental impact. This will help you achieve varying levels of resiliency and make decisions about the most appropriate architecture for your needs. Our intent is to provide a high-level approach to structure conversations on trade-offs associated with each of these patterns. For a deeper dive on each pattern, please navigate to the Further reading section at the end of this post.

Note: these patterns are not mutually exclusive; you may decide to implement a combination of one of more patterns.

Figure 1. Resilience patterns and trade-offs

What is resiliency? Why does it matter?

The AWS Well-Architected Framework defines resilience as having “the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload’s components.”

To meet your business’ resilience requirements, consider the following core factors as you design your workloads:

Design complexity – An increase in system complexity typically increases the emergent behaviors of that system. Each individual workload component has to be resilient, and you’ll need to eliminate single points of failure across people, process, and technology elements. Customers should consider their resilience requirements and decide if increasing system complexity is an effective approach, or if keeping the system simple and using a disaster recovery (DR) plan is be more appropriate.
Cost to implement – Costs often significantly increase when you implement higher resilience because there are new software and infrastructure components to operate. It’s important for such costs to be offset by the potential costs of future loss.
Operational effort – Deploying and supporting highly resilient systems requires complex operational processes and advanced technical skills. For example, customers might need to improve their operational processes using the Operational Readiness Review (ORR) approach. Before you decide to implement higher resilience, evaluate your operational competency to confirm you have the required level of process maturity and skillsets.
Effort to secure – Security complexity is less directly correlated with resilience. However, there are generally more components to secure for highly resilient systems. Using security best practices for cloud deployments can achieve security objectives without adding significant complexity even with a higher deployment footprint.
Environmental impact – An increased deployment footprint for resilient systems may increase your consumption of cloud resources. However, you can use trade-offs, like approximate computing and deliberately implementing slower response times to reduce resource consumption. The AWS Well-Architected Sustainability Pillar describes these patterns and provides guidance on sustainability best practices.

Pattern 1 (P1): Multi-AZ

P1 is a cloud-based architecture pattern (Figure 2) that introduces Availability Zones (AZs) into your architecture to increase your system’s resilience. The P1 pattern uses a Multi-AZ architecture where applications operate in multiple AZs within a single AWS Region. This allows your application to withstand AZ-level impacts.

As shown in Figure 2, Example Corp deploys their internal employee applications using the P1 pattern. These applications are low business impact and therefore have lower requirements for resiliency.

Example Corp deploys their low-business-impact applications as a single Amazon Elastic Compute Cloud (Amazon EC2) instance managed by an Auto Scaling group. Amazon EC2 uses health checks to automatically detect faults. If an AZ fails, Amazon EC2 prompts an Amazon EC2 Auto Scaling group to recreate their application in another unaffected AZ.

Figure 2. Multi-AZ deployment pattern (P1)

Trade-offs

P1 is low in several categories and mitigates a disruption to the AZ hosting the application, but this comes at the expense of application recovery. If an AZ is down, it will disrupt end users’ access to the application while the new resources are being re-provisioned in a new AZ. This is known as bi-modal behavior.

Pattern 2 (P2): Multi-AZ with static stability

P2 uses multiple instances across multiple AZs within a Region to increase resilience. The pattern uses static stability to prevent bimodal behavior. Statically stable systems remain stable and operate in one mode, irrespective of changes to their operating environment. A key benefit of a statically stable system on AWS is it reduces complexity of recovery during a disruption thanks to pre-provisioned resource capacity. Any resources needed to maintain operations during a disruption, such as the loss of resources in an AZ, already exist and AWS service control planes do not need to be available for recovery to be successful. To learn more about static stability, data planes and control planes read the builder’s library article Static stability using Availability Zones.

As shown in Figure 3, Example Corp has a customer-facing website that has a lower tolerance for downtime. Any time the website is down, it could result in lost revenue. Because of this, the website requires two EC2 instances that are provisioned within two AZs. Using health checks, when the AZ becomes impaired, the website continues to operate as the Elastic Load Balancer diverts traffic away from the impacted AZ. For more on using health checks, see the Implementing health checks article in The Amazon Builder’s Library.

Figure 3. Multi-AZ with static stability pattern (P2)

Trade-offs

P2 mitigates an AZ disruption without downtime to application clients but must be weighed against cost concerns. P1 is less expensive from an infrastructure cost perspective, as it provisions less compute capacity and relies on launching new instances in case of a failure. However, P1’s bimodal behavior can affect your customers during large-scale events.

Implementing P2 requires your application to support distributed operation across multiple instances. If your application can support this pattern, you can deploy your workload to all available AZs (usually 3 or more) across the Region. This will reduce costs associated with over-provisioning because you only have to provision 150% of your capacity across three AZs compared with the 200% in two AZs (as mentioned in our earlier example).

Pattern 3 (P3): Application portfolio distribution

P3 uses a Multi-Region pattern to increase functional resilience, as demonstrated in Figure 4. It distributes different critical applications in multiple Regions.

Example Corp provides banking services, like credit balance checks, to consumers on multiple digital channels. These services are available to consumers via a mobile application, contact center, and web-based applications. Each digital channel is deployed to a separate Region, which mitigates against a regional service disruption.

For example, a Region with the customers’ mobile application may have a disruption that causes the mobile app to be unavailable, but customers can still access banking services via online banking deployed in an alternate Region. Regional service disruptions are rare, but implementing a pattern like this ensures your users retain access to business-critical services during disruptions.

Figure 4. Application portfolio distribution pattern (P3)

Trade-offs

P3 mitigates the possibility of a regional service disruption impacting a multitude of systems at the same time. Operating an application portfolio that spans multiple Regions requires significant operational planning and management. Isolated functional elements may depend on common downstream systems and data sources that are deployed in a single Region. Therefore, Region-wide events may still cause disruption, but the impact surface area should be reduced.

Pattern 4 (P4): Multi-AZ deployment (multi-Region DR)

Example Corp operates several business-critical services that have a very low tolerance for disruption, such as the ability for consumers to make bank payments. Example Corp reviewed the four common patterns for DR (as defined in Disaster Recovery of Workloads on AWS: Recovery in the Cloud) and decided to use the following sub-patterns for their multi-Region applications:

Pilot Light – This pattern works for applications that require RTO/RPO of 10s of minutes. Data is actively replicated and application infrastructure is pre-provisioned in the DR Region. Cost optimization is a key driver here, as the application infrastructure is kept switched-off and only switched-on during the restore event.
Warm Standby – This pattern improves restore times significantly compared with pilot light by keeping your applications running in the DR Region but with a reduced capacity. Application infrastructure will be scaled up during a DR event, but this can typically be automated with minimal manual effort. This pattern can achieve RTO/RPO of minutes if implemented correctly.

Trade-offs

P4 mitigates a disruption to a regional service while reducing mitigation costs. Regional DR patterns increase deployment complexity as infrastructure changes need to be synchronized across Regions. Testing resilience is also significantly more complex and include simulating regional disruptions. Using Infrastructure as Code to automate deployments can help alleviate these issues.

Pattern 5 (P5): Multi-Region active-active

Example Corp’s core banking and Customer Relationship Management applications have zero tolerance for disruption. They use the P5 pattern for deploying these applications because it has an RTO of real-time and an RPO of near-zero data loss. They run their workload simultaneously in multiple Regions, allowing them to serve traffic from all Regions simultaneously. This pattern not only mitigates against regional disruptions but also addresses their zero tolerance requirements (Figure 5).

Figure 5. Multi-Region active-active pattern (P5)

Trade-offs

P5 mitigates the disruption of a regional service, and invests additional costs and complexity to deliver a RTO of near zero. Multi-active deployments are generally complex, as they include multiple applications that collaborate to deliver required business services. If you implement this pattern, you’ll need to consider the fact that you’re introducing asynchronous replication for data across Regions and the impact that has on data consistency.

Operating this pattern requires a very high level of process maturity, so we recommend customers gradually build towards this pattern by starting with the deployment patterns described earlier.

Conclusion

In this blog post, we introduced five resilience patterns and trade-offs to consider when implementing them. In an effort to help you find the most efficient architecture for your use case, we demonstrated how Example Corp evaluated these options and how they applied them to their business needs.

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Monitoring and alerting break-glass access in an AWS Organization

2022-05-27 Haresh Nandwani

Post Syndicated from Haresh Nandwani original https://aws.amazon.com/blogs/architecture/monitoring-and-alerting-break-glass-access-in-an-aws-organization/

Organizations building enterprise-scale systems require the setup of a secure and governed landing zone to deploy and operate their systems. A landing zone is a starting point from which your organization can quickly launch and deploy workloads and applications with confidence in your security and infrastructure environment as described in What is a landing zone?. Nationwide Building Society (Nationwide) is the world’s largest building society. It is owned by its 16 million members and exists to serve their needs. The Society is one of the UK’s largest providers for mortgages, savings and current accounts, as well as being a major provider of ISAs, credit cards, personal loans, insurance, and investments.

For one of its business initiatives, Nationwide utilizes AWS Control Tower to build and operate their landing zone which provides a well-established pattern to set up and govern a secure, multi-account AWS environment. Nationwide operates in a highly regulated industry and our governance assurance requires adequate control of any privileged access to production line-of-business data or to resources which have access to them. We chose for this specific business initiative to deploy our landing zone using AWS Organizations, to benefit from ongoing account management and governance as aligned with AWS implementation best practices. We also utilized AWS Single Sign-On (AWS SSO) to create our workforce identities in AWS once and manage access centrally across our AWS Organization. In this blog, we describe the integrations required across AWS Control Tower and AWS SSO to implement a break-glass mechanism that makes access reporting publishable to system operators as well as to internal audit systems and processes. We will outline how we used AWS SSO for our setup as well as the three architecture options we considered, and why we went with the chosen solution.

Sourcing AWS SSO access data for near real-time monitoring

In our setup, we have multiple AWS Accounts and multiple trails on each of these accounts. Users will regularly navigate across multiple accounts as they operate our infrastructure, and their journeys are marked across these multiple trails. Typically, AWS CloudTrail would be our chosen resource to clearly and unambiguously identify account or data access. The key challenge in this scenario was to design an efficient and cost-effective solution to scan these trails to help identify and report on break-glass user access to account and production data. To address this challenge, we developed the following two architecture design options.

Option 1: A decentralized approach that uses AWS CloudFormation StackSets, Amazon EventBridge and AWS Lambda

Our solution entailed a decentralized approach by deploying a CloudFormation StackSet to create, update, or delete stacks across multiple accounts and AWS Regions with a single operation. The Stackset created Amazon EventBridge rules and target AWS Lambda functions. These functions post to EventBridge in our audit account. Our audit account has a set of Lambda functions running off EventBridge to initiate specific events, format the event message and post to Slack, our centralized communication platform for this implementation. Figure 1 depicts the overall architecture for this option.

Figure 1. De-centralized logging using Amazon EventBridge and AWS Lambda

Option 2: Use an organization trail in the Organization Management account

This option uses the centralized organization trail in the Organization Management account to source audit data. Details of how to create an organization trail can be found in the AWS CloudTrail User Guide. CloudTrail was configured to send log events to CloudWatch Logs. These events are then sent via Lambda functions to Slack using webhooks. We used a public terraform module in this GitHub repository to build this Lambda Slack integration. Figure 2 depicts the overall architecture for this option.

Figure 2. Centralized logging pattern using Amazon CloudWatch

This was our preferred option and is the one we finally implemented.

We also evaluated a third option which was to use centralized logging and auditing feature enabled by Control Tower. Users authenticate and federate to target accounts from a central location so it seemed possible to source this info from the centralized logs. These log events arrive as .gz compressed json objects, which meant having to expand these archives repeatedly for inspection. We therefore decided against this option.

A centralized, economic, extensible solution to alert of SSO break-glass

Our requirement was to identify break-glass access across any of the access mechanisms supported by AWS, including CLI and User Portal access. To ensure we have comprehensive coverage across all access mechanisms, we identified all the events initiated for each access mechanism:

User Portal/AWS Console access events
- Authenticate
- ListApplications
- ListApplicationProfiles
- Federate – this event contains the role that the user is federating into
CLI access events
- CreateToken
- ListAccounts
- ListAccountRoles
- GetRoleCredentials – this event contains the role that the user is federating into

EventBridge is able to initiate actions after events only when the event is trying to perform changes (when the “readOnly” attribute on the event record body equals “false”).

The AWS support team was aware of this attribute and recommended that we, change the data flow we were using to one able to initiate actions after any kind of event, regardless of the value on its readOnly attribute. The solution in our case was to send the CloudTrail logs to CloudWatch Logs. This then and initiates the Lambda function through a filter subscription that detects the desired event names on the log content.

The filter used is as follows:

{($.eventSource = sso.amazonaws.com) && ($.eventName = Federate||$.eventName = GetRoleCredentials)}

Due to the query size in the CloudWatch Log queries we had to remove the subscription filters and do the parsing of the content of the log lines inside the lambda function. In order to determine what accounts would initiate the notifications, we sent the list of accounts and roles to it as an environment variable at runtime.

Considerations with cross-account SSO access

With direct federation users get an access token. This is most obvious in AWS single sign on at the chiclet page as “Command line or programmatic access”. SSO tokens have a limited lifetime (we use the default 1-hour). A user does not have to get a new token to access a target resource until the one they are using is expired. This means that a user may repeatedly access a target account using the same token during its lifetime. Although the token is made available at the chiclet page, the GetRoleCredentials event does not occur until it is used to authenticate an API call to the target AWS account.

Conclusion

In this blog, we discussed how AWS Control Tower and AWS Single Sign-on enabled Nationwide to build and govern a secure, multi-account AWS environment for one of their business initiatives and centralize access management across our implementation. The integration was important for us to accurately and comprehensively identify and audit break-glass access for our implementation. As a result, we were able to satisfy our security and compliance audit requirements for privileged access to our AWS accounts.

How Experian uses Amazon SageMaker to Deliver Affordability Verification

2022-01-13 Haresh Nandwani

Post Syndicated from Haresh Nandwani original https://aws.amazon.com/blogs/architecture/how-experian-uses-amazon-sagemaker-to-deliver-affordability-verification/

Financial Service (FS) providers must identify patterns and signals in a customer’s financial behavior to provide deeper, up-to-the-minute, insight into their affordability and credit risk. FS providers use these insights to improve decision making and customer management capabilities. Machine learning (ML) models and algorithms play a significant role in automating, categorising, and deriving insights from bank transaction data.

Experian publishes Categorisation-as-a-Service (CaaS) ML models that automate analysis of bank and credit card transactions, to be deployed in Amazon SageMaker. Driven by a suite of Experian proprietary algorithms, these models categorise a customer’s bank or credit card transactions into one of over 180 different income and expenditure categories. The service turns these categorised transactions into a set of summarised insights that can help a business better understand their customer and make more informed decisions. These insights provide a detailed picture of a customer’s financial circumstances and resilience by looking at verified income, expenditure, and credit behavior.

This blog demonstrates how financial service providers can introduce affordability verification and categorisation into their digital journeys by deploying Experian CaaS ML models on SageMaker. You don’t need significant ML knowledge to start using Amazon SageMaker and Experian CaaS.

Affordability verification and data categorisation in digital journeys

Product onboarding journeys are increasingly digital. Most financial service providers expect most of these journeys to initiate and complete online. An example journey would be consumers looking to apply for credit with their existing FS provider. These journeys typically involve FS providers performing affordability verification to ensure consumers are offered products they can afford. FS providers can now use Experian CaaS ML models available via AWS Marketplace to generate real-time financial insights and affordability verification for their customers.

Figure 1 depicts a typical digital journey for consumers applying for credit.

Figure 1. Customer journey for consumers applying for credit

Data categorisation for transactional data. Existing transactional data for current consumers is typically sourced from on-premises data sources into a data lake in the cloud. It is then prepared and transformed for processing and analytics. This analysis is done based on the FS provider’s existing consent in compliance with relevant data protection laws. Additional transaction information for other accounts not held by the lender can be sourced from Open Banking and categorised separately.
Store categorised transactions. Background processes run a SageMaker batch transform job using the Experian CaaS Data Categorisation model to categorise this transactional data.
Consumer applies for credit. Consumers use the FS providers’ existing front-end web, mobile, or any other digital channel to apply for credit.
FS provider retrieves up-to-date insights. Insights are generated in real time using the Experian CaaS insights model deployed as endpoints in SageMaker and returned to the consumer-facing digital channel.
FS provider makes credit decision. The channel app consolidates these insights to decide on product eligibility and drive customer journeys.

Deploying and publishing Experian CaaS ML models to Amazon SageMaker

Figure 2 demonstrates the technical solution for the customer journey described in the preceding section.

Figure 2. Credit application – technical solution using Amazon SageMaker and Experian CaaS ML models

Financial Service providers can use AWS Data Migration Service (AWS DMS) to replicate transactional data from their on-premises systems such as their core banking systems to Amazon S3. Customers can source this transactional data into a highly available and scalable data lake solution on AWS. Refer to AWS DMS documentation for technical details on supported database sources.
FS providers can use AWS Glue, a serverless data integration service, to cleanse, prepare, and transform the transactional data into formats supported by the Experian CaaS ML models.
FS providers can subscribe and download CaaS ML models built for SageMaker from the AWS Marketplace.
These models can be deployed to SageMaker hosting services as a SageMaker endpoint for real-time inference. Endpoints are fully managed by AWS, and can be set up to scale on demand and deployed in a Multi-AZ model for resilience. FS providers can use Amazon API Gateway and AWS Lambda to make these endpoints available to their consumer-facing applications.
SageMaker also supports a batch transform mode for ML models, which in this scenario will be used to precategorise transactional data. This mode is also useful for use cases that require nearly continuous and regular analysis such as a regular anti-fraud assessment.
Consumer requests for a financial product such as a credit card on an FS provider’s digital channels.
These requests invoke SageMaker endpoints, which use Experian CaaS models to derive real-time insights.
These insights are used to further drive the customer’s product journey. CaaS models are pre-trained and can return insights within the latency requirements of most real-time digital journeys.

Security and compliance using CaaS

AWS Marketplace models are scanned by AWS for common vulnerabilities and exposures (CVE). CVE is a list of publicly known information about security vulnerability and exposure. For details on infrastructure security applied by SageMaker, see Infrastructure Security in Amazon SageMaker.

Data security is a key concern for FS providers and sharing of data externally is challenging from a security and compliance perspective. The CaaS deployment model described here helps address these challenges as data owned by the FS provider remains within their control domain and AWS account. There is no requirement for this data to be shared with Experian. This means the customer’s personal financial information is retained by the FS provider. FS providers cannot access the model code as it is running in a locked SageMaker environment.

AWS Marketplace models such as the Experian CaaS ML models are deployed in a network isolation mode. This ensures that the models cannot make any outbound network calls, even to other AWS services such as Amazon S3. SageMaker still performs download and upload operations against Amazon S3 in isolation from the model.

Implementing upgrades to CaaS ML models

ML model upgrades can be performed in place in Amazon SageMaker as vendors release newer versions of their models in AWS Marketplace. Endpoints can be set up in a blue/green deployment pattern to ensure that upgrades do not impact consumers and be safely rolled back with no business interruptions.

Conclusion

Automated categorisation of bank transaction data is now being used by FS providers as they start to realise the benefits it can bring to their business. This is being driven in part by the advent of Open Banking. Many FS providers have increased confidence in the accuracy and performance of automated categorisation engines. Suppliers such as Experian are providing transparency around their methodologies used to categorise data, which is also encouraging adoption.

In this blog, we covered how FS providers can introduce automated categorisation of data and affordability identification capabilities into their digital journeys. This can be done quickly and without significant in-house ML skills, using Amazon SageMaker and Experian CaaS ML models. SageMaker endpoints and batch transform capabilities enable the deployment of a highly scalable, secure, and extensible ML infrastructure with minimal development and operational effort.

Experian’s CaaS is available for use via the AWS Marketplace.

Noise

All posts by Haresh Nandwani

Understand resiliency patterns and trade-offs to architect efficiently in the cloud

What is resiliency? Why does it matter?

Pattern 1 (P1): Multi-AZ

Trade-offs

Pattern 2 (P2): Multi-AZ with static stability

Trade-offs

Pattern 3 (P3): Application portfolio distribution

Trade-offs

Pattern 4 (P4): Multi-AZ deployment (multi-Region DR)

Trade-offs

Pattern 5 (P5): Multi-Region active-active

Trade-offs

Conclusion

Further reading

Looking for more architecture content?

Monitoring and alerting break-glass access in an AWS Organization

Sourcing AWS SSO access data for near real-time monitoring

Option 1: A decentralized approach that uses AWS CloudFormation StackSets, Amazon EventBridge and AWS Lambda

Option 2: Use an organization trail in the Organization Management account

A centralized, economic, extensible solution to alert of SSO break-glass

Considerations with cross-account SSO access

Conclusion

How Experian uses Amazon SageMaker to Deliver Affordability Verification

Affordability verification and data categorisation in digital journeys

Deploying and publishing Experian CaaS ML models to Amazon SageMaker

Security and compliance using CaaS

Implementing upgrades to CaaS ML models

Conclusion

The collective thoughts of the interwebz