Monitoring and alerting break-glass access in an AWS Organization

Post Syndicated from Haresh Nandwani original https://aws.amazon.com/blogs/architecture/monitoring-and-alerting-break-glass-access-in-an-aws-organization/

Organizations building enterprise-scale systems require the setup of a secure and governed landing zone to deploy and operate their systems. A landing zone is a starting point from which your organization can quickly launch and deploy workloads and applications with confidence in your security and infrastructure environment as described in What is a landing zone?. Nationwide Building Society (Nationwide) is the world’s largest building society. It is owned by its 16 million members and exists to serve their needs. The Society is one of the UK’s largest providers for mortgages, savings and current accounts, as well as being a major provider of ISAs, credit cards, personal loans, insurance, and investments.

For one of its business initiatives, Nationwide utilizes AWS Control Tower to build and operate their landing zone which provides a well-established pattern to set up and govern a secure, multi-account AWS environment. Nationwide operates in a highly regulated industry and our governance assurance requires adequate control of any privileged access to production line-of-business data or to resources which have access to them. We chose for this specific business initiative to deploy our landing zone using AWS Organizations, to benefit from ongoing account management and governance as aligned with AWS implementation best practices. We also utilized AWS Single Sign-On (AWS SSO) to create our workforce identities in AWS once and manage access centrally across our AWS Organization. In this blog, we describe the integrations required across AWS Control Tower and AWS SSO to implement a break-glass mechanism that makes access reporting publishable to system operators as well as to internal audit systems and processes. We will outline how we used AWS SSO for our setup as well as the three architecture options we considered, and why we went with the chosen solution.

Sourcing AWS SSO access data for near real-time monitoring

In our setup, we have multiple AWS Accounts and multiple trails on each of these accounts. Users will regularly navigate across multiple accounts as they operate our infrastructure, and their journeys are marked across these multiple trails. Typically, AWS CloudTrail would be our chosen resource to clearly and unambiguously identify account or data access.  The key challenge in this scenario was to design an efficient and cost-effective solution to scan these trails to help identify and report on break-glass user access to account and production data. To address this challenge, we developed the following two architecture design options.

Option 1: A decentralized approach that uses AWS CloudFormation StackSets, Amazon EventBridge and AWS Lambda

Our solution entailed a decentralized approach by deploying a CloudFormation StackSet to create, update, or delete stacks across multiple accounts and AWS Regions with a single operation. The Stackset created Amazon EventBridge rules and target AWS Lambda functions. These functions post to EventBridge in our audit account. Our audit account has a set of Lambda functions running off EventBridge to initiate specific events, format the event message and post to Slack, our centralized communication platform for this implementation. Figure 1 depicts the overall architecture for this option.

De-centralized logging using Amazon EventBridge and AWS Lambda

Figure 1. De-centralized logging using Amazon EventBridge and AWS Lambda

Option 2: Use an organization trail in the Organization Management account

This option uses the centralized organization trail in the Organization Management account to source audit data. Details of how to create an organization trail can be found in the AWS CloudTrail User Guide. CloudTrail was configured to send log events to CloudWatch Logs. These events are then sent via Lambda functions to Slack using webhooks. We used a public terraform module in this GitHub repository to build this Lambda Slack integration. Figure 2 depicts the overall architecture for this option.

Centralized logging pattern using Amazon CloudWatch

Figure 2. Centralized logging pattern using Amazon CloudWatch

This was our preferred option and is the one we finally implemented.

We also evaluated a third option which was to use centralized logging and auditing feature enabled by Control Tower. Users authenticate and federate to target accounts from a central location so it seemed possible to source this info from the centralized logs. These log events arrive as .gz compressed json objects, which meant having to expand these archives repeatedly for inspection. We therefore decided against this option.

A centralized, economic, extensible solution to alert of SSO break-glass

Our requirement was to identify break-glass access across any of the access mechanisms supported by AWS, including CLI and User Portal access. To ensure we have comprehensive coverage across all access mechanisms, we identified all the events initiated for each access mechanism:

  1. User Portal/AWS Console access events
    • Authenticate
    • ListApplications
    • ListApplicationProfiles
    • Federate – this event contains the role that the user is federating into
  2. CLI access events
    • CreateToken
    • ListAccounts
    • ListAccountRoles
    • GetRoleCredentials – this event contains the role that the user is federating into

EventBridge is able to initiate actions after events only when the event is trying to perform changes (when the “readOnly” attribute on the event record body equals “false”).

The AWS support team was aware of this attribute and recommended that we, change the data flow we were using to one able to initiate actions after any kind of event, regardless of the value on its readOnly attribute. The solution in our case was to send the CloudTrail logs to CloudWatch Logs. This then and initiates the Lambda function through a filter subscription that detects the desired event names on the log content.

The filter used is as follows:

{($.eventSource = sso.amazonaws.com) && ($.eventName = Federate||$.eventName = GetRoleCredentials)}

Due to the query size in the CloudWatch Log queries we had to remove the subscription filters and do the parsing of the content of the log lines inside the lambda function. In order to determine what accounts would initiate the notifications, we sent the list of accounts and roles to it as an environment variable at runtime.

Considerations with cross-account SSO access

With direct federation users get an access token. This is most obvious in AWS single sign on at the chiclet page as “Command line or programmatic access”. SSO tokens have a limited lifetime (we use the default 1-hour). A user does not have to get a new token to access a target resource until the one they are using is expired. This means that a user may repeatedly access a target account using the same token during its lifetime. Although the token is made available at the chiclet page, the GetRoleCredentials event does not occur until it is used to authenticate an API call to the target AWS account.

Conclusion

In this blog, we discussed how AWS Control Tower and AWS Single Sign-on enabled Nationwide to build and govern a secure, multi-account AWS environment for one of their business initiatives and centralize access management across our implementation. The integration was important for us to accurately and comprehensively identify and audit break-glass access for our implementation. As a result, we were able to satisfy our security and compliance audit requirements for privileged access to our AWS accounts.