How ERGO implemented an event-driven security remediation architecture on AWS

Post Syndicated from Adam Sikora original https://aws.amazon.com/blogs/architecture/how-ergo-implemented-an-event-driven-security-remediation-architecture-on-aws/

ERGO is one of the major insurance groups in Germany and Europe. Within the ERGO Group, ERGO Technology & Services S.A. (ET&S), a part of ET&SM holding, has competencies in digital transformation, know-how in creating and implementing complex IT systems with focus on the quality of solutions and a portfolio aligned with the entire value chain of the insurance market.

Business Challenge and Solution

ERGO has a multi-account AWS environment where each project team subscribes to a set of AWS accounts that conforms to workload requirements and security best practices. As ERGO began its cloud journey, CIS Foundations Benchmark Standard was used as the key indicator for measuring compliance. The report showed significant room for security posture improvements. ERGO was looking for a solution that could enable the management of security events at scale. At the same time, they needed to centralize the event response and remediation in near-real time. The goal was to improve the CIS compliance metric and overall security posture.

Architecture

ERGO uses AWS Organizations to centrally govern the multi-account AWS environment. Integration of AWS Security Hub with AWS Organizations enables ERGO to designate ERGO’s Security Account as the Security Hub administrator/primary account. Other organization accounts are automatically registered as Security Hub member accounts to send events to the Security Account.

An important aspect of the workflow is to maintain segregation of duties and separation of environments. ERGO uses two separate AWS accounts to implement automatic finding remediation:

  • Security Account – this is the primary account with Security Hub where security alerts (findings) from all the AWS accounts of the project are gathered.
  • Service Account – this is the account that can take action on target project (member) AWS accounts. ERGO uses AWS Lambda functions to run remediation actions through AWS Identity and Access Management (IAM) permissions, VPC resources actions, and more.

Within the Security Account, AWS Security Hub serves as the event aggregation solution that gathers multi-account findings from AWS services such as Amazon GuardDuty. ERGO was able to centralize the security findings. But they still needed to develop a solution that routed the filtered, actionable events to the Service Account. The solution had to automate the response to these events based on ERGO’s security policy. ERGO built this solution with the help of Amazon CloudWatch, AWS Step Functions, and AWS Lambda.

ERGO used the integration of AWS Security Hub with Amazon CloudWatch to send all the security events to CloudWatch. The filtering logic of events was managed at two levels. At the first level, ERGO used CloudWatch Events rules that match event patterns to refine the types of events ERGO wanted to focus on.

The second level of filtering logic was more nuanced and related to the remediation action ERGO wanted to take on a detected event. ERGO chose AWS Step Functions to build a workflow that enabled them to further filter the events, in addition to matching them to the suitable remediation action.

Choosing AWS Step Functions enabled ERGO to orchestrate multiple steps. They could also respond to errors in the overall workflow. For example, one of the issues that ERGO encountered was the sporadic failure of the Archival Lambda function. This was due to the Security Hub API Rate Throttling.

ERGO evaluated several workarounds to deal with this situation. They considered using the automatic retries capability of the AWS SDK to make the API call in the Archival function. However, the built-in mechanism was not sufficient in this case. Another option for dealing with rate limit was to throttle the Archival Lambda functions by applying a low reserved concurrency. Another possibility was to batch the events to be SUPPRESSED and process them as one batch at a time. The benefit was in making a single API call at a time, over several parameters.

After much consideration, ERGO decided to use the “retry on error” mechanism of the Step Function to circumvent this problem. This allowed ERGO to manage the error handling directly in the workflow logic. It wasn’t necessary to change the remediation and archival logic of the Lambda functions. This was a huge advantage. Writing and maintaining error handling logic in each one of the Lambda functions would have been time-intensive and complicated.

Additionally, the remediation actions had to be configured and run from the Service Account. That means the Step Function in the Security Account had to trigger a cross-account resource. ERGO had to find a way to integrate the Remediation Lambda in the Service Account with the state machine of the Security Account. ERGO achieved this integration using a Proxy Lambda in the Security Account.

The Proxy Lambda resides in the Security Account and is initiated by the Step Function. It takes as its argument, the function name and function version to start the Remediation function in the service account.

The Remediation functions in the Service Account have permission to take action on Project accounts. As the next step, the Remediation function is invoked on the impacted accounts. This is filtered by the Step Function, which passes the Account ID to Proxy Lambda, which in turn passes this argument to Remediation Lambda. The Remediation function runs the actions on the Project accounts and returns the output to the Proxy Lambda. This is then passed back to the Step Function.

The role that Lambda assumes using the AssumeRole mechanism, is an Organization Level role. It is deployed on every account and has proper permission to perform the remediation.

ERGO Architecture

Figure 1. Technical Solution implementation

  1. Security Hub service in ERGO Project accounts sends security findings to Administrative Account.
  2. Findings are aggregated and sent to CloudWatch Events for filtering.
  3. CloudWatch rules invoke Step Functions as the target. Step Functions process security events based on the event type and treatment required as per CIS Standards.
  4. For events that need to be suppressed without any dependency on the Project Accounts, the Step Function invokes a Lambda function to archive the findings.
  5. For events that need to be executed on the Project accounts, a Step Function invokes a Proxy Lambda with required parameters.
  6. Proxy Lambda in turn, invokes a cross-account Remediation function in Service Account. This has the permissions to run actions in Project accounts.
  7. Based on the event type, corresponding remediation action is run on the impacted Project Account.
  8. Remediation function passes the execution result back to Proxy Lambda to complete the Security event workflow.

Failed remediations are manually resolved in exceptional conditions.

Summary

By implementing this event-driven solution, ERGO was able to increase and maintain automated compliance with CIS AWS Foundation Benchmark Standard to about 95%. The remaining findings were evaluated on case basis, per specific Project requirements. This measurable improvement in ERGO compliance posture was achieved with an end-to-end serverless workflow. This offloaded any on-going platform maintenance efforts from the ERGO cloud security team. Working closely with our AWS account and service teams, ERGO will continue to evaluate and make improvements to our architecture.