Tag Archives: Provisioning and orchestration

Terraform CI/CD and testing on AWS with the new Terraform Test Framework

2024-04-03 Kevon Mayers

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/terraform-ci-cd-and-testing-on-aws-with-the-new-terraform-test-framework/

Graphic created by Kevon Mayers

Introduction

Organizations often use Terraform Modules to orchestrate complex resource provisioning and provide a simple interface for developers to enter the required parameters to deploy the desired infrastructure. Modules enable code reuse and provide a method for organizations to standardize deployment of common workloads such as a three-tier web application, a cloud networking environment, or a data analytics pipeline. When building Terraform modules, it is common for the module author to start with manual testing. Manual testing is performed using commands such as terraform validate for syntax validation, terraform plan to preview the execution plan, and terraform apply followed by manual inspection of resource configuration in the AWS Management Console. Manual testing is prone to human error, not scalable, and can result in unintended issues. Because modules are used by multiple teams in the organization, it is important to ensure that any changes to the modules are extensively tested before the release. In this blog post, we will show you how to validate Terraform modules and how to automate the process using a Continuous Integration/Continuous Deployment (CI/CD) pipeline.

Terraform Test

Terraform test is a new testing framework for module authors to perform unit and integration tests for Terraform modules. Terraform test can create infrastructure as declared in the module, run validation against the infrastructure, and destroy the test resources regardless if the test passes or fails. Terraform test will also provide warnings if there are any resources that cannot be destroyed. Terraform test uses the same HashiCorp Configuration Language (HCL) syntax used to write Terraform modules. This reduces the burden for modules authors to learn other tools or programming languages. Module authors run the tests using the command terraform test which is available on Terraform CLI version 1.6 or higher.

Module authors create test files with the extension *.tftest.hcl. These test files are placed in the root of the Terraform module or in a dedicated tests directory. The following elements are typically present in a Terraform tests file:

Provider block: optional, used to override the provider configuration, such as selecting AWS region where the tests run.
Variables block: the input variables passed into the module during the test, used to supply non-default values or to override default values for variables.
Run block: used to run a specific test scenario. There can be multiple run blocks per test file, Terraform executes run blocks in order. In each run block you specify the command Terraform (plan or apply), and the test assertions. Module authors can specify the conditions such as: length(var.items) != 0. A full list of condition expressions can be found in the HashiCorp documentation.

Terraform tests are performed in sequential order and at the end of the Terraform test execution, any failed assertions are displayed.

Basic test to validate resource creation

Now that we understand the basic anatomy of a Terraform tests file, let’s create basic tests to validate the functionality of the following Terraform configuration. This Terraform configuration will create an AWS CodeCommit repository with prefix name repo-.

# main.tf

variable "repository_name" {
  type = string
}
resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description     = "Test repository."
}

Now we create a Terraform test file in the tests directory. See the following directory structure as an example:

├── main.tf 
└── tests 
└── basic.tftest.hcl

For this first test, we will not perform any assertion except for validating that Terraform execution plan runs successfully. In the tests file, we create a variable block to set the value for the variable repository_name. We also added the run block with command = plan to instruct Terraform test to run Terraform plan. The completed test should look like the following:

# basic.tftest.hcl

variables {
  repository_name = "MyRepo"
}

run "test_resource_creation" {
  command = plan
}

Now we will run this test locally. First ensure that you are authenticated into an AWS account, and run the terraform init command in the root directory of the Terraform module. After the provider is initialized, start the test using the terraform test command.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... pass
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... pass

Our first test is complete, we have validated that the Terraform configuration is valid and the resource can be provisioned successfully. Next, let’s learn how to perform inspection of the resource state.

Create resource and validate resource name

Re-using the previous test file, we add the assertion block to checks if the CodeCommit repository name starts with a string repo- and provide error message if the condition fails. For the assertion, we use the startswith function. See the following example:

# basic.tftest.hcl

variables {
  repository_name = "MyRepo"
}

run "test_resource_creation" {
  command = plan

  assert {
    condition = startswith(aws_codecommit_repository.test.repository_name, "repo-")
    error_message = "CodeCommit repository name ${var.repository_name} did not start with the expected value of ‘repo-****’."
  }
}

Now, let’s assume that another module author made changes to the module by modifying the prefix from repo- to my-repo-. Here is the modified Terraform module.

# main.tf

variable "repository_name" {
  type = string
}
resource "aws_codecommit_repository" "test" {
  repository_name = format("my-repo-%s", var.repository_name)
  description = "Test repository."
}

We can catch this mistake by running the the terraform test command again.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... fail
╷
│ Error: Test assertion failed
│
│ on tests/basic.tftest.hcl line 9, in run "test_resource_creation":
│ 9: condition = startswith(aws_codecommit_repository.test.repository_name, "repo-")
│ ├────────────────
│ │ aws_codecommit_repository.test.repository_name is "my-repo-MyRepo"
│
│ CodeCommit repository name MyRepo did not start with the expected value 'repo-***'.
╵
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... fail

Failure! 0 passed, 1 failed.

We have successfully created a unit test using assertions that validates the resource name matches the expected value. For more examples of using assertions see the Terraform Tests Docs. Before we proceed to the next section, don’t forget to fix the repository name in the module (revert the name back to repo- instead of my-repo-) and re-run your Terraform test.

Testing variable input validation

When developing Terraform modules, it is common to use variable validation as a contract test to validate any dependencies / restrictions. For example, AWS CodeCommit limits the repository name to 100 characters. A module author can use the length function to check the length of the input variable value. We are going to use Terraform test to ensure that the variable validation works effectively. First, we modify the module to use variable validation.

# main.tf

variable "repository_name" {
  type = string
  validation {
    condition = length(var.repository_name) <= 100
    error_message = "The repository name must be less than or equal to 100 characters."
  }
}

resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description = "Test repository."
}

By default, when variable validation fails during the execution of Terraform test, the Terraform test also fails. To simulate this, create a new test file and insert the repository_name variable with a value longer than 100 characters.

# var_validation.tftest.hcl

variables {
  repository_name = “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
}

run “test_invalid_var” {
  command = plan
}

Notice on this new test file, we also set the command to Terraform plan, why is that? Because variable validation runs prior to Terraform apply, thus we can save time and cost by skipping the entire resource provisioning. If we run this Terraform test, it will fail as expected.

❯ terraform test
tests/basic.tftest.hcl… in progress
run “test_resource_creation”… pass
tests/basic.tftest.hcl… tearing down
tests/basic.tftest.hcl… pass
tests/var_validation.tftest.hcl… in progress
run “test_invalid_var”… fail
╷
│ Error: Invalid value for variable
│
│ on main.tf line 1:
│ 1: variable “repository_name” {
│ ├────────────────
│ │ var.repository_name is “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
│
│ The repository name must be less than or equal to 100 characters.
│
│ This was checked by the validation rule at main.tf:3,3-13.
╵
tests/var_validation.tftest.hcl… tearing down
tests/var_validation.tftest.hcl… fail

Failure! 1 passed, 1 failed.

For other module authors who might iterate on the module, we need to ensure that the validation condition is correct and will catch any problems with input values. In other words, we expect the validation condition to fail with the wrong input. This is especially important when we want to incorporate the contract test in a CI/CD pipeline. To prevent our test from failing due introducing an intentional error in the test, we can use the expect_failures attribute. Here is the modified test file:

# var_validation.tftest.hcl

variables {
  repository_name = “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
}

run “test_invalid_var” {
  command = plan

  expect_failures = [
    var.repository_name
  ]
}

Now if we run the Terraform test, we will get a successful result.

❯ terraform test
tests/basic.tftest.hcl… in progress
run “test_resource_creation”… pass
tests/basic.tftest.hcl… tearing down
tests/basic.tftest.hcl… pass
tests/var_validation.tftest.hcl… in progress
run “test_invalid_var”… pass
tests/var_validation.tftest.hcl… tearing down
tests/var_validation.tftest.hcl… pass

Success! 2 passed, 0 failed.

As you can see, the expect_failures attribute is used to test negative paths (the inputs that would cause failures when passed into a module). Assertions tend to focus on positive paths (the ideal inputs). For an additional example of a test that validates functionality of a completed module with multiple interconnected resources, see this example in the Terraform CI/CD and Testing on AWS Workshop.

Orchestrating supporting resources

In practice, end-users utilize Terraform modules in conjunction with other supporting resources. For example, a CodeCommit repository is usually encrypted using an AWS Key Management Service (KMS) key. The KMS key is provided by end-users to the module using a variable called kms_key_id. To simulate this test, we need to orchestrate the creation of the KMS key outside of the module. In this section we will learn how to do that. First, update the Terraform module to add the optional variable for the KMS key.

# main.tf

variable "repository_name" {
  type = string
  validation {
    condition = length(var.repository_name) <= 100
    error_message = "The repository name must be less than or equal to 100 characters."
  }
}

variable "kms_key_id" {
  type = string
  default = ""
}

resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description = "Test repository."
  kms_key_id = var.kms_key_id != "" ? var.kms_key_id : null
}

In a Terraform test, you can instruct the run block to execute another helper module. The helper module is used by the test to create the supporting resources. We will create a sub-directory called setup under the tests directory with a single kms.tf file. We also create a new test file for KMS scenario. See the updated directory structure:

├── main.tf
└── tests
├── setup
│ └── kms.tf
├── basic.tftest.hcl
├── var_validation.tftest.hcl
└── with_kms.tftest.hcl

The kms.tf file is a helper module to create a KMS key and provide its ARN as the output value.

# kms.tf

resource "aws_kms_key" "test" {
  description = "test KMS key for CodeCommit repo"
  deletion_window_in_days = 7
}

output "kms_key_id" {
  value = aws_kms_key.test.arn
}

The new test will use two separate run blocks. The first run block (setup) executes the helper module to generate a KMS key. This is done by assigning the command apply which will run terraform apply to generate the KMS key. The second run block (codecommit_with_kms) will then use the KMS key ARN output of the first run as the input variable passed to the main module.

# with_kms.tftest.hcl

run "setup" {
  command = apply
  module {
    source = "./tests/setup"
  }
}

run "codecommit_with_kms" {
  command = apply

  variables {
    repository_name = "MyRepo"
    kms_key_id = run.setup.kms_key_id
  }

  assert {
    condition = aws_codecommit_repository.test.kms_key_id != null
    error_message = "KMS key ID attribute value is null"
  }
}

Go ahead and run the Terraform init, followed by Terraform test. You should get the successful result like below.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... pass
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... pass
tests/var_validation.tftest.hcl... in progress
run "test_invalid_var"... pass
tests/var_validation.tftest.hcl... tearing down
tests/var_validation.tftest.hcl... pass
tests/with_kms.tftest.hcl... in progress
run "create_kms_key"... pass
run "codecommit_with_kms"... pass
tests/with_kms.tftest.hcl... tearing down
tests/with_kms.tftest.hcl... pass

Success! 4 passed, 0 failed.

We have learned how to run Terraform test and develop various test scenarios. In the next section we will see how to incorporate all the tests into a CI/CD pipeline.

Terraform Tests in CI/CD Pipelines

Now that we have seen how Terraform Test works locally, let’s see how the Terraform test can be leveraged to create a Terraform module validation pipeline on AWS. The following AWS services are used:

AWS CodeCommit – a secure, highly scalable, fully managed source control service that hosts private Git repositories.
AWS CodeBuild – a fully managed continuous integration service that compiles source code, runs tests, and produces ready-to-deploy software packages.
AWS CodePipeline – a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates.
Amazon Simple Storage Service (Amazon S3) – an object storage service offering industry-leading scalability, data availability, security, and performance.

Terraform module validation pipeline

In the above architecture for a Terraform module validation pipeline, the following takes place:

A developer pushes Terraform module configuration files to a git repository (AWS CodeCommit).
AWS CodePipeline begins running the pipeline. The pipeline clones the git repo and stores the artifacts to an Amazon S3 bucket.
An AWS CodeBuild project configures a compute/build environment with Checkov installed from an image fetched from Docker Hub. CodePipeline passes the artifacts (Terraform module) and CodeBuild executes Checkov to run static analysis of the Terraform configuration files.
Another CodeBuild project configured with Terraform from an image fetched from Docker Hub. CodePipeline passes the artifacts (repo contents) and CodeBuild runs Terraform command to execute the tests.

CodeBuild uses a buildspec file to declare the build commands and relevant settings. Here is an example of the buildspec files for both CodeBuild Projects:

# Checkov
version: 0.1
phases:
  pre_build:
    commands:
      - echo pre_build starting

  build:
    commands:
      - echo build starting
      - echo starting checkov
      - ls
      - checkov -d .
      - echo saving checkov output
      - checkov -s -d ./ > checkov.result.txt

In the above buildspec, Checkov is run against the root directory of the cloned CodeCommit repository. This directory contains the configuration files for the Terraform module. Checkov also saves the output to a file named checkov.result.txt for further review or handling if needed. If Checkov fails, the pipeline will fail.

# Terraform Test
version: 0.1
phases:
  pre_build:
    commands:
      - terraform init
      - terraform validate

  build:
    commands:
      - terraform test

In the above buildspec, the terraform init and terraform validate commands are used to initialize Terraform, then check if the configuration is valid. Finally, the terraform test command is used to run the configured tests. If any of the Terraform tests fails, the pipeline will fail.

For a full example of the CI/CD pipeline configuration, please refer to the Terraform CI/CD and Testing on AWS workshop. The module validation pipeline mentioned above is meant as a starting point. In a production environment, you might want to customize it further by adding Checkov allow-list rules, linting, checks for Terraform docs, or pre-requisites such as building the code used in AWS Lambda.

Choosing various testing strategies

At this point you may be wondering when you should use Terraform tests or other tools such as Preconditions and Postconditions, Check blocks or policy as code. The answer depends on your test type and use-cases. Terraform test is suitable for unit tests, such as validating resources are created according to the naming specification. Variable validations and Pre/Post conditions are useful for contract tests of Terraform modules, for example by providing error warning when input variables value do not meet the specification. As shown in the previous section, you can also use Terraform test to ensure your contract tests are running properly. Terraform test is also suitable for integration tests where you need to create supporting resources to properly test the module functionality. Lastly, Check blocks are suitable for end to end tests where you want to validate the infrastructure state after all resources are generated, for example to test if a website is running after an S3 bucket configured for static web hosting is created.

When developing Terraform modules, you can run Terraform test in command = plan mode for unit and contract tests. This allows the unit and contract tests to run quicker and cheaper since there are no resources created. You should also consider the time and cost to execute Terraform test for complex / large Terraform configurations, especially if you have multiple test scenarios. Terraform test maintains one or many state files within the memory for each test file. Consider how to re-use the module’s state when appropriate. Terraform test also provides test mocking, which allows you to test your module without creating the real infrastructure.

Conclusion

In this post, you learned how to use Terraform test and develop various test scenarios. You also learned how to incorporate Terraform test in a CI/CD pipeline. Lastly, we also discussed various testing strategies for Terraform configurations and modules. For more information about Terraform test, we recommend the Terraform test documentation and tutorial. To get hands on practice building a Terraform module validation pipeline and Terraform deployment pipeline, check out the Terraform CI/CD and Testing on AWS Workshop.

Authors

Deploy CloudFormation Hooks to an Organization with service-managed StackSets

2024-01-18 Kirankumar Chandrashekar

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/devops/deploy-cloudformation-hooks-to-an-organization-with-service-managed-stacksets/

This post demonstrates using AWS CloudFormation StackSets to deploy CloudFormation Hooks from a centralized delegated administrator account to all accounts within an Organization Unit(OU). It provides step-by-step guidance to deploy controls at scale to your AWS Organization as Hooks using StackSets. By following this post, you will learn how to deploy a hook to hundreds of AWS accounts in minutes.

AWS CloudFormation StackSets help deploy CloudFormation stacks to multiple accounts and regions with a single operation. Using service-managed permissions, StackSets automatically generate the IAM roles required to deploy stack instances, eliminating the need for manual creation in each target account prior to deployment. StackSets provide auto-deploy capabilities to deploy stacks to new accounts as they’re added to an Organizational Unit (OU) in AWS Organization. With StackSets, you can deploy AWS well-architected multi-account solutions organization-wide in a single click and target stacks to selected accounts in OUs. You can also leverage StackSets to auto deploy foundational stacks like networking, policies, security, monitoring, disaster recovery, billing, and analytics to new accounts. This ensures consistent security and governance reflecting AWS best practices.

AWS CloudFormation Hooks allow customers to invoke custom logic to validate resource configurations before a CloudFormation stack create/update/delete operation. This helps enforce infrastructure-as-code policies by preventing non-compliant resources. Hooks enable policy-as-code to support consistency and compliance at scale. Without hooks, controlling CloudFormation stack operations centrally across accounts is more challenging because governance checks and enforcement have to be implemented through disjointed workarounds across disparate services after the resources are deployed. Other options like Config rules evaluate resource configurations on a timed basis rather than on stack operations. And SCPs manage account permissions but don’t include custom logic tailored to granular resource configurations. In contrast, CloudFormation hooks allows customer-defined automation to validate each resource as new stacks are deployed or existing ones updated. This enables stronger compliance guarantees and rapid feedback compared to asynchronous or indirect policy enforcement via other mechanisms.

Follow the later sections of this post that provide a step-by-step implementation for deploying hooks across accounts in an organization unit (OU) with a StackSet including:

Configure service-managed permissions to automatically create IAM roles
Create the StackSet in the delegated administrator account
Target the OU to distribute hook stacks to member accounts

This shows how to easily enable a policy-as-code framework organization-wide.

I will show you how to register a custom CloudFormation hook as a private extension, restricting permissions and usage to internal administrators and automation. Registering the hook as a private extension limits discoverability and access. Only approved accounts and roles within the organization can invoke the hook, following security best practices of least privilege.

StackSets Architecture

As depicted in the following AWS StackSets architecture diagram, a dedicated Delegated Administrator Account handles creation, configuration, and management of the StackSet that defines the template for standardized provisioning. In addition, these centrally managed StackSets are deploying a private CloudFormation hook into all member accounts that belong to the given Organization Unit. Registering this as a private CloudFormation hook enables administrative control over the deployment lifecycle events it can respond to. Private hooks prevent public usage, ensuring the hook can only be invoked by approved accounts, roles, or resources inside your organization.

Architecture for deploying CloudFormation Hooks to accounts in an Organization

Diagram 1: StackSets Delegated Administration and Member Account Diagram

In the above architecture, Member accounts join the StackSet through their inclusion in a central Organization Unit. By joining, these accounts receive deployed instances of the StackSet template which provisions resources consistently across accounts, including the controlled private hook for administrative visibility and control.

The delegation of StackSet administration responsibilities to the Delegated Admin Account follows security best practices. Rather than having the sensitive central Management Account handle deployment logistics, delegation isolates these controls to an admin account with purpose-built permissions. The Management Account representing the overall AWS Organization focuses more on high-level compliance governance and organizational oversight. The Delegated Admin Account translates broader guardrails and policies into specific infrastructure automation leveraging StackSets capabilities. This separation of duties ensures administrative privileges are restricted through delegation while also enabling an organization-wide StackSet solution deployment at scale.

Centralized StackSets facilitate account governance through code-based infrastructure management rather than manual account-by-account changes. In summary, the combination of account delegation roles, StackSet administration, and joining through Organization Units creates an architecture to allow governed, infrastructure-as-code deployments across any number of accounts in an AWS Organization.

Sample Hook Development and Deployment

In the section, we will develop a hook on a workstation using the AWS CloudFormation CLI, package it, and upload it to the Hook Package S3 Bucket. Then we will deploy a CloudFormation stack that in turn deploys a hook across member accounts within an Organization Unit (OU) using StackSets.

The sample hook used in this blog post enforces that server-side encryption must be enabled for any S3 buckets and SQS queues created or updated on a CloudFormation stack. This policy requires that all S3 buckets and SQS queues be configured with server-side encryption when provisioned, ensuring security is built into our infrastructure by default. By enforcing encryption at the CloudFormation level, we prevent data from being stored unencrypted and minimize risk of exposure. Rather than manually enabling encryption post-resource creation, our developers simply enable it as a basic CloudFormation parameter. Adding this check directly into provisioning stacks leads to a stronger security posture across environments and applications. This example hook demonstrates functionality for mandating security best practices on infrastructure-as-code deployments.

Prerequisites

On the AWS Organization:

An AWS Organization setup with a management account or a delegated administrator for stack set deployments. For more information, see CloudFormation StackSets delegated administration for setting a delegated administrator.
An AWS Organization setup with Organization Units with AWS accounts. For more information, refer Managing organizational units.
Enable trusted access to CloudFormation StackSets with AWS Organizations. This is required for the service-managed permissions to work. Follow the steps in the AWS CloudFormation User Guide to enable trusted access with AWS Organizations.

On the workstation where the hooks will be developed:

AWS CloudFormation CLI installed
Any Code editor or a Cloud9 Instance
AWS CLI installedand configured with your AWS credentials that has access to the delegated administrator account. This post uses the us-west-2 Region.
```
aws configure
Default region name [us-east-1]: us-west-2
```

In the Delegated Administrator account:

Create a hooks package S3 bucket within the delegated administrator account. Upload the hooks package and CloudFormation templates that StackSets will deploy. Ensure the S3 bucket policy allows access from the AWS accounts within the OU. This access lets AWS CloudFormation access the hooks package objects and CloudFormation template objects in the S3 bucket from the member accounts during stack deployment.

Follow these steps to deploy a CloudFormation template that sets up the S3 bucket and permissions:

Click here to download the admin-cfn-hook-deployment-s3-bucket.yaml template file in to your local workstation.
Note: Make sure you model the S3 bucket and IAM policies as least privilege as possible. For the above S3 Bucket policy, you can add a list of IAM Role ARNs created by the StackSets service managed permissions instead of AWS: “*”, which allows S3 bucket access to all the IAM entities from the accounts in the OU. The ARN of this role will be “arn:aws:iam:::role/stacksets-exec-” in every member account within the OU. For more information about equipping least privilege access to IAM policies and S3 Bucket Policies, refer IAM Policies and Bucket Policies and ACLs! Oh, My! (Controlling Access to S3 Resources) blog post.
Execute the following command to deploy the template admin-cfn-hook-deployment-s3-bucket.yaml using AWS CLI. For more information see Creating a stack using the AWS Command Line Interface. If using AWS CloudFormation console, see Creating a stack on the AWS CloudFormation console.
To get the OU Id, see Viewing the details of an OU. OU Id starts with “ou-“. To get the Organization Id, see Viewing details about your organization. Organization Id starts with “o-”
```
aws cloudformation create-stack \
--stack-name hooks-asset-stack \
--template-body file://admin-cfn-deployment-s3-bucket.yaml \
--parameters ParameterKey=OrgId,ParameterValue="&lt;Org_id&gt;" \
ParameterKey=OUId,ParameterValue="&lt;OU_id&gt;"
```
After deploying the stack, note down the AWS S3 bucket name from the CloudFormation Outputs.

Hook Development

In this section, you will develop a sample CloudFormation hook package that will enforce encryption for S3 Buckets and SQS queues within the preCreate and preDelete hook. Follow the steps in the walkthrough to develop a sample hook and generate a zip package for deploying and enabling them in all the accounts within an OU. While following the walkthrough, within the Registering hooks section, make sure that you stop right after executing the cfn submit --dry-run command. The --dry-run option will make sure that your hook is built and packaged your without registering it with CloudFormation on your account. While initiating a Hook project if you created a new directory with the name mycompany-testing-mytesthook, the hook package will be generated as a zip file with the name mycompany-testing-mytesthook.zip at the root your hooks project.

Upload mycompany-testing-mytesthook.zip file to the hooks package S3 bucket within the Delegated Administrator account. The packaged zip file can then be distributed to enable the encryption hooks across all accounts in the target OU.

Note: If you are using your own hooks project and not doing the tutorial, irrespective of it, you should make sure that you are executing the cfn submit command with the --dry-run option. This ensures you have a hooks package that can be distributed and reused across multiple accounts.

Hook Deployment using CloudFormation Stack Sets

In this section, deploy the sample hook developed previously across all accounts within an OU. Use a centralized CloudFormation stack deployed from the delegated administrator account via StackSets.

Deploying hooks via CloudFormation requires these key resources:

AWS::CloudFormation::HookVersion: Publishes a new hook version to the CloudFormation registry
AWS::CloudFormation::HookDefaultVersion: Specifies the default hook version for the AWS account and region
AWS::CloudFormation::HookTypeConfig: Defines the hook configuration
AWS::IAM::Role #1: Task execution role that grants the hook permissions
AWS::IAM::Role #2: (Optional) role for CloudWatch logging that CloudFormation will assume to send log entries during hook execution
AWS::Logs::LogGroup: (Optional) Enables CloudWatch error logging for hook executions

Follow these steps to deploy CloudFormation Hooks to accounts within the OU using StackSets:

Click here to download the hooks-template.yaml template file into your local workstation and upload it into the Hooks package S3 bucket in the Delegated Administrator account.
Deploy the hooks CloudFormation template hooks-template.yaml to all accounts within an OU using StackSets. Leverage service-managed permissions for automatic IAM role creation across the OU.
To deploy the hooks template hooks-template.yaml across OU using StackSets, click here to download the CloudFormation StackSets template hooks-stack-sets-template.yaml locally, and upload it to the hooks package S3 bucket in the delegated administrator account. This StackSets template contains an AWS::CloudFormation::StackSet resource that will deploy the necessary hooks resources from hooks-template.yaml to all accounts in the target OU. Using SERVICE_MANAGED permissions model automatically handle provisioning the required IAM execution roles per account within the OU.
Execute the following command to deploy the template hooks-stack-sets-template.yaml using AWS CLI. For more information see Creating a stack using the AWS Command Line Interface. If using AWS CloudFormation console, see Creating a stack on the AWS CloudFormation console.To get the S3 Https URL for the hooks template, hooks package and StackSets template, login to the AWS S3 service on the AWS console, select the respective object and click on Copy URL button as shown in the following screenshot:
Diagram 2: S3 Https URL

To get the OU Id, see Viewing the details of an OU. OU Id starts with “ou-“.
Make sure to replace the <S3BucketName> and then <OU_Id> accordingly in the following command:
```
aws cloudformation create-stack --stack-name hooks-stack-set-stack \
--template-url https://<S3BucketName>.s3.us-west-2.amazonaws.com/hooks-stack-sets-template.yaml \
--parameters ParameterKey=OuId,ParameterValue="<OU_Id>" \
ParameterKey=HookTypeName,ParameterValue="MyCompany::Testing::MyTestHook" \
ParameterKey=s3TemplateURL,ParameterValue="https://<S3BucketName>.s3.us-west-2.amazonaws.com/hooks-template.yaml" \
ParameterKey=SchemaHandlerPackageS3URL,ParameterValue="https://<S3BucketName>.s3.us-west-2.amazonaws.com/mycompany-testing-mytesthook.zip"
```
Check the progress of the stack deployment using the aws cloudformation describe-stack command. Move to the next section when the stack status is CREATE_COMPLETE.
```
aws cloudformation describe-stacks --stack-name hooks-stack-set-stack
```
If you navigate to the AWS CloudFormation Service’s StackSets section in the console, you can view the stack instances deployed to the accounts within the OU. Alternatively, you can execute the AWS CloudFormation list-stack-instances CLI command below to list the deployed stack instances:
```
aws cloudformation list-stack-instances --stack-set-name MyTestHookStackSet
```

Testing the deployed hook

Deploy the following sample templates into any AWS account that is within the OU where the hooks was deployed and activated. Follow the steps in the Creating a stack on the AWS CloudFormation console. If using AWS CloudFormation CLI, follow the steps in the Creating a stack using the AWS Command Line Interface.

Provision a non-compliant stack without server-side encryption using the following template:
```
AWSTemplateFormatVersion: 2010-09-09
Description: |
  This CloudFormation template provisions an S3 Bucket
Resources:
  S3Bucket:
    Type: 'AWS::S3::Bucket'
    Properties: {}
```
The stack deployment will not succeed and will give the following error message

The following hook(s) failed: [MyCompany::Testing::MyTestHook] and the hook status reason as shown in the following screenshot:

Diagram 3: S3 Bucket creation failure with hooks execution

Provision a stack using the following template that has server-side encryption for the S3 Bucket.

AWSTemplateFormatVersion: 2010-09-09
Description: |
  This CloudFormation template provisions an encrypted S3 Bucket. **WARNING** This template creates an Amazon S3 bucket and a KMS key that you will be charged for. You will be billed for the AWS resources used if you create a stack from this template.
Resources:
  EncryptedS3Bucket:
    Type: "AWS::S3::Bucket"
    Properties:
      BucketName: !Sub "encryptedbucket-${AWS::Region}-${AWS::AccountId}"
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: "aws:kms"
              KMSMasterKeyID: !Ref EncryptionKey
            BucketKeyEnabled: true
  EncryptionKey:
    Type: "AWS::KMS::Key"
    DeletionPolicy: Retain
    UpdateReplacePolicy: Retain
    Properties:
      Description: KMS key used to encrypt the resource type artifacts
      EnableKeyRotation: true
      KeyPolicy:
        Version: 2012-10-17
        Statement:
          - Sid: Enable full access for owning account
            Effect: Allow
            Principal:
              AWS: !Ref "AWS::AccountId"
            Action: "kms:*"
            Resource: "*"
Outputs:
  EncryptedBucketName:
    Value: !Ref EncryptedS3Bucket

The deployment will succeed as it will pass the hook validation with the following hook status reason as shown in the following screenshot:

stack deployment pass due to hooks execution Diagram 4: S3 Bucket creation success with hooks execution

Updating the hooks package

To update the hooks package, follow the same steps described in the Hooks Development section to change the hook code accordingly. Then, execute the cfn submit --dry-run command to build and generate the hooks package file with the registering the type with the CloudFormation registry. Make sure to rename the zip file with a unique name compared to what was previously used. Otherwise, while updating the CloudFormation StackSets stack, it will not see any changes in the template and thus not deploy updates. The best practice is to use a CI/CD pipeline to manage the hook package. Typically, it is good to assign unique version numbers to the hooks packages so that CloudFormation stacks with the new changes get deployed.

Cleanup

Navigate to the AWS CloudFormation console on the Delegated Administrator account, and note down the Hooks package S3 bucket name and empty its contents. Refer to Emptying the Bucket for more information.

Delete the CloudFormation stacks in the following order:

Test stack that failed
Test stack that passed
StackSets CloudFormation stack. This has a DeletionPolicy set to Retain, update the stack by removing the DeletionPolicy and then initiate a stack deletion via CloudFormation or physically delete the StackSet instances and StackSets from the Console or CLI by following: 1. Delete stack instances from your stack set 2. Delete a stack set
Hooks asset CloudFormation stack

Refer to the following documentation to delete CloudFormation Stacks: Deleting a stack on the AWS CloudFormation console or Deleting a stack using AWS CLI.

Conclusion

Throughout this blog post, you have explored how AWS StackSets enable the scalable and centralized deployment of CloudFormation hooks across all accounts within an Organization Unit. By implementing hooks as reusable code templates, StackSets provide consistency benefits and slash the administrative labor associated with fragmented and manual installs. As organizations aim to fortify governance, compliance, and security through hooks, StackSets offer a turnkey mechanism to efficiently reach hundreds of accounts. By leveraging the described architecture of delegated StackSet administration and member account joining, organizations can implement a single hook across hundreds of accounts rather than manually enabling hooks per account. Centralizing your hook code-base within StackSets templates facilitates uniform adoption while also simplifying maintenance. Administrators can update hooks in one location instead of attempting fragmented, account-by-account changes. By enclosing new hooks within reusable StackSets templates, administrators benefit from infrastructure-as-code descriptiveness and version control instead of one-off scripts. Once configured, StackSets provide automated hook propagation without overhead. The delegated administrator merely needs to include target accounts through their Organization Unit alignment rather than handling individual permissions. New accounts added to the OU automatically receive hook deployments through the StackSet orchestration engine.

About the Author

Kirankumar Chandrashekar is a Sr. Solutions Architect for Strategic Accounts at AWS. He focuses on leading customers in architecting DevOps, modernization using serverless, containers and container orchestration technologies like Docker, ECS, EKS to name a few. Kirankumar is passionate about DevOps, Infrastructure as Code, modernization and solving complex customer issues. He enjoys music, as well as cooking and traveling.

Using AWS CloudFormation and AWS Cloud Development Kit to provision multicloud resources

2023-09-08 Aaron Sempf

Post Syndicated from Aaron Sempf original https://aws.amazon.com/blogs/devops/using-aws-cloudformation-and-aws-cloud-development-kit-to-provision-multicloud-resources/

Customers often need to architect solutions to support components across multiple cloud service providers, a need which may arise if they have acquired a company running on another cloud, or for functional purposes where specific services provide a differentiated capability. In this post, we will show you how to use the AWS Cloud Development Kit (AWS CDK) to create a single pane of glass for managing your multicloud resources.

AWS CDK is an open source framework that builds on the underlying functionality provided by AWS CloudFormation. It allows developers to define cloud resources using common programming languages and an abstraction model based on reusable components called constructs. There is a misconception that CloudFormation and CDK can only be used to provision resources on AWS, but this is not the case. The CloudFormation registry, with support for third party resource types, along with custom resource providers, allow for any resource that can be configured via an API to be created and managed, regardless of where it is located.

Multicloud solution design paradigm

Multicloud solutions are often designed with services grouped and separated by cloud, creating a segregation of resource and functions within the design. This approach leads to a duplication of layers of the solution, most commonly a duplication of resources and the deployment processes for each environment. This duplication increases cost, and leads to a complexity of management increasing the potential break points within the solution or practice.

Along with simplifying resource deployments, and the ever-increasing complexity of customer needs, so too has the need increased for the capability of IaC solutions to deploy resources across hybrid or multicloud environments. Through meeting this need, a proliferation of supported tools, frameworks, languages, and practices has created “choice overload”. At worst, this scares the non-cloud-savvy away from adopting an IaC solution benefiting their cloud journey, and at best confuses the very reason for adopting an IaC practice.

A single pane of glass

Systems Thinking is a holistic approach that focuses on the way a system’s constituent parts interrelate and how systems work as a whole especially over time and within the context of larger systems. Systems thinking is commonly accepted as the backbone of a successful systems engineering approach. Designing solutions taking a full systems view, based on the component’s function and interrelation within the system across environments, more closely aligns with the ability to handle the deployment of each cloud-specific resource, from a single control plane.

While AWS provides a list of services that can be used to help design, manage and operate hybrid and multicloud solutions, with AWS as the primary cloud you can go beyond just using services to support multicloud. CloudFormation registry resource types model and provision resources using custom logic, as a component of stacks in CloudFormation. Public extensions are not only provided by AWS, but third-party extensions are made available for general use by publishers other than AWS, meaning customers can create their own extensions and publish them for anyone to use.

The AWS CDK, which has a 1:1 mapping of all AWS CloudFormation resources, as well as a library of abstracted constructs, supports the ability to import custom AWS CloudFormation extensions, enabling customers and partners to create custom AWS CDK constructs for their extensions. The chosen programming language can be used to inherit and abstract the custom resource into reusable AWS CDK constructs, allowing developers to create solutions that contain native AWS extensions along with secondary hybrid or alternate cloud resources.

Providing the ability to integrate mixed resources in the same stack more closely aligns with the functional design and often diagrammatic depiction of the solution. In essence, we are creating a single IaC pane of glass over the entire solution, deployed through a single control plane. This lowers the complexity and the cost of maintaining separate modules and deployment pipelines across multiple cloud providers.

A common use case for a multicloud: disaster recovery

One of the most common use cases of the requirement for using components across different cloud providers is the need to maintain data sovereignty while designing disaster recovery (DR) into a solution.

Data sovereignty is the idea that data is subject to the laws of where it is physically located, and in some countries extends to regulations that if data is collected from citizens of a geographical area, then the data must reside in servers located in jurisdictions of that geographical area or in countries with a similar scope and rigor in their protection laws.

This requires organizations to remain in compliance with their host country, and in cases such as state government agencies, a stricter scope of within state boundaries, data sovereignty regulations. Unfortunately, not all countries, and especially not all states, have multiple AWS regions to select from when designing where their primary and recovery data backups will reside. Therefore, the DR solution needs to take advantage of multiple cloud providers in the same geography, and as such a solution must be designed to backup or replicate data across providers.

The multicloud solution

A multicloud solution to the proposed use case would be the backup of data from an AWS resource such as an Amazon S3 bucket to another cloud provider within the same geography, such as an Azure Blob Storage container, using AWS event driven behaviour to trigger the copying of data from the primary AWS resource to the secondary Azure backup resource.

Following the IaC single pane of glass approach, the Azure Blob Storage container is created as a resource type in the CloudFormation Registry, and imported into the AWS CDK to be used as a construct in the solution. However, before the extension resource type can be used effectively in the CDK as a reusable construct and added to your private library, you will first need to go through the import into CDK process for creating Constructs.

There are three different levels of constructs, beginning with low-level constructs, which are called CFN Resources (or L1, short for “layer 1”). These constructs directly represent all resources available in AWS CloudFormation. They are named CfnXyz, where Xyz is name of the resource.

Layer 1 Construct

In this example, an L1 construct named CfnAzureBlobStorage represents an Azure::BlobStorage AWS CloudFormation extension. Here you also explicitly configure the ref property, in order for higher level constructs to access the Output value which will be the Azure blob container url being provisioned.

import { CfnResource } from "aws-cdk-lib";
import { Secret, ISecret } from "aws-cdk-lib/aws-secretsmanager";
import { Construct } from "constructs";

export interface CfnAzureBlobStorageProps {
  subscriptionId: string;
  clientId: string;
  tenantId: string;
  clientSecretName: string;
}

// L1 Construct
export class CfnAzureBlobStorage extends Construct {
  // Allows accessing the ref property
  public readonly ref: string;

  constructor(scope: Construct, id: string, props: CfnAzureBlobStorageProps) {
    super(scope, id);

    const secret = this.getSecret("AzureClientSecret", props.clientSecretName);
    
    const azureBlobStorage = new CfnResource(
      this,
      "ExtensionAzureBlobStorage",
      {
        type: "Azure::BlobStorage",
        properties: {
          AzureSubscriptionId: props.subscriptionId,
          AzureClientId: props.clientId,
          AzureTenantId: props.tenantId,
          AzureClientSecret: secret.secretValue.unsafeUnwrap()
        },
      }
    );

    this.ref = azureBlobStorage.ref;
  }

  private getSecret(id: string, secretName: string) : ISecret {  
    return Secret.fromSecretNameV2(this, secretName.concat("Value"), secretName);
  }
}

As with every CDK Construct, the constructor arguments are scope, id and props. scope and id are propagated to the cdk.Construct base class. The props argument is of type CfnAzureBlobStorageProps which includes four properties all of type string. This is how the Azure credentials are propagated down from upstream constructs.

Layer 2 Construct

The next level of constructs, L2, also represent AWS resources, but with a higher-level, intent-based API. They provide similar functionality, but incorporate the defaults, boilerplate, and glue logic you’d be writing yourself with a CFN Resource construct. They also provide convenience methods that make it simpler to work with the resource.

In this example, an L2 construct is created to abstract the CfnAzureBlobStorage L1 construct and provides additional properties and methods.

import { Construct } from "constructs";
import { CfnAzureBlobStorage } from "./cfn-azure-blob-storage";

// L2 Construct
export class AzureBlobStorage extends Construct {
  public readonly blobContainerUrl: string;

  constructor(
    scope: Construct,
    id: string,
    subscriptionId: string,
    clientId: string,
    tenantId: string,
    clientSecretName: string
  ) {
    super(scope, id);

    const azureBlobStorage = new CfnAzureBlobStorage(
      this,
      "CfnAzureBlobStorage",
      {
        subscriptionId: subscriptionId,
        clientId: clientId,
        tenantId: tenantId,
        clientSecretName: clientSecretName,
      }
    );

    this.blobContainerUrl = azureBlobStorage.ref;
  }
}

The custom L2 construct class is declared as AzureBlobStorage, this time without the Cfn prefix to represent an L2 construct. This time the constructor arguments include the Azure credentials and client secret, and the ref from the L1 construct us output to the public variable AzureBlobContainerUrl.

As an L2 construct, the AzureBlobStorage construct could be used in CDK Apps along with AWS Resource Constructs in the same Stack, to be provisioned through AWS CloudFormation creating the IaC single pane of glass for a multicloud solution.

Layer 3 Construct

The true value of the CDK construct programming model is in the ability to extend L2 constructs, which represent a single resource, into a composition of multiple constructs that provide a solution for a common task. These are Layer 3, L3, Constructs also known as patterns.

In this example, the L3 construct represents the solution architecture to backup objects uploaded to an Amazon S3 bucket into an Azure Blob Storage container in real-time, using AWS Lambda to process event notifications from Amazon S3.

import { RemovalPolicy, Duration, CfnOutput } from "aws-cdk-lib";
import { Bucket, BlockPublicAccess, EventType } from "aws-cdk-lib/aws-s3";
import { DockerImageFunction, DockerImageCode } from "aws-cdk-lib/aws-lambda";
import { PolicyStatement, Effect } from "aws-cdk-lib/aws-iam";
import { LambdaDestination } from "aws-cdk-lib/aws-s3-notifications";
import { IStringParameter, StringParameter } from "aws-cdk-lib/aws-ssm";
import { Secret, ISecret } from "aws-cdk-lib/aws-secretsmanager";
import { Construct } from "constructs";
import { AzureBlobStorage } from "./azure-blob-storage";

// L3 Construct
export class S3ToAzureBackupService extends Construct {
  constructor(
    scope: Construct,
    id: string,
    azureSubscriptionIdParamName: string,
    azureClientIdParamName: string,
    azureTenantIdParamName: string,
    azureClientSecretName: string
  ) {
    super(scope, id);

    // Retrieve existing SSM Parameters
    const azureSubscriptionIdParameter = this.getSSMParameter("AzureSubscriptionIdParam", azureSubscriptionIdParamName);
    const azureClientIdParameter = this.getSSMParameter("AzureClientIdParam", azureClientIdParamName);
    const azureTenantIdParameter = this.getSSMParameter("AzureTenantIdParam", azureTenantIdParamName);    
    
    // Retrieve existing Azure Client Secret
    const azureClientSecret = this.getSecret("AzureClientSecret", azureClientSecretName);

    // Create an S3 bucket
    const sourceBucket = new Bucket(this, "SourceBucketForAzureBlob", {
      removalPolicy: RemovalPolicy.RETAIN,
      blockPublicAccess: BlockPublicAccess.BLOCK_ALL,
    });

    // Create a corresponding Azure Blob Storage account and a Blob Container
    const azurebBlobStorage = new AzureBlobStorage(
      this,
      "MyCustomAzureBlobStorage",
      azureSubscriptionIdParameter.stringValue,
      azureClientIdParameter.stringValue,
      azureTenantIdParameter.stringValue,
      azureClientSecretName
    );

    // Create a lambda function that will receive notifications from S3 bucket
    // and copy the new uploaded object to Azure Blob Storage
    const copyObjectToAzureLambda = new DockerImageFunction(
      this,
      "CopyObjectsToAzureLambda",
      {
        timeout: Duration.seconds(60),
        code: DockerImageCode.fromImageAsset("copy_s3_fn_code", {
          buildArgs: {
            "--platform": "linux/amd64"
          }
        }),
      },
    );

    // Add an IAM policy statement to allow the Lambda function to access the
    // S3 bucket
    sourceBucket.grantRead(copyObjectToAzureLambda);

    // Add an IAM policy statement to allow the Lambda function to get the contents
    // of an S3 object
    copyObjectToAzureLambda.addToRolePolicy(
      new PolicyStatement({
        effect: Effect.ALLOW,
        actions: ["s3:GetObject"],
        resources: [`arn:aws:s3:::${sourceBucket.bucketName}/*`],
      })
    );

    // Set up an S3 bucket notification to trigger the Lambda function
    // when an object is uploaded
    sourceBucket.addEventNotification(
      EventType.OBJECT_CREATED,
      new LambdaDestination(copyObjectToAzureLambda)
    );

    // Grant the Lambda function read access to existing SSM Parameters
    azureSubscriptionIdParameter.grantRead(copyObjectToAzureLambda);
    azureClientIdParameter.grantRead(copyObjectToAzureLambda);
    azureTenantIdParameter.grantRead(copyObjectToAzureLambda);

    // Put the Azure Blob Container Url into SSM Parameter Store
    this.createStringSSMParameter(
      "AzureBlobContainerUrl",
      "Azure blob container URL",
      "/s3toazurebackupservice/azureblobcontainerurl",
      azurebBlobStorage.blobContainerUrl,
      copyObjectToAzureLambda
    );      

    // Grant the Lambda function read access to the secret
    azureClientSecret.grantRead(copyObjectToAzureLambda);

    // Output S3 bucket arn
    new CfnOutput(this, "sourceBucketArn", {
      value: sourceBucket.bucketArn,
      exportName: "sourceBucketArn",
    });

    // Output the Blob Conatiner Url
    new CfnOutput(this, "azureBlobContainerUrl", {
      value: azurebBlobStorage.blobContainerUrl,
      exportName: "azureBlobContainerUrl",
    });
  }

}

The custom L3 construct can be used in larger IaC solutions by calling the class called S3ToAzureBackupService and providing the Azure credentials and client secret as properties to the constructor.

import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import { S3ToAzureBackupService } from "./s3-to-azure-backup-service";

export class MultiCloudBackupCdkStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const s3ToAzureBackupService = new S3ToAzureBackupService(
      this,
      "MyMultiCloudBackupService",
      "/s3toazurebackupservice/azuresubscriptionid",
      "/s3toazurebackupservice/azureclientid",
      "/s3toazurebackupservice/azuretenantid",
      "s3toazurebackupservice/azureclientsecret"
    );
  }
}

Solution Diagram

Diagram 1: IaC Single Control Plane, demonstrates the concept of the Azure Blob Storage extension being imported from the AWS CloudFormation Registry into AWS CDK as an L1 CfnResource, wrapped into an L2 Construct and used in an L3 pattern alongside AWS resources to perform the specific task of backing up from and Amazon s3 Bucket into an Azure Blob Storage Container.

Diagram 1: IaC Single Control Plan

The CDK application is then synthesized into one or more AWS CloudFormation Templates, which result in the CloudFormation service deploying AWS resource configurations to AWS and Azure resource configurations to Azure.

This solution demonstrates not only how to consolidate the management of secondary cloud resources into a unified infrastructure stack in AWS, but also the improved productivity by eliminating the complexity and cost of operating multiple deployment mechanisms into multiple public cloud environments.

The following video demonstrates an example in real-time of the end-state solution:

Next Steps

While this was just a straightforward example, with the same approach you can use your imagination to come up with even more and complex scenarios where AWS CDK can be used as a single pane of glass for IaC to manage multicloud and hybrid solutions.

To get started with the solution discussed in this post, this workshop will provide you with the instructions you need to understand the steps required to create the S3ToAzureBackupService.

Once you have learned how to create AWS CloudFormation extensions and develop them into AWS CDK Constructs, you will learn how, with just a few lines of code, you can develop reusable multicloud unified IaC solutions that deploy through a single AWS control plane.

Conclusion

By adopting AWS CloudFormation extensions and AWS CDK, deployed through a single AWS control plane, the cost and complexity of maintaining deployment pipelines across multiple cloud providers is reduced to a single holistic solution-focused pipeline. The techniques demonstrated in this post and the related workshop provide a capability to simplify the design of complex systems, improve the management of integration, and more closely align the IaC and deployment management practices with the design.

About the authors:

How organizations are modernizing for cloud operations

2023-06-07 Adam Keller

Post Syndicated from Adam Keller original https://aws.amazon.com/blogs/devops/how-organizations-are-modernizing-for-cloud-operations/

Over the past decade, we’ve seen a rapid evolution in how IT operations teams and application developers work together. In the early days, there was a clear division of responsibilities between the two teams, with one team focused on providing and maintaining the servers and various components (i.e., storage, DNS, networking, etc.) for the application to run, while the other primarily focused on developing the application’s features, fixing bugs, and packaging up their artifacts for the operations team to deploy. Ultimately, this division led to a siloed approach which presented glaring challenges. These siloes hindered communication between the teams, which would often result in developers being ready to ship code and passing it over to the operations teams with little to no collaboration prior. In turn, operations teams were often left scrambling trying to deliver on the requirements at the last minute. This would lead to bottlenecks in software delivery, delaying features and bug fixes from being shipped. Aside from software delivery, operations teams were primarily responsible for handling on-call duties, which encompassed addressing issues arising from both applications and infrastructure. Consequently, when incidents occurred, the operations teams were the ones receiving alerts, irrespective of the source of the problem. This raised the question: what motivates the software developers to create resilient and dependable software? Terms such as “throw it over the wall” and “it works on my laptop” were coined because of this and are still commonly referenced in discussion today.

The DevOps movement emerged in response to these challenges, aiming to build a bridge between developers and operations teams. DevOps focuses on collaboration between the two teams through communication and integration by fostering a culture of shared responsibility. This approach promotes the use of automation of infrastructure and application code leveraging continuous integration (CI) and continuous delivery (CD), microservices architectures, and visibility through monitoring, logging and tracing. The end result of operating in a DevOps model provides quicker and more reliable release cycles. While the ideology is well intentioned, implementing a DevOps practice is not easy as organizations struggle to adapt and adhere to the cultural expectations. In addition, teams can struggle to find the right balance between speed and stability, which often times results in reverting back to old behaviors due to fear of downtime and instability of their environments. While DevOps is very focused on culture through collaboration and automation, not all developers want to be involved in operations and vice versa. This poses the question: how do organizations centralize a frictionless developer experience, with guardrails and best practices baked in, while providing a golden path for developers to self serve? This is where platform engineering comes in.

Platform engineering has emerged as a critical discipline for organizations, which is driving the next evolution of infrastructure and operations, while simultaneously empowering developers to create and deliver robust, scalable applications. It aims to improve developer experience by providing self service mechanisms that provide some level of abstraction for provisioning resources, with good practices baked in. This builds on top of DevOps practices by enabling the developer to have full control of their resources through self service, without having to throw it over the wall. There are various ways that platform engineering teams implement these self service interfaces, from leveraging a GitOps focused strategy to building Internal Developer Platforms with a UI and/or API. With the increasing demand for faster and more agile development, many organizations are adopting this model to streamline their operations, gain visibility, reduce costs, and lower the friction of onboarding new applications.

In this blog post, we will explore the common operational models used within organizations today, where platform engineering fits within these models, the common patterns used to build and develop these self-service platforms, and what lies ahead for this emerging field.

Operational Models

It’s important for us to start by understanding how we see technology teams operate today and the various ways they support development teams from instantiating infrastructure to defining pipelines and deploying application code. In the below diagram we highlight the four common operational models and will discuss each to understand the benefits and challenges they bring. This is also critical in understanding where platform teams fit, and where they don’t.

This image shows a sliding scale of the various provisioning models. For each model it shows the interaction between developers and the platform team.

Centralized Provisioning

In a centralized provisioning model, the responsibility for architecting, deploying, and managing infrastructure falls primarily on a centralized team. Organizations assign enforcement of controls into specific roles with narrow scope, including release management, process management, and segmentation of siloed teams (networking, compute, pipelines, etc). The request model generally requires a ticket or request to be sent to the central or dedicated siloed team, ticket enters a backlog, and the developers wait until resources can be provisioned on their behalf. In an ideal world, the central teams can quickly provision the resources and pipelines to get the developers up and running; but, in reality these teams are busy with work and have to prioritize accordingly which often times leaves development teams waiting or having to predict what they need well in advance.

While this model provides central control over resource provisioning, it introduces bottlenecks into the delivery process and generally results in slower deployment cycles and feedback loops. This model becomes especially challenging when supporting a large number of development teams with varying requirements and use cases. Ultimately this model can lead to frustration and friction between teams and hence why organizations after some time look to move away from operating in this model. This leads us to segue into the next model, which is the Platform-enabled Golden Path.

Platform-enabled Golden Path

The platform-enabled golden path model is an approach that allows for developer to have some form of customization while still maintaining consistency by following a set of standards. In this model, platform engineers clearly lay out “preferred” standards with sane defaults, guardrails, and good practices based on common architectures that development teams can use as-is. Sophisticated platform teams may implement their own customizations on top of this framework in the following ways:

Leveraging a managed IaC orchestration service (such as AWS Proton, AWS Service Catalog, or Amazon CodeCatalyst)
Building out a custom Internal Developer Platform (Backstage, or built in house)
Implementing a GitOps model and tool (Flux, ArgoCD, etc)
Managing templates in a central repository for reuse with documentation
Building custom libraries or modules (AWS Cloud Development Kit (CDK) Constructs/Libraries, Terraform Modules, etc)

The platform engineering team is responsible for creating and updating the templates, with maintenance responsibilities typically being shared. This approach strikes a balance between consistency and flexibility, allowing for some customization while still maintaining standards. However, it can be challenging to maintain visibility across the organization, as development teams have more freedom to customize their infrastructure. This becomes especially challenging when platform teams want a change to propagate across resources deployed by the various development teams building on top of these patterns.

Embedded DevOps

Embedded DevOps is a model in which DevOps engineers are directly aligned with development teams to define, provision, and maintain their infrastructure. There are a couple of common patterns around how organizations use this model.

Floating model: A central DevOps team can leverage a floating model where a DevOps engineer will be directly embedded onto a development team early in the development process to help build out the required pipelines and infrastructure resources, and jump to another team once everything is up and running.
Permanent embedded model: Alternatively, a development team can have a permanent DevOps engineer on the team to help support early iterations as well as maintenance as the application evolves. The DevOps engineer is ideally there from the beginning of the project and continues to support and improve the infrastructure and automation based on feature requests and bug fixes.

A central platform and/or architecture team may define the acceptable configurations and resources, while DevOps engineers decide how to best use them to meet the needs of their development team. Individual teams are responsible for maintenance and updating of the templates and pipelines. This model offers greater agility and flexibility, but also requires the funding to hire DevOps engineers per development team, which can become costly as development teams scale. It’s important that when operating in this model to maintain collaboration between members of the DevOps team to ensure that best practices can be shared.

Decentralized DevOps

Lastly, the decentralized DevOps model gives development teams full end-to-end ownership and responsibility for defining and managing their infrastructure and pipelines. A central team may be focused on building out guardrails and boundaries to ensure that they limit the blast radius within the boundaries. They can also create a process to ensure that infrastructure deployed meets company standards, while ensuring development teams are free to make design decisions and remain autonomous. This approach offers the greatest agility and flexibility, but also the highest risk of inconsistency, errors, and security vulnerabilities. Additionally, this model requires a cultural shift in the organization because the development teams now own the entire stack, which results in more responsibility. This model can be a deterrent to developers, especially if they are unfamiliar with building resources in the cloud and/or don’t want to do it.

Overall, each model has its strengths and weaknesses, and the purpose of this blog is to educate on the patterns that are emerging. Ultimately the right approach depends on the organization’s specific needs and goals as well as their willingness to shift culturally. Of the above patterns, the two that are emerging as the most common are Platform-enabled Golden Path and Decentralized DevOps. Furthermore, we’re seeing that more often than not platform teams are finding themselves going back and forth between the two patterns within the same organization. This is in part due to technology making infrastructure creation in the cloud more accessible through abstraction and automation (think of tools like the AWS Cloud Development Kit (CDK), AWS Serverless Application Model (SAM) CLI, AWS Copilot, Serverless framework, etc). Let’s now look at the technology patterns that are emerging to support these use cases.

Emerging patterns

Of the trends that are on the rise, Internal Developer Platforms and GitOps practices are becoming increasingly popular in the industry due to their ability to streamline the software development process and improve collaboration between development and platform teams. Internal Developer Platforms provide a centralized platform for developers to access resources and tools needed to build, test, deploy, and monitor applications and associated infrastructure resources. By providing a self-service interface with pre-approved patterns (via UI, API, or Git), internal developer platforms empower development teams to work independently and collaborate with one another more effectively. This reduces the burden on IT and operations teams while also increasing the agility and speed of development as developers aren’t required to wait in line to get resources provisioned. The paradigm shifts with Internal Developer Platforms because the platform teams are focused on building the blueprints and defining the standards for backend resources that development teams centrally consume via the provided interfaces. The platform team should view the internal developer platform as a product and look at developers as their customer.

While internal developer platforms provide a lot of value and abstraction through a UI and API’s, some organizations prefer to use Git as the center of deployment orchestration, and this is where leveraging GitOps can help. GitOps is a methodology that leverages Git as the source of orchestrating and managing the deployment of infrastructure and applications. With GitOps, infrastructure is defined declaratively as code, and changes are tracked in Git, allowing for a more standardized and automated deployment process. Using git for deployment orchestration is not new, but there are some concepts with GitOps that take Git orchestration to a new level.

Let’s look at the principles of GitOps, as defined by OpenGitOps:

Declarative
- A system managed by GitOps must have its desired state expressed declaratively.
Versioned and Immutable
- Desired state is stored in a way that enforces immutability, versioning and retains a complete version history.
Pulled Automatically
- Software agents automatically pull the desired state declarations from the source.
Continuously Reconciled
- Software agents continuously observe actual system state and attempt to apply the desired state.

GitOps helps to reduce the risk of errors and improve consistency across the organization as all change is tracked centrally. Additionally this provides developers with a familiar interface in git as well as the ability to store the desired state of their infrastructure and applications in one place. Lastly, GitOps is focused on ensuring that the desired state in git is always maintained, and if drift occurs, an external process will reconcile the state of the resources. GitOps was born in the Kubernetes ecosystem using tools like Flux and ArgoCD.

The final emerging trend to discuss is particularly relevant to teams functioning within a decentralized DevOps model, possessing end-to-end responsibility for the stack, encompassing infrastructure and application delivery. The amount of cognitive load required to connect the underlying cloud resources together while also being an expert in building out business logic for the application is extremely high, and hence why teams look to harness the power of abstraction and automation for infrastructure provisioning. While this may appear analogous to previously mentioned practices, the key distinction lies in the utilization of tools specifically designed to enhance the developer experience. By abstracting various components (such as networking, identity, and stitching everything together), these tools eliminate the necessity for interaction with centralized teams, empowering developers to operate autonomously and assume complete ownership of the infrastructure. This trend is exemplified by the adoption of innovative tools such as AWS App Composer, AWS CodeCatalyst, SAM CLI, AWS Copilot CLI, and the AWS Cloud Development Kit (CDK).

Looking ahead

If there is one thing that we can ascertain it’s that the journey to successful developer enablement is ongoing, and it’s clear that finding that balance of speed, security, and flexibility can be difficult to achieve. Throughout all of these evolutionary trends in technology, Git has remained as the nucleus of infrastructure and application deployment automation. This is not new; however, the processes being built around Git such as GitOps are. The industry continues to gravitate towards this model, and at AWS we are looking at ways to enable builders to leverage git as the source of truth with simple integrations. For example, AWS Proton has built integrations with git for central template storage with a feature called template sync and recently released a feature called service sync, which allows developers to configure and deploy their Proton services using Git. These features empower the platform team and developers to seamlessly store their templates and desired infrastructure resource states within Git, requiring no additional effort beyond the initial setup.

We also see that interest in building internal developer platforms is on a sharp incline, and it’s still in the early days. With tools like AWS Proton, AWS Service Catalog, Backstage, and other SaaS providers, platform teams are able to define patterns centrally for developers to self serve patterns via a library or “shopping cart”. As mentioned earlier, it’s vital that the teams building out the internal developer platforms think of ways to enable the developer to deploy supplemental resources that aren’t defined in the central templates. While the developer platform can solve the majority of the use cases, it’s nearly impossible to solve them all. If you can’t enable developers to deploy resources on top of their platform deployed services, you’ll find that you’re back to the original problem statement outlined in the beginning of this blog which can ultimately result in a failed implementation. AWS Proton solves this through a feature we call components, which enables developers to bring their own IaC templates to deploy on top of their services deployed through Proton.

The rising popularity of the aforementioned patterns reveals an unmet need for developers who seek to tailor their cloud resources according to the specific requirements of their applications and the demands of platform/central teams that require governance. This is particularly prevalent in serverless workloads, where developers often integrate their application and infrastructure code, utilizing services such as AWS Step Functions to transfer varying degrees of logic from the application layer to the managed service itself. Centralizing these resources becomes increasingly challenging due to their dynamic nature, which adapts to the evolving requirements of business logic. Consequently, it is nearly impossible to consolidate these patterns into a universally applicable blueprint for reuse across diverse business scenarios.

As the distinction between cloud resources and application code becomes increasingly blurred, developers are compelled to employ tools that streamline the underlying logic, enabling them to achieve their desired outcomes swiftly and securely. In this context, it is crucial for platform teams to identify and incorporate these tools, ensuring that organizational safeguards and expectations are upheld. By doing so, they can effectively bridge the gap between developers’ preferences and the essential governance required by the platform or central team.

Wrapping up

We’ve explored the various operating models and emerging trends designed to facilitate these models. Platform Engineering represents the ongoing evolution of DevOps, aiming to enhance the developer experience for rapid and secure deployments. It is crucial to recognize that developers possess varying skill sets and preferences, even within the same organization. As previously discussed, some developers prefer complete ownership of the entire stack, while others concentrate solely on writing code without concerning themselves with infrastructure. Consequently, the platform engineering practice must continuously adapt to accommodate these patterns in a manner that fosters enablement rather than posing as obstacles. To achieve this, the platform must be treated as a product, with developers as its customers, ensuring that their needs and preferences are prioritized and addressed effectively.

To determine where your organization fits within the discussed operational models, we encourage you to initiate a self-assessment and have internal discussions. Evaluate your current infrastructure provisioning, deployment processes, and development team support. Consider the benefits and challenges of each model and how they align with your organization’s specific needs, goals, and cultural willingness to shift.

To facilitate this process, gather key stakeholders from various teams, including leadership, platform engineering, development, and DevOps, for a collaborative workshop. During this workshop, review the four operational models (Centralized Provisioning, Platform-enabled Golden Path, Embedded DevOps, and Decentralized DevOps) and discuss the following:

How closely does each model align with your current organizational structure and processes?
What are the potential benefits and challenges of adopting or transitioning to each model within your organization?
What challenges are you currently facing with the model that you operate under?
How can technology be leveraged to optimize infrastructure creation and deployment automation?

By conducting this self-assessment and engaging in open dialogue, your organization can identify the most suitable operational model and develop a strategic plan to optimize collaboration, efficiency, and agility within your technology teams. If a more guided approach is preferred, reach out to our solutions architects and/or AWS partners to assist.

Introduction

Terraform Test

Basic test to validate resource creation

Create resource and validate resource name

Testing variable input validation

Orchestrating supporting resources

Terraform Tests in CI/CD Pipelines

Choosing various testing strategies

Conclusion

Authors

StackSets Architecture

Sample Hook Development and Deployment

Prerequisites

Hook Development

Hook Deployment using CloudFormation Stack Sets

Testing the deployed hook

Updating the hooks package

Cleanup

Conclusion

About the Author

Multicloud solution design paradigm

A single pane of glass

A common use case for a multicloud: disaster recovery

The multicloud solution

Layer 1 Construct

Layer 2 Construct

Layer 3 Construct

Solution Diagram

Next Steps

Conclusion

Operational Models

Centralized Provisioning

Platform-enabled Golden Path

Embedded DevOps

Decentralized DevOps

Emerging patterns

Looking ahead

Wrapping up

The collective thoughts of the interwebz