Tag Archives: Terraform

Accelerate Serverless Streamlit App Deployment with Terraform

2024-10-09 Kevon Mayers

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/accelerate-serverless-streamlit-app-deployment-with-terraform/

Graphic created by Kevon Mayers.

Introduction

As customers increasingly seek to harness the power of generative AI (GenAI) and machine learning to deliver cutting-edge applications, the need for a flexible, intuitive, and scalable development platform has never been greater. In this landscape, Streamlit has emerged as a standout tool, making it easy for developers to prototype, build, and deploy GenAI-powered apps with minimal friction. It is an open-source Python framework designed to simplify the development of custom web applications for data science, machine learning, and GenAI projects. With Streamlit, developers can quickly transform Python scripts into interactive dashboards, LLM-powered chatbots, and web apps, using just a few lines of code. Its unique combination of simplicity, interactivity, and speed is the perfect complement to the rapid advancements in AI.

When deploying Streamlit applications, customers often face the challenge of ensuring their applications are highly available and can scale to meet a variable amount of demand. To achieve these goals, customers are looking at serverless approaches to deploying their Streamlit apps. With a serverless application, you only pay for the resources required and do not want have to worry about managing servers or capacity planning.

In this post, we will walk you through deploying containerized, serverless Streamlit applications automatically via HashiCorp Terraform, an Infrastructure as Code (IaC) tool that enables users to define and provision infrastructure across cloud platforms.

Solution Overview

For this solution, we have the Streamlit app running on an Amazon Elastic Container Service (ECS) cluster across multiple availability zones (AZs), using AWS Fargate to manage the compute. Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building apps without managing servers. Using Fargate helps reduce the undifferentiated heavy lifting that can come with building and maintaining web applications. It is also often desirable to use a Content Delivery Network (CDN) to ensure low latency for users globally by caching the content at edge locations closer to where the users are geographically located.

Let’s zoom in on the two architectures – the Streamlit App hosting architecture, and the Streamlit App deployment pipeline.

Streamlit app hosting

In the above architecture, the following flow applies:

Users access the Streamlit App using the public DNS endpoint for an Amazon CloudFront distribution.
Using an Internet Gateway (IGW), user requests are routed to a public-facing Application Load Balancer (ALB).
This ALB has target groups which map to ECS task nodes that are part of an ECS cluster running in two AZs (us-east-1a and us-east-1b in this example).
Fargate will automatically scale the underlying compute nodes in the ECS cluster based on the demand.

Streamlit app deployment pipeline

In the above architecture, the following flow applies:

User develops a local Streamlit App and defines the path of these assets in the module configuration, then runs terraform apply to generate a local .zip file comprised of the Streamlit App directory, and upload this to an Amazon S3 bucket (Streamlit Assets) with versioning enabled, which is configured to trigger the Streamlit CI/CD pipeline to run.
AWS CodePipeline (Streamlit CI/CD pipeline) begins running. The pipeline copies the .zip file from the Streamlit Assets S3 Bucket, stores the contents in a connected CodePipeline Artifacts S3 bucket, and passes the asset to the AWS CodeBuild project that is also part of the pipeline.
CodeBuild (Streamlit CodeBuild Project) configures a compute/build environment and fetches a Python Docker Image from a public Amazon ECR repository. CodeBuild uses Docker to build a new Streamlit App image based on what is defined in the Dockerfile within the .zip file, and pushes the new image to a private ECR repository. It tags the image with latest, an app_version (user-defined in Terraform), as well as the S3 Version ID of the .zip file and pushes the image to ECR.
ECS has a task definition that references the image in ECR based on the S3 Version ID tag which will always be a unique value, as it is generated whenever a new version of the file is created. This also serves as data lineage so versions of the Streamlit App .zip files in S3 can be linked to versions of the image stored in ECR. Once a new image is pushed to ECR (with a unique image tag), the task definition is updated and the ECS service begins a new deployment using the new version of the Streamlit App.
When a new image is pushed to ECR, the Terraform Module is configured to use the local-exec provisioner to run an AWS CLI command that creates a CloudFront invalidation. This enables users of the Streamlit app to use the new version without waiting for the time-to-live (TTL) of the cached file to expire on the edge locations (default is 24 hours).
Both of these pipelines are built and packaged into a Terraform module that can be reused efficiently with only a few lines of code.

Both of these pipelines are built and packaged into a Terraform module that can be reused efficiently with only a few lines of code.

Prerequisites

This solution requires the following prerequisites:

An AWS account. If you don’t have an account, you can sign up for one.
Terraform v1.0.0 or newer installed.
python v3.8 or newer installed.
A Streamlit app. If you don’t have a Streamlit project already, you can download this app directory as a sample Streamlit app for this post and save it to a local folder.

Your folder structure will look something like this:

terraform_streamlit_folder
├── README.md
└── app                 # Streamlit app directory
    ├── home.py         # Streamlit app entry point
    ├── Dockerfile      # Dockerfile
     └── pages/          # Streamlit pages

Create and initialize a Terraform project

In the same folder where you have the your Streamlit app saved, in the above example in the terraform_streamlit_folder, you will create and initialize a new Terraform project.

In your preferred terminal, create a new file named main.tf by running the following command on Unix/Linux machines, or an equivalent command on Windows machines:
```
touch main.tf
```
Open up the main.tf file and add the following code to it:
```
module "serverless-streamlit-app" {
  source          = "aws-ia/serverless-streamlit-app/aws"
  app_name        = "streamlit-app"
  app_version     = "v1.1.0" 
  path_to_app_dir = "./app" # Replace with path to your app
}
```
This code utilizes a module block with a source pointing to the Terraform module, and the appropriate input variables passed in. When Terraform encounters a module block, it loads and processes that module’s configuration files using the source. The Serverless Streamlit App Terraform module has many optional input variables. If you have existing resources, such as an existing VPC, subnets, and security groups that you’d like to reuse instead of deploying new ones, you can use the module’s input variables to reference your existing resources. However, in this post, we’re deploying all of the resources in the above architecture from scratch. Here, we simply define the source that references the module hosted in the Terraform Registry, provide an app_name that will be used as a prefix for naming your resources, the app_version that is used for tracking changes to your app, and the path_to_app_dir which is the path to the local directory where the assets for your Streamlit app are stored.
Save the file.
To initialize the Terraform working directory, run the following command in your terminal:
```
terraform init
```
The output will contain a successful message like the following:
```
"Terraform has been successfully initialized"
```

Output the CloudFront URL

To be able to easily access the Cloudfront URL of the deployed Streamlit application, you can add the URL as a Terraform output.

In your terminal, create a new file named outputs.tf by running the following command on Unix/Linux machines, or an equivalent command on Windows machines:
```
touch outputs.tf
```

Open up the outputs.tf file and add the following code to it:

output "streamlit_cloudfront_distribution_url" {
  value = module.serverless-streamlit-app.streamlit_cloudfront_distribution_url
}

Save the file.
Now, your folder structure will look like:

terraform_streamlit_folder
├── README.md
├── app                 # Streamlit app directory
│   ├── home.py         # Streamlit app entry point
│   ├── Dockerfile      # Dockerfile
│   └── pages/          # Streamlit pages
│     
├── main.tf             # Terraform Code (where you call the module) 
└── outputs.tf          # Outputs definition

Deploy the solution

Now you can use Terraform to deploy the resources defined in your main.tf file.

In your terminal, run the following command to apply to deploy the infrastructure. This includes the hosting for your Streamlit application using ECS and CloudFront, as well as the pipeline that is used to push updates.
```
terraform apply
```
When the apply command finishes running, you’ll see the Terraform outputs displayed in the terminal.
Navigate to the streamlit_cloudfront_distribution_url to see your Streamlit application that is hosted on AWS.
When you make changes to your Streamlit codebase, you can go ahead and re-run terraform apply to push your new changes to your cloud environment.

When updating the Streamlit codebase, the CodePipeline and CodeBuild processes kick off to automatically update your new changes, which get reflected on your Streamlit application. CodePipeline automates the entire software release process, managing stages like source retrieval, building, testing, and deployment. It integrates with AWS services and third-party tools (such as GitHub and Jenkins) to enhance automation, speed, and security. CodeBuild focuses on automating code compilation, testing, and packaging, supporting multiple languages and custom Docker environments, while integrating with CodePipeline for scalable, secure builds. With this CI/CD pipeline, when you make changes to your code, all you need to run is terraform apply to update your cloud environment. For an example buildspec, see the example in the repo.

You can find full examples of deploying the infrastructure with and without existing resources in the GitHub repository.

Clean up

When you no longer need the resources deployed in this post, you can clean up the resources by using the Terraform destroy command. Simply run terraform destroy . This will remove all of the resources you have deployed in this post with Terraform.

Conclusion

Building serverless Streamlit applications with Terraform on AWS offers a powerful combination of scalability, efficiency, and automation. As you continue to build and refine your Streamlit applications, Terraform’s flexibility ensures that your infrastructure can evolve seamlessly, supporting rapid innovation and agile development. With Streamlit and Terraform, you have the tools to create dynamic, serverless applications that scale effortlessly and operate reliably in the cloud.

Authors

Automatically generating Cloudflare’s Terraform provider

2024-09-24 Jacob Bednarz

Post Syndicated from Jacob Bednarz original https://blog.cloudflare.com/automatically-generating-cloudflares-terraform-provider

In November 2022, we announced the transition to OpenAPI Schemas for the Cloudflare API. Back then, we had an audacious goal to make the OpenAPI schemas the source of truth for our SDK ecosystem and reference documentation. During 2024’s Developer Week, we backed this up by announcing that our SDK libraries are now automatically generated from these OpenAPI schemas. Today, we’re excited to announce the latest pieces of the ecosystem to now be automatically generated — the Terraform provider and API reference documentation.

This means that the moment a new feature or attribute is added to our products and the team documents it, you’ll be able to see how it’s meant to be used across our SDK ecosystem and make use of it immediately. No more delays. No more lacking coverage of API endpoints.

You can find the new documentation site at https://developers.cloudflare.com/api-next/, and you can try the preview release candidate of the Terraform provider by installing 5.0.0-alpha1.

Why Terraform?

For anyone who is unfamiliar with Terraform, it is a tool for managing your infrastructure as code, much like you would with your application code. Many of our customers (big and small) rely on Terraform to orchestrate their infrastructure in a technology-agnostic way. Under the hood, it is essentially an HTTP client with lifecycle management built in, which means it makes use of our publicly documented APIs in a way that understands how to create, read, update and delete for the life of the resource.

Keeping Terraform updated — the old way

Historically, Cloudflare has manually maintained a Terraform provider, but since the provider internals require their own unique way of doing things, responsibility for maintenance and support has landed on the shoulders of a handful of individuals. The service teams always had difficulties keeping up with the number of changes, due to the amount of cognitive overhead required to ship a single change in the provider. In order for a team to get a change to the provider, it took a minimum of 3 pull requests (4 if you were adding support to cf-terraforming).

Even with the 4 pull requests completed, it didn’t offer guarantees on coverage of all available attributes, which meant small yet important details could be forgotten and not exposed to customers, causing frustration when trying to configure a resource.

To address this, our Terraform provider needed to be relying on the same OpenAPI schemas that the rest of our SDK ecosystem was already benefiting from.

Updating Terraform automatically

The thing that differentiates Terraform from our SDKs is that it manages the lifecycle of resources. With that comes a new range of problems related to known values and managing differences in the request and response payloads. Let’s compare the two different approaches of creating a new DNS record and fetching it back.

With our Go SDK:

// Create the new record
record, _ := client.DNS.Records.New(context.TODO(), dns.RecordNewParams{
	ZoneID: cloudflare.F("023e105f4ecef8ad9ca31a8372d0c353"),
	Record: dns.RecordParam{
		Name:    cloudflare.String("@"),
		Type:    cloudflare.String("CNAME"),
        Content: cloudflare.String("example.com"),
	},
})


// Wasteful fetch, but shows the point
client.DNS.Records.Get(
	context.Background(),
	record.ID,
	dns.RecordGetParams{
		ZoneID: cloudflare.String("023e105f4ecef8ad9ca31a8372d0c353"),
	},
)

And with Terraform:

resource "cloudflare_dns_record" "example" {
  zone_id = "023e105f4ecef8ad9ca31a8372d0c353"
  name    = "@"
  content = "example.com"
  type    = "CNAME"
}

On the surface, it looks like the Terraform approach is simpler, and you would be correct. The complexity of knowing how to create a new resource and maintain changes are handled for you. However, the problem is that for Terraform to offer this abstraction and data guarantee, all values must be known at apply time. That means that even if you’re not using the proxied value, Terraform needs to know what the value needs to be in order to save it in the state file and manage that attribute going forward. The error below is what Terraform operators commonly see from providers when the value isn’t known at apply time.

Error: Provider produced inconsistent result after apply

When applying changes to example_thing.foo, provider "provider[\"registry.terraform.io/example/example\"]"
produced an unexpected new value: .foo: was null, but now cty.StringVal("").

Whereas when using the SDKs, if you don’t need a field, you just omit it and never need to worry about maintaining known values.

Tackling this for our OpenAPI schemas was no small feat. Since introducing Terraform generation support, the quality of our schemas has improved by an order of magnitude. Now we are explicitly calling out all default values that are present, variable response properties based on the request payload, and any server-side computed attributes. All of this means a better experience for anyone that interacts with our APIs.

Making the jump from terraform-plugin-sdk to terraform-plugin-framework

To build a Terraform provider and expose resources or data sources to operators, you need two main things: a provider server and a provider.

The provider server takes care of exposing a gRPC server that Terraform core (via the CLI) uses to communicate when managing resources or reading data sources from the operator provided configuration.

The provider is responsible for wrapping the resources and data sources, communicating with the remote services, and managing the state file. To do this, you either rely on the terraform-plugin-sdk (commonly referred to as SDKv2) or terraform-plugin-framework, which includes all the interfaces and methods provided by Terraform in order to manage the internals correctly. The decision as to which plugin you use depends on the age of your provider. SDKv2 has been around longer and is what most Terraform providers use, but due to the age and complexity, it has many core unresolved issues that must remain in order to facilitate backwards compatibility for those who rely on it. terraform-plugin-framework is the new version that, while lacking the breadth of features SDKv2 has, provides a more Go-like approach to building providers and addresses many of the underlying bugs in SDKv2.

(For a deeper comparison between SDKv2 and the framework, you can check out a conversation between myself and John Bristowe from Octopus Deploy.)

The majority of the Cloudflare Terraform provider is built using SDKv2, but at the beginning of 2023, we took the plunge to multiplex and offer both in our provider. To understand why this was needed, we have to understand a little about SDKv2. The way SDKv2 is structured isn’t really conducive to representing null or “unset” values consistently and reliably. You can use the experimental ResourceData.GetRawConfig to check whether the value is set, null, or unknown in the config, but writing it back as null isn’t really supported.

This caveat first popped up for us when the Edge Rules Engine (Rulesets) started onboarding new services and those services needed to support API responses that contained booleans in an unset (or missing), true, or false state each with their own reasoning and purpose. While this isn’t a conventional API design at Cloudflare, it is a valid way to do things that we should be able to work with. However, as mentioned above, the SDKv2 provider couldn’t. This is because when a value isn’t present in the response or read into state, it gets a Go-compatible zero value for the default. This showed up as the inability to unset values after they had been written to state as false values (and vice versa).

The only solution we have here to reliably use the three states of those boolean values is to migrate to the terraform-plugin-framework, which has the correct implementation of writing back unset values.

Once we started adding more functionality using terraform-plugin-framework in the old provider, it was clear that it was a better developer experience, so we added a ratchet to prevent SDKv2 usage going forward to get ahead of anyone unknowingly setting themselves up to hit this issue.

When we decided that we would be automatically generating the Terraform provider, it was only fitting that we also brought all the resources over to be based on the terraform-plugin-framework and leave the issues from SDKv2 behind for good. This did complicate the migration as with the improved internals came changes to major components like the schema and CRUD operations that we needed to familiarize ourselves with. However, it has been a worthwhile investment because by doing so, we’ve future-proofed the foundations of the provider and are now making fewer compromises on a great Terraform experience due to buggy, legacy internals.

Iteratively finding bugs

One of the common struggles with code generation pipelines is that unless you have existing tools that implement your new thing, it’s hard to know if it works or is reasonable to use. Sure, you can also generate your tests to exercise the new thing, but if there is a bug in the pipeline, you are very likely to not see it as a bug as you will be generating test assertions that show the bug is expected behavior.

One of the essential feedback loops we have had is the existing acceptance test suite. All resources within the existing provider had a mix of regression and functionality tests. Best of all, as the test suite is creating and managing real resources, it was very easy to know whether the outcome was a working implementation or not by looking at the HTTP traffic to see whether the API calls were accepted by the remote endpoints. Getting the test suite ported over was only a matter of copying over all the existing tests and checking for any type assertion differences (such as list to single nested list) before kicking off a test run to determine whether the resource was working correctly.

While the centralized schema pipeline was a huge quality of life improvement for having schema fixes propagate to the whole ecosystem almost instantly, it couldn’t help us solve the largest hurdle, which was surfacing bugs that hide other bugs. This was time-consuming because when fixing a problem in Terraform, you have three places where you can hit an error:

Before any API calls are made, Terraform implements logical schema validation and when it encounters validation errors, it will immediately halt.
If any API call fails, it will stop at the CRUD operation and return the diagnostics, immediately halting.
After the CRUD operation has run, Terraform then has checks in place to ensure all values are known.

That means that if we hit the bug at step 1 and then fixed the bug, there was no guarantee or way to tell that we didn’t have two more waiting for us. Not to mention that if we found a bug in step 2 and shipped a fix, that it wouldn’t then identify a bug in the first step on the next round of testing.

There is no silver bullet here and our workaround was instead to notice patterns of problems in the schema behaviors and apply CI lint rules within the OpenAPI schemas before it got into the code generation pipeline. Taking this approach incrementally cut down the number of bugs in step 1 and 2 until we were largely only dealing with the type in step 3.

A more reusable approach to model and struct conversion

Within Terraform provider CRUD operations, it is fairly common to see boilerplate like the following:

var plan ThingModel
diags := req.Plan.Get(ctx, &plan)
resp.Diagnostics.Append(diags...)
if resp.Diagnostics.HasError() {
	return
}

out, err := r.client.UpdateThingModel(ctx, client.ThingModelRequest{
	AttrA: plan.AttrA.ValueString(),
	AttrB: plan.AttrB.ValueString(),
	AttrC: plan.AttrC.ValueString(),
})
if err != nil {
	resp.Diagnostics.AddError(
		"Error updating project Thing",
		"Could not update Thing, unexpected error: "+err.Error(),
	)
	return
}

result := convertResponseToThingModel(out)
tflog.Info(ctx, "created thing", map[string]interface{}{
	"attr_a": result.AttrA.ValueString(),
	"attr_b": result.AttrB.ValueString(),
	"attr_c": result.AttrC.ValueString(),
})

diags = resp.State.Set(ctx, result)
resp.Diagnostics.Append(diags...)
if resp.Diagnostics.HasError() {
	return
}

At a high level:

We fetch the proposed updates (known as a plan) using req.Plan.Get()
Perform the update API call with the new values
Manipulate the data from a Go type into a Terraform model (convertResponseToThingModel)
Set the state by calling resp.State.Set()

Initially, this doesn’t seem too problematic. However, the third step where we manipulate the Go type into the Terraform model quickly becomes cumbersome, error-prone, and complex because all of your resources need to do this in order to swap between the type and associated Terraform models.

To avoid generating more complex code than needed, one of the improvements featured in our provider is that all CRUD methods use unified apijson.Marshal, apijson.Unmarshal, and apijson.UnmarshalComputed methods that solve this problem by centralizing the conversion and handling logic based on the struct tags.

var data *ThingModel

resp.Diagnostics.Append(req.Plan.Get(ctx, &data)...)
if resp.Diagnostics.HasError() {
	return
}

dataBytes, err := apijson.Marshal(data)
if err != nil {
	resp.Diagnostics.AddError("failed to serialize http request", err.Error())
	return
}
res := new(http.Response)
env := ThingResultEnvelope{*data}
_, err = r.client.Thing.Update(
	// ...
)
if err != nil {
	resp.Diagnostics.AddError("failed to make http request", err.Error())
	return
}

bytes, _ := io.ReadAll(res.Body)
err = apijson.UnmarshalComputed(bytes, &env)
if err != nil {
	resp.Diagnostics.AddError("failed to deserialize http request", err.Error())
	return
}
data = &env.Result

resp.Diagnostics.Append(resp.State.Set(ctx, &data)...)

Instead of needing to generate hundreds of instances of type-to-model converter methods, we can instead decorate the Terraform model with the correct tags and handle marshaling and unmarshaling of the data consistently. It’s a minor change to the code that in the long run makes the generation more reusable and readable. As an added benefit, this approach is great for bug fixing as once you identify a bug with a particular type of field, fixing that in the unified interface fixes it for other occurrences you may not yet have found.

But wait, there’s more (docs)!

To top off our OpenAPI schema usage, we’re tightening the SDK integration with our new API documentation site. It’s using the same pipeline we’ve invested in for the last two years while addressing some of the common usage issues.

SDK aware

If you’ve used our API documentation site, you know we give you examples of interacting with the API using command line tools like curl. This is a great starting point, but if you’re using one of the SDK libraries, you need to do the mental gymnastics to convert it to the method or type definition you want to use. Now that we’re using the same pipeline to generate the SDKs and the documentation, we’re solving that by providing examples in all the libraries you could use — not just curl.

^{Example using cURL to fetch all zones.}

^{Example using the Typescript library to fetch all zones.}

^{Example using the Python library to fetch all zones.}

^{Example using the Go library to fetch all zones.}

With this improvement, we also remember the language selection so if you’ve selected to view the documentation using our Typescript library and keep clicking around, we keep showing you examples using Typescript until it is swapped out.

Best of all, when we introduce new attributes to existing endpoints or add SDK languages, this documentation site is automatically kept in sync with the pipeline. It is no longer a huge effort to keep it all up to date.

Faster and more efficient rendering

A problem we’ve always struggled with is the sheer number of API endpoints and how to represent them. As of this post, we have 1,330 endpoints, and for each of those endpoints, we have a request payload, a response payload, and multiple types associated with it. When it comes to rendering this much information, the solutions we’ve used in the past have had to make tradeoffs in order to make parts of the representation work.

This next iteration of the API documentation site addresses this is a couple of ways:

It’s implemented as a modern React application that pairs an interactive client-side experience with static pre-rendered content, resulting in a quick initial load and fast navigation. (Yes, it even works without JavaScript enabled!).
It fetches the underlying data incrementally as you navigate.

By solving this foundational issue, we’ve unlocked other planned improvements to the documentation site and SDK ecosystem to improve the user experience without making tradeoffs like we’ve needed to in the past.

Permissions

One of the most requested features to be re-implemented into the documentation site has been minimum required permissions for API endpoints. One of the previous iterations of the documentation site had this available. However, unknown to most who used it, the values were manually maintained and were regularly incorrect, causing support tickets to be raised and frustration for users.

Inside Cloudflare’s identity and access management system, answering the question “what do I need to access this endpoint” isn’t a simple one. The reason for this is that in the normal flow of a request to the control plane, we need two different systems to provide parts of the question, which can then be combined to give you the full answer. As we couldn’t initially automate this as part of the OpenAPI pipeline, we opted to leave it out instead of having it be incorrect with no way of verifying it.

Fast-forward to today, and we’re excited to say endpoint permissions are back! We built some new tooling that abstracts answering this question in a way that we can integrate into our code generation pipeline and have all endpoints automatically get this information. Much like the rest of the code generation platform, it is focused on having service teams own and maintain high quality schemas that can be reused with value adds introduced without any work on their behalf.

Stop waiting for updates

With these announcements, we’re putting an end to waiting for updates to land in the SDK ecosystem. These new improvements allow us to streamline the ability of new attributes and endpoints the moment teams document them. So what are you waiting for? Check out the Terraform provider and API documentation site today.

Quickly adopt new AWS features with the Terraform AWS Cloud Control provider

2024-05-30 Welly Siauw

Post Syndicated from Welly Siauw original https://aws.amazon.com/blogs/devops/quickly-adopt-new-aws-features-with-the-terraform-aws-cloud-control-provider/

Introduction

Today, we are pleased to announce the general availability of the Terraform AWS Cloud Control (AWS CC) Provider, enabling our customers to take advantage of AWS innovations faster. AWS has been continually expanding its services to support virtually any cloud workload; supporting over 200 fully featured services and delighting customers through its rapid pace of innovation with over 3,400 significant new features in 2023. Our customers use Infrastructure as Code (IaC) tools such as HashiCorp Terraform among others as a best-practice to provision and manage these AWS features and services as part of their cloud infrastructure at scale. With the Terraform AWS CC Provider launch, AWS customers using Terraform as their IaC tool can now benefit from faster time-to-market by building cloud infrastructure with the latest AWS innovations that are typically available on the Terraform AWS CC Provider on the day of launch. For example, AWS customer Meta’s Oculus Studios was able to quickly leverage Amazon GameLift to support their game development. “AWS and Hashicorp have been great partners in helping Oculus Studios standardize how we deploy our GameLift infrastructure using industry best practices.” said Mick Afaneh, Meta’s Oculus Studios Central Technology.

The Terraform AWS CC Provider leverages AWS Cloud Control API to automatically generate support for hundreds of AWS resource types, such as Amazon EC2 instances and Amazon S3 buckets. Since the AWS CC provider is automatically generated, new features and services on AWS can be supported as soon as they are available on AWS Cloud Control API, addressing any coverage gaps in the existing Terraform AWS standard provider. This automated process allows the AWS CC provider to deliver new resources faster because it does not have to wait for the community to author schema and resource implementations for each new service. Today, the AWS CC provider supports 950+ AWS resources and data sources, with more support being added as AWS service teams continue to adopt the Cloud Control API standard.

As a Terraform practitioner, using the AWS CC Provider would feel familiar to the existing workflow. You can employ the configuration blocks shown below, while specifying your preferred region.

terraform {
  required_providers {
    awscc = {
      source  = "hashicorp/awscc"
      version = "~> 1.0"
    }
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "awscc" {
  region = "us-east-1"
}

provider "aws" {
  region = "us-east-1"
}

During Terraform plan or apply, the AWS CC Terraform provider interacts with AWS Cloud Control API to provision the resources by calling its consistent Create, Read, Update, Delete, or List (CRUD-L) APIs.

AWS Cloud Control API

AWS service teams own, publish, and maintain resources on the AWS CloudFormation Registry using a standardized resource model. This resource model uses uniform JSON schemas and provisioning logic that codifies the expected behavior and error handling associated with CRUD-L operations. This resource model enables AWS service teams to expose their service features in an easily discoverable, intuitive, and uniform format with standardized behavior. Launched in September 2021, AWS Cloud Control API exposes these resources through a set of five consistent CRUD-L operations without any additional work from service teams. Using Cloud Control API, developers can manage the lifecycle of hundreds of AWS and third-party resources with consistent resource-oriented API instead of using distinct service-specific APIs. Furthermore, Cloud Control API is up-to-date with the latest AWS resources as soon as they are available on the CloudFormation Registry, typically on the day of launch. You can read more on launch day requirement for Cloud Control API in this blog post. This enables AWS Partners such as HashiCorp to take advantage of consistent CRUD-L API operations and integrate Terraform with Cloud Control API just once, and then automatically access new AWS resources without additional integration work.

History and Evolution of the Terraform AWS CC Provider

The general availability of Terraform AWS CC Provider project is a culmination of 4+ years of collaboration between AWS and HashiCorp. Our teams partnered across the Product, Engineering, Partner, and Customer Support functions in influencing, shaping, and defining the customer experience leading up to the the technical preview announcement of the AWS CC provider in September 2021. At technical preview, the provider supported more than 300 resources. Since then, we have added an additional 600+ resources to the provider, bringing the total to 950+ supported resources at general availability.

Beyond just increasing resource coverage, we gathered additional signals from customer feedback during the technical preview and rolled out several improvements since September 2021. Customers care deeply about the user experience on the providers available on the Terraform registry. Customers sought practical examples in the form of sample HCL configurations for each resource that they could use to immediately test in order to confidently start using the provider. This prompted us to enrich the AWS CC provider with hundreds of practical examples for popular AWS CC provider resources in the Terraform registry. This was made possible by contributions of hundreds of Amazonians who became early adopters of the AWS CC provider. We also published a how-to guide for anyone interested in contributing to AWS CC provider examples. Furthermore, customers also wanted to minimize context switching by moving between Terraform and AWS service documentation on what each attribute of a resource signified and the type of values it needed as part of configuration. This empowered us to prioritize augmenting the provider with rich resource attribute description with information taken from AWS documentation. The documentation provides detailed information of how to use the attributes, enumerations of the accepted attribute values and other relevant information for dozens of popularly used AWS resources.

We also worked with HashiCorp on various bug fixes and feature enhancements for the AWS CC provider, as well as the upstream Cloud Control API dependencies. We improved handling for resources with complex nested attribute schemas, implemented various bug fixes to resolve unintended resource replacement, and refined provider behavior under various conditions to support the idempotency expected by Terraform practitioners. While this are not an exhaustive list of improvements, we continue to listen to customer feedback and iterate on improving the experience. We encourage you to try out the provider and share feedback on the AWS CC provider’s GitHub page.

Using the AWS CC Provider

Let’s take an example of a recently introduced service, Amazon Q Business, a fully managed, generative AI-powered assistant that you can configure to answer questions, provide summaries, generate content, and complete tasks based on your enterprise data. Amazon Q Business resources were available in AWS CC provider shortly after the April 30th 2024 launch announcement. In the following example, we’ll create a demo Amazon Q Business application and deploy the web experience.

data "aws_caller_identity" "current" {}

data "aws_ssoadmin_instances" "example" {}

resource "awscc_qbusiness_application" "example" {
  description                  = "Example QBusiness Application"
  display_name                 = "Demo_QBusiness_App"
  attachments_configuration    = {
    attachments_control_mode = "ENABLED"
  }
  identity_center_instance_arn = data.aws_ssoadmin_instances.example.arns[0]
}

resource "awscc_qbusiness_web_experience" "example" {
  application_id              = awscc_qbusiness_application.example.id
  role_arn                    = awscc_iam_role.example.arn
  subtitle                    = "Drop a file and ask questions"
  title                       = "Demo Amazon Q Business"
  welcome_message             = "Welcome, please enter your questions"
}

resource "awscc_iam_role" "example" {
  role_name   = "Amazon-QBusiness-WebExperience-Role"
  description = "Grants permissions to AWS Services and Resources used or managed by Amazon Q Business"
  assume_role_policy_document = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "QBusinessTrustPolicy"
        Effect = "Allow"
        Principal = {
          Service = "application.qbusiness.amazonaws.com"
        }
        Action = [
          "sts:AssumeRole",
          "sts:SetContext"
        ]
        Condition = {
          StringEquals = {
            "aws:SourceAccount" = data.aws_caller_identity.current.account_id
          }
          ArnEquals = {
            "aws:SourceArn" = awscc_qbusiness_application.example.application_arn
          }
        }
      }
    ]
  })
  policies = [{
    policy_name = "qbusiness_policy"
    policy_document = jsonencode({
      Version = "2012-10-17"
      Statement = [
        {
          Sid = "QBusinessConversationPermission"
          Effect = "Allow"
          Action = [
            "qbusiness:Chat",
            "qbusiness:ChatSync",
            "qbusiness:ListMessages",
            "qbusiness:ListConversations",
            "qbusiness:DeleteConversation",
            "qbusiness:PutFeedback",
            "qbusiness:GetWebExperience",
            "qbusiness:GetApplication",
            "qbusiness:ListPlugins",
            "qbusiness:GetChatControlsConfiguration"
          ]
          Resource = awscc_qbusiness_application.example.application_arn
        }
      ]
    })
  }]
}

As you see in this example, you can use both the AWS and AWS CC providers in the same configuration file. This allows you to easily incorporate new resources available in the AWS CC provider into your existing configuration with minimal changes. The AWS CC provider also accepts the same authentication method and provider-level features available in the AWS provider. This means you don’t have to add additional configuration in your CI/CD pipeline to start using the AWS CC provider. In addition, you can also add custom agent information inside the provider block as described in this documentation.

Things to know

The AWS CC provider is unique due to how it was developed and its dependencies with Cloud Control API and AWS resource model in the CloudFormation registry. As such, there are things that you should know before you start using the AWS CC provider.

The AWS CC provider is generated from the latest CloudFormation schemas, and will release weekly containing all new AWS services and enhancements added to Cloud Control API.
Certain resources available in the CloudFormation schema are not compatible with the AWS CC provider due to nuances in the schema implementation. You can find them on the GitHub issue list here. We are actively working to add these resources to the AWS CC provider.
The AWS CC provider requires Terraform CLI version 1.0.7 or higher.
Every AWS CC provider resource includes a top-level attribute `id` that acts as the resource identifier. If the CloudFormation resource schema also has a similarly named top-level attribute `id`, then that property is mapped to a new attribute named `<type>_id`. For example `web_experience_id` for `awscc_qbusiness_web_experience` resource.
If a resource attribute is not defined in the Terraform configuration, the AWS CC provider will honor the default values specified in the CloudFormation resource schema. If the resource schema does not include a default value, AWS CC provider will use attribute value stored in the Terraform state (taken from Cloud Control API GetResponse after resource was created).
In correlation to the default value behavior as stated above, when an attribute value is removed from the Terraform configuration (e.g. by commenting the attribute), the AWS CC provider will use the previous attribute value stored in the Terraform state. As such, no drift will be detected on the resource configuration when you run Terraform plan / apply.
The AWS CC provider data sources are either plural or singular with filters based on `id` attribute. Currently there is no native support for metadata sources such as `aws_region` or `aws_caller_identity`. You can continue to leverage the AWS provider data sources to complement your Terraform configuration.

If you want to dive deeper into AWS CC provider resource behavior, we encourage you to check the documentation here.

Conclusion

The AWS CC provider is now generally available and will be the fastest way for customers to access newly launched AWS features and services using Terraform. We will continue to add support for more resources, additional examples and enriching the schema descriptions. You can start using the AWS CC provider alongside your existing AWS standard provider. To learn more about the AWS CC provider, please check the HashiCorp announcement blog post. You can also follow the workshop on how to get started with AWS CC provider. If you are interested in contributing with practical examples for AWS CC provider resources, check out the how-to guide. For more questions or if you run into any issues with the new provider, don’t hesitate to submit your issue in the AWS CC provider GitHub repository.

Authors

Automate Terraform Deployments with Amazon CodeCatalyst and Terraform Community action

2024-05-02 Vineeth Nair

Post Syndicated from Vineeth Nair original https://aws.amazon.com/blogs/devops/automate-terraform-deployments-with-amazon-codecatalyst-and-terraform-community-action/

Amazon CodeCatalyst integrates continuous integration and deployment (CI/CD) by bringing key development tools together on one platform. With the entire application lifecycle managed in one tool, CodeCatalyst empowers rapid, dependable software delivery. CodeCatalyst offers a range of actions which is the main building block of a workflow, and defines a logical unit of work to perform during a workflow run. Typically, a workflow includes multiple actions that run sequentially or in parallel depending on how you’ve configured them.

Introduction

Infrastructure as code (IaC) has become a best practice for managing IT infrastructure. IaC uses code to provision and manage your infrastructure in a consistent, programmatic way. Terraform by HashiCorp is one of most common tools for IaC.

With Terraform, you define the desired end state of your infrastructure resources in declarative configuration files. Terraform determines the necessary steps to reach the desired state and provisions the infrastructure automatically. This removes the need for manual processes while enabling version control, collaboration, and reproducibility across your infrastructure.

In this blog post, we will demonstrate using the “Terraform Community Edition” action in CodeCatalyst to create resources in an AWS account.

Amazon CodeCatalyst workflow overview
Figure 1: Amazon CodeCatalyst Action

Prerequisites

To follow along with the post, you will need the following items:

An AWS Builder ID for signing in to CodeCatalyst.
A CodeCatalyst space
Have the Space administrator role assigned in your CodeCatalyst space
Have an AWS account associated with your space along with an associated IAM role
A CodeCatalyst project with a source repository
A CodeCatalyst environment configured with a connection to your target AWS account
An Amazon S3 Bucket to store Terraform remote state file
An Amazon DynamoDB Table to manage the locking of the state file during Terraform operations.

Walkthrough

In this walkthrough we create an Amazon S3 bucket using the Terraform Community Edition action in Amazon CodeCatalyst. The action will execute the Terraform commands needed to apply your configuration. You configure the action with a specified Terraform version. When the action runs it uses that Terraform version to deploy your Terraform templates, provisioning the defined infrastructure. This action will run terraform init to initialize the working directory, terraform plan to preview changes, and terraform apply to create the Amazon S3 bucket based on the Terraform configuration in a target AWS Account. At the end of the post your workflow will look like the following:

Amazon CodeCatalyst Workflow with Terraform Community Action

Figure 2: Amazon CodeCatalyst Workflow with Terraform Community Action

Create the base workflow

To begin, we create a workflow that will execute our Terraform code. In the CodeCatalyst project, click on CI/CD on left pane and select Workflows. In the Workflows pane, click on Create Workflow.

Creating Amazon CodeCatalyst Workflow

Figure 3: Creating Amazon CodeCatalyst Workflow

We have taken an existing repository my-sample-terraform-repository as a source repository.

Creating Workflow from source repository

Figure 4 : Creating Workflow from source repository

Once the source repository is selected, select Branch as main and click Create. You will have an empty workflow. You can edit the workflow from within the CodeCatalyst console. Click on the Commit button to create an initial commit:

Initial Workflow commit

Figure 5: Initial Workflow commit

On the Commit Workflow dialogue, add a commit message, and click on Commit. Ignore any validation errors at this stage:

Completing Initial Commit for Workflow

Figure 6: Completing Initial Commit for Workflow

Connect to CodeCatalyst Dev Environment

For this post, we will use an AWS Cloud9 Dev Environment to edit our workflow. Your first step is to connect to the dev environment. Select Code → Dev Environments.

Navigate to CodeCatalyst Dev Environments

Figure 7 : Navigate to CodeCatalyst Dev Environments

If you do not already have a Dev Environment you can create an instance by selecting the Create Dev Environment dropdown and selecting AWS Cloud9 (in browser). Leave the options as default and click on Create to provision a new Dev Environment.

Create CodeCatalyst Dev Environment

Figure 8: Create CodeCatalyst Dev Environment

Once the Dev Environment has provisioned, you are redirected to a Cloud9 instance in browser. The Dev Environment automatically clones the existing repository for the Terraform project code. We at first create a main.tf file in root of the repository with the Terraform code for creating an Amazon S3 bucket. To do this, we right click on the repository folder in the tree-pane view on the left side of the Cloud9 Console window and select New File

Creating a new file in Cloud9

Figure 9: Creating a new file in Cloud9

We are presented with a new file which we will name main.tf, this file will store the Terraform code. We then edit main.tf by right clicking on the file and selecting open. We insert the code below into main.tf. The code has a Terraform resource block to create an AWS S3 Bucket. The configuration also uses Terraform AWS datasources to obtain AWS region and AWS Account ID data which is used to form part of the bucket name. Finally, we use a backend block to configure Terraform to use an AWS S3 bucket to store Terraform state data. To save our changes we select File -> Save

: Adding Terraform Code

Figure 10: Adding Terraform Code

Now let’s start creating Terraform Workflow using Amazon CodeCatalyst Terraform Community Action. Within your repository go to .codecatalyst/workflows directory and open the <workflowname.yaml> file.

Creating CodeCatalyst Workflow

Figure 11: Creating CodeCatalyst Workflow

The below code snippet is an example workflow definition with terraform plan and terraform apply. We will enter this into our workflow file, with the relevant configuration settings for our environment.

The workflow does the following:

When a change is pushed to the main branch, a new workflow execution is triggered. This workflow carries a Terraform plan and subsequent apply operation.

Name: terraform-action-workflow
Compute:
  Type: EC2
  Fleet: Linux.x86-64.Large
SchemaVersion: "1.0"
Triggers:
  - Type: Push
    Branches:
      -  main
Actions: 
  PlanTerraform:
    Identifier: codecatalyst-labs/provision-with-terraform-community@v1
    Environment:
      Name: dev 
      Connections:
        - Name: codecatalyst
          Role: CodeCatalystWorkflowDevelopmentRole # The IAM role to be used
    Inputs:
      Sources:
        - WorkflowSource
    Outputs:
      Artifacts:
        - Name: tfplan # generates a tfplan output artifact
          Files:
            - tfplan.out
    Configuration:
      AWSRegion: eu-west-2
      StateBucket: tfstate-bucket # The Terraform state S3 Bucket
      StateKey: terraform.tfstate # The Terraform state file
      StateKeyPrefix: states/ # The path to the state file (optional)
      StateTable: tfstate-table # The Dynamo DB database
      TerraformVersion: ‘1.5.1’ # The Terraform version to be used
      TerraformOperationMode: plan # The Terraform operation- can be plan or apply
  ApplyTerraform:
    Identifier: codecatalyst-labs/provision-with-terraform-community@v1
    DependsOn:
      - PlanTerraform
    Environment:
      Name: dev 
      Connections:
        - Name: codecatalyst
          Role: CodeCatalystWorkflowDevelopmentRole
    Inputs:
      Sources:
        - WorkflowSource
      Artifacts:
        - tfplan
    Configuration:
      AWSRegion: eu-west-2
      StateBucket: tfstate-bucket
      StateKey: terraform.tfstate
      StateKeyPrefix: states/
      StateTable: tfstate-table
      TerraformVersion: '1.5.1'
      TerraformOperationMode: apply

Key configuration parameters are:
- Environment.Name: The name of our CodeCatalyst Environment
- Environment.Connections.Name: The name of the CodeCatalyst connection
- Environment.Connections.Role: The IAM role used for the workflow
- AWSRegion: The AWS region that hosts the Terraform state bucket
- Environment.Name: The name of our CodeCatalyst Environment
- Identifier: codecatalyst-labs/provision-with-terraform-community@v1
- StateBucket: The Terraform state bucket
- StateKey: The Terraform statefile e.g. terraform.tfstate
- StateKeyPrefix: The folder location of the State file (optional)
- StateTable: The DynamoDB State table
- TerraformVersion: The version of Terraform to be installed
- TerraformOperationMode: The operation mode for Terraform – this can be either ‘plan’ or ‘apply’

The workflow now contains CodeCatalyst action for Terraform Plan and Terraform Apply.

To save our changes we select File -> Save, we can then commit these to our git repository by typing the following at the terminal:

git add . && git commit -m ‘adding terraform workflow and main.tf’ && git push

The above command adds the workflow file and Terraform code to be tracked by git. It then commits the code and pushes the changes to CodeCatalyst git repository. As we have a branch trigger for main defined, this will trigger a run of the workflow. We can monitor the status of the workflow in the CodeCatalyst console by selecting CICD -> Workflows. Locate your workflow and click on Runs to view the status. You will be able to observe that the workflow has successfully completed and Amazon S3 bucket is created.

: CodeCatalyst Workflow Status

Figure 12: CodeCatalyst Workflow Status

Cleaning up

If you have been following along with this workflow, you should delete the resources that you have deployed to avoid further charges. The walkthrough will create an Amazon S3 bucket named <your-aws-account-id>-<your-aws-region>-terraform-sample-bucket in your AWS account. In the AWS Console > S3, locate the bucket that was created, then select and click Delete to remove the bucket.

Conclusion

In this post, we explained how you can easily get started deploying IaC to your AWS accounts with Amazon CodeCatalyst. We outlined how the Terraform Community Edition action can streamline the process of planning and applying Terraform configurations and how to create a workflow that can leverage this action. Get started with Amazon CodeCatalyst today.

Terraform CI/CD and testing on AWS with the new Terraform Test Framework

2024-04-03 Kevon Mayers

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/terraform-ci-cd-and-testing-on-aws-with-the-new-terraform-test-framework/

Graphic created by Kevon Mayers

Introduction

Organizations often use Terraform Modules to orchestrate complex resource provisioning and provide a simple interface for developers to enter the required parameters to deploy the desired infrastructure. Modules enable code reuse and provide a method for organizations to standardize deployment of common workloads such as a three-tier web application, a cloud networking environment, or a data analytics pipeline. When building Terraform modules, it is common for the module author to start with manual testing. Manual testing is performed using commands such as terraform validate for syntax validation, terraform plan to preview the execution plan, and terraform apply followed by manual inspection of resource configuration in the AWS Management Console. Manual testing is prone to human error, not scalable, and can result in unintended issues. Because modules are used by multiple teams in the organization, it is important to ensure that any changes to the modules are extensively tested before the release. In this blog post, we will show you how to validate Terraform modules and how to automate the process using a Continuous Integration/Continuous Deployment (CI/CD) pipeline.

Terraform Test

Terraform test is a new testing framework for module authors to perform unit and integration tests for Terraform modules. Terraform test can create infrastructure as declared in the module, run validation against the infrastructure, and destroy the test resources regardless if the test passes or fails. Terraform test will also provide warnings if there are any resources that cannot be destroyed. Terraform test uses the same HashiCorp Configuration Language (HCL) syntax used to write Terraform modules. This reduces the burden for modules authors to learn other tools or programming languages. Module authors run the tests using the command terraform test which is available on Terraform CLI version 1.6 or higher.

Module authors create test files with the extension *.tftest.hcl. These test files are placed in the root of the Terraform module or in a dedicated tests directory. The following elements are typically present in a Terraform tests file:

Provider block: optional, used to override the provider configuration, such as selecting AWS region where the tests run.
Variables block: the input variables passed into the module during the test, used to supply non-default values or to override default values for variables.
Run block: used to run a specific test scenario. There can be multiple run blocks per test file, Terraform executes run blocks in order. In each run block you specify the command Terraform (plan or apply), and the test assertions. Module authors can specify the conditions such as: length(var.items) != 0. A full list of condition expressions can be found in the HashiCorp documentation.

Terraform tests are performed in sequential order and at the end of the Terraform test execution, any failed assertions are displayed.

Basic test to validate resource creation

Now that we understand the basic anatomy of a Terraform tests file, let’s create basic tests to validate the functionality of the following Terraform configuration. This Terraform configuration will create an AWS CodeCommit repository with prefix name repo-.

# main.tf

variable "repository_name" {
  type = string
}
resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description     = "Test repository."
}

Now we create a Terraform test file in the tests directory. See the following directory structure as an example:

├── main.tf 
└── tests 
└── basic.tftest.hcl

For this first test, we will not perform any assertion except for validating that Terraform execution plan runs successfully. In the tests file, we create a variable block to set the value for the variable repository_name. We also added the run block with command = plan to instruct Terraform test to run Terraform plan. The completed test should look like the following:

# basic.tftest.hcl

variables {
  repository_name = "MyRepo"
}

run "test_resource_creation" {
  command = plan
}

Now we will run this test locally. First ensure that you are authenticated into an AWS account, and run the terraform init command in the root directory of the Terraform module. After the provider is initialized, start the test using the terraform test command.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... pass
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... pass

Our first test is complete, we have validated that the Terraform configuration is valid and the resource can be provisioned successfully. Next, let’s learn how to perform inspection of the resource state.

Create resource and validate resource name

Re-using the previous test file, we add the assertion block to checks if the CodeCommit repository name starts with a string repo- and provide error message if the condition fails. For the assertion, we use the startswith function. See the following example:

# basic.tftest.hcl

variables {
  repository_name = "MyRepo"
}

run "test_resource_creation" {
  command = plan

  assert {
    condition = startswith(aws_codecommit_repository.test.repository_name, "repo-")
    error_message = "CodeCommit repository name ${var.repository_name} did not start with the expected value of ‘repo-****’."
  }
}

Now, let’s assume that another module author made changes to the module by modifying the prefix from repo- to my-repo-. Here is the modified Terraform module.

# main.tf

variable "repository_name" {
  type = string
}
resource "aws_codecommit_repository" "test" {
  repository_name = format("my-repo-%s", var.repository_name)
  description = "Test repository."
}

We can catch this mistake by running the the terraform test command again.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... fail
╷
│ Error: Test assertion failed
│
│ on tests/basic.tftest.hcl line 9, in run "test_resource_creation":
│ 9: condition = startswith(aws_codecommit_repository.test.repository_name, "repo-")
│ ├────────────────
│ │ aws_codecommit_repository.test.repository_name is "my-repo-MyRepo"
│
│ CodeCommit repository name MyRepo did not start with the expected value 'repo-***'.
╵
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... fail

Failure! 0 passed, 1 failed.

We have successfully created a unit test using assertions that validates the resource name matches the expected value. For more examples of using assertions see the Terraform Tests Docs. Before we proceed to the next section, don’t forget to fix the repository name in the module (revert the name back to repo- instead of my-repo-) and re-run your Terraform test.

Testing variable input validation

When developing Terraform modules, it is common to use variable validation as a contract test to validate any dependencies / restrictions. For example, AWS CodeCommit limits the repository name to 100 characters. A module author can use the length function to check the length of the input variable value. We are going to use Terraform test to ensure that the variable validation works effectively. First, we modify the module to use variable validation.

# main.tf

variable "repository_name" {
  type = string
  validation {
    condition = length(var.repository_name) <= 100
    error_message = "The repository name must be less than or equal to 100 characters."
  }
}

resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description = "Test repository."
}

By default, when variable validation fails during the execution of Terraform test, the Terraform test also fails. To simulate this, create a new test file and insert the repository_name variable with a value longer than 100 characters.

# var_validation.tftest.hcl

variables {
  repository_name = “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
}

run “test_invalid_var” {
  command = plan
}

Notice on this new test file, we also set the command to Terraform plan, why is that? Because variable validation runs prior to Terraform apply, thus we can save time and cost by skipping the entire resource provisioning. If we run this Terraform test, it will fail as expected.

❯ terraform test
tests/basic.tftest.hcl… in progress
run “test_resource_creation”… pass
tests/basic.tftest.hcl… tearing down
tests/basic.tftest.hcl… pass
tests/var_validation.tftest.hcl… in progress
run “test_invalid_var”… fail
╷
│ Error: Invalid value for variable
│
│ on main.tf line 1:
│ 1: variable “repository_name” {
│ ├────────────────
│ │ var.repository_name is “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
│
│ The repository name must be less than or equal to 100 characters.
│
│ This was checked by the validation rule at main.tf:3,3-13.
╵
tests/var_validation.tftest.hcl… tearing down
tests/var_validation.tftest.hcl… fail

Failure! 1 passed, 1 failed.

For other module authors who might iterate on the module, we need to ensure that the validation condition is correct and will catch any problems with input values. In other words, we expect the validation condition to fail with the wrong input. This is especially important when we want to incorporate the contract test in a CI/CD pipeline. To prevent our test from failing due introducing an intentional error in the test, we can use the expect_failures attribute. Here is the modified test file:

# var_validation.tftest.hcl

variables {
  repository_name = “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
}

run “test_invalid_var” {
  command = plan

  expect_failures = [
    var.repository_name
  ]
}

Now if we run the Terraform test, we will get a successful result.

❯ terraform test
tests/basic.tftest.hcl… in progress
run “test_resource_creation”… pass
tests/basic.tftest.hcl… tearing down
tests/basic.tftest.hcl… pass
tests/var_validation.tftest.hcl… in progress
run “test_invalid_var”… pass
tests/var_validation.tftest.hcl… tearing down
tests/var_validation.tftest.hcl… pass

Success! 2 passed, 0 failed.

As you can see, the expect_failures attribute is used to test negative paths (the inputs that would cause failures when passed into a module). Assertions tend to focus on positive paths (the ideal inputs). For an additional example of a test that validates functionality of a completed module with multiple interconnected resources, see this example in the Terraform CI/CD and Testing on AWS Workshop.

Orchestrating supporting resources

In practice, end-users utilize Terraform modules in conjunction with other supporting resources. For example, a CodeCommit repository is usually encrypted using an AWS Key Management Service (KMS) key. The KMS key is provided by end-users to the module using a variable called kms_key_id. To simulate this test, we need to orchestrate the creation of the KMS key outside of the module. In this section we will learn how to do that. First, update the Terraform module to add the optional variable for the KMS key.

# main.tf

variable "repository_name" {
  type = string
  validation {
    condition = length(var.repository_name) <= 100
    error_message = "The repository name must be less than or equal to 100 characters."
  }
}

variable "kms_key_id" {
  type = string
  default = ""
}

resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description = "Test repository."
  kms_key_id = var.kms_key_id != "" ? var.kms_key_id : null
}

In a Terraform test, you can instruct the run block to execute another helper module. The helper module is used by the test to create the supporting resources. We will create a sub-directory called setup under the tests directory with a single kms.tf file. We also create a new test file for KMS scenario. See the updated directory structure:

├── main.tf
└── tests
├── setup
│ └── kms.tf
├── basic.tftest.hcl
├── var_validation.tftest.hcl
└── with_kms.tftest.hcl

The kms.tf file is a helper module to create a KMS key and provide its ARN as the output value.

# kms.tf

resource "aws_kms_key" "test" {
  description = "test KMS key for CodeCommit repo"
  deletion_window_in_days = 7
}

output "kms_key_id" {
  value = aws_kms_key.test.arn
}

The new test will use two separate run blocks. The first run block (setup) executes the helper module to generate a KMS key. This is done by assigning the command apply which will run terraform apply to generate the KMS key. The second run block (codecommit_with_kms) will then use the KMS key ARN output of the first run as the input variable passed to the main module.

# with_kms.tftest.hcl

run "setup" {
  command = apply
  module {
    source = "./tests/setup"
  }
}

run "codecommit_with_kms" {
  command = apply

  variables {
    repository_name = "MyRepo"
    kms_key_id = run.setup.kms_key_id
  }

  assert {
    condition = aws_codecommit_repository.test.kms_key_id != null
    error_message = "KMS key ID attribute value is null"
  }
}

Go ahead and run the Terraform init, followed by Terraform test. You should get the successful result like below.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... pass
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... pass
tests/var_validation.tftest.hcl... in progress
run "test_invalid_var"... pass
tests/var_validation.tftest.hcl... tearing down
tests/var_validation.tftest.hcl... pass
tests/with_kms.tftest.hcl... in progress
run "create_kms_key"... pass
run "codecommit_with_kms"... pass
tests/with_kms.tftest.hcl... tearing down
tests/with_kms.tftest.hcl... pass

Success! 4 passed, 0 failed.

We have learned how to run Terraform test and develop various test scenarios. In the next section we will see how to incorporate all the tests into a CI/CD pipeline.

Terraform Tests in CI/CD Pipelines

Now that we have seen how Terraform Test works locally, let’s see how the Terraform test can be leveraged to create a Terraform module validation pipeline on AWS. The following AWS services are used:

AWS CodeCommit – a secure, highly scalable, fully managed source control service that hosts private Git repositories.
AWS CodeBuild – a fully managed continuous integration service that compiles source code, runs tests, and produces ready-to-deploy software packages.
AWS CodePipeline – a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates.
Amazon Simple Storage Service (Amazon S3) – an object storage service offering industry-leading scalability, data availability, security, and performance.

Terraform module validation pipeline

In the above architecture for a Terraform module validation pipeline, the following takes place:

A developer pushes Terraform module configuration files to a git repository (AWS CodeCommit).
AWS CodePipeline begins running the pipeline. The pipeline clones the git repo and stores the artifacts to an Amazon S3 bucket.
An AWS CodeBuild project configures a compute/build environment with Checkov installed from an image fetched from Docker Hub. CodePipeline passes the artifacts (Terraform module) and CodeBuild executes Checkov to run static analysis of the Terraform configuration files.
Another CodeBuild project configured with Terraform from an image fetched from Docker Hub. CodePipeline passes the artifacts (repo contents) and CodeBuild runs Terraform command to execute the tests.

CodeBuild uses a buildspec file to declare the build commands and relevant settings. Here is an example of the buildspec files for both CodeBuild Projects:

# Checkov
version: 0.1
phases:
  pre_build:
    commands:
      - echo pre_build starting

  build:
    commands:
      - echo build starting
      - echo starting checkov
      - ls
      - checkov -d .
      - echo saving checkov output
      - checkov -s -d ./ > checkov.result.txt

In the above buildspec, Checkov is run against the root directory of the cloned CodeCommit repository. This directory contains the configuration files for the Terraform module. Checkov also saves the output to a file named checkov.result.txt for further review or handling if needed. If Checkov fails, the pipeline will fail.

# Terraform Test
version: 0.1
phases:
  pre_build:
    commands:
      - terraform init
      - terraform validate

  build:
    commands:
      - terraform test

In the above buildspec, the terraform init and terraform validate commands are used to initialize Terraform, then check if the configuration is valid. Finally, the terraform test command is used to run the configured tests. If any of the Terraform tests fails, the pipeline will fail.

For a full example of the CI/CD pipeline configuration, please refer to the Terraform CI/CD and Testing on AWS workshop. The module validation pipeline mentioned above is meant as a starting point. In a production environment, you might want to customize it further by adding Checkov allow-list rules, linting, checks for Terraform docs, or pre-requisites such as building the code used in AWS Lambda.

Choosing various testing strategies

At this point you may be wondering when you should use Terraform tests or other tools such as Preconditions and Postconditions, Check blocks or policy as code. The answer depends on your test type and use-cases. Terraform test is suitable for unit tests, such as validating resources are created according to the naming specification. Variable validations and Pre/Post conditions are useful for contract tests of Terraform modules, for example by providing error warning when input variables value do not meet the specification. As shown in the previous section, you can also use Terraform test to ensure your contract tests are running properly. Terraform test is also suitable for integration tests where you need to create supporting resources to properly test the module functionality. Lastly, Check blocks are suitable for end to end tests where you want to validate the infrastructure state after all resources are generated, for example to test if a website is running after an S3 bucket configured for static web hosting is created.

When developing Terraform modules, you can run Terraform test in command = plan mode for unit and contract tests. This allows the unit and contract tests to run quicker and cheaper since there are no resources created. You should also consider the time and cost to execute Terraform test for complex / large Terraform configurations, especially if you have multiple test scenarios. Terraform test maintains one or many state files within the memory for each test file. Consider how to re-use the module’s state when appropriate. Terraform test also provides test mocking, which allows you to test your module without creating the real infrastructure.

Conclusion

In this post, you learned how to use Terraform test and develop various test scenarios. You also learned how to incorporate Terraform test in a CI/CD pipeline. Lastly, we also discussed various testing strategies for Terraform configurations and modules. For more information about Terraform test, we recommend the Terraform test documentation and tutorial. To get hands on practice building a Terraform module validation pipeline and Terraform deployment pipeline, check out the Terraform CI/CD and Testing on AWS Workshop.

Authors

Best practices for managing Terraform State files in AWS CI/CD Pipeline

2024-02-19 Arun Kumar Selvaraj

Post Syndicated from Arun Kumar Selvaraj original https://aws.amazon.com/blogs/devops/best-practices-for-managing-terraform-state-files-in-aws-ci-cd-pipeline/

Introduction

Today customers want to reduce manual operations for deploying and maintaining their infrastructure. The recommended method to deploy and manage infrastructure on AWS is to follow Infrastructure-As-Code (IaC) model using tools like AWS CloudFormation, AWS Cloud Development Kit (AWS CDK) or Terraform.

One of the critical components in terraform is managing the state file which keeps track of your configuration and resources. When you run terraform in an AWS CI/CD pipeline the state file has to be stored in a secured, common path to which the pipeline has access to. You need a mechanism to lock it when multiple developers in the team want to access it at the same time.

In this blog post, we will explain how to manage terraform state files in AWS, best practices on configuring them in AWS and an example of how you can manage it efficiently in your Continuous Integration pipeline in AWS when used with AWS Developer Tools such as AWS CodeCommit and AWS CodeBuild. This blog post assumes you have a basic knowledge of terraform, AWS Developer Tools and AWS CI/CD pipeline. Let’s dive in!

Challenges with handling state files

By default, the state file is stored locally where terraform runs, which is not a problem if you are a single developer working on the deployment. However if not, it is not ideal to store state files locally as you may run into following problems:

When working in teams or collaborative environments, multiple people need access to the state file
Data in the state file is stored in plain text which may contain secrets or sensitive information
Local files can get lost, corrupted, or deleted

Best practices for handling state files

The recommended practice for managing state files is to use terraform’s built-in support for remote backends. These are:

Remote backend on Amazon Simple Storage Service (Amazon S3): You can configure terraform to store state files in an Amazon S3 bucket which provides a durable and scalable storage solution. Storing on Amazon S3 also enables collaboration that allows you to share state file with others.

Remote backend on Amazon S3 with Amazon DynamoDB: In addition to using an Amazon S3 bucket for managing the files, you can use an Amazon DynamoDB table to lock the state file. This will allow only one person to modify a particular state file at any given time. It will help to avoid conflicts and enable safe concurrent access to the state file.

There are other options available as well such as remote backend on terraform cloud and third party backends. Ultimately, the best method for managing terraform state files on AWS will depend on your specific requirements.

When deploying terraform on AWS, the preferred choice of managing state is using Amazon S3 with Amazon DynamoDB.

AWS configurations for managing state files

Create an Amazon S3 bucket using terraform. Implement security measures for Amazon S3 bucket by creating an AWS Identity and Access Management (AWS IAM) policy or Amazon S3 Bucket Policy. Thus you can restrict access, configure object versioning for data protection and recovery, and enable AES256 encryption with SSE-KMS for encryption control.

Next create an Amazon DynamoDB table using terraform with Primary key set to LockID. You can also set any additional configuration options such as read/write capacity units. Once the table is created, you will configure the terraform backend to use it for state locking by specifying the table name in the terraform block of your configuration.

For a single AWS account with multiple environments and projects, you can use a single Amazon S3 bucket. If you have multiple applications in multiple environments across multiple AWS accounts, you can create one Amazon S3 bucket for each account. In that Amazon S3 bucket, you can create appropriate folders for each environment, storing project state files with specific prefixes.

Now that you know how to handle terraform state files on AWS, let’s look at an example of how you can configure them in a Continuous Integration pipeline in AWS.

Architecture

Figure 1: Example architecture on how to use terraform in an AWS CI pipeline

This diagram outlines the workflow implemented in this blog:

The AWS CodeCommit repository contains the application code
The AWS CodeBuild job contains the buildspec files and references the source code in AWS CodeCommit
The AWS Lambda function contains the application code created after running terraform apply
Amazon S3 contains the state file created after running terraform apply. Amazon DynamoDB locks the state file present in Amazon S3

Implementation

Pre-requisites

Before you begin, you must complete the following prerequisites:

Install the latest version of AWS Command Line Interface (AWS CLI)
Install terraform latest version
Install latest Git version and setup git-remote-codecommit
Use an existing AWS account or create a new one
Use AWS IAM role with role profile, role permissions, role trust relationship and user permissions to access your AWS account via local terminal

Setting up the environment

You need an AWS access key ID and secret access key to configure AWS CLI. To learn more about configuring the AWS CLI, follow these instructions.
Clone the repo for complete example: git clone https://github.com/aws-samples/manage-terraform-statefiles-in-aws-pipeline
After cloning, you could see the following folder structure:

Figure 2: AWS CodeCommit repository structure

Let’s break down the terraform code into 2 parts – one for preparing the infrastructure and another for preparing the application.

Preparing the Infrastructure

The main.tf file is the core component that does below:
- - It creates an Amazon S3 bucket to store the state file. We configure bucket ACL, bucket versioning and encryption so that the state file is secure.
  - It creates an Amazon DynamoDB table which will be used to lock the state file.
  - It creates two AWS CodeBuild projects, one for ‘terraform plan’ and another for ‘terraform apply’.
Note – It also has the code block (commented out by default) to create AWS Lambda which you will use at a later stage.

AWS CodeBuild projects should be able to access Amazon S3, Amazon DynamoDB, AWS CodeCommit and AWS Lambda. So, the AWS IAM role with appropriate permissions required to access these resources are created via iam.tf file.

Next you will find two buildspec files named buildspec-plan.yaml and buildspec-apply.yaml that will execute terraform commands – terraform plan and terraform apply respectively.

Modify AWS region in the provider.tf file.

Update Amazon S3 bucket name, Amazon DynamoDB table name, AWS CodeBuild compute types, AWS Lambda role and policy names to required values using variable.tf file. You can also use this file to easily customize parameters for different environments.

With this, the infrastructure setup is complete.

You can use your local terminal and execute below commands in the same order to deploy the above-mentioned resources in your AWS account.

terraform init
terraform validate
terraform plan
terraform apply

Once the apply is successful and all the above resources have been successfully deployed in your AWS account, proceed with deploying your application.

Preparing the Application

In the cloned repository, use the backend.tf file to create your own Amazon S3 backend to store the state file. By default, it will have below values. You can override them with your required values.

bucket = "tfbackend-bucket" 
key    = "terraform.tfstate" 
region = "eu-central-1"

The repository has sample python code stored in main.py that returns a simple message when invoked.

In the main.tf file, you can find the below block of code to create and deploy the Lambda function that uses the main.py code (uncomment these code blocks).

data "archive_file" "lambda_archive_file" {
    ……
}

resource "aws_lambda_function" "lambda" {
    ……
}

Now you can deploy the application using AWS CodeBuild instead of running terraform commands locally which is the whole point and advantage of using AWS CodeBuild.

Run the two AWS CodeBuild projects to execute terraform plan and terraform apply again.

Once successful, you can verify your deployment by testing the code in AWS Lambda. To test a lambda function (console):

- Open AWS Lambda console and select your function “tf-codebuild”
- In the navigation pane, in Code section, click Test to create a test event
- Provide your required name, for example “test-lambda”
- Accept default values and click Save
- Click Test again to trigger your test event “test-lambda”

It should return the sample message you provided in your main.py file. In the default case, it will display “Hello from AWS Lambda !” message as shown below.

Figure 3: Sample Amazon Lambda function response

To verify your state file, go to Amazon S3 console and select the backend bucket created (tfbackend-bucket). It will contain your state file.

Figure 4: Amazon S3 bucket with terraform state file

Open Amazon DynamoDB console and check your table tfstate-lock and it will have an entry with LockID.

Figure 5: Amazon DynamoDB table with LockID

Thus, you have securely stored and locked your terraform state file using terraform backend in a Continuous Integration pipeline.

Cleanup

To delete all the resources created as part of the repository, run the below command from your terminal.

terraform destroy

Conclusion

In this blog post, we explored the fundamentals of terraform state files, discussed best practices for their secure storage within AWS environments and also mechanisms for locking these files to prevent unauthorized team access. And finally, we showed you an example of how efficiently you can manage them in a Continuous Integration pipeline in AWS.

You can apply the same methodology to manage state files in a Continuous Delivery pipeline in AWS. For more information, see CI/CD pipeline on AWS, Terraform backends types, Purpose of terraform state.

Best Practices for Writing Step Functions Terraform Projects

2023-09-18 Patrick Guha

Post Syndicated from Patrick Guha original https://aws.amazon.com/blogs/devops/best-practices-for-writing-step-functions-terraform-projects/

Terraform by HashiCorp is one of the most popular infrastructure-as-code (IaC) platforms. AWS Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines. In this blog, we showcase best practices for users leveraging Terraform to deploy workflows, also known as Step Functions state machines. We will create a state machine using Workflow Studio for AWS Step Functions, deploy the state machine with Terraform, and introduce best operating practices on topics such as project structure, modules, parameter substitution, and remote state.

We recommend that you have a working understanding of both Terraform and Step Functions before going through this blog. If you are brand new to Step Functions and/or Terraform, please visit the Introduction to Terraform on AWS Workshop and the Terraform option in the Managing State Machines with Infrastructure as Code section of The AWS Step Functions Workshop to learn more.

Step Functions and Terraform Project Structure

One of the most important parts of any software project is its structure. It must be clear and well-organized for yourself or any member of your team to pick up and start coding efficiently. A Step Functions project using Terraform can potentially have many moving parts and components, so it is especially important to modularize and label wherever possible. Let’s take a look at a project structure that will allow for modularization, re-usability, and extensibility:

mkdir sfn-tf-example
cd sfn-tf-example
mkdir -p -- statemachine modules functions/first-function/src
touch main.tf outputs.tf variables.tf .gitignore functions/first-function/src/lambda.py
tree

Before moving forward, let’s analyze the directory, subdirectories, and files created above:

/statemachine will hold our Amazon States Language (ASL) JSON code describing the Step Functions state machine definition. This is where the orchestration logic will reside, so it is prudent to keep it separated from the infrastructure code. If you are deploying multiple state machines in your project, each definition will have its own JSON file. If you prefer, you can specify separate folders for each state machine to further modularize and isolate the logic.
/functions subdirectory includes the actual code for AWS Lambda functions used in our state machine. Keeping this code here will be much easier to read than writing it inline in our main.tf file.
The last subdirectory we have is /modules. Terraform modules are higher level abstracts explaining new concepts in your architecture. However, do not fall into the trap of making a custom module for everything. Doing so will make your code harder to maintain, and AWS provider resources will often suffice. There are also very popular modules that you can use from the Terraform Registry, such as Terraform AWS modules. Whenever possible, one should re-use modules to avoid code duplication in your project.
The remaining files in the root of the project are common to all Terraform projects. There are going to be hidden files created by your Terraform project after running terraform init, so we will include a .gitignore. What you include in .gitignore is largely dependent on your codebase and what your tools silently create in the background. In a later section, we will explicitly call out *.tfstate files in our .gitignore, and go over best practices for managing Terraform state securely and remotely.

Initial Code and Project Setup

We are going to create a simple Step Functions state machine that will only execute a single Lambda function. However, we will need to create the Lambda function that the state machine will reference. We first need to create our Lambda function code and save it in the following the directory structure and file mentioned above: functions/first-function/src/lambda.py.

import boto3

def lambda_handler(event, context):
# Minimal function for demo purposes
	return True

In Terraform, the main configuration file is named main.tf. This is the file that the Terraform CLI will look for in the local directory. Although you can break down your template into multiple .tf files, main.tf must be one of them. In this file, we will define the required providers and their minimum version, along with the resource definition of our template. In the example below, we define the minimum resources needed for a simple state machine that only executes a Lambda function. We define the two AWS Identity and Access Management (IAM) roles that our Lambda function and state machine will use, respectively. We define a data resource that zips the Lambda function code, which is then used in the Lambda function definition. Also notice that we use the aws_iam_policy_document data source throughout. Using the official IAM policy document means both your integrated development environment (IDE) and Terraform can see if your policy is malformed before running terraform apply. Finally, we define an Amazon CloudWatch Log group that will be used by the Lambda function to store its execution logs.

Terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~>4.0"
    }
  }
}

provider "aws" {}

provider "random" {}

data "aws_caller_identity" "current_account" {}

data "aws_region" "current_region" {}

resource "random_string" "random" {
  length  = 4
  special = false
}

data "aws_iam_policy_document" "lambda_assume_role_policy" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }

    actions = [
      "sts:AssumeRole",
    ]
  }
}

resource "aws_iam_role" "function_role" {
  assume_role_policy  = data.aws_iam_policy_document.lambda_assume_role_policy.json
  managed_policy_arns = ["arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"]
}

# Create the function
data "archive_file" "lambda" {
  type        = "zip"
  source_file = "functions/first-function/src/lambda.py"
  output_path = "functions/first-function/src/lambda.zip"
}

resource "aws_kms_key" "log_group_key" {}

resource "aws_kms_key_policy" "log_group_key_policy" {
  key_id = aws_kms_key.log_group_key.id
  policy = jsonencode({
    Id = "log_group_key_policy"
    Statement = [
      {
        Action = "kms:*"
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${data.aws_caller_identity.current_account.account_id}:root"
        }

        Resource = "*"
        Sid      = "Enable IAM User Permissions"
      },
      {
        Effect = "Allow",
        Principal = {
          Service : "logs.${data.aws_region.current_region.name}.amazonaws.com"
        },
        Action = [
          "kms:Encrypt*",
          "kms:Decrypt*",
          "kms:ReEncrypt*",
          "kms:GenerateDataKey*",
          "kms:Describe*"
        ],
        Resource = "*"
      }
    ]
    Version = "2012-10-17"
  })
}

resource "aws_lambda_function" "test_lambda" {
  function_name    = "HelloFunction-${random_string.random.id}"
  role             = aws_iam_role.function_role.arn
  handler          = "lambda.lambda_handler"
  runtime          = "python3.9"
  filename         = "functions/first-function/src/lambda.zip"
  source_code_hash = data.archive_file.lambda.output_base64sha256
}

# Explicitly create the function’s log group to set retention and allow auto-cleanup
resource "aws_cloudwatch_log_group" "lambda_function_log" {
  retention_in_days = 1
  name              = "/aws/lambda/${aws_lambda_function.test_lambda.function_name}"
  kms_key_id        = aws_kms_key.log_group_key.arn
}

# Create an IAM role for the Step Functions state machine
data "aws_iam_policy_document" "state_machine_assume_role_policy" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["states.amazonaws.com"]
    }

    actions = [
      "sts:AssumeRole",
    ]
  }
}

resource "aws_iam_role" "StateMachineRole" {
  name               = "StepFunctions-Terraform-Role-${random_string.random.id}"
  assume_role_policy = data.aws_iam_policy_document.state_machine_assume_role_policy.json
}

data "aws_iam_policy_document" "state_machine_role_policy" {
  statement {
    effect = "Allow"

    actions = [
      "logs:CreateLogStream",
      "logs:PutLogEvents",
      "logs:DescribeLogGroups"
    ]

    resources = ["${aws_cloudwatch_log_group.MySFNLogGroup.arn}:*"]
  }

  statement {
    effect = "Allow"
    actions = [
      "cloudwatch:PutMetricData",
      "logs:CreateLogDelivery",
      "logs:GetLogDelivery",
      "logs:UpdateLogDelivery",
      "logs:DeleteLogDelivery",
      "logs:ListLogDeliveries",
      "logs:PutResourcePolicy",
      "logs:DescribeResourcePolicies",
    ]
    resources = ["*"]
  }

  statement {
    effect = "Allow"

    actions = [
      "lambda:InvokeFunction"
    ]

    resources = ["${aws_lambda_function.test_lambda.arn}"]
  }

}

# Create an IAM policy for the Step Functions state machine
resource "aws_iam_role_policy" "StateMachinePolicy" {
  role   = aws_iam_role.StateMachineRole.id
  policy = data.aws_iam_policy_document.state_machine_role_policy.json
}

# Create a Log group for the state machine
resource "aws_cloudwatch_log_group" "MySFNLogGroup" {
  name_prefix       = "/aws/vendedlogs/states/MyStateMachine-"
  retention_in_days = 1
  kms_key_id        = aws_kms_key.log_group_key.arn
}

Workflow Studio and Terraform Integration

It is important to understand the recommended steps given the different tools we have available for creating Step Functions state machines. You should use a combination of Workflow Studio and local development with Terraform. This workflow assumes you will define all resources for your application within the same Terraform project, and that you will be leveraging Terraform for managing your AWS resources.

Figure 1 – Workflow for creating Step Functions state machine via Terraform

You will write the Terraform definition for any resources you intend to call with your state machine, such as Lambda functions, Amazon Simple Storage Service (Amazon S3) buckets, or Amazon DynamoDB tables, and deploy them using the terraform apply command. Doing this prior to using Workflow Studio will be useful in designing the first version of the state machine. You can define additional resources after importing the state machine into your local Terraform project.
You can use Workflow Studio to visually design the first version of the state machine. Given that you should have created the necessary resources already, you can drag and drop all of the actions and states, link them, and see how they look. Finally, you can execute the state machine for testing purposes.
Once your initial design is ready, you will export the ASL file and save it in your Terraform project. You can use the Terraform resource type aws_sfn_state_machine and reference the saved ASL file in the definition field.
You will then need to parametrize the ASL file given that Terraform will dynamically name the resources, and the Amazon Resource Name (ARN) may eventually change. You do not want to hardcode an ARN in your ASL file, as this will make updating and refactoring your code more difficult.
Finally, you deploy the state machine via Terraform by running terraform apply.

Simple changes should be made directly in the parametrized ASL file in your Terraform project instead of going back to Workflow Studio. Having the ASL file versioned as part of your project ensures that no manual changes break the state machine. Even if there is a breaking change, you can easily roll back to a previous version. One caveat to this is if you are making major changes to the state machine. In this case, taking advantage of Workflow Studio in the console is preferable.

However, you will most likely want to continue seeing a visual representation of the state machine while developing locally. The good news is that you have another option directly integrated into Visual Studio Code (VS Code) that visually renders the state machine, similar to Workflow Studio. This functionality is part of the AWS Toolkit for VS Code. You can learn more about the state machine integration with the AWS Toolkit for VS Code here. Below is an example of a parametrized ASL file and its rendered visualization in VS Code.

Figure 2 – Step Functions state machine displayed visually in VS Code

Parameter Substitution

In the Terraform template, when you define the Step Functions state machine, you can either include the definition in the template or in an external file. Leaving the definition in the template can cause the template to be less readable and difficult to manage. As a best practice, it is recommended to keep the definition of the state machine in a separate file. This raises the question of how to pass parameters to the state machine. In order to do this, you can use the templatefile function of Terraform. The templatefile function reads a file and renders its content with the supplied set of variables. As shown in the code snippet below, we will use the templatefile function to render the state machine definition file with the Lambda function ARN and any other parameters to pass to the state machine.

resource "aws_sfn_state_machine" "sfn_state_machine" {
  name     = "MyStateMachine-${random_string.random.id}"
  role_arn = aws_iam_role.StateMachineRole.arn
  definition = templatefile("${path.module}/statemachine/statemachine.asl.json", {
    ProcessingLambda = aws_lambda_function.test_lambda.arn
    }
  )
  logging_configuration {
    log_destination        = "${aws_cloudwatch_log_group.MySFNLogGroup.arn}:*"
    include_execution_data = true
    level                  = "ALL"
  }
}

Inside the state machine definition, you have to specify a string template using the interpolation sequences delimited with ${ … }. Similar to the code snippet below, you will define the state machine with the variable name that will be passed by the templatefile function.

"Lambda Invoke": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Parameters": {
        "Payload.$": "$",
        "FunctionName": "${ProcessingLambda}"
    },
    "End": true
}

After the templatefile function runs, it will replace the variable ${ProcessingLambda} with the actual Lambda function ARN generated when the template is deployed.

Remote Terraform State Management

Every time you run Terraform, it stores information about the managed infrastructure and configuration in a state file. By default, Terraform creates the state file called terraform.tfstate in the local directory. As mentioned earlier, you will want to include any .tfstate files in your .gitignore file. This will ensure you do not commit it to source control, which could potentially expose secrets and would most likely lead to errors in state. If you accidentally delete this local file, Terraform cannot track the infrastructure that was previously created. In that case, if you run terraform apply on an updated configuration, Terraform will create it from scratch, which will lead to conflicts. It is recommended that you store the Terraform state remotely in secure storage to enable versioning, encryption, and sharing. Terraform supports storing state in S3 buckets by using the backend configuration block. In order to configure Terraform to write the state file to an S3 bucket, you need to specify the bucket name, the region, and the key name.

It is also recommended that you enable versioning in the S3 bucket and MFA delete to protect the state file from accidental deletion. In addition, you need to make sure that Terraform has the right IAM permissions on the target S3 bucket. In case you have multiple developers working with the same infrastructure simultaneously, Terraform can also use state locking to prevent concurrent runs against the same state. You can use a DynamoDB table to control locking. The DynamoDB table you use must have a partition key named LockID with type String, and Terraform must have the right IAM permissions on the table.

terraform {
    backend "s3" {
        bucket         = "mybucket"
        key            = "path/to/state/file"
        region         = "us-east-1"
        attach_deny_insecure_transport_policy = true # only allow HTTPS connections 
        encrypt        = true
        dynamodb_table = "Table-Name"
    }
}

With this remote state configuration, you will maintain the state securely stored in S3. With every change you apply to your infrastructure, Terraform will automatically pull the latest state from the S3 bucket, lock it using the DynamoDB table, apply the changes, push the latest state again to the S3 bucket and then release the lock.

Cleanup

If you were following along and deployed resources such as the Lambda function, the Step Functions state machine, the S3 bucket for backend state storage, or any of the other associated resources by running terraform apply, to avoid incurring charges on your AWS account, please run terraform destroy to tear these resources down and clean up your environment.

Conclusion

In conclusion, this blog provides a comprehensive guide to leveraging Terraform for deploying AWS Step Functions state machines. We discussed the importance of a well-structured project, initial code setup, integration between Workflow Studio and Terraform, parameter substitution, and remote state management. By following these best practices, developers can create and manage their state machines more effectively while maintaining clean, modular, and reusable code. Embracing infrastructure-as-code and using the right tools, such as Workflow Studio, VS Code, and Terraform, will enable you to build scalable and maintainable distributed applications, automate processes, orchestrate microservices, and create data and ML pipelines with AWS Step Functions.

If you would like to learn more about using Step Functions with Terraform, please check out the following patterns and workflows on Serverless Land and view the Step Functions Developer Guide.

About the authors

How Cloudflare uses Terraform to manage Cloudflare

2022-11-17 Michael Wolf

Post Syndicated from Michael Wolf original https://blog.cloudflare.com/terraforming-cloudflare-at-cloudflare/

How Cloudflare uses Terraform to manage Cloudflare

Configuration management is far from a solved problem. As organizations scale beyond a handful of administrators, having a secure, auditable, and self-service way of updating system settings becomes invaluable. Managing a Cloudflare account is no different. With dozens of products and hundreds of API endpoints, keeping track of current configuration and making bulk updates across multiple zones can be a challenge. While the Cloudflare Dashboard is great for analytics and feature exploration, any changes that could potentially impact users really should get a code review before being applied!

This is where Cloudflare’s Terraform provider can come in handy. Built as a layer on top of the cloudflare-go library, the provider allows users to interface with the Cloudflare API using stateful Terraform resource declarations. Not only do we actively support this provider for customers, we make extensive use of it internally! In this post, we hope to provide some best practices we’ve learned about managing complex Cloudflare configurations in Terraform.

Why Terraform

Unsurprisingly, we find Cloudflare’s products to be pretty useful for securing and enhancing the performance of services we deploy internally. We use DNS, WAF, Zero Trust, Email Security, Workers, and all manner of experimental new features throughout the company. This dog-fooding allows us to battle-harden the services we provide to users and feed our desired features back to the product teams all while running the backend of Cloudflare. But, as Cloudflare grew, so did the complexity and importance of our configuration.

When we were a much smaller company, we only had a handful of accounts with designated administrators making changes on behalf of their colleagues. However, over time this handful of accounts grew into hundreds with each managed by separate teams. Independent accounts are useful in that they allow service-owners to make modifications that can’t impact others, but it comes with overhead.

We faced the challenge of ensuring consistent security policies, up-to-date account memberships, and change visibility. While our accounts were still administered by kind human stewards, we had numerous instances of account members not being removed after they transferred to a different team. While this never became a security incident, it demonstrated the shortcomings of manually provisioning account memberships. In the case of a production service migration, the administrator executing the change would often hop on a video call and ask for others to triple-check an IP address, ruleset, or access policy update. It was an era of looking through the audit logs to see what broke a service.

We wanted to make it easier for developers and users to make the changes they wanted without having to reach out to an administrator. Defining our configuration in code using Terraform has allowed us to keep tabs on the complexity of configuration while improving visibility and change management practices. By dogfooding the Cloudflare Terraform provider, we’ve been able to ensure:

Modifications to accounts are peer reviewed by the team that owns an account.
Each change is tied to a user, commit, and a ticket explaining the rationale for the change.
API Tokens are tied to service accounts rather than individual human users, meaning they survive team changes and offboarding.
Account configuration can be audited by anyone at the company for current state, accuracy, and security without needing to add everyone as a member of every account.
Large changes, such as enforcing hard keys can be done rapidly– even in a single pull request.
Configuration can be easily copied and reused across accounts to promote best practices and speed up development.
We can use and iterate on our awesome provider and provide a better experience to other users (shoutout in particular to Jacob!).

Terraform in CI/CD

Terraform has a fairly mature open source ecosystem, built from years of running-in-production experience. Thus, there are a number of ways to make interacting with the system feel as comfortable to developers as git. One of these tools is Atlantis.

Atlantis acts as continuous integration/continuous delivery (CI/CD) for Terraform; fitting neatly into version control workflows, and giving visibility into the changes being deployed in each code change. We use Atlantis to display Terraform plans (effectively a diff in configuration) within pull requests and apply the changes after the pull request has been approved. Having all the output from the terraform provider in the comments of a pull request means there’s no need to fiddle with the state locally or worry about where a state lock is coming from. Using Terraform CI/CD like this makes configuration management approachable to developers and non-technical folks alike.

In this example pull request, I’m adding a user to the cloudflare-cool-account (see the code in the next section). Once the PR is opened, Bitbucket posts a webhook to Atlantis, telling it to run a `terraform plan` using this branch. The resulting comment is placed in the pull request. Notice that this pull request can’t be applied or merged yet as it doesn’t have an approval! Once the pull request is approved, I would comment “atlantis apply”, wait for Atlantis to post a comment containing the output of the command, and merge the pull request if that output looks correct.

Our Terraforming Cloudflare architecture consists of a monorepo with one directory (and tfstate) for each internally-owned Cloudflare account. This keeps all of our Cloudflare configuration centralized for easier oversight while remaining neatly organized.

It will be possible in a future (as of this writing) release to manage multiple Cloudflare accounts in the same tfstate, but we’ve found that accounts in our use generally map fairly neatly onto teams. Teams can be configured as CODEOWNERS for a given directory and be tagged on any pull requests to that account. With teams owning separate accounts and each account having a separate tfstate, it’s rare for pull requests to get stuck waiting for a lock on the tfstate. Team-account-sized states remain relatively small, meaning that they also build quickly. Later on, we’ll share some of the other optimizations we’ve made to keep the repo user-friendly.

Each of our terraform states, given that they include secrets (including the API key!), is stored encrypted in an internal datastore. When a pull request is opened, Atlantis reaches out to a piece of middleware (that we may open source once it’s cleaned up a bit) that retrieves and decrypts the state for processing. Once the pull request is applied, the state is encrypted and put away again.

We execute a daily Terraform apply across all tfstates to capture any unintended config drift and rotate certificates when they approach expiration. This prevents unrelated changes from popping up in pull request diffs and causing confusion. While we could run more frequent state applies to ensure Terraform remains firmly up to date, once-a-day rectification strikes a balance between code enforcement and avoiding state locks while users are running Terraform plans in pull requests.

One of the problems that we encountered during our transition to Terraform is that folks were in the habit of making updates to configuration in the Dashboard and were still able to edit settings there. Thus, we didn’t always have a single source of truth for our configuration in code. It also meant the change would get mysteriously (to them) reverted the next day! So that’s why I’m excited to share a new Zero Trust Dashboard toggle that we’ve been turning on for our accounts internally: API/Terraform read-only mode.

With this button, we’re able to politely prevent manual changes to your Cloudflare account’s Zero Trust configuration without removing permissions from the set of users who can fix settings manually in a break-glass emergency scenario. Check out how you can enable this setting in your Zero Trust organization.

Slick Snippets and Terraforming Recommendations

As our Terraform repository has matured, we’ve refined how we define Cloudflare resources in code. By finding a sweet spot between code reuse and readability, we’ve been able to minimize operational overhead and generally let users get their work done. Here’s a couple of useful snippets that have been particularly valuable to us.

Account Membership

This allows for defining a fairly straightforward mapping of user emails to account privileges without code duplication or complex modules. We pull the list of human-friendly names of account roles from the API to show user permission assignments at a glance. Note: status is a new argument that allows for accounts to be added without sending an email to the user; perfect for when an organization is using SSO. (Thanks patrobinson for the feature request and mblackman for the PR!)

variables.tf
—-
data "cloudflare_account_roles" "my_account" {
	account_id = var.account_id
}

locals {
  roles = {
	for role in data.cloudflare_account_roles.my_account.roles :
  	role.name => role
  }
}

members.tf
—-
locals {
  users = {
    emerson = {
      roles = [
        local.roles["Administrator"].id
      ]
    }
    lucian = {
      roles = [
        local.roles["Super Administrator - All Privileges"].id
      ]
    }
    walruto = {
      roles = [
        local.roles_by_name["Audit Logs Viewer"].id,
        local.roles_by_name["Cloudflare Access"].id,
        local.roles_by_name["DNS"].id
      ]
  }
}

resource "cloudflare_account_member" "account_member" {
  for_each  	= local.users
  account_id	= var.account_id
  email_address = "${each.key}@cloudflare.com"
  role_ids  	= each.value.roles
  status            = "accepted"
}

Defining Auto-Refreshing Access Service Tokens

The GitHub issue and provider change that enabled automatic Access service token refreshes actually came from a need inside Cloudflare. Here’s how we ended up implementing it. We begin by defining a set of services that need to connect to our hostnames that are protected by Access. Each of these tokens are created and stored in a secret key value store. Next, we reference those access tokens by ID in the target Access policies. Once this has run, the service owner or the service itself can retrieve the credentials from the data store. (Note: we’re using Vault here, but any storage provider could be used in its place).

tokens.tf
—
locals {
  service_tokens = toset([
    "customer-service",     # TICKET-120
    "full-service",               # TICKET-128
    "quality-of-service"      # TICKET-420 
    "room-service"            # TICKET-927
  ])
}

resource "cloudflare_access_service_token" "token" {
  for_each   = local.service_tokens
  account_id = var.account_id
  name   	= each.key
  min_days_for_renewal = 30
}

resource "vault_generic_secret" "access_service_token" {
  for_each   = local.service_tokens
  path = "kv/secrets/${each.key}/access_service_token"
  disable_read = true

  data_json = jsonencode({
	client_id        = cloudflare_access_service_token.token["${each.key}"].client_id,
client_secret = cloudflare_access_service_token.token["${each.key}"].client_secret
  })
}

super_cool_hostname.tf
—
resource "cloudflare_access_application" "super_cool_hostname" {
  account_id             	            = var.account_id
  name                   	            = "Super Cool Hostname"
  domain                 	            = "supercool.hostname.tld"
}

resource "cloudflare_access_policy" "super_cool_hostname_service_access" {
  application_id = cloudflare_access_application.super_cool_hostname.id
  zone_id    	= data.cloudflare_zone.hostname_tld.id
  name       	= "TICKET-927 Allow Room Service "
  decision   	= "non_identity"
  precedence 	= 1
  include {
	service_token = [cloudflare_access_service_token.token["room-service"].id]
  }
}

mTLS (Authenticated Origin Pulls) certificate creation and rotation

To further defense-in-depth objectives, we’ve been rolling out mTLS throughout our internal systems. One of the places where we can take advantage of our Terraform provider is in defining AOP (Authenticated Origin Pulls) certificates to lock down the Cloudflare-edge-to-origin connection. Anyone who has managed certificates of any kind can speak to the headaches they can cause. Having certificate configurations in Terraform takes out the manual work of rotation and expiration.

In this example we’re defining hostname-level AOP as opposed to zone-level AOP. We start by cutting a certificate for each hostname. Once again we’re using Vault for certificate creation, but other backends could be used just as well. This certificate is created with a (not-shown) 30 day expiration, but set to renew automatically. This means once the time-to-expiration is equal to min_seconds_remaining, the resource will be automatically tainted and replaced on the next Terraform run. We like to give this automation plenty of room before expiration to take into account holiday seasons and avoid sending alerts to humans when the alerts hit seven days to expiration. For the rest of this snippet, the certificate is uploaded to Cloudflare and the ID from that upload is then placed in the AOP configuration for the given hostname. The create_before_destroy meta-argument ensures that the replacement certificate is uploaded successfully before we remove the certificate that’s currently in place.

locals {
  hostnames = toset([
	"supercool.hostname.tld",
            "thatsafinelooking.hostname.tld"
  ])
}

resource "vault_pki_secret_backend_cert" "vault_cert" {
  for_each          	      = local.hostnames
  backend           	      = "pki-aop"
  name              	      = "default"
  auto_renew         	      = true
  common_name       	      = "${each.key}.aop.pki.vault.cfdata.org"
  min_seconds_remaining = 864000 // renew when there are 10 days left before expiration
}

resource "cloudflare_authenticated_origin_pulls_certificate" "aop_cert" {
  for_each  = local.hostnames
  zone_id   = data.cloudflare_zone.hostname_tld.id
  type 	      = "per-hostname"

  certificate = vault_pki_secret_backend_cert.vault_cert["${each.key}"].certificate
  private_key = vault_pki_secret_backend_cert.vault_cert["${each.key}"].private_key

  lifecycle {
	create_before_destroy = true
  }
}

resource "cloudflare_authenticated_origin_pulls" "aop_config" {
  for_each                           	= local.hostnames
  zone_id    	                        = data.cloudflare_zone.hostname_tld.id
  authenticated_origin_pulls_certificate = cloudflare_authenticated_origin_pulls_certificate.aop_cert["${each.key}"].id
  hostname                           	= "${each.key}"
  enabled                            	= true
}

Terraform recommendations

The comfortable automation that we’ve achieved thus far did not come without some hair-pulling. Below are a few of the learnings that have allowed us to maintain the repository as a side project run by two engineers (shoutout David).

Store your state somewhere safe

It feels worth repeating that the tfstate contains secrets including any API keys you’re using with providers and the default location of the tfstate is in the current working directory. It’s very easy to accidentally commit this to source control. By defining a backend, the state can be stored with a cloud storage provider, in a secure location on a filesystem, in a database, or even Cloudflare Workers! Wherever the state is stored, make sure it is encrypted.

Choose simplicity, avoid modules

Modules are intended to reduce code repetition for well-defined chunks of systems such as “I want three clusters of whizz-bangs in locations A, C, and F.” If cloud-computing was like Factorio, this would be amazing. However, financial, technical, and physical constraints mean subtle differences in systems develop over time such as “I want fewer whizz-bangs in C and the whizz-bangs in F should get a different network topology.” In Terraform, implementation logic of these requirements is moved to the module code. HCL is absolutely not the place to write decipherable conditionals. While module versioning prevents having to make every change backwards-compatible, keeping module usage up-to-date becomes another chore for repository maintainers.

An understandable code base is a user-friendly codebase. It’s rare that a deeply cryptic error will return from a misconfigured resource definition. Conversely, modules, especially custom ones, can lead users on a head-scratching adventure. This kind of system can’t scale with confused users.

A few well-designed for_each loops (we’re obviously fans) can achieve similar objectives as modules without the complexity. It’s fine to use plain old resources too! Especially when there are more than a handful of varying arguments, it’s more valuable for the configuration to be clear than to be eloquent. For example: an account_member resource makes sense to be in a for_loop, but a page_rule probably doesn’t.

Keep tfstates small

Maintaining quick pull-request-to-plan turnaround keeps Terraform from feeling like a burden on users’ time. Furthermore, if a plan is taking 30 minutes to run, a rollback in the case of an issue would also take 30 minutes! This post describes our single-account-to-tfstate model.

However, after noticing slow-downs coming from the large number of AOP certificate configurations in a big zone, we moved that code to a separate tfstate. We were able to make this change because AOP configuration is fairly self-contained. To ensure there would be no fighting between the states, we kept the API token permissions for each tfstate mutually exclusive of each other. Our Atlantis Terraform plans typically finish under five minutes. If it feels impossible to keep the size of a tfstate down to a reasonable amount of time, it may be worth considering a different tool for that bit of configuration management.

Know when to use a Different tool

Terraform isn’t a panacea. We generally don’t use Terraform to manage DNS records, for example. We use OctoDNS which integrates more neatly into our infrastructure automation. DNS records can quickly add up to long state-rendering times and are often dynamically generated from systems that Terraform doesn’t know about. To avoid conflicts, there should only ever be one system publishing changes to DNS records.

We also haven’t figured out a maintainable way of managing Workers scripts in Terraform. When a .js script in the Terraform directory changes, Terraform isn’t aware of it. This means a change needs to occur somewhere else in a .tf file before the plan diff is generated. It likely isn’t an unsolvable issue, but doesn’t seem particularly worth cramming into Terraform when there are better options for Worker management like Wrangler.

Looking forward

We’re continuing to invest in the Cloudflare Terraforming experience both for our own use and for the benefit of our users. With the provider, we hope to offer a comfortable and scalable method of interacting with Cloudflare products. Hopefully this post has presented some useful suggestions to anyone interested in adopting Cloudflare-configuration-as-code. Don’t hesitate to reach out on the GitHub project for troubleshooting, bug reports, or feature requests. For more in depth documentation on using Terraform to manage your Cloudflare account, read on here. And if you don’t have a Cloudflare account already, click here to get started.

Deploying IBM Cloud Pak for Data on Red Hat OpenShift Service on AWS

2022-09-14 Eduardo Monich Fronza

Post Syndicated from Eduardo Monich Fronza original https://aws.amazon.com/blogs/architecture/deploying-ibm-cloud-pak-for-data-on-red-hat-openshift-service-on-aws/

Amazon Web Services (AWS) customers who are looking for a more intuitive way to deploy and use IBM Cloud Pak for Data (CP4D) on the AWS Cloud, can now use the Red Hat OpenShift Service on AWS (ROSA).

ROSA is a fully managed service, jointly supported by AWS and Red Hat. It is managed by Red Hat Site Reliability Engineers and provides a pay-as-you-go pricing model, as well as a unified billing experience on AWS.

With this, customers do not manage the lifecycle of Red Hat OpenShift Container Platform clusters. Instead, they are free to focus on developing new solutions and innovating faster, using IBM’s integrated data and artificial intelligence platform on AWS, to differentiate their business and meet their ever-changing enterprise needs.

CP4D can also be deployed from the AWS Marketplace with self-managed OpenShift clusters. This is ideal for customers with requirements, like Red Hat OpenShift Data Foundation software defined storage, or who prefer to manage their OpenShift clusters.

In this post, we discuss how to deploy CP4D on ROSA using IBM-provided Terraform automation.

Cloud Pak for data architecture

Here, we install CP4D in a highly available ROSA cluster across three availability zones (AZs); with three master nodes, three infrastructure nodes, and three worker nodes.

Review the AWS Regions and Availability Zones documentation and the regions where ROSA is available to choose the best region for your deployment.

This is a public ROSA cluster, accessible from the internet via port 443. When deploying CP4D in your AWS account, consider using a private cluster (Figure 1).

Figure 1. IBM Cloud Pak for Data on ROSA

We are using Amazon Elastic Block Store (Amazon EBS) and Amazon Elastic File System (Amazon EFS) for the cluster’s persistent storage. Review the IBM documentation for information about supported storage options.

Review the AWS prerequisites for ROSA, and follow the Security best practices in IAM documentation to protect your AWS account before deploying CP4D.

Cost

The costs associated with using AWS services when deploying CP4D in your AWS account can be estimated on the pricing pages for the services used.

Prerequisites

This blog assumes familiarity with: CP4D, Terraform, Amazon Elastic Compute Cloud (Amazon EC2), Amazon EBS, Amazon EFS, Amazon Virtual Private Cloud, and AWS Identity and Access Management (IAM).

You will need the following before getting started:

Access to an AWS account, with permissions to create the resources described in the installation steps section.
An AWS IAM user, with the permissions described in the AWS prerequisites for ROSA documentation.
Sufficient AWS service quotas to deploy ROSA. You can request service-quota increases from the AWS console.
An IBM entitlement API key: either a 60-day trial or an existing entitlement.
A bastion host to run the CP4D installer, with the following packages:
- AWS Command Line Interface (aws cli)
- OpenShift command-line interface (oc)
- Kubernetes command-line tool (kubectl)
- Terraform
- Git
- Podman
- Python 3.8
- httpd-tools, jq, wget, vim, unzip

Installation steps

Complete the following steps to deploy CP4D on ROSA:

First, enable ROSA on the AWS account. From the AWS ROSA console, click on Enable ROSA, as in Figure 2.

Figure 2. Enabling ROSA on your AWS account
Click on Get started. Redirect to the Red Hat website, where you can register and obtain a Red Hat ROSA token.

Navigate to the AWS IAM console. Create an IAM policy named cp4d-installer-policy and add the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "autoscaling:*",
                "cloudformation:*",
                "cloudwatch:*",
                "ec2:*",
                "elasticfilesystem:*",
                "elasticloadbalancing:*",
                "events:*",
                "iam:*",
                "kms:*",
                "logs:*",
                "route53:*",
                "s3:*",
                "servicequotas:GetRequestedServiceQuotaChange",
                "servicequotas:GetServiceQuota",
                "servicequotas:ListServices",
                "servicequotas:ListServiceQuotas",
                "servicequotas:RequestServiceQuotaIncrease",
                "sts:*",
                "support:*",
                "tag:*"
            ],
            "Resource": "*"
        }
    ]
}

Next, let’s create an IAM user from the AWS IAM console, which will be used for the CP4D installation:
a. Specify a name, like ibm-cp4d-bastion.
b. Set the credential type to Access key – Programmatic access.
c. Attach the IAM policy created in Step 3.
d. Download the .csv credentials file.
From the Amazon EC2 console, create a new EC2 key pair and download the private key.
Launch an Amazon EC2 instance from which the CP4D installer is launched:
a. Specify a name, like ibm-cp4d-bastion.
b. Select an instance type, such as t3.medium.
c. Select the EC2 key pair created in Step 4.
d. Select the Red Hat Enterprise Linux 8 (HVM), SSD Volume Type for 64-bit (x86) Amazon Machine Image.
e. Create a security group with an inbound rule that allows connection. Restrict access to your own IP address or an IP range from your organization.
f. Leave all other values as default.
Connect to the EC2 instance via SSH using its public IP address. The remaining installation steps will be initiated from it.

Install the required packages:

$ sudo yum update -y
$ sudo yum install git unzip vim wget httpd-tools python38 -y

$ sudo ln -s /usr/bin/python3 /usr/bin/python
$ sudo ln -s /usr/bin/pip3 /usr/bin/pip
$ sudo pip install pyyaml

$ curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
$ unzip awscliv2.zip
$ sudo ./aws/install

$ wget "https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64"
$ chmod +x jq-linux64
$ sudo mv jq-linux64 /usr/local/bin/jq

$ wget "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.10.15/openshift-client-linux-4.10.15.tar.gz"
$ tar -xvf openshift-client-linux-4.10.15.tar.gz
$ chmod u+x oc kubectl
$ sudo mv oc /usr/local/bin
$ sudo mv kubectl /usr/local/bin

$ sudo yum install -y yum-utils
$ sudo yum-config-manager --add-repo $ https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
$ sudo yum -y install terraform

$ sudo subscription-manager repos --enable=rhel-7-server-extras-rpms
$ sudo yum install -y podman

Configure the AWS CLI with the IAM user credentials from Step 4 and the desired AWS region to install CP4D:

$ aws configure

AWS Access Key ID [None]: AK****************7Q
AWS Secret Access Key [None]: vb************************************Fb
Default region name [None]: eu-west-1
Default output format [None]: json

Clone the following IBM GitHub repository:
https://github.com/IBM/cp4d-deployment.git
```
$ cd ~/cp4d-deployment/managed-openshift/aws/terraform/
```

For the purpose of this post, we enabled Watson Machine Learning, Watson Studio, and Db2 OLTP services on CP4D. Use the example in this step to create a Terraform variables file for CP4D installation. Enable CP4D services required for your use case:

region			= "eu-west-1"
tenancy			= "default"
access_key_id 		= "your_AWS_Access_key_id"
secret_access_key 	= "your_AWS_Secret_access_key"

new_or_existing_vpc_subnet	= "new"
az				= "multi_zone"
availability_zone1		= "eu-west-1a"
availability_zone2 		= "eu-west-1b"
availability_zone3 		= "eu-west-1c"

vpc_cidr 		= "10.0.0.0/16"
public_subnet_cidr1 	= "10.0.0.0/20"
public_subnet_cidr2 	= "10.0.16.0/20"
public_subnet_cidr3 	= "10.0.32.0/20"
private_subnet_cidr1 	= "10.0.128.0/20"
private_subnet_cidr2 	= "10.0.144.0/20"
private_subnet_cidr3 	= "10.0.160.0/20"

openshift_version 		= "4.10.15"
cluster_name 			= "your_ROSA_cluster_name"
rosa_token 			= "your_ROSA_token"
worker_machine_type 		= "m5.4xlarge"
worker_machine_count 		= 3
private_cluster 			= false
cluster_network_cidr 		= "10.128.0.0/14"
cluster_network_host_prefix 	= 23
service_network_cidr 		= "172.30.0.0/16"
storage_option 			= "efs-ebs" 
ocs 				= { "enable" : "false", "ocs_instance_type" : "m5.4xlarge" } 
efs 				= { "enable" : "true" }

accept_cpd_license 		= "accept"
cpd_external_registry 		= "cp.icr.io"
cpd_external_username 	= "cp"
cpd_api_key 			= "your_IBM_API_Key"
cpd_version 			= "4.5.0"
cpd_namespace 		= "zen"
cpd_platform 			= "yes"

watson_knowledge_catalog 	= "no"
data_virtualization 		= "no"
analytics_engine 		= "no"
watson_studio 			= "yes"
watson_machine_learning 	= "yes"
watson_ai_openscale 		= "no"
spss_modeler 			= "no"
cognos_dashboard_embedded 	= "no"
datastage 			= "no"
db2_warehouse 		= "no"
db2_oltp 			= "yes"
cognos_analytics 		= "no"
master_data_management 	= "no"
decision_optimization 		= "no"
bigsql 				= "no"
planning_analytics 		= "no"
db2_aaservice 			= "no"
watson_assistant 		= "no"
watson_discovery 		= "no"
openpages 			= "no"
data_management_console 	= "no"

Save your file, and launch the commands below to install CP4D and track progress:

$ terraform init -input=false
$ terraform apply --var-file=cp4d-rosa-3az-new-vpc.tfvars \
   -input=false | tee terraform.log

The installation runs for 4 or more hours. Once installation is complete, the output includes (as in Figure 3):
a. Commands to get the CP4D URL and the admin user password
b. CP4D admin user
c. Login command for the ROSA cluster

Figure 3. CP4D installation output

Validation steps

Let’s verify the installation!

$ oc login https://api.cp4dblog.17e7.p1.openshiftapps.com:6443 --username cluster-admin --password *****-*****-*****-*****

Initiate the following command to get the cluster’s console URL (Figure 4):
```
$ oc whoami --show-console
```
Figure 4. ROSA console URL
Run the commands in this step to retrieve the CP4D URL and admin user password (Figure 5).
```
$ oc extract secret/admin-user-details \
  --keys=initial_admin_password --to=- -n zen
$ oc get routes -n zen
```
Figure 5. Retrieve the CP4D admin user password and URL

Initiate the following commands to have the CP4D workloads in your ROSA cluster (Figure 6):

$ oc get pods -n zen
$ oc get deployments -n zen
$ oc get svc -n zen 
$ oc get pods -n ibm-common-services 
$ oc get deployments -n ibm-common-services
$ oc get svc -n ibm-common-services
$ oc get subs -n ibm-common-services

Figure 6. Checking the CP4D pods running on ROSA

Log in to your CP4D web console using its URL and your admin password.
Expand the navigation menu. Navigate to Services > Services catalog for the available services (Figure 7).

Figure 7. Navigating to the CP4D services catalog
Notice that the services set as “enabled” correspond with your Terraform definitions (Figure 8).

Figure 8. Services enabled in your CP4D catalog

Congratulations! You have successfully deployed IBM CP4D on Red Hat OpenShift on AWS.

Post installation

Refer to the IBM documentation on setting up services, if you need to enable additional services on CP4D.

When installing CP4D on productive environments, please review the IBM documentation on securing your environment. Also, the Red Hat documentation on setting up identity providers for ROSA is informative. You can also consider enabling auto scaling for your cluster.

Cleanup

Connect to your bastion host, and run the following steps to delete the CP4D installation, including ROSA. This step avoids incurring future charges on your AWS account.

$ cd ~/cp4d-deployment/managed-openshift/aws/terraform/
$ terraform destroy -var-file="cp4d-rosa-3az-new-vpc.tfvars"

If you’ve experienced any failures during the CP4D installation, run these next steps:

$ cd ~/cp4d-deployment/managed-openshift/aws/terraform
$ sudo cp installer-files/rosa /usr/local/bin
$ sudo chmod 755 /usr/local/bin/rosa
$ Cluster_Name=`rosa list clusters -o yaml | grep -w "name:" | cut -d ':' -f2 | xargs`
$ rosa remove cluster --cluster=${Cluster_Name}
$ rosa logs uninstall -c ${Cluster_Name } –watch
$ rosa init --delete-stack
$ terraform destroy -var-file="cp4d-rosa-3az-new-vpc.tfvars"

Conclusion

In summary, we explored how customers can take advantage of a fully managed OpenShift service on AWS to run IBM CP4D. With this implementation, customers can focus on what is important to them, their workloads, and their customers, and less on managing the day-to-day operations of managing OpenShift to run CP4D.

Check out the IBM Cloud Pak for Data Simplifies and Automates How You Turn Data into Insights blog to learn how to use CP4D on AWS to unlock the value of your data.

Additional resources

Use AWS Network Firewall to filter outbound HTTPS traffic from applications hosted on Amazon EKS and collect hostnames provided by SNI

2022-09-12 Kirankumar Chandrashekar

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/security/use-aws-network-firewall-to-filter-outbound-https-traffic-from-applications-hosted-on-amazon-eks/

This blog post shows how to set up an Amazon Elastic Kubernetes Service (Amazon EKS) cluster such that the applications hosted on the cluster can have their outbound internet access restricted to a set of hostnames provided by the Server Name Indication (SNI) in the allow list in the AWS Network Firewall rules. For encrypted web traffic, SNI can be used for blocking access to specific sites in the network firewall. SNI is an extension to TLS that remains unencrypted in the traffic flow and indicates the destination hostname a client is attempting to access over HTTPS.

This post also shows you how to use Network Firewall to collect hostnames of the specific sites that are being accessed by your application. Securing outbound traffic to specific hostnames is called egress filtering. In computer networking, egress filtering is the practice of monitoring and potentially restricting the flow of information outbound from one network to another. Securing outbound traffic is usually done by means of a firewall that blocks packets that fail to meet certain security requirements. One such firewall is AWS Network Firewall, a managed service that you can use to deploy essential network protections for all of your VPCs that you create with Amazon Virtual Private Cloud (Amazon VPC).

Example scenario

You have the option to scan your application traffic by the identifier of the requested SSL certificate, which makes you independent from the relationship of the IP address to the certificate. The certificate could be served from any IP address. Traditional stateful packet filters are not able to follow the changing IP address of the endpoints. Therefore, the host name information that you get from the SNI becomes important in making security decisions. Amazon EKS has gained popularity for running containerized workloads in the AWS Cloud, and you can restrict outbound traffic to only the known hostnames provided by SNI. This post will walk you through the process of setting up the EKS cluster in two different subnets so that your software can use the additional traffic routing in the VPC and traffic filtering through Network Firewall.

Solution architecture

The architecture illustrated in Figure 1 shows a VPC with three subnets in Availability Zone A, and three subnets in Availability Zone B. There are two public subnets where Network Firewall endpoints are deployed, two private subnets where the worker nodes for the EKS cluster are deployed, and two protected subnets where NAT gateways are deployed.

Figure 1: Outbound internet access through Network Firewall from Amazon EKS worker nodes

The workflow in the architecture for outbound access to a third-party service is as follows:

The outbound request originates from the application running in the private subnet (for example, to https://aws.amazon.com) and is passed to the NAT gateway in the protected subnet.
The HTTPS traffic received in the protected subnet is routed to the AWS Network Firewall endpoint in the public subnet.
The network firewall computes the rules, and either accepts or declines the request to pass to the internet gateway.
If the request is passed, the application-requested URL (provided by SNI in the non-encrypted HTTPS header) is allowed in the network firewall, and successfully reaches the third-party server for access.

The VPC settings for this blog post follow the recommendation for using public and private subnets described in Creating a VPC for your Amazon EKS cluster in the Amazon EKS User Guide, but with additional subnets called protected subnets. Instead of placing the NAT gateway in a public subnet, it will be placed in the protected subnet, and the Network Firewall endpoints in the public subnet will filter the egress traffic that flows through the NAT gateway. This design pattern adds further checks and could be a recommendation for your VPC setup.

As suggested in Creating a VPC for your Amazon EKS cluster, using the Public and private subnets option allows you to deploy your worker nodes to private subnets, and allows Kubernetes to deploy load balancers to the public subnets. This arrangement can load-balance traffic to pods that are running on nodes in the private subnets. As shown in Figure 1, the solution uses an additional subnet named the protected subnet, apart from the public and private subnets. The protected subnet is a VPC subnet deployed between the public subnet and private subnet. The outbound internet traffic that is routed through the protected subnet is rerouted to the Network Firewall endpoint hosted within the public subnet. You can use the same strategy mentioned in Creating a VPC for your Amazon EKS cluster to place different AWS resources within private subnets and public subnets. The main difference in this solution is that you place the NAT gateway in a separate protected subnet, between private subnets, and place Network Firewall endpoints in the public subnets to filter traffic in the network firewall. The NAT gateway’s IP address is still preserved, and could still be used for adding to the allow list of third-party entities that need connectivity for the applications running on the EKS worker nodes.

To see a practical example of how the outbound traffic is filtered based on the hosted names provided by SNI, follow the steps in the following Deploy a sample section. You will deploy an AWS CloudFormation template that deploys the solution architecture, consisting of the VPC components, EKS cluster components, and the Network Firewall components. When that’s complete, you can deploy a sample app running on Amazon EKS to test egress traffic filtering through AWS Network Firewall.

Deploy a sample to test the network firewall

Follow the steps in this section to perform a sample app deployment to test the use case of securing outbound traffic through AWS Network Firewall.

Prerequisites

The prerequisite actions required for the sample deployment are as follows:

Make sure you have the AWS CLI installed, and configure access to your AWS account.
Install and set up the eksctl tool to create an Amazon EKS cluster.
Copy the necessary CloudFormation templates and the sample eksctl config files from the blog’s Amazon S3 bucket to your local file system. You can do this by using the following AWS CLI S3 cp command.
aws s3 cp s3://awsiammedia/public/sample/803-network-firewall-to-filter-outbound-traffic/config.yaml . aws s3 cp s3://awsiammedia/public/sample/803-network-firewall-to-filter-outbound-traffic/lambda_function.py . aws s3 cp s3://awsiammedia/public/sample/803-network-firewall-to-filter-outbound-traffic/network-firewall-eks-collect-all.yaml . aws s3 cp s3://awsiammedia/public/sample/803-network-firewall-to-filter-outbound-traffic/network-firewall-eks.yaml .

Important: This command will download the S3 bucket contents to the current directory on your terminal, so the “.” (dot) in the command is very important.
Once this is complete, you should be able to see the list of files shown in Figure 2. (The list includes config.yaml, lambda_function.py, network-firewall-eks-collect-all.yaml, and network-firewall-eks.yaml.)

Figure 2: Files downloaded from the S3 bucket

Deploy the VPC architecture with AWS Network Firewall

In this procedure, you’ll deploy the VPC architecture by using a CloudFormation template.

To deploy the VPC architecture (AWS CLI)

Deploy the CloudFormation template network-firewall-eks.yaml, which you previously downloaded to your local file system from the Amazon S3 bucket.
You can do this through the AWS CLI by using the create-stack command, as follows.

aws cloudformation create-stack --stack-name AWS-Network-Firewall-Multi-AZ \ --template-body file://network-firewall-eks.yaml \ --parameters ParameterKey=NetworkFirewallAllowedWebsites,ParameterValue=".amazonaws.com\,.docker.io\,.docker.com" \ --capabilities CAPABILITY_NAMED_IAM

Note: The initially allowed hostnames for egress filtering are passed to the network firewall by using the parameter key NetworkFirewallAllowedWebsites in the CloudFormation stack. In this example, the allowed hostnames are .amazonaws.com, .docker.io, and docker.com.
Make a note of the subnet IDs from the stack outputs of the CloudFormation stack after the status goes to Create_Complete.

aws cloudformation describe-stacks \ --stack-name AWS-Network-Firewall-Multi-AZ

Note: For simplicity, the CloudFormation stack name is AWS-Network-Firewall-Multi-AZ, but you can change this name to according to your needs and follow the same naming throughout this post.

To deploy the VPC architecture (console)

In your account, launch the AWS CloudFormation template by choosing the following Launch Stack button. It will take approximately 10 minutes for the CloudFormation stack to complete.

Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, download the solution’s CloudFormation template, modify it, and deploy it to the selected Region.

Deploy and set up access to the EKS cluster

In this step, you’ll use the eksctl CLI tool to create an EKS cluster.

To deploy an EKS cluster by using the eksctl tool

There are two methods for creating an EKS cluster. Method A uses the eksctl create cluster command without a configuration (config) file. Method B uses a config file.

Note: Before you start, make sure you have the VPC subnet details available from the previous procedure.

Method A: No config file

You can create an EKS cluster without a config file by using the eksctl create cluster command.

From the CLI, enter the following commands.
eksctl create cluster \ --vpc-private-subnets=<private-subnet-A>,<private-subnet-B> \ --vpc-public-subnets=<public-subnet-A>,<public-subnet-B>
Make sure that the subnets passed to the --vpc-public-subnets parameter are protected subnets taken from the VPC architecture CloudFormation stack output. You can verify the subnet IDs by looking at step 2 in the To deploy the VPC architecture section.

Method B: With config file

Another way to create an EKS cluster is by using the following config file, with more options with the name (cluster.yaml in this example).

Create a file named cluster.yaml by adding the following contents to it.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: filter-egress-traffic-test
  region: us-east-1
  version: "1.19"
availabilityZones: ["us-east-1a", "us-east-1b"]
vpc:
  id: 
  subnets:
    public:
      us-east-1a: { id: <public-subnet-A> }
      us-east-1b: { id: <public-subnet-B> }
    private:
      us-east-1a: { id: <private-subnet-A> }
      us-east-1b: { id: <private-subnet-B> }

managedNodeGroups:
- name: nodegroup
  desiredCapacity: 3
  ssh:
    allow: true
    publicKeyName: main
  iam:
    attachPolicyARNs:
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
    - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
    - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
    - arn:aws:iam::aws:policy/AmazonEKSServicePolicy
    - arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
    - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
  preBootstrapCommands:
    - yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
    - sudo systemctl enable amazon-ssm-agent
    - sudo systemctl start amazon-ssm-agent

Run the following command to create an EKS cluster using the eksctl tool and the cluster.yaml config file.
eksctl create cluster -f cluster.yaml

To set up access to the EKS cluster

Before you deploy a sample Kubernetes Pod, make sure you have the kubeconfig file set up for the EKS cluster that you created in step 2 of To deploy an EKS cluster by using the eksctl tool. For more information, see Create a kubeconfig for Amazon EKS. You can use eksctl to do this, as follows.

eksctl utils write-kubeconfig —cluster filter-egress-traffic-test
Set the kubectl context to the EKS cluster you just created, by using the following command.

kubectl config get-contexts

Figure 3 shows an example of the output from this command.

Figure 3: kubectl config get-contexts command output
Copy the context name from the command output and set the context by using the following command.

kubectl config use-context <NAME-OF-CONTEXT>

To deploy a sample Pod on the EKS cluster

Next, deploy a sample Kubernetes Pod in the EKS cluster.

kubectl run -i --tty amazon-linux —image=public.ecr.aws/amazonlinux/amazonlinux:latest sh

If you already have a Pod, you can use the following command to get a shell to a running container.

kubectl attach amazon-linux -c alpine -i -t
Now you can test access to a non-allowed website in the AWS Network Firewall stateful rules, using these steps.
1. First, install the cURL tool on the sample Pod you created previously. cURL is a command-line tool for getting or sending data, including files, using URL syntax. Because cURL uses the libcurl library, it supports every protocol libcurl supports. On the Pod where you have obtained a shell to a running container, run the following command to install cURL.
  apk install curl
2. Access a website using cURL.
  curl -I https://aws.amazon.com
  
  This gives a timeout error similar to the following.
  
  curl -I https://aws.amazon.com curl: (28) Operation timed out after 300476 milliseconds with 0 out of 0 bytes received
3. Navigate to the AWS CloudWatch console and check the alert logs for Network Firewall. You will see a log entry like the following sample, indicating that the access to https://aws.amazon.com was blocked.
```
{
    "firewall_name": "AWS-Network-Firewall-Multi-AZ-firewall",
    "availability_zone": "us-east-1a",
    "event_timestamp": "1623651293",
    "event": {
        "timestamp": "2021-06-14T06:14:53.483069+0000",
        "flow_id": 649458981081302,
        "event_type": "alert",
        "src_ip": "xxx.xxx.xxx.xxx",
        "src_port": xxxxx,
        "dest_ip": "xxx.xxx.xxx.xxx",
        "dest_port": 443,
        "proto": "TCP",
        "alert": {
            "action": "blocked",
            "signature_id": 4,
            "rev": 1,
            "signature": "not matching any TLS allowlisted FQDNs",
            "category": "",
            "severity": 1
        },
        "tls": {
            "sni": "aws.amazon.com",
            "version": "UNDETERMINED",
            "ja3": {},
            "ja3s": {}
        },
        "app_proto": "tls"
    }
}
```
  The error shown here occurred because the hostname www.amazon.com was not added to the Network Firewall stateful rules allow list.
  
  When you deployed the network firewall in step 1 of the To deploy the VPC architecture procedure, the values provided for the CloudFormation parameter NetworkFirewallAllowedWebsites were just .amazonaws.com, .docker.io, .docker.com and not aws.amazon.com.

Update the Network Firewall stateful rules

In this procedure, you’ll update the Network Firewall stateful rules to allow the aws.amazon.com domain name.

To update the Network Firewall stateful rules (console)

In the AWS CloudFormation console, locate the stack you used to create the network firewall earlier in the To deploy the VPC architecture procedure.
Select the stack you want to update, and choose Update. In the Parameters section, update the stack by adding the hostname aws.amazon.com to the parameter NetworkFirewallAllowedWebsites as a comma-separated value. See Updating stacks directly in the AWS CloudFormation User Guide for more information on stack updates.

Re-test from the sample pod

In this step, you’ll test the outbound access once again from the sample Pod you created earlier in the To deploy a sample Pod on the EKS cluster procedure.

To test the outbound access to the aws.amazon.com hostname

Get a shell to a running container in the sample Pod that you deployed earlier, by using the following command.
kubectl attach amazon-linux -c alpine -i -t
On the terminal where you got a shell to a running container in the sample Pod, run the following cURL command.
curl -I https://aws.amazon.com
The response should be a success HTTP 200 OK message similar to this one.
curl -Ik https://aws.amazon.com HTTP/2 200 content-type: text/html;charset=UTF-8 server: Server

If the VPC subnets are organized according to the architecture suggested in this solution, outbound traffic from the EKS cluster can be sent to the network firewall and then filtered based on hostnames provided by SNI.

Collecting hostnames provided by the SNI

In this step, you’ll see how to configure the network firewall to collect all the hostnames provided by SNI that are accessed by an already running application—without blocking any access—by making use of CloudWatch and alert logs.

To configure the network firewall (console)

In the AWS CloudFormation console, locate the stack that created the network firewall earlier in the To deploy the VPC architecture procedure.
Select the stack to update, and then choose Update.
Choose Replace current template and upload the template network-firewall-eks-collect-all.yaml. (This template should be available from the files that you downloaded earlier from the S3 bucket in the Prerequisites section.) Choose Next. See Updating stacks directly for more information.

To configure the network firewall (AWS CLI)

Update the CloudFormation stack by using the network-firewall-eks-collect-all.yaml template file that you previously downloaded from the S3 bucket in the Prerequisites section, using the update-stack command as follows.
aws cloudformation update-stack --stack-name AWS-Network-Firewall-Multi-AZ \ --template-body file://network-firewall-eks-collect-all.yaml \ --capabilities CAPABILITY_NAMED_IAM

To check the rules in the AWS Management Console

In the AWS Management Console, navigate to the Amazon VPC console and locate the AWS Network Firewall tab.
Select the network firewall that you created earlier, and then select the stateful rule with the name log-all-tls.
The rule group should appear as shown in Figure 4, indicating that the logs are captured and sent to the Alert logs.

Figure 4: Network Firewall rule groups

To test based on stateful rule

On the terminal, get the shell for the running container in the Pod you created earlier. If this Pod is not available, follow the instructions in the To deploy a sample Pod on the EKS cluster procedure to create a new sample Pod.
Run the cURL command to aws.amazon.com. It should return HTTP 200 OK, as follows.
curl -Ik https://aws.amazon.com/ HTTP/2 200 content-type: text/html;charset=UTF-8 server: Server date: ------ ---------- --------------

Navigate to the AWS CloudWatch Logs console and look up the Alert logs log group with the name /AWS-Network-Firewall-Multi-AZ/anfw/alert.

You can see the hostnames provided by SNI within the TLS protocol passing through the network firewall. The CloudWatch Alert logs for allowed hostnames in the SNI looks like the following example.

{
    "firewall_name": "AWS-Network-Firewall-Multi-AZ-firewall",
    "availability_zone": "us-east-1b",
    "event_timestamp": "1627283521",
    "event": {
        "timestamp": "2021-07-26T07:12:01.304222+0000",
        "flow_id": 1977082435410607,
        "event_type": "alert",
        "src_ip": "xxx.xxx.xxx.xxx",
        "src_port": xxxxx,
        "dest_ip": "xxx.xxx.xxx.xxx",
        "dest_port": 443,
        "proto": "TCP",
        "alert": {
            "action": "allowed",
            "signature_id": 2,
            "rev": 0,
            "signature": "",
            "category": "",
            "severity": 3
        },
        "tls": {
            "subject": "CN=aws.amazon.com",
            "issuerdn": "C=US, O=Amazon, OU=Server CA 1B, CN=Amazon",
            "serial": "08:13:34:34:48:07:64:27:4D:BC:CB:14:4D:AF:F2:11",
            "fingerprint": "f7:53:97:5e:76:1e:fb:f6:70:72:02:95:d5:9f:2f:05:52:79:5d:ae",
            "sni": "aws.amazon.com",
            "version": "TLS 1.2",
            "notbefore": "2020-09-30T00:00:00",
            "notafter": "2021-09-23T12:00:00",
            "ja3": {},
            "ja3s": {}
        },
        "app_proto": "tls"
    }
}

Optionally, you can also create an AWS Lambda function to collect the hostnames that are passed through the network firewall.

To create a Lambda function to collect hostnames provided by SNI (optional)

Create subscriptions for one or more log streams to invoke a function when logs are created or match an optional pattern.
For more information, see Using Lambda with CloudWatch Logs. Figure 5 is an example architecture in which a Lambda code extracts the hostnames provided by SNI, which can be sent to an Amazon Simple Notification Service (Amazon SNS) topic to send an alert to subscriptions.

Figure 5: Architecture to collect and capture hostnames by using Network Firewall

Sample Lambda code

The sample Lambda code from Figure 5 is shown following, and is written in Python 3. The sample collects the hostnames that are provided by SNI and captured in Network Firewall. Network Firewall logs the hostnames provided by SNI in the CloudWatch Alert logs. Then, by creating a CloudWatch logs subscription filter, you can send logs to the Lambda function for further processing, for example to invoke SNS notifications.

import json
import gzip
import base64
import boto3
import sys
import traceback
sns_client = boto3.client('sns')
def lambda_handler(event, context):
    try:
        decoded_event = json.loads(gzip.decompress(base64.b64decode(event['awslogs']['data'])))
        body = '''
        {filtermatch}
        '''.format(
            loggroup=decoded_event['logGroup'],
            logstream=decoded_event['logStream'],
            filtermatch=decoded_event['logEvents'][0]['message'],
        )
        # print(body)# uncomment this for debugging
        filterMatch = json.loads(body)
        data = []
        if 'http' in filterMatch['event']:
            data.append(filterMatch['event']['http']['hostname'])
        elif 'tls' in filterMatch['event']:
            data.append(filterMatch['event']['tls']['sni'])
        result = 'Trying to reach ' + 1*' ' + (data[0]) + 1*' ' 'via Network Firewall' + 1*' '  + (filterMatch['firewall_name'])
        # print(result)# uncomment this for debugging
        message = {'HostName': result}
        send_to_sns = sns_client.publish(
            TargetArn='<SNS-topic-ARN>', #Replace with the SNS topic ARN
            Message=json.dumps({'default': json.dumps(message),
                            'sms': json.dumps(message),
                            'email': json.dumps(message)}),
            Subject='Trying to reach the hostname through the Network Firewall',
            MessageStructure='json')
    except Exception as e:
        print('Function failed due to exception.')
        e = sys.exc_info()[0]
        print(e)
        traceback.print_exc()
        Status="Failure"
        Message=("Error occured while executing this. The error is %s" %e)

Clean up

In this step, you’ll clean up the infrastructure that was created as part of this solution.

To delete the Kubernetes workloads

On the terminal, using the kubectl CLI tool, run the following command to delete the sample Pod that you created earlier.
kubectl delete pods amazon-linux

Note: Clean up all the Kubernetes workloads running on the EKS cluster. For example, if the Kubernetes service of type LoadBalancer is deployed, and if the EKS cluster where it exists is deleted, the LoadBalancer will not be deleted. The best practice is to clean up all the deployed workloads.
On the terminal, using the eksctl CLI tool, delete the created EKS cluster by using the following command.
eksctl delete cluster --name filter-egress-traffic-test

To delete the CloudFormation stack and AWS Network Firewall

Navigate to the AWS CloudFormation console and choose the stack with the name AWS-Network-Firewall-Multi-AZ.
Choose Delete, and then at the prompt choose Delete Stack. For more information, see Deleting a stack on the AWS CloudFormation console.

Conclusion

By following the VPC architecture explained in this blog post, you can protect the applications running on an Amazon EKS cluster by filtering the outbound traffic based on the approved hostnames that are provided by SNI in the Network Firewall Allow list.

Additionally, with a simple Lambda function, CloudWatch Logs, and an SNS topic, you can get readable hostnames provided by the SNI. Using these hostnames, you can learn about the traffic pattern for the applications that are running within the EKS cluster, and later create a strict list to allow only the required outbound traffic. To learn more about Network Firewall stateful rules, see Working with stateful rule groups in AWS Network Firewall in the AWS Network Firewall Developer Guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Multi-Region Terraform Deployments with AWS CodePipeline using Terraform Built CI/CD

2022-06-24 Lerna Ekmekcioglu

Post Syndicated from Lerna Ekmekcioglu original https://aws.amazon.com/blogs/devops/multi-region-terraform-deployments-with-aws-codepipeline-using-terraform-built-ci-cd/

As of February 2022, the AWS Cloud spans 84 Availability Zones within 26 geographic Regions, with announced plans for more Availability Zones and Regions. Customers can leverage this global infrastructure to expand their presence to their primary target of users, satisfying data residency requirements, and implementing disaster recovery strategy to make sure of business continuity. Although leveraging multi-Region architecture would address these requirements, deploying and configuring consistent infrastructure stacks across multi-Regions could be challenging, as AWS Regions are designed to be autonomous in nature. Multi-region deployments with Terraform and AWS CodePipeline can help customers with these challenges.

In this post, we’ll demonstrate the best practice for multi-Region deployments using HashiCorp Terraform as infrastructure as code (IaC), and AWS CodeBuild , CodePipeline as continuous integration and continuous delivery (CI/CD) for consistency and repeatability of deployments into multiple AWS Regions and AWS Accounts. We’ll dive deep on the IaC deployment pipeline architecture and the best practices for structuring the Terraform project and configuration for multi-Region deployment of multiple AWS target accounts.

You can find the sample code for this solution here

Solutions Overview

Architecture

The following architecture diagram illustrates the main components of the multi-Region Terraform deployment pipeline with all of the resources built using IaC.

DevOps engineer initially works against the infrastructure repo in a short-lived branch. Once changes in the short-lived branch are ready, DevOps engineer gets them reviewed and merged into the main branch. Then, DevOps engineer git tags the repo. For any future changes in the infra repo, DevOps engineer repeats this same process.

Git tags named “dev_us-east-1/research/1.0”, “dev_eu-central-1/research/1.0”, “dev_ap-southeast-1/research/1.0”, “dev_us-east-1/risk/1.0”, “dev_eu-central-1/risk/1.0”, “dev_ap-southeast-1/risk/1.0” corresponding to the version 1.0 of the code to release from the main branch using git tagging. Short-lived branch in between each version of the code, followed by git tags corresponding to each subsequent version of the code such as version 1.1 and version 2.0.”

Fig 1. Tagging to release from the main branch.

The deployment is triggered from DevOps engineer git tagging the repo, which contains the Terraform code to be deployed. This action starts the deployment pipeline execution.
Tagging with ‘dev_us-east-1/research/1.0’ triggers a pipeline to deploy the research dev account to us-east-1. In our example git tag ‘dev_us-east-1/research/1.0’ contains the target environment (i.e., dev), AWS Region (i.e. us-east-1), team (i.e., research), and a version number (i.e., 1.0) that maps to an annotated tag on a commit ID. The target workload account aliases (i.e., research dev, risk qa) are mapped to AWS account numbers in the environment configuration files of the infra repo in AWS CodeCommit.

The central tooling account contains the CodeCommit Terraform infra repo, where DevOps engineer has git access, along with the pipeline trigger, the CodePipeline dev pipeline consisting of the S3 bucket with Terraform infra repo and git tag, CodeBuild terraform tflint scan, checkov scan, plan and apply. Terraform apply points using the cross account role to VPC containing an Application Load Balancer (ALB) in eu-central-1 in the dev target workload account. A qa pipeline, a staging pipeline, a prod pipeline are included along with a qa target workload account, a staging target workload account, a prod target workload account. EventBridge, Key Management Service, CloudTrail, CloudWatch in us-east-1 Region are in the central tooling account along with Identity Access Management service. In addition, the dev target workload account contains us-east-1 and ap-southeast-1 VPC’s each with an ALB as well as Identity Access Management.

Fig 2. Multi-Region AWS deployment with IaC and CI/CD pipelines.

To capture the exact git tag that starts a pipeline, we use an Amazon EventBridge rule. The rule is triggered when the tag is created with an environment prefix for deploying to a respective environment (i.e., dev). The rule kicks off an AWS CodeBuild project that takes the git tag from the AWS CodeCommit event and stores it with a full clone of the repo into a versioned Amazon Simple Storage Service (Amazon S3) bucket for the corresponding environment.
We have a continuous delivery pipeline defined in AWS CodePipeline. To make sure that the pipelines for each environment run independent of each other, we use a separate pipeline per environment. Each pipeline consists of three stages in addition to the Source stage:

1. IaC linting stage – A stage for linting Terraform code. For illustration purposes, we’ll use the open source tool tflint.
2. IaC security scanning stage – A stage for static security scanning of Terraform code. There are many tooling choices when it comes to the security scanning of Terraform code. Checkov, TFSec, and Terrascan are the commonly used tools. For illustration purposes, we’ll use the open source tool Checkov.
3. IaC build stage – A stage for Terraform build. This includes an action for the Terraform execution plan followed by an action to apply the plan to deploy the stack to a specific Region in the target workload account.

Once the Terraform apply is triggered, it deploys the infrastructure components in the target workload account to the AWS Region based on the git tag. In turn, you have the flexibility to point the deployment to any AWS Region or account configured in the repo.
The sample infrastructure in the target workload account consists of an AWS Identity and Access Management (IAM) role, an external facing Application Load Balancer (ALB), as well as all of the required resources down to the Amazon Virtual Private Cloud (Amazon VPC). Upon successful deployment, browsing to the external facing ALB DNS Name URL displays a very simple message including the location of the Region.

Architectural considerations

Multi-account strategy

Leveraging well-architected multi-account strategy, we have a separate central tooling account for housing the code repository and infrastructure pipeline, and a separate target workload account to house our sample workload infra-architecture. The clean account separation lets us easily control the IAM permission for granular access and have different guardrails and security controls applied. Ultimately, this enforces the separation of concerns as well as minimizes the blast radius.

A dev pipeline, a qa pipeline, a staging pipeline and, a prod pipeline in the central tooling account, each targeting the workload account for the respective environment pointing to the Regional resources containing a VPC and an ALB.

Fig 3. A separate pipeline per environment.

The sample architecture shown above contained a pipeline per environment (DEV, QA, STAGING, PROD) in the tooling account deploying to the target workload account for the respective environment. At scale, you can consider having multiple infrastructure deployment pipelines for multiple business units in the central tooling account, thereby targeting workload accounts per environment and business unit. If your organization has a complex business unit structure and is bound to have different levels of compliance and security controls, then the central tooling account can be further divided into the central tooling accounts per business unit.

Pipeline considerations

The infrastructure deployment pipeline is hosted in a central tooling account and targets workload accounts. The pipeline is the authoritative source managing the full lifecycle of resources. The goal is to decrease the risk of ad hoc changes (e.g., manual changes made directly via the console) that can’t be easily reproduced at a future date. The pipeline and the build step each run as their own IAM role that adheres to the principle of least privilege. The pipeline is configured with a stage to lint the Terraform code, as well as a static security scan of the Terraform resources following the principle of shifting security left in the SDLC.

As a further improvement for resiliency and applying the cell architecture principle to the CI/CD deployment, we can consider having multi-Region deployment of the AWS CodePipeline pipeline and AWS CodeBuild build resources, in addition to a clone of the AWS CodeCommit repository. We can use the approach detailed in this post to sync the repo across multiple regions. This means that both the workload architecture and the deployment infrastructure are multi-Region. However, it’s important to note that the business continuity requirements of the infrastructure deployment pipeline are most likely different than the requirements of the workloads themselves.

A dev pipeline in us-east-1, a dev pipeline in eu-central-1, a dev pipeline in ap-southeast-1, all in the central tooling account, each pointing respectively to the regional resources containing a VPC and an ALB for the respective Region in the dev target workload account.

Fig 4. Multi-Region CI/CD dev pipelines targeting the dev workload account resources in the respective Region.

Deeper dive into Terraform code

Backend configuration and state

As a prerequisite, we created Amazon S3 buckets to store the Terraform state files and Amazon DynamoDB tables for the state file locks. The latter is a best practice to prevent concurrent operations on the same state file. For naming the buckets and tables, our code expects the use of the same prefix (i.e., <tf_backend_config_prefix>-<env> for buckets and <tf_backend_config_prefix>-lock-<env> for tables). The value of this prefix must be passed in as an input param (i.e., “tf_backend_config_prefix”). Then, it’s fed into AWS CodeBuild actions for Terraform as an environment variable. Separation of remote state management resources (Amazon S3 bucket and Amazon DynamoDB table) across environments makes sure that we’re minimizing the blast radius.


-backend-config="bucket=${TF_BACKEND_CONFIG_PREFIX}-${ENV}" 
-backend-config="dynamodb_table=${TF_BACKEND_CONFIG_PREFIX}-lock-${ENV}"

A dev Terraform state files bucket named

<prefix>-dev, a dev Terraform state locks DynamoDB table named <prefix>-lock-dev, a qa Terraform state files bucket named <prefix>-qa, a qa Terraform state locks DynamoDB table named <prefix>-lock-qa, a staging Terraform state files bucket named <prefix>-staging, a staging Terraform state locks DynamoDB table named <prefix>-lock-staging, a prod Terraform state files bucket named <prefix>-prod, a prod Terraform state locks DynamoDB table named <prefix>-lock-prod, in us-east-1 in the central tooling account” width=”600″ height=”456″>
<p id=

Fig 5. Terraform state file buckets and state lock tables per environment in the central tooling account.

The git tag that kicks off the pipeline is named with the following convention of “<env>_<region>/<team>/<version>” for regional deployments and “<env>_global/<team>/<version>” for global resource deployments. The stage following the source stage in our pipeline, tflint stage, is where we parse the git tag. From the tag, we derive the values of environment, deployment scope (i.e., Region or global), and team to determine the Terraform state Amazon S3 object key uniquely identifying the Terraform state file for the deployment. The values of environment, deployment scope, and team are passed as environment variables to the subsequent AWS CodeBuild Terraform plan and apply actions.

-backend-config="key=${TEAM}/${ENV}-${TARGET_DEPLOYMENT_SCOPE}/terraform.tfstate"

We set the Region to the value of AWS_REGION env variable that is made available by AWS CodeBuild, and it’s the Region in which our build is running.

-backend-config="region=$AWS_REGION"

The following is how the Terraform backend config initialization looks in our AWS CodeBuild buildspec files for Terraform actions, such as tflint, plan, and apply.

terraform init -backend-config="key=${TEAM}/${ENV}-
${TARGET_DEPLOYMENT_SCOPE}/terraform.tfstate" -backend-config="region=$AWS_REGION"
-backend-config="bucket=${TF_BACKEND_CONFIG_PREFIX}-${ENV}" 
-backend-config="dynamodb_table=${TF_BACKEND_CONFIG_PREFIX}-lock-${ENV}"
-backend-config="encrypt=true"

Using this approach, the Terraform states for each combination of account and Region are kept in their own distinct state file. This means that if there is an issue with one Terraform state file, then the rest of the state files aren’t impacted.

In the central tooling account us-east-1 Region, Terraform state files named “research/dev-us-east-1/terraform.tfstate”, “risk/dev-ap-southeast-1/terraform.tfstate”, “research/dev-eu-central-1/terraform.tfstate”, “research/dev-global/terraform.tfstate” are in S3 bucket named

<prefix>-dev along with DynamoDB table for Terraform state locks named <prefix>-lock-dev. The Terraform state files named “research/qa-us-east-1/terraform.tfstate”, “risk/qa-ap-southeast-1/terraform.tfstate”, “research/qa-eu-central-1/terraform.tfstate” are in S3 bucket named <prefix>-qa along with DynamoDB table for Terraform state locks named <prefix>-lock-qa. Similarly for staging and prod.” width=”600″ height=”677″>
<p id=

Fig 6. Terraform state files per account and Region for each environment in the central tooling account

Following the example, a git tag of the form “dev_us-east-1/research/1.0” that kicks off the dev pipeline works against the research team’s dev account’s state file containing us-east-1 Regional resources (i.e., Amazon S3 object key “research/dev-us-east-1/terraform.tfstate” in the S3 bucket <tf_backend_config_prefix>-dev), and a git tag of the form “dev_ap-southeast-1/risk/1.0” that kicks off the dev pipeline works against the risk team’s dev account’s Terraform state file containing ap-southeast-1 Regional resources (i.e., Amazon S3 object key “risk/dev-ap-southeast-1/terraform.tfstate”). For global resources, we use a git tag of the form “dev_global/research/1.0” that kicks off a dev pipeline and works against the research team’s dev account’s global resources as they are at account level (i.e., “research/dev-global/terraform.tfstate).

Git tag “dev_us-east-1/research/1.0” pointing to the Terraform state file named “research/dev-us-east-1/terraform.tfstate”, git tag “dev_ap-southeast-1/risk/1.0 pointing to “risk/dev-ap-southeast-1/terraform.tfstate”, git tag “dev_eu-central-1/research/1.0” pointing to ”research/dev-eu-central-1/terraform.tfstate”, git tag “dev_global/research/1.0” pointing to “research/dev-global/terraform.tfstate”, in dev Terraform state files S3 bucket named <prefix>-dev along with <prefix>-lock-dev DynamoDB dev Terraform state locks table.” width=”600″ height=”318″>
<p id=

Fig 7. Git tags and the respective Terraform state files.

This backend configuration makes sure that the state file for one account and Region is independent of the state file for the same account but different Region. Adding or expanding the workload to additional Regions would have no impact on the state files of existing Regions.

If we look at the further improvement where we make our deployment infrastructure also multi-Region, then we can consider each Region’s CI/CD deployment to be the authoritative source for its local Region’s deployments and Terraform state files. In this case, tagging against the repo triggers a pipeline within the local CI/CD Region to deploy resources in the Region. The Terraform state files in the local Region are used for keeping track of state for the account’s deployment within the Region. This further decreases cross-regional dependencies.

A dev pipeline in the central tooling account in us-east-1, pointing to the VPC containing ALB in us-east-1 in dev target workload account, along with a dev Terraform state files S3 bucket named <prefix>-use1-dev containing us-east-1 Regional resources “research/dev/terraform.tfstate” and “risk/dev/terraform.tfstate” Terraform state files along with DynamoDB dev Terraform state locks table named <prefix>-use1-lock-dev. A dev pipeline in the central tooling account in eu-central-1, pointing to the VPC containing ALB in eu-central-1 in dev target workload account, along with a dev Terraform state files S3 bucket named <prefix>-euc1-dev containing eu-central-1 Regional resources “research/dev/terraform.tfstate” and “risk/dev/terraform.tfstate” Terraform state files along with DynamoDB dev Terraform state locks table named <prefix>-euc1-lock-dev. A dev pipeline in the central tooling account in ap-southeast-1, pointing to the VPC containing ALB in ap-southeast-1 in dev target workload account, along with a dev Terraform state files S3 bucket named <prefix>-apse1-dev containing ap-southeast-1 Regional resources “research/dev/terraform.tfstate” and “risk/dev/terraform.tfstate” Terraform state files along with DynamoDB dev Terraform state locks table named <prefix>-apse1-lock-dev” width=”700″ height=”603″>
<p id=

Fig 8. Multi-Region CI/CD with Terraform state resources stored in the same Region as the workload account resources for the respective Region

Provider

For deployments, we use the default Terraform AWS provider. The provider is parametrized with the value of the region passed in as an input parameter.

provider "aws" {
  region = var.region
   ...
}

Once the provider knows which Region to target, we can refer to the current AWS Region in the rest of the code.

# The value of the current AWS region is the name of the AWS region configured on the provider
# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/region
data "aws_region" "current" {} 

locals {
    region = data.aws_region.current.name # then use local.region where region is needed
}

Provider is configured to assume a cross account IAM role defined in the workload account. The value of the account ID is fed as an input parameter.

provider "aws" {
  region = var.region
  assume_role {
    role_arn     = "arn:aws:iam::${var.account}:role/InfraBuildRole"
    session_name = "INFRA_BUILD"
  }
}

This InfraBuildRole IAM role could be created as part of the account creation process. The AWS Control Tower Terraform Account Factory could be used to automate this.

Code

Minimize cross-regional dependencies

We keep the Regional resources and the global resources (e.g., IAM role or policy) in distinct namespaces following the cell architecture principle. We treat each Region as one cell, with the goal of decreasing cross-regional dependencies. Regional resources are created once in each Region. On the other hand, global resources are created once globally and may have cross-regional dependencies (e.g., DynamoDB global table with a replica table in multiple Regions). There’s no “global” Terraform AWS provider since the AWS provider requires a Region. This means that we pick a specific Region from which to deploy our global resources (i.e., global_resource_deploy_from_region input param). By creating a distinct Terraform namespace for Regional resources (e.g., module.regional) and a distinct namespace for global resources (e.g., module.global), we can target a deployment for each using pipelines scoped to the respective namespace (e.g., module.global or module.regional).

Deploying Regional resources: A dev pipeline in the central tooling account triggered via git tag “dev_eu-central-1/research/1.0” pointing to the eu-central-1 VPC containing ALB in the research dev target workload account corresponding to the module.regional Terraform namespace. Deploying global resources: a dev pipeline in the central tooling account triggered via git tag “dev_global/research/1.0” pointing to the IAM resource corresponding to the module.global Terraform namespace.

Fig 9. Deploying regional and global resources scoped to the Terraform namespace

As global resources have a scope of the whole account regardless of Region while Regional resources are scoped for the respective Region in the account, one point of consideration and a trade-off with having to pick a Region to deploy global resources is that this introduces a dependency on that region for the deployment of the global resources. In addition, in the case of a misconfiguration of a global resource, there may be an impact to each Region in which we deployed our workloads. Let’s consider a scenario where an IAM role has access to an S3 bucket. If the IAM role is misconfigured as a result of one of the deployments, then this may impact access to the S3 bucket in each Region.

There are alternate approaches, such as creating an IAM role per Region (myrole-use1 with access to the S3 bucket in us-east-1, myrole-apse1 with access to the S3 bucket in ap-southeast-1, etc.). This would make sure that if the respective IAM role is misconfigured, then the impact is scoped to the Region. Another approach is versioning our global resources (e.g., myrole-v1, myrole-v2) with the ability to move to a new version and roll back to a previous version if needed. Each of these approaches has different drawbacks, such as the duplication of global resources that may make auditing more cumbersome with the tradeoff of minimizing cross Regional dependencies.

We recommend looking at the pros and cons of each approach and selecting the approach that best suits the requirements for your workloads regarding the flexibility to deploy to multiple Regions.

Consistency

We keep one copy of the infrastructure code and deploy the resources targeted for each Region using this same copy. Our code is built using versioned module composition as the “lego blocks”. This follows the DRY (Don’t Repeat Yourself) principle and decreases the risk of code drift per Region. We may deploy to any Region independently, including any Regions added at a future date with zero code changes and minimal additional configuration for that Region. We can see three advantages with this approach.

The total deployment time per Region remains the same regardless of the addition of Regions. This helps for restrictions, such as tight release windows due to business requirements.
If there’s an issue with one of the regional deployments, then the remaining Regions and their deployment pipelines aren’t affected.
It allows the ability to stagger deployments or the possibility of not deploying to every region in non-critical environments (e.g., dev) to minimize costs and remain in line with the Well Architected Sustainability pillar.

Conclusion

In this post, we demonstrated a multi-account, multi-region deployment approach, along with sample code, with a focus on architecture using IaC tool Terraform and CI/CD services AWS CodeBuild and AWS CodePipeline to help customers in their journey through multi-Region deployments.

Thanks to Welly Siauw, Kenneth Jackson, Andy Taylor, Rodney Bozo, Craig Edwards and Curtis Rissi for their contributions reviewing this post and its artifacts.

Author:

AWS Config RDK: Deploying the Custom Rules using the Terraform

2021-08-21 Madhu Sarma

Post Syndicated from Madhu Sarma original https://aws.amazon.com/blogs/devops/aws-config-rdk-deploying-the-custom-rules-using-the-terraform/

To help customers using the Terraform for multi-cloud infrastructure deployment, we have introduced a new feature in the AWS Config Rule Development Kit (RDK) that allows you to export custom AWS Config rules to Terraform files so that you can deploy the RDK rules with Terraform.

This blog post is a complement to the previous post – How to develop custom AWS Config rules using the Rule Development Kit. Here I will show you how to prototype, develop, and deploy custom AWS Config rules. The steps for prototyping and developing the custom AWS Config rules remain identical, while a variation exists in the deployment step, which I’ll walk you through in detail. I would encourage you to review the previous blog post, so that you can follow along here.

In this post, you will learn how to export the custom AWS Config rule to Terraform files and deploy to AWS using the Terraform.

Background

RDK doesn’t support the Terraform for rules deployment, which is impacting customers using the Terraform (“Infrastructure As Code”) to provision AWS infrastructure. Therefore, we have provided one more option to deploy the rules by using the Terraform.

Getting Started

The first step is making sure that you installed the latest RDK version. After you have defined an AWS Config rule and prototyped using the AWS Config RDK as described in the previous blog post, follow the steps below to deploy the various AWS Config components across the compliance and satellite accounts.

Prerequisites

Validate that you downloaded the RDK that supports “export”, using the command “rdk export -h”, and you should see the below output. If the installed RDK doesn’t support the export feature, then update it by using the command “pip install rdk”

(venv) 8c85902e4110:7RDK test$ rdk export -h 
 
usage: rdk export [-h] [-s RULESETS] [--all] [--lambda-layers LAMBDA_LAYERS]  
                  [--lambda-subnets LAMBDA_SUBNETS]  
                  [--lambda-security-groups LAMBDA_SECURITY_GROUPS]  
                  [--lambda-role-arn LAMBDA_ROLE_ARN]  
                  [--rdklib-layer-arn RDKLIB_LAYER_ARN] -v {0.11,0.12} -f  
                  {terraform}  
                  [<rulename> [<rulename> ...]]  
  
Used to export the Config Rule to terraform file.  
  
positional arguments:  
  <rulename>            Rule name(s) to export to a file.  
  
optional arguments:  
  -h, --help            show this help message and exit  
  -s RULESETS, --rulesets RULESETS  
                        comma-delimited list of RuleSet names  
  --all, -a             All rules in the working directory will be deployed.  
  --lambda-layers LAMBDA_LAYERS  
                        [optional] Comma-separated list of Lambda Layer ARNs  
                        to deploy with your Lambda function(s).  
  --lambda-subnets LAMBDA_SUBNETS  
                        [optional] Comma-separated list of Subnets to deploy  
                        your Lambda function(s).  
  --lambda-security-groups LAMBDA_SECURITY_GROUPS  
                        [optional] Comma-separated list of Security Groups to  
                        deploy with your Lambda function(s).  
  --lambda-role-arn LAMBDA_ROLE_ARN  
                        [optional] Assign existing iam role to lambda  
                        functions. If omitted, new lambda role will be  
                        created.  
  --rdklib-layer-arn RDKLIB_LAYER_ARN  
                        [optional] Lambda Layer ARN that contains the desired  
                        rdklib. Note that Lambda Layers are region-specific.  
  -v {0.11,0.12}, --version {0.11,0.12}  
                        Terraform version  
  -f {terraform}, --format {terraform}  
                        Export Format

Create your rule

Create your rule by using the command below which creates the MY_FIRST_RULE rule.

7RDK test$ rdk create MY_FIRST_RULE  --runtime python3.6 --resource-types AWS::EC2::SecurityGroup  
Running create!  
Local Rule files created.

This creates the three files below. Edit the “MY_FIRST_RULE.py” as per your business requirement, as described in the “Edit” section of this blog.

7RDK test$ cd MY_FIRST_RULE/ 
(venv) 8c85902e4110:MY_FIRST_RULE test$ls 
MY_FIRST_RULE.py        MY_FIRST_RULE_test.py   parameters.json

Export your rule to Terraform

Use the command below to export your rule to the Terraform files, which supports the two versions of Terraform (0.11 and 0.12). Use the “-v” argument to specify the version.

test$ cd ..  
7RDK test$ rdk export MY_FIRST_RULE -f terraform -v 0.12  
Running export  
Found Custom Rule.  
Zipping MY_FIRST_RULE  
Zipping complete.  
terraform version: 0.12  
Export completed.This will generate three .tf files.  
7RDK test$

This creates the four files.

<< rule-name >>_rule.tf :
- This script uploads the rule to the Amazon S3 bucket, deploys the lambda, and creates the AWS config rule and the required IAM roles/policies.
<< rule-name >>_variables.tf: Terraform variable definitions.
<< rule-name >>.tfvars.json: Terraform variable values.
<< rule-name >>.zip: Compiled rule code.

7RDK test$ cd MY_FIRST_RULE/  
(venv) 8c85902e4110:MY_FIRST_RULE test$ ls -1  
MY_FIRST_RULE.py  
MY_FIRST_RULE.zip  
MY_FIRST_RULE_test.py  
my_first_rule.tfvars.json  
my_first_rule_rule.tf  
my_first_rule_variables.tf  
parameters.json

Deploy your rule using the Terraform

Initialize the Terraform by using “terraform init” to download the AWS provider Plug-In.

MY_FIRST_RULE test$ terraform init  
  
Initializing the backend...  
  
Initializing provider plugins...  
- Checking for available provider plugins...  
- Downloading plugin for provider "aws" (hashicorp/aws) 2.70.0...  
  
The following providers do not have any version constraints in configuration,  
so the latest version was installed.  
  
To prevent automatic upgrades to new major versions that may contain breaking  
changes, it is recommended to add version = "..." constraints to the  
corresponding provider blocks in configuration, with the constraint strings  
suggested below.  
  
* provider.aws: version = "~> 2.70"  
  
Terraform has been successfully initialized!

To deploy the config rules, your role should have the permissions and should mention the role ARN in my_rule.tfvars.json

To apply the Terraform, it requires two arguments:

var-file: Terraform script variable file name, created while exporting the rule using RDK.
source_bucket: Your Amazon S3 bucket name, to upload the config rule lambda code.

Make sure that AWS provider is configured for your Terraform environment as mentioned in the docs.

MY_FIRST_RULE test$ terraform apply -var-file=my_first_rule.tfvars.json --var source_bucket=config-bucket-xxxxx  
  
aws_iam_policy.awsconfig_policy[0]: Creating...  
aws_iam_role.awsconfig[0]: Creating...  
aws_s3_bucket_object.rule_code: Creating...  
aws_iam_role.awsconfig[0]: Creation complete after 3s [id=my_first_rule-awsconfig-role]  
aws_iam_role_policy_attachment.readonly-role-policy-attach[0]: Creating...  
aws_iam_policy.awsconfig_policy[0]: Creation complete after 4s [id=arn:aws:iam::xxxxxxxxxxxx:policy/my_first_rule-awsconfig-policy]  
aws_iam_role_policy_attachment.awsconfig_policy_attach[0]: Creating...  
aws_s3_bucket_object.rule_code: Creation complete after 5s [id=MY_FIRST_RULE.zip]  
aws_lambda_function.rdk_rule: Creating...  
aws_iam_role_policy_attachment.readonly-role-policy-attach[0]: Creation complete after 2s [id=my_first_rule-awsconfig-role-20200726023315892200000001]  
aws_iam_role_policy_attachment.awsconfig_policy_attach[0]: Creation complete after 3s [id=my_first_rule-awsconfig-role-20200726023317242000000002]  
aws_lambda_function.rdk_rule: Still creating... [10s elapsed]  
aws_lambda_function.rdk_rule: Creation complete after 18s [id=RDK-Rule-Function-MY_FIRST_RULE]  
aws_lambda_permission.lambda_invoke: Creating...  
aws_config_config_rule.event_triggered[0]: Creating...  
aws_lambda_permission.lambda_invoke: Creation complete after 2s [id=AllowExecutionFromConfig]  
aws_config_config_rule.event_triggered[0]: Creation complete after 4s [id=MY_FIRST_RULE]  
  
Apply complete! Resources: 8 added, 0 changed, 0 destroyed.

Clean up

Enter the following command to remove all the resources.

MY_FIRST_RULE test$ terraform destroy

Conclusion

With this new feature, you can export the AWS config rules developed by RDK to the Terraform, and integrate these files into your Terraform CI/CD pipeline to provision the config rules in AWS without using the RDK.

Secure and analyse your Terraform code using AWS CodeCommit, AWS CodePipeline, AWS CodeBuild and tfsec

2021-07-30 César Prieto Ballester

Post Syndicated from César Prieto Ballester original https://aws.amazon.com/blogs/devops/secure-and-analyse-your-terraform-code-using-aws-codecommit-aws-codepipeline-aws-codebuild-and-tfsec/

Introduction

More and more customers are using Infrastructure-as-Code (IaC) to design and implement their infrastructure on AWS. This is why it is essential to have pipelines with Continuous Integration/Continuous Deployment (CI/CD) for infrastructure deployment. HashiCorp Terraform is one of the popular IaC tools for customers on AWS.

In this blog, I will guide you through building a CI/CD pipeline on AWS to analyze and identify possible configurations issues in your Terraform code templates. This will help mitigate security risks within our infrastructure deployment pipelines as part of our CI/CD. To do this, we utilize AWS tools and the Open Source tfsec tool, a static analysis security scanner for your Terraform code, including more than 90 preconfigured checks with the ability to add custom checks.

Solutions Overview

The architecture goes through a CI/CD pipeline created on AWS using AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, and Amazon ECR.

Our demo has two separate pipelines:

CI/CD Pipeline to build and push our custom Docker image to Amazon ECR
CI/CD Pipeline where our tfsec analysis is executed and Terraform provisions infrastructure

The tfsec configuration and Terraform goes through a buildspec specification file defined within an AWS CodeBuild action. This action will calculate how many potential security risks we currently have within our Terraform templates, which will be displayed in our manual acceptance process for verification.

Architecture diagram

Provisioning the infrastructure

We have created an AWS Cloud Development Kit (AWS CDK) app hosted in a Git Repository written in Python. Here you can deploy the two main pipelines in order to manage this scenario. For a list of the deployment prerequisites, see the README.md file.

Clone the repo in your local machine. Then, bootstrap and deploy the CDK stack:

git clone https://github.com/aws-samples/aws-cdk-tfsec
cd aws-cdk-tfsec
pip install -r requirements.txt
cdk bootstrap aws://account_id/eu-west-1
cdk deploy --all

The infrastructure creation takes around 5-10 minutes due the AWS CodePipelines and referenced repository creation. Once the CDK has deployed the infrastructure, clone the two new AWS CodeCommit repos that have already been created and push the example code. First, one for the custom Docker image, and later for your Terraform code, like this:

git clone https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/awsome-terraform-example-container
cd awsome-terraform-example-container
git checkout -b main
cp repos/docker_image/* .
git add .
git commit -am "First commit"
git push origin main

Once the Docker image is built and pushed to the Amazon ECR, proceed with Terraform repo. Check the pipeline process on the AWS CodePipeline console.

Screenshot of CI/CD Pipeline to build Docker Image

git clone https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/awsome-terraform-example
cd awsome-terraform-example
git checkout -b main
cp -aR repos/terraform_code/* .
git add .
git commit -am "First commit"
git push origin main

The Terraform provisioning AWS CodePipeline has the following aspect:

Screenshot of CodePipeline to run security and orchestrate IaC

The pipeline has three main stages:

Source – AWS CodeCommit stores the Terraform repository infrastructure and every time we push code to the main branch the AWS CodePipeline will be triggered.
tfsec analysis – AWS CodeBuild looks for a buildspec to execute the tfsec actions configured on the same buildspec.

Screenshot showing tfsec analysis

The output shows the potential security issues detected by tfsec for our Terraform code. The output is linking to the different security issues already defined on tfsec. Check the security checks defined by tfsec here. After tfsec execution, a manual approval action is set up to decide if we should go for the next steps or if we reject and stop the AWS CodePipeline execution.

The URL for review is linking to our tfsec output console.

Screenshot of tfsec output

Terraform plan and Terraform apply – This will be applied to our infrastructure plan. After the Terraform plan command and before the Terraform apply, a manual action is set up to decide if we can apply the changes.

After going through all of the stages, our Terraform infrastructure should be created.

Clean up

After completing your demo, feel free to delete your stack using the CDK cli:

cdk destroy --all

Conclusion

At AWS, security is our top priority. This post demonstrates how to build a CI/CD pipeline by using AWS Services to automate and secure your infrastructure as code via Terraform and tfsec.

Learn more about tfsec through the official documentation: https://tfsec.dev/

About the authors

César Prieto Ballester is a DevOps Consultant at Amazon Web Services. He enjoys automating everything and building infrastructure using code. Apart from work, he plays electric guitar and loves riding his mountain bike.

Bruno Bardelli is a Senior DevOps Consultant at Amazon Web Services. He loves to build applications and in his free time plays video games, practices aikido, and goes on walks with his dog.

Continuous Compliance Workflow for Infrastructure as Code: Part 2

2021-06-26 DAMODAR SHENVI WAGLE

Post Syndicated from DAMODAR SHENVI WAGLE original https://aws.amazon.com/blogs/devops/continuous-compliance-workflow-for-infrastructure-as-code-part-2/

In the first post of this series, we introduced a continuous compliance workflow in which an enterprise security and compliance team can release guardrails in a continuous integration, continuous deployment (CI/CD) fashion in your organization.

In this post, we focus on the technical implementation of the continuous compliance workflow. We demonstrate how to use AWS Developer Tools to create a CI/CD pipeline that releases guardrails for Terraform application workloads.

We use the Terraform-Compliance framework to define the guardrails. Terraform-Compliance is a lightweight, security and compliance-focused test framework for Terraform to enable the negative testing capability for your infrastructure as code (IaC).

With this compliance framework, we can ensure that the implemented Terraform code follows security standards and your own custom standards. Currently, HashiCorp provides Sentinel (a policy as code framework) for enterprise products. AWS has CloudFormation Guard an open-source policy-as-code evaluation tool for AWS CloudFormation templates. Terraform-Compliance allows us to build a similar functionality for Terraform, and is open source.

This post is from the perspective of a security and compliance engineer, and assumes that the engineer is familiar with the practices of IaC, CI/CD, behavior-driven development (BDD), and negative testing.

Solution overview

You start by building the necessary resources as listed in the workload (application development team) account:

An AWS CodeCommit repository for the Terraform workload
A CI/CD pipeline built using AWS CodePipeline to deploy the workload
A cross-account AWS Identity and Access Management (IAM) role that gives the security and compliance account the permissions to pull the Terraform workload from the workload account repository for testing their guardrails in observation mode

Next, we build the resources in the security and compliance account:

A CodeCommit repository to hold the security and compliance standards (guardrails)
A CI/CD pipeline built using CodePipeline to release new guardrails
A cross-account role that gives the workload account the permissions to pull the activated guardrails from the main branch of the security and compliance account repository.

The following diagram shows our solution architecture.

solution architecture diagram

The architecture has two workflows: security and compliance (Steps 1–4) and application delivery (Steps 5–7).

When a new security and compliance guardrail is introduced into the develop branch of the compliance repository, it triggers the security and compliance pipeline.
The pipeline pulls the Terraform workload.
The pipeline tests this compliance check guardrail against the Terraform workload in the workload account repository.
If the workload is compliant, the guardrail is automatically merged into the main branch. This activates the guardrail by making it available for all Terraform application workload pipelines to consume. By doing this, we make sure that we don’t break the Terraform application deployment pipeline by introducing new guardrails. It also provides the security and compliance team visibility into the resources in the application workload that are noncompliant. The security and compliance team can then reach out to the application delivery team and suggest appropriate remediation before the new standards are activated. If the compliance check fails, the automatic merge to the main branch is stopped. The security and compliance team has an option to force merge the guardrail into the main branch if it’s deemed critical and they need to activate it immediately.
The Terraform deployment pipeline in the workload account always pulls the latest security and compliance checks from the main branch of the compliance repository.
Checks are run against the Terraform workload to ensure that it meets the organization’s security and compliance standards.
Only secure and compliant workloads are deployed by the pipeline. If the workload is noncompliant, the security and compliance checks fail and break the pipeline, forcing the application delivery team to remediate the issue and recheck-in the code.

Prerequisites

Before proceeding any further, you need to identify and designate two AWS accounts required for the solution to work:

Security and Compliance – In which you create a CodeCommit repository to hold compliance standards that are written based on Terraform-Compliance framework. You also create a CI/CD pipeline to release new compliance guardrails.
Workload – In which the Terraform workload resides. The pipeline to deploy the Terraform workload enforces the compliance guardrails prior to the deployment.

You also need to create two AWS account profiles in ~/.aws/credentials for the tools and target accounts, if you don’t already have them. These profiles need to have sufficient permissions to run an AWS Cloud Development Kit (AWS CDK) stack. They should be your private profiles and only be used during the course of this use case. Therefore, it should be fine if you want to use admin privileges. Don’t share the profile details, especially if it has admin privileges. I recommend removing the profile when you’re finished with this walkthrough. For more information about creating an AWS account profile, see Configuring the AWS CLI.

In addition, you need to generate a cucumber-sandwich.jar file by following the steps in the cucumber-sandwich GitHub repo. The JAR file is needed to generate pretty HTML compliance reports. The security and compliance team can use these reports to make sure that the standards are met.

To implement our solution, we complete the following high-level steps:

Create the security and compliance account stack.
Create the workload account stack.
Test the compliance workflow.

Create the security and compliance account stack

We create the following resources in the security and compliance account:

A CodeCommit repo to hold the security and compliance guardrails
A CI/CD pipeline to roll out the Terraform compliance guardrails
An IAM role that trusts the application workload account and allows it to pull compliance guardrails from its CodeCommit repo

In this section, we set up the properties for the pipeline and cross-account role stacks, and run the deployment scripts.

Set up properties for the pipeline stack

Clone the GitHub repo aws-continuous-compliance-for-terraform and navigate to the folder security-and-compliance-account/stacks. This contains the folder pipeline_stack/, which holds the code and properties for creating the pipeline stack.

The folder has a JSON file cdk-stack-param.json, which has the parameter TERRAFORM_APPLICATION_WORKLOADS, which represents the list of application workloads that the security and compliance pipeline pulls and runs tests against to make sure that the workloads are compliant. In the workload list, you have the following parameters:

GIT_REPO_URL – The HTTPS URL of the CodeCommit repository in the workload account against which the security and compliance check pipeline runs compliance guardrails.
CROSS_ACCOUNT_ROLE_ARN – The ARN for the cross-account role we create in the next section. This role gives the security and compliance account permissions to pull Terraform code from the workload account.

For CROSS_ACCOUNT_ROLE_ARN, replace <workload-account-id> with the account ID for your designated AWS workload account. For GIT_REPO_URL, replace <region> with AWS Region where the repository resides.

security and compliance pipeline stack parameters

Set up properties for the cross-account role stack

In the cloned GitHub repo aws-continuous-compliance-for-terraform from the previous step, navigate to the folder security-and-compliance-account/stacks. This contains the folder cross_account_role_stack/, which holds the code and properties for creating the cross-account role.

The folder has a JSON file cdk-stack-param.json, which has the parameter TERRAFORM_APPLICATION_WORKLOAD_ACCOUNTS, which represents the list of Terraform workload accounts that intend to integrate with the security and compliance account for running compliance checks. All these accounts are trusted by the security and compliance account and given permissions to pull compliance guardrails. Replace <workload-account-id> with the account ID for your designated AWS workload account.

security and compliance cross account role stack parameters

Run the deployment script

Run deploy.sh by passing the name of the AWS security and compliance account profile you created earlier. The script uses the AWS CDK CLI to bootstrap and deploy the two stacks we discussed. See the following code:

cd aws-continuous-compliance-for-terraform/security-and-compliance-account/
./deploy.sh "<AWS-COMPLIANCE-ACCOUNT-PROFILE-NAME>"

You should now see three stacks in the tools account:

CDKToolkit – AWS CDK creates the CDKToolkit stack when we bootstrap the AWS CDK app. This creates an Amazon Simple Storage Service (Amazon S3) bucket needed to hold deployment assets such as an AWS CloudFormation template and AWS Lambda code package.
cf-CrossAccountRoles – This stack creates the cross-account IAM role.
cf-SecurityAndCompliancePipeline – This stack creates the pipeline. On the Outputs tab of the stack, you can find the CodeCommit source repo URL from the key OutSourceRepoHttpUrl. Record the URL to use later.

security and compliance stack

Create a workload account stack

We create the following resources in the workload account:

A CodeCommit repo to hold the Terraform workload to be deployed
A CI/CD pipeline to deploy the Terraform workload
An IAM role that trusts the security and compliance account and allows it to pull Terraform code from its CodeCommit repo for testing

We follow similar steps as in the previous section to set up the properties for the pipeline stack and cross-account role stack, and then run the deployment script.

Set up properties for the pipeline stack

In the already cloned repo, navigate to the folder workload-account/stacks. This contains the folder pipeline_stack/, which holds the code and properties for creating the pipeline stack.

The folder has a JSON file cdk-stack-param.json, which has the parameter COMPLIANCE_CODE, which provides details on where to pull the compliance guardrails from. The pipeline pulls and runs compliance checks prior to deployment, to make sure that application workload is compliant. You have the following parameters:

GIT_REPO_URL – The HTTPS URL of the CodeCommit repositoryCode in the security and compliance account, which contains compliance guardrails that the pipeline in the workload account pulls to carry out compliance checks.
CROSS_ACCOUNT_ROLE_ARN – The ARN for the cross-account role we created in the previous step in the security and compliance account. This role gives the workload account permissions to pull the Terraform compliance code from its respective security and compliance account.

For CROSS_ACCOUNT_ROLE_ARN, replace <compliance-account-id> with the account ID for your designated AWS security and compliance account. For GIT_REPO_URL, replace <region> with Region where the repository resides.

workload pipeline stack config

Set up the properties for cross-account role stack

In the already cloned repo, navigate to folder workload-account/stacks. This contains the folder cross_account_role_stack/, which holds the code and properties for creating the cross-account role stack.

The folder has a JSON file cdk-stack-param.json, which has the parameter COMPLIANCE_ACCOUNT, which represents the security and compliance account that intends to integrate with the workload account for running compliance checks. This account is trusted by the workload account and given permissions to pull compliance guardrails. Replace <compliance-account-id> with the account ID for your designated AWS security and compliance account.

workload cross account role stack config

Run the deployment script

Run deploy.sh by passing the name of the AWS workload account profile you created earlier. The script uses the AWS CDK CLI to bootstrap and deploy the two stacks we discussed. See the following code:

cd aws-continuous-compliance-for-terraform/workload-account/
./deploy.sh "<AWS-WORKLOAD-ACCOUNT-PROFILE-NAME>"

You should now see three stacks in the tools account:

CDKToolkit –AWS CDK creates the CDKToolkit stack when we bootstrap the AWS CDK app. This creates an S3 bucket needed to hold deployment assets such as a CloudFormation template and Lambda code package.
cf-CrossAccountRoles – This stack creates the cross-account IAM role.
cf-TerraformWorkloadPipeline – This stack creates the pipeline. On the Outputs tab of the stack, you can find the CodeCommit source repo URL from the key OutSourceRepoHttpUrl. Record the URL to use later.

workload pipeline stack

Test the compliance workflow

In this section, we walk through the following steps to test our workflow:

Push the application workload code into its repo.
Push the security and compliance code into its repo and run its pipeline to release the compliance guardrails.
Run the application workload pipeline to exercise the compliance guardrails.
Review the generated reports.

Push the application workload code into its repo

Clone the empty CodeCommit repo from workload account. You can find the URL from the variable OutSourceRepoHttpUrl on the Outputs tab of the cf-TerraformWorkloadPipeline stack we deployed in the previous section.

Create a new branch main and copy the workload code into it.
Copy the cucumber-sandwich.jar file you generated in the prerequisites section into a new folder /lib.
Create a directory called reports with an empty file dummy. The reports directory is where Terraform-Compliance framework create compliance reports.
Push the code to the remote origin.

See the following sample script

git checkout -b main
# Copy the code from git repo location
# Create reports directory and a dummy file.
mkdir reports
touch reports/dummy
git add .
git commit -m “Initial commit”
git push origin main

The folder structure of workload code repo should match the structure shown in the following screenshot.

workload code folder structure

The first commit triggers the pipeline-workload-main pipeline, which fails in the stage RunComplianceCheck due to the security and compliance repo not being present (which we add in the next section).

Push the security and compliance code into its repo and run its pipeline

Clone the empty CodeCommit repo from the security and compliance account. You can find the URL from the variable OutSourceRepoHttpUrl on the Outputs tab of the cf-SecurityAndCompliancePipeline stack we deployed in the previous section.

Create a new local branch main and check in the empty branch into the remote origin so that the main branch is created in the remote origin. Skipping this step leads to failure in the code merge step of the pipeline due to the absence of the main branch.
Create a new branch develop and copy the security and compliance code into it. This is required because the security and compliance pipeline is configured to be triggered from the develop branch for the purposes of this post.
Copy the cucumber-sandwich.jar file you generated in the prerequisites section into a new folder /lib.

See the following sample script:

cd security-and-compliance-code
git checkout -b main
git add .
git commit --allow-empty -m “initial commit”
git push origin main
git checkout -b develop main
# Here copy the code from git repo location
# You also copy cucumber-sandwich.jar into a new folder /lib
git add .
git commit -m “Initial commit”
git push origin develop

The folder structure of security and compliance code repo should match the structure shown in the following screenshot.

security and compliance code folder structure

The code push to the develop branch of the security-and-compliance-code repo triggers the security and compliance pipeline. The pipeline pulls the code from the workload account repo, then runs the compliance guardrails against the Terraform workload to make sure that the workload is compliant. If the workload is compliant, the pipeline merges the compliance guardrails into the main branch. If the workload fails the compliance test, the pipeline fails. The following screenshot shows a sample run of the pipeline.

security and compliance pipeline

Run the application workload pipeline to exercise the compliance guardrails

After we set up the security and compliance repo and the pipeline runs successfully, the workload pipeline is ready to proceed (see the following screenshot of its progress).

workload pipeline

The service delivery teams are now being subjected to the security and compliance guardrails being implemented (RunComplianceCheck stage), and their pipeline breaks if any resource is noncompliant.

Review the generated reports

CodeBuild supports viewing reports generated in cucumber JSON format. In our workflow, we generate reports in cucumber JSON and BDD XML formats, and we use this capability of CodeBuild to generate and view HTML reports. Our implementation also generates report directly in HTML using the cucumber-sandwich library.

The following screenshot is snippet of the script compliance-check.sh, which implements report generation.

compliance check script

The bug noted in the screenshot is in the radish-bdd library that Terraform-Compliance uses for the cucumber JSON format report generation. For more information, you can review the defect logged against radish-bdd for this issue.

After the script generates the reports, CodeBuild needs to be configured to access them to generate HTML reports. The following screenshot shows a snippet from buildspec-compliance-check.yml, which shows how the reports section is set up for report generation:

buildspec compliance check

For more details on how to set up buildspec file for CodeBuild to generate reports, see Create a test report.

CodeBuild displays the compliance run reports as shown in the following screenshot.

code build cucumber report

We can also view a trending graph for multiple runs.

code build cucumber report

The other report generated by the workflow is the pretty HTML report generated by the cucumber-sandwich library.

code build cucumber report

The reports are available for download from the S3 bucket <OutPipelineBucketName>/pipeline-security-an/report_App/<zip file>.

The cucumber-sandwich generated report marks scenarios with skipped tests as failed scenarios. This is the only noticeable difference between the CodeBuild generated HTML and cucumber-sandwich generated HTML reports.

Clean up

To remove all the resources from the workload account, complete the following steps in order:

Go to the folder where you cloned the workload code and edit buildspec-workload-deploy.yml:
- Comment line 44 (- ./workload-deploy.sh).
- Uncomment line 45 (- ./workload-deploy.sh --destroy).
- Commit and push the code change to the remote repo. The workload pipeline is triggered, which cleans up the workload.
Delete the CloudFormation stack cf-CrossAccountRoles. This step removes the cross-account role from the workload account, which gives permission to the security and compliance account to pull the Terraform workload.
Go to the CloudFormation stack cf-TerraformWorkloadPipeline and note the OutPipelineBucketName and OutStateFileBucketName on the Outputs tab. Empty the two buckets and then delete the stack. This removes pipeline resources from workload account.
Go to the CDKToolkit stack and note the BucketName on the Outputs tab. Empty that bucket and then delete the stack.

To remove all the resources from the security and compliance account, complete the following steps in order:

Delete the CloudFormation stack cf-CrossAccountRoles. This step removes the cross-account role from the security and compliance account, which gives permission to the workload account to pull the compliance code.
Go to CloudFormation stack cf-SecurityAndCompliancePipeline and note the OutPipelineBucketName on the Outputs tab. Empty that bucket and then delete the stack. This removes pipeline resources from the security and compliance account.
Go to the CDKToolkit stack and note the BucketName on the Outputs tab. Empty that bucket and then delete the stack.

Security considerations

Cross-account IAM roles are very powerful and need to be handled carefully. For this post, we strictly limited the cross-account IAM role to specific CodeCommit permissions. This makes sure that the cross-account role can only do those things.

Conclusion

In this post in our two-part series, we implemented a continuous compliance workflow using CodePipeline and the open-source Terraform-Compliance framework. The Terraform-Compliance framework allows you to build guardrails for securing Terraform applications deployed on AWS.

We also showed how you can use AWS developer tools to seamlessly integrate security and compliance guardrails into an application release cycle and catch noncompliant AWS resources before getting deployed into AWS.

Try implementing the solution in your enterprise as shown in this post, and leave your thoughts and questions in the comments.

About the authors

sumit mishra

Sumit Mishra is Senior DevOps Architect at AWS Professional Services. His area of expertise include IaC, Security in pipeline, CI/CD and automation.

Damodar Shenvi Wagle

Damodar Shenvi Wagle is a Cloud Application Architect at AWS Professional Services. His areas of expertise include architecting serverless solutions, CI/CD and automation.