All posts by Rostislav Markov

Empower your teams with modern architecture governance

2025-04-21 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/empower-your-teams-with-modern-architecture-governance/

Agile product teams thrive on autonomy and rapid iteration, especially in the cloud where they can quickly deploy and test systems. However, traditional architecture governance often stands in their way, because many enterprises still impose centralized, one-off architecture signoffs early in the process. Historically, these signoffs verified design compliance with corporate standards in a slower, on-premises world. In cloud environments, such signoffs quickly become obsolete—along with their associated architectural documents—and discourage teams from considering new insights.

Modern cloud architectures demand a new governance approach. In this post, we show how collaborative architecture oversight can transform team performance through automation, self-service platforms, and distributed decision-making. We explore how key stakeholders (developers, architects, security specialists, and shared services teams) can participate in architectural decisions through asynchronous approval workflows, while making sure non-negotiable controls such as encryption at rest and in transit are consistently enforced through automation and policy as code. This approach empowers teams to experiment and adapt quickly while maintaining robust enterprise standards.

The promise of traditional architecture signoffs

Traditional architecture governance centers around formal reviews where teams submit detailed design documents to a central architecture board. These artifacts often include comprehensive diagrams, technology selections, security plans, and integration specifications. Architects and a variety of stakeholders such as security specialists, compliance officers, quality assurance, and operations teams review these documents in scheduled meetings before issuing a signoff. These approvals represent point-in-time validations against enterprise standards, assuming minimal deviation during implementation.

Why traditional signoffs fall short

Signoffs can create challenges in modern cloud architectures:

Substituting for continuous compliance when automated verification is missing, creating false assurance through one-off reviews
Creating a restrictive “check-the-box” mentality where meeting minimum documented requirements becomes the goal instead of exploring the best solutions
Removing decision authority from implementation teams who often have the most contextual knowledge
Delaying implementation feedback loops and reducing organizational agility

Consider an agile team responsible for a strategic cloud application that’s moved beyond minimal viable product and is now scaling to support growing business demands. The system architecture must evolve to handle increasing data volumes, performance requirements, and unanticipated integrations. However, corporate stakeholders insist on rigid adherence to the originally approved architecture documents. Although governance is essential for production systems, this inflexible approach with an early architecture signoff prevents the team from implementing architectural improvements as they go. What appears as meaningful control to stakeholders becomes a stifling constraint for builders, ultimately compromising the system stability the governance process aims to protect.

Modern cloud architecture support

A modern architecture function operates around evolving capabilities across three core areas: preapproved blueprints, distributed governance, and automated insights—with traditional signoffs reserved as an exception path for unique use cases.

Preapproved blueprints

Preapproved blueprints like reference architectures and code templates enable your teams to move faster while maintaining corporate standards. This approach supports a use-case-focused assessment model, where architects can concentrate on evaluating specific workflows or threats relevant to a unique use case—rather than having to understand the entire system or threat model from scratch. In this way, the architecture function shifts to managing by exception and refocus reviews on deviations from the standard. Blueprints should have gravity, guiding teams towards standardized patterns while preventing fragmentation through too many tools, databases, middleware options, or SDKs. Consider the following:

Pattern-based reference architectures – These set clear principles for security and resilience without micromanaging. These standards align teams while allowing innovation within a reliable framework. The cloud-driven enterprise transformation at BMW Group exemplifies how moving from signoffs to enablement through pattern-based architectures can be successful.
Self-service platforms – These provide standardized resources that empower teams to build independently. A self-service platform with preapproved templates for deployment toolchains and infrastructure code enables confident and rapid development. Most companies host these on internal developer platforms like Backstage or AWS Service Catalog. This also allows controlled changes to the blueprints and to track their adoption.
Blueprint lifecycle – Blueprints require their own approval process. Although this creates significant efficiency by reducing individual system reviews, it introduces the challenge of managing existing deployments when patterns are updated. Include versioning and migration strategies when introducing new blueprints.

Distributed governance

Distributed governance treats architectural decision-making as a continuous, collaborative process with clear accountability, delegating decision-making and empowering your builders within established blueprints. Consider the following:

Architecture decision records (ADRs) – These replace formal, one-off signoffs by documenting decisions for each build iteration. This approach promotes transparency and maintains team agility without compromising accountability for decisions and approvals with key stakeholders. It also allows teams to defer decisions until they are most relevant. For practical implementation guidance, see Using architectural decision records to streamline decision-making for a software development project. To learn about how to write concise ADRs and avoid duplication, consult the ADR templates GitHub repository and When Should I Write an Architecture Decision Record.
Community-driven consultations – Architecture departments can foster self-organization by creating architecture communities of practice for peer knowledge-sharing. These communities enable collaboration on best practices, challenges, and standards, cultivating a culture of distributed decision-making, without eliminating the lines of responsibility and accountability for final decisions. This approach works because deep architectural expertise often resides with builders who have hands-on experience with specific use cases and technologies. The role of the architecture function shifts towards providing the necessary infrastructure and identifying thought leaders in the organization.

Automated insights

Automated insights enable compliance with corporate standards through real-time monitoring and adaptation:

Continuous monitoring – Continuous workload discovery detects architecture based on log data such as AWS Config, AWS CloudTrail, VPC Flow Logs, and Amazon GuardDuty. Gathering insights from application environments allows the architecture function to automatically create architecture diagrams, embed compliance policies as code such as AWS Config rules, and provide real-time security checks like the Workload Discovery on AWS solution, which can automatically generate architecture diagrams and cost reports for AWS accounts. Consult the AWS Partner Solutions Finder to explore partner-provided solutions for application discovery and monitoring.
AI-driven governance – AI tools can analyze decisions, identify architectural and code inefficiencies, detect anomalies, and suggest optimized configurations. This supports informed decision-making, in particular when complimented with thorough human verification and oversight. Amazon Bedrock Agents can find similar existing ADRs, analyze architecture diagrams, and generate infrastructure code. For instance, Japan’s Digital Agency uses an AI assistant to streamline migration reviews for hundreds of systems.

Comparison

The modern view improves the overall value-add and support model of your architecture function. The following table compares the traditional and modern views.

Aspect	Traditional View	Modern View
Purpose	Centralized signoff to enforce control and reduce risk	Empower teams with preapproved standards to prevent sprawl and manage their distribution such as through Backstage
Architecture approach	Fixed, one-time design	Evolving, treated as a parametrized, reusable code product refined through feedback
Team empowerment	Limited, decisions approved by centralized authority	High, teams make decisions within clear standards
Team speed and agility	Slower, due to dependency on signoff	Faster, continuous iteration without waiting for approvals
Risk management	Early signoff to lock in decisions and reduce uncertainty	Risk managed through continuous control validation with automated evidence collection, providing stronger assurance for second and third lines of defense than point-in-time assessments
Compliance	Manual checks by experts	Automated through policy as code and AI tools
Transparency	Limited, focused on approval documentation	High, lightweight decision records for technical stakeholders and visualizations or dashboards for non-technical oversight functions
Collaboration	Centralized control, limited cross-team collaboration	Peer-led communities (collective governance) such as Security Guardians
Innovation	Restricted, focus on following signed-off designs	Encouraged, teams explore within a standards-based framework

Despite the benefits, many organizations struggle to let go of signoffs for a number of reasons, including:

Cultural resistance – In risk-averse cultures where failing fast is not accepted, leaders hesitate to let go of centralized control mechanisms.
Compliance concerns – In regulated industries, centralized approvals serve as control gates. The modern view replaces point-in-time trust with continuous compliance mechanisms—automated guardrails, real-time monitoring, and evidence collection—enabling even highly regulated environments to achieve compliance with small, autonomous teams operating within clear boundaries (“two-pizza team”).
Lack of infrastructure – Some organizations lack self-service platforms, automated compliance, or observability, so they fall back on signoff to manage risk.
Governance concerns – Traditional teams often view distributed decision-making as no governance rather than transformed governance.

The modern view offers significant benefits, though with governance considerations:

Speed and flexibility – Teams move faster without waiting for approvals, deploying AWS resources iteratively.
Empowerment and ownership – Builders using standards and ADRs feel accountable and actively shape architecture.
Innovation and experimentation – Self-service tools and AI guidance foster experimentation without delays.

Conclusion

You can empower your builders by rethinking your architecture signoff. In the modern view discussed in this post, architecture governance aligns with the pace and flexibility of the cloud, allowing teams to innovate within a shared framework. This approach values standards and autonomy over control, and transforms your architecture function into a strategic partner in a fast-evolving landscape.

To learn how to establish and maintain cloud-centered principles and patterns, refer to the platform architecture chapter of the AWS Cloud Adoption Framework and the AWS Culture of Security resources.

Related resources

About the author

Balance deployment speed and stability with DORA metrics

2024-07-31 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/devops/balance-deployment-speed-and-stability-with-dora-metrics/

Development teams adopt DevOps practices to increase the speed and quality of their software delivery. The DevOps Research and Assessment (DORA) metrics provide a popular method to measure progress towards that outcome. Using four key metrics, senior leaders can assess the current state of team maturity and address areas of optimization.

This blog post shows you how to make use of DORA metrics for your Amazon Web Services (AWS) environments. We share a sample solution which allows you to bootstrap automatic metric collection in your AWS accounts.

Benefits of collecting DORA metrics

DORA metrics offer insights into your development teams’ performance and capacity by measuring qualitative aspects of deployment speed and stability. They also indicate the teams’ ability to adapt by measuring the average time to recover from failure. This helps product owners in defining work priorities, establishing transparency on team maturity, and developing a realistic workload schedule. The metrics are appropriate for communication with senior leadership. They help commit leadership support to resolve systemic issues inhibiting team satisfaction and user experience.

Use case

This solution is applicable to the following use case:

Development teams have a multi-account AWS setup including a tooling account where the CI/CD tools are hosted, and an operations account for log aggregation and visualization.
Developers use GitHub code repositories and AWS CodePipeline to promote code changes across application environment accounts.
Tooling, operations, and application environment accounts are member accounts in AWS Control Tower or workload accounts in the Landing Zone Accelerator on AWS solution.
Service impairment resulting from system change is logged as OpsItem in AWS Systems Manager OpsCenter.

Overview of solution

The four key DORA metrics

The ‘four keys’ measure team performance and ability to react to problems:

Deployment Frequency measures the frequency of successful change releases in your production environment.
Lead Time For Changes measures the average time for committed code to reach production.
Change Failure Rate measures how often changes in production lead to service incidents/failures, and is complementary to Mean Time Between Failure.
Mean Time To Recovery measures the average time from service interruption to full recovery.

The first two metrics focus on deployment speed, while the other two indicate deployment stability (Figure 1). We recommend organizations to set their own goals (that is, DORA metric targets) based on service criticality and customer needs. For a discussion of prior DORA benchmark data and what it reveals about the performance of development teams, consult How DORA Metrics Can Measure and Improve Performance.

Figure 1. Overview of DORA metrics

Consult the GitHub code repository Balance deployment speed and stability with DORA metrics for a detailed description of the metric calculation logic. Any modifications to this logic should be made carefully.

For example, the Change Failure Rate focuses on changes that impair the production system. Limiting the calculation to tags (such as hotfixes) on pull requests would exclude issues related to the build process. It’s important to match system change records that lead to actual impairments in production. Limiting the calculation to the number of failed deployments from the deployment pipeline only considers deployments that didn’t reach production. We use AWS Systems Manager OpsCenter as the system of records for change-related outages, rather than relying solely on data from CI/CD tools.

Similarly, Mean Time To Recovery measures the duration from a service impairment in production to a successful pipeline run. We encourage teams to track both pipeline status and recovery time, as frequent pipeline failure can indicate insufficient local testing and potential pipeline engineering issues.

Gathering DORA events

Our metric calculation process runs in four steps:

In the tooling account, we send events from CodePipeline to the default event bus of Amazon EventBridge.
Events are forwarded to custom event buses which process them according to the defined metrics and any filters we may have set up.
The custom event buses call AWS Lambda functions which forward metric data to Amazon CloudWatch. CloudWatch gives us an aggregated view of each of the metrics. From Amazon CloudWatch, you can send the metrics to another designated dashboard like Amazon Managed Grafana.
As part of the data collection, the Lambda function will also query GitHub for the relevant commit to calculate the lead time for changes metric. It will query AWS Systems Manager for OpsItem data for change failure rate and mean time to recovery metrics. You can create OpsItems manually as part of your change management process or configure CloudWatch alarms to create OpsItems automatically.

Figure 2 visualizes these steps. This setup can be replicated to a group of accounts of one or multiple teams.

This figure visualizes the aforementioned four steps of our metric calculation process. AWS Lambda functions process all events and publish custom metrics in Amazon CloudWatch.

Figure 2. DORA metric setup for AWS CodePipeline deployments

Walkthrough

Follow these steps to deploy the solution in your AWS accounts.

Prerequisites

For this walkthrough, you should have the following prerequisites:

AWS accounts for tooling, operations, and application environments
Install Python version 3.9 or later
AWS Cloud Development Kit (AWS CDK) v2 installed
Set up an AWS CDK pipeline
Access to AWS CodePipeline, AWS Systems Manager OpsCenter and GitHub

Deploying the solution

Clone the GitHub code repository Balance deployment speed and stability with DORA metrics.

Before you start deploying or working with this code base, there are a few configurations you need to complete in the constants.py file in the cdk/ directory. Open the file in your IDE and update the following constants:

TOOLING_ACCOUNT_ID & TOOLING_ACCOUNT_REGION: These represent the AWS account ID and AWS region for AWS CodePipeline (that is, your tooling account).
OPS_ACCOUNT_ID & OPS_ACCOUNT_REGION: These are for your operations account (used for centralized log aggregation and dashboard).
TOOLING_CROSS_ACCOUNT_LAMBDA_ROLE: The IAM Role for cross-account access that allows AWS Lambda to post metrics from your tooling account to your operations account/Amazon CloudWatch dashboard.
DEFAULT_MAIN_BRANCH: This is the default branch in your code repository that’s used to deploy to your production application environment. It is set to “main” by default, as we assumed feature-driven development (GitFlow) on the main branch; update if you use a different naming convention.
APP_PROD_STAGE_NAME: This is the name of your production stage and set to “DeployPROD” by default. It’s reserved for teams with trunk-based development.

Setting up the environment

To set up your environment on MacOS and Linux:

Create a virtual environment:
```
$ python3 -m venv .venv
```
Activate the virtual environment: On MacOS and Linux:
```
$ source .venv/bin/activate
```

Alternatively, to set up your environment on Windows:

Create a virtual environment:
```
% .venv\Scripts\activate.bat
```
Install the required Python packages:
```
$ pip install -r requirements.txt
```

To configure the AWS Command Line Interface (AWS CLI):

Follow the configuration steps in the AWS CLI User Guide.
```
$ aws configure sso
```
Configure your user profile (for example, Ops for operations account, Tooling for tooling account). You can check user profile names in the credentials file.

Deploying the CloudFormation stacks

Switch directory
```
$ cd cdk
```
Bootstrap CDK
```
$ cdk bootstrap –-profile Ops
```
Synthesize the AWS CloudFormation template for this project:
```
$ cdk synth
```
To deploy a specific stack (see Figure 3 for an overview), specify the stack name and AWS account number(s) in the following command:
```
$ cdk deploy <Stack-Name> --profile {Tooling, Ops}
```
To launch the DoraToolingEventBridgeStack stack in the Tooling account:
```
$ cdk deploy DoraToolingEventBridgeStack --profile Tooling
```
To launch the other stacks in the Operations account (including DoraOpsGitHubLogsStack, DoraOpsDeploymentFrequencyStack, DoraOpsLeadTimeForChangeStack, DoraOpsChangeFailureRateStack, DoraOpsMeanTimeToRestoreStack, DoraOpsMetricsDashboardStack):
```
$ cdk deploy DoraOps* --profile Ops
```

The following figure shows the resources you’ll launch with each CloudFormation stack. This includes six AWS CloudFormation stacks in operations account. The first stack sets up log integration for GitHub commit activity. Four stacks contain a Lambda function which creates one of the DORA metrics. The sixth stack creates the consolidated dashboard in Amazon CloudWatch.

Figure 3. Resources provisioned with this solution

Testing the deployment

To run the provided tests:

$ pytest

Understanding what you’ve built

Deployed resources in tooling account

The DoraToolingEventBridgeStack includes Amazon EventBridge rules with a target of the central event bus in the operations account, plus an AWS IAM role with cross-account access to put events in the operations account. The event pattern for invoking our EventBridge rules listens for deployment state changes in AWS CodePipeline:

{
  "detail-type": ["CodePipeline Pipeline Execution State Change"],
  "source": ["aws.codepipeline"]
}

Deployed resources in operations account

The Lambda function for Deployment Frequency tracks the number of successful deployments to production, and posts the metric data to Amazon CloudWatch. You can add a dimension with the repository name in Amazon CloudWatch to filter on particular repositories/teams.
The Lambda function for the Lead Time For Change metric calculates the duration from the first commit to successful deployment in production. This covers all factors contributing to lead time for changes, including code reviews, build, test, as well as the deployment itself.
The Lambda function for Change Failure Rate keeps track of the count of successful deployments and the count of system impairment records (OpsItems) in production. It publishes both as metrics to Amazon CloudWatch and the latter calculates the ratio, as shown in below example.
The Lambda function for Mean Time To Recovery keeps track of all deployments with status SUCCEEDED in production and whose repository branch name references an existing OpsItem ID. For every matching event, the function gets the creation time of the OpsItem record and posts the duration between OpsItem creation and successful re-deployment to the CloudWatch dashboard.

All Lambda functions publish metric data to Amazon CloudWatch using the PutMetricData API. The final calculation of the four keys is performed on the CloudWatch dashboard. The solution includes a simple CloudWatch dashboard so you can validate the end-to-end data flow and confirm that it has deployed successfully:

This simple CloudWatch dashboard displays the four DORA metrics for three reporting periods: per day, per week, and per month.

Cleaning up

Remember to delete example resources if you no longer need them to avoid incurring future costs.

You can do this via the CDK CLI:

$ cdk destroy <Stack-Name> --profile {Tooling, Ops}

Alternatively, go to the CloudFormation console in each AWS account, select the stacks related to DORA and click on Delete. Confirm that the status of all DORA stacks is DELETE_COMPLETE.

Conclusion

DORA metrics provide a popular method to measure the speed and stability of your deployments. The solution in this blog post helps you bootstrap automatic metric collection in your AWS accounts. The four keys help you gain consensus on team performance and provide data points to back improvement suggestions. We recommend using the solution to gain leadership support for systemic issues inhibiting team satisfaction and user experience. To learn more about developer productivity research, we encourage you to also review alternative frameworks including DevEx and SPACE.

Further resources

If you enjoyed this post, you may also like:

Author bio

Genomics workflows, Part 7: analyze public RNA sequencing data using AWS HealthOmics

2024-06-05 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-7-analyze-public-rna-sequencing-data-using-aws-healthomics/

Genomics workflows process petabyte-scale datasets on large pools of compute resources. In this blog post, we discuss how life science organizations can use Amazon Web Services (AWS) to run transcriptomic sequencing data analysis using public datasets. This allows users to quickly test research hypotheses against larger datasets in support of clinical diagnostics. We use AWS HealthOmics and AWS Step Functions to orchestrate the entire lifecycle of preparing and analyzing sequence data and remove the associated heavy lifting.

Use case

In genomics, transcription relates to the process of making a ribonucleic acid (RNA) copy from a gene’s deoxyribonucleic acid (DNA). Usually, RNA is single-stranded, although some RNA viruses are double-stranded. With RNA sequencing (RNA-Seq), scientists isolate the RNA, prepare an RNA library, and use next-generation sequencing technology to decode it. Organizations around the world use RNA-Seq to support clinical diagnostics.

In our use case, life science research teams use workflows written in Nextflow to process RNA-Seq datasets in FASTQ file format. Following their initial RNA-Seq studies on internal datasets, scientists can extend their insights by using public datasets. For example, the Gene Expression Omnibus (GEO) functional genomics data repository is hosted by the National Center for Biotechnology Information (NCBI) and offers multiple download options and formats. Scientists can download datasets in FASTQ format from GEO File Transfer Protocol (FTP) and compress them into the .gz format before further analysis.

Scaling and automating the data ingestion can be challenging. For example, scientists might need to do the following:

Manually download FASTQ files and invoke their analysis pipelines
Monitor the workflow runs, which can span hours, days, or weeks
Manage the infrastructure for performance and scale

This blog post presents a solution that removes this undifferentiated heavy lifting.

Prerequisites

To build this solution, you must be analyzing transcriptomic sequencing data with the Nextflow workflow system and make use of GEO FASTQ datasets. In addition, you must do the following:

Create three Amazon Simple Storage Service (Amazon S3) buckets with the following purposes:
- Uploaded GEO Accession IDs (GEO IDs)
- Ingested FASTQ datasets
- RNA-Seq output files
Create one Amazon DynamoDB table to track the status of data ingestion. This helps with checkpointing and avoids repetitive ingestion jobs so that you can keep data ingestion cost to a minimum.

Solution overview

Using AWS, you can automate the entire RNA-Seq Nextflow pipeline. Users only need to provide the GEO IDs, then the pipeline ingests the corresponding FASTQ sample files and performs the subsequent data analysis.

Our solution, shown in Figure 1, uses a combination of AWS HealthOmics and AWS Step Functions. HealthOmics manages the compute, scalability, scheduling, and orchestration required for processing large RNA-Seq datasets. This helps scientists focus on writing their pipelines in Nextflow while AWS takes care of the underlying infrastructure. Step Functions adds reliability to the workflow from dataset ingestion to output archival. Automating the entire workflow also helps with tracing specific invocations and troubleshooting errors.

This figure visualizes the AWS services involved in each processing step, starting with users uploading CSV files with GEO metadata to Amazon S3, and concluding with AWS HealthOmics performing the RNA-Seq analysis and putting the output data on Amazon S3.

Figure 1. RNA sequencing using HealthOmics

Our solution includes the following:

The scientist creates and uploads a CSV file to the GEO metadata S3 bucket. The CSV file includes a reference to the specific GEO ID that is ingested. An Amazon S3 Event Notification configured on s3:ObjectCreated events (in this case, the CSV file upload) invokes an AWS Lambda function.
The Lambda function first extracts the corresponding Sequence Read Run (SRR) IDs of the GEO ID. Next, it starts a Step Functions state machine with the following input parameters: the SRR IDs, species of the samples, and GEO ID. The state machine uses an AWS Batch job queue for parallel ingestion.
The Lambda function writes the following metadata to a DynamoDB table for future reference:
- Ingested GEO ID and corresponding list of SRR IDs
- Amazon S3 output paths to the ingested FASTQ files
- Overall workflow status
- Ingested species
Upon ingestion completion, the state machine puts the RNA-Seq sample sheet into the FASTQ S3 bucket. This invokes a Lambda function, which launches the RNA-Seq analysis workflow with the following input parameters:
- Sample sheet
- GEO ID
- Other relevant metadata
Our RNA-Seq data analysis is run with HealthOmics and the associated sequence store. We use Step Functions to launch this workflow and ingest the relevant files to the sequence store.
Upon workflow completion, HealthOmics writes the output data (BAM files) to the output S3 bucket.

Implementation considerations

Dataset preparation

The Step Functions state machine orchestrates the ingestion of FASTQ files through the following steps:

The state machine invokes the Map state in Step Functions that uses dynamic parallelism for increased scale, with the SRR IDs array as input. You can now launch multiple AWS Batch jobs in parallel to ingest the FASTQ files that correspond to the SRR ID input.
The state machine checks our ingestion DynamoDB table to see if the corresponding SRR ID has already been processed and has ingested the corresponding FASTQ files. If the SRR ID ingested the files, the state machine writes the sample sheet to the FASTQ S3 bucket and terminates successfully.
The state machine uses the NCBI-provided sra-tools Docker container and fasterq-dump command to ingest the FASTQ files. The state machine generates the set of ingestion commands and starts the AWS Batch job. The ingestion commands are a set of shell commands that interact with NCBI for downloading FASTQ files. These commands compress the files with pigz, and then uploads them to an S3 bucket.
The state machine updates the DynamoDB table with the ingestion status.
1. If the ingestion is successful, then the state machine continues to step 5.
2. If the ingestion isn’t successful, the state machine writes a message to Amazon Simple Notification Service (Amazon SNS) to notify scientists of the failure.
A Lambda function generates the RNA-Seq sample sheet with the combined samples to analyze. This sample sheet is a CSV file containing:
1. The paths to the ingested FASTQ files.
2. The names of each corresponding SRR ID as input to the RNA-Seq workflow.
The state machine notifies that the ingestion job is complete by publishing a message to an Amazon SNS topic before terminating itself.

Figure 2 provides a detailed overview of the state machine.

This Map state definition in AWS Step Functions visualizes the aforementioned steps for FASTQ file ingestion including orchestration of the associated AWS Batch job.

Figure 2. RNA sequencing data ingestion

Dataset analysis

A Lambda function divides the RNA-Seq sample sheet in compliance with the Step Functions service quota. This enables parallel processing using a Map state.

Our transcriptomic analysis workflow does the following:

Checks if samples are single-end (one FASTQ file per sample) or paired-end (two sets of FASTQ files per sample).
Ingests the appropriate set of FASTQ files into the HealthOmics sequence store.
Monitors the status until all files are imported.

In parallel, a Lambda function initiates the HealthOmics RNA-Seq workflow.

Upon successful completion, HealthOmics stores the output data in Amazon S3. Finally, our state machine imports the output BAM files into the HealthOmics sequence store for future use.

Figure 3 provides a detailed overview of our state machine.

This AWS Step Functions workflow visualizes the aforementioned steps for data analysis including orchestration of the associated AWS HealthOmics workflow and FASTQ file ingestion into the HealthOmics Sequence Store.

Figure 3. RNA sequencing workflow

Cleanup (optional)

Delete all AWS resources that you no longer want to maintain.

Conclusion

HealthOmics removes the heavy lifting associated with gaining insights from genomics, transcriptomics, and other omics data. We used RNA-Seq analysis to showcase an example scientific workflow that can benefit from HealthOmics. When using HealthOmics in combination with Step Functions, scientists can automate the entire workflow from initial dataset preparation to archival. To learn more, we encourage you to explore our HealthOmics tutorials on GitHub.

Related information

Genomics workflows, Part 6: cost prediction

2024-04-24 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/part-6-genomics-workflows-cost-prediction/

Genomics workflows run on large pools of compute resources and take petabyte-scale datasets as inputs. Workflow runs can cost as much as hundreds of thousands of US dollars. Given this large scale, scientists want to estimate the projected cost of their genomics workflow runs before deciding to launch them.

In Part 6 of this series, we build on the benchmarking concepts presented in Part 5. You will learn how to train machine learning (ML) models on historical data to predict the cost of future runs. While we focus on genomics, the design pattern is broadly applicable to any compute-intensive workflow use case.

Use case

In large life-sciences organizations, multiple research teams often use the same genomics applications. The actual cost of consuming shared resources is only periodically shown or charged back to research teams.

In this blog post’s scenario, scientists want to predict the cost of future workflow runs based on the following input parameters:

Workflow name
Input data set size
Expected output dataset size

In our experience, scientists might not know how to reliably estimate compute cost based on the preceding parameters. This is because workflow run cost doesn’t linearly correlate to the input dataset size. For example, some workflow steps might be highly parallelizable while others aren’t. Otherwise, scientists could simply use the AWS Pricing Calculator or interact programmatically with the AWS Price List API. To solve this problem, we use ML to model the pattern of correlation and predict workflow cost.

Business benefits of predicting the cost of genomics workflow runs

Price prediction brings the following benefits:

Prioritizing workflow runs based on financial impact
Promoting cost awareness and frugality with application users
Supporting enterprise resource planning and prevention of budget overruns by integrating estimation data into management reporting and approval workflows

Prerequisites

To build this solution, you must have workflows running on AWS for which you collect actual cost data after each workflow run. This setup is demonstrated in Part 3 and Part 5 of this blog series. This data provides training data for the solution’s cost prediction models.

Solution overview

This solution includes a friendly user interface, ML models that predict usage parameters, and a metadata storage mechanism to estimate the cost of a workflow run. We use the automated workflow manager presented in Part 3 and the benchmarking solution from Part 5. The data on historical workflow launches and their cost serves as training and testing data for our ML models. We store this in Amazon DynamoDB. We use AWS Amplify to host a serverless user interface and a library/framework such as React to build it.

Scientists input the required parameters about their genomics workflow run to the Amplify frontend React application. The latter makes a request to an Amazon API Gateway REST API. This invokes an AWS Lambda function, which calls an Amazon SageMaker hosted endpoint to return predicted costs (Figure 1).

This visual summarizes the cost prediction and model training processes. Users request cost predictions for future workflow runs on a web frontend hosted in AWS Amplify. The frontend passes the requests to an Amazon API Gateway endpoint with Lambda integration. The Lambda function retrieves the suitable model endpoint from the DynamoDB table and invokes the model via the Amazon SageMaker API. Model training runs on a schedule and is orchestrated by an AWS Step Functions state machine. The state machine queries training datasets from the DynamoDB table. If the new model performs better, it is registered in the SageMaker model registry. Otherwise, the state machine sends a notification to an Amazon Simple Notification Service topic stating that there are no updates.

Figure 1. Automated cost prediction of genomics workflow runs

Each workflow captured in the DynamoDB table has a corresponding ML model trained for the specific use case. Separating out models for specific workflows simplifies the model development process. This solution periodically trains ML models to improve their overall accuracy and performance. A rule in Amazon EventBridge Scheduler invokes model training on a regular basis. An AWS Step Functions state machine automates the model training process.

Implementation considerations

When a scientist submits a request (which includes the name of the workflow they’re running), API Gateway uses Lambda integration. The Lambda function retrieves a record from the DynamoDB table that keeps track of the SageMaker hosted endpoints. The partition key of the table is the workflow name (indicated as workflow_name), as shown in the following example:

This visual displays an exemplary DynamoDB record. The record includes the Amazon SageMaker hosted endpoint that AWS Lambda would retrieve for a regenie workflow.

Using the input parameters, the Lambda function invokes the SageMaker hosted endpoint and returns the inference values back to the frontend.

Automating model training

Our Step Functions state machine for model training uses native SageMaker SDK integration. It runs as follows:

The state machine invokes a SageMaker training job to train a new ML model. The training job uses the historical workflow run data sourced from the DynamoDB table. After the training job completes, it outputs the ML model to an Amazon Simple Storage Service (Amazon S3) bucket.
The state machine registers the new model in the SageMaker model registry.
A Lambda function compares the performance of the new model with the prior version on the training dataset.
If the new model performs better than the prior model, the state machine creates a new SageMaker hosted endpoint configuration and puts the endpoint name in the DynamoDB table.
Otherwise, the state machine sends a notification to an Amazon Simple Notification Service (Amazon SNS) topic stating that there are no updates.

Conclusion

In this blog post, we demonstrated how genomics research teams can build a price estimator to predict genomics workflow run cost. This solution trains ML models for each workflow based on data from historical workflow runs. A state machine helps automate the entire model training process. You can use price estimation to promote cost awareness in your organization and reduce the risk of budget overruns.

Our solution is particularly suitable if you want to predict the price of individual workflow runs. If you want forecast overall consumption of your shared application infrastructure, consider deploying a forecasting workflow with Amazon Forecast. The Build workflows for Amazon Forecast with AWS Step Functions blog post provides details on the specific use case for using Amazon Forecast workflows.

Related information

Simplify document search at scale with intelligent search bot on AWS

2024-03-21 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/simplify-document-search-at-scale-with-intelligent-search-bot-on-aws/

Enterprise document management systems (EDMS) manage the lifecycle and distribution of documents. They often rely on keyword-based search functionality. However, it increasingly becomes hard to discover documents as such repositories grow to tens of thousands of items.

In this blog, we discuss how Amazon Web Services (AWS) built an intelligent search bot on top of the document repository of a global life sciences company. Before this enhancement, the native repository search function relied solely on keywords and document names, leading to a trial-and-error process. Now, scientists can effortlessly query the repository using natural language to locate the right document in a few seconds—even through voice commands while working with lab equipment.

Use case

In life sciences, documentation is critical for regulatory compliance with GxP. Scientists in life sciences use EDMS on a daily basis to retrieve standard operating procedures (SOPs). SOPs contain task-level instructions (for example, how to monitor lab equipment or use utilities such as steam generators and chilled water circulation pumps).

EDMS search capability is often limited to file names and metadata. In our use case, file names were numerical and metadata was typically a single-sentence description plus keywords.

Scientists wanted to query for a particular context and type of task they’re about to perform. However, this data was not extracted from document content and thus not readily available for search purposes. Scientists also wanted to be able to search by using a voice interface (for example, when they operate on lab equipment).

To address this, we designed a conversational bot interface. This bot locates the most relevant SOP based on pre-generated document extracts and returns a hyperlink to the most suitable document, as shown in Figure 1.

Figure 1. Example of document search prompt and chatbot response

Overview of the intelligent search bot solution

Our solution requirements for intelligent search were:

Semantic search index and ranking based on full text
Search capability through voice and text
Out-of-the-box integration with web applications and mobile devices

We choose Amazon Lex to provide the conversational interface using text or speech. Lex bots can be integrated with web and mobile applications using AWS Amplify Interactions. We used Amazon Kendra to create an intelligent search index on top of the data repository, which we hosted on Amazon Simple Storage Service (Amazon S3).

The advantage of using an Amazon Kendra index is its out-of-the-box semantic search and ranking capability based on document content. We use AWS Lambda to take care of Amazon S3 path mappings, and document attribute formatting for replicated documents, in order to make them retrievable by Amazon Kendra.

Figure 2. Intelligent search bot for enterprise document management systems

Benefits of integrating intelligent search bot with your EDMS

The benefits of extending EDMS with intelligent search bot include:

Improved usability by adding text- and speech-based channels to match user situations (for example, scientists operating on lab equipment)
Native, out-of-the-box integration with third-party systems (for example, Adobe Experience Manager, Alfresco, HubSpot, Marketo, Salesforce)
Implementation timeboxed in a two-week agile sprint and requires no data science skills

Extensibility to large language models

The Amazon Kendra Retrieve API allows the extension to a document retrieval chain pattern using a large language model (LLM) from Amazon Bedrock or Amazon SageMaker Jumpstart. With this pattern, Amazon Kendra generated document summaries can be passed to the LLM for correlation, as shown in Figure 3. Consult the LangChain documentation to learn how to configure retrieval chains.

Figure 3. Extending the document search bot to large language model

In our use case, the effort for such extension went beyond the incremental optimizations and limited migration timeframe. Also, preference for a chain pattern should be given to complex correlations across documents.

This wasn’t the case here, as documents were functionally disjunct (for example, by device type, geographical site, and process task). Therefore, document retrieval with the Amazon Kendra API was sufficient and we deferred the extra effort associated with custom-built LLM prompt layers.

Implementation considerations

We started the EDMS migration to AWS by replicating the data repository to Amazon S3 by using AWS DataSync. We stored every document and corresponding metadata files as separate Amazon S3 objects.

For the Amazon Kendra index mappings to initialize properly:

Metadata must be a JSON Amazon S3 object
Follow the name convention <document>.<extension>.metadata.json
Have reserved or common document attributes correctly formatted

The EDMS system did not adhere to this when generating metadata files so we offloaded the transformation to a Lambda function. The function fixed metadata attributes such as, changing version type (_version) from numeric to string and date (_created_at) from string to ISO8601. It also changed metadata names/Amazon S3 paths by enacting new objects (PutObject API) and deleting the original objects (DeleteObject API).

We configured Lambda invocation on Amazon S3 PutObject operations using Amazon S3 event notifications. We set the sync run schedule for the Amazon Kendra index to run on demand.

Alternatively, you can run it on a predefined schedule or as part of each Lambda invocation (using the update_index boto3 operation). Finally, we monitor for sync run fails associated with the Amazon Kendra index using Amazon CloudWatch.

Conclusion

This blog showed how you can enhance the keyword-based search of your EDMS. We embedded document search queries behind a chatbot to simplify user interaction.

You can build this solution quickly, with no machine learning skills, as part of your EDMS cloud migration. In more advanced use cases, including complete refactoring, consider extending this solution to use a large language model from Amazon Bedrock or Amazon SageMaker Jumpstart.

Related information

Reduce archive cost with serverless data archiving

2023-07-07 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/reduce-archive-cost-with-serverless-data-archiving/

For regulatory reasons, decommissioning core business systems in financial services and insurance (FSI) markets requires data to remain accessible years after the application is retired. Traditionally, FSI companies either outsourced data archiving to third-party service providers, which maintained application replicas, or purchased vendor software to query and visualize archival data.

In this blog post, we present a more cost-efficient option with serverless data archiving on Amazon Web Services (AWS). In our experience, you can build your own cloud-native solution on Amazon Simple Storage Service (Amazon S3) at one-fifth of the price of third-party alternatives. If you are retiring legacy core business systems, consider serverless data archiving for cost-savings while keeping regulatory compliance.

Serverless data archiving and retrieval

Modern archiving solutions follow the principles of modern applications:

Serverless-first development, to reduce management overhead.
Cloud-native, to leverage native capabilities of AWS services, such as backup or disaster recovery, to avoid custom build.
Consumption-based pricing, since data archival is consumed irregularly.
Speed of delivery, as both implementation and archive operations need to be performed quickly to fulfill regulatory compliance.
Flexible data retention policies can be enforced in an automated manner.

AWS Storage and Analytics services offer the necessary building blocks for a modern serverless archiving and retrieval solution.

Data archiving can be implemented on top of Amazon S3) and AWS Glue.

Amazon S3 storage tiers enable different data retention policies and retrieval service level agreements (SLAs). You can migrate data to Amazon S3 using AWS Database Migration Service; otherwise, consider another data transfer service, such as AWS DataSync or AWS Snowball.
AWS Glue crawlers automatically infer both database and table schemas from your data in Amazon S3 and store the associated metadata in the AWS Glue Data Catalog.
Amazon CloudWatch monitors the execution of AWS Glue crawlers and notifies of failures.

Figure 1 provides an overview of the solution.

Figure 1. Serverless data archiving and retrieval

Once the archival data is catalogued, Amazon Athena can be used for serverless data query operations using standard SQL.

Amazon API Gateway receives the data retrieval requests and eases integration with other systems via REST, HTTPS, or WebSocket.
AWS Lambda reads parametrization data/templates from Amazon S3 in order to construct the SQL queries. Alternatively, query templates can be stored as key-value entries in a NoSQL store, such as Amazon DynamoDB.
Lambda functions trigger Athena with the constructed SQL query.
Athena uses the AWS Glue Data Catalog to retrieve table metadata for the Amazon S3 (archival) data and to return the SQL query results.

How we built serverless data archiving

An early build-or-buy assessment compared vendor products with a custom-built solution using Amazon S3, AWS Glue, and a user frontend for data retrieval and visualization.

The total cost of ownership over a 10-year period for one insurance core system (Policy Admin System) was $0.25M to build and run the custom solution on AWS compared with >$1.1M for third-party alternatives. The implementation cost advantage of the custom-built solution was due to development efficiencies using AWS services. The lower run cost resulted from a decreased frequency of archival usage and paying only for what you use.

The data archiving solution was implemented with AWS services (Figure 2):

Amazon S3 is used to persist archival data in Parquet format (optimized for analytics and compressed to reduce storage space) that is loaded from the legacy insurance core system. The archival data source was AS400/DB2 and moved with Informatica Cloud to Amazon S3.
AWS Glue crawlers infer the database schema from objects in Amazon S3 and create tables in AWS Glue for the decommissioned application data.
Lambda functions (Python) remove data records based on retention policies configured for each domain, such as customers, policies, claims, and receipts. A daily job (Control-M) initiates the retention process.

Figure 2. Exemplary implementation of serverless data archiving and retrieval for insurance core system

Retrieval operations are formulated and executed via Python functions in Lambda. The following AWS resources implement the retrieval logic:

Athena is used to run SQL queries over the AWS Glue tables for the decommissioned application.
Lambda functions (Python) build and execute queries for data retrieval. The functions render HMTL snippets using Jinja templating engine and Athena query results, returning the selected template filled with the requested archive data. Using Jinja as templating engine improved the speed of delivery and reduced the heavy lifting of frontend and backend changes when modeling retrieval operations by ~30% due to the decoupling between application layers. As a result, engineers only need to build an Athena query with the linked Jinja template.
Amazon S3 stores templating configuration and queries (JSON files) used for query parametrization.
Amazon API Gateway serves as single point of entry for API calls.

The user frontend for data retrieval and visualization is implemented as web application using React JavaScript library (with static content on Amazon S3) and Amazon CloudFront used for web content delivery.

The archiving solution enabled 80 use cases with 60 queries and reduced storage from three terabytes on source to only 35 gigabytes on Amazon S3. The success of the implementation depended on the following key factors:

Appropriate sponsorship from business across all areas (claims, actuarial, compliance, etc.)
Definition of SLAs for responding to courts, regulators, etc.
Minimum viable and mandatory approach
Prototype visualizations early on (fail fast)

Conclusion

Traditionally, FSI companies relied on vendor products for data archiving. In this post, we explored how to build a scalable solution on Amazon S3 and discussed key implementation considerations. We have demonstrated that AWS services enable FSI companies to build a serverless archiving solution while reaching and keeping regulatory compliance at a lower cost.

Learn more about some of the AWS services covered in this blog:

Genomics workflows, Part 5: automated benchmarking

2023-03-24 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-5-automated-benchmarking/

Launching and running genomics workflows can take hours and involves large pools of compute instances that process data at a petabyte scale. Benchmarking helps you evaluate workflow performance and discover faster and cheaper ways of running them.

In practice, performance evaluations happen irregularly because of the associated heavy lifting. In this blog post, we discuss how life-science research teams can automate evaluations.

Business Benefits

An automated benchmarking solution provides:

more accurate enterprise resource planning by performing historical analytics,
lower cost to the business by comparing performance on different resource types, and
cost transparency to the business by quantifying periodical chargeback.

We’ve used automated benchmarking to compare processing times on different services such as Amazon Elastic Compute Cloud (Amazon EC2), AWS Batch, AWS ParallelCluster, Amazon Elastic Kubernetes Service (Amazon EKS), and on-premises HPC clusters. Scientists, financiers, technical leaders, and other stakeholders can build reports and dashboards to compare consumption data by consumer, workflow type, and time period.

Design pattern

Our automated benchmarking solution measures performance on two dimensions:

Timing: measures the duration of a workflow launch on a specific dataset
Pricing: measures the associated cost

This solution can be extended to other performance metrics such as iterations per second or process/thread distribution across compute nodes.

Our requirements include the following:

Consistent measurement of timing based on workflow status (such as preparing, waiting, ready, running, failed, complete)
Extensible pricing models based on unit prices (the Amazon EC2 Spot price at a specific period of time compared to Amazon EC2 On-Demand pricing)
Scalable, cost-efficient, and flexible data store enabling historical benchmarking and estimations
Minimal infrastructure management overhead

We choose a serverless design pattern using AWS Step Functions orchestration, AWS Lambda for our application code, and Amazon DynamoDB to track workflow launch IDs and states (as described in Part 3). We assume that the genomics workflows run on AWS Batch with genomics data on Amazon FSx for Lustre (Part 1). AWS Step Functions allows us to break down processing into smaller steps and avoid monolithic application code. Our evaluation process runs in four steps:

Monitor for completed workflow launches in the DynamoDB stream using an Amazon EventBridge pipe with a Step Functions workflow as target. This event-driven approach is more efficient than periodic polling and avoids custom code for parsing status and cost values in all records of the DynamoDB stream.
Collect a list of all compute resources associated with the workflow launch. Design a Lambda function that queries the AWS Batch API (see Part 1) to describe compute environment parameters like the Amazon EC2 instance IDs and their details, such as processing times, instance family/size, and allocation strategy (for example, Spot Instances, Reserved Instances, On-Demand Instances).
Calculate the cost of all consumed resources. We achieve this with another Lambda function, which calculates the total price based on unit prices from the AWS Price List Query API.
Our state machine updates the total price in the DynamoDB table without the need for additional application code.

Figure 1 visualizes these steps.

Figure 1. Automated benchmarking of genomics workflows

Implementation considerations

AWS Step Functions orchestrates our benchmarking workflow reliably and makes our application code easy to maintain. Figure 2 summarizes the state machine transitions that we’ll describe.

Figure 2. AWS Step Functions state machine for automated benchmarking

Gather consumption details

Configure the DynamoDB stream view type to New image so that the entire item is passed through as it appears after it was changed. We set up an Amazon EventBridge pipe with event filtering and the DynamoDB stream as a source. Our event filter uses multiple matching on records with a status of COMPLETE, but no cost entry in order to avoid an infinite loop. Once our state machine has updated the DynamoDB item with the workflow price, the resulting record in the DynamoDB stream will not pass our event filter.

The syntax of our event filter is as follows:

{
  "dynamodb": {
    "NewImage": {
      "status": {
        "S": ["COMPLETE"]
      },
      "totalCost": {
        "S": [{
          "exists": false
        }]
      }
    }
  }
}

We use an input transformer to simplify follow-on parsing by removing unnecessary metadata from the event.

The consumed resources included in the stream record are the auto-scaling group ID for AWS Batch and the Amazon FSx for Lustre volume ID. We use the DescribeJobs API (describe_jobs in Boto3) to determine which compute resources were used. If the response is a list of EC2 instances, we then look up consumption information including start and end times using the ListJobs API (list_jobs in Boto3) for each compute node. We use describe_volumes with filters on the identified EC2 instances to obtain the size and type of Amazon Elastic Block Store (Amazon EBS) volumes.

Calculate prices

Another Lambda function obtains the associated unit prices of all consumed resources using the GetProducts request of AWS Price List Query API (get_products in Boto3) and then parsing the pricePerUnit value. For Spot Instances, we use describe_spot_price_history of the EC2 client in Boto3 and specify the time range and instance types for which we want to receive prices.

Calculate the price of workflow launches based on the following factors:

Number and size of EC2 instances in auto-scaling node groups
Size of EBS volumes and Amazon FSx for Lustre
Processing duration

Our Python-based Lambda function calculates the total, rounds it, and delivers the price breakdown in the following format:

total_cost: str, instance_cost: str, volume_cost: str, filesystem_cost: str

Lastly, we put the price breakdown to the DynamoDB table using UpdateItem directly from the Amazon States Language.

Note that AWS credits and enterprise discounts might not be reflected in the responses of the AWS Price List Query API unless applied to the particular AWS account. This is often considered best practice in light of least-privilege considerations.

In the past, we’ve also used AWS Cost Explorer instead of the AWS Price List API. AWS Cost Explorer data is updated at least once every 24 hours. You can denote the pending price status in the DynamoDB table item and use the Wait state to delay the calculation process.

The presented solution can be extended to other compute services such as Amazon Elastic Kubernetes Service (Amazon EKS). For Amazon EKS, events are enriched with the cluster ID from the DynamoDB table and the price calculation should also include control plane costs.

Conclusion

Life-science research teams use benchmarking to compare workflow performance and inform their architectural decisions. Such evaluations are effort-intensive and therefore done irregularly.

In this blog post, we showed how life-science research teams can automate benchmarking for their scientific workflows. The insights teams gain from automated benchmarking indicate continuous optimization opportunities, such as by adjusting compute node configuration. The evaluation data is also available on demand for other purposes including chargeback.

Stay tuned for our next post in which we show how to use historical benchmarking data for price estimations of future workflow launches.

Related information

Genomics workflows, Part 4: processing archival data

2023-01-04 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-4-processing-archival-data/

Genomics workflows analyze data at petabyte scale. After processing is complete, data is often archived in cold storage classes. In some cases, like studies on the association of DNA variants against larger datasets, archived data is needed for further processing. This means manually initiating the restoration of each archived object and monitoring the progress. Scientists require a reliable process for on-demand archival data restoration so their workflows do not fail.

In Part 4 of this series, we look into genomics workloads processing data that is archived with Amazon Simple Storage Service (Amazon S3). We design a reliable data restoration process that informs the workflow when data is available so it can proceed. We build on top of the design pattern laid out in Parts 1-3 of this series. We use event-driven and serverless principles to provide the most cost-effective solution.

Use case

Our use case focuses on data in Amazon Simple Storage Service Glacier (Amazon S3 Glacier) storage classes. The S3 Glacier Instant Retrieval storage class provides the lowest-cost storage for long-lived data that is rarely accessed but requires retrieval in milliseconds.

The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive provide further cost savings, with retrieval times ranging from minutes to hours. We focus on the latter in order to provide the most cost-effective solution.

You must first restore the objects before accessing them. Our genomics workflow will pause until the data restore completes. The requirements for this workflow are:

Reliable launch of the restore so our workflow doesn’t fail (due to S3 Glacier service quotas, or because not all objects were restored)
Event-driven design to mirror the event-driven nature of genomics workflows and perform the retrieval upon request
Cost-effective and easy-to-manage by using serverless services
Upfront detection of archived data when formulating the genomics workflow task, avoiding idle computational tasks that incur cost
Scalable and elastic to meet the restore needs of large, archived datasets

Solution overview

Genomics workflows take multiple input parameters to prepare the initiation, such as launch ID, data path, workflow endpoint, and workflow steps. We store this data, including workflow configurations, in an S3 bucket. An AWS Fargate task reads from the S3 bucket and prepares the workflow. It detects if the input parameters include S3 Glacier URLs.

We use Amazon Simple Queue Service (Amazon SQS) to decouple S3 Glacier index creation from object restore actions (Figure 1). This increases the reliability of our process.

Figure 1. Solution architecture for S3 Glacier object restore

An AWS Lambda function creates the index of all objects in the specified S3 bucket URLs and submits them as an SQS message.

Another Lambda function polls the SQS queue and submits the request(s) to restore the S3 Glacier objects to S3 Standard storage class.

The function writes the job ID of each S3 Glacier restore request to Amazon DynamoDB. After the restore is complete, Lambda sets the status of the workflow to READY. Only then can any computing jobs start, such as with AWS Batch.

Implementation considerations

We consider the use case of Snakemake with Tibanna, which we detailed in Part 2 of this series. This allows us to dive deeper on launch details.

Snakemake is an open-source utility for whole-genome-sequence mapping in directed acyclic graph format. Snakemake uses Snakefiles to declare workflow steps and commands. Tibanna is an open-source, AWS-native software that runs bioinformatics data pipelines. It supports Snakefile syntax, plus other workflow languages, including Common Workflow Language and Workflow Description Language (WDL).

We recommend using Amazon Genomics CLI if Tibanna is not needed for your use case, or Amazon Omics if your workflow definitions are compliant with the supported WDL and Nextflow specifications.

Formulate the restore request

The Snakemake Fargate launch container detects if the S3 objects under the requested S3 bucket URLs are stored in S3 Glacier. The Fargate launch container generates and puts a JSON binary base call (BCL) configuration file into an S3 bucket and exits successfully. This file includes the launch ID of the workflow, corresponding with the DynamoDB item key, plus the S3 URLs to restore.

Query the S3 URLs

Once the JSON BCL configuration file lands in this S3 bucket, the S3 Event Notification PutObject event invokes a Lambda function. This function parses the configuration file and recursively queries for all S3 object URLs to restore.

Initiate the restore

The main Lambda function then submits messages to the SQS queue that contains the full list of S3 URLs that need to be restored. SQS messages also include the launch ID of the workflow. This is to ensure we can bind specific restoration jobs to specific workflow launches. If all S3 Glacier objects belong to Flexible Retrieval storage class, the Lambda function puts the URLs in a single SQS message, enabling restoration with Bulk Glacier Job Tier. The Lambda function also sets the status of the workflow to WAITING in the corresponding DynamoDB item. The WAITING state is used to notify the end user that the job is waiting on the data-restoration process and will continue once the data restoration is complete.

A secondary Lambda function polls for new messages landing in the SQS queue. This Lambda function submits the restoration request(s)—for example, as a free-of-charge Bulk retrieval—using the RestoreObject API. The function subsequently writes the S3 Glacier Job ID of each request in our DynamoDB table. This allows the main Lambda function to check if all Job IDs associated with a workflow launch ID are complete.

Update status

The status of our workflow launch will remain WAITING as long as the Glacier object restore is incomplete. The AWS CloudTrail logs of completed S3 Glacier Job IDs invoke our main Lambda function (via an Amazon EventBridge rule) to update the status of the restoration job in our DynamoDB table. With each invocation, the function checks if all Job IDs associated with a workflow launch ID are complete.

After all objects have been restored, the function updates the workflow launch with status READY. This launches the workflow with the same launch ID prior to the restore.

Conclusion

In this blog post, we demonstrated how life-science research teams can make use of their archival data for genomic studies. We designed an event-driven S3 Glacier restore process, which retrieves data upon request. We discussed how to reliably launch the restore so our workflow doesn’t fail. Also, we determined upfront if an S3 Glacier restore is needed and used the WAITING state to prevent our workflow from failing.

With this solution, life-science research teams can save money using Amazon S3 Glacier without worrying about their day-to-day work or manually administering S3 Glacier object restores.

Related information

Genomics workflows, Part 3: automated workflow manager

2022-12-21 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-3-automated-workflow-manager/

Genomics workflows are high-performance computing workloads. Life-science research teams make use of various genomics workflows. With each invocation, they specify custom sets of data and processing steps, and translate them into commands. Furthermore, team members stay to monitor progress and troubleshoot errors, which can be cumbersome, non-differentiated, administrative work.

In Part 3 of this series, we describe the architecture of a workflow manager that simplifies the administration of bioinformatics data pipelines. The workflow manager dynamically generates the launch commands based on user input and keeps track of the workflow status. This workflow manager can be adapted to many scientific workloads—effectively becoming a bring-your-own-workflow-manager for each project.

Use case

In Part 1, we demonstrated how life-science research teams can use Amazon Web Services to remove the heavy lifting of conducting genomic studies, and our design pattern was built on AWS Step Functions with AWS Batch. We mentioned that we’ve worked with life-science research teams to put failed job logs onto Amazon DynamoDB. Some teams prefer to use command-line interface tools, such as the AWS Command Line Interface; other interfaces, such as PyBDA with Apache Spark, or CWL experimental grammar in combination with the Amazon Simple Storage Service (Amazon S3) API, are also used when access to the AWS Management Console is prohibited. In our use case, scientists used the console to easily update table items, plus initiate retry via DynamoDB streams.

In this blog post, we extend this idea to a new frontend layer in our design pattern. This layer automates command generation and monitors the invocations of a variety of workflows—becoming a workflow manager. Life-science research teams use multiple workflows for different datasets and use cases, each with different syntax and commands. The workflow manager we create removes the administrative burden of formulating workflow-specific commands and tracking their launches.

Solution overview

We allow scientists to upload their requested workflow configuration as objects in Amazon S3. We use S3 Event Notifications on PUT requests to invoke an AWS Lambda function. The function parses the uploaded S3 object and registers the new launch request as a DynamoDB item using the PutItem operation. Each item corresponds with a distinct launch request, stored as key-value pair. Item values store the:

S3 data path containing genomic datasets
Workflow endpoint
Preferred compute service (optional)

Another Lambda function monitors for change data captures in the DynamoDB Stream (Figure 1). With each PutItem operation, the Lambda function prepares a workflow invocation, which includes translating the user input into the syntax and launch commands of the respective workflow.

In the case of Snakemake (discussed in Part 2), the function creates a Snakefile that declares processing steps and commands. The function spins up an AWS Fargate task that builds the computational tasks, distributes them with AWS Batch, and monitors for completion. An AWS Step Functions state machine orchestrates job processing, for example, initiated by Tibanna.

Amazon CloudWatch provides a consolidated overview of performance metrics, like time elapsed, failed jobs, and error types. We store log data, including status updates and errors, in Amazon CloudWatch Logs. A third Lambda function parses those logs and updates the status of each workflow launch request in the corresponding DynamoDB item (Figure 1).

Figure 1. Workflow manager for genomics workflows

Implementation considerations

In this section, we describe some of our past implementation considerations.

Register new workflow requests

DynamoDB items are key-value pairs. We use launch IDs as key, and the value includes the workflow type, compute engine, S3 data path, the S3 object path to the user-defined configuration file and workflow status. Our Lambda function parses the configuration file and generates all commands plus ancillary artifacts, such as Snakefiles.

Launch workflows

Launch requests are picked by a Lambda function from the DynamoDB stream. The function has the following required parameters:

Launch ID: unique identifier of each workflow launch request
Configuration file: the Amazon S3 path to the configuration sheet with launch details (in s3://bucket/object format)
Compute service (optional): our workflow manager allows to select a particular service on which to run computational tasks, such as Amazon Elastic Compute Cloud (Amazon EC2) or AWS ParallelCluster with Slurm Workload Manager. The default is the pre-defined compute engine.

These points assume that the configuration sheet is already uploaded into an accessible location in an S3 bucket. This will issue a new Snakemake Fargate launch task. If either of the parameters is not provided or access fails, the workflow manager returns MissingRequiredParametersError.

Log workflow launches

Logs are written to CloudWatch Logs automatically. We write the location of the CloudWatch log group and log stream into the DynamoDB table. To send logs to Amazon CloudWatch, specify the awslogs driver in the Fargate task definition settings in your provisioning template.

Our Lambda function writes Fargate task launch logs from CloudWatch Logs to our DynamoDB table. For example, OutOfMemoryError can occur if the process utilizes more memory than the container is allocated.

AWS Batch job state logs are written to the following log group in CloudWatch Logs: /aws/batch/job. Our Lambda function writes status updates to the DynamoDB table. AWS Batch jobs may encounter errors, such as being stuck in RUNNABLE state.

Manage state transitions

We manage the status of each job in DynamoDB. Whenever a Fargate task changes state, it is picked up by a CloudWatch rule that references the Fargate compute cluster. This CloudWatch rule invokes a notifier Lambda function that updates the workflow status in DynamoDB.

Conclusion

In this blog post, we demonstrated how life-science research teams can simplify genomic analysis across an array of workflows. These workflows usually have their own command syntax and workflow management system, such as Snakemake. The presented workflow manager removes the administrative burden of preparing and formulating workflow launches, increasing reliability.

The pattern is broadly reusable with any scientific workflow and related high-performance computing systems. The workflow manager provides persistence to enable historical analysis and comparison, which enables us to automatically benchmark workflow launches for cost and performance.

Stay tuned for Part 4 of this series, in which we explore how to enable our workflows to process archival data stored in Amazon Simple Storage Service Glacier storage classes.

Related information

Genomics workflows, Part 2: simplify Snakemake launches

2022-12-09 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-2-simplify-snakemake-launches/

Genomics workflows are high-performance computing workloads. In Part 1 of this series, we demonstrated how life-science research teams can focus on scientific discovery without the associated heavy lifting. We used regenie for large genome-wide association studies. Our design pattern built on AWS Step Functions with AWS Batch and Amazon FSx for Lustre.

In Part 2, we explore genomics workloads with built-in workflow logic. Historically, running bioinformatics data pipelines was a manual and error-prone task. Over the last years, multiple workflow management systems have emerged. An example of these is the Snakemake workflow management system with Tibanna orchestration. We discuss the solution design and how you can fully automate the launch with Amazon Web Services (AWS).

Use case

We focus on the use case of Snakemake, an open-source utility for whole genome sequence mapping in directed acyclic graph (DAG) format. Snakemake uses Snakefiles to declare workflow steps and commands. A Snakefile extends Python syntax to declare workflow steps such as mapping data sets to DAG structure and identifying variants. Consult the Snakemake tutorial for further information on workflow rules.

Snakefiles provide an exception from the general design pattern and an alternative to granular modeling workflow logic in Amazon States Language. In our real-life use case, we used Tibanna to orchestrate Snakemake. Tibanna is an open-source, AWS-native software that runs bioinformatics data pipelines. It supports Snakefile syntax, plus other workflow languages, including Common Workflow Language and Workflow Description Language (WDL).

We recommend using Amazon Genomics CLI, if Tibanna is not needed for your use case, and Amazon Omics, if your workflow definitions are compliant with the supported WDL and Nextflow specifications.

Solution overview

Snakemake is available as Docker image on GitHub. We push the image to Amazon Elastic Container Registry. Tibanna is also available as Docker image on GitHub—it comes with Snakemake. Consult the Tibanna installation guide for more information.

We store Snakefiles on Amazon Simple Storage Service (Amazon S3). We configure S3 Event Notifications on PUT request operations. The event notification triggers an AWS Lambda function. The Lambda function launches an AWS Fargate task, which overrides the task definition command with the appropriate Snakemake start command and arguments.

The launched AWS Fargate task pulls the Snakefiles at launch time for each job and prepares the Snakemake initiation commands. Once the Snakefiles are downloaded on the Fargate task, the Snakemake head initiation command is invoked to begin launching jobs using Tibanna. Tibanna invokes a Step Functions state machine which orchestrates the launch of Snakemake on Amazon Elastic Compute Cloud (Amazon EC2).

Amazon CloudWatch provides a consolidated overview of performance metrics, including elapsed time, failed jobs, and error types. You can keep logs of your failed jobs in CloudWatch Logs (Figure 1). You can set up filters to match specific error types, plus create subscriptions to deliver a real-time stream of your log events to Amazon Kinesis or Lambda for further retry.

Figure 1. Solution architecture for Snakemake with Tibanna on AWS

Implementation considerations

Here, we describe some of the implementation considerations.

Creating Snakefiles

The launching point for the initiation depends on a Snakefile. Each Snakefile may contain one or more samples to be launched. The sheet resides in an S3 bucket. This adds flexibility and the ability to purge any sensitive or restrictive information after the job has been processed.

Invoking Tibanna

In order to launch Snakemake DAGs using Tibanna, we will need to set up a new Tibanna Unicorn. A Tibanna Unicorn is an Step Functions state machine and a corresponding Lambda function for provisioning EC2 instances.

The state machine runs the following sequence:

Create EC2 instance
Check EC2 status
Exit

After the Tibanna Unicorn has been created, we can start a Snakemake DAG using the following sample commands inside of the Fargate task.

$ export TIBANNA_DEFAULT_STEP_FUNCTION_NAME=YOUR_UNICORN_PROJECT
$ snakemake --tibanna --tibanna-config spot_instance=true --default-remote-prefix=YOUR_S3_BUCKET/BUCKET_PREFIX --retries 3.

The Snakemake command is used with the --tibanna flag to send launch requests to the Step Functions state machine in order to provision EC2 instances and run DAG tasks.

We recommend deploying the solution with AWS Serverless Application Model or the AWS Cloud Development Kit, both of which launch AWS CloudFormation.

Logging and troubleshooting

With this solution, each launch will automatically capture and retain start logs in a centralized location in Amazon CloudWatch Logs for tracing and auditing.

If there are issues during the launch of the Tibanna Step Function state machine, such as Amazon EC2 capacity limits, logs will be available in the S3 bucket that was specified during the Tibanna Unicorn creation process. There will be a file available in the format of <EXECUTION_ID>.log inside of the S3 bucket. This information is easily accessible via the command line interface. Use the following command to display specific log results or error messages.

tibanna log -j <EXECUTION_ID> -T

Retries and EC2 Spot Instances

We advise to use Amazon EC2 Spot Instances, if possible, for additional cost savings. This option is available in the --tibanna-config arguments with the setting spot_instance=true.

This is optional, and you need to create retry logic in the event a Spot Instance gets reclaimed. You can include --retries=3 in your Tibanna launch command. This would ensure all rules are retried three times. You can also specify the number of retries for individual rules when defining the Snakemake DAG definition; for example:

rule a:
    output:
        "test.txt"
    retries: 3
    shell:
        "curl https://some.unreliable.server/test.txt > {output}"

If EC2 Spot Instance capacity is hit, you can automatically switch to using EC2 On-Demand Instances instead. Add the behavior_on_capacity_limit argument and set retry_without_spot=true.

Adding services

The presented solution can be adapted to use other compute services supported by Snakemake. These include Amazon Elastic Kubernetes Service and AWS ParallelCluster with Slurm Workload Manager plus Amazon FSx for Lustre volumes attached to the head node and cluster nodes.

To initiate jobs on ParallelCluster, install the AWS Systems Manager agent on the head node. This is the launching point into the cluster and used for submitting jobs to the initiation queue. Systems Manager is a secure way to remotely invoke commands on an EC2 instance without the need for SSH access. You can restrict access to your EC2 instance through IAM policies.

Conclusion

In this blog post, we demonstrated how life-science research teams can simplify the launch of Snakemake using AWS. We used Snakefiles and Tibanna to orchestrate workflow steps. Snakefiles provide an exception from the general design pattern and an alternative to Amazon States Language. File uploads to Amazon S3 served as our launching point for workflow initiations.

Stay tuned for Part 3 of this series, in which we create a job manager that administrates multiple workflows.

Related information

Automated launch of genomics workflows

2022-11-16 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/automated-launch-of-genomics-workflows/

Genomics workflows are high-performance computing workloads. Traditionally, they run on-premises with a collection of scripts. Scientists run and manage these workflows manually, which slows down the product development lifecycle. Scientists spend time to administer workflows and handle errors on a day-to-day basis. They also lack sufficient compute capacity on-premises.

In this blog post, we demonstrate how life sciences companies can use Amazon Web Services (AWS) to remove the traditional heavy lifting associated with genomic studies. We use AWS Step Functions to orchestrate workflow steps, including error handling. With AWS Batch, we horizontally scale-out the analytic tasks for optimal performance. This allows genome scientists to focus on scientific discovery while AWS runs their workflows.

Use case

Workflow systems used for genomic analysis include Cromwell, Nextflow, and regenie. These high-performance computing systems share the following requirements:

Fast access to datasets at petabyte scale
Parallel task distribution, with horizontal compute scale-out
Data processing in batches following a specific sequence of data analysis steps, which vary by use case

We explore the use case of regenie. regenie is a common, open-source utility for whole-genome regression modelling of large genome-wide association studies (GWAS). GWAS compare DNA datasets of individuals with a specific trait or disease. The intent is to associate the identified trait/disease with DNA variants. Among other positive results, this helps identify at-risk patients, plus testing and prevention opportunities.

regenie is a C++ program that runs in two steps:

The first step searches for variants associated with a specific trait in a dataset of individuals with the trait, in order to create a whole-genome regression model that captures the variance.
The second step validates for association with the identified variants against a larger dataset, typically in the scale of petabytes, and launches a sequence of tasks run on data batches.

Solution overview

The entire regenie workflow and associated tasks of attaching and deleting file-share access to sample data, as well as spinning up compute instances for parallel computing, can be orchestrated with Step Functions. We use Amazon FSx for Lustre as a high-performance, transient file system providing file access to the datasets stored in an Amazon Simple Storage Service (Amazon S3) bucket. AWS Batch allows us to programmatically spin up multiple Amazon Elastic Compute Cloud (Amazon EC2) instances on which regenie can distribute parallel computing tasks. We do this with an AWS Lambda function that calculates the number of required batch jobs based on the requested size of samples per batch.

regenie is available as Docker image on GitHub. We push the image to Amazon Elastic Container Registry from which AWS Batch can pull it with the creation of new jobs at launch time. The Step Functions state machine is initiated by a Lambda function, with interactive user input. In the past, scientists have also directly interacted with the Step Functions API via the AWS Management Console or by running start-execution in the AWS Command Line Interface and passing a JSON file with the input parameters.

Amazon CloudWatch provides a consolidated overview of performance metrics, including elapsed time, failed jobs, and error types. You can keep logs of your failed jobs in Amazon CloudWatch Logs (Figure 1). You can set up filters to match specific error types, plus create subscriptions to deliver a real-time stream of your log events to Amazon Kinesis or AWS Lambda for further retry.

Figure 1. Solution overview for automating regenie workflows on AWS

Alternatively, the Step Functions workflow triggers another Lambda function, which puts failed job logs to Amazon DynamoDB. In the past, we have used this to ease data access and manipulation via the AWS management console. Scientists updated table items and DynamoDB Streams initiated the retry.

Workflow automation

With each invocation, Step Functions initiates a new instance of the state machine. AWS documentation provides an overview of the API quotas. Step Functions allows the modeling of the entire workflow, including custom application error handling. Map state improves performance by parallelizing workflow branches.

The state machine initiates the build of the file system and, once it’s ready, creates a data repository association with the sample data stored on Amazon S3. It waits until the data repository association is complete and proceeds with the calculation of batch jobs, based on a user-defined number of samples to be processed per batch job (Figure 2). This is essential to determine the amount of compute instances required for data processing.

Figure 2. AWS Step Functions workflow for regenie: initialize file access

Next, the state machine builds the commands to launch the regenie steps, as requested by the user, and submit the jobs for AWS Batch (Figure 3). The workflow checks if a specific version of regenie was requested by the user, otherwise, it defaults to the version of regenie on the container.

Then, we build the commands to initiate the two regenie steps. Step 2 may need to run in multiple iterations on different datasets (more often than Step 1). This is also determined with user input at initiation of the workflow. With Step Functions, we create runner logic to build the set of commands dynamically. This pattern is applicable to other scientific workloads, as well.

Figure 3. AWS Step Functions workflow for regenie: prepare and submit jobs

Once jobs are submitted, the workflow proceeds (by default) with the initiation of Step 1 of regenie; if requested by the user, the workflow will proceed directly to step 2 (Figure 4).

Any errors during batch launch leading to the failure of a job are passed, in this case, to a Lambda function. We configure the Lambda function to write the failed job logs to Amazon DynamoDB or as S3 objects.

Figure 4. AWS Step Functions workflow for regenie: launch jobs

Finally, the Step Functions workflow checks for pending errors and confirms that all jobs have finished their initiation. Then, it deletes the file system and data repository association and ends the workflow instance (Figure 5).

Figure 5. AWS Step Functions workflow for regenie: complete error handling and delete file system

As demonstrated, we can automate the entire process, from data access to verifying job completion and cleaning-up transient resources. This removes manual error handling and retry, plus reduces the overall cost of running regenie workflows. We also showed in Figure 3 that you can build commands dynamically for different scientific workloads.

Conclusion

In this blog post, we addressed a common pain point in the daily work of life sciences research teams. Traditionally, they had to run genomics workflows manually on limited compute capacity. Moving those workflows to AWS eliminates the heavy lifting of running scripts manually and expedites computational cycles. This allows research teams to stay focused on scientific discovery.

We recommend a thorough performance testing when setting up your genomics workflows. This includes determining the most suitable EC2 instance size. Some workflows, such as regenie, are single-threaded and benefit from horizontal scale-out of the number of instances but not from vertical scale-out of instance sizes.

Related information

Maintain visibility over the use of cloud architecture patterns

2022-09-19 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/maintain-visibility-over-the-use-of-cloud-architecture-patterns/

Cloud platform and enterprise architecture teams use architecture patterns to provide guidance for different use cases. Cloud architecture patterns are typically aggregates of multiple Amazon Web Services (AWS) resources, such as Elastic Load Balancing with Amazon Elastic Compute Cloud, or Amazon Relational Database Service with Amazon ElastiCache. In a large organization, cloud platform teams often have limited governance over cloud deployments, and, therefore, lack control or visibility over the actual cloud pattern adoption in their organization.

While having decentralized responsibility for cloud deployments is essential to scale, a lack of visibility or controls leads to inefficiencies, such as proliferation of infrastructure templates, misconfigurations, and insufficient feedback loops to inform cloud platform roadmap.

To address this, we present an integrated approach that allows cloud platform engineers to share and track use of cloud architecture patterns with:

AWS Service Catalog to publish an IT service catalog of codified cloud architecture patterns that are pre-approved for use in the organization.
Amazon QuickSight to track and visualize actual use of service catalog products across the organization.

This solution enables cloud platform teams to maintain visibility into the adoption of cloud architecture patterns in their organization and build a release management process around them.

Publish architectural patterns in your IT service catalog

We use AWS Service Catalog to create portfolios of pre-approved cloud architecture patterns and expose them as self-service to end users. This is accomplished in a shared services AWS account where cloud platform engineers manage the lifecycle of portfolios and publish new products (Figure 1). Cloud platform engineers can publish new versions of products within a portfolio and deprecate older versions, without affecting already-launched resources in end-user AWS accounts. We recommend using organizational sharing to share portfolios with multiple AWS accounts.

Application engineers launch products by referencing the AWS Service Catalog API. Access can be via infrastructure code, like AWS CloudFormation and TerraForm, or an IT service management tool, such as ServiceNow. We recommend using a multi-account setup for application deployments, with an application deployment account hosting the deployment toolchain: in our case, using AWS developer tools.

Although not explicitly depicted, the toolchain can be launched as an AWS Service Catalog product and include pre-populated infrastructure code to bootstrap initial product deployments, as described in the blog post Accelerate deployments on AWS with effective governance.

Figure 1. Launching cloud architecture patterns as AWS Service Catalog products

Track the adoption of cloud architecture patterns

Track the usage of AWS Service Catalog products by analyzing the corresponding AWS CloudTrail logs. The latter can be forwarded to an Amazon EventBridge rule with a filter on the following events: CreateProduct, UpdateProduct, DeleteProduct, ProvisionProduct and TerminateProvisionedProduct.

The logs are generated no matter how you interact with the AWS Service Catalog API, such as through ServiceNow or TerraForm. Once in EventBridge, Amazon Kinesis Data Firehose delivers the events to Amazon Simple Storage Service (Amazon S3) from where QuickSight can access them. Figure 2 depicts the end-to-end flow.

Figure 2. Tracking adoption of AWS Service Catalog products with Amazon QuickSight

Depending on your AWS landing zone setup, CloudTrail logs from all relevant AWS accounts and regions need to be forwarded to a central S3 bucket in your shared services account or, otherwise, centralized logging account. Figure 3 provides an overview of this cross-account log aggregation.

Figure 3. Aggregating AWS Service Catalog product logs across AWS accounts

If your landing zone allows, consider giving permissions to EventBridge in all accounts to write to a central event bus in your shared services AWS account. This avoids having to set up Kinesis Data Firehose delivery streams in all participating AWS accounts and further simplifies the solution (Figure 4).

Figure 4. Aggregating AWS Service Catalog product logs across AWS accounts to a central event bus

If you are already using an organization trail, you can use Amazon Athena or AWS Lambda to discover the relevant logs in your QuickSight dashboard, without the need to integrate with EventBridge and Kinesis Data Firehose.

Reporting on product adoption can be customized in QuickSight. The S3 bucket storing AWS Service Catalog logs can be defined in QuickSight as datasets, for which you can create an analysis and publish as a dashboard.

In the past, we have reported on the top ten products used in the organization (if relevant, also filtered by product version or time period) and the top accounts in terms of product usage. The following figure offers an example dashboard visualizing product usage by product type and number of times they were provisioned. Note: the counts of provisioned and terminated products differ slightly, as logging was activated after the first products were created and provisioned for demonstration purposes.

Figure 5. Example Amazon QuickSight dashboard tracking AWS Service Catalog product adoption

Conclusion

In this blog, we described an integrated approach to track adoption of cloud architecture patterns using AWS Service Catalog and QuickSight. The solution has a number of benefits, including:

Building an IT service catalog based on pre-approved architectural patterns
Maintaining visibility into the actual use of patterns, including which patterns and versions were deployed in the organizational units’ AWS accounts
Compliance with organizational standards, as architectural patterns are codified in the catalog

In our experience, the model may compromise on agility if you enforce a high level of standardization and only allow the use of a few patterns. However, there is the potential for proliferation of products, with many templates differing slightly without a central governance over the catalog. Ideally, cloud platform engineers assume responsibility for the roadmap of service catalog products, with formal intake mechanisms and feedback loops to account for builders’ localization requests.

Accelerate deployments on AWS with effective governance

2022-08-12 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/accelerate-deployments-on-aws-with-effective-governance/

Amazon Web Services (AWS) users ask how to accelerate their teams’ deployments on AWS while maintaining compliance with security controls. In this blog post, we describe common governance models introduced in mature organizations to manage their teams’ AWS deployments. These models are best used to increase the maturity of your cloud infrastructure deployments.

Governance models for AWS deployments

We distinguish three common models used by mature cloud adopters to manage their infrastructure deployments on AWS. The models differ in what they control: the infrastructure code, deployment toolchain, or provisioned AWS resources. We define the models as follows:

Central pattern library, which offers a repository of curated deployment templates that application teams can re-use with their deployments.
Continuous Integration/Continuous Delivery (CI/CD) as a service, which offers a toolchain standard to be re-used by application teams.
Centrally managed infrastructure, which allows application teams to deploy AWS resources managed by central operations teams.

The decision of how much responsibility you shift to application teams depends on their autonomy, operating model, application type, and rate of change. The three models can be used in tandem to address different use cases and maximize impact. Typically, organizations start by gathering pre-approved deployment templates in a central pattern library.

Model 1: Central pattern library

With this model, cloud platform engineers publish a central pattern library from which teams can reference infrastructure as code templates. Application teams reuse the templates by forking the central repository or by copying the templates into their own repository. Application teams can also manage their own deployment AWS account and pipeline with AWS CodePipeline), as well as the resource-provisioning process, while reusing templates from the central pattern library with a service like AWS CodeCommit. Figure 1 provides an overview of this governance model.

Figure 1. Deployment governance with central pattern library

The central pattern library represents the least intrusive form of enablement via reusable assets. Application teams appreciate the central pattern library model, as it allows them to maintain autonomy over their deployment process and toolchain. Reusing existing templates speeds up the creation of your teams’ first infrastructure templates and eases policy adherence, such as tagging policies and security controls.

After the reusable templates are in the application team’s repository, incremental updates can be pulled from the central library when the template has been enhanced. This allows teams to pull when they see fit. Changes to the team’s repository will trigger the pipeline to deploy the associated infrastructure code.

With the central pattern library model, application teams need to manage resource configuration and CI/CD toolchain on their own in order to gain the benefits of automated deployments. Model 2 addresses this.

Model 2: CI/CD as a service

In Model 2, application teams launch a governed deployment pipeline from AWS Service Catalog. This includes the infrastructure code needed to run the application and “hello world” source code to show the end-to-end deployment flow.

Cloud platform engineers develop the service catalog portfolio (in this case the CI/CD toolchain). Then, application teams can launch AWS Service Catalog products, which deploy an instance of the pipeline code and populated Git repository (Figure 2).

The pipeline is initiated immediately after the repository is populated, which results in the “hello world” application being deployed to the first environment. The infrastructure code (for example, Amazon Elastic Compute Cloud [Amazon EC2] and AWS Fargate) will be located in the application team’s repository. Incremental updates can be pulled by launching a product update from AWS Service Catalog. This allows application teams to pull when they see fit.

Figure 2. Deployment governance with CI/CD as a service

This governance model is particularly suitable for mature developer organizations with full-stack responsibility or platform projects, as it provides end-to-end deployment automation to provision resources across multiple teams and AWS accounts. This model also adds security controls over the deployment process.

Since there is little room for teams to adapt the toolchain standard, the model can be perceived as very opinionated. The model expects application teams to manage their own infrastructure. Model 3 addresses this.

Model 3: Centrally managed infrastructure

This model allows application teams to provision resources managed by a central operations team as self-service. Cloud platform engineers publish infrastructure portfolios to AWS Service Catalog with pre-approved configuration by central teams (Figure 3). These portfolios can be shared with all AWS accounts used by application engineers.

Provisioning AWS resources via AWS Service Catalog products ensures resource configuration fulfills central operations requirements. Compared with Model 2, the pre-populated infrastructure templates launch AWS Service Catalog products, as opposed to directly referencing the API of the corresponding AWS service (for example Amazon EC2). This locks down how infrastructure is configured and provisioned.

Figure 3. Deployment governance with centrally managed infrastructure

In our experience, it is essential to manage the variety of AWS Service Catalog products. This avoids proliferation of products with many templates differing slightly. Centrally managed infrastructure propagates an “on-premises” mindset so it should be used only in cases where application teams cannot own the full stack.

Models 2 and 3 can be combined for application engineers to launch both deployment toolchain and resources as AWS Service Catalog products (Figure 4), while also maintaining the opportunity to provision from pre-populated infrastructure templates in the team repository. After the code is in their repository, incremental updates can be pulled by running an update from the provisioned AWS Service Catalog product. This allows the application team to pull an update as needed while avoiding manual deployments of service catalog products.

Figure 4. Using AWS Service Catalog to automate CI/CD and infrastructure resource provisioning

Comparing models

The three governance models differ along the following aspects (see Table 1):

Governance level: What component is managed centrally by cloud platform engineers?
Role of application engineers: What is the responsibility split and operating model?
Use case: When is each model applicable?

Table 1. Governance models for managing infrastructure deployments

	Model 1: Central pattern library	Model 2: CI/CD as a service	Model 3: Centrally managed infrastructure
Governance level	Centrally defined infrastructure templates	Centrally defined deployment toolchain	Centrally defined provisioning and management of AWS resources
Role of cloud platform engineers	Manage pattern library and policy checks	Manage deployment toolchain and stage checks	Manage resource provisioning (including CI/CD)
Role of application teams	Manage deployment toolchain and resource provisioning	Manage resource provisioning	Manage application integration
Use case	Federated governance with application teams maintaining autonomy over application and infrastructure	Platform projects or development organizations with strong preference for pre-defined deployment standards including toolchain	Applications without development teams (e.g., “commercial-off-the-shelf”) or with separation of duty (e.g., infrastructure operations teams)

Conclusion

In this blog post, we distinguished three common governance models to manage the deployment of AWS resources. The three models can be used in tandem to address different use cases and maximize impact in your organization. The decision of how much responsibility is shifted to application teams depends on your organizational setup and use case.

Want to learn more?

Save time and effort in assessing your teams’ architectures with pattern-based architecture reviews

2022-07-21 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/save-time-and-effort-in-assessing-your-teams-architectures-with-pattern-based-architecture-reviews/

Enterprise architecture frameworks use architecture reviews as a key governance mechanism to review and approve architecture designs, identify quality enhancements, and align architectural decisions with enterprise-wide standards. Architecture reviews are very thorough, but it typically takes a lot of time and teamwork to prepare for them, which means developers can’t always move as quickly as they’d like.

If your team needs a flexible, faster option to review their architecture, consider adopting pattern-based architecture reviews (PBARs). PBARs may not find every issue that a traditional architecture review will. However, in situations where you need to accommodate tight deadlines or budgets, changing project requirements, or multiple releases, they offer a simpler, quicker, more focused way to address issues and ensure your architecture aligns with business needs.

Pattern-based architecture reviews vs. traditional architecture reviews

PBARs use generic architecture patterns (in other words, generalized, reusable solutions to common design problems) to review non-functional system properties and align architectural patterns to business outcomes.

Traditional architecture reviews

Pattern-based architecture reviews

Consider system architecture as a highly stable documentation of all functional and system needs and their implementation plan

Require detailed architecture documentation
Review functional requirements, infrastructure configuration, and process specifications
Focus on technical configuration completeness
Are used to sign off on architectural decisions and/or authorize the implementation plan

Adopt a continuous architecture mindset that focuses on enhancing system composition and quality attributes

Use generalized solutions with existing architectural documentation
Focus on identifying design inconsistencies and opportunities to improve on required system capabilities and quality attributes via architectural patterns
Increase developer productivity and identify opportunities to shorten the implementation plan through component re-use

PBARs are broadly applicable to any cloud initiative, ranging from migration use cases to complex large-scale development initiatives. Here are just a few examples of ways to use them:

With cloud migrations, PBARs help manage various infrastructure and integration patterns, including like-for-like moves and full refactoring options.
A few infrastructure patterns, such as N-tier architecture, are applicable to many applications. After the pilot phase, these cloud infrastructure patterns serve as reusable blueprints for follow-on migrations, which reduces the amount of repetitive work and ensures compliance with security controls.
With new development use cases, PBARs emphasize composition through reusable code
Teams with novel uses are encouraged to verify the new pattern through early prototyping as opposed to heavy documentation and requirements analysis.

Use case: Applying PBARs across multiple teams to meet stringent go-live date

We introduced PBARs to a global industrial company’s large cloud development initiative where developers had no prior AWS experience and their go-live date was in six months. The initiative spanned over 50 development teams along 10 functional domains and 11 geographical locations from Americas to Asia. Each team was responsible for developing between one and six customer-facing aggregated services exposed via APIs (asset management, tenant billing, customer onboarding, or event analytics).

Socialize initial design patterns

To get the team to use PBARs, we advocated to adopt particular managed/serverless services to reduce management overhead, as shown in Figure 1.

Figure 1. Proposed AWS services for use by development teams

We also shared an initial set of design patterns, including:

Containerized Spring Boot services with Elastic Load Balancing, Amazon Elastic Container Service, and Amazon Aurora
Serverless services with Amazon API Gateway, AWS Lambda, and Amazon DynamoDB, with extensions for data streaming and analysis (Figure 2)
AWS Elastic Beanstalk deployments of Amazon Elastic Compute Cloud (Amazon EC2)-backed web services

Figure 2. Containerized and serverless design patterns

Shorten review times by applying lessons learned from early adopters

PBARs were run by domain architects, team architects, lead engineers, and product owners. We also invited teams with similar use cases and system requirements for joint reviews.

They brought knowledge and experience that allowed the process to conclude within two weeks for all teams with minimal preparation—significantly faster than traditional reviews.

Complete reviews quicker and increase participation and understanding by focusing the review

Because PBARs move quickly, we had to be specific about the areas we chose to focus on improving or evaluating. We worked towards identifying inconsistencies between system requirements and pattern selection, any special needs, and opportunities to improve on non-functional requirements, including:

Security
Availability and operations
Deployment process
Speed and reproducibility
Quality concerns and defects

In narrowing the PBAR’s scope, we were also able to complete the architecture reviews more quickly and increase participants’ understanding of the architecture and critical project needs.

Findings

Our technical findings showed single points of failure, service scalability limits, or opportunities to automate test/deployment/recovery processes.

The PBARs emphasized pattern reuse and, therefore, standardization in the early development phase. This required follow-on tailoring to individual use cases, such as distinguishing data ingest profiles by data type and throughput or moving from containerized deployments for data analytics jobs to AWS Lambda and Amazon Athena.

PBARs also provided actionable feedback on what to address prior to the go-live date.

By emphasizing non-functional aspects, our PBARs helped create a case for zero-defect culture where fixing bugs had priority over new features.
Early-adopter teams of architectural patterns served as internal champions, providing informal support to others on how to address review findings.
Follow-on game days and performance load tests helped teams gain first-hand exposure to PBAR findings in simulated environments.

Introducing pattern-based architecture reviews in your organization

In large enterprises, PBARs serve as a demand intake mechanism for their cloud center of excellence (CoE). They facilitate adoption of established pattern solutions and contribute new use cases to the enterprise-wide roadmap of cloud architectural patterns.

Three organizational disciplines contribute to PBARs:

Application teams envision system capabilities and outcomes and own decision-making on application design and operations.
The enterprise architecture team oversee the adoption of architectural best practices and work closely with application teams and the cloud CoE to review architectural patterns.
The cloud CoE approves and publishes pattern solutions and tracks their adoption in the cloud service catalog. At AWS, we use AWS Service Catalog portfolios to publish pattern solutions to developers.

Figure 3 describes the high-level process tasks and responsibilities:

Application teams solicit PBAR with enterprise architects who help identify and customize suitable design patterns for particular use cases.
If the use case requires novel pattern, architects work with the application team on early prototyping and approval of the novel system architecture. They also work with the cloud CoE team to generalize and publish novel pattern solutions in the service catalog of the cloud CoE.

To better align with agile development cycles, we recommend establishing internal commitments on the time to schedule and conduct PBARs as well as auto-approval options for teams re-using existing pattern solutions, in order to allow developers to move as quickly as possible.

Figure 3. PBAR workflow

PBARs provide lightweight architectural governance across enterprises. They help focus your teams on non-functional system properties and align architectural patterns to business outcomes.

As shown in our use case, PBARs enable teams to build faster and help change the perception of enterprise-wide architecture reviews as a corporate guardrail. For teams with novel use cases, PBARs encourage pattern validation through early prototyping and therefore provide modern alternative for agile cloud projects.

If you are looking to scale your cloud architecture governance effectively, consider adopting PBARs.

Related information

Use the following links to learn more about patterns you can use on your next architecture review:

Queue Integration with Third-party Services on AWS

2021-08-23 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/queue-integration-with-third-party-services-on-aws/

Commercial off-the-shelf software and third-party services can present an integration challenge in event-driven workflows when they do not natively support AWS APIs. This is even more impactful when a workflow is subject to unpredicted usage spikes, and you want to increase decoupling and fault tolerance. Given the third-party nature of services, polling an Amazon Simple Queue Service (SQS) queue and having built-in AWS API handling logic may not be an immediate option.

In such cases, AWS Lambda helps out-task the Amazon SQS queue integration and AWS API handling to an additional layer. The success of this depends on how well exception handling is implemented across the different interacting services. In this blog post, we outline issues to consider when adopting this design pattern. We also share a reusable solution.

Design pattern for third-party integration with SQS

With this design pattern, one or more services (producers) asynchronously invoke other third-party downstream consumer services. They publish messages to an Amazon SQS queue, which acts as buffer for requests. Producers provide all commands and other parameters required for consumer service execution with the message.

As messages are written to the queue, the queue is configured to invoke a message broker (implemented as AWS Lambda) for each message. AWS Lambda can interact natively with target AWS services such as Amazon EC2, Amazon Elastic Container Service (ECS), or Amazon Elastic Kubernetes Service (EKS). It can also be configured to use an Amazon Virtual Private Cloud (VPC) interface endpoint to establish a connection to VPC resources without traversing the internet. The message broker assigns the tasks to consumer services by invoking the RunTask API of Amazon ECS and AWS Fargate (see Figure 1.)

Figure 1. On-premises and AWS queue integration for third-party services using AWS Lambda

The message broker asynchronously invokes the API in ‘fire-and-forget’ mode. Therefore, error handling must be built in to respond to API invocation errors. In an event-driven scenario, an error will be invoked if you asynchronously call the third-party service hundreds or thousands of times and reach Service Quotas. This is a potential issue with RunTask API actions, or a large volume of concurrent tasks running on AWS Fargate. Two mechanisms can help implement troubleshooting API request errors.

API retries with exponential backoff. The message broker retries for a number of times with configurable sleep intervals and exponential backoff in-between. This enforces progressively longer waits between retries for consecutive error responses. If the RunTask API fails to process the request and initiate the third-party service, the message remains in the queue for a subsequent retry. The AWS General Reference provides further guidance.
API error handling. Error handling and consequent logging should be implemented at every step. Since there are several services working together in tandem, crucial debugging information from errors may be lost. Additionally, error handling also provides opportunity to define automated corrective actions or notifications on event occurrence. The message broker can publish failure notifications including the root cause to an Amazon Simple Notification Service (SNS) topic.

SNS topic subscription can be configured via different protocols. You can email a distribution group for active monitoring and processing of errors. If persistence is required for messages that failed to process, error handling can be associated directly with SQS by configuring a dead letter queue.

Reference implementation for third-party integration with SQS

We implemented the design pattern in Figure 1, with Broad Institute’s Cell Painting application workflow. This is for morphological profiling from microscopy cell images running on Amazon EC2. It interacts with CellProfiler version 3.0 cell image analysis software as the downstream consumer hosted on ECS/Fargate. Every invocation of CellProfiler required approximately 1,500 tasks for a single processing step.

Resource constraints determined the rate of scale-out. In this case, it was for an Amazon ECS task creation. Address space for Amazon ECS subnets should be large enough to prevent running out of available IPs within your VPC. If Amazon ECS Service Quotas provide further constraints, a quota increase can be requested.

Exceptions must be handled both when validating and initiating requests. As part of the validation workflow, exceptions are captured as follows, also shown in Figure 2.

1. Invalid arguments exception. The message broker validates that the SQS message contains all the needed information to initiate the ECS task. This information includes subnets, security groups and container names required to start the ECS task, and else raises exception.

2. Retry limit exception. On each iteration, the message broker will evaluate whether the SQS retry limit has been reached, before invoking the RunTask API. It will then exit, by sending failure notification to SNS when the retry limit is reached.

Figure 2. Exception handling flow during request validation

As part of the initiation workflow, exceptions are handled as follows, shown in Figure 3:

1. ECS/Fargate API and concurrent execution limitations. The message broker catches API exceptions when calling the API RunTask operation. These exceptions can include:

- When the call to the launch tasks exceeds the maximum allowed API request limit for your AWS account
- When failing to retrieve security group information
- When you have reached the limit on the number of tasks you can run concurrently

With each of the preceding exceptions, the broker will increase the retry count.

2. Networking and IP space limitations. Network interface timeouts received after initiating the ECS task set off an Amazon CloudWatch Events rule, causing the message broker to re-initiate the ECS task.

Figure 3. Exception handling flow during request initiation

While we specifically address downstream consumer services running on ECS/Fargate, this solution can be adjusted for third-party services running on Amazon EC2 or EKS. With EC2, the message broker must be adjusted to interact with the RunInstances API, and include troubleshooting API request errors. Integration with downstream consumers on Amazon EKS requires that the AWS Lambda function is associated via the IAM role with a Kubernetes service account. A Python client for Kubernetes can be used to simplify interaction with the Kubernetes REST API and AWS Lambda would invoke the run API.

Conclusion

This pattern is useful when queue polling is not an immediate option. This is typical with event-driven workflows involving third-party services and vendor applications subject to unpredictable, intermittent load spikes. Exception handling is essential for these types of workflows. Offloading AWS API handling to a separate layer orchestrated by AWS Lambda can improve the resiliency of such third-party services on AWS. This pattern represents an incremental optimization until the third party provides native SQS integration. It can be achieved with the initial move to AWS, for example as part of the V1 AWS design strategy for third-party services.

Some limitations should be acknowledged. While the pattern enables graceful failure, it does not prevent the overloading of the ECS RunTask API. By invoking Amazon ECS RunTask API in ‘fire-and-forget’ mode, it does not monitor service execution once a task was successfully invoked. Therefore, it should be adopted when direct queue polling is not an option. In our example, Broad Institute’s CellProfiler application enabled direct queue polling with its subsequent product version of Distributed CellProfiler.

Further reading

The referenced deployment with consumer services on Amazon ECS can be accessed via AWSLabs.