Tag Archives: Customer Solutions

Architecting for data residency with AWS Outposts rack and landing zone guardrails

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/architecting-for-data-residency-with-aws-outposts-rack-and-landing-zone-guardrails/

This blog post was written by Abeer Naffa’, Sr. Solutions Architect, Solutions Builder AWS, David Filiatrault, Principal Security Consultant, AWS and Jared Thompson, Hybrid Edge SA Specialist, AWS.

In this post, we will explore how organizations can use AWS Control Tower landing zone and AWS Organizations custom guardrails to enable compliance with data residency requirements on AWS Outposts rack. We will discuss how custom guardrails can be leveraged to limit the ability to store, process, and access data and remain isolated in specific geographic locations, how they can be used to enforce security and compliance controls, as well as, which prerequisites organizations should consider before implementing these guardrails.

Data residency is a critical consideration for organizations that collect and store sensitive information, such as Personal Identifiable Information (PII), financial, and healthcare data. With the rise of cloud computing and the global nature of the internet, it can be challenging for organizations to make sure that their data is being stored and processed in compliance with local laws and regulations.

One potential solution for addressing data residency challenges with AWS is to use Outposts rack, which allows organizations to run AWS infrastructure on premises and in their own data centers. This lets organizations store and process data in a location of their choosing. An Outpost is seamlessly connected to an AWS Region where it has access to the full suite of AWS services managed from a single plane of glass, the AWS Management Console or the AWS Command Line Interface (AWS CLI).  Outposts rack can be configured to utilize landing zone to further adhere to data residency requirements.

The landing zones are a set of tools and best practices that help organizations establish a secure and compliant multi-account structure within a cloud provider. A landing zone can also include Organizations to set policies – guardrails – at the root level, known as Service Control Policies (SCPs) across all member accounts. This can be configured to enforce certain data residency requirements.

When leveraging Outposts rack to meet data residency requirements, it is crucial to have control over the in-scope data movement from the Outposts. This can be accomplished by implementing landing zone best practices and the suggested guardrails. The main focus of this blog post is on the custom policies that restrict data snapshots, prohibit data creation within the Region, and limit data transfer to the Region.

Prerequisites

Landing zone best practices and custom guardrails can help when data needs to remain in a specific locality where the Outposts rack is also located.  This can be completed by defining and enforcing policies for data storage and usage within the landing zone organization that you set up. The following prerequisites should be considered before implementing the suggested guardrails:

1. AWS Outposts rack

AWS has installed your Outpost and handed off to you. An Outpost may comprise of one or more racks connected together at the site. This means that you can start using AWS services on the Outpost, and you can manage the Outposts rack using the same tools and interfaces that you use in AWS Regions.

2. Landing Zone Accelerator on AWS

We recommend using Landing Zone Accelerator on AWS (LZA) to deploy a landing zone for your organization. Make sure that the accelerator is configured for the appropriate Region and industry. To do this, you must meet the following prerequisites:

    • A clear understanding of your organization’s compliance requirements, including the specific Region and industry rules in which you operate.
    • Knowledge of the different LZAs available and their capabilities, such as the compliance frameworks with which you align.
    • Have the necessary permissions to deploy the LZAs and configure it for your organization’s specific requirements.

Note that LZAs are designed to help organizations quickly set up a secure, compliant multi-account environment. However, it’s not a one-size-fits-all solution, and you must align it with your organization’s specific requirements.

3. Set up the data residency guardrails

Using Organizations, you must make sure that the Outpost is ordered within a workload account in the landing zone.

Figure 1 Landing Zone Accelerator Outposts workload on AWS high level Architecture

Figure 1: Landing Zone Accelerator – Outposts workload on AWS high level Architecture

Utilizing Outposts rack for regulated components

When local regulations require regulated workloads to stay within a specific boundary, or when an AWS Region or AWS Local Zone isn’t available in your jurisdiction, you can still choose to host your regulated workloads on Outposts rack for a consistent cloud experience. When opting for Outposts rack, note that, as part of the shared responsibility model, customers are responsible for attesting to physical security, access controls, and compliance validation regarding the Outposts, as well as, environmental requirements for the facility, networking, and power. Utilizing Outposts rack requires that you procure and manage the data center within the city, state, province, or country boundary for your applications’ regulated components, as required by local regulations.

Procuring two or more racks in the diverse data centers can help with the high availability for your workloads. This is because it provides redundancy in case of a single rack or server failure. Additionally, having redundant network paths between Outposts rack and the parent Region can help make sure that your application remains connected and continue to operate even if one network path fails.

However, for regulated workloads with strict service level agreements (SLA), you may choose to spread Outposts racks across two or more isolated data centers within regulated boundaries. This helps make sure that your data remains within the designated geographical location and meets local data residency requirements.

In this post, we consider a scenario with one data center, but consider the specific requirements of your workloads and the regulations that apply to determine the most appropriate high availability configurations for your case.

Outposts rack workload data residency guardrails

Organizations provide central governance and management for multiple accounts. Central security administrators use SCPs with Organizations to establish controls to which all AWS Identity and Access Management (IAM) principals (users and roles) adhere.

Now, you can use SCPs to set permission guardrails.  A suggested preventative controls for data residency on Outposts rack that leverage the implementation of SCPs are shown as follows. SCPs enable you to set permission guardrails by defining the maximum available permissions for IAM entities in an account. If an SCP denies an action for an account, then none of the entities in the account can take that action, even if their IAM permissions let them. The guardrails set in SCPs apply to all IAM entities in the account, which include all users, roles, and the account root user.

Upon finalizing these prerequisites, you can create the guardrails for the Outposts Organization Unit (OU).

Note that while the following guidelines serve as helpful guardrails – SCPs – for data residency, you should consult internally with legal and security teams for specific organizational requirements.

 To exercise better control over workloads in the Outposts rack and prevent data transfer from Outposts to the Region or data storage outside the Outposts, consider implementing the following guardrails. Additionally, local regulations may dictate that you set up these additional guardrails.

  1. When your data residency requirements require restricting data transfer/saving to the Region, consider the following guardrails:

a. Deny copying data from Outposts to the Region for Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), Amazon ElastiCache and data sync “DenyCopyToRegion”.

b. Deny Amazon Simple Storage Service (Amazon S3) put action to the Region “DenyPutObjectToRegionalBuckets”.

If your data residency requirements mandate restrictions on data storage in the Region,  consider implementing this guardrail to prevent  the use of S3 in the Region.

Note: You can use Amazon S3 for Outposts.

c. If your data residency requirements mandate restrictions on data storage in the Region, consider implementing “DenyDirectTransferToRegion” guardrail.

Out of Scope is metadata such as tags, or operational data such as KMS keys.

{
  "Version": "2012-10-17",
  "Statement": [
      {
      "Sid": "DenyCopyToRegion",
      "Action": [
        "ec2:ModifyImageAttribute",
        "ec2:CopyImage",  
        "ec2:CreateImage",
        "ec2:CreateInstanceExportTask",
        "ec2:ExportImage",
        "ec2:ImportImage",
        "ec2:ImportInstance",
        "ec2:ImportSnapshot",
        "ec2:ImportVolume",
        "rds:CreateDBSnapshot",
        "rds:CreateDBClusterSnapshot",
        "rds:ModifyDBSnapshotAttribute",
        "elasticache:CreateSnapshot",
        "elasticache:CopySnapshot",
        "datasync:Create*",
        "datasync:Update*"
      ],
      "Resource": "*",
      "Effect": "Deny"
    },
    {
      "Sid": "DenyDirectTransferToRegion",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:CreateTable",
        "ec2:CreateTrafficMirrorTarget",
        "ec2:CreateTrafficMirrorSession",
        "rds:CreateGlobalCluster",
        "es:Create*",
        "elasticfilesystem:C*",
        "elasticfilesystem:Put*",
        "storagegateway:Create*",
        "neptune-db:connect",
        "glue:CreateDevEndpoint",
        "glue:UpdateDevEndpoint",
        "datapipeline:CreatePipeline",
        "datapipeline:PutPipelineDefinition",
        "sagemaker:CreateAutoMLJob",
        "sagemaker:CreateData*",
        "sagemaker:CreateCode*",
        "sagemaker:CreateEndpoint",
        "sagemaker:CreateDomain",
        "sagemaker:CreateEdgePackagingJob",
        "sagemaker:CreateNotebookInstance",
        "sagemaker:CreateProcessingJob",
        "sagemaker:CreateModel*",
        "sagemaker:CreateTra*",
        "sagemaker:Update*",
        "redshift:CreateCluster*",
        "ses:Send*",
        "ses:Create*",
        "sqs:Create*",
        "sqs:Send*",
        "mq:Create*",
        "cloudfront:Create*",
        "cloudfront:Update*",
        "ecr:Put*",
        "ecr:Create*",
        "ecr:Upload*",
        "ram:AcceptResourceShareInvitation"
      ],
      "Resource": "*",
      "Effect": "Deny"
    },
    {
      "Sid": "DenyPutObjectToRegionalBuckets",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": ["arn:aws:s3:::*"],
      "Effect": "Deny"
    }
  ]
}
  1. If your data residency requirements require limitations on data storage in the Region, consider implementing this guardrail “DenySnapshotsToRegion” and “DenySnapshotsNotOutposts” to restrict the use of snapshots in the Region.

a. Deny creating snapshots of your Outpost data in the Region “DenySnapshotsToRegion”

 Make sure to update the Outposts “<outpost_arn_pattern>”.

b. Deny copying or modifying Outposts Snapshots “DenySnapshotsNotOutposts”

Make sure to update the Outposts “<outpost_arn_pattern>”.

Note: “<outpost_arn_pattern>” default is arn:aws:outposts:*:*:outpost/*

{
  "Version": "2012-10-17",
  "Statement": [

    {
      "Sid": "DenySnapshotsToRegion",
      "Effect":"Deny",
      "Action":[
        "ec2:CreateSnapshot",
        "ec2:CreateSnapshots"
      ],
      "Resource":"arn:aws:ec2:*::snapshot/*",
      "Condition":{
         "ArnLike":{
            "ec2:SourceOutpostArn":"<outpost_arn_pattern>"
         },
         "Null":{
            "ec2:OutpostArn":"true"
         }
      }
    },
    {

      "Sid": "DenySnapshotsNotOutposts",          
      "Effect":"Deny",
      "Action":[
        "ec2:CopySnapshot",
        "ec2:ModifySnapshotAttribute"
      ],
      "Resource":"arn:aws:ec2:*::snapshot/*",
      "Condition":{
         "ArnLike":{
            "ec2:OutpostArn":"<outpost_arn_pattern>"
         }
      }
    }

  ]
}
  1. This guardrail helps to prevent the launch of Amazon EC2 instances or creation of network interfaces in non-Outposts subnets. It is advisable to keep data residency workloads within the Outposts rather than the Region to ensure better control over regulated workloads. This approach can help your organization achieve better control over data residency workloads and improve governance over your AWS Organization.

Make sure to update the Outposts subnets “<outpost_subnet_arns>”.

{
"Version": "2012-10-17",
  "Statement":[{
    "Sid": "DenyNotOutpostSubnet",
    "Effect":"Deny",
    "Action": [
      "ec2:RunInstances",
      "ec2:CreateNetworkInterface"
    ],
    "Resource": [
      "arn:aws:ec2:*:*:network-interface/*"
    ],
    "Condition": {
      "ForAllValues:ArnNotEquals": {
        "ec2:Subnet": ["<outpost_subnet_arns>"]
      }
    }
  }]
}

Additional considerations

When implementing data residency guardrails on Outposts rack, consider backup and disaster recovery strategies to make sure that your data is protected in the event of an outage or other unexpected events. This may include creating regular backups of your data, implementing disaster recovery plans and procedures, and using redundancy and failover systems to minimize the impact of any potential disruptions. Additionally, you should make sure that your backup and disaster recovery systems are compliant with any relevant data residency regulations and requirements. You should also test your backup and disaster recovery systems regularly to make sure that they are functioning as intended.

Additionally, the provided SCPs for Outposts rack in the above example do not block the “logs:PutLogEvents”. Therefore, even if you implemented data residency guardrails on Outpost, the application may log data to CloudWatch logs in the Region.

Highlights

By default, application-level logs on Outposts rack are not automatically sent to Amazon CloudWatch Logs in the Region. You can configure CloudWatch logs agent on Outposts rack to collect and send your application-level logs to CloudWatch logs.

logs: PutLogEvents does transmit data to the Region, but it is not blocked by the provided SCPs, as it’s expected that most use cases will still want to be able to use this logging API. However, if blocking is desired, then add the action to the first recommended guardrail. If you want specific roles to be allowed, then combine with the ArnNotLike condition example referenced in the previous highlight.

Conclusion

The combined use of Outposts rack and the suggested guardrails via AWS Organizations policies enables you to exercise better control over the movement of the data. By creating a landing zone for your organization, you can apply SCPs to your Outposts racks that will help make sure that your data remains within a specific geographic location, as required by the data residency regulations.

Note that, while custom guardrails can help you manage data residency on Outposts rack, it’s critical to thoroughly review your policies, procedures, and configurations to make sure that they are compliant with all relevant data residency regulations and requirements. Regularly testing and monitoring your systems can help make sure that your data is protected and your organization stays compliant.

References

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

Post Syndicated from Anish Moorjani original https://aws.amazon.com/blogs/big-data/how-safetyculture-scales-unpredictable-dbt-cloud-workloads-in-a-cost-effective-manner-with-amazon-redshift/

This post is co-written by Anish Moorjani, Data Engineer at SafetyCulture.

SafetyCulture is a global technology company that puts the power of continuous improvement into everyone’s hands. Its operations platform unlocks the power of observation at scale, giving leaders visibility and workers a voice in driving quality, efficiency, and safety improvements.

Amazon Redshift is a fully managed data warehouse service that tens of thousands of customers use to manage analytics at scale. Together with price-performance, Amazon Redshift enables you to use your data to acquire new insights for your business and customers while keeping costs low.

In this post, we share the solution SafetyCulture used to scale unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift.

Use case

SafetyCulture runs an Amazon Redshift provisioned cluster to support unpredictable and predictable workloads. A source of unpredictable workloads is dbt Cloud, which SafetyCulture uses to manage data transformations in the form of models. Whenever models are created or modified, a dbt Cloud CI job is triggered to test the models by materializing the models in Amazon Redshift. To balance the needs of unpredictable and predictable workloads, SafetyCulture used Amazon Redshift workload management (WLM) to flexibly manage workload priorities.

With plans for further growth in dbt Cloud workloads, SafetyCulture needed a solution that does the following:

  • Caters for unpredictable workloads in a cost-effective manner
  • Separates unpredictable workloads from predictable workloads to scale compute resources independently
  • Continues to allow models to be created and modified based on production data

Solution overview

The solution SafetyCulture used is comprised of Amazon Redshift Serverless and Amazon Redshift Data Sharing, along with the existing Amazon Redshift provisioned cluster.

Amazon Redshift Serverless caters to unpredictable workloads in a cost-effective manner because compute cost is not incurred when there is no workload. You pay only for what you use. In addition, moving unpredictable workloads into a separate Amazon Redshift data warehouse allows each Amazon Redshift data warehouse to scale resources independently.

Amazon Redshift Data Sharing enables data access across Amazon Redshift data warehouses without having to copy or move data. Therefore, when a workload is moved from one Amazon Redshift data warehouse to another, the workload can continue to access data in the initial Amazon Redshift data warehouse.

The following figure shows the solution and workflow steps:

  1. We create a serverless instance to cater for unpredictable workloads. Refer to Managing Amazon Redshift Serverless using the console for setup steps.
  2. We create a datashare called prod_datashare to allow the serverless instance access to data in the provisioned cluster. Refer to Getting started data sharing using the console for setup steps. Database names are identical to allow queries with full path notation database_name.schema_name.object_name to run seamlessly in both data warehouses.
  3. dbt Cloud connects to the serverless instance and models, created or modified, are tested by being materialized in the default database dev, in either each users’ personal schema or a pull request related schema. Instead of dev, you can use a different database designated for testing. Refer to Connect dbt Cloud to Redshift for setup steps.
  4. You can query materialized models in the serverless instance with materialized models in the provisioned cluster to validate changes. After you validate the changes, you can implement models in the serverless instance in the provisioned cluster.

Outcome

SafetyCulture carried out the steps to create the serverless instance and datashare, with integration to dbt Cloud, with ease. SafetyCulture also successfully ran its dbt project with all seeds, models, and snapshots materialized into the serverless instance via run commands from the dbt Cloud IDE and dbt Cloud CI jobs.

Regarding performance, SafetyCulture observed dbt Cloud workloads completing on average 60% faster in the serverless instance. Better performance could be attributed to two areas:

  • Amazon Redshift Serverless measures compute capacity using Redshift Processing Units (RPUs). Because it costs the same to run 64 RPUs in 10 minutes and 128 RPUs in 5 minutes, having a higher number of RPUs to complete a workload sooner was preferred.
  • With dbt Cloud workloads isolated on the serverless instance, dbt Cloud was configured with more threads to allow materialization of more models at once.

To determine cost, you can perform an estimation. 128 RPUs provides approximately the same amount of memory that an ra3.4xlarge 21-node provisioned cluster provides. In US East (N. Virginia), the cost of running a serverless instance with 128 RPUs is $48 hourly ($0.375 per RPU hour * 128 RPUs). In the same Region, the cost of running an ra3.4xlarge 21-node provisioned cluster on demand is $68.46 hourly ($3.26 per node hour * 21 nodes). Therefore, an accumulated hour of unpredictable workloads on a serverless instance is 29% more cost-effective than an on-demand provisioned cluster. Calculations in this example should be recalculated when performing future cost estimations because prices may change over time.

Learnings

SafetyCulture had two key learnings to better integrate dbt with Amazon Redshift, which can be helpful for similar implementations.

First, when integrating dbt with an Amazon Redshift datashare, configure INCLUDENEW=True to ease management of database objects in a schema:

ALTER DATASHARE datashare_name SET INCLUDENEW = TRUE FOR SCHEMA schema;

For example, assume the model customers.sql is materialized by dbt as the view customers. Next, customers is added to a datashare. When customers.sql is modified and rematerialized by dbt, dbt creates a new view with a temporary name, drops customers, and renames the new view to customers. Although the new view carries the same name, it’s a new database object that wasn’t added to the datashare. Therefore, customers is no longer found in the datashare.

Configuring INCLUDENEW=True allows new database objects to be automatically added to the datashare. An alternative to configuring INCLUDENEW=True and providing more granular control is the use of dbt post-hook.

Second, when integrating dbt with more than one Amazon Redshift data warehouse, define sources with database to aid dbt in evaluating the right database.

For example, assume a dbt project is used across two dbt Cloud environments to isolate production and test workloads. The dbt Cloud environment for production workloads is configured with the default database prod_db and connects to a provisioned cluster. The dbt Cloud environment for test workloads is configured with the default database dev and connects to a serverless instance. In addition, the provisioned cluster contains the table prod_db.raw_data.sales, which is made available to the serverless instance via a datashare as prod_db′.raw_data.sales.

When dbt compiles a model containing the source {{ source('raw_data', 'sales') }}, the source is evaluated as database.raw_data.sales. If database is not defined for sources, dbt sets the database to the configured environment’s default database. Therefore, the dbt Cloud environment connecting to the provisioned cluster evaluates the source as prod_db.raw_data.sales, while the dbt Cloud environment connecting to the serverless instance evaluates the source as dev.raw_data.sales, which is incorrect.

Defining database for sources allows dbt to consistently evaluate the right database across different dbt Cloud environments, because it removes ambiguity.

Conclusion

After testing Amazon Redshift Serverless and Data Sharing, SafetyCulture is satisfied with the result and has started productionalizing the solution.

“The PoC showed the vast potential of Redshift Serverless in our infrastructure,” says Thiago Baldim, Data Engineer Team Lead at SafetyCulture. “We could migrate our pipelines to support Redshift Serverless with simple changes to the standards we were using in our dbt. The outcome provided a clear picture of the potential implementations we could do, decoupling the workload entirely by teams and users and providing the right level of computation power that is fast and reliable.”

Although this post specifically targets unpredictable workloads from dbt Cloud, the solution is also relevant for other unpredictable workloads, including ad hoc queries from dashboards. Start exploring Amazon Redshift Serverless for your unpredictable workloads today.


About the authors

Anish Moorjani is a Data Engineer in the Data and Analytics team at SafetyCulture. He helps SafetyCulture’s analytics infrastructure scale with the exponential increase in the volume and variety of data.

Randy Chng is an Analytics Solutions Architect at Amazon Web Services. He works with customers to accelerate the solution of their key business problems.

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

Post Syndicated from Gowtham Dandu original https://aws.amazon.com/blogs/big-data/how-infomedia-built-a-serverless-data-pipeline-with-change-data-capture-using-aws-glue-and-apache-hudi/

This is a guest post co-written with Gowtham Dandu from Infomedia.

Infomedia Ltd (ASX:IFM) is a leading global provider of DaaS and SaaS solutions that empowers the data-driven automotive ecosystem. Infomedia’s solutions help OEMs, NSCs, dealerships and 3rd party partners manage the vehicle and customer lifecycle. They are used by over 250,000 industry professionals, across 50 OEM brands and in 186 countries to create a convenient customer journey, drive dealer efficiencies and grow sales.

In this post, we share how Infomedia built a serverless data pipeline with change data capture (CDC) using AWS Glue and Apache Hudi.

Infomedia was looking to build a cloud-based data platform to take advantage of highly scalable data storage with flexible and cloud-native processing tools to ingest, transform, and deliver datasets to their SaaS applications. The team wanted to set up a serverless architecture with scale-out capabilities that would allow them to optimize time, cost, and performance of the data pipelines and eliminate most of the infrastructure management.

To serve data to their end-users, the team wanted to develop an API interface to retrieve various product attributes on demand. Performance and scalability of both the data pipeline and API endpoint were key success criteria. The data pipeline needed to have sufficient performance to allow for fast turnaround in the event that data issues needed to be corrected. Finally, the API endpoint performance was important for end-user experience and customer satisfaction. When designing the data processing pipeline for the attribute API, the Infomedia team wanted to use a flexible and open-source solution for processing data workloads with minimal operational overhead.

They saw an opportunity to use AWS Glue, which offers a popular open-source big data processing framework, and Apache Spark, in a serverless environment for end-to-end pipeline development and deployment.

Solution overview

The solution involved ingesting data from various third-party sources in different formats, processing to create a semantic layer, and then exposing the processed dataset as a REST API to end-users. The API retrieves data at runtime from an Amazon Aurora PostgreSQL-Compatible Edition database for end-user consumption. To populate the database, the Infomedia team developed a data pipeline using Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue for data transformations, and Apache Hudi for CDC and record-level updates. They wanted to develop a simple incremental data processing pipeline without having to update the entire database each time the pipeline ran. The Apache Hudi framework allowed the Infomedia team to maintain a golden reference dataset and capture changes so that the downstream database could be incrementally updated in a short timeframe.

To implement this modern data processing solution, Infomedia’s team chose a layered architecture with the following steps:

  1. The raw data originates from various third-party sources and is a collection of flat files with a fixed width column structure. The raw input data is stored in Amazon S3 in JSON format (called the bronze dataset layer).
  2. The raw data is converted to an optimized Parquet format using AWS Glue. The Parquet data is stored in a separate Amazon S3 location and serves as the staging area during the CDC process (called the silver dataset layer). The Parquet format results in improved query performance and cost savings for downstream processing.
  3. AWS Glue reads the Parquet file from the staging area and updates Apache Hudi tables stored in Amazon S3 (the golden dataset layer) as part of incremental data processing. This process helps create mutable datasets on Amazon S3 to store the versioned and latest set of records.
  4. Finally, AWS Glue is used to populate Amazon Aurora PostgreSQL-Compatible Edition with the latest version of the records. This dataset is used to serve the API endpoint. The API itself is a Spring Java application deployed as a Docker container in an Amazon Elastic Container Service (Amazon ECS) AWS Fargate environment.

The following diagram illustrates this architecture.

arch diag

AWS Glue and Apache Hudi overview

AWS Glue is a serverless data integration service that makes it easy to prepare and process data at scale from a wide variety of data sources. With AWS Glue, you can ingest data from multiple data sources, extract and infer schema, populate metadata in a centralized data catalog, and prepare and transform data for analytics and machine learning. AWS Glue has a pay-as-you-go model with no upfront costs, and you only pay for resources that you consume.

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities. It allows you to comply with data privacy laws, manage CDC operations, reinstate late-arriving data, and roll back to a particular point in time. You can use AWS Glue to build a serverless Apache Spark-based data pipeline and take advantage of the AWS Glue native connector for Apache Hudi at no cost to manage CDC operations with record-level insert, updates, and deletes.

Solution benefits

Since the start of Infomedia’s journey with AWS Glue, the Infomedia team has experienced several benefits over the self-managed extract, transform, and load (ETL) tooling. With the horizontal scaling of AWS Glue, they were able to seamlessly scale the compute capacity of their data pipeline workloads by a factor of 5. This allowed them to increase both the volume of records and the number of datasets they could process for downstream consumption. They were also able to take advantage of AWS Glue built-in optimizations, such as pre-filtering using pushdown predicates, which allowed the team to save valuable engineering time tuning the performance of data processing jobs.

In addition, Apache Spark-based AWS Glue enabled developers to author jobs using concise Spark SQL and dataset APIs. This allowed for rapid upskilling of developers who are already familiar with database programming. Because developers are working with higher-level constructs across entire datasets, they spend less time solving for low-level technical implementation details.

Also, the AWS Glue platform has been cost-effective when compared against running self-managed Apache Spark infrastructure. The team did an initial analysis that showed an estimated savings of 70% over running a dedicated Spark EC2 infrastructure for their workload. Furthermore, the AWS Glue Studio job monitoring dashboard provides the Infomedia team with detailed job-level visibility that makes it easy to get a summary of the job runs and understand data processing costs.

Conclusion and next steps

Infomedia will continue to modernize their complex data pipelines using the AWS Glue platform and other AWS Analytics services. Through integration with services such as AWS Lake Formation and the AWS Glue Data Catalog, the Infomedia team plans to maintain reference primary datasets and democratize access to high-value datasets, allowing for further innovation.

If you would like to learn more, please visit AWS Glue and AWS Lake Formation to get started on your data integration journey.


About the Authors

Gowtham Dandu is an Engineering Lead at Infomedia Ltd with a passion for building efficient and effective solutions on the cloud, especially involving data, APIs, and modern SaaS applications. He specializes in building microservices and data platforms that are cost-effective and highly scalable.

Praveen Kumar is a Specialist Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-native services. His areas of interests are serverless technology, streaming applications, and modern cloud data warehouses.

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Post Syndicated from Parag Doshi original https://aws.amazon.com/blogs/big-data/how-tricentis-unlocks-insights-across-the-software-development-lifecycle-at-speed-and-scale-using-amazon-redshift/

This is a guest post co-written with Parag Doshi, Guru Havanur, and Simon Guindon from Tricentis.

Tricentis is the global leader in continuous testing for DevOps, cloud, and enterprise applications. It has been well published since the State of DevOps 2019 DORA Metrics were published that with DevOps, companies can deploy software 208 times more often and 106 times faster, recover from incidents 2,604 times faster, and release 7 times fewer defects. Speed changes everything, and continuous testing across the entire CI/CD lifecycle is the key. However, speed is only realized when you have the confidence to release software on demand. Tricentis instills that confidence by providing software tools that enable Agile Continuous Testing (ACT) at scale. Whether exploratory or automated, functional or performance, API or UI, targeting mainframes, custom applications, packaged applications, or cloud-native applications, Tricentis provides a comprehensive suite of specialized continuous testing tools that help its customers achieve the confidence to release on demand.

The next phase of Tricentis’ journey is to unlock insights across all testing tools. Teams may struggle to have a unified view of software quality due to siloed testing across many disparate tools. For users that require a unified view of software quality, this is unacceptable. In this post, we share how the AWS Data Lab helped Tricentis to improve their software as a service (SaaS) Tricentis Analytics platform with insights powered by Amazon Redshift.

The challenge

Tricentis provides SaaS and on-premises solutions to thousands of customers globally. Every change to software worth testing is tracked in test management tools such as Tricentis qTest, test automation tools such as Tosca or Testim, or performance testing tools such as Neoload. Although Tricentis has amassed such data over a decade, the data remains untapped for valuable insights. Each of these tools has its own reporting capabilities that make it difficult to combine the data for integrated and actionable business insights.

Additionally, the scale is significant because the multi-tenant data sources provide a continuous stream of testing activity, and our users require quick data refreshes as well as historical context for up to a decade due to compliance and regulatory demands.

Finally, data integrity is of paramount importance. Every event in the data source can be relevant, and our customers don’t tolerate data loss, poor data quality, or discrepancies between the source and Tricentis Analytics. While aggregating, summarizing, and aligning to a common information model, all transformations must not affect the integrity of data from its source.

The solution

Tricentis Analytics aims to address the challenges of high volume, near-real-time, and visually appealing reporting and analytics across the entire Tricentis product portfolio.

The initial customer objectives were:

  • Provide export of data securely accessible from the AWS Cloud
  • Provide an initial set of pre-built dashboards that provide immediate business insights
  • Beta test a solution with early adopter customers within 6 weeks

Considering the multi-tenant data source, Tricentis and the AWS Data Lab team engineered for the following constraints:

  • Deliver the end-to-end pipeline to load only the eligible customers into an analytics repository
  • Transform the multi-tenant data into single-tenant data isolated for each customer in strictly segregated environments

Knowing that data will be unified across many sources deployed in any environment, the architecture called for an enterprise-grade analytics platform. The data pipeline consists of multiple layers:

  • Ingesting data from the source either as application events or change data capture (CDC) streams
  • Queuing data so that we can rewind and replay the data back in time without going back to the source
  • Light transformations such as splitting multi-tenant data into single tenant data to isolate customer data
  • Persisting and presenting data in a scalable and reliable lake house (data lake and data warehouse) repository

Some customers will access the repository directly via an API with the proper guardrails for stability to combine their test data with other data sources in their enterprise, while other customers will use dashboards to gain insights on testing. Initially, Tricentis defines these dashboards and charts to enable insight on test runs, test traceability with requirements, and many other pre-defined use cases that can be valuable to customers. In the future, more capabilities will be provided to end-users to come up with their own analytics and insights.

How Tricentis and the AWS Data Lab were able to establish business insights in 6 weeks

Given the challenge of Tricentis Analytics with live customers in 6 weeks, Tricentis partnered with the AWS Data Lab. From detailed design to a beta release, Tricentis had customers expecting to consume data from a data lake specific to only their data, and all of the data that had been generated for over a decade. Customers also required their own repository, an Apache Parquet data lake, which would combine with other data in the customer environment to gather even greater insights.

The AWS account team proposed the AWS Data Lab Build Lab session to help Tricentis accelerate the process of designing and building their prototype. The Build Lab is a two-to-five-day intensive build by a team of customer builders with guidance from an AWS Data Lab Solutions Architect. During the Build Lab, the customer will construct a prototype in their environment, using their data, with guidance on real-world architectural patterns and anti-patterns, as well as strategies for building effective solutions, from AWS service experts. Including the pre-lab preparation work, the total engagement duration is 3–6 weeks and in the Tricentis case was 3 weeks: two for the pre-lab preparation work and one for the lab. The weeks that followed the lab included go-to-market activities with specific customers, documentation, hardening, security reviews, performance testing, data integrity testing, and automation activities.

The 2 weeks before the lab were used for the following:

  • Understanding the use case and working backward with an architecture
  • Preparing the Tricentis team for the lab by delivering all the training on the services to be used during the lab

For this solution, Tricentis and AWS built a data pipeline that consumes data from streaming, which was in place before the lab, and this streaming has the database transactions captured through CDC. In the streaming, the data from each table is separated by topic, and data from all the customers comes on the same topic (no isolation). Because of that, a pipeline was created to separate customers to create their tables isolated by the schema on the final destination at Amazon Redshift. The following diagram illustrates the solution architecture.

The main idea of this architecture is to be event-driven with eventual consistency. Any time new test cases or test results are created or modified, events trigger such that processing is immediate and new snapshot files are available via an API or data is pulled at the refresh frequency of the reporting or business intelligence (BI) tool. Every time the Amazon Simple Storage Service (Amazon S3) sink connector from Apache Kafka delivers a file on Amazon S3, Amazon EventBridge triggers an AWS Lambda function to transform the multi-tenant file into separated files, one per customer per table, and land it on specific folders on Amazon S3. As the files are created, another process is triggered to load the data from each customer on their schema or table on Amazon Redshift. On Amazon Redshift, materialized views were used to get the queries for the dashboards ready and easier to be returned to the Apache Superset. Also, the materialized views were configured to refresh automatically (with the autorefresh option), so Amazon Redshift updates the data automatically in the materialized views as soon as possible after base tables changes.

In the following sections, we detail specific implementation challenges and additional features required by customers discovered along the way.

Data export

As stated earlier, some customers want to get an export of their test data and create their data lake. For these customers, Tricentis provides incremental data as Apache Parquet files and will have the ability to filter on specific projects and specific date ranges. To ensure data integrity, Tricentis uses its technology known as Tosca DI (not part of the AWS Data Lab session).

Data security

The solution uses the following data security guardrails:

  • Data isolation guardrails – Tricentis source databases systems are used by all customers, and therefore, data from different customers is in the same database. To isolate customer-specific data, Tricentis has a unique identifier that discriminates customer-specific data. All the queries filter data based on the discriminator to get customer-specific data. EventBridge triggers a Lambda function to transform multi-tenant files to single-tenant (customer) files to land in customer-specific S3 folders. Another Lambda function is triggered to load data from customer-specific folders to their specific schema in Amazon Redshift. The latter Lambda function is data isolation aware and triggers an alert and stops processing further for any data that doesn’t belong to a specific customer.
  • Data access guardrails – To ensure access control, Tricentis applied role-based access control principles to users and service accounts for specific work-related resources. Access to Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon S3, Amazon Relational Database Service (Amazon RDS), and Amazon Redshift was controlled by granting privileges at the role level and assigning those roles appropriate resources.

Pay per use and linear cost scalability

Tricentis’s objective is to pay for the compute and storage used and grow analytics infrastructure with linear cost scalability. To better manage storage costs in the data plane, Tricentis stores all raw and intermediate data in Amazon S3 storage in a compressed format. The Amazon MSK and Amazon Redshift is right-sized for Tricentis Analytics load and is allowed to scale up or down with no downtime based on future business needs. Data on all the stores, including Amazon MSK, Amazon Redshift, and Amazon S3, is subjected to tiered storage and retention policies per the customer data retention and archival requirements to reduce the cost further and provide linear cost scalability.

In the control plane, Debezium and Kafka Connect resources are turned on and off, so you only pay for what you use. Lambda triggers are triggered on an event or a schedule and turned off after completing tasks.

Automated data integrity

High data integrity is a fundamental design principle of Tricentis Analytics. Fortunately, Tricentis has a product called ToscaDI, which is used to automate the measurement of data integrity across many different data sources. The main idea is to use the machine-generated data type and log sequence number (LSN) to reflect the latest snapshot data from the change data capture (CDC) streams. Tricentis reached the data integrity automation milestone outside of the AWS Data Lab window by automatically triggering Tosca DI at various stages of the AWS serverless architecture (illustrated earlier), and because of that Tricentis was able to ensure expected record counts at every step, preventing data loss or inadvertent data manipulation. In future versions, Tricentis will have much deeper data integrity verification record counts and incorporate specific fields to ensure data quality (for example, nullness) and semantic or format validation. To date, the combination of CDC and data cleansing has resulted in ultra-high data integrity when comparing source data to the final Parquet file contents.

Performance and data loss prevention

Performance was tuned for maximum throughput at three stages in the pipeline:

  • Data ingestion – Data integrity during ingestion was dramatically improved using CDC events and allowed us to rely on the well-respected replication mechanisms in PostgreSQL and Kafka, which simplified the system and eliminated a lot of the past data corrections that were in place. The Amazon S3 sink connector further streams data into Amazon S3 in real time by partitioning data into fixed-sized files. Fixed-size data files avoid further latency due to unbound file sizes. As a result, data was higher quality and was streamed in real time at a much faster rate.
  • Data transformation – Batch processing is highly cost efficient and compute efficient, and can mitigate various potential performance issues if appropriately implemented. Tricentis uses batch transformation to move data from multi-tenant Amazon S3 to single-tenant Amazon S3 and between single-tenant Amazon S3 to Amazon Redshift by micro-batch loading. The batch processing is staged to work within the Lamba invocations limits and maximum Amazon Redshift connections limits to keep the cost minimum. However, the transformation pipeline is configurable to go real time by processing every incoming S3 file on an EventBridge event.
  • Data queries – Materialized views with appropriate sort keys significantly improve the performance of repeated and predictable dashboard workloads. Tricentis pipelines use dynamic data loading in views and precomputed results in materialized views to seamlessly improve the performance of dashboards, along with setting up appropriate simple and compound sort keys to accelerate performance. Tricentis query performance is further accelerated by range-restricted predicates in sort keys.

Implementation challenges

Tricentis worked within the default limit of 1,000 concurrent Lambda function runs by keeping track of available functions at any given time and firing only those many functions for which slots are available. For the 10 GB memory limit per function, Tricentis right-sized the Amazon S3 sink connector generated files and single-tenant S3 files to not exceed 4 GB in size. The Lambda function throttling can be prevented by requesting a higher limit of concurrent runs if that becomes necessary later.

Tricentis also experienced some Amazon Redshift connection limitations. Amazon Redshift has quotas and adjustable quotas that limit the use of server resources. To effectively manage Amazon Redshift limits of maximum connections, Tricentis used connection pools to ensure optimal consumption and stability.

Results and next steps

The collaborative approach between Tricentis and the AWS Data Lab allowed considerable acceleration and the ability to meet timelines for establishing a big data solution that will benefit Tricentis customers for years. Since this writing, customer onboarding, observability and alerting, and security scanning were automated as part of a DevSecOps pipeline.

Within 6 weeks, the team was able to beta a data export service for one of Tricentis’ customers.

In the future, Tricentis anticipates adding multiple data sources, unify towards a common, ubiquitous language for testing data, and deliver richer insights so that our customers can have the correct data in a single view and increase confidence in their delivery of software at scale and speed.

Conclusion

In this post, we walked you through the journey the Tricentis team took with the AWS Data Lab during their participation in a Build Lab session. During the session, the Tricentis team and AWS Data Lab worked together to identify a best-fit architecture for their use cases and implement a prototype for delivering new insights for their customers.

To learn more about how the AWS Data Lab can help you turn your ideas into solutions, visit AWS Data Lab.


About the Authors

  Parag Doshi is Vice President of Engineering at Tricentis, where he continues to lead towards the vision of Innovation at the Speed of Imagination. He brings innovation to market by building world-class quality engineering SaaS such as qTest, the flagship test management product, and a new capability called Tricentis Analytics, which unlocks software development lifecycle insights across all types of testing. Prior to Tricentis, Parag was the founder of Anthem’s Cloud Platform Services, where he drove a hybrid cloud and DevSecOps capability and migrated 100 mission-critical applications. He enabled Anthem to build a new pharmacy benefits management business in AWS, resulting in $800 million in total operating gain for Anthem in 2020 per Forbes and CNBC. He also held posts at Hewlett-Packard, having multiple roles including Chief Technologist and head of architecture for DXC’s Virtual Private Cloud, and CTO for HP’s Application Services in the Americas region.

Guru Havanur serves as a Principal, Big Data Engineering and Analytics team in Tricentis. Guru is responsible for data, analytics, development, integration with other products, security, and compliance activities. He strives to work with other Tricentis products and customers to improve data sharing, data quality, data integrity, and data compliance through the modern big data platform. With over 20 years of experience in data warehousing, a variety of databases, integration, architecture, and management, he thrives for excellence.

Simon Guindon is an Architect at Tricentis. He has expertise in large-scale distributed systems and database consistency models, and works with teams in Tricentis around the world on scalability and high availability. You can follow his Twitter @simongui.

Ricardo Serafim is a Senior AWS Data Lab Solutions Architect. With a focus on data pipelines, data lakes, and data warehouses, Ricardo helps customers create an end-to-end architecture and test an MVP as part of their path to production. Outside of work, Ricardo loves to travel with his family and watch soccer games, mainly from the “Timão” Sport Club Corinthians Paulista.

Proactive Insights with Amazon DevOps Guru for RDS

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/devops/proactive-insights-with-amazon-devops-guru-for-rds/

Today, we are pleased to announce a new Amazon DevOps Guru for RDS capability: Proactive Insights. DevOps Guru for RDS is a fully-managed service powered by machine learning (ML), that uses the data collected by RDS Performance Insights to detect and alert customers of anomalous behaviors within Amazon Aurora databases. Since its release, DevOps Guru for RDS has empowered customers with information to quickly react to performance problems and to take corrective actions. Now, Proactive Insights adds recommendations related to operational issues that may prevent potential issues in the future.

Proactive Insights requires no additional set up for customers already using DevOps Guru for RDS, for both Amazon Aurora MySQL-Compatible Edition and Amazon Aurora PostgreSQL-Compatible Edition.

The following are example use cases of operational issues available for Proactive Insights today, with more insights coming over time:

  • Long InnoDB History for Aurora MySQL-Compatible engines – Triggered when the InnoDB history list length becomes very large.
  • Temporary tables created on disk for Aurora MySQL-Compatible engines – Triggered when the ratio of temporary tables created versus all temporary tables breaches a threshold.
  • Idle In Transaction for Aurora PostgreSQL-Compatible engines – Triggered when sessions connected to the database are not performing active work, but can keep database resources blocked.

To get started, navigate to the Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights. In the following screen capture, the number three indicates that there are three ongoing proactive insights. Click on that number to see the listing of the corresponding Proactive Insights, which may include RDS or other Proactive Insights supported by Amazon DevOps Guru.

Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights

Figure 1. Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights.

Ongoing problems (including reactive and proactive insights) are also highlighted against your database instance on the Database list page in the Amazon RDS console.

Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console

Figure 2. Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console.

In the following sections, we will dive deep on these use cases of DevOps Guru for RDS Proactive Insights.

Long InnoDB History for Aurora MySQL-Compatible engines

The InnoDB history list is a global list of the undo logs for committed transactions. MySQL uses the history list to purge records and log pages when transactions no longer require the history.  If the InnoDB history list length grows too large, indicating a large number of old row versions, queries and even the database shutdown process can become slower.

DevOps Guru for RDS now detects when the history list length exceeds 1 million records and alerts users to close (either by commit or by rollback) any unnecessary long-running transactions before triggering database changes that involve a shutdown (this includes reboots and database version upgrades).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS InnoDB History List Length Anomalous” Proactive Insight with an ongoing status. You will notice that Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, the history list for row changes increased significantly, which affects query and shutdown performance.

Long InnoDB History for Aurora MySQL-Compatible engines Insight overview

Figure 3. Long InnoDB History for Aurora MySQL-Compatible engines Insight overview.

The Metrics panel gives you a graphical representation of the history list length and the timeline, allowing you to correlate it with any anomalous application activity that may have occurred during this window.

Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel

Figure 4. Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem. You will also notice the rationale behind the recommendation under the “Why is DevOps Guru recommending this?” column.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem

Figure 5. The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem.

Temporary tables created on disk for Aurora MySQL-Compatible engines

Sometimes it is necessary for the MySQL database to create an internal temporary table while processing a query. An internal temporary table can be held in memory and processed by the TempTable or MEMORY storage engine, or stored on disk by the InnoDB storage engine. An increase of temporary tables created on disk instead of in memory can impact the database performance.

DevOps Guru for RDS now monitors the rate at which the database creates temporary tables and the percentage of those temporary tables that use disk. When these values cross recommended levels over a given period of time, DevOps Guru for RDS creates an insight exposing this situation before it becomes critical.

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Temporary Tables On Disk AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insight provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, more than 58% of the total temporary tables created per second were using disk, with a sustained rate of two temporary tables on disk created every second, which indicates that query performance is degrading.

Temporary tables created on disk insight overview

Figure 6. Temporary tables created on disk insight overview.

The Metrics panel shows you a graphical representation of the information specific for this insight. You will be presented with the evolution of the amount of temporary tables created on disk per second, the percentage of temporary tables on disk (out of the total number of database-created temporary tables), and of the overall rate at which the temporary tables are created (per second).

Temporary tables created on disk evolution of the amount of temporary tables created on disk per second

Figure 7. Temporary tables created on disk – evolution of the amount of temporary tables created on disk per second.

Temporary tables created on disk the percentage of temporary tables on disk (out of the total number of database-created temporary tables)

Figure 8. Temporary tables created on disk – the percentage of temporary tables on disk (out of the total number of database-created temporary tables).

Temporary tables created on disk overall rate at which the temporary tables are created (per second)

Figure 9. Temporary tables created on disk – overall rate at which the temporary tables are created (per second).

The Recommendations section suggests actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Temporary tables created on disk actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more

Figure 10. Temporary tables created on disk – actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Additional explanations on this use case can be found by clicking on the “View troubleshooting doc” link.

Idle In Transaction for Aurora PostgreSQL-Compatible engines

A connection that has been idle in transaction  for too long can impact performance by holding locks, blocking other queries, or by preventing VACUUM (including autovacuum) from cleaning up dead rows.
PostgreSQL database requires periodic maintenance, which is known as vacuuming. Autovacuum in PostgreSQL automates the execution of VACUUM and ANALYZE commands. This process gathers the table statistics and deletes the dead rows. When vacuuming does not occur, this negatively impacts the database performance. It leads to an increase in table and index bloat (the disk space that was used by a table or index and is available for reuse by the database but has not been reclaimed), leads to stale statistics and can even end in transaction wraparound (when the number of unique transaction ids reaches its maximum of about two billion).

DevOps Guru for RDS monitors the time spent by sessions in an Aurora PostgreSQL database in idle in transaction state and raises initially a warning notification, followed by an alarm notification if the idle in transaction state continues (the current thresholds are 1800 seconds for the warning and 3600 seconds for the alarm).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Idle In Transaction Max Time AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

In our case, a connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance

Figure 11. A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started

Figure 12. The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

As with the other insights, recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case

Figure 13. Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Conclusion

With Proactive Insights, DevOpsGuru for RDS enhances its abilities to help you monitor your databases by notifying you about potential operational issues, before they become bigger problems down the road. To get started, you need to ensure that you have enabled Performance Insights on the database instance(s) you want monitored, as well as ensure and confirm that DevOps Guru is enabled to monitor those instances (for example by enabling it at account level, by monitoring specific CloudFormation stacks or by using AWS tags for specific Aurora resources). Proactive Insights is available in all regions where DevOps Guru for RDS is supported. To learn more about Proactive Insights, join us for a free hands-on Immersion Day (available in three time zones) on March 15th or April 12th.

About the authors:

Kishore Dhamodaran

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Raluca Constantin

Raluca Constantin is a Senior Database Engineer with the Relational Database Services (RDS) team at Amazon Web Services. She has 16 years of experience in the databases world. She enjoys travels, hikes, arts and is a proud mother of a 12y old daughter and a 7y old son.

Jonathan Vogel

Jonathan is a Developer Advocate at AWS. He was a DevOps Specialist Solutions Architect at AWS for two years prior to taking on the Developer Advocate role. Prior to AWS, he practiced professional software development for over a decade. Jonathan enjoys music, birding and climbing rocks.

Securely validate business application resilience with AWS FIS and IAM

Post Syndicated from Dr. Rudolf Potucek original https://aws.amazon.com/blogs/devops/securely-validate-business-application-resilience-with-aws-fis-and-iam/

To avoid high costs of downtime, mission critical applications in the cloud need to achieve resilience against degradation of cloud provider APIs and services.

In 2021, AWS launched AWS Fault Injection Simulator (FIS), a fully managed service to perform fault injection experiments on workloads in AWS to improve their reliability and resilience. At the time of writing, FIS allows to simulate degradation of Amazon Elastic Compute Cloud (EC2) APIs using API fault injection actions and thus explore the resilience of workflows where EC2 APIs act as a fault boundary. 

In this post we show you how to explore additional fault boundaries in your applications by selectively denying access to any AWS API. This technique is particularly useful for fully managed, “black box” services like Amazon Simple Storage Service (S3) or Amazon Simple Queue Service (SQS) where a failure of read or write operations is sufficient to simulate problems in the service. This technique is also useful for injecting failures in serverless applications without needing to modify code. While similar results could be achieved with network disruption or modifying code with feature flags, this approach provides a fine granular degradation of an AWS API without the need to re-deploy and re-validate code.

Overview

We will explore a common application pattern: user uploads a file, S3 triggers an AWS Lambda function, Lambda transforms the file to a new location and deletes the original:

S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 1. S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

We will simulate the user upload with an Amazon EventBridge rate expression triggering an AWS Lambda function which creates a file in S3:

S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 2. S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Using this architecture we can explore the effect of S3 API degradation during file creation and deletion. As shown, the API call to delete a file from S3 is an application fault boundary. The failure could occur, with identical effect, because of S3 degradation or because the AWS IAM role of the Lambda function denies access to the API.

To inject failures we use AWS Systems Manager (AWS SSM) automation documents to attach and detach IAM policies at the API fault boundary and FIS to orchestrate the workflow.

Each Lambda function has an IAM execution role that allows S3 write and delete access, respectively. If the processor Lambda fails, the S3 file will remain in the bucket, indicating a failure. Similarly, if the IAM execution role for the processor function is denied the ability to delete a file after processing, that file will remain in the S3 bucket.

Prerequisites

Following this blog posts will incur some costs for AWS services. To explore this test application you will need an AWS account. We will also assume that you are using AWS CloudShell or have the AWS CLI installed and have configured a profile with administrator permissions. With that in place you can create the demo application in your AWS account by downloading this template and deploying an AWS CloudFormation stack:

git clone https://github.com/aws-samples/fis-api-failure-injection-using-iam.git
cd fis-api-failure-injection-using-iam
aws cloudformation deploy --stack-name test-fis-api-faults --template-file template.yaml --capabilities CAPABILITY_NAMED_IAM

Fault injection using IAM

Once the stack has been created, navigate to the Amazon CloudWatch Logs console and filter for /aws/lambda/test-fis-api-faults. Under the EventBridgeTimerHandler log group you should find log events once a minute writing a timestamped file to an S3 bucket named fis-api-failure-ACCOUNT_ID. Under the S3TriggerHandler log group you should find matching deletion events for those files.

Once you have confirmed object creation/deletion, let’s take away the permission of the S3 trigger handler lambda to delete files. To do this you will attach the FISAPI-DenyS3DeleteObject  policy that was created with the template:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam attach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

With the deny policy in place you should now see object deletion fail and objects should start showing up in the S3 bucket. Navigate to the S3 console and find the bucket starting with fis-api-failure. You should see a new object appearing in this bucket once a minute:

S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

Figure 3. S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

If you would like to graph the results you can navigate to AWS CloudWatch, select “Logs Insights“, select the log group starting with /aws/lambda/test-fis-api-faults-S3CountObjectsHandler, and run this query:

fields @timestamp, @message
| filter NumObjects >= 0
| sort @timestamp desc
| stats max(NumObjects) by bin(1m)
| limit 20

This will show the number of files in the S3 bucket over time:

AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

Figure 4. AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

You can now detach the policy:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam detach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

We see that newly written files will once again be deleted but the un-processed files will remain in the S3 bucket. From the fault injection we learned that our system does not tolerate request failures when deleting files from S3. To address this, we should add a dead letter queue or some other retry mechanism.

Note: if the Lambda function does not return a success state on invocation, EventBridge will retry. In our Lambda functions we are cost conscious and explicitly capture the failure states to avoid excessive retries.

Fault injection using SSM

To use this approach from FIS and to always remove the policy at the end of the experiment, we first create an SSM document to automate adding a policy to a role. To inspect this document, open the SSM console, navigate to the “Documents” section, find the FISAPI-IamAttachDetach document under “Owned by me”, and examine the “Content” tab (make sure to select the correct region). This document takes the name of the Role you want to impact and the Policy you want to attach as parameters. It also requires an IAM execution role that grants it the power to list, attach, and detach specific policies to specific roles.

Let’s run the SSM automation document from the console by selecting “Execute Automation”. Determine the ARN of the FISAPI-SSM-Automation-Role from CloudFormation or by running:

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

Use FISAPI-SSM-Automation-Role, a duration of 2 minutes expressed in ISO8601 format as PT2M, the ARN of the deny policy, and the name of the target role FISAPI-TARGET-S3TriggerHandlerRole:

Image of parameter input field reflecting the instructions in blog text.

Figure 5. Image of parameter input field reflecting the instructions in blog text.

Alternatively execute this from a shell:

ASSUME_ROLE_NAME=FISAPI-SSM-Automation-Role
ASSUME_ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ASSUME_ROLE_NAME}'].Arn" --output text )
echo Assume Role ARN: $ASSUME_ROLE_ARN

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws ssm start-automation-execution \
  --document-name FISAPI-IamAttachDetach \
  --parameters "{
      \"AutomationAssumeRole\": [ \"${ASSUME_ROLE_ARN}\" ],
      \"Duration\": [ \"PT2M\" ],
      \"TargetResourceDenyPolicyArn\": [\"${POLICY_ARN}\" ],
      \"TargetApplicationRoleName\": [ \"${ROLE_NAME}\" ]
    }"

Wait two minutes and then examine the content of the S3 bucket starting with fis-api-failure again. You should now see two additional files in the bucket, showing that the policy was attached for 2 minutes during which files could not be deleted, and confirming that our application is not resilient to S3 API degradation.

Permissions for injecting failures with SSM

Fault injection with SSM is controlled by IAM, which is why you had to specify the FISAPI-SSM-Automation-Role:

Visual representation of IAM permission used for fault injections with SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the SSM user needing to have a pass-role permission to grant the SSM execution role to the SSM service.

Figure 6. Visual representation of IAM permission used for fault injections with SSM.

This role needs to contain an assume role policy statement for SSM to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
             - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "ssm.amazonaws.com"

The role also needs to contain permissions to describe roles and their attached policies with an optional constraint on which roles and policies are visible:

          - Sid: GetRoleAndPolicyDetails
            Effect: Allow
            Action:
              - 'iam:GetRole'
              - 'iam:GetPolicy'
              - 'iam:ListAttachedRolePolicies'
            Resource:
              # Roles
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn
              # Policies
              - !Ref AwsFisApiPolicyDenyS3DeleteObject

Finally the SSM role needs to allow attaching and detaching a policy document. This requires

  1. an ALLOW statement
  2. a constraint on the policies that can be attached
  3. a constraint on the roles that can be attached to

In the role we collapse the first two requirements into an ALLOW statement with a condition constraint for the Policy ARN. We then express the third requirement in a DENY statement that will limit the '*' resource to only the explicit role ARNs we want to modify:

          - Sid: AllowOnlyTargetResourcePolicies
            Effect: Allow
            Action:  
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            Resource: '*'
            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Policies that can be attached
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject
          - Sid: DenyAttachDetachAllRolesExceptApplicationRole
            Effect: Deny
            Action: 
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            NotResource: 
              # Roles that can be attached to
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

We will discuss security considerations in more detail at the end of this post.

Fault injection using FIS

With the SSM document in place you can now create an FIS template that calls the SSM document. Navigate to the FIS console and filter for FISAPI-DENY-S3PutObject. You should see that the experiment template passes the same parameters that you previously used with SSM:

Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

Figure 7. Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

You can now run the FIS experiment and after a couple minutes once again see new files in the S3 bucket.

Permissions for injecting failures with FIS and SSM

Fault injection with FIS is controlled by IAM, which is why you had to specify the FISAPI-FIS-Injection-EperimentRole:

Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

Figure 8. Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

This role needs to contain an assume role policy statement for FIS to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
              - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "fis.amazonaws.com"

The role also needs permissions to list and execute SSM documents:

            - Sid: RequiredReadActionsforAWSFIS
              Effect: Allow
              Action:
                - 'cloudwatch:DescribeAlarms'
                - 'ssm:GetAutomationExecution'
                - 'ssm:ListCommands'
                - 'iam:ListRoles'
              Resource: '*'
            - Sid: RequiredSSMStopActionforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:CancelCommand'
              Resource: '*'
            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

Finally, remember that the SSM document needs to use a Role of its own to execute the fault injection actions. Because that Role is different from the Role under which we started the FIS experiment, we need to explicitly allow SSM to assume that role with a PassRole statement which will expand to FISAPI-SSM-Automation-Role:

            - Sid: RequiredIAMPassRoleforSSMADocuments
              Effect: Allow
              Action: 'iam:PassRole'
              Resource: !Sub 'arn:aws:iam::${AWS::AccountId}:role/${SsmAutomationRole}'

Secure and flexible permissions

So far, we have used explicit ARNs for our guardrails. To expand flexibility, we can use wildcards in our resource matching. For example, we might change the Policy matching from:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject

or the equivalent:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${FullPolicyName}

to a wildcard notation like this:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Wildcard policies - secure and flexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${PolicyNamePrefix}*'

If we set PolicyNamePrefix to FISAPI-DenyS3 this would now allow invoking FISAPI-DenyS3PutObject and FISAPI-DenyS3DeleteObject but would not allow using a policy named FISAPI-DenyEc2DescribeInstances.

Similarly, we could change the Resource matching from:

            NotResource: 
              # Explicitly listed roles - secure but inflexible
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

to a wildcard equivalent like this:

            NotResource: 
              # Wildcard policies - secure and flexible
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixEventBridge}*'
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixS3}*'
and setting RoleNamePrefixEventBridge to FISAPI-TARGET-EventBridge and RoleNamePrefixS3 to FISAPI-TARGET-S3.

Finally, we would also change the FIS experiment role to allow SSM documents based on a name prefix by changing the constraint on automation execution from:

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Explicitly listed resource - secure but inflexible
                # Note: the $DEFAULT at the end could also be an explicit version number
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

to

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Wildcard resources - secure and flexible
                # 
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationDocumentPrefix}*'

and setting SsmAutomationDocumentPrefix to FISAPI-. Test this by updating the CloudFormation stack with a modified template:

aws cloudformation deploy --stack-name test-fis-api-faults --template-file template2.yaml --capabilities CAPABILITY_NAMED_IAM

Permissions governing users

In production you should not be using administrator access to use FIS. Instead we create two roles FISAPI-AssumableRoleWithCreation and FISAPI-AssumableRoleWithoutCreation for you (see this template). These roles require all FIS and SSM resources to have a Name tag that starts with FISAPI-. Try assuming the role without creation privileges and running an experiment. You will notice that you can only start an experiment if you add a Name tag, e.g. FISAPI-secure-1, and you will only be able to get details of experiments and templates that have proper Name tags.

If you are working with AWS Organizations, you can add further guard rails by defining SCPs that control the use of the FISAPI-* tags similar to this blog post.

Caveats

For this solution we are choosing to attach policies instead of permission boundaries. The benefit of this is that you can attach multiple independent policies and thus simulate multi-step service degradation. However, this means that it is possible to increase the permission level of a role. While there are situations where this might be of interest, e.g. to simulate security breaches, please implement a thorough security review of any fault injection IAM policies you create. Note that modifying IAM Roles may trigger events in your security monitoring tools.

The AttachRolePolicy and DetachRolePolicy calls from AWS IAM are eventually consistent, meaning that in some cases permission propagation when starting and stopping fault injection may take up to 5 minutes each.

Cleanup

To avoid additional cost, delete the content of the S3 bucket and delete the CloudFormation stack:

# Clean up policy attachments just in case
CLEANUP_ROLES=$(aws iam list-roles --query "Roles[?starts_with(RoleName,'FISAPI-')].RoleName" --output text)
for role in $CLEANUP_ROLES; do
  CLEANUP_POLICIES=$(aws iam list-attached-role-policies --role-name $role --query "AttachedPolicies[?starts_with(PolicyName,'FISAPI-')].PolicyName" --output text)
  for policy in $CLEANUP_POLICIES; do
    echo Detaching policy $policy from role $role
    aws iam detach-role-policy --role-name $role --policy-arn $policy
  done
done
# Delete S3 bucket content
ACCOUNT_ID=$( aws sts get-caller-identity --query Account --output text )
S3_BUCKET_NAME=fis-api-failure-${ACCOUNT_ID}
aws s3 rm --recursive s3://${S3_BUCKET_NAME}
aws s3 rb s3://${S3_BUCKET_NAME}
# Delete cloudformation stack
aws cloudformation delete-stack --stack-name test-fis-api-faults
aws cloudformation wait stack-delete-complete --stack-name test-fis-api-faults

Conclusion 

AWS Fault Injection Simulator provides the ability to simulate various external impacts to your application to validate and improve resilience. We’ve shown how combining FIS with IAM to selectively deny access to AWS APIs provides a generic path to explore fault boundaries across all AWS services. We’ve shown how this can be used to identify and improve a resilience problem in a common S3 upload workflow. To learn about more ways to use FIS, see this workshop.

About the authors:

Dr. Rudolf Potucek

Dr. Rudolf Potucek is Startup Solutions Architect at Amazon Web Services. Over the past 30 years he gained a PhD and worked in different roles including leading teams in academia and industry, as well as consulting. He brings experience from working with academia, startups, and large enterprises to his current role of guiding startup customers to succeed in the cloud.

Rudolph Wagner

Rudolph Wagner is a Premium Support Engineer at Amazon Web Services who holds the CISSP and OSCP security certifications, in addition to being a certified AWS Solutions Architect Professional. He assists internal and external Customers with multiple AWS services by using his diverse background in SAP, IT, and construction.

How Vanguard made their technology platform resilient and efficient by building cross-Region replication for Amazon Kinesis Data Streams

Post Syndicated from Raghu Boppanna original https://aws.amazon.com/blogs/big-data/how-vanguard-made-their-technology-platform-resilient-and-efficient-by-building-cross-region-replication-for-amazon-kinesis-data-streams/

This is a guest post co-written with Raghu Boppanna from Vanguard. 

At Vanguard, the Enterprise Advice line of business improves investor outcomes through digital access to superior, personalized, and affordable financial advice. They made it possible, in part, by driving economies of scale across the globe for investors with a highly resilient and efficient technical platform. Vanguard opted for a multi-Region architecture for this workload to help protect against impairments of Regional services. For high availability purposes, there is a need to make the data used by the workload available not just in the primary Region, but also in the secondary Region with minimal replication lag. In the event of a service impairment in the primary Region, the solution should be able to fail over to the secondary Region with as little data loss as possible and the ability to resume data ingestion.

Vanguard Cloud Technology Office and AWS partnered to build an infrastructure solution on AWS that met their resilience requirements. The multi-Region solution enables a robust fail-over mechanism, with built-in observability and recovery. The solution also supports streaming data from multiple sources to different Kinesis data streams. The solution is currently being rolled out to the different lines of business teams to improve the resilience posture of their workloads.

The use case discussed here requires Change Data Capture (CDC) to stream data from a remote data source (mainframe DB2) to Amazon Kinesis Data Streams, because the business capability depends on this data. Kinesis Data Streams is a fully managed, massively scalable, durable, and low-cost streaming service that can continuously capture and stream large amounts of data from multiple sources, and makes the data available for consumption within milliseconds. The service is built to be highly resilient and uses multiple Availability Zones to process and store data.

The solution discussed in this post explains how AWS and Vanguard innovated to build a resilient architecture to meet their high availability goals.

Solution overview

The solution uses AWS Lambda to replicate data from Kinesis data streams in the primary Region to a secondary Region. In the event of any service impairment impacting the CDC pipeline, the failover process promotes the secondary Region to primary for the producers and consumers. We use Amazon DynamoDB global tables for replication checkpoints that allows to resume data streaming from the checkpoint and also maintains a primary Region configuration flag that prevents an infinite replication loop of the same data back and forth.

The solution also provides the flexibility for Kinesis Data Streams consumers to use the primary or any secondary Region within the same AWS account.

The following diagram illustrates the reference architecture.

Let’s look at each component in detail:

  1. CDC processor (producer) – In this reference architecture, the producer is deployed on Amazon Elastic Compute Cloud (Amazon EC2) in both the primary and secondary Regions, and is active in the primary Region and on standby mode in the secondary Region. It captures CDC data from the external data source (like a DB2 database as shown in the architecture above), and streams to Kinesis Data Streams in the primary Region. Vanguard uses a 3rd party tool Qlik Replicate as their CDC Processor. It produces a well-formed payload including the DB2 commit timestamp to the Kinesis data stream, in addition to the actual row data from the remote data source. (example-stream-1 in this example). The following code is a sample payload containing only the primary key of the record that changed and the commit timestamp (for simplicity, the rest of the table row data is not shown below):
    {
        "eventSource": "aws:kinesis",
        "kinesis": 
        {
             "ApproximateArrivalTimestamp": "Mon July 18 20:00:00 UTC 2022",
             "SequenceNumber": "49544985256907370027570885864065577703022652638596431874",
             "PartitionKey": "12349999",
             "KinesisSchemaVersion": "1.0",
             "Data": "eyJLZXkiOiAxMjM0OTk5OSwiQ29tbWl0VGltZXN0YW1wIjogIjIwMjItMDctMThUMjA6MDA6MDAifQ=="
        },
        "eventId": "shardId-000000000000:49629136582982516722891309362785181370337771525377097730",
        "invokeIdentityArn": "arn:aws:iam::6243876582:role/kds-crr-LambdaRole-1GZWP67437SD",
        "eventName": "aws:kinesis:record",
        "eventVersion": "1.0",
        "eventSourceARN": "arn:aws:kinesis:us-east-1:6243876582:stream/kds-stream-1/consumer/kds-crr:6243876582",
        "awsRegion": "us-east-1"
    }

    The Base64 decoded value of Data is as follows. The actual Kinesis record would contain the entire row data of the table row that changed, in addition to the primary key and the commit timestamp.

    {"Key": 12349999,"CommitTimestamp": "2022-07-18T20:00:00"}

    The CommitTimestamp in the Data field is used in the replication checkpoint and is critical to accurately track how much of the stream data has been replicated to the secondary Region. The checkpoint can then be used to facilitate a CDC processor (producer) failover and accurately resume producing data from the replication checkpoint timestamp onwards.

    The alternative to using a remote data source CommitTimestamp (if unavailable) is to use the ApproximateArrivalTimestamp (which is the timestamp when the record is actually written to the data stream).

  2. Cross-Region replication Lambda function – The function is deployed to both primary and secondary Regions. It’s set up with an event source mapping to the data stream containing CDC data. The same function can be used to replicate data of multiple streams. It’s invoked with a batch of records from Kinesis Data Streams and replicates the batch to a target replication Region (which is provided via the Lambda configuration environment). For cost considerations, if the CDC data is actively produced into the primary Region only, the reserved concurrency of the function in the secondary Region can be set to zero, and modified during regional failover. The function has AWS Identity and Access Management (IAM) role permissions to do the following:
    • Read and write to the DynamoDB global tables used in this solution, within the same account.
    • Read and write to Kinesis Data Streams in both Regions within the same account.
    • Publish custom metrics to Amazon CloudWatch in both Regions within the same account.
  3. Replication checkpoint – The replication checkpoint uses the DynamoDB global table in both the primary and secondary Regions. It’s used by the cross-Region replication Lambda function to persist the commit timestamp of the last replication record as the replication checkpoint for every stream that is configured for replication. For this post, we create and use a global table called kdsReplicationCheckpoint.
  4. Active Region config – The active Region uses the DynamoDB global table in both primary and secondary Regions. It uses the native cross-Region replication capability of the global table to replicate the configuration. It’s pre-populated with data about which is the primary Region for a stream, to prevent replication back to the primary Region by the Lambda function in the standby Region. This configuration may not be required if the Lambda function in the standby Region has a reserved concurrency set to zero, but can serve as a safety check to avoid infinite replication loop of the data. For this post, we create a global table called kdsActiveRegionConfig and put an item with the following data:
    {
     "stream-name": "example-stream-1",
     "active-region" : "us-east-1"
    }
    
  5. Kinesis Data Streams – The stream to which the CDC processor produces the data. For this post, we use a stream called example-stream-1 in both the Regions, with the same shard configuration and access policies.

Sequence of steps in cross-Region replication

Let’s briefly look at how the architecture is exercised using the following sequence diagram.

The sequence consists of the following steps:

  1. The CDC processor (in us-east-1) reads the CDC data from the remote data source.
  2. The CDC processor (in us-east-1) streams the CDC data to Kinesis Data Streams (in us-east-1).
  3. The cross-Region replication Lambda function (in us-east-1) consumes the data from the data stream (in us-east-1). The enhanced fan-out pattern is recommended for dedicated and increased throughput for cross-Region replication.
  4. The replicator Lambda function (in us-east-1) validates its current Region with the active Region configuration for the stream being consumed, with the help of the kdsActiveRegionConfig DynamoDB global tableThe following sample code (in Java) can help illustrate the condition being evaluated:
    // Fetch the current AWS Region from the Lambda function’s environment
    String currentAWSRegion = System.getenv(“AWS_REGION”);
    // Read the stream name from the first Kinesis Record once for the entire batch being processed. This is done because we are reusing the same Lambda function for replicating multiple streams.
    String currentStreamNameConsumed = kinesisRecord.getEventSourceARN().split(“:”)[5].split(“/”)[1];
    // Build the DynamoDB query condition using the stream name
    Map<String, Condition> keyConditions = singletonMap(“streamName”, Condition.builder().comparisonOperator(EQ).attributeValueList(AttributeValue.builder().s(currentStreamNameConsumed).build()).build());
    // Query the DynamoDB Global Table
    QueryResponse queryResponse = ddbClient.query(QueryRequest.builder().tableName("kdsActiveRegionConfig").keyConditions(keyConditions).attributesToGet(“ActiveRegion”).build());
  5. The function evaluates the response from DynamoDB with the following code:
    // Evaluate the response
    if (queryResponse.hasItems()) {
           AttributeValue activeRegionForStream = queryResponse.items().get(0).get(“ActiveRegion”);
           return currentAWSRegion.equalsIgnoreCase(activeRegionForStream.s());
    }
  6. Depending on the response, the function takes the following actions:
    1. If the response is true, the replicator function produces the records to Kinesis Data Streams in us-east-2 in a sequential manner.
      • If there is a failure, the sequence number of the record is tracked and the iteration is broken. The function returns the list of failed sequence numbers. By returning the failed sequence number, the solution uses the feature of Lambda checkpointing to be able to resume processing of a batch of records with partial failures. This is useful when handling any service impairments, where the function tries to replicate the data across Regions to ensure stream parity and no data loss.
      • If there are no failures, an empty list is returned, which indicates the batch was successful.
    2. If the response is false, the replicator function returns without performing any replication. To reduce the cost of the Lambda invocations, you can set the reserved concurrency of the function in the DR Region (us-east-2) to zero. This will prevent the function from being invoked. When you failover, you can update this value to an appropriate number based on the CDC throughput and set the reserved concurrency of the function in us-east-1 to zero to prevent it from executing unnecessarily.
  7. After all the records are produced to Kinesis Data Streams in us-east-2, the replicator function checkpoints to the kdsReplicationCheckpoint DynamoDB global table (in us-east-1) with the following data:
    { "streamName": "example-stream-1", "lastReplicatedTimestamp": "2022-07-18T20:00:00" }
    
  8. The function returns after successfully processing the batch of records.

Performance considerations

The performance expectations of the solution should be understood with respect to the following factors:

  • Region selection – The replication latency is directly proportional to the distance being traveled by the data, so understand your Region selection
  • Velocity – The incoming velocity of the data or the volume of data being replicated
  • Payload size – The size of the payload being replicated

Monitor the Cross-Region replication

It’s recommended to track and observe the replication as it happens. You can tailor the Lambda function to publish custom metrics to CloudWatch with the following metrics at the end of every invocation. Publishing these metrics to both the primary and secondary Regions helps protect yourself from impairments affecting observability in the primary Region.

  • Throughput – The current Lambda invocation batch size
  • ReplicationLagSeconds – The difference between the current timestamp (after processing all the records) and the ApproximateArrivalTimestamp of the last record that was replicated

The following example CloudWatch metric graph shows the average replication lag was 2 seconds with a throughput of 100 records replicated from us-east-1 to us-east-2.

Common failover strategy

During any impairments impacting the CDC pipeline in the primary Region, business continuity or disaster recovery needs may dictate a pipeline failover to the secondary (standby) Region. This means a couple of things need to be done as part of this failover process:

  • If possible, stop all the CDC tasks in the CDC processor tool in us-east-1.
  • The CDC processor must be failed over to the secondary Region, so that it can read the CDC data from the remote data source while operating out of the standby Region.
  • The kdsActiveRegionConfig DynamoDB global table needs to be updated. For instance, for the stream example-stream-1 used in our example, the active Region is changed to us-east-2:
{
"stream-name": "example-stream-1",
"active-Region" : "us-east-2"
}
  • All the stream checkpoints need to be read from the kdsReplicationCheckpoint DynamoDB global table (in us-east-2), and the timestamps from each of the checkpoints are used to start the CDC tasks in the producer tool in us-east-2 Region. This minimizes the chances of data loss and accurately resumes streaming the CDC data from the remote data source from the checkpoint timestamp onwards.
  • If using reserved concurrency to control Lambda invocations, set the value to zero in the primary Region(us-east-1) and to a suitable non-zero value in the secondary Region(us-east-2).

Vanguard’s multi-step failover strategy

Some of the third-party tools that Vanguard uses have a two-step CDC process of streaming data from a remote data source to a destination. Vanguard’s tool of choice for their CDC processor follows this two-step approach:

  1. The first step involves setting up a log stream task that reads the data from the remote data source and persists in a staging location.
  2. The second step involves setting up individual consumer tasks that read data from the staging location—which could be on Amazon Elastic File System (Amazon EFS) or Amazon FSx, for example—and stream it to the destination. The flexibility here is that each of these consumer tasks can be triggered to stream from different commit timestamps. The log stream task usually starts reading data from the minimum of all the commit timestamps used by the consumer tasks.

Let’s look at an example to explain the scenario:

  • Consumer task A is streaming data from a commit timestamp 2022-07-19T20:00:00 onwards to example-stream-1.
  • Consumer task B is streaming data from a commit timestamp 2022-07-19T21:00:00 onwards to example-stream-2.
  • In this situation, the log stream should read data from the remote data source from the minimum of the timestamps used by the consumer tasks, which is 2022-07-19T20:00:00.

The following sequence diagram demonstrates the exact steps to run during a failover to us-east-2 (the standby Region).

The steps are as follows:

  1. The failover process is triggered in the standby Region (us-east-2 in this example) when required. Note that the trigger can be automated using comprehensive health checks of the pipeline in the primary Region.
  2. The failover process updates the kdsActiveRegionConfig DynamoDB global table with the new value for the Region as us-east-2 for all the stream names.
  3. The next step is to fetch all the stream checkpoints from the kdsReplicationCheckpoint DynamoDB global table (in us-east-2).
  4. After the checkpoint information is read, the failover process finds the minimum of all the lastReplicatedTimestamp.
  5. The log stream task in the CDC processor tool is started in us-east-2 with the timestamp found in Step 4. It begins reading CDC data from the remote data source from this timestamp onwards and persists them in the staging location on AWS.
  6. The next step is to start all the consumer tasks to read data from the staging location and stream to the destination data stream. This is where each consumer task is supplied with the appropriate timestamp from the kdsReplicationCheckpoint table according to the streamName to which the task streams the data.

After all the consumer tasks are started, data is produced to the Kinesis data streams in us-east-2. From there on, the process of cross-Region replication is the same as described earlier – the replication Lambda function in us-east-2 starts replicating data to the data stream in us-east-1.

The consumer applications reading data from the streams are expected to be idempotent to be able to handle duplicates. Duplicates can be introduced in the stream due to many reasons, some of which are called out below.

  • The Producer or the CDC Processor introduces duplicates into the stream while replaying the CDC data during a failover
  • DynamoDB Global Table uses asynchronous replication of data across Regions and if the kdsReplicationCheckpoint table data has a replication lag, the failover process may potentially use an older checkpoint timestamp to replay the CDC data.

Also, consumer applications should checkpoint the CommitTimestamp of the last record that was consumed. This is to facilitate better monitoring and recovery.

Path to maturity: Automated recovery

The ideal state is to fully automate the failover process, reducing time to recover and meeting the resilience Service Level Objective (SLO). However, in most organizations, the decision to fail over, fail back, and trigger the failover requires manual intervention in assessing the situation and deciding the outcome. Creating scripted automation to perform the failover that can be run by a human is a good place to start.

Vanguard has automated all of the steps of failover, but still have humans make the decision on when to invoke it. You can customize the solution to meet your needs and depending on the CDC processor tool you use in your environment.

Conclusion

In this post, we described how Vanguard innovated and built a solution for replicating data across Regions in Kinesis Data Streams to make the data highly available. We also demonstrated a robust checkpoint strategy to facilitate a Regional failover of the replication process when needed. The solution also illustrated how to use DynamoDB global tables for tracking the replication checkpoints and configuration. With this architecture, Vanguard was able to deploy workloads depending on the CDC data to multiple Regions to meet business needs of high availability in the face of service impairments impacting CDC pipelines in the primary Region.

If you have any feedback please leave a comment in the Comments section below.


About the authors

Raghu Boppanna works as an Enterprise Architect at Vanguard’s Chief Technology Office. Raghu specializes in Data Analytics, Data Migration/Replication including CDC Pipelines, Disaster Recovery and Databases. He has earned several AWS Certifications including AWS Certified Security – Specialty & AWS Certified Data Analytics – Specialty.

Parameswaran V Vaidyanathan is a Senior Cloud Resilience Architect with Amazon Web Services. He helps large enterprises achieve the business goals by architecting and building scalable and resilient solutions on the AWS Cloud.

Richa Kaul is a Senior Leader in Customer Solutions serving Financial Services customers. She is based out of New York. She has extensive experience in large scale cloud transformation, employee excellence, and next generation digital solutions. She and her team focus on optimizing value of cloud by building performant, resilient and agile solutions. Richa enjoys multi sports like triathlons, music, and learning about new technologies.

Mithil Prasad is a Principal Customer Solutions Manager with Amazon Web Services. In his role, Mithil works with Customers to drive cloud value realization, provide thought leadership to help businesses achieve speed, agility, and innovation.

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

Post Syndicated from Adrianus Kurnadi original https://aws.amazon.com/blogs/big-data/how-ruparupa-gained-updated-insights-with-an-amazon-s3-data-lake-aws-glue-apache-hudi-and-amazon-quicksight/

This post is co-written with Olivia Michele and Dariswan Janweri P. at Ruparupa.

Ruparupa was built by PT. Omni Digitama Internusa with the vision to cultivate synergy and create a seamless digital ecosystem within Kawan Lama Group that touches and enhances the lives of many.

Ruparupa is the first digital platform built by Kawan Lama Group to give the best shopping experience for household, furniture, and lifestyle needs. Ruparupa’s goal is to help you live a better life, shown by the meaning of the word ruparupa, which means “everything.” We believe that everyone deserves the best, and home is where everything starts.

In this post, we show how Ruparupa implemented an incrementally updated data lake to get insights into their business using Amazon Simple Storage Service (Amazon S3), AWS Glue, Apache Hudi, and Amazon QuickSight. We also discuss the benefits Ruparupa gained after the implementation.

The data lake implemented by Ruparupa uses Amazon S3 as the storage platform, AWS Database Migration Service (AWS DMS) as the ingestion tool, AWS Glue as the ETL (extract, transform, and load) tool, and QuickSight for analytic dashboards.

Amazon S3 is an object storage service with very high scalability, durability, and security, which makes it an ideal storage layer for a data lake. AWS DMS is a database migration tool that supports many relational database management services, and also supports Amazon S3.

An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data. The AWS Glue job can transform the raw data in Amazon S3 to Parquet format, which is optimized for analytic queries. The AWS Glue Data Catalog stores the metadata, and Amazon Athena (a serverless query engine) is used to query data in Amazon S3.

AWS Secrets Manager is an AWS service that can be used to store sensitive data, enabling users to keep data such as database credentials out of source code. In this implementation, Secrets Manager is used to store the configuration of the Apache Hudi job for various tables.

Data analytic challenges

As an ecommerce company, Ruparupa produces a lot of data from their ecommerce website, their inventory systems, and distribution and finance applications. The data can be structured data from existing systems, and can also be unstructured or semi-structured data from their customer interactions. This data contains insights that, if unlocked, can help management make decisions to help increase sales and optimize cost.

Before implementing a data lake on AWS, Ruparupa had no infrastructure capable of processing the volume and variety of data formats in a short time. Data had to be manually processed by data analysts, and data mining took a long time. Because of the fast growth of data, it took 1–1.5 hours just to ingest data, which was hundreds of thousands of rows.

The manual process caused inconsistent data cleansing. After the data had been cleansed, some processes were often missing, and all the data had to go through another process of data cleansing.

This long processing time reduced the analytic team’s productivity. The analytic team could only produce weekly and monthly reports. This delay in report frequency impacted delivering important insights to management, and they couldn’t move fast enough to anticipate changes in their business.

The method used to create analytic dashboards was manual and could only produce a few routine reports. The audience of these few reports was limited—a maximum of 20 people from management. Other business units in Kawan Lama Group only consumed weekly reports that were prepared manually. Even the weekly reports couldn’t cover all important metrics, because some metrics were only available in monthly reports.

Initial solution for a real-time dashboard

The following diagram illustrates the initial solution Ruparupa implemented.

Initial solution architecture

Ruparupa started a data initiative within the organization to create a single source of truth within the company. Previously, business users could only get the sales data from the day before, and they didn’t have any visibility to current sales activities in their stores and websites.

To gain trust from business users, we wanted to provide the most updated data in an interactive QuickSight dashboard. We used an AWS DMS replication task to stream real-time change data capture (CDC) updates to an Amazon Aurora MySQL-Compatible Edition database, and built a QuickSight dashboard to replace the static presentation deck.

This pilot dashboard was accepted extremely well by the users, who now had visibility to their current data. However, the data source for the dashboard still resided in an Aurora MySQL database and only covered a single data domain.

The initial design had some additional challenges:

  • Diverse data source – The data source in an ecommerce platform consists of structured, semi-structured, and unstructured data, which requires flexible data storage. The initial data warehouse design in Ruparupa only stored transactional data, and data from other systems including user interaction data wasn’t consolidated yet.
  • Cost and scalability – Ruparupa wanted to build a future-proof data platform solution that could scale up to terabytes of data in the most cost-effective way.

The initial design also had some benefits:

  • Data updates – Data inside the initial data warehouse was delayed by 1 day. This was an improvement over the weekly report, but still not fast enough to make quicker decisions.

This solution only served as a temporary solution; we needed a more complete analytics solution that could serve more complex and larger data sources, faster, and cost-effectively.

Real-time data lake solution

To fulfill their requirements, Ruparupa introduced a mutable data lake, as shown in the following diagram.

Real time data lake solutions architecture

Let’s look at each main component in more detail.

AWS DMS CDC process

To get the real-time data from the source, we streamed the database CDC log using AWS DMS (component 1 in the architecture diagram). The CDC records consist of all inserts, updates, and deletes from the source database. This raw data is stored in the raw layer of the S3 data lake.

An S3 lifecycle policy is used to manage data retention, where the older data is moved to Amazon S3 Glacier.

AWS Glue ETL job

The second S3 data lake layer is the transformed layer, where the data is transformed to an optimized format that is ready for user query. The files are transformed to Parquet columnar format with snappy compression and table partitioning to optimize SQL queries from Athena.

In order to create a mutable data lake that can merge changes from the data source, we introduced an Apache Hudi data lake framework. With Apache Hudi, we can perform upserts and deletes on the transformed layer to keep the data consistent in a reliable manner. With a Hudi data lake, Ruparupa can create a single source of truth for all our data sources quickly and easily. The Hudi framework takes care of the underlying metadata of the updates, making it simple to implement across hundreds of tables in the data lake. We only need to configure the writer output to create a copy-on-write table depending on the access requirements. For the writer, we use an AWS Glue job writer combined with an AWS Glue Hudi connector frrom AWS Marketplace. The additional library from the connector helps AWS Glue understand how to write to Hudi.

An AWS Glue ETL job is used to get the changes from the raw layer and merge the changes in the transformed layer (component 2 in the architecture diagram). With AWS Glue, we are able to create a PySpark job to get the data, and we use the AWS Glue Connector for Apache Hudi to simplify the Hudi library import to the AWS Glue job. With AWS Glue, all the changes from AWS DMS can be merged easily to the Hudi data lake. The jobs are scheduled every hour using a built-in scheduler in AWS Glue.

Secrets Manager is used to store all the related parameters that are required to run the job. Instead of making one transformation job for each table, Ruparupa creates a single generic job that can transform multiple tables by using several parameters. The parameters that give details about the table structure are stored in Secrets Manager and can be retrieved using the name of the table as key. With these custom parameters, Ruparupa doesn’t need to create a job for every table—we can utilize a single job that can ingest the data for all different tables by passing the name of the table to the job.

All the metadata of the tables is stored in the AWS Glue Data Catalog, including the Hudi tables. This catalog is used by the AWS Glue ETL job, Athena query engine, and QuickSight dashboard.

Athena queries

Users can then query the latest data for their report using Athena (component 3 in the architecture diagram). Athena is serverless, so there is no infrastructure to provision or maintain. We can immediately use SQL to query the data lake to create a report or ingest the data to the dashboard.

QuickSight dashboard

Business users can use a QuickSight dashboard to query the data lake (component 4 in the architecture diagram). The existing dashboard is modified to get data from Athena, replacing the previous database. New dashboards were also created to fulfill continuously evolving needs for insights from multiple business units.

QuickSight is also used to notify certain parties when a value is reaching a certain threshold. An email alert is sent to an external notification and messaging platform so it can reach the end-user.

Business results

The data lake implementation in Ruparupa took around 3 months, with an additional month for data validation, before it was considered ready for production. With this solution, management can get the latest information view of their current state up to the last 1 hour. Previously, they could only generate weekly reports—now insights are available 168 times faster.

The QuickSight dashboard, which can be updated automatically, shortens the time required by the analytic team. The QuickSight dashboard now has more content—not only is transactional data reported, but also other metrics like new SKU, operation escalation for free services to customers, and monitoring SLA. Since April 2021 when Ruparupa started their QuickSight pilot, the number of dashboards has increased to around 70 based on requests from business users.

Ruparupa has hired new personnel to join the data analytic team to explore new possibilities and new use cases. The analytic team has grown from just one person to seven to handle new analytic use cases:

  • Merchandising
  • Operations
  • Store manager performance measurement
  • Insights of trending SKUs

Kawan Lama Group also has offline stores besides the ecommerce platform managed by Ruparupa. With the new dashboard, it’s easier to compare transaction data from online and offline stores because they now use the same platform.

The new dashboards also can be consumed by a broader audience, including other business units in Kawan Lama Group. The total users consuming the dashboard increased from just 20 users from management to around 180 users (9 times increase).

Since the implementation, other business units in Kawan Lama Group have increased their trust in the S3 data lake platform implemented by Ruparupa, because the data is more up to date and they can drill down to the SKU level to validate that the data is correct. Other business units can now act faster after an event like a marketing campaign. This data lake implementation has helped increase sales revenue in various business units in Kawan Lama Group.

Conclusion

Implementing a real-time data lake using Amazon S3, Apache Hudi, AWS Glue, Athena, and QuickSight gave Ruparupa the following benefits:

  • Yielded faster insights (hourly compared to weekly)
  • Unlocked new insights
  • Enabled more people in more business units to consume the dashboard
  • Helped business units in Kawan Lama Group act faster and increase sales revenue

If you’re interested in gaining similar benefits, check out Build a Data Lake Foundation with AWS Glue and Amazon S3.

You can also learn how to get started with QuickSight in the Getting Started guide.

Last but not least, you can learn about running Apache Hudi on AWS Glue in Writing to Apache Hudi tables using AWS Glue Custom Connector.


About the Authors

Olivia Michele is a Data Scientist Lead at Ruparupa, where she has worked in a variety of data roles over the past 5 years, including building and integrating Ruparupa data systems with AWS to improve user experience with data and reporting tools. She is passionate about turning raw information into valuable actionable insights and delivering value to the company.

Dariswan Janweri P. is a Data Engineer at Ruparupa. He considers challenges or problems as interesting riddles and finds satisfaction in solving them, and even more satisfaction by being able to help his colleagues and friends, “two birds one stone.” He is excited to be a major player in Indonesia’s technology transformation.

Adrianus Budiardjo Kurnadi is a Senior Solutions Architect at Amazon Web Services Indonesia. He has a strong passion for databases and machine learning, and works closely with the Indonesian machine learning community to introduce them to various AWS Machine Learning services. In his spare time, he enjoys singing in a choir, reading, and playing with his two children.

Nico Anandito is an Analytics Specialist Solutions Architect at Amazon Web Services Indonesia. He has years of experience working in data integration, data warehouses, and big data implementation in multiple industries. He is certified in AWS data analytics and holds a master’s degree in the data management field of computer science.

Enabling Microsoft Defender Credential Guard on Amazon EC2

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/enabling-microsoft-defender-credential-guard-on-amazon-ec2/

This blog post is written by Jason Nicholls, Principal Solutions Architect AWS.

In this post we show you how to enable that Windows Defender Credential Guard (Credential Guard) on Amazon Elastic Compute Cloud (Amazon EC2) running Microsoft Windows Server. Credential Guard, when enabled on Amazon EC2 Windows Instances protects sensitive user login information from being extracted from the Operating System (OS) memory.

Microsoft Windows stores credential material such as authentication tokens in the Local Security Authority (LSA), a process in the Microsoft Windows operating system that is responsible for enforcing the security policy of the system. Traditionally these credentials are stored in LSA’s process memory accessible to the rest of the OS, making it possible for a privileged user to extract these credentials. The use of complex passwords, limiting privileged account use, and training users not to use the same password across multiple systems are steps that limit the use of credentials extracted from the LSA, but customers want to isolate these credentials from other rest of the OS.

With Credential Guard enabled, the LSA is isolated by Windows virtualization-based security (VBS). VBS is a suite of Windows security mechanisms that use hardware virtualization features to create an isolated compute environment to store user credentials referred to as the isolated LSA. The isolated LSA is inaccessible to the rest of the OS.

Prerequisites

Credential Guard requires a virtualization technology that supports Virtualization-based Security (VBS) and Unified Extensible Firmware Interface (UEFI) Secure Boot. Optionally, Credential Guard can leverage a Trusted Platform Module (TPM) to further secure credentials. In this walk-through we will use NitroTPM, a virtual TPM 2.0-compliant (ISO/IEC 11889:2015) module for your Amazon EC2 instances to enhance the security of the isolated LSA.

Launch your instance

A list of Windows Amazon Machine Image (AMI)s preconfigured to enable UEFI Secure Boot is available in our “Launch an instance with UEFI Secure Boot support ” guide. You can verify that your AMI supports UEFI Secure Boot and NitroTPM using the AWS Command Line Interface (AWS CLI) describe-images command as follows:

aws ec2 describe-images --image-ids ami-0123456789

When UEFI Secure Boot and NitroTPM are enabled for the AMI, “TpmSupport“: “v2.0“, and “BootMode”: “uefi” appear in the output respectively, such as in the following example.

{
   "Images": [
      {
         ...
         "BootMode": "uefi",
         "TpmSupport": "v2.0"
      }
   ]
}

Before launching the AMI verify that the instance type you want to launch supports UEFI boot mode and uses the Nitro System by using the DescribeInstanceTypes API call. Using the AWS CLI and calling describe-instance-types as follows:

aws ec2 describe-instance-types --instance-types [INSTANCE_TYPE] --region [REGION]

Where INSTANCE_TYPE is a supported instance type as defined in the documentation, an example is c5.large and REGION is a supported AWS Region.

The output of the command should display nitro as the hypervisor and list uefi as a supported boot mode. For example:

{
   "InstanceTypes": [
   {
      ...
      "Hypervisor": "nitro",
      ...
      "SupportedBootModes": [
                "legacy-bios",
                "uefi"
      ]
   }
}

Use the Amazon EC2 console or AWS Command Line Interface (AWS CLI) to launch an Amazon EC2 instance which has “uefi” boot mode enabled. Administrator access to the Amazon EC2 instance is required to enable Credential Guard.

Walkthrough:

Enabling Credential Guard with Amazon EC2 Launch an Amazon EC2 Windows Instance

Launch an Amazon EC2 Windows Instance using a Windows AMI preconfigured to enable UEFI Secure Boot with Microsoft Windows Secure Boot Keys on an instance type that supports UEFI Secure Boot.

You can use the launch wizard in the Amazon EC2 console or run-instances command via the AWS CLI to launch an instance that can support Credential Guard. You need a compatible AMI ID for launching your instance which is unique for each AWS Region. You can use the following link to discover and launch instances with compatible Amazon-provided AMIs in the Amazon EC2 console:

Enable Credential Guard on the launched Instance

Now that you know how to create an AMI with UEFI Secure Boot support enabled, let’s create a Windows instance and configure Credential Guard.

Credential Guard can be enabled either by using Group Policies, the Windows Registry, or the Windows Defender Device Guard and Windows Defender Credential Guard hardware readiness tool (DG-Readiness tool). In this post we will show you how to enable Credential Guard using the Windows Registry and via the DG-Readiness tool.

Option 1: Enabling Credential Guard via the Windows Registry

Start the Amazon EC2 instance using an AMI that has the BootMode set to “uefi“. Windows AMI must be preconfigured to enable UEFI Secure Boot with Microsoft Windows Secure Boot keys as we defined earlier. Once you’re connected to the instance use the Windows System Information tool to check that Credential Guard isn’t running on the instance:

  1. Select Start, type msinfo32.exe, and then select System Information.
  2. Select System Summary on the left.
  3. Confirm that Virtualization-based security is Not Enabled.

Figure 1 Overview of System Information in WindowsFigure 1 Overview of System Information in Windows

Figure 2 System Information confirming that Credential Guard is Not Enabled

Figure 2 System Information confirming that Credential Guard is Not Enabled

4. Open the Windows Command Shell by selecting Start, type cmd.exe, and then press Enter.

5. Run the following commands from the Windows Command Shell to enable Credential Guard using the Windows Registry:

REG ADD "HKLM\System\CurrentControlSet\Control\DeviceGuard" /v EnableVirtualizationBasedSecurity /t REG_DWORD /d 1 /f

REG ADD "HKLM\System\CurrentControlSet\Control\DeviceGuard" /v RequirePlatformSecurityFeatures /t REG_DWORD /d 1 /f

REG ADD "HKLM\System\CurrentControlSet\Control\LSA" /v LsaCfgFlags /t REG_DWORD /d 2 /f

REG ADD "HKLM\SYSTEM\CurrentControlSet\Control\DeviceGuard\Capabilities\" /v "HyperVEnabled" /t REG_DWORD /d 1 /f

Figure 3 Update the Windows Registry to enable Credential GuardFigure 3 Update the Windows Registry to enable Credential Guard

6. Once the changes have been made, reboot the instance.

7. When the instance is back up, reconnect to the instance and select Start, type msinfo32.exe, and then select System Information.

8. Select System Summary on the left.

9. Note that Virtualisation-based security is now changed to Running, and that Secure Boot and Credential Guard are enabled.

Figure 4 Overview of System Information showing Credential Guard is EnabledFigure 4 Overview of System Information showing Credential Guard is Enabled

Figure 5 System Information shows Credential Guard is now EnabledFigure 5 System Information shows Credential Guard is now Enabled

Option 2: Enabling Credential Guard via the DG-Readiness tool

Option 1 requires a remote desktop session to your Windows Instance. Option 2 is run via PowerShell which can either be done in a remote desktop session as described here or via a remote PowerShell session.

Start the Amazon EC2 instance using an AMI that has the BootMode set to “uefi“. Windows AMI must be preconfigured to enable UEFI Secure Boot with Microsoft Windows Secure Boot keys as we defined earlier. Once you’re connected to the instance start a PowerShell session:

  1. Select Start, type PowerShell, and then click Windows PowerShell.

Figure 6 Start PowerShell via the Start BarFigure 6 Start PowerShell via the Start Bar

      2. Download the DG-Readiness tool by running the command:

wget https://download.microsoft.com/download/B/D/8/BD821B1F-05F2-4A7E-AA03-DF6C4F687B07/dgreadiness_v3.6.zip -outfile dgreadiness.zip

Figure 7 Download the DG-Readiness toolFigure 7 Download the DG-Readiness tool

  1. Uncompress the downloaded zip file using the Expand-Archive function within PowerShell
Expand-Archive -Path C:\Users\Administrator\dgreadiness.zip -DestinationPath C:\dgreadiness
  1. Move the DG-Readiness tool to the current folder

copy C:\dgreadiness\dgreadiness_v3.6\DG_Readiness_Tool_v3.6.ps1 .\

Figure 8 Copy the DG-Readiness tool to the current folderFigure 8 Copy the DG-Readiness tool to the current folder

  1. Confirm that Credential Guard is disabled by running the DG-Readiness tool with the -Ready option, as follows:
DG_Readiness_Tool_v3.6.ps1 -Ready

Figure 9 Using the DG-Readiness tool to confirm Credential Guard is not enabledFigure 9 Using the DG-Readiness tool to confirm Credential Guard is not enabled

  1. Enable Credential Guard using the -Enable -CG options as follows:
DG_Readiness_Tool_v3.6.ps1 -Enable -CG
  1. Reboot the instance
  2. Reconnect to the instance after the reboot and confirm that Credential Guard is now running by running the command:
DG_Readiness_Tool_v3.6.ps1 -Ready

Figure 10 DG Readiness Tool shows Credential Guard is enabled and runningFigure 10 DG Readiness Tool shows Credential Guard is enabled and running

Conclusion

With support for Windows Defender Credential Guard on Amazon EC2 Windows Instances customers can create an isolated compute environment that is inaccessible to the rest of the OS. Credential Guard requires UEFI Secure Boot support. Credential Guard can leverage NitroTPM to further secure credentials.

To learn more about the support of Credential Guard visit documentation.

Maintaining Code Quality with Amazon CodeCatalyst Reports

Post Syndicated from Imtranur Rahman original https://aws.amazon.com/blogs/devops/maintaining-code-quality-with-amazon-codecatalyst-reports/

Amazon CodeCatalyst reports contain details about tests that occur during a workflow run. You can create tests such as unit tests, integration tests, configuration tests, and functional tests. You can use a test report to help troubleshoot a problem during a workflow.

Introduction

In prior posts in this series, I discussed reading The Unicorn Project, by Gene Kim, and how the main character, Maxine, struggles with a complicated Software Development Lifecycle (SDLC) after joining a new team. One of the challenges she encounters is the difficulties in shipping secure, functioning code without an automated testing mechanism. To quote Gene Kim, “Without automated testing, the more code we write, the more money it takes for us to test.”

Software Developers know that shipping vulnerable or non-functioning code to a production environment is to be avoided at all costs; the monetary impact is high and the toll it takes on team morale can be even greater. During the SDLC, developers need a way to easily identify and troubleshoot errors in their code.

In this post, I will focus on how developers can seamlessly run tests as a part of workflow actions as well as configure unit test and code coverage reports with Amazon CodeCatalyst. I will also outline how developers can access these reports to gain insights into their code quality.

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Walkthrough

As with the previous posts in the CodeCatalyst series, I am going to use the Modern Three-tier Web Application blueprint. Blueprints provide sample code and CI/CD workflows to help you get started easily across different combinations of programming languages and architectures. To follow along, you can re-use a project you created previously, or you can refer to a previous post that walks through creating a project using the Three-tier blueprint.

Once the project is deployed, CodeCatalyst opens the project overview. This view shows the content of the README file from the project’s source repository, workflow runs, pull requests, etc. The source repository and workflow are created for me by the project blueprint. To view the source code, I select Code → Source Repositories from the left-hand navigation bar. Then, I select the repository name link from the list of source repositories.

Figure 1. List of source repositories including Mythical Mysfits source code.

Figure 1. List of source repositories including Mythical Mysfits source code.

From here I can view details such as the number of branches, workflows, commits, pull requests and source code of this repo. In this walkthrough, I’m focused on the testing capabilities of CodeCatalyst. The project already includes unit tests that were created by the blueprint so I will start there.

From the Files list, navigate to web → src → components→ __tests__ → TheGrid.spec.js. This file contains the front-end unit tests which simply check if the strings “Good”, “Neutral”, “Evil” and “Lawful”, “Neutral”, “Chaotic” have rendered on the web page. Take a moment to examine the code. I will use these tests throughout the walkthrough.

Figure 2. Unit test for the front-end that test strings have been rendered properly.

Figure 2. Unit test for the front-end that test strings have been rendered properly. 

Next, I navigate to the  workflow that executes the unit tests. From the left-hand navigation bar, select CI/CD → Workflows. Then, find ApplicationDeploymentPipeline, expand Recent runs and select  Run-xxxxx . The Visual tab shows a graphical representation of the underlying YAML file that makes up this workflow. It also provides details on what started the workflow run, when it started,  how long it took to complete, the source repository and whether it succeeded.

Figure 3. The Deployment workflow open in the visual designer.

Figure 3. The Deployment workflow open in the visual designer.

Workflows are comprised of a source and one or more actions. I examined test reports for the back-end in a prior post. Therefore, I will focus on the front-end tests here. Select the build_and_test_frontend action to view logs on what the action ran, its configuration details, and the reports it generated. I’m specifically interested in the Unit Test and Code Coverage reports under the Reports tab:

Figure 4. Reports tab showing line and branch coverage.

Figure 4. Reports tab showing line and branch coverage.

Select the report unitTests.xml (you may need to scroll). Here, you can see an overview of this specific report with metrics like pass rate, duration, test suites, and the test cases for those suites:

Figure 5. Detailed report for the front-end tests

Figure 5. Detailed report for the front-end tests.

This report has passed all checks.  To make this report more interesting, I’ll intentionally edit the unit test to make it fail. First, navigate back to the source repository and open web → src → components→ __tests__→TheGrid.spec.js. This test case is looking for the string “Good” so change it to say “Best” instead and commit the changes.

Figure 6. Front-End Unit Test Code Change.

Figure 6. Front-End Unit Test Code Change.

This will automatically start a new workflow run. Navigating back to CI/CD →  Workflows, you can see a new workflow run is in progress (takes ~7 minutes to complete).

Once complete, you can see that the build_and_test_frontend action failed. Opening the unitTests.xml report again, you can see that the report status is in a Failed state. Notice that the minimum pass rate for this test is 100%, meaning that if any test case in this unit test ever fails, the build fails completely.

There are ways to configure these minimums which will be explored when looking at Code Coverage reports. To see more details on the error message in this report, select the failed test case.

Figure 7. Failed Test Case Error Message.

Figure 7. Failed Test Case Error Message.

As expected, this indicates that the test was looking for the string “Good” but instead, it found the string “Best”. Before continuing, I return to the TheGrid.spec.js file and change the string back to “Good”.

CodeCatalyst also allows me to specify code and branch coverage criteria. Coverage is a metric that can help you understand how much of your source was tested. This ensures source code is properly tested before shipping to a production environment. Coverage is not configured for the front-end, so I will examine the coverage of the back-end.

I select Reports on the left-hand navigation bar, and open the report called backend-coverage.xml. You can see details such as line coverage, number of lines covered, specific files that were scanned, etc.

Figure 8. Code Coverage Report Succeeded.

Figure 8. Code Coverage Report Succeeded.

The Line coverage minimum is set to 70% but the current coverage is 80%, so it succeeds. I want to push the team to continue improving, so I will edit the workflow to raise the minimum threshold to 90%. Navigating back to CI/CD → Workflows → ApplicationDeploymentPipeline, select the Edit button. On the Visual tab, select build_backend. On the Outputs tab, scroll down to Success Criteria and change Line Coverage to 90%.

Figure 9. Configuring Code Coverage Success Criteria.

Figure 9. Configuring Code Coverage Success Criteria.

On the top-right, select Commit. This will push the changes to the repository and start a new workflow run. Once the run has finished, navigate back to the Code Coverage report. This time, you can see it reporting a failure to meet the minimum threshold for Line coverage.

Figure 10. Code Coverage Report Failed.

There are other success criteria options available to experiment with. To learn more about success criteria, see Configuring success criteria for tests.

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges. First, delete the two stacks that CDK deployed using the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. These stacks will have names like mysfitsXXXXXWebStack and mysfitsXXXXXAppStack. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project.

Summary

In this post, I demonstrated how Amazon CodeCatalyst can help developers quickly configure test cases, run unit/code coverage tests, and generate reports using CodeCatalyst’s workflow actions. You can use these reports to adhere to your code testing strategy as a software development team. I also outlined how you can use success criteria to influence the outcome of a build in your workflow.  In the next post, I will demonstrate how to configure CodeCatalyst workflows and integrate Software Composition Analysis (SCA) reports. Stay tuned!

About the authors:

Imtranur Rahman

Imtranur Rahman is an experienced Sr. Solutions Architect in WWPS team with 14+ years of experience. Imtranur works with large AWS Global SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform.Imtranur specializes in Containers, Dev/SecOps, GitOps, microservices based applications, hybrid application solutions, application modernization and loves innovating on behalf of his customers. He is highly customer obsessed and takes pride in providing the best solutions through his extensive expertise.

Wasay Mabood

Wasay is a Partner Solutions Architect based out of New York. He works primarily with AWS Partners on migration, training, and compliance efforts but also dabbles in web development. When he’s not working with customers, he enjoys window-shopping, lounging around at home, and experimenting with new ideas.

Using Porting Advisor for Graviton

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/using-porting-advisor-for-graviton/

This blog post is written by Ryan Doty Solutions Architect , AWS and Vishal Manan Sr. SSA, EC2 Graviton , AWS.

Introduction

AWS customers recognize that Graviton-based EC2 instances deliver price-performance benefits but many are concerned about the effort to port existing applications. Porting code from one architecture to another can result in a substantial investment in time and effort. AWS has worked continuously to improve the migration process for customers. We recently introduced the Porting Advisor for Graviton as a tool to further simplify the migration process. In this blog, we’ll walk you through how to use Porting Advisor for Graviton so that you can learn how to use it.

Porting Advisor for Graviton is an open-source, command-line tool that analyzes source code and generates a report highlighting missing or outdated libraries and code constructs that may require modification and provides a user with alternative recommendations. It helps customers and developers accelerate their transition to Graviton-based Amazon EC2 instances by reducing the iterative process of identifying and resolving source code and library dependencies. This blog post will provide you with a step-by-step implementation on how to use Porting Advisor for Graviton. At the end of the blog, you will be able to run Porting Advisor for Graviton on your source code tree, generating findings that will help simplify the effort required to port your application.

Porting Advisor for Graviton scans for potentially unsupported or non-portable arm64 code in source code trees. The tool only scans source code files for the programming languages C/C++, Fortran, Go 1.11+, Java 8+, Python 3+, and dependency files such as project/requirements.txt file(s). Most importantly the Porting Advisor doesn’t make any code modifications, API-level recommendations, or send data back to AWS.

You can utilize Porting Advisor for Graviton either as a Python script or compiled into a binary and run on x86-64 or arm64 systems. Therefore, it can be easily implemented as part of your build processes.

Expected Results

Porting Advisor for Graviton reports the following issues:

  1. Inline assembly with no corresponding arm64 inline assembly.
  2. Assembly source files with no corresponding arm64 assembly source files.
  3. Missing arm64 architecture detection in autoconf config.guess scripts.
  4. Linking against libraries that aren’t available on the arm64 architecture.
  5. Use  architecture specific intrinsics.
  6. Preprocessor errors that trigger when compiling on arm64.
  7. Usages of old Visual C++ runtime (Windows specific).

Compiler specific code guarded by compiler specific pre-defined macros is detected, but not reported by default. The following cross-compile specific issues are detected, but not reported by default:

  • Architecture detection that depends on the host rather than the target.
  • Use of build artifacts in the build process.

Skillsets needed for using the tool

Porting Advisor for Graviton is designed to be easy to use. Users though should be versed in the following skills in order to take advantage of the recommendations the tool provides:

  • An understanding of the build system – project requirements and dependencies, versioning, etc.
  • Basic scripting language skills around Python, PowerShell/Bash.
  • Understanding hardware (inline assembly for C/C++) or compiler specific (intrinsic for C/C++) constructs when applicable.
  • The ability to follow best practices in the AWS Graviton Technical Guide for code optimization.

How to use Porting Advisor for Graviton

Prerequisites

The tool requires a minimum version of Python 3.10 and Java (8+ to be installed). The installation of Java is also optional and only required if you want to scan JAR files for native method calls. You can run the tool on a Windows/Linux/MacOS machine, or in an EC2 instance. I will show case usage on Windows and Amazon Linux 2(AL2) running on an EC2 instance. It supports both arm64 and x86-64 processors.

You don’t need to be on an arm64-based processor to run the tool.

The tool doesn't need a lot of CPU Horsepower and even a system with few processors will do

You can run the tool as a Python script or an executable. The executable requires extra steps to build. However, it can be used on another machine without the need to install Python packages.

You must copy the complete “dist” folder for it to work, as there are libraries that are present in that folder along with the executable.

Porting Advisor for Graviton can be run as a binary or as a script. You must run the script to generate the binaries

./porting-advisor-linux-x86_64 ~/my/path/to/my/repo --output report.html 
./porting-advisor-linux-x86_64.exe ../test/CppCode/inline_assembly --output report.html

Running Porting Advisor for Graviton as an executable

The first step is to build the executable.

Building the executable

The first step is to set up the Python Environment. If you run into any errors building the binary, see the Troubleshooting section for more details.

Building the binary on Windows

Building Porting Advisor using Powershell

Building Porting Advisor binary

Building the binary on Linux/Mac

Using shell script to build on Linux or macOS

Porting Advisor binary saved in dist folder

Running the binary

Here you can see how you can run the tool on Linux as a binary for a C++ project.

Porting advisor binary run on a C++ codebase with 350 files

The “dist” folder will have the executable.

Running Porting Advisor for Graviton as a script

Enable the Python environment for the following:

Linux/Mac:

$. python3 -m venv .venv
$. source .venv/bin/activate

PowerShell:

PS> python -m venv .venv
PS> .\.venv\Scripts\Activate.ps1

The following shows how the tool would work on Windows when run as a script:

Running Porting Advisor on Windows as a powershell script

Running the Porting Advisor for Graviton on Linux as a Script

Setting up Python environment to run the Porting Advisor as a script

Running Porting Advisor on Linux as a script

Output of the Porting Advisor for Graviton

The Porting Advisor for Graviton requires the directory parameter to point to the folder where your source code lives. If there are multiple locations, then you can run the tool as part of the script.

If no output file is supplied only standard output will be produced. The following is the output of the tool in HTML format. The first line shows the total files scanned.

  1. If no issues are found, then you’ll see an output like the following:

Results of Porting Advisor run on C++ code with 2836 files with no issues found

  1. With x86-64 specific intrinsics, such as _bswap64, you’ll see it flagged. arm64 specific intrinsics won’t be flagged. Therefore, if your code has arm64 specific intrinsics, then the porting advisor will only flag x86-64 to arm64 and not vice versa. The primary goal of the tool is determining the arm64 readiness of your code.

Porting Advisor reporting inline assembly files in C++ code

  1. The scanner typically looks for source code files only, but it can also look for assembly files with *.s extensions. An example of a file with the C++ code with inline assembly code is as follows:

Porting Advisor reporting use of intrinsics in C++ code

  1. The tool has pointed out errors such as a preprocessor error and architecture-specific intrinsic usage errors.

Porting Advisor results run on C++ code pointing out missing preprocessor macros for arm64 and x86-64 specific intrinsics

Next steps

If you don’t see any issues reported by the tool, then you are in good shape for doing the port. If no issues are reported it does not guarantee that it will work as port. Porting Advisor for Graviton is a tool used best as a helper. Regardless of the issues reported by the tool, we still highly suggest testing the application thoroughly.

As a good practice, we recommend that you use the latest version of your libraries.

Based on the issue, you can consider further actions. For Compiler intrinsic errors, we recommend studying Intel and arm64 intrinsics.

Once you’ve migrated your code and are using Gravition, you can start to look at taking steps to optimize your performance. If you’re interested in doing that please look at our Getting Started Guide.

If you run into any issues, then see our CONTRIBUTING file.

FAQs

  1. How fast is the tool? The tool is able to scan 4048 files in 1.18 seconds.

On an arm64 Based Instance:

Porting Advisor scans 4048 files in 1.18 seconds

  1. False Positives?

This tool scans all files in a source tree, regardless of whether they are included by the build system or not. Therefore, it may misreport issues in files that appear in the source tree but are excluded by the build system. Currently, the tool supports the following languages/dependencies: C/C++, Fortran, Go 1.11+, Java 8+, and Python 3+.

For example: You may have legacy code using Python version 2.7 that is in the source tree but isn’t being used. The tool will scan the code base and point out issues in that particular codebase even though you may not be using that piece of code.  To mitigate, either  remove that folder from the source code or ignore the error pointed by the tool.

  1. I see mention of Ruby and .Net in the Open source tool, but they don’t work on my tool.

Ruby and .Net haven’t been implemented yet, but please consider contributing to it and open an issue requesting support. If you need support then see our CONTRIBUTING file.

Troubleshooting

Errors that you may encounter while building the tool binary:

PyInstaller needs a shared version of libraries.

  1. Python 3.10+ not having shared libraries for use by the PyInstaller tool.

Building Porting Advisor binaries

Pyinstaller failed at building binary and suggesting building Python configure script with --enable-shared on Linux or --enable-framework on macOS

The fix for this is to build your version of Python(3.10+) with the right flags:

./configure --enable-optimizations --enable-shared

If the two flags don’t work together, try doing the build with each flag enabled sequentially.

pyinstaller tool needs python configure script with --enable-shared and enable-optimizations flag

  1. Incorrect Python version (version less than 3.10).If you aren’t on the correct version of Python:

You will get errors similar to the ones here:

Python version on host is 3.7.15 which is less than the recommended version

If you want to run the tool in an EC2 instance on Amazon Linux 2(AL2), then you could try upgrading/installing Python 3.10 as pointed out here.

If you run into any issues, then see our CONTRIBUTING file.

Trying to run Porting Advisor as script will result in Syntax errors on Python version less than 3.10

Conclusion

Porting Advisor for Graviton helps customers quantify the amount of work that is required to port an application. It accelerates your ability to transition to Graviton-based Amazon EC2 instances by reducing the iterative process of identifying and resolving source code and library dependencies.

Resources

To learn how to migrate your workloads to Graviton-based instances, see the AWS Graviton Technical Guide GitHub Repository and AWS Graviton Transition Guide. To get started with Graviton-based Amazon EC2 instances, see the AWS Management Console, AWS Command Line Interface (AWS CLI), and AWS SDKs.

Some other resources include:

How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with Amazon EMR on Amazon EKS

Post Syndicated from Nan Zhu original https://aws.amazon.com/blogs/big-data/how-safegraph-built-a-reliable-efficient-and-user-friendly-spark-platform-with-amazon-emr-on-amazon-eks/

This is a guest post by Nan Zhu, Engineering Manager/Software Engineer, SafeGraph, and Dave Thibault, Sr. Solutions Architect – AWS

SafeGraph is a geospatial data company that curates over 41 million global points of interest (POIs) with detailed attributes, such as brand affiliation, advanced category tagging, and open hours, as well as how people interact with those places. We use Apache Spark as our main data processing engine and have over 1,000 Spark applications running over massive amounts of data every day. These Spark applications implement our business logic ranging from data transformation, machine learning (ML) model inference, to operational tasks.

SafeGraph found itself with a less-than-optimal Spark environment with their incumbent Spark vendor. Their costs were climbing. Their jobs would suffer frequent retries from Spot Instance termination. Developers spent too much time troubleshooting and changing job configurations and not enough time shipping business value code. SafeGraph needed to control costs, improve developer iteration speed, and improve job reliability. Ultimately, SafeGraph chose Amazon EMR on Amazon EKS to meet their needs and realized 50% savings relative to their previous Spark managed service vendor.

If building Spark applications for our product is like cutting a tree, having a sharp saw becomes crucial. The Spark platform is the saw. The following figure highlights the engineering workflow when working with Spark, and the Spark platform should support and optimize each action in the workflow. The engineers usually start with writing and building the Spark application code, then submit the application to the computing infrastructure, and finally close the loop by debugging the Spark applications. Additionally, platform and infrastructure teams need to continually operate and optimize the three steps in the engineering workflow.

Figure 1 engineering workflow of Spark applications

There are various challenges involved in each action when building a Spark platform:

  • Reliable dependency management – A complicated Spark application usually brings many dependencies. To run a Spark application, we need to identify all dependencies, resolve any conflicts, pack dependent libraries reliably, and ship them to the Spark cluster. Dependency management is one of the biggest challenges for engineers, especially when they work with PySpark applications.
  • Reliable computing infrastructure – The reliability of the computing infrastructure hosting Spark applications is the foundation of the whole Spark platform. Unstable resource provisioning will not only cause negative impact over engineering efficiency, but it will also increase infrastructure costs due to reruns of the Spark applications.
  • Convenient debugging tools for Spark applications – The debugging tooling plays a key role for engineers to iterate fast on Spark applications. Performant access to the Spark History Server (SHS) is a must for developer iteration speed. Conversely, poor SHS performance slows developers and increases the cost of goods sold for software companies.
  • Manageable Spark infrastructure – A successful Spark platform engineering involves multiple aspects, such as Spark distribution version management, computing resource SKU management and optimization, and more. It largely depends on whether the Spark service vendors provide the right foundation for platform teams to use. The wrong abstraction over distribution version and computing resources, for example, could significantly reduce the ROI of platform engineering.

At SafeGraph, we experienced all of the aforementioned challenges. To resolve them, we explored the marketplace and found that building a new Spark platform on top of EMR on EKS was the solution to our roadblocks. In this post, we share our journey of building our latest Spark platform and how EMR on EKS serves as a robust and efficient foundation for it.

Reliable Python dependency management

One of the biggest challenges for our users to write and build Spark application code is the struggle of managing dependencies reliably, especially for PySpark applications. Most of our ML-related Spark applications are built with PySpark. With our previous Spark service vendor, the only supported way to manage Python dependencies was via a wheel file. Despite its popularity, wheel-based dependency management is fragile. The following figure shows two types of reliability issues faced with wheel-based dependency management:

  • Unpinned direct dependency – If the .whl file doesn’t pinpoint the version of a certain direct dependency, Pandas in this example, it will always pull the latest version from upstream, which may potentially contain a breaking change and take down our Spark applications.
  • Unpinned transitive dependency – The second type of reliability issue is more out of our control. Even though we pinned the direct dependency version when building the .whl file, the direct dependency itself could miss pinpointing the transitive dependencies’ versions (MLFlow in this example). The direct dependency in this case always pulls the latest versions of these transitive dependencies that potentially contain breaking changes and may take down our pipelines.

Figure 2 fragile wheel based dependency management

The other issue we encountered was the unnecessary installation of all Python packages referred by the wheel files for every Spark application initialization. With our previous setup, we needed to run the installation script to install wheel files for every Spark application upon starting even if there is no dependency change. This installation prolongs the Spark application start time from 3–4 minutes to at least 7–8 minutes. The slowdown is frustrating especially when our engineers are actively iterating over changes.

Moving to EMR on EKS enables us to use pex (Python EXecutable) to manage Python dependencies. A .pex file packs all dependencies (including direct and transitive) of a PySpark application in an executable Python environment in the spirit of virtual environments.

The following figure shows the file structure after converting the wheel file illustrated earlier to a .pex file. Compared to the wheel-based workflow, we don’t have transitive dependency pulling or auto-latest version fetching anymore. All versions of dependencies are fixed as x.y.z, a.b.c, and so on when building the .pex file. Given a .pex file, all dependencies are fixed so that we don’t suffer from the slowness or fragility issues in a wheel-based dependency management anymore. The cost of building a .pex file is a one-off cost, too.

Figure 3 PEX file structure

Reliable and efficient resource provisioning

Resource provisioning is the process for the Spark platform to get computing resources for Spark applications, and is the foundation for the whole Spark platform. When building a Spark platform in the cloud, using Spot Instances for cost optimization makes resource provisioning even more challenging. Spot Instances are spare compute capacity available to you at a savings of up to 90% off compared to On-Demand prices. However, when the demand for certain instance types grows suddenly, Spot Instance termination can happen to prioritize meeting those demands. Because of these terminations, we saw several challenges in our earlier version of Spark platform:

  • Unreliable Spark applications – When the Spot Instance termination happened, the runtime of Spark applications got prolonged significantly due to the retried compute stages.
  • Compromised developer experience – The unstable supply of Spot Instances caused frustration among engineers and slowed our development iterations because of the unpredictable performance and low success rate of Spark applications.
  • Expensive infrastructure bill – Our cloud infrastructure bill increased significantly due to the retry of jobs. We had to buy more expensive Amazon Elastic Compute Cloud (Amazon EC2) instances with higher capacity and run in multiple Availability Zones to mitigate issues but in turn paid for the high cost of cross-Availability Zone traffic.

Spark Service Providers (SSPs) like EMR on EKS or other third-party software products serve as the intermediate between users and Spot Instance pools, and play a key role to ensure the sufficient supply of Spot Instances. As shown in the following figure, users launch Spark jobs with job orchestrators, notebooks, or services via SSPs. The SSP implements their internal functionality to access the unused instances in the Spot Instance pool in cloud services like AWS. One of the best practices of using Spot Instances is to diversify instance types (for more information, see Cost Optimization using EC2 Spot Instances). Specifically, there are two key features for a SSP to achieve instance diversification:

  • The SSP should be able to access all types of instances in the Spot Instance pool in AWS
  • The SSP should provide functionality for users to use as many instance types as possible when launching Spark applications

Figure 4 SSP provides the access to the unused instances in Cloud Service Provider

Our last SSP doesn’t provide the expected solution to these two points. They only support a limited set of Spot Instance types and by default, allow only a single Spot Instance type to be selected when launching Spark jobs. As a result, each Spark application only runs with a small capacity of Spot Instances and is vulnerable to Spot Instance terminations.

EMR on EKS uses Amazon Elastic Kubernetes Service (Amazon EKS) for accessing Spot Instances in AWS. Amazon EKS supports all available EC2 instance types, bringing a much higher capacity pool to us. We use the features of Amazon EKS managed node groups and node selectors and taints to assign each Spark application to a node group that is made of multiple instance types. After moving to EMR on EKS, we observed the following benefits:

  • Spot Instance termination was less frequent and our Spark applications’ runtime became shorter and stayed stable.
  • Engineers were able to iterate faster as they saw improvement in the predictability of application behaviors.
  • The infrastructure costs dropped significantly because we no longer needed costly workarounds and, simultaneously, we had a sophisticated selection of instances in each node group of Amazon EKS. We were able to save approximately 50% of computing costs without the workarounds like running in multiple Availability Zones and simultaneously provide the expected level of reliability.

Smooth debugging experience

An infrastructure that supports engineers conveniently debugging the Spark application is critical to close the loop of our engineering workflow. Apache Spark uses event logs to record the activities of a Spark application, such as task start and finish. These events are formatted in JSON and are used by SHS to rerender the UI of Spark applications. Engineers can access SHS to debug task failure reasons or performance issues.

The major challenge for engineers in SafeGraph was the scalability issue in SHS. As shown in the left part of the following figure, our previous SSP forced all engineers to share the same SHS instance. As a result, SHS was under intense resource pressure due to many engineers accessing at the same time for debugging their applications, or if a Spark application had a large event log to be rendered. Prior to moving to EMR on EKS, we frequently experienced either slowness of SHS or SHS crashed completely.

As shown in the following figure, for every request to view Spark history UI, EMR on EKS starts an independent SHS instance container in an AWS-managed environment. The benefit of this architecture is two-fold:

  • Different users and Spark applications won’t compete for SHS resources anymore. Therefore, we never experience slowness or crashes of SHS.
  • All SHS containers are managed by AWS; users don’t need pay additional financial or operational costs to enjoy the scalable architecture.

Figure 5 SHS provisioning architecture in previous SSP and EMR on EKS

Manageable Spark platform

As shown in the engineering workflow, building a Spark platform is not a one-off effort, and platform teams need to manage the Spark platform and keep optimizing each step in the engineer development workflow. The role of the SSP should provide the right facilities to ease operational burden as much as possible. Although there are many types of operational tasks, we focus on two of them in this post: computing resource SKU management and Spark distro version management.

Computing resource SKU management refers to the design and process for a Spark platform to allow users to choose different sizes of computing instances. Such a design and process would largely rely on the relevant functionality implemented from SSPs.

The following figure shows the SKU management with our previous SSP.

Figure 6 (a) Previous SSP: Users have to explicitly specify instance type and availability zone

The following figure shows SKU management with EMR on EKS.

Figure 6 (b) EMR on EKS helps abstracting out instance types from users and make it easy to manage computing SKU

With our previous SSP, job configuration only allowed explicitly specifying a single Spot Instance type, and if that type ran out of Spot capacity, the job switched to On-Demand or fell into reliability issues. This left platform engineers with the choice of changing the settings across the fleet of Spark jobs or risking unwanted surprises for their budget and cost of goods sold.

EMR on EKS makes it much easier for the platform team to manage computing SKUs. In SafeGraph, we embedded a Spark service client between users and EMR on EKS. The Spark service client exposes only different tiers of resources to users (such as small, medium, and large). Each tier is mapped to a certain node group configured in Amazon EKS. This design brings the following benefits:

  • In the case of prices and capacity changes, it’s easy for us to update configurations in node groups and keep it abstracted from users. Users don’t change anything, or even feel it, and continue to enjoy the stable resource provisioning while we keep costs and operational overhead as low as possible.
  • When choosing the right resources for the Spark application, end-users don’t need to do any guess work because it’s easy to choose with simplified configuration.

Improved Spark distro release management is the other benefit we gain from EMR on EKS. Prior to using EMR on EKS, we suffered from the non-transparent release of Spark distro in our SSP. Every 1–2 months, there is a new patched version of Spark distro released to users. These versions are all exposed to users via their UI. This resulted in engineers choosing various versions of distro, some of which hadn’t been tested with our internal tools. It significantly increased the breaking rate of our pipelines, internal systems, and the support burden of platform teams. We expect that the risk from releases of Spark distros should be minimum and transparent to users with an EMR on EKS architecture.

EMR on EKS follows the best practices with a stable base Docker image containing a fixed version of Spark distro. For any change of Spark distro, we have to explicitly rebuild and roll out the Docker image. With EMR on EKS, we can keep a new version of Spark distro hidden from users before testing it with our internal toolings and systems and make a formal release.

Conclusion

In this post, we shared our journey building a Spark platform on top of EMR on EKS. EMR on EKS as the SSP serves as a strong foundation of our Spark platform. With EMR on EKS, we were able to resolve challenges ranging from dependency management, resource provisioning, and debugging experience, and also significantly reduce our computing cost by 50% due to higher availability of Spot Instance types and sizes.

We hope this post could share some insights to the community when choosing the right SSP for your business. Learn more about EMR on EKS, including benefits, features, and how to get started.


About the Authors

Nan Zhu is the engineering lead of the platform team in SafeGraph. He leads the team to build a broad range of infrastructure and internal toolings to improve the reliability, efficiency and productivity of the SafeGraph engineering process, e.g. internal Spark ecosystem, metrics store and CI/CD for large mono repos, etc. He is also involved in multiple open source projects like Apache Spark, Apache Iceberg, Gluten, etc.

Dave Thibault is a Sr. Solutions Architect serving AWS’s independent software vendor (ISV) customers. He’s passionate about building with serverless technologies, machine learning, and accelerating his AWS customers’ business success. Prior to joining AWS, Dave spent 17 years in life sciences companies doing IT and informatics for research, development, and clinical manufacturing groups. He also enjoys snowboarding, plein air oil painting, and spending time with his family.

Detecting solar panel damage with Amazon Rekognition Custom Labels

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/architecture/detecting-solar-panel-damage-with-amazon-rekognition-custom-labels/

Enterprises perform quality control to ensure products meet production standards and avoid potential brand reputation damage. As the cost of sensors decreases and connectivity increases, industries adopt real-time imagery analysis to detect quality issues.

At the same time, artificial intelligence (AI) advancements enable advanced automation, reduce overall cost and project time, and produce accurate defect detection results in manufacturing plants. As these technologies mature, AI-driven inspections are more common outside of the plant environment.

Overview of solution

This post describes our SOLVED (Solar Roving Eye Detector) project leveraging machine learning (ML) to identify damaged solar panels using Amazon Rekognition Custom Labels and alert operators to take corrective action.

As solar adoption increases, so does the need to detect panel damage. Applying AWS-managed AI services is a simpler, more cost-effective approach than human solar panel inspection or custom-built production applications.

Customers can capture and process videos from the field and build effective computer vision models without creating a dedicated data science team. This approach can be generalized for use cases across industries to detect defects in wind turbines, cell phone towers, automotive parts, and other field components.

Amazon Rekognition Custom Labels builds off of existing service capabilities already trained to identify the objects and scenes in millions of cross-category images. You upload a small set of training images—typically a few hundred or less—into our console. The solution automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. You can then integrate your custom model into your applications through the Amazon Rekognition Custom Labels API.

Walkthrough

This post introduces the SOLVED project featured at the re:Invent 2021 Builders Fair. It will:

  • Review the need for solar panel damage detection
  • Discuss a cloud-based approach to ingest, store, process, analyze, and detect damaged solar panels
  • Present a diagram streaming videos from a Raspberry Pi, storing them on Amazon Simple Storage Service (Amazon S3), processing them using an AWS video-on-demand solution, and inferring damage using Amazon Rekognition
  • Introduce a console to mimic an operation center for appropriate action
  • Demonstrate the integration of AWS IoT Core with a Philips Hue bulb for operator alerts

Prerequisites

Before getting started, review the following prerequisites for this solution:

The SOLVED project

The SOLVED project leverages ML to identify damaged solar panels using Amazon Rekognition Custom Labels. It involves four steps:

  1. Data ingestion: Live solar panel video ingested from moving rover into an Amazon S3 bucket
  2. Pre-processing: Captured video split into thumbnail images
  3. Processing and visualization: ML models making real-time inferences to identify defective panels with a dashboard to review images and prediction scores
  4. Alerting: Defective panels result in notification sent through MQTT messages to light a smart bulb

Figure 1 shows the SOLVED project system architecture.

The SOLVED project system architecture

Figure 1. The SOLVED project system architecture

Installation steps

Let’s review each of the steps in this use case.

Data ingestion

The data ingestion layer of the SOLVED project consists of a continuous video stream captured as a rover moves through a field of solar panels.

We used a Freenove 4WD Smart Car rover with Raspberry Pi. The mounted camera captures video as it moves through the field. We installed an Amazon Kinesis Video Streams Producer on the Pi and streamed the live video to a Kinesis Video Stream named reinventbuilder2021.

Figure 2 shows the Kinesis Video Stream setup window for reinventbuilder2021.

Kinesis Video Stream setup for reinventbuilder2021

Figure 2. Kinesis Video Stream setup for reinventbuilder2021

To start streaming, use the following steps.

  1. Create a new Kinesis Video Stream using this Amazon Kinesis Video Streams Developer Guide
  2. Make a note of the Amazon Resource Name (ARN)
  3. On the Pi, access the command prompt and use aws sts get-session-token for temporary credentials. The IAM user should have the permissions for Kinesis Video Streams PutMedia.
  4. Set the following environment variables:
    export AWS_DEFAULT_REGION="us-east-1"
    export AWS_ACCESS_KEY_ID="xxxxx"
    export AWS_SECRET_ACCESS_KEY="yyyyy"
    export AWS_SESSION_TOKEN=“zzzzz”
  5. Start the streamer using the following command:
    cd ~/amazon-kinesis-video-streams-producer-sdk-cpp/build
    ./kvs_gstreamer_sample reinventbuilder2021
  6. Validate the captured stream by viewing the Media playback on the console.

Figure 3 shows the video stream console, including the Media playback option.

Video stream console with Media playback option

Figure 3. Video stream console with Media playback option

There are two ways to clip video snippets, which we’ll do next.

You can use the Download clip button on the video stream console as shown in Figure 4.

Choose your video streaming clip duration

Figure 4. Choose your video streaming clip duration

Alternately, you can use a script from the following command line:

ONE_MIN_AGO=$(date -v -30S -u "+%FT%T+0000")
NOW=$(date -u "+%FT%T+0000")

FILE_NAME=reinventbuilder-solved-$RANDOM.mp4
echo $FILE_NAME
S3_PATH=s3://videoondemandsplitter-source-e6lyof9qjv1j/

aws kinesis-video-archived-media get-clip --endpoint-url $KVS_DATA_ENDPOINT \
--stream-name reinventbuilder2021 \
--clip-fragment-selector "FragmentSelectorType=SERVER_TIMESTAMP,TimestampRange={StartTimestamp=$ONE_MIN_AGO,EndTimestamp=$NOW}" \
$FILE_NAME

echo "Running get-clip for stream"

sleep 45

aws s3 cp $FILE_NAME $S3_PATH
echo "copying file $FILE_NAME TO $S3_PATH"

The clip is available in the Amazon S3 source folder created using AWS CloudFormation, as shown in Figure 5.

Access your clip in the Amazon S3 source folder

Figure 5. Access your clip in the Amazon S3 source folder

Pre-processing

To process the video, we leverage Video on Demand at AWS. This solution encodes video files with AWS Elemental MediaConvert. Out of the box, it:

1. Automatically transcodes videos uploaded to Amazon S3 into formats suitable for playback on a range of devices using MediaConvert
2. Customizes MediaConvert job settings by uploading a custom file and using different settings per input
3. Stores transcoded files in a destination Amazon S3 bucket and uses CloudFront to deliver them to end viewers
4. Provides outputs including input file metadata, job settings, and output details in addition to transcoded video. These outputs are stored in a separate JSON file, available for further processing

For our use case, we used the frame capture feature to create a set of thumbnails from the source videos. The thumbnails are stored in the Amazon S3 bucket with the video output.

To deploy this solution, use the CloudFormation stack.

Processing and visualization

Every trained ML model requires quality training data. We began with publicly available solar panel images that were categorized as “good” or “defective” and uploaded the images to an Amazon S3 bucket into corresponding folders.

Next, we configured Amazon Rekognition Custom Labels with the folders to indicate the labels to use in training and deploying the model. Using the rover images, we tested the model.

We used the rover to record videos of good and damaged solar panels over an extended period and label the outcome favorably. The video was then split into individual frames using MediaConvert, giving us a well-labeled dataset that we trained our model with using Amazon Rekognition Custom Labels.

We used the model endpoint to infer outcomes on solar panels with varying damage footprints across multiple locations. AWS Elemental Mediaconvert expedited the process of curating the training set, and creating the model and endpoint using Amazon Rekognition was straightforward.

As shown in Figure 6, we used a training set of 7,000 images with an even mix of good and damaged panels.

A training set of images

Figure 6. A training set of images

Examples of good panel images are depicted in Figure 7.

Good panel images

Figure 7. Good panel images

Examples of damaged panel images are depicted in Figure 8.

Damaged panel images

Figure 8. Damaged panel images

In this use case, 90 percent model accuracy was achieved.

To visualize the results, we leveraged AWS Amplify to provide an operator interface to identify the damaged panels.

Figure 9 shows screenshots from the operator dashboard with output from the Amazon Custom Labels Rekognition model for good and defective panels.

Operator dashboard in AWS Amplify

Figure 9. Operator dashboard in AWS Amplify

Alerting

Maintenance teams must be notified of defective panels to take corrective action. To create alerts, we configured AWS IoT Core to send MQTT messages to a Philips Hue smart bulb, with red bulbs indicating defective panels. To set up the Philips Hue API, use the How to develop for Hue guide.

For example, here’s the API to change color:

PUT https://192.xx.xx.xx/api/xxxxxxx/lights/1/state

{"on":true, "sat":254, "bri":254,"hue":20000} 

turns color to green

{"on":true, "sat":254, "bri":254,"hue":1000}

turns to red.

We set up a client on the Pi that listens on an AWS IoT Core MQTT topic and makes an API request to Philips Hue.

To connect a device to AWS IoT, complete these steps:

  1. Create an IoT thing, a device certificate, and an AWS IoT policy. An AWS IoT thing represents a physical device (in this case, Raspberry Pi) and contains static device metadata, as shown in Figure 10.
    AWS IoT Thing

    Figure 10. AWS IoT Thing

    2. Create a device certificate, required to connect to and authenticate with AWS IoT. An example is shown in Figure 11.

Device certificate

Figure 11. Device certificate

3. Associate an AWS IoT policy with each device certificate. They determine which AWS IoT resources the device can access. In this case, we allowed iot.*, giving the device access to all IoT resources, as shown in Figure 12.

IoT policy

Figure 12. IoT policy

Devices and other clients use an AWS IoT root CA certificate to authenticate the server they’re communicating with. For more on how devices authenticate with AWS IoT Core, see Server authentication in the AWS IoT Core Developer Guide. Copy the certificate chain to the Raspberry Pi.

For communication with the Philips Hue, we used the Qhue wrapper as shown in Figure 13.

Qhue wrapper

Figure 13. Qhue wrapper

The authors presented a demo of this solution at re:Invent 2021 Builder’s Fair.

Author demo at re:Invent 2021 Builder's Fair

Figure 14. Author demo at re:Invent 2021 Builder’s Fair

Clean up

If you used the CloudFormation stack, delete it to avoid unexpected future charges. Delete Amazon S3 buckets and terminate Amazon Rekognition jobs to stop accruing charges.

Conclusion

Amazon Rekognition helps customers collect images in the field and apply AI-based analysis to interpret the condition of assets within the images.

In this post, you learned how to configure the Kinesis Video Stream producer on a Raspberry Pi to upload captured videos to Amazon Kinesis Video streams. You also learned how to save video streams to Amazon S3 and leverage the Video on Demand at AWS solution.

Using AWS MediaConvert, we transcoded the videos and create a set of thumbnails from the source videos. We then used Amazon Rekognition Custom Labels to train and deploy models for solar panel damage detection. Finally, we configured AWS IoT core to send MQTT messages to a Philips Hue smart bulb for notifications.

In this post, we presented a serverless architecture on AWS to detect defective solar panels. The reference architecture diagram is adaptable to solve inspection and damage detection problems across other industries.

How Strategic Blue uses Amazon QuickSight and AWS Cost and Usage Reports to help their customers save millions

Post Syndicated from Frank Contrepois original https://aws.amazon.com/blogs/big-data/how-strategic-blue-uses-amazon-quicksight-and-aws-cost-and-usage-reports-to-help-their-customers-save-millions/

This is a guest post co-written with Frank Contrepois from Strategic Blue.

For over 10 years, Strategic Blue has helped organizations unlock the most value from the cloud by enabling their customers to purchase non-standard commitments. By taking a commodity trading approach to purchasing from AWS, Strategic Blue helps customers purchase commitments for varying lengths of time, such as a 9-month Reserved Instance or an 18-month Savings Plan. Over the years, they’ve been able to help customers save millions by maximizing commitments.

In this post, we share how Strategic Blue uses Amazon QuickSight and AWS Cost and Usage Reports to help their customers save costs.

The challenge

Buying and selling AWS differently means that the cost and usage data available to Strategic Blue on the AWS Management Console only matches what they purchased—not what their customers purchased. In order to accurately bill their customers, Strategic Blue has built a billing system that takes AWS Cost and Usage Reports (CUR) as input and outputs a pro forma CUR that matches what their customers would expect to see. Although the CUR remains a critical source of data, it can have hundreds of columns and millions of rows, making it very difficult for Strategic Blue’s customers to visualize and understand.

Initially, in order to give their customers a way to visualize and understand their spend, Strategic Blue started building a dashboard from scratch using QuickSight, AWS Glue, and Amazon Athena. The idea was to build a dashboard accessible to all of Strategic Blue’s customers (as soon as they were onboarded) that enabled them to see their true cost and usage. QuickSight provides modern interactive dashboards that enable fast time-to-insights and have features that allow for the easy embedding of dashboards into web applications linked to a customer’s single sign-on (SSO) solution. Combined with row-level-security, a dashboard can be pre-filtered to show an individual user’s cost and usage information for only the accounts they are responsible for.

However, figuring out how to accurately show cost and usage insights from the CUR from scratch was a challenge. The amount of work required to decode the raw content of the CUR to build something that matches what you would see in AWS Cost Explorer quickly became overwhelming.

The solution

To help you visualize the treasure trove of data available in a CUR file, AWS created the Cloud Intelligence Dashboards solution. The Cloud Intelligence Dashboards provide a full stack of capabilities, deployed via AWS CloudFormation or other methods, that include six different QuickSight dashboards geared towards helping you optimize your spend on AWS.

Strategic Blue deployed the CUDOS Dashboard (one of the six dashboards available as of this writing) to solve their business case for providing their customers with accurate, comprehensive insights into their cost and usage.

“The Cloud Intelligence Dashboards provided all of the accurate CUR calculations and maintenance we need. This saves us many hours of reverse engineering to keep pace with AWS innovation and changes as their product and billing mechanisms mature.”

– Frank Contrepois, Head of FinOps at Strategic Blue.

The following screenshot shows an example of the dashboard.

Embedding and scaling to all customers

Strategic Blue uses the embedding and row-level security features of QuickSight to scale the dashboards to every customer the moment they are onboarded. By integrating their SSO Amazon Cognito with QuickSight, Strategic Blue is able to provide customers instant access to an instance of CUDOS through a portal as soon as they are onboarded to Strategic Blue’s platform. The row-level security setup managed by administrators within Strategic Blue makes sure that their customers only ever see their own cost and usage data. This helps Strategic Blue offer the protection and security customers expect.

To see the Strategic Blue portal with embedded CUDOS in action, watch the following demo video, or to unlock your savings potential in the cloud, sign up with Strategic Blue now.

Now that CUDOS is embedded into Strategic Blue’s portal, they are able to offer additional value free of charge to their customers. Strategic Blue’s portal uses the customer’s CUDOS dashboard to find opportunities to save and optimize. After the engagement, Strategic Blue will continue to analyze the impact of their customer’s cost optimization measures, reporting back to them what they’ve saved and what additional opportunities there are to further optimize.

Strategic Blue is already seeing an impact. Since enabling an embedded version of CUDOS for every customer and for internal teams, Strategic Blue is seeing a 27% increase in positive feedback for regular meetings, 31% fewer support tickets related to questions about cost, and a move from tactical to strategic FinOps conversations.

The future

Strategic Blue plans to evaluate and embed the rest of the Cloud Intelligence Dashboards, such as the KPI Dashboard to help customers track goals and metrics around cost optimization, and the Trusted Advisor Organizational Dashboard, which helps customers stay on top of underutilized resources as well as operational information such as security and fault tolerance postures. With the foundation in place and QuickSight’s ease of use, Strategic Blue will continue working closely with customers to customize and refine the dashboards to meet specific customer needs, improving and expanding their offerings.

Conclusion

In this post, we covered how Strategic Blue used Cloud Intelligence Dashboards built on QuickSight to help customers get accurate insights into their cloud cost and usage. Strategic Blue’s customers can access stunning and interactive dashboards that can be customized with ease and without the need for technical skills. The comprehensive embedding capabilities, granular row-level-security, and SSO integrations of QuickSight ensure that Strategic Blue is able to provide customers secure and seamless access to their Cloud Intelligence Dashboards.

It’s easy to install Cloud Intelligence Dashboards in your account. Visit Cloud Intelligence Dashboards for details on how to deploy the dashboards today, and let us know how it goes.


About the Authors

Frank Contrepois is the Head of FinOps at Strategic Blue. Frank and his team support a wide range of customers to implement Cloud FinOps including small-and medium-sized organizations, private equity (PE) funds and investors, enterprises and global cloud resellers striking significant deals. He is an AWS APN Ambassador and has several AWS and GCP pro-level certifications. Outside of work, he is a husband, father of two wonderful boys, and co-host of the “What’s new with in Cloud FinOps” podcast.

Aaron Edell is Head of GTM for Customer Cloud Intelligence for Amazon Web Services. He is responsible for building and scaling businesses around Cloud Financial Management, FinOps, and the Well-Architected Cost Optimization pillar. He focuses his GTM efforts on the Cloud Intelligence Dashboards and remains obsessed with helping all customers get better visibility and access to their cost and usage data.

Automating adverse events reporting for pharma with Amazon Connect and Amazon Lex

Post Syndicated from Siva Thangavel original https://aws.amazon.com/blogs/architecture/automating-adverse-events-reporting-for-pharma-with-amazon-connect-and-amazon-lex/

Every pharmaceutical company manufacturing medicine must provide customers nationwide with a method to report adverse events following medicine usage as well as emergency assistance as needed. To comply with regulatory policy and enable an Adverse Events Reporting System (AERS), pharma companies must provide dedicated, toll-free phone numbers and contact center agents to handle inbound calls.

But they must also be prepared for sudden spikes in call volume, which can increase contact center agents’ workloads and lead to long wait times for customers. With these limitations comes the possibility that customers may not be able to report adverse events.

Further still, as medicine status keeps changing, all agents must be retrained to handle calls and extend support. Pharma companies incur significant costs for training and onboarding additional agents, as well as the physical infrastructure to support their work.

To overcome these challenges, we designed a self-service Interactive Voice Response (IVR) solution with Amazon Connect. The IVR solution handles customer calls without agent involvement. It captures customer information and records data into an enterprise AERS. It also provides an option to receive a link to an Adverse Events (AE) portal using Short Message Service (SMS), or to be routed to a live agent queue.

In this blog post, we introduce a reference architecture for this use case. This framework can help other pharma companies solve similar problems.

Solution overview

Let’s explore how the IVR solution architecture routes customer calls step by step, as shown in Figure 1.

Adverse events reporting architecture diagram

Figure 1. Adverse events reporting architecture diagram

  1.  Callers who dial in to report a medicine-related AE are routed to the Amazon Lex chatbot through IVR in Amazon Connect.
  2. Callers can proceed to IVR self-service functions, such as understanding the intent of a customer call and the AE.
  3. AEs are analyzed with Amazon SageMaker for a decision on whether to complete the call on IVR or forward it to an agent.
  4. If the caller remains on the self-service option, the bot captures information from 15 to 20 essential questions.
  5. The bot follows a hybrid workflow that allows for guided responses where appropriate and free-text conversations using AWS Lambda. It confirms captured AE information with the caller before closing the call and submitting information to the AERS system.
  6. The bot provides the option to route the call to a human agent contextually.
  7. The bot provides the option to share an AE reporting link over SMS using Amazon Simple Notification System (SNS), so the caller can access it through a mobile device to continue AE reporting outside of the call.
  8. The bot records customer interactions in AERS using Amazon DynamoDB, leveraging the current validated process used by the AE portal team
  9. The bot makes call recordings available for auditing, monitoring, and training purposes. These recordings are not be provided to live agents.
  10. Standard analytics are available to help the business continuously train the bot and measure its performance.

Leveraging IVR as an extended solution

Recorded customer calls can be used for further analytics with Amazon Transcribe. Actionable insights can be derived from the text using a machine learning (ML) model such as AE detection at scale. A (Named Entity Recognition model (NER) model can also identify medicines and caller types.

Further, all recorded calls may be stored in a secure AWS ecosystem and archived for longer durations for compliance purposes. Storage costs can be optimized by setting up a policy to move old calls to Amazon Simple Storage Service Glacier (Amazon S3 Glacier) storage classes and calls over two years old to the Amazon S3 Deep Glacier storage class. This results in significant cost saving and helps companies archive at scale.

Finally, the Amazon Lex bot can be enhanced and continuously trained with additional intents and utterances to handle complex AE reporting for various drugs. This provides significant cost saving and operational efficiencies as bots can be trained faster than human agents, as well as at scale.

Conclusion: Using IVR to better manage AE reporting

This IVR solution was deployed for a pharma company and helped handle unusually high call volumes for AE reporting with its current agent population. It resulted in cost savings in contact center operations and significantly improved the customer experience by reducing wait times.

The IVR solution can also be used with any existing contact center platform to first forward the calls to Amazon Connect for initial triage, and then handover to existing platforms for agent involvement. This adds intelligence to existing contact centers.

This blog post demonstrates how pharma companies can leverage the self-service option for AERS to handle any AE reporting call. With solution enhancements using Amazon SageMaker models, it can quickly be transformed to handle calls for any medicine. They can also:

  • Incorporate related information into the model, such as the age, gender, or existing AEs to further improve the ML prediction performance
  • Leverage audio data augmentation plus handcrafted features to help yield better predictions
  • Use the audio-based diagnostic prediction in an Amazon Connect contact flow to triage the targeted group of incoming calls and escalate to a doctor for follow up if necessary
  • Allow call center agents to use the intelligence provided by the acoustic classification in conjunction with Contact Lens for Amazon Connect, which provides a turn-by-turn transcript; real-time alerts; automated call categorization based on keywords and phrases; sentiment analysis, and sensitive data redaction—truly making it a real-time intelligent solution.

The IVR solution can also be used for other industry use cases where a series of data is collected from customers. This solution improves the customer experience and can be implemented without increasing call center agent counts.

Chargeback Gurus empowers eCommerce merchants with advanced chargeback intelligence to recover millions using Amazon Quicksight

Post Syndicated from Suresh Dakshina original https://aws.amazon.com/blogs/big-data/chargeback-gurus-empowers-ecommerce-merchants-with-advanced-chargeback-intelligence-to-recover-millions-using-amazon-quicksight/

This is a guest post by Suresh Dakshina and Damodharan Sampathkumar from Chargeback Gurus.

Chargeback Gurus, a global financial technology company helps businesses fight, prevent, and win chargebacks. To date, we have helped businesses worldwide recover over $2 billion in lost revenue. As trusted advisors to card networks and Fortune 500 companies, we are known for our expertise in the areas of transaction risk management, chargeback mitigation, fraud prevention, and dispute intelligence. Our Veda and Ari product lines are an industry breakthrough helping merchants identify the risk in online transactions and take insightful actions to protect revenue.

Chargebacks are costing merchants up to 40% of their overall revenue if left unchecked. In addition, as per a Javelin study, fraud and chargeback management are consuming 13-20% of a merchant’s operational budget. As a result, it is of utmost importance for merchants to gain access to transaction intelligence that can help them minimize these costs. The vast majority of these chargebacks are preventable, which is why our advanced dispute intelligence platform – Veda, along with our ability to predict chargebacks accurately using Ari have helped us go a long way in effectively preventing and recovering chargebacks for the merchants.

When seeking the right balance of flexibility, confidentiality, big data, cost, and customization capabilities, Amazon QuickSight stood out as the clear winner. In this post, we go over what we needed in a business intelligence (BI) tool, and why QuickSight was the right choice for us.

Helping customers solve problems with data

Our customers face numerous challenges on a daily basis, including the need to adapt to changing payment methods and products, navigate complex payments ecosystems, and address resource shortages. They also frequently encounter difficulties in identifying the root causes of their chargebacks and fraud. As a result, they are constantly confronted with challenges to protect their revenue. It is our mission to utilize data intelligence to address their biggest pain point and make a significant impact on their bottom line.

Our industry-first Advanced Dispute Intelligence platform, Veda, provides merchants with a wealth of data analytics, which would help them minimize revenue leakages. Empowered with this data, our merchants can spot inefficiencies in their operations and proactively fix issues to prevent chargebacks and fraud. Veda’s dashboard can be customized to display the information that’s most relevant to each merchant, allowing them to get to the root causes of their underlying challenges with a high level of accuracy.

The following sample screenshot highlights Veda’s ability to accurately forecast chargeback performance and trends using AWS QuickSight.

To power Veda, we were looking for a balance of flexibility and customization. We needed a BI tool that would meet our data analytics needs, and also provide us with the agility to make periodic changes to align with business requirements. The main reasons we chose QuickSight were:

  • Fast setup – Speed and agility can make all the difference when it comes to gathering insights that can impact revenue lost from chargebacks.
  • Comprehensive out-of-the-box functionality – Having a tool that has all the business intelligence and embedded analytics capabilities that we needed from the start was very important.
  • SPICE engine – SPICE (Super-fast, Parallel, In-memory Calculation Engine) is the robust in-memory engine that QuickSight uses. As it’s engineered to rapidly perform advanced calculations and serve data, it was a big selling point for us.

QuickSight’s out-of-the-box features and functionality were the perfect fit, enabling us to get started with providing insights to our customers right away.

User-friendliness is a key differentiator

Our Gurus provide deep analytics and data-driven recommendations to help businesses identify the problems and inefficiencies that are leading to chargebacks. With this guidance, companies can permanently improve their business processes and increase customer retention. Though we have the expertise and acumen to assist with all aspects of chargeback mitigation, management, and forecasting, we needed that same level of expertise and acumen when it came to translating ideas to products in the BI space.

QuickSight has met all our needs to provide dashboards and reports, with insightful visualizations that help our customers run their businesses more efficiently. Better still, it has done so through an intuitive, user-friendly interface that enables people from all technical backgrounds to use it with very little time spent on training. Having a minimal training timeframe is a key factor when it comes to onboarding new clients.

Partners in customer obsession

One of the things that has been a delightful surprise in our experience with QuickSight is how invested the development team is in collecting our feedback and using it to help inform plans for future releases. We have agreed to be beta testers, and so we receive proactive updates on progress from the project team, providing us with informative visibility on upcoming changes. Being understanding of evolving customer and data needs, and being receptive to feedback has made this a fantastic partnership. We look forward to continuing to work with QuickSight to expand our BI capabilities, empowering our customers with the data they need to be efficient and successful.

To learn more about how QuickSight can help your business with dashboards, reports, and more, visit Amazon QuickSight.

To learn more about how Chargeback Gurus can help you solve your chargeback and fraud challenges, visit www.chargebackgurus.com or email [email protected].


About the authors

Suresh Dakshina is the Co-founder & President at Chargeback Gurus. A pioneer in data analytics and industry-specific risk management, he is a certified e-commerce fraud prevention specialist and Certified Payments Professional. He understands first-hand the challenges that business owners face, especially when it comes to chargebacks and fraud. He is a veteran speaker, and works closely with Card Networks like Visa and American Express on chargeback process optimization and compelling evidence policies. He loves spending time with his family and his Labradoodle, Joy, when not working.

Damodharan Sampathkumar is the Chief Product Officer & GM-India at Chargeback Gurus. He is responsible for product strategy, platform development, and leading innovation for the next generation in payments technology with a specific focus on Chargebacks and Risk mitigation at Chargeback Gurus. He specializes in cloud-native, mission-critical, real-time payment systems, group-up technology platform setup, and operations. He has successfully championed multiple new-age digital payment platforms across North America, Europe, India, and the Middle East for Central Banks, Acquirer processors, and Fintech innovators. He is an avid cyclist and regular endurance rider who loves to ride during his time off to bring the right balance between work and life.

How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics

Post Syndicated from Miguel Chin original https://aws.amazon.com/blogs/big-data/how-olx-group-migrated-to-amazon-redshift-ra3-for-simpler-faster-and-more-cost-effective-analytics/

This is a guest post by Miguel Chin, Data Engineering Manager at OLX Group and David Greenshtein, Specialist Solutions Architect for Analytics, AWS.

OLX Group is one of the world’s fastest-growing networks of online marketplaces, operating in over 30 countries around the world. We help people buy and sell cars, find housing, get jobs, buy and sell household goods, and much more.

We live in a data-producing world, and as companies want to become data driven, there is the need to analyze more and more data. These analyses are often done using data warehouses. However, a common data warehouse issue with ever-growing volumes of data is storage limitations and the degrading performance that comes with it. This scenario is very familiar to us in OLX Group. Our data warehouse is built using Amazon Redshift and is used by multiple internal teams to power their products and data-driven business decisions. As such, it’s crucial to maintain a cluster with high availability and performance while also being storage cost-efficient.

In this post, we share how we modernized our Amazon Redshift data warehouse by migrating to RA3 nodes and how it enabled us to achieve our business expectations. Hopefully you can learn from our experience in case you are considering doing the same.

Status quo before migration

Here at OLX Group, Amazon Redshift has been our choice for data warehouse for over 5 years. We started with a small Amazon Redshift cluster of 7 DC2.8xlarge nodes, and as its popularity and adoption increased inside the OLX Group data community, this cluster naturally grew.

Before migrating to RA3, we were using a 16 DC2.8xlarge nodes cluster with a highly tuned workload management (WLM), and performance wasn’t an issue at all. However, we kept facing challenges with storage demand due to having more users, more data sources, and more prepared data. Almost every day we would get an alert that our disk space was close to 100%, which was about 40 TB worth of data.

Our usual method to solve storage problems used to be to simply increase the number of nodes. Overall, we reached a cluster size of 18 nodes. However, this solution wasn’t cost-efficient enough because we were adding compute capacity to the cluster even though computation power was underutilized. We saw this as a temporary solution, and we mainly did it to buy some time to explore other cost-effective alternatives, such as RA3 nodes.

Amazon Redshift RA3 nodes along with Redshift Managed Storage (RMS) provided separation of storage and compute, enabling us to scale storage and compute separately to better meet our business requirements.

Our data warehouse had the following configuration before the migration:

  • 18 x DC2.8xlarge nodes
  • 250 monthly active users, consistently increasing
  • 10,000 queries per hour, 30 queries in parallel
  • 40 TB of data, consistently increasing
  • 100% disk space utilization

This cluster’s performance was generally good, ETL (extract, transform, and load) and interactive queries barely had any queue time, and 80% of them would finish in under 5 minutes.

Evaluating the performance of Amazon Redshift clusters with RA3 nodes

In this section, we discuss how we conducted a performance evaluation of RA3 nodes with an Amazon Redshift cluster.

Test environment

In order to be confident with the performance of the RA3 nodes, we decided to stress test them in a controlled environment before making the decision to migrate. To assess the nodes and find an optimal RA3 cluster configuration, we collaborated with AllCloud, the AWS premier consulting partner. The following figures illustrate the approach we took to evaluate the performance of RA3.

Test setup

This strategy aims to replicate a realistic workload in different RA3 cluster configurations and compare them with our DC2 configuration. To do this, we required the following:

  • A reference cluster snapshot – This ensures that we can replay any tests starting from the same state.
  • A set of queries from the production cluster – This set can be reconstructed from the Amazon Redshift logs (STL_QUERYTEXT) and enriched by metadata (STL_QUERY). It should be noted that we only took into consideration SELECT and FETCH query types (to simplify this first stage of performance tests). The following chart shows what the profile of our test set looked like.

  • A replay tool to orchestrate all the query operations – AllCloud developed a Python application for us for this purpose.

For more details about approach we used, including using the Amazon Redshift Simple Replay utility, refer to Compare different node types for your workload using Amazon Redshift.

Next, we picked which cluster configurations we wanted to test, which RA3 type, and how many nodes. For the specifications of each node type, refer to Amazon Redshift pricing.

First, we decided to test the same DC2 cluster we had in production as a way to validate our test environment, followed by RA3 clusters using RA3.4xlarge nodes with various numbers of nodes. We used RA3.4xlarge because it gives us more flexibility to fine-tune how many nodes we need compared to the RA3.16xlarge instance (1 x RA3.16xlarge node is equivalent to 4 x RA3.4xlarge nodes in terms of CPU and memory). With this in mind, we tested the following cluster configurations and used the replay tool to take measurements of the performance of each cluster.

18 x DC2

(Reference)

18 x RA3

(Before Classic Resize)

18 x RA3 6 x RA3
Queries Number 1560 1560 1560 1560
Timeouts 25 66 127
Duration/s Mean 1.214 1.037 1.167 1.921
Std. 2.268 2.026 2.525 3.488
Min. 0.003 0.000 0.002 0.002
Q 25% 0.005 0.004 0.004 0.004
Q 50% 0.344 0.163 0.118 0.183
Q 75% 1.040 0.746 1.076 2.566
Max. 25.411 15.492 19.770 19.132

These results show how the DC2 cluster compares with other RA3 configurations. For 50% of the faster queries (quantile 50%) they ran faster than on DC2. Regarding the number of RA3 nodes, six nodes were clearly slower, particularly noticeable on quantile 75% of query durations.

We used the following steps to deploy different clusters:

  1. Use 18 x DC2.8xlarge, restored from the original snapshot (18 x DC2.8xlarge).
  2. Take measurements 18 x DC2.
  3. Use 18 x RA3.4xlarge, restored from the original snapshot (18 x DC2.8xlarge).
  4. Take measurements 18 x RA3 (before classic resize).
  5. Use 6 x RA3.4xlarge, classic resize from 18 x RA3.4xlarge.
  6. Take snapshot from 6 x RA3.4xlarge.
  7. Take measurements 6 x RA3.
  8. Use 6 x RA3.4xlarge, restored from 6 x RA3.4xlarge snapshot.
  9. Use 18 x RA3.4xlarge, elastic resize from 6 x RA3.4xlarge.
  10. Take measurements 18x RA3.

Although these are promising results, there were some limitations in the test environment setup. We were concerned that we weren’t stressing the clusters enough, queries were only running in sequence using a single client, and the fact that we were using only SELECT and FETCH query types moved us away from a realistic workload. Therefore, we proceeded to the second stage of our tests.

Concurrency stress test

To stress the clusters, we changed our replay tool to run multiple queries in parallel. Queries extracted from the log files were queued with the same frequency as they were originally run in the reference cluster. Up to 50 clients take queries from the queue and send them to Amazon Redshift. The timing of all queries is recorded for comparison with the reference cluster.

The cluster performance is evaluated by measuring the temporal course of the query concurrency. If a cluster is equally performant as the reference cluster, the concurrency will closely follow the concurrency of the reference cluster. Queries pushed to the query queue are immediately picked up by a client and sent to the cluster. If the cluster isn’t capable of handling the queries as fast as the reference cluster, the number of running concurrent queries will increase when compared to the reference cluster. We also decided to keep concurrency scaling disabled during this test because we wanted to focus on node types instead of cluster features.

The following table shows the concurrent queries running on a DC2 and RA3 (both 18 nodes) with two different query test sets (3:00 AM and 1:00 PM). These were selected so we could test both our day and overnight workloads. 3:00 AM is when we have a peak of automated ETL jobs running, and 1:00 PM is when we have high user activity.

The median of running concurrent queries on the RA3 cluster is much higher than the DC2 one. This led us to conclude that a cluster of 18 RA3.4xlarge might not be enough to handle this workload reliably.

Concurrency 18 x DC2.8xlarge 18 x RA3.4xlarge
Starting 3:00 AM 1:00 PM 3:00 AM 1:00 PM
Mean 5 7 10 5
STD 11 13 7 4
25% 1 1 5 2
50% 2 2 8 4
75% 4 4 13 7
Max 50 50 50 27

RA3.16xlarge

Initially, we chose the RA3.4xlarge node type for more granular control in fine-tuning the number of nodes. However, we overlooked one important detail: the same instance type is used for worker and leader nodes. A leader node needs to manage all the parallel processing happening in the cluster, and a single RA3.4xlarge wasn’t enough to do so.

With this in mind, we tested two more cluster configurations: 6 x RA3.16xlarge and 8 x RA3.16xlarge, and once again measured concurrency. This time the results were much better; RA3.16xlarge was able to keep up with the reference concurrency, and the sweet spot seemed to be between 6–8 nodes.

Concurrency 18 x DC2.8xlarge 18 x RA3.4xlarge 6 x RA3.16xlarge 8 x RA3.16xlarge
Starting 3:00 AM 1:00 PM 3:00 AM 1:00 PM 3:00 AM 3:00 AM
Mean 5 7 10 5 3 1
STD 11 13 7 4 4 1
25% 1 1 5 2 2 0
50% 2 2 8 4 3 1
75% 4 4 13 7 4 2
Max 50 50 50 27 38 9

Things were looking better and our target configuration was now a 7 x RA3.16xlarge cluster. We were now confident enough to proceed with the migration.

The migration

Regardless of how excited we were to proceed, we still wanted to do a calculated migration. It’s best practice to have a playbook for migrations—a step-by-step guide on what needs to be done and also a contingency plan that includes a rollback plan. For simplicity reasons, we list here only the relevant steps in case you are looking for inspiration.

Migration plan

The migration plan included the following key steps:

  1. Remove the DNS from the current cluster, in our case in Amazon Route 53. No users should be able to query after this.
  2. Check if any sessions are still running a query, and decide to wait or stop it. This strongly indicates these users are using the direct cluster URL to connect.
    1. To check running sessions, use SELECT * FROM STV_SESSIONS.
    2. To check stopped sessions, use SELECT PG_TERMINATE_BACKEND(xxxxx);.
  3. Create a snapshot of the DC2 cluster.
  4. Pause the DC2 cluster.
  5. Create an RA3 cluster from the snapshot with the following configuration:
    1. Node type – RA3.16xlarge
    2. Number of nodes – 7
    3. Database name – Same as the DC2
    4. Associated IAM roles – Same as the DC2
    5. VPC – Same as the DC2
    6. VPC security groups – Same as the DC2
    7. Parameter groups – Same as the DC2
  6. Wait for SELECT COUNT(1) FROM STV_UNDERREPPED_BLOCKS to return 0. This is related to the hydration process of the cluster.
  7. Point the DNS to the RA3 cluster.
  8. Users can now query the cluster again.

Contingency plan

In case the performance of hourly and daily ETL is not acceptable, the contingency plan is triggered:

  1. Add one more node to deal with the unexpected workload.
  2. Increase the limit of concurrency scaling hours.
  3. Reassess the parameter group.

Following this plan, we migrated from DC2 to RA3 nodes in roughly 3.5 hours, from stopping the old cluster to booting the new one and letting our processes fully synchronize. We then proceeded to monitor performance for a couple of hours. Storage capacity was looking great and everything was running smoothly, but we were curious to see how the overnight processes would perform.

The next morning, we woke up to what we dreaded: a slow cluster. We triggered our contingency plan and in the following few days we ended up implementing all three actions we had in the contingency plan.

Adding one extra node itself didn’t provide much help, however users did experience good performance during the hours concurrency scaling was on. The concurrency scaling feature allows Amazon Redshift to temporarily increase cluster capacity whenever the workload requires it. We configured it to allow a maximum of 4 hours per day—1 hour for free and 3 hours paid. We chose this particular value because price-wise it is equivalent to adding one more node (taking us to nine nodes) with the added advantage of only using and paying for it when the workload requires it.

The last action we took was related to the parameter group, in particular, the WLM. As initially stated, we had a manually fine-tuned WLM, but it proved to be inefficient for this new RA3 cluster. Therefore, we decided to try auto WLM with the following configuration.

Manual WLM before introducing auto WLM Queue 1 Data Team ETL queue (daily and hourly), admin, monitoring, data quality queries
Queue 2 Users queue (for both their ETL and ad hoc queries)
Auto WLM Queue 1: Priority highest Daily Data Team ETL queue
Queue 2: Priority high Admin queries
Queue 3: Priority normal User queries and hourly Data Team ETL
Queue 4: Priority low Monitoring, data quality queries

Manual WLM requires you to manually allocate a percentage of resources and define a number of slots per queue. Although this gives you resource segregation, it also means resources are constantly allocated and can go to waste if they’re not used. Auto WLM dynamically sets these variables depending on each queue’s priority and workload. This means that a query in the highest priority queue will get all the resources allocated to it, while lower priority queues will need to wait for available resources. With this in mind, we split our ETL depending on its priority: daily ETL to highest, hourly ETL to normal (to give a fair chance for user queries to compete for resources), and monitoring and data quality to low.

After applying concurrency scaling and auto WLM, we achieved stable performance for a whole week, and considered the migration a success.

Status quo after migration

Almost a year has passed since we migrated to RA3 nodes, and we couldn’t be more satisfied. Thanks to Redshift Managed Storage (RMS), our disk space issues are a thing of the past, and performance has been generally great compared to our previous DC2 cluster. We are now at 300 monthly active users. Cluster costs did increase due to the new node type and concurrency scaling, but we now feel prepared for the future and don’t expect any cluster resizing anytime soon.

Looking back, we wanted to have a carefully planned and prepared migration, and we were able to learn more about RA3 with our test environment. However, our experience also shows that test environments aren’t always bulletproof, and some details may be overlooked. In the end, these are our main takeaways from the migration to RA3 nodes:

  • Pick the right node type according to your workload. An RA3.16xlarge cluster provides more powerful leader and worker nodes.
  • Use concurrency scaling to provision more resources when the workload demands it. Adding a new node is not always the most cost-efficient solution.
  • Manual WLM requires a lot of adjustments; using auto WLM allows for a better and fairer distribution of cluster resources.

Conclusion

In this post, we covered how OLX Group modernized our Amazon Redshift data warehouse by migrating to RA3 nodes. We detailed how we tested before migration, the migration itself, and the outcome. We are now starting to explore the possibilities provided by the RA3 nodes. In particular, the data sharing capabilities together with Redshift Serverless open the door for exciting architecture setups that we are looking forward to.

If you are going through the same storage issues we used to face with your Amazon Redshift cluster, we highly recommend migrating to RA3 nodes. Its RMS feature decouples the scalability of compute and storage power, providing a more cost-efficient solution.

Thanks for reading this post and hopefully you found it useful. If you’re going through the same scenario and have any questions, feel free to reach out.


About the author

Miguel Chin is a Data Engineering Manager at OLX Group, one of the world’s fastest-growing networks of trading platforms. He is responsible for managing a domain-oriented team of data engineers that helps shape the company’s data ecosystem by evangelizing cutting-edge data concepts like data mesh.

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.

Building an event-driven solution for AvalonBay property leasing and search

Post Syndicated from Kausik Dey original https://aws.amazon.com/blogs/architecture/building-an-event-driven-solution-for-avalonbay-property-leasing-and-search/

In this blog post, we show you how to build an event-driven and serverless solution for property leasing and search that is scalable and resilient. This solution was created for AvalonBay Communities, Inc.—a leading residential Real Estate Investment Trusts (REITs). It enables:

  • More than 150,000 multi-parameter searches per day
  • The processing of more than 3,500 lease applications and 85,000 individual rent payments per month

Introduction

AvalonBay is an equity REIT. The company has a long track record of developing, redeveloping, acquiring, and managing apartment homes in top U.S. markets. AvalonBay builds long-term value for customers using innovative technology solutions.

The company understands that data-driven insights contribute to targeted business growth. But AvalonBay found that managing the complex interdependencies between multiple data sets—from real estate and property management systems to financial and payment systems—required a new solution.

The challenge

AvalonBay owned or held a direct or indirect ownership interest in 293 apartment communities containing 88,405 apartment homes in 12 states and Washington, D.C., as of September 30, 2022.

Of these, 18 communities were under development and one was under redevelopment. This presented a unique challenge to both internal and external users looking to search and lease apartment units based on multi-parameter selection criteria in geographically dispersed regions. For example, finding units in buildings with specific amenities, lease terms, furnishings and availability dates.

Overview of solution

AvalonBay’s fully managed leasing solution for applicants and residents is hosted by Amazon Web Services (AWS). The solution is secure, autoscaling, and multi-region, ensuring resiliency and performance with efficient resource usage.

In this event-driven solution, AvalonBay’s leasing service is hosted in multiple AWS Regions to provide low latency response to users across various geolocations. This blog post focuses on showing use case implementation in only one region—Region East— as shown in Figure 1.

AvalonBay lease processing platform

Figure 1. AvalonBay lease processing platform

Several AWS services come together in this solution to meet key company objectives. Let’s explore each one and its purpose within the architecture.

    1. Amazon Route 53: For AvalonBay’s Lease Processing solution, any non-transient service failure is unacceptable. In addition to providing a high degree of resiliency through a Multi-AZ architecture, Lease Processing also provides regional-level high availability through its multi-Region active-active architecture. Route 53 with latency-based routing allows dynamic rerouting of requests within seconds to alternate Regions.
    2. Amazon API Gateway: Route 53 latency-based routing is configured across multiple AWS Regions as Route 53 to route traffic to an API Gateway endpoint. API Gateway authorizers were added to control access to APIs using an Amazon Cognito user pool.
    3. AWS Lambda with provisioned concurrency: The Lambda services are set up for automatic scaling, secured through a private subnet, and span across Availability Zones. This provides horizontal scaling capability, self-healing capacity, and resiliency across Availability Zones. Provisioned concurrency minimizes the estimate of cold starts by generating execution environments. It also greatly reduces time spent on APIs invocations.
    4. Amazon Aurora V2 for Amazon Relational Database Service (Amazon RDS) PostgreSQL-Compatible Edition: Aurora Serverless V2 is an on-demand, autoscaling configuration for Aurora. Serverless Aurora V2 PostgreSQL-Compatible is used for the Lease Processing solution. The global database was configured with two Regions; us-east-1 as the primary cluster and us-west-2 as the secondary cluster. Automated Aurora global database endpoint management for planned and unplanned failover is configured through a Route 53 private hosted zone, Amazon EventBridge, and Lambda.
    5. Amazon RDS Proxy for Aurora: Amazon RDS Proxy allows the leasing application to pool and share database connections to improve its ability to scale. It also makes the leasing solution more resilient to database failures by automatically connecting to a standby database instance while preserving application connections.
    6. Amazon EventBridge: EventBridge supports the solution through two primary purposes:
      1. Oversees lease flow events – During the lease application process, the solution generates various events which are consumed by AvalonBay and external applications such as property management, finance portals, administration, and more. Leasing events are sent to EventBridge and various event rules are configured for multiple destinations including Lambda, Amazon Simple Notification Service (Amazon SNS), and external API endpoints.
      2. Handles global Aurora V2 failover – Aurora generates events when certain actions are taken or events occur, including any type of global database activity.
        • When a managed planned failover is initiated—either via the AWS Command Line Interface (AWS CLI), API, or console—the global database failover process starts and generates an event.
        • When an Aurora cluster is removed from a global cluster through the AWS CLI, API, or console, the Aurora cluster is promoted as a single primary cluster. Once this process is completed, it generates an event.

An EventBridge rule is created to match an event pattern any time a global database managed planned failover completes successfully in a Region. When a failover is completed, the completion event is detected, and this rule is triggered. The event rule is configured to invoke a Lambda function that is triggered on global database failover and updates the Amazon CloudFront CNAME record to the correct value.

Scalable search solution

Leasing professionals need to easily scan a huge amount of property information using AvalonBay’s Search Solution to obtain required information.

Using the Amazon OpenSearch Service, agents can generate property profiles and other asset data to identify matching units and quickly respond to end customers. OpenSearch is a fully open-source search and analytics engine that securely unlocks real-time search, monitoring, and analysis of business and operational data. It is employed for use cases such as application monitoring, log analytics, observability, and website search.

The AvalonBay Search Service solution architecture featuring OpenSearch is shown in Figure 2.

AvalonBay Search Service solution architecture

Figure 2: AvalonBay Search Service solution architecture

AvalonBay search requires search criteria including keyword and Universal Resource Identifier (URI) search, SQL-based search, and custom package search, all of which are detailed in the Amazon OpenSearch Service Developer Guide.

OpenSearch automatically detects and replaces failed OpenSearch Service nodes, reducing the overhead associated with self-managed infrastructures.

Let’s explore this architecture further by step.

  1. Amazon Kinesis event stream – The AvalonBay community requires near real-time updates to search attributes such as amenities, features, promotion, and pricing. Events created through various producers are streamed through Kinesis and inserted or updated through OpenSearch.
  2. Amazon OpenSearch – OpenSearch is used for end-to-end community search—a managed service making it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. As community search data is read-only, UltraWarm and cold storage are also used based on usage frequency.
  3. Amazon Simple Storage Service (S3): Various community documents, policies, and image or video files are key search elements. They must be maintained securely and reliably for years due to contractual obligations. Amazon S3 simplifies this task with high durabilitylifecycle rules, and varied controls for retention.

Conclusion

This post showed how AvalonBay has built and deployed custom leasing and search solutions on AWS serverless platforms without compromising resiliency, performance, and capacity requirements. This is a 24/7, fully managed solution with no additional equipment on-premises.

Choosing AWS for leasing and search solutions gives AvalonBay the ability to dynamically scale and meet future growth demands while introducing cost advantages. In addition, the global availability of AWS services makes it possible to deploy services across geographic locations to meet performance requirements.

Scaling an ASG using target tracking with a dynamic SQS target

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/scaling-an-asg-using-target-tracking-with-a-dynamic-sqs-target/

This blog post is written by Wassim Benhallam, Sr Cloud Application Architect AWS WWCO ProServe, and Rajesh Kesaraju, Sr. Specialist Solution Architect, EC2 Flexible Compute.

Scaling an Amazon EC2 Auto Scaling group based on Amazon Simple Queue Service (Amazon SQS) is a commonly used design pattern in decoupled applications. For example, an EC2 Auto Scaling Group can be used as a worker tier to offload the processing of audio files, images, or other files sent to the queue from an upstream tier (e.g., web tier). For latency-sensitive applications, AWS guidance describes a common pattern that allows an Auto Scaling group to scale in response to the backlog of an Amazon SQS queue while accounting for the average message processing duration (MPD) and the application’s desired latency.

This post builds on that guidance to focus on latency-sensitive applications where the MPD varies over time. Specifically, we demonstrate how to dynamically update the target value of the Auto Scaling group’s target tracking policy based on observed changes in the MPD. We also cover the utilization of Amazon EC2 Spot instances, mixed instance policies, and attribute-based instance selection in the Auto Scaling Groups as well as best practice implementation to achieve greater cost savings.

The challenge

The key challenge that this post addresses is applications that fail to honor their acceptable/target latency in situations where the MPD varies over time. Latency refers here to the time required for any queue message to be consumed and fully processed.

Consider the example of a customer using a worker tier to process image files (e.g., resizing, rescaling, or transformation) uploaded by users within a target latency of 100 seconds. The worker tier consists of an Auto Scaling group configured with a target tracking policy. To achieve the target latency mentioned previously, the customer assumes that each image can be processed in one second, and configures the target value of the scaling policy so that the average image backlog per instance is maintained at approximately 100 images.

In the first week, the customer submits 1000 images to the Amazon SQS queue for processing, each of which takes one second of processing time. Therefore, the Auto Scaling group scales to 10 instances, each of which processes 100 images in 100s, thereby honoring the target latency of 100s.

In the second week, the customer submits 1000 slightly larger images for processing. Since an image’s processing duration scales with its size, each image takes two seconds to process. As in the first week, the Auto Scaling group scales to 10 instances, but this time each instance processes 100 images in 200s, which is twice as long as was needed in the first round. As a result, the application fails to process the latter images within its acceptable latency.

Therefore, the challenge is common to any latency-sensitive application where the MPD is subject to change. Applications where the processing duration scales with input data size are particularly vulnerable to this problem. This includes image processing, document processing, computational jobs, and others.

Solution overview

Before we dive into the solution, let’s briefly review the target tracking policy’s scaling metric and its corresponding target value. A target tracking scaling policy works by adjusting the capacity to keep a scaling metric at, or close to, the specified target value. When scaling in response to an Amazon SQS backlog, it’s good practice to use a scaling metric known as the Backlog Per Instance (BPI) and a target value based on the acceptable BPI. These are computed as follows:

BPI equation, Saling metric and target value.

Given the acceptable BPI equation, a longer MPD requires us to use a smaller target value if we are to process these messages in the same acceptable latency, and vice versa. Therefore, the solution we propose here works by monitoring the average MPD over time and dynamically adjusting the target value of the Auto Scaling group’s target tracking policy (acceptable BPI) based on the observed changes in the MPD. This allows the scaling policy to adapt to variations in the average MPD over time, and thus enables the application to honor its acceptable latency.

Solution architecture

To demonstrate how the approach above can be implemented in practice, we put together an example architecture highlighting the services involved (see the following figure). We also provide an automated deployment solution for this architecture using an AWS Serverless Application Model (AWS SAM) template and some Python code (repository link). The repository also includes a README file with detailed instructions that you can follow to deploy the solution. The AWS SAM template deploys several resources, including an autoscaling group, launch template, target tracking scaling policy, an Amazon SQS queue, and a few AWS Lambda functions that serve various functions described as follows.

The Amazon SQS queue is used to accumulate messages intended for processing, while the Auto Scaling group instances are responsible for polling the queue and processing any messages received. To do this, a launch template defines a bootstrap script that allows the group’s instances to download and execute a Python code when first launched. The Python code consumes messages from the Amazon SQS queue and simulates their processing by sleeping for the MPD duration specified in the message body. After processing each message, the instance publishes the MPD as an Amazon CloudWatch metric (see the following figure).

Figure 1: Architecture diagram showing the components deployed by the AWS SAM template. These include an SQS queue, an Auto Scaling group responsible for polling and processing queue messages, a Lambda function that regularly updates the BPI CloudWatch metric, and a “Target Setter” Lambda function that regularly updates the Auto Scaling group’s target tracking scaling policy.

Figure 1: Architecture diagram showing the components deployed by the AWS SAM template.

To enable scaling, the Auto Scaling group is configured with a target tracking scaling policy that specifies BPI as the scaling metric and with an initial target value provided by the user.

The BPI CloudWatch metric is calculated and published by the “Metric-Publisher” Lambda function which is invoked every one minute using an Amazon EventBridge rate expression. To calculate BPI, the Lambda function simply takes the ratio of the number of messages visible in the Amazon SQS queue by the total number of in-service instances in the Auto Scaling group, as shown in equation (2) above.

On the other hand, the scaling policy’s target value is updated by the “Target-Setter” Lambda function, which is invoked every 30 minutes using another EventBridge rate expression. To calculate the new target value, the Lambda function simply takes the ratio of the user-defined acceptable latency value by the current average MPD queried from the corresponding CloudWatch metric, as shown in the previous equation (1).

Finally, to help you quickly test this solution, a Lambda function “Testing Lambda” is also provided and can be used to send messages to the Amazon SQS queue with a processing duration of your choice. This is specified within each message’s body. You can invoke this Lambda function with different MPDs (by modifying the corresponding environment variable) to verify how the Auto Scaling group scales in response. A CloudWatch dashboard is also deployed to enable you to track key scaling metrics through time. These include the number of messages visible in the queue, the number of in-service instances in the Auto Scaling group, the MPD, and BPI vs acceptable BPI.

Solution testing

To demonstrate the solution in action and its impact on application latency, we conducted two tests that you can reproduce by following the instructions described in the “Testing” section of the repository’s README file (repository link). In both tests, we assume a hypothetical application with a target latency of 300s. We also modified the invocation frequency of the “Target Setter” Lambda function to one minute to quickly assess the impact of target value changes. In both tests, we submit 50 messages to the Amazon SQS queue through the provided helper lambda. An MPD of 25s and 50s was used for the first and second test, respectively. The provided CloudWatch dashboard shows that the ASG scales to a total of four instances in the first test, and eight instances in the second (see the following figure). See README file for a detailed description of how various metrics evolve over time.

Comparison of Tests 1 and 2

Since Test 2 messages take twice as long to process, the Auto Scaling group launched twice as many instances to attempt to process all of the messages in the same amount of time as Test 1 (latency). The following figure shows that the total time to process all 50 messages in Test 1 was 9 mins vs 10 mins in Test 2. In contrast, if we were to use a static/fixed acceptable BPI of 12, then a total of four instances would have been operational in Test 2, thereby requiring double the time of Test 1 (~20 minutes) to process all of the messages. This demonstrates the value of using a dynamic scaling target when processing messages from Amazon SQS queues, especially in circumstances where the MPD is prone to vary with time.

Figure 2: CloudWatch dashboard showing Auto Scaling group scaling test results (Test 1 and 2). Although Test 2 messages require double the MPD of Test 1 messages, the Auto Scaling group processed Test 2 messages in the same amount of time as Test 1 by launching twice as many instances.

Figure 2: CloudWatch dashboard showing Auto Scaling group scaling test results (Test 1 and 2).

Recommended best practices for Auto Scaling groups

This section highlights a few key best practices that we recommend adopting when deploying and working with Auto Scaling groups.

Reducing cost using EC2 Spot instances

Amazon SQS helps build loosely coupled application architectures, while providing reliable asynchronous communication between the various layers/components of an application. If a worker node fails to process a message within the Amazon SQS message visibility time-out, then the message is returned to the queue and another worker node can pick up and process that message. This makes Amazon SQS-backed applications fault-tolerant by design and thus a great fit for EC2 Spot instances. EC2 spot instances are spare compute capacity in the AWS cloud that is available to you at steep discounts as compared to On-Demand prices.

Maximizing capacity using attribute-based instance selection

With the recently released attribute-based instance selection feature, you can define infrastructure requirements based on application needs such as vCPU, RAM, and processor family (e.g., x86, ARM). This removes the need to define specific instances in your Auto Scaling group configuration, and it eliminates the burden of identifying the correct instance families and sizes. In addition, newly released instance types will be automatically considered if they fit your requirements. Attribute-based instance selection lets you tap into hundreds of different EC2 instance pools, which increases the chance of getting EC2 (Spot/On-demand) instances. When using attribute-based instance selection with the capacity optimized allocation strategy, Amazon EC2 allocates instances from deeper Spot capacity pools, thereby further reducing the chance of Spot interruption.

The following sample configuration creates an Auto Scaling group with attribute-based instance selection:

AutoScalingGroupName: 'my-asg' # [REQUIRED] 
MixedInstancesPolicy:
  LaunchTemplate:
    LaunchTemplateSpecification:
      LaunchTemplateId: 'lt-0537239d9aef10a77'
    Overrides:
    - InstanceRequirements:
        VCpuCount: # [REQUIRED] 
          Min: 2
          Max: 4
        MemoryMiB: # [REQUIRED] 
          Min: 2048
  InstancesDistribution:
    SpotAllocationStrategy: 'capacity-optimized'
MinSize: 0 # [REQUIRED] 
MaxSize: 100 # [REQUIRED] 
DesiredCapacity: 4
VPCZoneIdentifier: 'subnet-e76a128a,subnet-e66a128b,subnet-e16a128c'

Conclusion

As can be seen from the test results, this approach demonstrates how an Auto Scaling group can honor a user-provided acceptable latency constraint while accomodating variations in the MPD over time. This is possible because the average MPD is monitored and regularly updated as a CloudWatch metric. In turn, this is continously used to update the target value of the group’s target tracking policy. Moreover, we have covered additional Auto Scaling group best practices suitable for this use case, including the use of Spot instances to reduce costs and attribute-based instance selection to simplify the selection of relevant instance types.

For more information on scaling options for Auto Scaling groups, visit the Amazon EC2 Auto Scaling documentation page and the SQS-based scaling guide.

Using GitHub Actions with Amazon CodeCatalyst

Post Syndicated from Dr. Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/using-github-actions-with-amazon-codecatalyst/

An Amazon CodeCatalyst workflow is an automated procedure that describes how to build, test, and deploy your code as part of a continuous integration and continuous delivery (CI/CD) system. You can use GitHub Actions alongside native CodeCatalyst actions in a CodeCatalyst workflow.

Introduction:

In a prior post in this series, Using Workflows to Build, Test, and Deploy with Amazon CodeCatalyst, I discussed creating CI/CD pipelines in CodeCatalyst and how that relates to The Unicorn Project’s main protagonist, Maxine. CodeCatalyst workflows help you reliably deliver high-quality application updates frequently, quickly, and securely. CodeCatalyst allows you to quickly assemble and configure actions to compose workflows that automate your CI/CD pipeline, test reporting, and other manual processes. Workflows use provisioned compute, Lambda compute, custom container images, and a managed build infrastructure to scale execution easily without sacrificing flexibility. In this post, I will return to workflows and discuss running GitHub Actions alongside native CodeCatalyst actions.

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Walkthrough

As with the previous posts in the CodeCatalyst series, I am going to use the Modern Three-tier Web Application blueprint. Blueprints provide sample code and CI/CD workflows to help you get started easily across different combinations of programming languages and architectures. To follow along, you can re-use a project you created previously, or you can refer to a previous post that walks through creating a project using the Three-tier blueprint.

As the team has grown, I have noticed that code quality has decreased. Therefore, I would like to add a few additional tools to validate code quality when a new pull request is submitted. In addition, I would like to create a Software Bill of Materials (SBOM) for each pull request so I know what components are used by the code. In the previous post on workflows, I focused on the deployment workflow. In this post, I will focus on the OnPullRequest workflow. You can view the OnPullRequest pipeline by expanding CI/CD from the left navigation, and choosing Workflows. Next, choose OnPullRequest and you will be presented with the workflow shown in the following screenshot. This workflow runs when a new pull request is submitted and currently uses Amazon CodeGuru to perform an automated code review.

OnPullRequest Workflow with CodeGuru code review

Figure 1. OnPullRequest Workflow with CodeGuru code review

While CodeGuru provides intelligent recommendations to improve code quality, it does not check style. I would like to add a linter to ensure developers follow our coding standards. While CodeCatalyst supports a rich collection of native actions, this does not currently include a linter. Fortunately, CodeCatalyst also supports GitHub Actions. Let’s use a GitHub Action to add a linter to the workflow.

Select Edit in the top right corner of the Workflow screen. If the editor opens in YAML mode, switch to Visual mode using the toggle above the code. Next, select “+ Actions” to show the list of actions. Then, change from Amazon CodeCatalyst to GitHub using the dropdown. At the time this blog was published, CodeCatalyst includes about a dozen curated GitHub Actions. Note that you are not limited to the list of curated actions. I’ll show you how to add GitHub Actions that are not on the list later in this post. For now, I am going to use Super-Linter to check coding style in pull requests. Find Super-Linter in the curated list and click the plus icon to add it to the workflow.

Super-Linter action with add icon

Figure 2. Super-Linter action with add icon

This will add a new action to the workflow and open the configuration dialog box. There is no further configuration needed, so you can simply close the configuration dialog box. The workflow should now look like this.

Workflow with the new Super-Linter action

Figure 3. Workflow with the new Super-Linter action

Notice that the actions are configured to run in parallel. In the previous post, when I discussed the deployment workflow, the steps were sequential. This made sense since each step built on the previous step. For the pull request workflow, the actions are independent, and I will allow them to run in parallel so they complete faster. I select Validate, and assuming there are no issues, I select Commit to save my changes to the repository.

While CodeCatalyst will start the workflow when a pull request is submitted, I do not have a pull request to submit. Therefore, I select Run to test the workflow. A notification at the top of the screen includes a link to view the run. As expected, Super Linter fails because it has found issues in the application code. I click on the Super Linter action and review the logs. Here are few issues that Super Linter reported regarding app.py used by the backend application. Note that the log has been modified slightly to fit on a single line.

/app.py:2:1: F401 'os' imported but unused
/app.py:2:1: F401 'time' imported but unused
/app.py:2:1: F401 'json' imported but unused
/app.py:2:10: E401 multiple imports on one line
/app.py:4:1: F401 'boto3' imported but unused
/app.py:6:9: E225 missing whitespace around operator
/app.py:8:1: E402 module level import not at top of file
/app.py:10:1: E402 module level import not at top of file
/app.py:15:35: W291 trailing whitespace
/app.py:16:5: E128 continuation line under-indented for visual indent
/app.py:17:5: E128 continuation line under-indented for visual indent
/app.py:25:5: E128 continuation line under-indented for visual indent
/app.py:26:5: E128 continuation line under-indented for visual indent
/app.py:33:12: W292 no newline at end of file

With Super-Linter working, I turn my attention to creating a Software Bill of Materials
(SBOM). I am going to use OWASP CycloneDX to create the SBOM. While there is a GitHub Action for CycloneDX, at the time I am writing this post, it is not available from the list of curated GitHub Actions in CodeCatalyst. Fortunately, CodeCatalyst is not limited to the curated list. I can use most any GitHub Action in CodeCatalyst. To add a GitHub Action that is not in the curated list, I return to edit mode, find GitHub Actions in the list of curated actions, and click the plus icon to add it to the workflow.

Figure 4. GitHub Action with add icon

Figure 4. GitHub Action with add icon

CodeCatalyst will add a new action to the workflow and open the configuration dialog box. I choose the Configuration tab and use the pencil icon to change the Action Name to Software-Bill-of-Materials. Then, I scroll down to the configuration section, and change the GitHub Action YAML. Note that you can copy the YAML from the GitHub Actions Marketplace, including the latest version number. In addition, the CycloneDX action expects you to pass the path to the Python requirements file as an input parameter.

GitHub Action YAML configuration

Figure 5. GitHub Action YAML configuration

Since I am using the generic GitHub Action, I must tell CodeCatalyst which artifacts are produced by the action and should be collected after execution. CycloneDX creates an XML file called bom.xml which I configure as an artifact. Note that a CodeCatalyst artifact is the output of a workflow action, and typically consists of a folder or archive of files. You can share artifacts with subsequent actions.

Artifact configuration with the path to bom.xml

Figure 6. Artifact configuration with the path to bom.xml

Once again, I select Validate, and assuming there are no issues, I select Commit to save my changes to the repository. I now have three actions that run in parallel when a pull request is submitted: CodeGuru, Super-Linter, and Software Bill of Materials.

Figure 7. Workflow including the software bill of materials

Figure 7. Workflow including the software bill of materials

As before, I select Run to test my workflow and click the view link in the notification. As expected, the workflow fails because Super-Linter is still reporting issues. However, the new Software Bill of Materials has completed successfully. From the artifacts tab I can download the SBOM.

Figure 8. Artifacts tab listing code review and SBOM

Figure 8. Artifacts tab listing code review and SBOM

The artifact is a zip archive that includes the bom.xml created by CycloneDX. This includes, among other information, a list of components used in the backend application.

    <components>
        <component type="library" bom-ref="7474f0f6-8aa2-46db-bebf-a7648cff84e1">
            <name>Jinja2</name>
            <version>3.1.2</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="fad0708b-d007-4f98-a80c-056b136015df">
            <name>aws-cdk-lib</name>
            <version>2.43.0</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="23e3aaae-b4e1-4f3b-b026-fcd298c9cb9b">
            <name>aws-cdk.aws-apigatewayv2-alpha</name>
            <version>2.43.0a0</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="d283cf17-9125-422c-b55c-cabb64d18f79">
            <name>aws-cdk.aws-apigatewayv2-integrations-alpha</name>
            <version>2.43.0a0</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="0f095c84-c9e9-4d6c-a4ed-c4a6c7605426">
            <name>aws-cdk.aws-lambda-python-alpha</name>
            <version>2.43.0a0</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="b248b85b-ba27-4796-bcdf-6bd82ad47295">
            <name>constructs</name>
            <version>&gt;=10.0.0,&lt;11.0.0</version>
            <purl>pkg:pypi/constructs@%3E%3D10.0.0%2C%3C11.0.0</purl>
        </component>
        <component type="library" bom-ref="72b1da33-19c2-4b5c-bd58-7f719dafc28a">
            <name>simplejson</name>
            <version>3.17.6</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
    </components>

The workflow is now enforcing code quality and generating a SBOM like I wanted. Note that while this is a great start, there is still room for improvement. First, I could collect reports generated by the actions in my workflow, and define success criteria for code quality. Second, I could scan the SBOM for known security vulnerabilities using a Software Composition Analysis (SCA) solution. I will be covering this in a future post in this series.

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges. First, delete the two stacks that CDK deployed using the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. These stacks will have names like mysfitsXXXXXWebStack and mysfitsXXXXXAppStack. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project.

Conclusion

In this post, you learned how to add GitHub Actions to a CodeCatalyst workflow. I used GitHub Actions alongside native CodeCatalyst actions in my workflow. I also discussed adding actions from both the curated list of actions and others not in the curated list. Read the documentation to learn more about using GitHub Actions in CodeCatalyst.

About the authors:

Dr. Rahul Gaikwad

Dr. Rahul is a DevOps Lead Consultant at AWS. He helps customers to migrate and modernize workloads to AWS Cloud with a special focus on DevOps and IaC. He is passionate about building innovative solutions using technology and enjoys collaborating with customers and peers. He contributes to open-source community projects. Outside of work, Rahul has completed Ph.D. in AIOps and he enjoys travelling and spending time with his family.

Anirudh Sharma

Anirudh is a Cloud Support Engineer 2 with an extensive background in DevOps offerings at AWS, he is also a Subject Matter Expert in AWS ElasticBeanstalk and AWS CodeDeploy services. He loves helping customers and learning new services and technologies. He also loves travelling and has a goal to visit Japan someday, is a Golden State Warriors fan and loves spending time with his family.

Navdeep Pareek

Navdeep is Lead Migration Consultant at AWS. He helps customer to migrate and modernize customer workloads to AWS Cloud and have specialisation in automation, DevOps. In his spare time, he enjoys travelling, cooking and spending time with family and friends.