Tag Archives: Customer Solutions

How Volkswagen Autoeuropa built a data solution with a robust governance framework, simplifying access to quality data using Amazon DataZone

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/how-volkswagen-autoeuropa-built-a-data-solution-with-a-robust-governance-framework-simplifying-access-to-quality-data-using-amazon-datazone/

This is a joint post co-authored with Martin Mikoleizig from Volkswagen Autoeuropa.

This second post of a two-part series that details how Volkswagen Autoeuropa, a Volkswagen Group plant, together with AWS, built a data solution with a robust governance framework using Amazon DataZone to become a data-driven factory. Part 1 of this series focused on the customer challenges, overall solution architecture and solution features, and how they helped Volkswagen Autoeuropa overcome their challenges. This post dives into the technical details, highlighting the robust data governance framework that enables ease of access to quality data using Amazon DataZone.

At Amazon, we work backward, a systematic way to vet ideas and create new products. The key tenet of this approach is to start by defining the customer experience, then iteratively work backward from that point until the team achieves clarity of thought around what to build. The first section of this post discusses how we aligned the technical design of the data solution with the data strategy of Volkswagen Autoeuropa. Next, we detail the governance guardrails of the Volkswagen Autoeuropa data solution. Finally, we highlight the key business outcomes.

Aligning the solution with the data strategy

At an early stage of the project, the Volkswagen Autoeuropa and AWS team identified that a data mesh architecture for the data solution aligns with the Volkswagen Autoeuropa’s vision of becoming a data-driven factory. With this in mind, the team implemented the following steps:

  • Define data domains – In a workshop, the team identified the data landscape and its distribution in Volkswagen Autoeuropa. Next, the team grouped the data assets of the organization along the lines of business and defined the data domains. Because Volkswagen Autoeuropa is at an early stage of their data mesh journey, defining data domains along the lines of business is the recommended approach. As the data solution evolves, Volkswagen Autoeuropa might consider other criteria such as business subdomains to define data domains. The team defined more than five data domains, such as production, quality, logistics, planning, and finance.
  • Identify pioneer cases – The team identified the pioneer use cases that onboard the data solution first, to validate its business value. The team identified two use cases. The first use case helps predict test results during the car assembly process. The second use case enables the creation of reports containing shop floor key metrics for different management levels. The following criteria were considered to identify these use cases:
    • Use cases that deliver measurable business value for Volkswagen Autoeuropa.
    • Use cases with high AWS maturity.
    • Use cases whose requirements can be met with the first release version of the data solution.
  • Onboard key data products – The team identified the key data products that enabled these two use cases and aligned to onboard them into the data solution. These data products belonged to data domains such as production, finance, and logistics. In addition, the team aligned on business metadata attributes that would help with data discovery. The data products are classified as either source-based data or consumer-based data. Source-based data is the unaltered, raw data that is generated from source systems (for example, quality data, safety data) and is useful for other business use cases. Consumer-based data is the aggregated and transformed data from source systems. Reuse of consumer-based data saves cost in extract, transform, and load (ETL) implementation and system maintenance.

In addition to the preceding steps, the team established a data quality framework to improve the quality of the data product registered in the data solution. The following table shows the mapping of the data mesh-based solution components to Amazon DataZone and AWS Glue features. The table also provides generic examples of the components in the automotive industry.

Data Solution Components AWS Service Features Generic Examples
Data domains Amazon DataZone projects and Amazon DataZone domain units Production, logistics
Use cases Amazon DataZone projects Smart manufacturing, predictive maintenance
Data products Amazon DataZone assets Sales data, sensor data
Business metadata Amazon DataZone glossaries and metadata forms Data product owner information, data refresh frequency
Data quality framework AWS Glue Data Quality  A quality score of 92%

Empowering teams with a governance framework

This section discusses the governance framework that was put in place to empower the teams at Volkswagen Autoeuropa by enhancing their analytics journey. It highlights the guardrails that enable ease of access to quality data.

Business metadata

Business metadata helps users understand the context of the data, which can lead to increased trust in the data. Moreover, establishing a common set of attributes of the data products promotes a consistent experience for the users. In addition to the business context, at Volkswagen Autoeuropa, the metadata includes information related to data classification and if the data contains personally identifiable information (PII). The data solution uses Amazon DataZone glossaries and metadata forms to provide business context to their data. Apart from the previous benefits, using the appropriate keywords in Amazon DataZone glossary terms and metadata forms can help with the search and filtering capability of data products in the Amazon DataZone data portal.

Data quality framework

The data quality framework is a comprehensive solution designed to streamline the process of data quality checks and publishing a quality score. It uses AWS Glue Data Quality to generate recommendation rulesets, run orchestrated jobs, store results, and send notifications. This framework can be seamlessly integrated into an AWS Glue job, providing a quality score for data pipeline jobs. The quality score of a data product is published in the Amazon DataZone data portal for consumers to evaluate. The key components of the solution are as follows:

  • Recommendation ruleset generation – The framework generates tailored rulesets based on metadata from the AWS Glue Data Catalog table, providing relevant and comprehensive quality checks.
  • Orchestrated job execution – Jobs are run in AWS Step Functions to perform data quality checks using the generated rulesets against data sources, evaluating data quality based on defined rules and criteria.
  • Result storage and notification – Results, including quality scores, quality status, and rulesets checked, are stored in an Amazon Simple Storage Service (Amazon S3) bucket, maintaining a historical record. End-users receive notifications with relevant details.
  • Data quality score publishing – The quality scores are published in the Amazon DataZone data portal, enabling consumers to access and evaluate data quality.
  • Subscription and quality score requirements – Consumers can subscribe to data sources or targets based on their desired quality score thresholds, making sure they receive data that meets their specific needs and standards.
  • Integration and extensibility – The framework is designed for seamless integration into existing AWS Glue jobs or data pipelines and provides a flexible and extensible architecture for customization and enhancement.

Federated governance

Federated governance empowers producer and consumer teams to operate independently while adhering to a central governance model. For the data solution at Volkswagen Autoeuropa, this meant a centralized team defined the governance guardrails and decentralized data teams employed those guardrails. The following are a few examples of how the team established federated governance in Volkswagen Autoeuropa:

  • Management of Amazon DataZone glossaries and metadata forms – In this mechanism, the Volkswagen Autoeuropa IT team defined the Amazon DataZone glossaries and metadata forms in a central manner. The data teams used them to publish the data assets in the Amazon DataZone. This provides consistency of business metadata across the organization. The following figure explains the process.
    The workflow in the Amazon DataZone data portal consists of the following steps:
    1. The data solution administrator belonging to the Volkswagen Autoeuropa IT team aligns with stakeholders such as data producers, data consumers, and source system owners, and maintains the business metadata using the Amazon DataZone glossaries and metadata forms.
    2. The producer project teams use the Amazon DataZone glossary terms and fill the Amazon DataZone metadata forms to enrich the inventory assets.
    3. After the business metadata is populated, the team publishes the assets in the Amazon DataZone data portal.
  • Management of Amazon DataZone project membership – In this scenario, the management of Amazon DataZone project membership is delegated to a designated administrator of the project. The following figure explains the process.
    The workflow consists of the following steps:
    1. The data solution administrator belonging to the Volkswagen Autoeuropa IT team provisions the Amazon DataZone project and environment using automation. The data solution administrator is the owner of the project.
    2. The data solution administrator delegates the management of the Amazon DataZone project membership to a designated administrator by assigning the owner role.
    3. The Amazon DataZone project administrator assigns the contributor role to eligible users.
    4. The users access the Amazon DataZone project and its assets from the Amazon DataZone data portal.

Authentication and authorization

The Amazon DataZone portal supports two types of authorizations: AWS Identity and Access Management (IAM) roles and AWS IAM Identity Center users. The data solution supports both of these authorization methods. The choice of authentication mechanism is a function of the type of authorization used for Amazon DataZone.

For IAM role authorization, an IAM role is created for each user, incorporating a prefix. Each data solution user role has a permission to list the Amazon DataZone domains (datazone:ListDomains) and to get the data portal login URL (datazone:GetIamPortalLoginUrl) in the Amazon DataZone AWS account. For reasons that are out of scope for this post, there could only be three SAML federated roles in an AWS account in the customer environment. As such, the team didn’t have a dedicated SAML federated role for each Amazon DataZone user. The data solution user role implemented a trust policy allowing the user’s AWS Security Token Service (AWS STS) federated user session principal Amazon Resource Name (ARN). If you don’t have limitations on the number of SAML federated roles per AWS account, you can make all data solution user roles SAML federated roles and update the trust policy accordingly.

For IAM Identity Center authorization, the configuration is done either at the AWS Organizations level or AWS account level in IAM Identity Center. Because there are currently no APIs available for identity source configuration in IAM Identity Center, the team followed the appropriate instructions to configure the identity source on the AWS Management Console.

After the chosen authorization option is activated, Amazon DataZone administrators grant the IAM principals (IAM role or IAM Identity Center user) access to the Amazon DataZone portal. For more details, refer to Manage users in the Amazon DataZone console.

Business outcomes

Volkswagen Autoeuropa and AWS established an iterative mechanism to enable the continuous growth of the data solution. This iterative improvement is expressed as a flywheel as shown in the following figure.

The outcome of each component of the flywheel powers the next component, creating a virtuous cycle. The data solution flywheel consists of five components:

  1. Data solution growth – The primary focus of the flywheel is to accelerate the growth of the data solution. This growth is measured by metrics such as number of data products, number of use cases onboarded into the solution, and number of users.
  2. Enhancing user experience – This component focuses on enhancing the user experience of the data solution. One way to measure the user experience is through user feedback surveys.
  3. Data solution use cases – Improved, positive user experience with the data solution contributes to the increased number of use cases that want to onboard the data solution.
  4. Data producers and consumers – As the number of use cases increases, so does the number of data producers and consumers. Data producers make data available to power the use cases. Data consumers use the data to drive the use cases.
  5. Selection of data products – After data producers onboard the data solution, they publish the assets in the Amazon DataZone data portal. This leads to a larger selection of data products. This, in turn, creates a positive experience for the data solution users.

In addition to the previous components, the positive user experience is reinforced by improving governance guardrails, increasing number of reusable assets, and maximizing operational excellence.

As of writing this post, Volkswagen Autoeuropa reduced the time to discover data from days to minutes using the data solution. This led to approximately 384 times improvement in data discovery time. Data access took several weeks before the Volkswagen Autoeuropa and AWS collaboration. With the help of the data solution powered by Amazon DataZone, the data access time was reduced to minutes. Overall, the data solution resulted in regaining between 48 hours and weeks of customer productivity over the course of a month.

The data solution powered by Amazon DataZone is driving measurable business impact for Volkswagen Autoeuropa. It enables Volkswagen Autoeuropa to deliver digital use cases faster, with less effort, and a higher overall quality. Volkswagen Autoeuropa believes that Amazon DataZone will be key in their journey to become a data-driven factory and to leverage AI.

Conclusion

This post explored how Volkswagen Autoeuropa built a robust and scalable data solution using Amazon DataZone. The first step was to align the solution with Volkswagen Autoeuropa’s overarching data strategy to drive business value.

The establishment of a comprehensive governance framework was central to this effort. This framework encompasses key components, such as business metadata, data quality, federated governance, access controls, and security, which maintain the trustworthiness and reliability of Volkswagen Autoeuropa’s data assets. The post highlighted the Volkswagen Autoeuropa data solution flywheel, showcasing how the solution enabled improved decision-making, increased operational efficiency, and accelerated digital transformation initiatives across the organization.

The data solution built at Volkswagen Autoeuropa is one of the first implementations within the Volkswagen Group and is a blueprint for other Volkswagen production plants.

“This project is a blueprint for other Volkswagen production plants. By involving the AWS team and using Amazon DataZone, we are able to govern our data centrally and make it accessible in an automated and secure way.”

– Daniel Madrid, Head of IT, Volkswagen Autoeuropa.

If you’re looking to harness the power of data mesh to drive innovation and business value within your organization, we’ve got you covered. In Strategies for building a data mesh-based enterprise solution on AWS, we dive deep into the key considerations and current recommendations to establish a robust, scalable, and well-governed data mesh on AWS. This documentation covers everything from aligning your data mesh with overall business strategy to implementing the data mesh strategy framework.

To get hands-on experience with real-world code examples, see our GitHub repository. This open source project provides a step-by-step blueprint for constructing a data mesh architecture using the powerful capabilities of Amazon DataZone, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.


About the Authors

BDB-4558-DhrubaDhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data analytics, and data governance at AWS. He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure AWS solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. An active contributor to the AWS community, Dhrubajyoti authors AWS Prescriptive Guidance publications, blog posts, and open source artifacts, sharing his insights and best practices with the broader community. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

BDB-4558-RaviRavi Kumar is a Data Architect and Analytics expert at AWS, where he finds immense fulfilment in working with data. His days are dedicated to designing and analyzing complex data systems, uncovering valuable insights that drive business decisions. Outside of work, he unwinds by listening to music and watching movies, activities that allow him to recharge after a long day of data wrangling.

Martin Mikoleizig studied mechanical engineering and production technology at the RWTH Aachen University before starting to work in Dr. h.c. Ing. F. Porsche AG 2015 as a production planner for the engine assembly. Over several years as a Project Manager on Testing Technology for new engine models, he also introduced several innovations like human-machine collaborations and intelligent assistance systems. Starting in 2017, he was responsible for the shop floor IT team of the module lines in Zuffenhausen before he became responsible for the planning of the E-Drive assembly at Porsche. Additionally, he was responsible for the Digitalisation Strategy of the Production Ressort at Porsche. In October 2022, he was assigned to Volkswagen Autoeuropa in Portugal in the role of a Digital Transformation Manager for the plant, driving the digital transformation towards a data-driven factory.

BDB-4558-WeiWeizhou Sun is a Lead Architect at AWS, specializing in digital manufacturing solutions and IoT. With extensive experience in Europe, she has enhanced operational efficiencies, reducing latency and increasing throughput. Weizhou’s expertise includes industrial computer vision, predictive maintenance, and predictive quality, consistently delivering top performance and client satisfaction. A recognized thought leader in IoT and remote driving, she has contributed to business growth through innovations and open source work. Committed to knowledge sharing, Weizhou mentors colleagues and contributes to practice development. Known for her problem-solving skills and customer focus, she delivers solutions that exceed expectations. In her free time, Weizhou explores new technologies and fosters a collaborative culture.

BDB-4558-AjinkyaAjinkya Patil is a Senior Security Architect with AWS Professional Services, specializing in security consulting for customers in the automotive industry. Since joining AWS in 2019, he has played a key role in helping automotive companies design and implement robust security solutions on AWS. Ajinkya is an active contributor to the AWS community, having presented at AWS re:Inforce and authored articles for the AWS Security Blog and AWS Prescriptive Guidance. Outside of his professional pursuits, Ajinkya is passionate about travel and photography, often capturing the diverse landscapes he encounters on his journeys.

BDB-4558-AdjoaAdjoa Taylor has over 20 years of experience in industrial manufacturing, providing industry and technology consulting services, digital transformation, and solution delivery. Currently, Adjoa leads Product Centric Digital Transformation, enabling customers in solving complex manufacturing problems using smart factory and industry-leading transformation mechanisms. Most recently, she drives value with AI/ML and generative AI use cases for the plant floor. Adjoa is an experienced leader, having spent over 20 years of her career delivering projects in countries throughout North America, Latin America, Europe, and Asia. Adjoa brings deep experience across multiple business segments with a focus on business outcome-driven solutions. Adjoa is passionate about helping customers solve problems while realizing the art of the possible through implementing value-based solutions.

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Post Syndicated from Samir Patel original https://aws.amazon.com/blogs/big-data/achieve-data-resilience-using-amazon-opensearch-service-disaster-recovery-with-snapshot-and-restore/

Amazon OpenSearch Service is a fully managed service offered by AWS that enables you to deploy, operate, and scale OpenSearch domains effortlessly. OpenSearch is a distributed search and analytics engine, which is an open-source project. OpenSearch Service seamlessly integrates with other AWS offerings, providing a robust solution for building scalable and resilient search and analytics applications in the cloud.

Disaster recovery is vital for organizations, offering a proactive strategy to mitigate the impact of unforeseen events like system failures, natural disasters, or cyberattacks.

In Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for Recovery in the Cloud, we introduced four major strategies for disaster recovery (DR) on AWS. These strategies enable you to prepare for and recover from a disaster. By using the best practices provided in the AWS Well-Architected Reliability Pillar to design your DR strategy, your workloads can remain available despite disaster events such as natural disasters, technical failures, or human actions. OpenSearch Service provides various DR solutions, including active-passive and active-active approaches. This post focuses on introducing an active-passive approach using a snapshot and restore strategy.

Snapshot and restore in OpenSearch Service

The snapshot and restore strategy in OpenSearch Service involves creating point-in-time backups, known as snapshots, of your OpenSearch domain. These snapshots capture the entire state of the domain, including indexes, mappings, and settings. In the event of data loss or system failure, these snapshots will be used to restore the domain to a specific point in time. Implementing a snapshot and restore strategy helps organizations meet Recovery Point Objectives (RPOs) and Recovery Time Objectives (RTOs), providing minimal data loss and rapid system recovery in case of disasters.

Snapshot and restore results in longer downtimes and greater loss of data between when the disaster event occurs and recovery. However, backup and restore can still be the right strategy for your workload because it is the most straightforward and least expensive strategy to implement. Additionally, not all workloads require RTO and RPO in minutes or less.

Solution overview

The following architecture diagram illustrates how manual snapshots are taken from the OpenSearch Service domain in the primary AWS Region and stored in an Amazon Simple Storage Service (Amazon S3) bucket in the secondary Region.

We walk through each step and discuss scenarios for failing over to the OpenSearch Service domain in the secondary Region in the event of a disaster in the primary Region, as well as how to fail back to the OpenSearch Service domain to resume operations in the primary Region.

bdb-4227-Arch1.1

The workflow consists of the following initial steps:

  1. OpenSearch Service is hosted in the primary Region, and all the active traffic is routed to the OpenSearch Service domain in the primary Region.
  2. The manual snapshots from the OpenSearch Service domain in the primary Region are transferred to the S3 bucket in the secondary Region on a predefined schedule.

This process can be programmatically scheduled using an AWS Lambda function, as described in Unleash the power of Snapshot Management to take automated snapshots using Amazon OpenSearch Service. This gives you the most effective protection from disasters of any scope of impact. In the event of a disaster in the primary Region, in addition to OpenSearch data recovery from backup, you must also be able to restore your infrastructure in the secondary Region. Infrastructure as code (IaC) methods such as using AWS CloudFormation or the AWS Cloud Development Kit (AWS CDK) enable you to deploy consistent infrastructure across Regions.

The following diagram illustrates the architecture in the event of a disaster.

bdb-4227-Arch1.2

The workflow consists of the following steps:

  1. In the event of a disaster making the OpenSearch Service domain in the primary Region unavailable, all active traffic routed to the primary Region’s OpenSearch Service domain will cease.
  2. When the OpenSearch Service domain becomes unavailable, the manual snapshots to Amazon S3 will no longer be taken at the predefined intervals.
  3. To fail over, launch the OpenSearch Service domain in the secondary Region using IaC. Restore manual snapshots from the S3 bucket in the secondary Region to the OpenSearch Service domain in the secondary domain. For log workloads, restore only recent or relevant logs to save time and use this opportunity to purge unnecessary documents or indexes.
  4. Update the DNS controller (Amazon Route 53) to redirect traffic to the OpenSearch Service domain in the secondary Region.
  5. When the primary Region becomes available, set up manual snapshots from the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region.

The following diagram illustrates the architecture after the primary Region becomes available.

bdb-4227-Arch1.3

The workflow consists of the following steps:

  1. When the primary Region becomes available again, destroy the existing OpenSearch domain in the primary Region. Launch a new OpenSearch Service domain in the primary Region.
  2. Restore manual snapshots from the S3 bucket in the primary Region to the new OpenSearch Service domain created in the previous step.
  3. Update Route 53 to redirect traffic to the new OpenSearch Service domain in the primary Region.
  4. Set up manual snapshots from the new OpenSearch Service domain in the primary Region to a new prefix in the S3 bucket in the secondary Region.
  5. After successfully failing back to the OpenSearch Service domain in the primary Region, destroy the OpenSearch Service domain in the secondary Region.

In this post, we demonstrate how to launch an OpenSearch Service domain in the primary Region and set up manual snapshots to an S3 bucket in the secondary Region. Then we simulate a failover to resume operations using the OpenSearch Service domain in the secondary Region in the event of a disaster. Finally, we illustrate the failback mechanism by reverting to the OpenSearch Service domain in the primary Region.

Regular operations

In this section, we discuss the regular operations to set up the solution architecture.

Launch an OpenSearch Service domain in the primary Region

Create an OpenSearch Service domain in the primary Region by following the instructions in Creating and managing Amazon OpenSearch Service domains with fine-grained access control enabled. Do not enable standby mode. Create indexes and populate them with documents.

Create an S3 bucket in the secondary Region

To store OpenSearch snapshots in the secondary Region, you need to create S3 buckets in that Region. For instructions, see Creating a bucket.

Create the snapshot IAM role

The snapshot AWS Identity and Access Management (IAM) role is necessary to grant permissions specifically for managing snapshots within the OpenSearch Service domain. For instructions, see Creating an IAM role (console). We refer to this role as TheSnapshotRole in this post.

  1. Attach the following IAM policy to TheSnapshotRole:
    {
      "Version": "2012-10-17",
      "Statement": [{
          "Action": [
            "s3:ListBucket"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:s3:::s3-bucket-name"
          ]
        },
        {
          "Action": [
            "s3:GetObject",
            "s3:PutObject",
            "s3:DeleteObject"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:s3:::s3-bucket-name/*"
          ]
        }
      ]
    }

  2. Edit the trust relationship of TheSnapshotRole to specify OpenSearch Service in the Principal statement, as shown in the following example:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "",
    "Effect": "Allow",
    "Principal": {
      "Service": "es.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}

To register the snapshot repository, you need to be able to pass TheSnapshotRole to OpenSearch Service. You also need access to the es:ESHttpPut action.

  1. To grant both of these permissions, attach the following policy to the IAM role whose credentials are being used to sign the request:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::123456789012:role/TheSnapshotRole"
    },
    {
      "Effect": "Allow",
      "Action": "es:ESHttpPut",
      "Resource": "arn:aws:es:region:123456789012:domain/domain-name/*"
    }
  ]
}

Associate the IAM role or user to the OpenSearch security role for manual snapshots

Fine-grained access control introduces an additional step when registering a repository. Even if you use HTTP basic authentication for all other purposes, you need to map the manage_snapshots role to your IAM role that has iam:PassRole permissions to pass TheSnapshotRole. Snapshots can only be taken by a process or user associated with an IAM identity. This makes sure only authorized entities can create, manage, or restore snapshots.

One such method is to use Amazon Cognito. With Amazon Cognito, users can sign in with IAM credentials indirectly, either using proxy mapping with SAML or through user pool credentials. This setup provides a secure way to manage access while using the capabilities of IAM. The preferred method is to use a process that signs requests with AWS SigV4. This approach involves programmatically signing each request to OpenSearch with the appropriate IAM credentials, making sure only authorized processes can manage snapshots. This method is recommended because it provides a higher level of security and can be automated using Lambda functions as part of your backup and DR workflows.

  1. On OpenSearch Dashboards, navigate to the main menu and choose Security.
  2. Choose Roles and search for the manage_snapshots
  3. Choose Mapped users and choose Manage mappings.
  4. Add the Amazon Resource Name (ARN) of TheSnapshotRole to the backend roles.

bdb-4227-AssociateRole

Register a snapshot repository on the OpenSearch Service domain

To register a snapshot repository, send a signed PUT request to the OpenSearch Service domain endpoint using Curl; integrated development environments (IDEs) like PyCharm or VS Code, Postman; or another method. Using a PUT request in OpenSearch Dashboards for repository registration is not supported. For more details, see Using OpenSearch Dashboards with Amazon OpenSearch Service.

The curl command is as follows:

curl —aws-sigv4 "aws:amz:us-east-1:es" —user "ACCESS_KEY:SECRET_KEY" -XPUT "https://DOMAIN_ENDPOINT/_snapshot/REPOSITORY_NAME" -H 'Content-Type: application/json' -d '{ "type": "s3", "settings": { "bucket": "BUCKET_NAME", "endpoint": "s3.amazonaws.com", "role_arn": "ROLE_ARN" }}'

Use the curl command to register a snapshot repository in the OpenSearch Service domain in the primary Region pointing to the S3 bucket in the secondary Region.

To verify the snapshot repository creation, run the following query:

GET /_snapshot/os-snapshot-repo

bdb-4227-GetSnapshot

Take manual snapshots

To take a manual snapshot, perform the following steps from OpenSearch Dashboards. To include or exclude certain indexes and specify other settings, add a request body. For the request structure, see Take snapshots in the OpenSearch documentation.

  1. To create a manual snapshot, use the following query. In this query, the repository name is os-snapshot-repo and the snapshot name is 2023-11-18.

PUT /_snapshot/os-snapshot-repo/2023-11-18

bdb-4227-PutSnapshot

  1. Verify the snapshot has been created and indexes for which snapshot was taken:

GET /_snapshot/os-snapshot-repo/_all

bdb-4227-GetAllSnapshots

  1. Schedule your manual snapshot at a defined interval (for example, every 1 hour) based on your RPO requirements.

You can schedule this by creating an Amazon EventBridge rule to invoke a Lambda function every hour. For instructions, see Tutorial: Create an EventBridge scheduled rule for AWS Lambda functions. The Lambda function will transfer incremental manual snapshots into Amazon S3. For more information, see Unleash the power of Snapshot Management to take automated snapshots using Amazon OpenSearch Service.

Failover scenario

In a disaster, if your OpenSearch Service domain in the primary Region goes down, you can fail over to a domain in the secondary Region. This provides business continuity and minimizes downtime during unexpected Region failures.

To maintain business continuity during a disaster, you can use message queues like Amazon Simple Queue Service (Amazon SQS) and streaming solutions like Apache Kafka or Amazon Kinesis. These tools buffer incoming data in the primary Region, allowing you to replay traffic on a predefined period in the secondary Region when you fail over, to keep the OpenSearch Service domain up to date with all recent changes.

Launch an OpenSearch Service domain in the Secondary Region

Create an OpenSearch Service domain in the secondary Region by following the instructions in Creating and managing Amazon OpenSearch Service domains with fine-grained access control enabled. Do not enable standby mode.

Depending on your RTO requirements, you can keep the OpenSearch Service domain in the secondary Region up and running if you have an RTO of less than 1 hour. However, it will incur additional costs. If you have an RTO of more than 1 hour, you can launch a new OpenSearch Service domain in the secondary Region during the failover activity to reduce operational costs.

Associate the IAM role or user to the OpenSearch security role for manual snapshots

Follow the instructions in the previous section to associate the IAM role with the OpenSearch security role.

Register a snapshot repository on the OpenSearch Service domain

To make sure your data is available for failover, you need to register a snapshot repository on the OpenSearch Service domain in the secondary Region. The snapshots taken from your OpenSearch Service domain in the primary Region can be restored. Use the following command:

curl —aws-sigv4 "aws:amz:us-west-2:es" —user "ACCESS_KEY:SECRET_KEY" -XPUT "https://DOMAIN_ENDPOINT/_snapshot/REPOSITORY_NAME" -H 'Content-Type: application/json' -d '{ "type": "s3", "settings": { "bucket": "BUCKET_NAME", "endpoint": "s3.amazonaws.com", "role_arn": "ROLE_ARN" }}'

The S3 bucket should be the bucket created in the secondary Region where the snapshots from your OpenSearch Service domain in the primary Region are stored.

Restore snapshots

Before you restore a snapshot, make sure that the destination domain doesn’t use Multi-AZ with standby.

After you register the snapshot repository on your OpenSearch Service domain in the secondary Region, the next step is to restore the desired indexes from the snapshot repository. This step makes sure your data is available in the OpenSearch Service domain in the secondary Region. This step allows you to selectively restore specific index from your snapshot, providing flexibility to recover only the necessary data. Use the following command:

POST /_snapshot/<REPOSITORY_NAME>/<SNAPSHOT_NAME>/_restore
{
"indices": "movie-index"
}

bdb-4227-Restore

Verify the snapshots for all the necessary indexes are stored in the OpenSearch Service domain in the secondary Region.

Update Route 53 to redirect traffic to the OpenSearch Service domain in the secondary Region

After you restore the snapshots to the OpenSearch Service domain in the secondary Region, update the DNS settings (Route 53) with the new OpenSearch Service domain endpoint to redirect indexing traffic to the OpenSearch Service domain in the secondary Region. Route 53, a scalable DNS service, can seamlessly redirect traffic to the new OpenSearch endpoint by updating its DNS records.

A Route 53 resource record set directs internet traffic to specific resources, such as an OpenSearch Service domain. It includes a domain name, a record type (for example, CNAME), and the DNS name or IP address of the endpoint. To redirect traffic to a new endpoint, update or create a new record set.

Set up manual snapshots from the OpenSearch Service domain in the secondary Region to the Amazon S3 bucket in the primary Region

Complete the following steps to set up manual snapshots from the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region:

  1. Create S3 bucket in the primary Region, following the steps from earlier in this post.
  2. Associate the IAM role or user to the OpenSearch security role for taking manual snapshots in your OpenSearch Service domain in the secondary Region. For instructions, refer to the earlier section in this post.
  3. Register a snapshot repository on the OpenSearch Service domain in the secondary Region pointing to the S3 bucket in the primary Region. For instructions, refer to the earlier section in this post.
  4. Take manual snapshots of the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region, following the instructions from earlier in this post.
  5. Schedule your manual snapshot from the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region at a defined interval (for example, every 1 hour) based on your RPO requirements.

Failback scenario

When the primary Region becomes available again, you can seamlessly revert to the OpenSearch Service domain in the primary Region. This failback process involves the following steps.

Destroy an existing OpenSearch Service domain in the primary Region

When the primary Region becomes available again, destroy the existing OpenSearch Service domain in the primary Region from the OpenSearch Service console. In the following screenshot, the primary Region is US East (N. Virginia).

bdb-4227-DestroyDomain

Launch a new OpenSearch Service domain in the primary Region

Create an OpenSearch Service domain in the primary Region by following the instructions in Creating and managing Amazon OpenSearch Service domains with fine-grained access control. Do not enable standby mode.

Associate the IAM role or user to the OpenSearch security role for restoring manual snapshots

Follow the instructions from earlier in this post to associate the IAM role or user to the OpenSearch security role.

Register a snapshot repository on the OpenSearch Service domain

To make sure your data is available for failover, you need to register a snapshot repository on the new OpenSearch Service domain in the primary Region. The snapshots taken from your OpenSearch Service domain in the secondary Region can be restored. Use the following command:

curl —aws-sigv4 "aws:amz:us-west-2:es" —user "ACCESS_KEY:SECRET_KEY" -XPUT "https://DOMAIN_ENDPOINT/_snapshot/REPOSITORY_NAME" -H 'Content-Type: application/json' -d '{ "type": "s3", "settings": { "bucket": "BUCKET_NAME", "endpoint": "s3.amazonaws.com", "role_arn": "ROLE_ARN" }}'

The S3 bucket should be the bucket created in the primary Region where the snapshots from your OpenSearch Service domain in the secondary Region are stored.

Restore manual snapshots from the S3 bucket in the primary Region to the new OpenSearch Service domain in the primary Region

To restore the manual snapshots, complete the following steps:

  1. Use the following code to restore the manual snapshots from the S3 bucket in the primary Region to the new OpenSearch Service domain in the primary Region:

POST /_snapshot/os-snapshot-repo/2023-11-18/_restore
{
"indices": "movie-index"
}

bdb-4227-Restore

  1. Verify data integrity and make sure the primary domain is up to date by checking the document count of the index:

GET movie-index/_count

bdb-4227-IndexCount

  1. Update Route 53 to redirect traffic to the new OpenSearch Service domain in the primary Region.
  2. Set up manual snapshots from the new OpenSearch Service domain in the primary Region to a new prefix in the S3 bucket in the secondary Region.

Destroy the OpenSearch Service domain in the secondary Region

After you have successfully failed back to the OpenSearch Service domain in the primary Region, destroy the OpenSearch Service domain in the secondary Region. In the following screenshot, the secondary Region is US West (Oregon).

bdb-4227-DestroyDomain2

Conclusion

In this post, we explained how you can implement a DR pattern on OpenSearch Service using a snapshot and restore strategy. It’s highly recommended to define your RPO and RTO for your workload and choose an appropriate DR strategy. Then, using AWS services, you can design an architecture that achieves the RTO and RPO for your business needs.


About the Authors

Samir Patel is a Senior Data Architect at Amazon Web Services, where he specializes in OpenSearch, data analytics, and cutting-edge generative AI technologies. Samir works directly with enterprise customers to design and build customized solutions catered to their data analytics and cybersecurity needs. When not immersed in technical work, Samir pursues his passion for outdoor activities, including hiking, pickleball, and grilling with family and friends.

Sesha Sanjana Mylavarapu is an Associate Data Lake Consultant at AWS Professional Services. She specializes in cloud-based data management and collaborates with enterprise clients to design and implement scalable data lakes. She has a strong interest in data analytics and enjoys assisting customers solve their business and technical challenges. Beyond her professional pursuits, Sanjana enjoys hiking, playing guitar, and is passionate about teaching yoga.

Vivek Gautam is a Senior Data Architect with specialization in data analytics at AWS Professional Services. He works with enterprise customers building data products, analytics platforms, streaming, and search solutions on AWS. When not building and designing data products, Vivek is a food enthusiast who also likes to explore new travel destinations and go on hikes.

Channel deflection from voice to chat using Amazon Connect

Post Syndicated from Siva Thangavel original https://aws.amazon.com/blogs/architecture/channel-deflection-from-voice-to-chat-using-amazon-connect/

This post was co-written with Sagar Bedmutha, senior solutions architect at Tata Consultancy Services, and Rajiya Patan, AWS developer at Tata Consultancy Services

Service excellence helps cultivate customer satisfaction and brand loyalty. According to Gartner, one service excellence challenge is long wait times on interactive voice response (IVR) systems. Long wait times can translate into frustrated customers and potentially lost business. To maintain and grow business, companies must examine the shape of their customer service—avoiding long wait times, offering alternative communication channels such as chat, and designing easier-to-use, more efficient systems.

Amazon Connect, an AWS cloud-based contact center solution, is specialized in both voice and chat communication. This powerful tool can open up new avenues for businesses to enhance their customer service experience. Through Amazon Connect, companies can implement strategies like transferring a voice call to a chat channel. This can help resolve the pain point of wait times while maintaining the continuity of the engagement with customers.

This post outlines an Amazon Connect architecture pattern for transitioning between voice and chat channels. With this solution, a customer in a long queue on a voice call can choose a callback or to continue the conversation with an agent through chat.

Prerequisites

To implement this solution, you’ll need the following:

Solution overview

Our solution provides an alternate channel and call-back option if there is a long wait time in IVR. Customers can transition from voice to a chat or email instantly without additional work.

We designed this solution by using the following AWS services and custom widget:

  •  Amazon Connect – Omnichannel cloud contact center that helps you provide superior customer service at a lower cost. Amazon Connect contact flows define the customer experience from start to finish.
  •  Lambda – Serverless, event-driven compute service that lets you run code for virtually any type of application or backend service, without you needing to provision or manage servers.
  •  CloudFront – Content delivery network (CDN) that speeds up delivery of static and dynamic web content, such as HTML, CSS, JavaScript, and images. CloudFront caches content at edge locations closer to end users.
  •  Amazon Pinpoint – Flexible, scalable marketing communications service that connects you with customers over email, SMS, push notifications, or voice.
  •  Customized chat widget – Hosted in an Amazon S3 bucket, the widget provides the interface for chat interactions. It is developed using HTML, Vanilla JavaScript, and customized styling.

The following high-level architecture diagram outlines the flow of the process.

Architecture diagram showing the flow from the customer call to chatting with a live agent. Detailed description follows in text.

Channel deflection architecture diagram

  1. The customer initiates a call to the IVR system for customer support.
  2. If there is a long wait time, the IVR system provides an option for callback through the voice channel or the ability to switch to another channel like chat or SMS.
  3. The customer selects option to transition the call to a chat channel.
  4. The Amazon Connect flow invokes a Lambda function to create a chat session for the customer. The Lambda function generates a secure, time-limited signed URL for the chat channel, including relevant context.
  5. The solution sends the URL to the customer’s registered mobile number and email address through Amazon Pinpoint.
  6. The customer receives the chat link on their mobile device or email, then they select the link.
  7. A chat session initiates in a web browser, and a live agent is connected to assist the customer.

Note: The chat link becomes inactive if the user doesn’t access it within the designated schedule.

Implementation considerations

When implementing this voice-to-chat transition solution, it’s important to consider the following:

  • Ensure that your AWS account has the necessary permissions, and that you’ve set up appropriate IAM roles and policies for secure access to Amazon Connect, Lambda, Amazon S3, CloudFront, and Amazon Pinpoint.
  • Ensure that you have the necessary technical knowledge. Familiarity with Amazon Connect contact flows is crucial, as is proficiency in developing and deploying Lambda functions. You must create custom Lambda functions to handle the chat session creation and generate secure, time-limited signed URLs.
  • Set up an S3 bucket to host your custom chat widget, and configure a CloudFront distribution for performance and security.
  • Integrate Amazon Pinpoint for communication delivery. This requires careful setup to handle email and SMS notifications effectively.
  • When developing the custom chat widget, focus on creating a user-friendly interface that integrates with the Amazon Connect chat API. Pay special attention to security measures, particularly in generating and managing the signed URLs for chat access.
  • Complete testing to confirm smooth operations across various scenarios, including edge cases like expired chat links.
  • Remember to monitor the solution’s performance in production and consider scalability as your customer base grows.

By addressing these implementation considerations, you’ll be well-positioned to deploy a robust and effective voice-to-chat transition system that enhances your customer service capabilities.

Extended use cases

You can extend this solution for solving other contact center use cases with minimal or no modification. The following are some examples:

  • Assisting customers with complex technical issues that require a step by step guide.
  • Helping customers to follow instructions by reading the manual to complete backend processes, like profile updates.
  • Overcoming language barriers with international customer support by using writing instead of voice.
  • Authenticating customers using address, zip code, or other demographics.
  • Offering chat functionality to customers who prefer to multitask during interactions.
  • Deflecting traffic to alternate channels to improve customer experience and reduce costs.
  • Offering a method for secure document exchange, such as during financial services consultations.

Conclusion

Using Amazon Connect and other AWS services, this solution offers an implementation that can transition voice calls to a chat channel. This approach provides flexibility to your customer so that they can switch between channels. This helps to improve the total customer experience, the company’s efficiency, and the agent’s productivity. The flow provides continuity in conversations, so that agents can resume conversations with clients across channels and still maintain context. In the end, this solution empowers companies to deliver exceptional customer service and drive positive outcomes for their business. You can learn more about using Amazon Connect by visiting our Amazon Connect Resources page.

How Amazon Q reduced the time Amazon developers spent waiting for technical answers by 450k hours this year

Post Syndicated from Hank Cycyota original https://aws.amazon.com/blogs/devops/reducing-time-spent-waiting-with-amazon-q/

Introduction

Software development is complex and time consuming. Developers frequently need to stop building to get answers to hard, technical questions. What is the error in my code? How do I debug the logic? Where do I go to find this information? In 2024 Stack Overflow Developer Survey 53% of respondents agreed that waiting on answers disrupts their workflow, even if they know where to go find those answers. Similarly, 30% of respondents said knowledge silos impact their productivity ten times or more per week.

Our team – Amazon Software Builder Experience (ASBX) – leaned in to help solve this problem on behalf of Amazon developers. The ASBX organization has a mission to improve the software builder experience across tens of thousands of software engineers that work in all Amazon businesses. This includes the discovery tools that developers use to build and innovate on behalf of our customers. At Amazon, we have a wealth of software development expertise and an extensive knowledge base, but individual developers often have a hard time finding the exact information they need for the task at hand. They’re looking for a needle in a haystack. We have a few internal tools where Amazon developers go to connect with those subject matter experts (SME) but for the most complex questions, they might need to wait hours before they get a response. However, often the information they need is buried somewhere deep in our knowledge base. We saw an opportunity to pair new techniques in generative AI like retrieval augmented generation (RAG) with our extensive knowledge base to reduce the demand on those SMEs in the tools that our developers use every day. In this way, we would reduce the time our developers had to spend waiting for answers, allowing them stay in their workflow and continue delighting customers.

While we thought RAG would enable us to better assist our developers, we knew that our solution would need to pass rigorous security and privacy bars and scale to support the volume of documentation and questions that Amazon developers generate. Amazon Q Business met those requirements as well as removed some of the duplicative work that comes with managing a separate index and large language model. Additionally, Amazon Q Business comes with some out-of-the-box APIs that would allow us to integrate it with the tools that Amazon developers use as part of their discovery workflows. Finally, those APIs come with important hooks that would let us use the context within those tools to improve information retrieval and answer relevancy.

This year, Amazon Q Business has helped tens of thousands of Amazon developers answer questions and get back to building. With Amazon Q Business connected to our internal knowledge repositories, our developers are getting unblocked quicker to deliver results for customers; we’ve been able to reduce the time it takes for Amazon developers to find answers to technical questions from hours to just a few seconds. This year, Amazon Q Business has resolved over 1 million internal Amazon developer questions, reducing time spent churning on manual technical investigations by more than 450k hours.

Closing the discovery loop

Over its 30-year history, Amazon developers have generated a vast corpus of content to help them delight their customers. This includes community-generated content such as runbooks, dashboards, service-level documentation, structured Q&A, and program information. While we have an abundance of knowledge, finding the right answer to some of our more complex technical questions has been a challenge and the act of finding information pulls developers out of their workflow. When a developer is struggling to find an answer for a specific question, there are a few popular internal resources to get support from technical experts.

  • Developers can post a question on our internal Q&A boards (a tool called Sage) and wait for a reply. These questions are often complex in nature and require specific expertise to answer. The challenge is, answers aren’t immediate and the more obscure the question, the longer it can take for an SME to review and respond.
  • Developers can find the appropriate interest channel in Slack and ask there (often channels that are set up by a team of experts to provide support for a given problem domain). These questions are similarly complex and a developer benefits from asynchronous back and forth with the SME. This route can be faster than our Q&A boards, but it can still take hours to get a reply.

In both cases, the developer needs to wait for an answer from those SMEs to unblock their task. They could transition to the next task on their sprint board but now they need to context switch to something new (and often quickly switch back once the SME finally responds). What’s more, the answers to these questions often already exist somewhere in our knowledge base but the developer couldn’t find that needle in the collective haystack of tools and repositories that we have at Amazon.

We saw an opportunity to better connect Amazon developers with answers to their questions by integrating Amazon Q Business with those tools they were already using. We wanted this solution to take on the role of a subject matter expert in tools like Sage and Slack to provide a fast answer to questions to get the developer unstuck. We ingested our internal knowledge repository consisting of millions of documents into Amazon Q Business so our developers could get answers based on data across those repositories. Then, we integrated our Amazon Q Business instance with the tools where our developers commonly ask questions. Finally, we used context inherent to the tools themselves (e.g., the Slack channel in which a user was asking a question or the specific Sage subject they were posting to) to provide more useful responses. This approach has resulted in three primary benefits:

  • Swift adoption: We integrated the Q&A capabilities of Amazon Q Business into tools like Slack and Sage to make it part of the workflows our developers use every day. It also prevented us from having to create (yet) another tool that our developers needed to visit to get answers to questions. As a result, Amazon Q Business has already answered over 1 million internal Amazon developer questions this year.
  • More precise answers: This approach has enabled better responses to developer questions via Amazon Q Business by using the context that is readily available on the tool. In this way, we can narrow the retrieval scope from potentially millions of documents. Developers report that answers are more helpful when we scope retrieval in this manner (versus when no scope is present).
  • Faster answers: Leveraging Amazon Q Business has reduced the time it takes for a developer to get an answer to seconds, getting them unblocked so they can delight customers.

The remainder of this article touches briefly on how we set up our Amazon Q Business instance and how we integrated it with downstream tools. We then go into more depth around how we leveraged tool-specific context to provide better answers in service of getting our developers unstuck faster.

Setting up Amazon Q Business: Integrating Q Business with our knowledge repositories

We took a straightforward approach to setting up our Amazon Q Business application and followed the steps outlined in the service documentation here. We staged our knowledge repositories in S3 and used the Amazon Q Business S3 Connector to ingest those documents into our index (more info on how to use those S3 connectors here). We have a lot of content that isn’t useful in retrieval – pre-processing of our documents in S3 to remove stale content allows us to reduce our repository of over ten million potentially useful documents to under four million relevant ones. We also stage in S3 to enrich content with metadata (e.g., document type, team or service name, URL hierarchy) that isn’t in the source repository in order to take advantage of more Q features (e.g., filtering, boosting).

Integrating Amazon Q Business: Putting Q Business where users are

We leveraged a single Amazon Q Business application that hosts all our curated content to quickly release our desired experience in places where our developers ask questions. Also, leveraging one Amazon Q Business application lets us provide a consistent experience on the tools we integrate with, ensuring users are getting the same answers from the same dataset.

As we considered the UX for these integrations, our goal was that these experiences complement the existing workflow in the tools developers use. For instance, rather than increasing the cognitive load on the developer by adding a chat experience in the corner of Sage, we chose to automatically respond with answers to questions on the Q&A board. Similarly, developers can invoke Amazon Q Business on Slack in an interest channel to get answers to questions. We learned early that users have expectations around what sort of questions a bot is able to answer based on where they are interacting with it. For instance, customers were frustrated at a prototype experience on Sage that didn’t actually include the Q&A data native to that tool. Our prerequisite for onboarding new tools is to make sure we have the data for that tool and have considered other expectations users might have.

Enhancing our solution: Improving retrieval through surface-level context

By integrating Amazon Q Business directly with applications that developers use every day we’ve been able to leverage that information as context to improve retrieval accuracy and give better responses to developers. A common technical question our Amazon Q Business application answers is “How do I onboard to XX service?” Depending on the service in question (e.g., How common is the service name? Have many other actors onboarded to it?), Amazon Q Business might retrieve context outside of the authoritative set of documentation we want it to use. This can be confusing to the reader and reduce trust in the answers. However, if we know that a developer is asking that question in the XX-service@ Slack channel, we can narrow the retrieval scope to just the content related to XX-service.

As an example of where this helps, there is an internal Amazon service framework called Coral. Take a common question like “How do I maintain my Coral dashboard?” The answer to this question can vary by team, most of which maintain specific documentation about their dashboards. Asking this question without scoping will list a lot of good best practices around dashboards. However, if we know that the question is being asked in a specific team channel on Slack, it can use that context to scope to that specific team’s documentation to provide a more precise answer.

We scope documents with the Amazon Q Business filter attribute, which allows us to customize and control chat responses based on the document metadata reflected in index fields. We can break this down to the following steps:

  • We capture and associate document-specific metadata during ingestion.
  • We enable SMEs to define a topic space based on that metadata. A topic space is a collection of authoritative documents (spread across multiple repositories) specific to their knowledge domain. To use an example for an external-facing repository, a topic space could encompass all the Dynamo content on AWS Docs as specified by a root URL (like https://docs.aws.amazon.com/dynamodb). In this case, we would include all the URLs under that root in that topic space.
  • With the topic space defined, we leverage the Amazon Q Business metadata filters to restrict its RAG down at runtime to just the content in the topic space. Then, when a user asks a question about that topic space, say in a Slack interest channel on DynamoDB, the filters are applied and a domain-specific answer is generated from the SME-specified, authoritative content. Our call looks like this and the service documentation to set up your own is here.

  • In cases where the topic space does not contain an answer to the developer’s question (maybe the documentation hasn’t been updated or the question is too specific), we also added an opt-in user configuration to enable falling back to general knowledge. This ignores the filters and retries the original query to take advantage of the whole dataset.

Conclusion

This post described our process to integrating Amazon Q Business into tools where internal Amazon developers are looking for support. It also describes how we used context from these tools to improve the responses we provide to customers by using the inbuilt Amazon Q Business filter attributes.

You can also leverage this approach to make it easier for your developers to find answers to questions. For more information on how to do this, along with other ways you can make Amazon Q work for you, visit Getting started with Amazon Q Business.

Alexandru Baluta

Alexandru Baluta is a Senior Software Development Engineer at Amazon Web Services leading teams that specialize in Generative AI and Knowledge Discovery. His work enables enterprise decision-making, information flow, and knowledge sharing within Amazon’s engineering community and beyond. In his free time, he enjoys exploring new places and discovering unique restaurants.

Hank Cycyota

Hank Cycyota is Product Manager at Amazon Web Services and leads the teams that help Amazonians find the information they need so they can delight customers. He loves helping his teams work backwards from ambiguous problem statements to data-driven solutions. Outside of work, Hank spends time running and trying new recipes from his library of cookbooks.

How Volkswagen Autoeuropa built a data mesh to accelerate digital transformation using Amazon DataZone

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/how-volkswagen-autoeuropa-built-a-data-mesh-to-accelerate-digital-transformation-using-amazon-datazone/

This is a joint blog post co-authored with Martin Mikoleizig from Volkswagen Autoeuropa.

Volkswagen Autoeuropa is a Volkswagen Group plant that produces the T-Roc. The plant is located near Lisbon, Portugal and produces about 934 cars per day. In 2023, Volkswagen Autoeuropa represented 1.3% of the national GDP of Portugal and 4% in national export of goods impact with a sales volume of 3.3511 billion Euros. Volkswagen Autoeuropa aims to become a data-driven factory and has been using cutting-edge technologies to enhance digitalization efforts.

In this post, we discuss how Volkswagen Autoeuropa used Amazon DataZone to build a data marketplace based on data mesh architecture to accelerate their digital transformation. The data mesh, built on Amazon DataZone, simplified data access, improved data quality, and established governance at scale to power analytics, reporting, AI, and machine learning (ML) use cases. As a result, the data solution offers benefits such as faster access to data, expeditious decision making, accelerated time to value for use cases, and enhanced data governance.

Understanding Volkswagen Autoeuropa’s challenges

At the time of writing this post, Volkswagen Autoeuropa has already implemented more than 15 successful digital use cases in the context of real-time visualization, business intelligence, industrial computer vision, and AI.

Before the AWS partnership, Volkswagen Autoeuropa faced the following challenges.

  • Long lead time to access data – The digital use cases launched by Volkswagen Autoeuropa spent most of their project time getting access to the data that was relevant to their use cases. After the right data for the use case was found, the IT team provided access to the data through manual configuration. The lead time to access data was often from several days to weeks.
  • Insufficient data governance and auditing – Data was shared directly to use cases by copying it. Therefore, the IT team connected the data manually from their sources to the desired destinations multiple times. This process wasn’t centrally tracked to discover any information on the data sharing process. For example, if the data was copied in the past, how many use cases have access to the data, when access was granted, and who granted the access.
  • Redundant effort to process the same information – Because the IT team copied the data sources based on the exact use case requirements, they shared specific columns of the tables from the data. As additional use cases requested access to the same data with different column requirements, even more copies of the data were created.
  • Repeated process to establish security and governance guardrails – Each time the IT and the security team provided a connection to a new data source, they had to set up the security and governance guardrails. This required repeated manual effort.
  • Data quality issues – Because the data was processed redundantly and shared multiple times, there was no guarantee of or control over the quality of the data. This led to reduced trust in the data.
  • Absence of data catalog and metadata management – Data didn’t have any metadata associated with it, and so use cases couldn’t consume the data without further explanation from the data source owners and specialists. Furthermore, no process to discover new data existed. Similar to the consumption process, use cases would consult specialists to understand the context of the data and if it could provide value.

Envisioning a data solution for Volkswagen Autoeuropa

To address these challenges, Volkswagen Autoeuropa embarked on a bold vision. They envisioned a seamless data consumption process, similar to an online shopping experience. They envisioned a data marketplace where data users could browse and access high-quality, secure data with clear specifications, business context, and relevant attributes. This vision materialized into a project aimed at transforming data accessibility and governance as the foundation for the digital ecosystem. The vision to be realized: Data as seamless as online shopping.

In collaboration with Amazon Web Services (AWS), Volkswagen Autoeuropa joined the Enhanced Plant Onboarding Program of the Global Volkswagen Group’s Digital Production Platform (DPP EPO) strategy. Through this partnership, AWS and Volkswagen Autoeuropa created a data marketplace that significantly improved data availability.

In the discovery phase of the project, Volkswagen Autoeuropa and AWS evaluated several options to build the data solution. In the end, Volkswagen Autoeuropa chose a solution based on data mesh architecture using Amazon DataZone. Being a managed service, Amazon DataZone provided the necessary speed and agility to build the solution. At the same time, it led to higher operational efficiencies and lower operational overhead. The team adopted a data mesh architecture because the principles of the data mesh aligned with Volkswagen Autoeuropa’s vision of being a data driven factory.

Solution overview

This section describes the key features and architecture of the Volkswagen Autoeuropa data solution. The solution is based on a data mesh architecture.

Data solution features

The following figure shows the key capabilities of the Volkswagen Autoeuropa data solution.

The key capabilities of the solution are:

  • Data quality – In the solution, we’ve built a data quality framework to streamline the process of data quality checks and publishing quality scores. It uses AWS Glue Data Quality to generate recommendation rulesets, run orchestrated jobs, store results, and send notifications to users. This framework can be seamlessly integrated into AWS Glue jobs, providing a quality score for data pipeline jobs. In addition, the quality score is published in the Amazon DataZone data portal, allowing consumers to subscribe to the data based on its quality score.Assigning a quality score to the data helps build trust in the data, and shifts the responsibility of maintaining data quality to the data owner. As a result, the quality of the results delivered by these use cases improves.
  • Data registration – The producers sign in to the Amazon DataZone data portal using their AWS Identity and Access Management (IAM) credentials or single sign-on with integration through AWS IAM Identity Center. They register their data assets, which are stored in Amazon Simple Storage Service (Amazon S3), in the Amazon DataZone data catalog. The metadata of the data assets is stored in an AWS Glue catalog and made available in the business data catalog of Amazon DataZone and in the Amazon DataZone data source. The producers add business context such as business unit name, data owner contact information, and data refresh frequency using Amazon DataZone glossaries and metadata forms. In addition, they use generative AI capabilities to generate business metadata. After the business metadata is generated, they review the changes and modify the metadata if needed.Because all data products in Volkswagen Autoeuropa are now registered in the same location, the likelihood of data duplication is significantly reduced. Moreover, the data producers are improving the quality of the data by adding business context to it.
  • Data discovery – The consumers sign in to the Amazon DataZone data portal using their IAM credentials or single sign-on with integration through IAM Identity Center and search the data using keywords in the search bar. After the results are returned, they can further filter the results using glossary terms and project names. Finally, they review the business metadata of the data assets to evaluate if the data is relevant to their business use cases. They can check the quality score of the data assets and the refresh schedule for their use cases.With a data discovery capability in place, consumers can gain information about the data without the need to consult the source system owners or specialists.
  • Data access management – When the consumers find a data asset that’s relevant to their use case, they request access to it using the subscription feature of Amazon DataZone. Data is classified as public, internal, and confidential. For public and internal data assets, the access request is automatically approved. For confidential data assets, the data producer team reviews the access request and either accepts or rejects the subscription request.With a central place to manage data access, data owners can view which use cases have access to their data and when the access request was granted. The fine-grained access control feature of Amazon DataZone gives data owners granular control of their data at the row and column levels.
  • Data consumption – Upon approval of the subscription request, Amazon DataZone provisions the backend infrastructure to make the data accessible to the corresponding consumers. After this process is complete, the consumers can access the data through Amazon Athena using the deep link feature of Amazon DataZone. The data consumption pattern in Volkswagen Autoeuropa supports two use cases:
    • Cloud-to-cloud consumption – Both data assets and consumer teams or applications are hosted in the cloud.
    • Cloud-to-on-premises consumption – Data assets are hosted in the cloud and consumer use cases or applications are hosted on-premises.

Requirements specific to a use case requires access to the relevant data assets; sharing data to use cases using Amazon DataZone doesn’t require creating multiple copies. As a result, duplication and processing of data. Furthermore, by reducing the number of copies of the data, the overall quality of the data products improves. In addition, the backend automation of Amazon DataZone to make data available to use cases reduces the manual effort and improves the lead time to access data.

  • Single collaborative environment – The Amazon DataZone data portal provides a single collaborative environment to the users in Volkswagen Autoeuropa. Data consumers such as use case owners, data engineers, data scientists, and ML engineers can browse and request access to data assets. At the same time, data producers, such as use case owners and source system owners, can publish and curate their data in the Amazon DataZone data portal. This collaborative experience promotes teamwork and accelerates the realization of business value. Furthermore, the security and governance guardrails scales across the organization as the number of use cases increases.

Data solution architecture

The following figure displays the reference architecture of the data solution at Volkswagen Autoeuropa. In the next part of the post, we discuss how we arrived at the solution.

The architecture includes:

  1. The data from SAP applications, manufacturing execution systems (MES), and supervisory control and data acquisition (SCADA) systems is ingested into the producer accounts of Volkswagen Autoeuropa.
  2. In the producer account, raw data is transformed using AWS Glue. The technical metadata of the data is stored in AWS Glue catalog. The data quality is measured using the data quality framework. The data stored in Amazon Simple Storage Service (Amazon S3) is registered as an asset in the Amazon DataZone data catalog hosted in the central governance account.
  3. The central governance account hosts the Amazon DataZone domain and the related Amazon DataZone data portal. The AWS accounts of the data producers and consumers are associated with the Amazon DataZone domain. Amazon DataZone projects belonging to the data producers and consumers are created under the related Amazon DataZone domain units.
  4. Consumers of the data products sign in to the Amazon DataZone data portal hosted in the central governance account using their IAM credentials or single sign-on with integration through IAM Identity Center. They search, filter, and view asset information (for example, data quality, business, and technical metadata).
  5. After the consumer finds the asset they need, they request access to the asset using the subscription feature of Amazon DataZone. Based on the validity of the request, the asset owner approves or rejects the request.
  6. After the subscription request is granted and fulfilled, the asset is accessed in the consumer account for a one-time query using Athena and Microsoft Power BI applications hosted on premises. This consumption pattern can be extended for AI and machine learning (AI/ML) model development using Amazon SageMaker and reporting purposes using Amazon QuickSight.

User journey

After discussing the desired system with the use case teams and stakeholders and analyzing the current workflow, Volkswagen Autoeuropa grouped the user personas of the data solution into three main categories: data producer, data consumer, and data solution administrator. This sets the foundation for the desired user experience and what’s needed to achieve the solution goals.

Data producer

Data producers create the data products in the data solution. There are two types of data producers.

  • Data source owners – Data source owners publish the raw data in the Amazon DataZone data portal. These data products are attributed as source-based data.
  • Use case owners – Use case owners publish data that’s fit for consumption by other use cases. These data products are called consumer-based data.

The following figure shows the user journey of a data producer:

 

A data producer’s journey includes:

  1. Identify data of interest
    1. Identify data (Volkswagen Autoeuropa network).
    2. Perform data quality checks (Volkswagen Autoeuropa network).
  2. Connect data to the data solution
    1. Ingest data into the data solution (Amazon DataZone portal).
    2. Start process to connect data using AWS Glue.
  3. Locate the data source in the data solution
    1. Register data (Amazon DataZone portal).
    2. Add data to the inventory in Amazon DataZone.
  4. Add or edit metadata
    1. Add or edit metadata (Amazon DataZone portal).
    2. Publish data assets (Amazon DataZone portal).
  5. Approve or reject subscription request
    1. Review subscription requests.
  6. Maintain data assets
    1. Manage data assets (Amazon DataZone portal).

Data consumer

Data consumers use data for business analytics, machine learning, AI, and business reporting. Data consumers are data engineers, data scientists, ML engineers, and business users. The following diagram shows the journey of a data consumer.

A data consumer’s journey includes:

  1. Access Amazon DataZone portal
    1. Amazon DataZone portal – Access is granted based on the user’s assigned domain and projects.
  2. Search for data assets
    1. Data assets in Amazon DataZone portal – Search for data and brows the results by glossary terms or the project name. Use additional filters to refine the results.
  3. View business metadata
    1. Select a data asset to see additional information – Review the description, data quality score and metadata.
  4. Request access to data (subscribe)
    1. Subscribe to request access.
    2. After the subscription request is approved, review the data products that you have access to.
    3. Query the data to view and consume the data.
  5. Retrieve additional data
    1. Repeat the steps as needed to access and retrieve additional data.

Data solution administrator

Data solution administrators are responsible for performing administrative tasks on the data solution. The following figure shows the common tasks performed by the data solution administrator.

A data administrator’s journey includes:

  1. Manage projects
    1. Manage Amazon DataZone domain.
    2. Manage Amazon DataZone projects within the domain.
  2. Manage environment
    1. Set up the environment to manage the infrastructure.
  3. Manage business metadata glossary
    1. Manage and enable Amazon DataZone glossaries and metadata forms.
  4. Manage data assets
    1. Manage assets.
    2. Query the data to view and consume the data.
  5. Manage access to data solution
    1. Monitor and revoke access when appropriate.

Conclusion

In this post, you learned how Volkswagen Autoeuropa embarked on a bold vision to become a data driven factory. It shows how this vision was put into action by building a data solution based on data mesh architecture using Amazon DataZone. It highlights the key features and architecture of the data solutions and presents the user journey. As of writing this post, Volkswagen Autoeuropa reduced the data discovery time from days to minutes using the data solution. The time to access data took several weeks before the Volkswagen Autoeuropa and AWS collaboration. Now, with the help of the data solution, the data access time has been reduced to several minutes.

In May 2024, the team achieved a major milestone by successfully offering data on the data solution and transporting it instantly to Power BI, a process that previously took several weeks.

“After one year of work, we did the full roundtrip from offering data on our new data marketplace built using Amazon DataZone to transporting it instantly to third-party tools, a process that previously took several weeks. This was a big achievement for our team.”

– Jorge Paulino, Product owner of the data solution. Volkswagen Autoeuropa.

The next post of the two-part series details discusses how we built the solution, its technical details, and the business value created.

If you want to harness the agility and scalability of a data mesh architecture and Amazon DataZone to accelerate innovation and drive business value for your organization, we have the resources to get you started. Be sure to check out the AWS Prescriptive Guidance: Strategies for building a data mesh-based enterprise solution on AWS. This comprehensive guide covers the key considerations and best practices for establishing a robust, well-governed data mesh on AWS. From aligning your data mesh with overall business strategy to scaling the data mesh across your organization, this Prescriptive Guidance provides a clear roadmap to help you succeed.

If you’re curious to get hands-on, see the GitHub repository: Building an enterprise Data Mesh with Amazon DataZone, Amazon DataZone, AWS CDK, and AWS CloudFormation. This open source project delivers a step-by-step guide to build a data mesh architecture using Amazon DataZone, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.


About the Authors

Dhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data analytics, and data governance at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure AWS solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. An active contributor to the AWS community, Dhrubajyoti authors AWS Prescriptive Guidance publications, blog posts, and open-source artifacts, sharing his insights and best practices with the broader community. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

Ravi Kumar is a Data Architect and Analytics expert at Amazon Web Services; he finds immense fulfillment in working with data. His days are dedicated to designing and analyzing complex data systems, uncovering valuable insights that drive business decisions. Outside of work, he unwinds by listening to music and watching movies, activities that allow him to recharge after a long day of data wrangling.

Martin Mikoleizig studied mechanical engineering and production technology at the RWTH Aachen University before starting to work in Dr. h.c. Ing. F. Porsche AG 2015 as a production planner for the engine assembly. In several years as a Project Manager on Testing Technology for new engine models he also introduced several innovations like human-machine-collaborations and intelligent assistance systems. From 2017, he was responsible for the Shopfloor IT team of the module lines in Zuffenhausen before he became responsible for the Planning of the E-Drive assembly at Porsche. Beside this he was responsible for the Digitalisation Strategy of the Production Ressort at Porsche. Since October 2022, he has been assigned to Volkswagen Autoeuropa in Portugal in the role of a Digital Transformation Manager for the plant driving the Digital Transformation towards a Data Driven Factory.

Weizhou Sun is a Lead Architect at Amazon Web Services, specializing in digital manufacturing solutions and IoT. With extensive experience in Europe, she has enhanced operational efficiencies, reducing latency and increasing throughput. Weizhou’s expertise includes Industrial Computer Vision, predictive maintenance, and predictive quality, consistently delivering top performance and client satisfaction. A recognized thought leader in IoT and remote driving, she has contributed to business growth through innovations and open-source work. Committed to knowledge sharing, Weizhou mentors colleagues and contributes to practice development. Known for her problem-solving skills and customer focus, she delivers solutions that exceed expectations. In her free time, Weizhou explores new technologies and fosters a collaborative culture.

Shameka Almond is an Advisory Consultant at Amazon Web Services. She works closely with enterprise customers to help them better understand the business impact and value of implementing data solutions, including data governance best practices. Shameka has over a decade of wide-ranging IT experience in the manufacturing and aerospace industries, and the nonprofit sector. She has supported several data governance initiatives, helping both public and private organizations identify opportunities for improvement and increased efficiency. Outside of the office she enjoys hosting large family gatherings, and supporting community outreach events dedicated to introducing students in K-12 to STEM.

Adjoa Taylor has over 20 years of experience in industrial manufacturing, providing industry and technology consulting services, digital transformation, and solution delivery. Currently Adjoa leads Product Centric Digital Transformation, enabling customers to solve complex manufacturing problems by leveraging Smart Factory and Industry leading transformation mechanisms. Most recently driving value with AI/ML and generative AI use-cases for the plant floor. Adjoa is an experienced leader spending over 20 years of her career delivering projects in countries throughout North America, Latin America, Europe, and Asia. Through prior roles, Adjoa brings deep experience across multiple business segments with a focus on business outcome driven solutions. Adjoa is passionate about helping customers solve problems while realizing the art of the possible via the right impacting value-based solution.

Modernize your legacy databases with AWS data lakes, Part 3: Build a data lake processing layer

Post Syndicated from Anoop Kumar K M original https://aws.amazon.com/blogs/big-data/modernize-your-legacy-databases-with-aws-data-lakes-part-3-build-a-data-lake-processing-layer/

This is the final part of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to process data with Amazon Redshift Spectrum and create the gold (consumption) layer. To review the first two parts of the series where we load data from SQL Server into Amazon Simple Storage Service (Amazon S3) using AWS Database Migration Service (AWS DMS) and load the data into the silver layer of the data lake, see the following:

Solution overview

Choosing the right tools and technology stack to build the data lake in order to build a scalable solution and have shorter time to market is critical. In this post, we go over the process of building a data lake, providing rationale behind the different decisions, and share best practices when building such a data solution.

The following diagram illustrates the different layers of the data lake.

The data lake is designed to serve a multitude of use cases. In the silver layer of the data lake, the data is stored as it is loaded from sources, preserving the table and schema structure. In the gold layer, we create data marts by combining, aggregating, and enriching data as required by our use cases. The gold layer is the consumption layer for the data lake. In this post, we describe how you can use Redshift Spectrum as an API to query data.

To create data marts, we use Amazon Redshift Query Editor. It provides a web-based analyst workbench to create, explore, and share SQL queries. In our use case, we use Redshift Query Editor to create data marts using SQL code. We also use Redshift Spectrum, which allows you to efficiently query and retrieve structured and semi-structured data from files stored on Amazon S3 without having to load the data into the Redshift tables. The Apache Iceberg tables, which we created and cataloged in Part 2, can be queried using Redshift Spectrum. For the latest information on Redshift Spectrum integration with Iceberg, see Using Apache Iceberg tables with Amazon Redshift.

We also show how to use RedshiftDataAPIService to run SQL commands to query the data mart using a Boto3 Python SDK. You can use the Redshift Data API to create the resulting datasets on Amazon S3, and then use the datasets in use cases such as business intelligence dashboards and machine learning (ML).

In this post, we walk through the following steps:

  1. Set up a Redshift cluster.
  2. Set up a data mart.
  3. Query the data mart.

Prerequisites

To follow the solution, you need to set up certain access rights and resources:

  • An AWS Identity and Access Management (IAM) role for the Redshift cluster with access to an external data catalog in AWS Glue and data files in Amazon S3 (these are the data files populated by the silver layer in Part 2). The role also needs Redshift cluster permissions. This policy must include permissions to do the following:
    • Run SQL commands to copy, unload, and query data with Amazon Redshift.
    • Grant permissions to run SELECT statements for related services, such as Amazon S3, Amazon CloudWatch logs, Amazon SageMaker, and AWS Glue.
    • Manage AWS Lake Formation permissions (in case the AWS Glue Data Catalog is managed by Lake Formation).
  • An IAM execution role for AWS Lambda with permissions to access Amazon Redshift and AWS

For more information about setting up IAM roles for Redshift Spectrum, see Getting started with Amazon Redshift Spectrum.

Set up a Redshift cluster

Redshift Spectrum is a feature of Amazon Redshift that queries data stored in Amazon S3 directly, without having to load it into Amazon Redshift. In our use case, we use Redshift Spectrum to query Iceberg data stored as Parquet files on Amazon S3. To use Redshift Spectrum, we first need a Redshift cluster to run the Redshift Spectrum compute jobs. Complete the following steps to provision a Redshift cluster:

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Choose Create cluster.
  3. For Cluster identifier, enter a name for your cluster.
  4. For Choose the size of the cluster, select I’ll choose.
  5. For Node type, choose xlplus.
  6. For Number of nodes, enter 1.

can

  1. For Admin password, select Manage admin credentials in AWS Secrets Manager if you want to use Secrets Manager, otherwise you can generate and store the credentials manually.

  1. For the IAM role, choose the IAM role created in the prerequisites.
  2. Choose Create cluster.

We chose the cluster Availability Zone, number of nodes, compute type, and size for this post to minimize costs. If you’re working on larger datasets, we recommend reviewing the different instance types offered by Amazon Redshift to select the one that is appropriate for your workloads.

Set up a data mart

A data mart is a collection of data organized around a specific business area or use case, providing focused and quickly accessible data for analysis or consumption by applications or users. Unlike a data warehouse, which serves the entire organization, a data mart is tailored to the specific needs of a particular department, allowing for more efficient and targeted data analysis. In our use case, we use data marts to create aggregated data from the silver layer and store it in the gold layer for consumption. For our use case, we use the schema HumanResources in the AdventureWorks sample database we loaded in Part 1 (FIX LINK). This database contains a factory’s employee shift information for different departments. We use this database to create a summary of the shift rate changes for different departments, years, and shifts to see which years had the most rate changes.

We recommend using the auto mount feature in Redshift Spectrum. This feature removes the need to create an external schema in Amazon Redshift to query tables cataloged in the Data Catalog.

Complete the following steps to create a data mart:

  1. On the Amazon Redshift console, choose Query editor v2 in the navigation pane.
  2. Choose the cluster you created and choose AWS Secrets Manager or Database username and password depending on how you chose to store the credentials.
  3. After you’re connected, open a new query editor.

You will be able to see the AdventureWorks database under awsdatacatalog. You can now start querying the Iceberg database in the query editor.

query-editor

If you encounter permission issues, choose the options menu (three dots) next to the cluster, choose Edit connection, and connect using Secrets Manager or your database user name and password. Then grant privileges for the IAM user or role with the following command, and reconnect with your IAM identity:

GRANT USAGE ON DATABASE awsdatacatalog to "IAMR:MyRole"

For more information, see Querying the AWS Glue Data Catalog.

Next, you create a local schema to store the definition and data for the view.

  1. On the Create menu, choose Schema.
  2. Provide a name and set the type as local.
  3. For the data mart, create a dataset that combines different tables in the silver layer to generate a report of the total shift rate changes by department, year, and shift. The following SQL code will return the required dataset:
SELECT dep.name AS "Department Name",
extract(year from emp_pay_hist.ratechangedate) AS "Rate Change Year",
shift.name AS "Shift",
COUNT(emp_pay_hist.rate) AS "Rate Changes"
FROM "dev"."{redshift_schema_name}"."department" dep
INNER JOIN "dev"."{redshift_schema_name}"."employeedepartmenthistory" emp_hist
ON dep.departmentid = emp_hist.departmentid
INNER JOIN "dev"."{redshift_schema_name}"."employeepayhistory" emp_pay_hist
ON emp_pay_hist.businessentityid = emp_hist.businessentityid
INNER JOIN "dev"."{redshift_schema_name}"."employee" emp
ON emp_hist.businessentityid = emp.businessentityid
INNER JOIN "dev"."{redshift_schema_name}"."shift" shift
ON emp_hist.shiftid = shift.shiftid
WHERE emp.currentflag = 'true'
GROUP BY dep.name, extract(year from emp_pay_hist.ratechangedate), shift.name;
  1. Create an internal schema where you want Amazon Redshift to store the view definition:

CREATE SCHEMA IF NOT EXISTS {internal_schema_name};

  1. Create a view in Amazon Redshift that you can query to get the dataset:
CREATE OR REPLACE VIEW {internal_schema_name}.rate_changes_by_department_year AS
SELECT dep.name AS "Department Name",
extract(year from emp_pay_hist.ratechangedate) AS "Rate Change Year",
shift.name AS "Shift",
COUNT(emp_pay_hist.rate) AS "Rate Changes"
FROM "dev"."{redshift_schema_name}"."department" dep
INNER JOIN "dev"."{redshift_schema_name}"."employeedepartmenthistory" emp_hist
ON dep.departmentid = emp_hist.departmentid
INNER JOIN "dev"."{redshift_schema_name}"."employeepayhistory" emp_pay_hist
ON emp_pay_hist.businessentityid = emp_hist.businessentityid
INNER JOIN "dev"."{redshift_schema_name}"."employee" emp
ON emp_hist.businessentityid = emp.businessentityid
INNER JOIN "dev"."{redshift_schema_name}"."shift" shift
ON emp_hist.shiftid = shift.shiftid
WHERE emp.currentflag = 'true'
GROUP BY dep.name, extract(year from emp_pay_hist.ratechangedate), shift.name
WITH NO SCHEMA BINDING;

If the SQL takes a long time to run or produces a large result set, consider using Redshift Unlike regular views, which are computed in the moment, the results from materialized views can be pre-computed and stored on Amazon S3. When the data is requested, Amazon Redshift can point to an Amazon S3 location where the results are stored. Materialized views can be refreshed on demand and on a schedule.

Query the data mart

Lastly, we query the data mart using a Lambda function to show how the data can be retrieved using an API. The Lambda function requires an IAM role to access Secrets Manager where the Redshift user credentials are stored. We use the Redshift Data API to retrieve the dataset we created in the previous step. First, we call the execute_statement() command to run the view. Next , we check the status of the run by calling the describe_statement() call. Finally , when the statement has successfully run, we use the get_statement_result() call to get the result set. The Lambda function shown in the following code implements this logic and returns the result set from querying the view rate_changes_by_department_year:

import json
import boto3
import time

def lambda_handler(event, context):
	client = boto3.client('redshift-data')

	# Use the Redshift execute statement api to query the data mart
	response = client.execute_statement(
	ClusterIdentifier='{redshift cluster name}',
	Database='dev',
	SecretArn='{redshift cluster secrets manager secret arn}',
	Sql='select * from {internal_schema_name}.rate_changes_by_department_year',
	StatementName='query data mart'
	)

	statement_id = response["Id"]
	query_status = True
	resultSet = []

	# Check the status of the sql statement, once the statement has finished executing we can retrive the resultset
	while query_status:
	if client.describe_statement(Id=statement_id)["Status"] == "FINISHED":

	print("SQL statement has finished successfully and we can get the resultset")

	response = client.get_statement_result(
	Id=statement_id
	)
	columns = response["ColumnMetadata"]
	results = response["Records"]
	while "NextToken" in response:
	response = client.get_servers(NextToken=response["NextToken"])
	results.extend(response["Records"])

	resultSet.append(str(columns[0].get("label")) + "," + str(columns[1].get("label")) + "," + str(columns[2].get("label")) + "," + str(columns[3].get("label")))

	for result in results:
	resultSet.append(str(result[0].get("stringValue")) + "," + str(result[1].get("longValue")) + "," + str(result[2].get("stringValue")) + "," + str(result[3].get("longValue")))

	query_status = False

	# In case the statement runs into errors we abort the resultset retrival
	if client.describe_statement(Id=statement_id)["Status"] == "ABORTED" or client.describe_statement(Id=statement_id)["Status"] == "FAILED":
	query_status = False
	print("SQL statement has failed or aborted")

	# To avoid spamming the API with requests on the status of the statement, we introduce a 2 second wait between calls
	else:
	print("Query Status ::" + client.describe_statement(Id=statement_id)["Status"])
	time.sleep(2)

	return {
	'statusCode': 200,
	'body': resultSet
	}

The Redshift Data API allows you to access data from many different types of traditional, cloud-based, containerized, web service-based, and event-driven applications. The API is available in many programming languages and environments supported by the AWS SDK, such as Python, Go, Java, Node.js, PHP, Ruby, and C++. For larger datasets that don’t fit into memory, such as ML training datasets, you can use the Redshift UNLOAD command to move the results of the query to an Amazon S3 location.

Clean up

In this post, you created an IAM role, Redshift cluster, and Lambda function. To clean up your resources, complete the following steps:

  1. Delete the IAM role:
    1. On the IAM console, choose Roles in the navigation pane.
    2. Select the role and choose Delete.
  2. Delete the Redshift cluster:
    1. On the Amazon Redshift console, choose Clusters in the navigation pane.
    2. Select the cluster you created and on the Actions menu, choose Delete.
  3. Delete the Lambda function:
    1. On the Lambda console, choose Functions in the navigation pane.
    2. Select the function you created and on the Actions menu, choose Delete.

Conclusion

In this post, we showed how you can use Redshift Spectrum to create data marts on top of the data in your data lake. Redshift Spectrum can query Iceberg data stored in Amazon S3 and cataloged in AWS Glue. You can create views in Amazon Redshift that compute the results from the underlying data on demand, or pre-compute results and store them (using materialized views). Lastly, the Redshift Data API is a great tool for running SQL queries on the data lake from a wide variety of sources.

For more insights into the Redshift Data API and how to use it, refer to Using the Amazon Redshift Data API to interact with Amazon Redshift clusters. To continue to learn more about building a modern data architecture, refer to Analytics on AWS.


About the Authors

Shaheer Mansoor is a Senior Machine Learning Engineer at AWS, where he specializes in developing cutting-edge machine learning platforms. His expertise lies in creating scalable infrastructure to support advanced AI solutions. His focus areas are MLOps, feature stores, data lakes, model hosting, and generative AI.

Anoop Kumar K M is a Data Architect at AWS with focus in the data and analytics area. He helps customers in building scalable data platforms and in their enterprise data strategy. His areas of interest are data platforms, data analytics, security, file systems and operating systems. Anoop loves to travel and enjoys reading books in the crime fiction and financial domains.

Sreenivas Nettem is a Lead Database Consultant at AWS Professional Services. He has experience working with Microsoft technologies with a specialization in SQL Server. He works closely with customers to help migrate and modernize their databases to AWS.

How BMW streamlined data access using AWS Lake Formation fine-grained access control

Post Syndicated from Ruben Simon original https://aws.amazon.com/blogs/big-data/how-bmw-streamlined-data-access-using-aws-lake-formation-fine-grained-access-control/

This post is cowritten with Ruben Simon and Khalid Al Khalili from BMW.

BMW’s ambition is to continuously accelerate innovation and improve decision-making across their global operations. To achieve this, they aimed to break down data silos and centralize data from various business units and countries into the BMW Cloud Data Hub (CDH). The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight. By building the CDH, BMW realized improved efficiency, performance and sustainability throughout the automotive lifecycle, from design to after-sales services.

With over 10 PB of data across 1,500 data assets, 1,000 data use cases, and more than 9000 users, the BMW CDH has become a resounding success since BMW decided to build it in a strategic collaboration with Amazon Web Services (AWS) in 2020. However, the initial version of CDH supported only coarse-grained access control to entire data assets, and hence it was not possible to scope access to data asset subsets. This led to inefficiencies in data governance and access control.

AWS Lake Formation is a service that streamlines and centralizes the data lake creation and management process. One of its key features is fine-grained access control, which allows customers to granularly control access to their data lake resources at the table, column, and row levels. This level of control is essential for organizations that need to comply with data governance and security regulations, or those that deal with sensitive data.

With fine-grained access control, customers can define and enforce data access policies based on various criteria, such as user roles, data classifications, or data sensitivity levels. This makes sure that only authorized users or applications can access specific data sets or portions of data, but also reduces the risk of unauthorized access or data breaches. Additionally, Lake Formation integrates with AWS Identity and Access Management (IAM) and other AWS services so customers can use existing security and access management practices within their data lake environment.

This post explores how BMW implemented AWS Lake Formation‘s fine-grained access control (FGAC) in the CDH and how this saves them up to 25% on compute and storage costs.

The Solution: How BMW CDH solved data duplication

The CDH is a company-wide data lake built on Amazon Simple Storage Service (Amazon S3). The CDH serves as a centralized repository for petabytes of data from engineering, manufacturing, sales, and vehicle performance and provides BMW employees with a unified view of the organization and acts as a starting point for new development initiatives. It streamlines access to various AWS services, including Amazon QuickSight, for building business intelligence (BI) dashboards and Amazon Athena for exploring data. Many of these services are embedded into the CDH data portal, which offers a web-based user interface for accessing and interacting with the platform. It allows users to discover datasets, manage data assets, and consume data for their use cases. The architecture is shown in the following figure.

The BMW CDH follows a decentralized, multi-account architecture to foster agility, scalability, and accountability. It comprises distinct AWS account types, each serving a specific purpose. The following account types are relevant for implementation:

  • Resource accounts: Accounts are used for centralized storage repositories, hosting the datasets and their associated metadata across different stages (such as development, integration, and production) and AWS Regions.
  • Consumer accounts: Used by data consumers to implement use cases insights and build applications tailored to their business needs.
  • CDH control plane account: This account contains the APIs for creating filter packages and controlling access. A filter package provides a restricted view of a data asset by defining column and row filters on the tables.

The following are the three key roles within the CDH’s decentralized architecture:

  • Data providers, who provision data assets in resource accounts
  • Data stewards, who govern data assets
  • Use cases (data consumers), which use data assets to derive insights and build applications inside of consumer accounts to support decision-making processes.

For example, a global sales dataset is created by a team of data engineers with the data provider role. A data analyst in a local market who wants to derive insights from the global sales data can create a use case with a dedicated AWS consumer account and request access to the dataset from a data steward.

This multi-account strategy promotes a clear separation of concerns, empowering data producers and consumers to operate independently while using the centralized governance and services provided by the solution. The following figure illustrates how Lake Formation is used across the resource and consumer accounts in the CDH to provide FGAC to use cases.

The CDH uses the AWS Glue in resource accounts as a technical metadata catalog and data assets are stored in Amazon S3. Both the data catalog and the locations in Amazon S3 are registered with Lake Formation so that it can govern data access. Data catalogs and tables are shared with consumer accounts and use cases through AWS Resource Access Manager (AWS RAM). With Lake Formation, BMW can control access to data assets at different granularities, such as permissions at the table, column, or row level. Users can then use a Lake Formation integrated engine such as Amazon Athena to access only the data they need, removing the need to duplicate data. For example, to restrict access to a global sales data asset, BMW can now specify row filters in Lake Formation using the PartiQL language, filtering rows based on the country column of the data asset.

Data stewardship: Managing fine-grained access control

At the core of the CDH FGAC implementation lies the concept of filter packages. A filter package provides a selective view of a data asset by defining column and row filters on the tables. Multiple filter packages can be defined for a data asset to create suitable views for different use cases. In our example of the global sales dataset, a data steward creates a filter package for each local market that restricts access to the relevant rows and columns. Data stewards create and manage these packages through the CDH interface. These filter packages are implemented using Lake Formation row-level and column-level access control mechanisms. The following figure illustrates these concepts.

When creating a filter package, data stewards can specify the desired access level for individual tables within their data asset: Full access grants permissions to all columns and rows, None denies access to an entire table, while Filtered allows for granular row-level and column-level access controls.

For filtered access, data stewards use PartiQL queries to define row-level filters on tables, selecting only the rows that meet specific criteria. Additionally, they can specify column-level filters by selecting the accessible columns.

After filter packages have been created and published, they can be requested. Data stewards can review incoming requests and grant or deny access through the CDH interface, making sure that only authorized environments can access sensitive data.

Using fine-grained access control in use cases

Use case owners can browse and search for relevant data assets in the CDH, and then request full or scoped access. The CDH provides a clear overview of the available filter packages, allowing them to select the appropriate level of access based on their use case.

After access is granted to a filter package by the data steward, the filters are enforced for the use case using Lake Formation. Use case owners can further control access at the row and column level for individual users or roles within their use case account using Lake Formation. For example, they can create another column filter to hide a particular column for a particular group of users and provide unfiltered access to another group of users.

Gradual deployment with Lake Formation hybrid access mode

One of the challenges in implementing changes in access control within an existing data lake such as the CDH is the need to coordinate migration between data providers and consumers. To address this, Lake Formation offers a hybrid access mode to facilitate a gradual transition to FGAC without disrupting existing data access patterns.

In hybrid access mode, data providers can activate Lake Formation for new dataset consumers while existing consumers continue to access the data using the legacy permission model. This approach makes sure that consumers can migrate to FGAC at their own pace, minimizing the impact on their existing workloads and processes. A use case account is only switched to Lake Formation permissions for a dataset when it requests access to a filter package. This hybrid approach allows providers and consumers to migrate at their own pace, maintaining a smooth transition to the new access control model.

How BMW saves money by using Lake Formation

As the CDH grew, it became apparent that data was often duplicated for access control purposes. This issue was particularly evident with data assets containing sales data of all markets where BMW operates. Local markets were only eligible to see their own data, and to achieve this, subsets of global data assets had to be duplicated to create isolated local variants. While this approach succeeded in fulfilling access control requirements, it led to increased storage costs, higher compute expenses for data processing and drift detection, and project delays because of time-consuming provisioning processes and governance overhead. At one point, 25% of all data assets in the CDH were duplicates, a natural consequence of these measures.

With Lake Formation, creating these duplicates is no longer necessary. Data stewards can restrict access to global datasets on column and row level to comply with governance requirements. Not only does this reduce the cost for data processing, storage, development and maintenance, it also minimizes the opportunity cost of delayed data access.

Conclusion

By using AWS Lake Formation fine-grained access control capabilities, BMW has transparently implemented finer data access management within the Cloud Data Hub. The integration of Lake Formation has enabled data stewards to scope and grant granular access to specific subsets of data, reducing costly data duplication. This approach enables BMW to save up to 25% on compute and storage costs while reducing governance overhead costs. The hybrid access mode implementation further facilitates a smooth transition to the new access control model, allowing data providers and consumers to migrate at their own pace without disrupting existing workloads and processes. To dive deeper into how to replicate BMWs data success story, check out the AWS blog post on building a data mesh with Amazon Lake Formation and AWS Glue.


About the authors

Ruben Simon is a Head of Product for BMW’s Cloud Data Hub, the company’s largest data platform. He is passionate about driving digital transformation in aata, analytics, and AI, and thrives on collaborating with international teams. Outside the office, Ruben cherishes family time and has a keen interest in continual learning.

Khalid Al Khalili is a Data Architect at BMW Group, leading the architecture of the Cloud Data Hub, BMW’s central platform for data innovation. He is a strong advocate for creating seamless data experiences, transforming complex requirements into efficient, user-friendly solutions. When he’s not building new features, Khalid enjoys collaborating with his peers and cross-functional teams to advance and shape BMW’s data strategy, ensuring it stays ahead in a rapidly evolving landscape.

Florian Seidel is a Global Solutions Architect specializing in the automotive sector at AWS. He guides strategic customers in harnessing the full potential of cloud technologies to drive innovation in the automotive industry. With a passion for analytics, machine learning, AI, and resilient distributed systems, Florian helps transform cutting-edge concepts into practical solutions. When not architecting cloud strategies, he enjoys cooking for family and friends and experimenting with electronic music production.

Aishwarya Lakshmi Krishnan is a Senior Customer Solutions Manager with AWS Automotive. She is passionate about solving business problems using generative AI and cloud based technologies.

Durga Mishra is a Principal solutions architect at AWS. Outside of work, Durga enjoys spending time building new things and spend time with family and loves to hike on Appalachian trails and spend time in nature.

How Getir unleashed data democratization using a data mesh architecture with Amazon Redshift

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/how-getir-unleashed-data-democratization-using-a-data-mesh-architecture-with-amazon-redshift/

This blog post is co-written with Pinar Yasar from Getir.

Amazon Redshift is a fully managed cloud data warehouse that’s used by tens of thousands of customers for price-performance, scale, and advanced data analytics. Amazon Redshift enables data warehousing by seamlessly integrating with other data stores and services in the modern data organization through features such as Zero-ETL, data sharing, streaming ingestion, data lake integration, and Redshift ML.

In this post, we explain how ultrafast delivery pioneer, Getir, unleashed the power of data democratization on a large scale through their data mesh architecture using Amazon Redshift.

We start by introducing Getir and their vision—to seamlessly, securely, and efficiently share business data across different teams within the organization for BI, extract, transform, and load (ETL), and other use cases. We’ll then explore how Amazon Redshift data sharing powered the data mesh architecture that allowed Getir to achieve this transformative vision. We will also explain how Getir’s data mesh architecture enabled data democratization, shorter time-to-market, and cost-efficiencies. Next, we’ll provide a broader overview of modern data trends reinforced by Getir’s vision. In conclusion, we’ll offer some thoughts on how you can apply a similar approach to eliminate costly and barrier-inducing data silos using Amazon Redshift.

Who is Getir?

Getir is an ultrafast delivery pioneer that revolutionized last-mile delivery in 2015 with its 10-minute grocery delivery proposition.Getir’s story started in Istanbul, and they have launched multiple products since inception: GetirFood, GetirMore, GetirWater, GetirLocals, GetirBitaksi (taxi service), GetirDrive (car rental service), and GetirJobs (recruitment).

Getir serves dozens of cities throughout the world with more than 30,000 employees. The following figure shows the Getir app.

Figure 1: Getir app

Figure 1: Getir app

Overview of Getir’s main use case

Getir’s business is characterized by a tremendous volume of data generation and growth, in addition to ample opportunities to gain valuable insights. However, siloing this data and creating friction for teams trying to access the information they needed wasn’t a viable option. Allowing teams to duplicate data wherever required can be an anti-pattern, leading to operational complexity, cost overruns, and fragile data storage bloat.

Similarly, relying on dedicated teams to create data extracts or insights for downstream consumers introduces bottlenecks, stifles innovation, and increases the time-to-market. This approach isn’t optimal for a data-driven organization like Getir, which needs to empower its teams with seamless access to the information they require to drive the business forward. The various business lines within the organization made it abundantly clear that they wanted unfettered access to the company’s entire data ecosystem in a secure, cost-efficient, near real-time, and well-governed manner.

Furthermore, the organization was anticipating the emergence of data-as-a-serviceservice and generative AI use cases in the near future. This would necessitate the ability to securely share and potentially monetize the company’s data with external partners, such as franchises.

Overview of Getir’s use of Amazon Redshift and modern data architecture

To strike a balance that addresses these concerns and enables Getir teams to effectively use the wealth of data to generate meaningful insights and drive strategic decision-making across the organization, we chose a data mesh architecture.

Getir’s data analytics environment encompasses hundreds of terabytes of data, thousands of tables, and billions upon billions of data rows. Additionally, it processes millions of messaging events daily, all of which must be ingested, refined, and made available to analysts querying multiple Amazon Redshift warehouses. The end-to-end service level agreements (SLAs) for this data ecosystem can be extremely aggressive, with requirements that can be as stringent as single-digit minutes to single-digit seconds. This underscores the scale and complexity of Getir’s data analytics capabilities, which must operate with the utmost efficiency and responsiveness to meet the demands of the business. We were able to easily implement the envisioned data mesh architecture using Amazon Redshift’s native data sharing capabilities.

Figure 2: Data mesh architecture using Amazon Redshift data sharing

Figure 2: Data mesh architecture using Amazon Redshift data sharing

As the preceding diagram shows, at the heart of Getir’s architecture, was an ETL Redshift data warehouse that was used for various data sets from all over the organization, creating a refined 360-degree view of critical assets. It also was a producer for downstream Redshift data warehouses.

The demand was quite heavy on this main ETL cluster, so we relied on data sharing to isolate noisy workloads on a different Redshift data warehouse without having to duplicate the data on the main ETL cluster.

Using Redshift data sharing, individual business line teams could now rely solely on their dedicated Redshift cluster to provide them with their own data and analytics capabilities, but also the refined 360-degree views of data generated from all over the organization—without any data duplication or overstepping compute boundaries. BI analysts gained access to all of the data they needed to power their most complex dashboards with consistent performance free of noisy jobs. Additional warehouses were integrated into the data mesh for visualization, reporting, and machine learning.

Another benefit of Amazon Redshift data sharing and the data mesh architecture, was the relative ease with which we were able to maintain a chargeback model for ensuring costs were spread fairly across different teams.

Finally, the data sharing capability also enabled the seamless propagation of newly created tables within a schema to the subscribed consumers.

Modern data trends reinforced by Getir’s case study

Getir’s case study showcases the strategic uses of a data mesh architecture and Amazon Redshift, but more importantly provides tremendous insights into five key trends across all industries as modern data organizations move away from costly data silos that hinder collaboration, business insights, and time-to-market. As highlighted in the following diagram, those trends are 1/interconnected, purpose-built data stores that enable users to access data regardless of its physical location, 2/data democratization empowering users with self-service analytics capabilities, 3/real-time insights to drive greater value from data, 4/resilient data services ensuring business continuity, 5/leveraging generative AI to extract even deeper insights from data more expeditiously.

Figure 3: Key trends in the modern data organization reinforced by Getir's use case and solution

Figure 3: Key trends in the modern data organization reinforced by Getir’s use case and solution

As Getir showed, the modern data organization is adopting data architectures that democratize data securely and enable self-service analytics. To realize data’s true potential, the modern data organization has progressed beyond basic dashboarding and reporting on limited, point-in-time data sets, and evolved to use more sophisticated ETL processes that can ingest data from diverse sources. Near real-time analytics in addition to predictive models have become standard fare, significantly reducing the time to actionable insights.

Furthermore, the data landscape has been democratized to empower analysts in numerous ways through the rise of transactional data lakes powered by open table formats such as Apache Iceberg and the assistance of generative AI. This holistic approach has elevated data organizations’ capabilities well beyond traditional reporting, unlocking greater business value from the wealth of data available.

Using generative AI with data mesh architecture

In addition to the five key trends previously mentioned, the present-day data landscape is characterized by three key facts that are leading data organizations like Getir to increasingly harness the power of generative AI to drive the next evolution of data-informed decision-making.

Data is an organization’s most valuable asset and the ability to effectively use data is central to an organization’s success and growth. Data analytics and insights are absolutely crucial to strengthening and expanding the business. Deriving meaningful insights from data is essential for making informed, strategic decisions. Democratizing data and enabling self-service analytics can greatly expand the range of business insights, while reducing the time to market for those insights. Empowering users across the organization to access and analyze data can unlock tremendous value. Generative AI’s ability to respond to natural language prompts, explore and analyze complex data, and summarize lengthy content makes it a valuable tool for translating large amounts of data into valuable insights. However, the true potential of generative AI for organizations lies in Retrieval Augmented Generation (RAG).

Out of the box, generative AI models start with a relatively generic knowledge base, which can lead to unreliable or inaccurate information. RAG addresses this by introducing the model to additional datasets that are specific to the organization or context. This allows generative AI models to produce far more accurate, attributable, and highly contextualized outputs to support decision-making.

Data mesh architecture can play a crucial role in enabling and facilitating RAG. By facilitating access to multiple data sources within the organization, the data mesh provides the necessary fuel for the generative AI model to draw from, resulting in more reliable and insightful information. This, in turn, empowers data-driven decision-making and helps organizations harness the full potential of their data assets.

Conclusion

In this post, we examined how Getir implemented a data mesh architecture and Amazon Redshift data sharing to meet their evolving data requirements. This entailed dedicated data warehouses tailored to different business lines and needs, while maintaining robust data governance and secure data access. Additionally, we highlighted the key industry trends that Getir’s case study reinforces across the broader data landscape. For more information, contact AWS or connect with your AWS Technical Account Manager or Solutions Architect, who will be happy to provide more detailed guidance and support.


About the Authors

Asser Moustafa is a Principal Worldwide Specialist Solutions Architect at AWS, based in Dallas, Texas, USA. He partners with customers worldwide, advising them on all aspects of their data architectures, migrations, and strategic data visions to help organizations adopt cloud-based solutions, maximize the value of their data assets, modernize legacy infrastructures, and implement cutting-edge capabilities like machine learning and advanced analytics. Prior to joining AWS, Asser held various data and analytics leadership roles, completing an MBA from New York University and an MS in Computer Science from Columbia University in New York. He is passionate about empowering organizations to become truly data-driven and unlock the transformative potential of their data.

Pinar Yasar is the Data Engineering Manager at Getir. Her passion is to accelerate self-service analytics for her internal customers and build highly scalable and cost-effective solutions in the cloud.

How to use the Amazon Detective API to investigate GuardDuty security findings and enrich data in Security Hub

Post Syndicated from Nicholas Jaeger original https://aws.amazon.com/blogs/security/how-to-use-the-amazon-detective-api-to-investigate-guardduty-security-findings-and-enrich-data-in-security-hub/

Understanding risk and identifying the root cause of an issue in a timely manner is critical to businesses. Amazon Web Services (AWS) offers multiple security services that you can use together to perform more timely investigations and improve the mean time to remediate issues. In this blog post, you will learn how to integrate Amazon Detective with AWS Security Hub, giving you better visibility into threat indicators and investigative data directly from Security Hub, which provides you with a centralized view of your overall security posture across your AWS accounts.

Amazon GuardDuty is an intelligent threat detection service that continuously monitors your AWS accounts, workloads, runtime activity, and data for potential malicious activity. If suspicious activity, such as anomalous behavior or credential exfiltration, is detected, GuardDuty generates detailed security findings. When you enable GuardDuty and Security Hub in the same account within the same AWS Region, GuardDuty sends its generated findings to Security Hub.

AWS Security Hub is a cloud security posture management tool that automatically detects when your AWS accounts and resources deviate from security best practices, aggregates security alerts into a single place and format, and provides insight into your security posture across your AWS accounts.

Amazon Detective makes it easier for you to analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities. Detective supports the ability to automatically investigate AWS Identity and Access Management (IAM) users and roles for indicators of compromise (IoC). This capability helps security analysts determine whether IAM users and IAM roles have potentially been compromised or involved in any known tactics, techniques, and procedures (TTPs) from the MITRE ATT&CK framework. In this post, we show you an example of how to programmatically use the Detective Investigation API to help investigate potential security issues.

The example architecture we provide in this post performs enrichment automatically for CRITICAL, HIGH, and MEDIUM severity findings and gives you the flexibility to initiate additional investigations and enrichment on-demand. You then have the option to review those enriched findings directly in the Security Hub console, or you can enable an integration to review the enriched findings in the AWS service or AWS Partner Network (APN) solution of your choice. This post gives an overview of what you need to do to build the example architecture, but if you prefer step-by-step instructions, check out the workshop version of the instructions.

This integration and finding enrichment is made possible through the use of the Detective Investigation API. You must have GuardDuty, Detective, and Security Hub enabled for this to work. We recommend that you build this architecture in the account you are using as a delegated admin for GuardDuty, Detective, and Security Hub, and in the Region where Security Hub aggregates findings (if finding aggregation is configured).

Solution architecture

Security Hub automatically ingests findings from GuardDuty. You can integrate Security Hub with Detective using EventBridge rules and a Lambda function. To make the solution more manageable and customizable, you can configure a Security Hub custom action and a Security Hub automation rule. The custom action is used to identify findings you want to manually select for investigation. The automation rule is configured to identify and flag findings you want to automatically initiate investigations for. EventBridge rules (two of them) are used to initiate the Lambda function for each finding you want to investigate and enrich. The Lambda function processes the finding it receives, makes API calls to Detective, and then makes an API call back to Security Hub to update and enrich the finding. The Lambda function is invoked one time for each finding. Figure 1 illustrates this solution.

Figure 1: The solution architecture, including GuardDuty, Security Hub, EventBridge, Lambda, and Detective

Figure 1: The solution architecture, including GuardDuty, Security Hub, EventBridge, Lambda, and Detective

The workflow is as follows:

  1. Security Hub automatically ingests the findings from GuardDuty. As Security Hub ingests the findings, it applies one or more enabled automation rules to modify the findings. You can use rules to add a user-defined field to mark which findings you want automatically processed, such as those of CRITICAL, HIGH, and MEDIUM severity.
  2. Security Hub emits an event for each new and updated imported finding after applying the automation rules that are enabled. The event that is emitted includes one finding (after automation rules are applied).
    B. Security Hub emits an event for each execution of a custom action. The event emitted includes the findings that are selected when the custom action is initiated.
  3. An EventBridge rule evaluates tevents that match Security Hub Findings – Imported and sends the events to a target Lambda function for processing. You can further adjust the event pattern to only send findings that contain a user-defined field.
    B. A second EventBridge rule evaluates events that match Security Hub Findings – Custom Action (the specific custom action) and sends the events to the same target Lambda function for processing.
  4. The target Lambda function processes the finding in the event, makes API calls to Detective to start an investigation for the related IAM user or IAM role (if there is one) and fetches the results. It then makes an API call to Security Hub to update the finding. The function adds a note with a summary of the investigation, a link to the full investigation result, and a user-defined field that can be used to filter for findings that have been investigated.

In the following sections of this post, we provide more detail on the architecture components and setup. As a prerequisite, you must have GuardDuty, Detective, and Security Hub enabled.

Perform investigations with Detective using Lambda

You can start investigations in Detective and retrieve the results through the API. AWS Lambda supports several programming languages, but you will use JavaScript (Node.js 20.x) in this example. To start an investigation, supply the Amazon Resource Name (ARN) of an IAM role or user, the start time, the end time, and the ARN of the Detective behavior graph. The Detective API will fetch the results of the investigation, including IoCs, TTPs, and a categorical severity score. The severity score returned is computed using two dimensions, confidence and impact, where the confidence represents the likelihood that the events are anomalous and not normal user behavior, while the impact quantifies harm that could occur from the events as a measure of the TTPs’ effect.

You can use the example Lambda function in code sample 1 as the target of the EventBridge rule in the architecture previously described. The function takes the ARN from a GuardDuty security finding that was aggregated by Security Hub and calls the Investigation API. When the result is returned, the function formats the data into the AWS Security Finding Format (ASFF) used by Security Hub and calls the BatchUpdateFindings API to send the enriched, updated finding back to Security Hub. Make sure to read and review the function so you understand in detail how it works.

Code sample 1: Example JavaScript Lambda function using Node.js 20.x

"use strict";
import {
  DetectiveClient,
  GetInvestigationCommand,
  ListGraphsCommand,
  StartInvestigationCommand,
} from "@aws-sdk/client-detective";
import { BatchUpdateFindingsCommand, SecurityHubClient } from "@aws-sdk/client-securityhub";

const SHClient = new SecurityHubClient();
const detectiveClient = new DetectiveClient();

export const handler = async (event) => {
  try {
    // Handle only one (the first) finding per function call
    const finding = event.detail.findings[0];

    if (finding.ProductName != "GuardDuty") {
      // Handle only GuardDuty findings
      throw new Error("This is not a GuardDuty finding!");
    }

    const listgraphs = new ListGraphsCommand({});
    const graphs = await detectiveClient.send(listgraphs);
    const graphArn = graphs.GraphList[0].Arn;

    const IAMResourceARNs = finding.Resources.filter((resource) => {
      return (
        resource.Type == "AwsIamAccessKey" ||
        resource.Type == "AwsIamRole" ||
        resource.Type == "AwsIamUser"
      );
    }).map((resource) => {
      if (resource.Type == "AwsIamRole" || resource.Type == "AwsIamUser") {
        return {
          arn: resource.Id,
          type: resource.Type == "AwsIamRole" ? "role" : "user",
        };
      } else if (resource.Type == "AwsIamAccessKey") {
        return {
          arn: `arn:aws:iam::${finding.AwsAccountId}:role/${resource.Details.AwsIamAccessKey.PrincipalName}`,
          type: "role",
        };
      }
    });

    if (IAMResourceARNs.length == 0) {
      throw new Error("No IAM resource!");
    }

    // Investigate the first IAM role or user identified in the finding
    const investigationTarget = IAMResourceARNs[0].arn;
    const investigationTargetType = IAMResourceARNs[0].type;

    const investigationEndTime = new Date(Date.now());

    let investigationStartTime;
    if (finding.FirstObservedAt) {
      investigationStartTime = new Date(finding.FirstObservedAt);
    } else if (finding.CreatedAt) {
      investigationStartTime = new Date(finding.CreatedAt);
    } else if (finding.ProcessedAt) {
      investigationStartTime = new Date(finding.ProcessedAt);
    } else {
      throw new Error("Investigation start time invalid!");
    }
    investigationStartTime.setHours(investigationStartTime.getHours() - 24);

    const totalInvestigationTime = Math.round(
      (investigationEndTime.getTime() - investigationStartTime.getTime()) / (1000 * 60 * 60),
    ); // Hours

    const startInvestigationRequest = {
      GraphArn: graphArn,
      EntityArn: investigationTarget,
      ScopeStartTime: investigationStartTime,
      ScopeEndTime: investigationEndTime,
    };

    const startinvestigation = new StartInvestigationCommand(startInvestigationRequest);
    const investigation = await detectiveClient.send(startinvestigation);
    const investigationId = investigation.InvestigationId;

    const getInvestigationRequest = {
      GraphArn: graphArn,
      InvestigationId: investigationId,
    };

    let investigationResult = { Status: "RUNNING" };
    while (investigationResult.Status == "RUNNING") {
      await new Promise((r) => setTimeout(r, 30000));
      const getinvestigation = new GetInvestigationCommand(getInvestigationRequest);
      investigationResult = await detectiveClient.send(getinvestigation);
      if (investigationResult.Status == "FAILED") {
        throw new Error("Investigation failed!");
      }
    }

    let investigationSummary = "";
    switch (investigationResult.Severity) {
      case "INFORMATIONAL":
      case "LOW":
        investigationSummary += `We did not observe uncommon behavior for the associated ${investigationTargetType} during the ${totalInvestigationTime} hour investigation window.`;
        break;
      case "MEDIUM":
        investigationSummary += `We observed anomalous behavior for the associated ${investigationTargetType} during the ${totalInvestigationTime} hour investigation window which might be indicative of compromise.`;
        break;
      case "HIGH":
      case "CRITICAL":
        investigationSummary += `We observed anomalous behavior for the associated ${investigationTargetType} during the ${totalInvestigationTime} hour investigation window indicating potential compromise.`;
        break;
      default:
        throw new Error("Severity information not found!");
    }

    investigationSummary += " For more information, visit ";
    investigationSummary += `https://${process.env.AWS_REGION}.console.aws.amazon.com/detective/home?region=${process.env.AWS_REGION}#investigationReport/${investigationResult.InvestigationId}`;

    const findingUpdateInput = {
      FindingIdentifiers: [
        {
          Id: finding.Id,
          ProductArn: finding.ProductArn,
        },
      ],
      Note: {
        Text: investigationSummary.substring(0, 512),
        UpdatedBy: "Detective Investigation Lambda function.",
      },
      UserDefinedFields: {
        investigate: "complete",
      },
    };

    const batchUpdateCommand = new BatchUpdateFindingsCommand(findingUpdateInput);
    const updatedFinding = await SHClient.send(batchUpdateCommand);

    return updatedFinding;
  } catch (error) {
    console.error("Error:", error);
    throw error;
  }
};

For this function to work as desired, you need to change the permissions and the timeout of the Lambda function. The permissions must include the necessary actions you are taking with Detective and Security Hub in the function. Attach the policy shown in code example 2 to the role used by the function. Then set the timeout of the function to 15 minutes to allow Detective to complete the investigation. Note that you can change “Resource”:”*” to restrict permissions as needed.

Code example 2: Permissions required by the Lambda function

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"detective:ListGraphs",
				"detective:searchGraph",
				"detective:StartInvestigation",
				"detective:UpdateInvestigationState",
                "detective:GetInvestigation",
				"detective:ListInvestigations",
				"detective:ListIndicators",
				"securityhub:BatchUpdateFindings",
                "securityhub:UpdateFindings"
			],
			"Resource": "*"
		}
	]
}

Initiate automated investigations and finding enrichment

Now that you’ve set up the Lambda function, you’re ready to set up the two methods of initiating the investigations. The first approach involves automatically investigating and enriching CRITICAL, HIGH, and MEDIUM severity GuardDuty findings. This can accelerate investigations for the highest severity findings because you don’t need to go into Security Hub or Detective and manually select the findings for investigation.

In this approach, the investigation Lambda function you previously created is automatically invoked by using Security Hub automations and an EventBridge rule. Using Security Hub automations allows you to configure and update which findings get automatically investigated and enriched without ongoing code changes. (Automation rules use a UI with dropdown options for criteria.)

Set up an automation rule from the Automations page in Security Hub. Use these criteria for the rule:

  • ProductName equals GuardDuty
  • SeverityLabel equals CRITICAL, HIGH, or MEDIUM
  • ResourceType equals AwsIamUser or AwsIamRole (shown in Figure 2)

In the future, if you want to modify which findings are automatically investigated, you can revisit the rule and select new criteria to specify which findings receive the user-defined field.

Figure 2: Example criteria for automation rule in Security Hub

Figure 2: Example criteria for automation rule in Security Hub

For the automated actions for the rule, add a user-defined field as follows:

  • Key: investigate, Value: true (shown in Figure 3)
Figure 3: Define the user-defined field for the automation rule in Security Hub

Figure 3: Define the user-defined field for the automation rule in Security Hub

Next, set an EventBridge rule to determine which Security Hub Findings – Imported events are investigated based on the user-defined field, investigate. Each Security Hub Findings – Imported event contains a single finding. Use the JSON pattern shown in Code example 3 to match findings in the rule. You need to set the target of this rule to the Lambda function you set up earlier.

Code example 3: The pattern used in your EventBridge rule

{
  "source": ["aws.securityhub"],
  "detail": {
    "findings": {
      "UserDefinedFields": {
        "investigate": ["true"]
      }
    }
  }
}

As new findings are aggregated in Security Hub, they are evaluated and updated by the automation rule. Findings that receive the user-defined field will initiate the Lambda function. After the Lambda function is initiated, it might take a couple of minutes for the execution to complete and appear in Security Hub. When it does, you will notice a new Notes field, as shown in Figure 4, and additional data in the finding JSON.

Figure 4: See that the enriched finding now includes a Notes section

Figure 4: See that the enriched finding now includes a Notes section

You can also see what updates were made to the finding on the History tab of the finding, as shown in Figure 5.

Figure 5: See the fields that were updated for the finding under the History tab

Figure 5: See the fields that were updated for the finding under the History tab

If you want to modify which findings start this flow, you can modify the automation rule in the Security Hub console. For example, you might also want to investigate findings from other services or with other severity labels. Keep in mind that Detective only supports IAM users and IAM roles.

You might want to add additional criteria to help prevent repeat investigations on the same findings. For example, you might not want to have the investigation flow initiated every time a finding receives an update. To help prevent this behavior, you can add criteria to the automation rule where the user-defined field, investigate, does not equal complete.

On-demand finding investigation and enrichment

The second approach involves investigating and enriching findings on-demand. You might want to use both approaches in case there are findings that don’t meet the criteria of your earlier automation that you still want to investigate.

In this approach, initiate the Lambda function through the use of a feature in Security Hub called custom actions. To use a Security Hub custom action to send findings to EventBridge, you first create the custom action in Security Hub. Name it Investigate. Then, define a rule in EventBridge that applies to your custom action (using the ARN of the custom action) and that uses the same Lambda function as the target to orchestrate the automation. The pattern of your EventBridge rule will be similar to the one shown in Figure 6, but uses the ARN of the custom action you create in Security Hub.

Figure 6: The EventBridge rule for the second approach

Figure 6: The EventBridge rule for the second approach

After you set up the custom action and the EventBridge rule, you can select a finding and choose Investigate from the Actions dropdown list to initiate the processing, as shown in Figure 7.

Figure 7: Initiate the on-demand finding enrichment

Figure 7: Initiate the on-demand finding enrichment

Because both approaches to initiating the investigation use the same Lambda function, the resulting enriched finding in Security Hub will be the same.

Limitations and further customization

We encourage you to try, test, and customize the architecture and example code. To simplify the example, there are some limitations coded in the Lambda function. For example, the Lambda function processes only the first finding it receives (per execution) and proceeds only if the finding originates from GuardDuty. The function also only begins an investigation into the first IAM user or IAM role it identifies that is associated with the finding. If you have a use case requiring that the Lambda function handle multiple findings at a time, findings from other services, or other problems, you will need to make code or architectural changes to accommodate those requirements (such as incorporating the use of AWS Step Functions or Amazon Simple Queue Service (Amazon SQS)), and perform the relevant testing.

Conclusion

Use the example code provided here or the detailed workshop version of the instructions to try out the Detective API and enrich findings in Security Hub with investigative data. This can help you reduce mean time to respond by automatically investigating IAM entities, providing investigation details within the findings, and giving you a direct link into the details of the Detective investigation. Visit Getting started with AWS Security Hub, Getting started with Amazon Detective, and Getting started with Amazon GuardDuty to learn more.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Nicholas Jaeger

Nicholas Jaeger

Nicholas is a Principal Security Solutions Architect at AWS, where he provides guidance to customers focused on operating their business as securely as possible on AWS. His background includes software engineering, teaching, solutions architecture, and AWS security. Nicholas also hosts AWS Security Activation Days to provide customers with prescriptive guidance while using AWS security services. https://awsactivationdays.splashthat.com/

Rima Tanash

Rima Tanash

Rima, a Senior Security Engineer and researcher at AWS, specializes in developing innovative cloud security features that use machine learning and automated reasoning. Her work encompasses modeling automating risk identification, AWS API sequences, building investigative playbooks, and graph analytics for threat modeling. She holds a PhD from Rice University and a Master’s from Johns Hopkins University.

Infor’s Amazon OpenSearch Service Modernization: 94% faster searches and 50% lower costs

Post Syndicated from Allan Pienaar original https://aws.amazon.com/blogs/big-data/infors-amazon-opensearch-service-modernization-94-faster-searches-and-50-lower-costs/

This post is cowritten by Arjan Hammink from Infor.

Robust storage and search capabilities are critical components of Infor’s enterprise business cloud software. Infor’s Intelligent Open Network (ION) OneView platform provides real-time reporting, dashboards, and data visualization to help customers access and analyze information across their organization. To enhance the search functionality within ION OneView, Infor used Amazon OpenSearch Service to improve their software products and offer better service to their customers by providing real-time visibility. By modernizing their use of OpenSearch Service, Infor has been able to deliver a 94% improvement in search performance for customers, along with a 50% reduction in storage costs.

In this post, we’ll explore Infor’s journey to modernize its search capabilities, the key benefits they achieved, and the technologies that powered this transformation. We’ll also discuss how Infor’s customers are now able to more effectively search through business messages, documents, and other critical data within the ION OneView platform.

Where Infor started

Infor’s ION OneView was built on top of Elasticsearch v5.x on Amazon OpenSearch Service, hosted across eight AWS Regions. This architecture enabled users to track business documents from a consolidated view, search using various criteria, and correlate messages while viewing content based on user roles. Over time, Infor expanded its functionality to include “Enrich” and “Archive” capabilities, which added significant complexity. The Enrich process would build searchable messages by aggregating related events, requiring constant document updates to the OpenSearch indices. The Archive process would then move these messages and events to Amazon Simple Storage Service (Amazon S3), while using a delete_by_query to remove the corresponding documents from OpenSearch Service. These read-update-write-delete workloads, coupled with large all-encompassing indices with shard sizes of over 100GB, resulted in high volumes of deleted documents and exponential data growth that the system struggled to keep up with. To address increasing performance needs, Infor continually horizontally scaled out their OpenSearch Service domain.

Challenges 

The key challenges Infor faced underscored the need for a more scalable, resilient, and cost-effective search capability that could seamlessly integrate with their cloud environment. These included the inability to effectively archive data because of high ingestion rates, resulting in longer upgrade and recovery times. Escalating costs from scaling the solution and the need for custom development to enable newer OpenSearch Service features created significant operational burdens. Additionally, Infor was seeing increasing search latency, with CPU utilization peaking at 75% and occasionally spiking above 90% (as shown in the following figures), demonstrating the performance limitations of Infor’s existing infrastructure. Collectively, these issues drove Infor’s need for a modernized search solution.

SearchLatency Pre-Modernization

Screenshot shows CloudWatch metric SearchLatency before Modernization

CPUUtilization Pre-Modernization

Screenshot shows CloudWatch metric CPUUtilization before Modernization

Infor’s journey to modernize search with OpenSearch Service

To address the growing challenges with ION OneView, Infor partnered with AWS to undertake a comprehensive modernization effort. This involved optimizing operational processes, storage configurations, and instance selections, while also upgrading to the later versions within OpenSearch Service.

Operational review and enhancements

As a collaborative effort between Infor and AWS, a comprehensive operational review of Infor’s OpenSearch Service cluster was undertaken. With the help of slow logs and adjusting the logging thresholds, the review was able to identify long-running queries and the archival process consuming the largest amount of CPU capacity. Infor rewrote the long-running queries that used high cardinality fields, reducing the average query time.

Next, the team turned their attention to redesigning Infor’s archival process to reduce stress on the CPU. Instead of a single large index, we implemented independent indices based on customer license types. This improved delete performance by allowing the team to target old indices, using index aliases to manage the transition. We also replaced the delete_by_query approach where a query is sent to locate documents prior to a delete with a standard delete passing document IDs directly, because all the document IDs to be archived were known ahead of time. This reduced round-trip time and CPU stress compared to the sequential search requests performed by delete_by_query. This was followed by the tuning of the refresh interval based on the workload requirements, improving the indexing performance, and memory and CPU utilization.

Storage optimization

The team switched from GP2 to GP3 storage, provisioning additional input/output operations per second (IOPS) and throughput only when needed. This resulted in a 9% reduction in storage costs for most of Infor’s workloads. In all use cases where IOPS was a bottleneck, the team was able to provision additional IOPS and throughput independent of the volume size using GP3, further reducing Infor’s overall storage costs. Additionally, we implemented a shard size-based rollover strategy that provided a sharding strategy where total shards were divisible by the number of nodes to reduce the shard size to the recommended number of less than 50 GiB. This helped ensure an even distribution of data and workloads across the nodes for each index, and the performance improvements indicated that more vCPU would be beneficial given the thread pool queues and latencies. Appropriate master and data node instance types were chosen based on the new storage requirements. To support the reindexing process, the team also temporarily scaled up the storage and compute resources.

Upgrading OpenSearch Service

After optimizing the storage and compute configurations based on best practices, the Infor ION team turned their attention to using the latest features of OpenSearch Service. With the shards now at an appropriate boundary and the memory and CPU utilization at the right levels, the team was able to seamlessly upgrade from Elasticsearch version 5.x to 6.x and then to 7.x in OpenSearch Service. Each major version upgrade required careful testing and client-side code changes to make sure that the appropriate compatible client libraries were used, and the team took the necessary time after each upgrade to thoroughly validate the system and provide a smooth transition for Infor’s customers. This commitment to a methodical upgrade process allowed Infor to take advantage of the latest OpenSearch Service features, such as Graviton support, performance improvements, bug fixes, and security posture improvements, while minimizing disruption to their users.

Optimizing instance selection for performance

In collaboration with the AWS team, Infor carefully evaluated local non-volatile memory express (NVMe)-backed instance types for their ION OneView search cluster, comparing options such as i3 and R6gd instances to balance memory, latency, and storage requirements. For write-heavy workloads, the team found that using NVMe storage provided better performance and price compared to Amazon Elastic Block Store (Amazon EBS) volumes because of the high IOPS requirement of the workload, allowing them to be less reliant on off-heap memory usage. By selecting the most appropriate instance types, the ION OneView search cluster was able to resize and scale down the number of data nodes by 63% while still achieving improved throughput and reduced latency. Staying on the latest AWS instance families was also a key consideration, and the team further optimized costs by purchasing Reserved Instances after establishing a good baseline for their performance and compute consumption, with discounts ranging from 30% to 50% depending on the commitment term.

Results

The following figures show the improvements of the modernization.

New indices with the correct shard size can be seen in the increase in shards, shown in the following figure.

Figure showing increase in shards with new indices and correct shard size

The updated shard strategy combined with a version upgrade led to a ten-fold increase in the volume of traffic and efficient archiving as shown in the following figure.

Figure illustrates 10x increase in traffic volume and improved archiving due to updated shard strategy and version upgrade

The SearchRate increase is shown in the following figure.

Figure shows increase in SearchRate

The following figure shows that the CPU increase was minimal compared to the traffic increase.

Figure demonstrates CPU increase was minimal compared to traffic increase

The SearchLatency reduction post upgrade and implementation of the new indexing and shard strategy is shown in the following figure.

Figure illustrates reduction in CloudWatch metric SearchLatency after upgrade and new indexing/shard strategy implementation

The following figure shows the monthly spend over the past 4 quarters for two Infor ION products.

Figure shows the monthly spend over 4 quarters for two Infor ION products.

Conclusion

Through their careful modernization of the OpenSearch Service infrastructure, Infor was able to achieve 50% reduction in infrastructure costs coupled with a 94% improvement in cluster performance. The optimized clusters are now healthier and more resilient, enabling faster blue/green deployments to process even greater data volumes.

This successful transformation was driven by Infor’s close collaboration with the AWS team, using deep technical expertise and best practices to accelerate the optimization process and unlock the full potential of OpenSearch Service. Infor’s OpenSearch Service modernization has empowered the company to provide an improved, high-performing search experience for their customers at a significantly lower cost, positioning their ION OneView platform for continued growth and success.

Every workload is unique, with its own distinct characteristics. While the best practices outlined in the Amazon OpenSearch Service developer guide serve as a valuable guide, the most important step is to deploy, test, and continuously tune your own domains to find the optimal configuration, stability, and cost for your specific needs.


About the Authors

Author image of Allan PiennarAllan Pienaar is an OpenSearch SME and Customer Success Engineer at AWS. He works closely with enterprise customers in ensuring operational excellence, maintaining production stability and optimizing cost using the Amazon OpenSearch Service.

Author image of Gokul Sarangaraju Gokul Sarangaraju is a Senior Solutions Architect at AWS. He helps customers adopt AWS services and provides guidance in AWS cost and usage optimization. His areas of expertise include building scalable and cost-effective data analytics solutions using AWS services and tools.

Author image of Arjan Hammink Arjan Hammink is a Senior Director of Software Development at Infor, bringing over 25 years of expertise in software development and team management. He currently oversees Infor ION, a project he has been integral to since its inception in 2010 when he began as a Software Engineer. Infor ION is a robust middleware designed to streamline software integration, a key component of Infor OS, Infor’s cloud technology platform.

A customer’s journey with Amazon OpenSearch Ingestion pipelines

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/a-customers-journey-with-amazon-opensearch-ingestion-pipelines/

This is a guest post co-written with Mike Mosher, Sr. Principal Cloud Platform Network Architect at a multi-national financial credit reporting company.

I work for a multi-national financial credit reporting company that offers credit risk, fraud, targeted marketing, and automated decisioning solutions. We are an AWS early adopter and have embraced the cloud to drive digital transformation efforts. Our Cloud Center of Excellence (CCoE) team operates a global AWS Landing Zone, which includes a centralized AWS network infrastructure. We are also an AWS PrivateLink Ready Partner and offer our E-Connect solution to allow our B2B customers to connect to a range of products through private, secure, and performant connectivity.

Our E-Connect solution is a platform comprised of multiple AWS services like Application Load Balancer (ALB), Network Load Balancer (NLB), Gateway Load Balancer (GWLB), AWS Transit Gateway, AWS PrivateLink, AWS WAF, and third-party security appliances. All of these services and resources, as well as the large amount of network traffic across the platform, create a large number of logs, and we needed a solution to aggregate and organize these logs for quick analysis by our operations teams when troubleshooting the platform.

Our original design consisted of Amazon OpenSearch Service, selected for its ability to return specific log entries from extensive datasets in seconds. We also complemented this with Logstash, allowing us to use multiple filters to enrich and augment the data before sending to the OpenSearch cluster, facilitating a more comprehensive and insightful monitoring experience.

In this post, we share our journey, including the hurdles we faced, the solutions we thought about, and why we went with Amazon OpenSearch Ingestion pipelines to make our log management smoother.

Overview of the initial solution

We originally wanted to store and analyze the logs in an OpenSearch cluster, and decided to use the AWS-managed service for OpenSearch called Amazon OpenSearch Service. We also wanted to enrich these logs with Logstash, but there was no AWS-managed service for this, so we needed to deploy the application on an Amazon Elastic Compute Cloud (Amazon EC2) server. This setup meant that we had to implement a lot of maintenance of the server, including using AWS CodePipeline and AWS CodeDeploy to push new Logstash configurations to the server and restart the service. We also needed to perform server maintenance tasks such as patching and updating the operating system (OS) and the Logstash application, and monitor server resources such as Java heap, CPU, memory, and storage.

The complexity extended to validating the network path from the Logstash server to the OpenSearch cluster, incorporating checks on Access Control Lists (ACLs) and security groups, as well as routes in the VPC subnets. Scaling beyond a single EC2 server introduced considerations for managing an auto scaling group, Amazon Simple Queue Service (Amazon SQS) queues, and more. Maintaining the continuous functionality of our solution became a significant effort, diverting focus from the core tasks of operating and monitoring the platform.

The following diagram illustrates our initial architecture.

Possible solutions for us:

Our team looked at multiple options to manage the logs from this platform. We possess a Splunk solution for storing and analyzing logs, and we did assess it as a potential competitor to OpenSearch Service. However, we opted against it for several reasons:

  • Our team is more familiar with OpenSearch Service and Logstash than Splunk.
  • Amazon OpenSearch Service, being a managed service in AWS, facilitates a smoother log transfer process compared to our on-premises Splunk solution. Also, transporting logs to the on-premises Splunk cluster would incur high costs, consume bandwidth on our AWS Direct Connect connections, and introduce unnecessary complexity.
  • Splunk’s pricing structure, based on storage in GBs, proved cost-prohibitive for the volume of logs we intended to store and analyze.

Initial designs for an OpenSearch Ingestion pipeline solution

The Amazon team approached me about a new feature they were launching: Amazon OpenSearch Ingestion. This feature offered a great solution to the problems we were facing with managing EC2 instances for Logstash. First, the new feature removed all the heavy lifting from our team of managing multiple EC2 instances, scaling the servers up and down based on traffic, and monitoring the ingestion of logs and the resources of the underlying servers. Second, Amazon OpenSearch Ingestion pipelines supported most if not all of the Logstash filters we were using in our current solution, which allowed us to use the same functionality of our current solution for enriching the logs.

We were thrilled to be accepted into the AWS beta program, emerging as one of its earliest and largest adopters. Our journey began with ingesting VPC flow logs for our internet ingress platform, alongside Transit Gateway flow logs connecting all VPCs in the AWS Region. Handling such a substantial volume of logs proved to be a significant task, with Transit Gateway flow logs alone reaching upwards of 14 TB per day. As we expanded our scope to include other logs like ALB and NLB access logs and AWS WAF logs, the scale of the solution translated to higher costs.

However, our enthusiasm was somewhat dampened by the challenges we faced initially. Despite our best efforts, we encountered performance issues with the domain. Through collaborative efforts with the AWS team, we uncovered misconfigurations within our setup. We had been using instances that were inadequately sized for the volume of data we were handling. Consequently, these instances were constantly operating at maximum CPU capacity, resulting in a backlog of incoming logs. This bottleneck cascaded into our OpenSearch Ingestion pipelines, forcing them to scale up unnecessarily, even as the OpenSearch cluster struggled to keep pace.

These challenges led to a suboptimal performance from our cluster. We found ourselves unable to analyze flow logs or access logs promptly, sometimes waiting days after their creation. Additionally, the costs associated with these inefficiencies far exceeded our initial expectations.

However, with the assistance of the AWS team, we successfully addressed these issues, optimizing our setup for improved performance and cost-efficiency. This experience underscored the importance of proper configuration and collaboration in maximizing the potential of AWS services, ultimately leading to a more positive outcome for our data ingestion processes.

Optimized design for our OpenSearch Ingestion pipelines solution

We collaborated with AWS to enhance our overall solution, building a solution that is both high performing, cost-effective, and aligned with our monitoring requirements. The solution involves selectively ingesting specific log fields into the OpenSearch Service domain using an Amazon S3 Select pipeline in the pipeline source; alternative selective ingestion can also be done by filtering within pipelines. You can use include_keys and exclude_keys in your sink to filter data that’s routed to destination. We also used the built-in Index State Management feature to remove logs older than a predefined period to reduce the overall cost of the cluster.

The ingested logs in OpenSearch Service empower us to derive aggregate data, providing insights into trends and issues across the entire platform. For additional detailed analysis of these logs including all original log fields, we use Amazon Athena tables with partitioning to quickly and cost-effectively query Amazon Simple Storage Service (Amazon S3) for logs stored in Parquet format.

This comprehensive solution significantly enhances our platform visibility, reduces overall monitoring costs for handling a large log volume, and expedites our time to identify root causes when troubleshooting platform incidents.

The following diagram illustrates our optimized architecture.

Performance comparison

The following table compares the performance of the initial design with Logstash on Amazon EC2, the original OpenSearch Ingestion pipeline solution, and the optimized OpenSearch Ingestion pipeline solution.

  Initial Design with Logstash on Amazon EC2 Original Ingestion Pipeline Solution Optimized Ingestion Pipeline Solution
Maintenance Effort High: Solution required the team to manage multiple services and instances, taking effort away from managing and monitoring our platform. Low: OpenSearch Ingestion managed most of the undifferentiated heavy lifting, leaving the team to only maintain the ingestion pipeline configuration file. Low: OpenSearch Ingestion managed most of the undifferentiated heavy lifting, leaving the team to only maintain the ingestion pipeline configuration file.
Performance High: EC2 instances with Logstash could scale up and down as needed in the auto scaling group. Low: Due to insufficient resources on the OpenSearch cluster, the ingestion pipelines were constantly at max OpenSearch Compute Units (OCUs), causing log delivery to be delayed by multiple days. High: Ingestion pipelines can scale up and down in OCUs as needed.
Real-time Log Availability Medium: In order to pull, process, and deliver the large number of logs in Amazon S3, we needed a large number of EC2 instances. To save on cost, we ran fewer instances, which led to slower log delivery to OpenSearch. Low: Due to insufficient resources on the OpenSearch cluster, the ingestion pipelines were constantly at max OCUs, causing log delivery to be delayed by multiple days. High: The optimized solution was able to deliver a large number of logs to OpenSearch to be analyzed in near real time.
Cost Saving Medium: Running multiple services and instances to send logs to OpenSearch increased the cost of the overall solution. Low: Due to insufficient resources on the OpenSearch cluster, the ingestion pipelines were constantly at max OCUs, increasing the cost of the service. High: The optimized solution was able to scale the ingestion pipeline OCUs up and down as needed, which kept the overall cost low.
Overall Benefit Medium Low High

Conclusion

In this post, we highlighted my journey to build a solution using OpenSearch Service and OpenSearch Ingestion pipelines. This solution allows us to focus on analyzing logs and supporting our platform, without needing to support the infrastructure to deliver logs to OpenSearch. We also highlighted the need to optimize the service in order to increase performance and reduce cost.

As our next steps, we aim to explore the recently announced Amazon OpenSearch Service zero-ETL integration with Amazon S3 (in preview) feature within OpenSearch Service. This step is intended to further reduce the solution’s costs and provide flexibility in the timing and number of logs that are ingested.


About the Authors

Navnit Shukla serves as an AWS Specialist Solutions Architect with a focus on analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled “Data Wrangling on AWS.” He can be reached via LinkedIn.

Mike Mosher is s Senior Principal Cloud Platform Network Architect at a multi-national financial credit reporting company. He has more than 16 years of experience in on-premises and cloud networking and is passionate about building new architectures on the cloud that serve customers and solve problems. Outside of work, he enjoys time with his family and traveling back home to the mountains of Colorado.

Amazon EMR on EC2 cost optimization: How a global financial services provider reduced costs by 30%

Post Syndicated from Omar Gonzalez original https://aws.amazon.com/blogs/big-data/amazon-emr-on-ec2-cost-optimization-how-a-global-financial-services-provider-reduced-costs-by-30/

In this post, we highlight key lessons learned while helping a global financial services provider migrate their Apache Hadoop clusters to AWS and best practices that helped reduce their Amazon EMR, Amazon Elastic Compute Cloud (Amazon EC2), and Amazon Simple Storage Service (Amazon S3) costs by over 30% per month.

We outline cost-optimization strategies and operational best practices achieved through a strong collaboration with their DevOps teams. We also discuss a data-driven approach using a hackathon focused on cost optimization along with Apache Spark and Apache HBase configuration optimization.

Background

In early 2022, a business unit of a global financial services provider began their journey to migrate their customer solutions to AWS. This included web applications, Apache HBase data stores, Apache Solr search clusters, and Apache Hadoop clusters. The migration included over 150 server nodes and 1 PB of data. The on-premises clusters supported real-time data ingestion and batch processing.

Because of aggressive migration timelines driven by the closure of data centers, they implemented a lift-and-shift rehosting strategy of their Apache Hadoop clusters to Amazon EMR on EC2, as highlighted in the Amazon EMR migration guide.

Amazon EMR on EC2 provided the flexibility for the business unit to run their applications with minimal changes on managed Hadoop clusters with the required Spark, Hive, and HBase software and versions installed. Because the clusters are managed, they were able to decompose their large on-premises cluster and deploy purpose-built transient and persistent clusters for each use case on AWS without increasing operational overhead.

Challenge

Although the lift-and-shift strategy allowed the business unit to migrate with lower risk and allowed their engineering teams to focus on product development, this came with increased ongoing AWS costs.

The business unit deployed transient and persistent clusters for different use cases. Several application components relied on Spark Streaming for real-time analytics, which was deployed on persistent clusters. They also deployed the HBase environment on persistent clusters.

After the initial deployment, they discovered several configuration issues that led to suboptimal performance and increased cost. Despite using Amazon EMR managed scaling for persistent clusters, the configuration wasn’t efficient due to setting a minimum of 40 core nodes and task nodes, resulting in wasted resources. Core nodes were also misconfigured to auto scale. This led to scale-in events shutting down core nodes with shuffle data. The business unit also implemented Amazon EMR auto-termination policies. Because of shuffle data loss on the EMR on EC2 clusters running Spark applications, certain jobs ran five times longer than planned. Here, auto-termination policies didn’t mark a cluster as idle because a job was still running.

Lastly, there were separate environments for development (dev), user acceptance testing (UAT), production (prod), which were also over-provisioned with the minimum capacity units for the managed scaling policies configured too high, leading to higher costs as shown in the following figure.

Short-term cost-optimization strategy

The business unit completed the migration of applications, databases, and Hadoop clusters in 4 months. Their immediate goal was to get out of their data centers as quickly as possible, followed by cost optimization and modernization. Although they expected greater upfront costs because of the lift-and-shift approach, their costs were 40% higher than forecasted. This sped up their need to optimize.

They engaged with their shared services team and the AWS team to develop a cost-optimization strategy. The business unit began by focusing on cost-optimization best practices to implement immediately that didn’t require product development team engagement or impact their productivity. They performed a cost analysis to determine the largest contributors of cost were EMR on EC2 clusters running Spark, EMR on EC2 clusters running HBase, Amazon S3 storage, and EC2 instances running Solr.

The business unit started by enforcing auto-termination of EMR clusters in their dev environments by using automation. They considered using Amazon EMR isIdle Amazon CloudWatch metrics to build an event-driven solution with AWS Lambda, as described in Optimize Amazon EMR costs with idle checks and automatic resource termination using advanced Amazon CloudWatch metrics and AWS Lambda. They implemented a stricter policy to shut down clusters in their lower environments after 3 hours, regardless of usage. They also updated managed scaling policies in DEV and UAT and set the minimum cluster size to three instances to allow clusters to scale up as needed. This resulted in a 60% savings in monthly dev and UAT costs over 5 months, as shown in the following figure.

For the initial production deployment, they had a subset of Spark jobs running on a persistent cluster with an older Amazon EMR 5.(x) release. To optimize costs, they split smaller jobs and larger jobs to run on separate persistent clusters and configured the minimum number of core nodes required to support jobs in each cluster. Setting the core nodes to a constant size while using managed scaling for only task nodes is a recommended best practice and eliminated the issue of shuffle data loss. This also improved the time to scale in and out, because task nodes don’t store data in Hadoop Distributed File System (HDFS).

Solr clusters ran on EC2 instances. To optimize this environment, they ran performance tests to determine the best EC2 instances for their workload.

With over one petabyte of data, Amazon S3 contributed to over 15% of monthly costs. The business unit enabled the Amazon S3 Intelligent-Tiering storage class to optimize storage expenses for historical data and reduce their monthly Amazon S3 costs by over 40%, as shown in the following figure. They also migrated Amazon Elastic Block Store (Amazon EBS) volumes from gp2 to gp3 volume types.

Longer-term cost-optimization strategy

After the business unit realized initial cost savings, they engaged with the AWS team to organize a financial hackathon (FinHack) event. The goal of the hackathon was to reduce costs further by using a data-driven process to test cost-optimization strategies for Spark jobs. To prepare for the hackathon, they identified a set of jobs to test using different Amazon EMR deployment options (Amazon EC2, Amazon EMR Serverless) and configurations (Spot, AWS Graviton, Amazon EMR managed scaling, EC2 instance fleets) to arrive at the most cost-optimized solution for each job. A sample test plan for a job is shown in the following table. The AWS team also assisted with analyzing Spark configurations and job execution during the event.

Job Test Description Configuration
Job 1 1 Run an EMR on EC2 job with default Spark configurations Non Graviton, On-Demand Instances
2 Run an EMR on Serverless job with default Spark configurations Default configuration
3 Run an EMR on EC2 job with default Spark configuration and Graviton instances Graviton, On-Demand Instances
4 Run an EMR on EC2 job with default Spark configuration and Graviton instances. Hybrid Spot Instance allocation. Graviton, On-Demand and Spot Instances

The business unit also performed extensive testing using Spot Instances before and during the FinHack. They initially used the Spot Instance advisor and Spot Blueprints to create optimal instance fleet configurations. They automated the process to select the most optimal Availability Zone to run jobs by querying for the Spot placement scores using the get_spot_placement_scores API before launching new jobs.

During the FinHack, they also developed an EMR job tracking script and report to granularly track cost per job and measure ongoing improvements. They used the AWS SDK for Python (Boto3) to list the status of all transient clusters in their account and report on cluster-level configurations and instance hours per job.

As they executed the test plan, they found several additional areas of enhancement:

  • One of the test jobs makes API calls to Solr clusters, which introduced a bottleneck in the design. To prevent Spark jobs from overwhelming the clusters, they fine-tuned executor.cores and spark.dynamicAllocation.maxExecutors properties.
  • Task nodes were over-provisioned with large EBS volumes. They reduced the size to 100 GB for additional cost savings.
  • They updated their instance fleet configuration by setting unit/weights proportional based on instance types selected.
  • During the initial migration, they set the spark.sql.shuffle.paritions configuration too high. The configuration was fine-tuned for their on-premises cluster but not updated to align with their EMR clusters. They optimized the configuration by setting the value to one or two times the number of vCores in the cluster .

Following the FinHack, they enforced a cost allocation tagging strategy for persistent clusters that are deployed using Terraform and transient clusters deployed using Amazon Managed Workflows for Apache Airflow (Amazon MWAA). They also deployed an EMR Observability dashboard using Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Results

The business unit reduced monthly costs by 30% over 3 months. This allowed them to continue migration efforts of remaining on-premises workloads. Most of their 2,000 jobs per month now run on EMR transient clusters. They have also increased AWS Graviton usage to 40% of total usage hours per month and Spot usage to 10% in non-production environments.

Conclusion

Through a data-driven approach involving cost analysis, adherence to AWS best practices, configuration optimization, and extensive testing during a financial hackathon, the global financial services provider successfully reduced their AWS costs by 30% over 3 months. Key strategies included enforcing auto-termination policies, optimizing managed scaling configurations, using Spot Instances, adopting AWS Graviton instances, fine-tuning Spark and HBase configurations, implementing cost allocation tagging, and developing cost tracking dashboards. Their partnership with AWS teams and a focus on implementing short-term and longer-term best practices allowed them to continue their cloud migration efforts while optimizing costs for their big data workloads on Amazon EMR.

For additional cost-optimization best practices, we recommend visiting AWS Open Data Analytics.


About the Authors

Omar Gonzalez is a Senior Solutions Architect at Amazon Web Services in Southern California with more than 20 years of experience in IT. He is passionate about helping customers drive business value through the use of technology. Outside of work, he enjoys hiking and spending quality time with his family.

Navnit Shukla, an AWS Specialist Solution Architect specializing in Analytics, is passionate about helping clients uncover valuable insights from their data. Leveraging his expertise, he develops inventive solutions that empower businesses to make informed, data-driven decisions. Notably, Navnit Shukla is the accomplished author of the book Data Wrangling on AWS, showcasing his expertise in the field. He also runs the YouTube channel Cloud and Coffee with Navnit, where he shares insights on cloud technologies and analytics. Connect with him on LinkedIn.

Accelerate Serverless Streamlit App Deployment with Terraform

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/accelerate-serverless-streamlit-app-deployment-with-terraform/

Image depicting the HashiCorp Terraform and Amazon Web Services (AWS) logos. Underneath the AWS logo are AWS service logos for Amazon Elastic Container Service (ECS), AWS CodePipeline, AWS CodeBuild, and Amazon CloudFront

Graphic created by Kevon Mayers.

Introduction

As customers increasingly seek to harness the power of generative AI (GenAI) and machine learning to deliver cutting-edge applications, the need for a flexible, intuitive, and scalable development platform has never been greater. In this landscape, Streamlit has emerged as a standout tool, making it easy for developers to prototype, build, and deploy GenAI-powered apps with minimal friction. It is an open-source Python framework designed to simplify the development of custom web applications for data science, machine learning, and GenAI projects. With Streamlit, developers can quickly transform Python scripts into interactive dashboards, LLM-powered chatbots, and web apps, using just a few lines of code. Its unique combination of simplicity, interactivity, and speed is the perfect complement to the rapid advancements in AI.

When deploying Streamlit applications, customers often face the challenge of ensuring their applications are highly available and can scale to meet a variable amount of demand. To achieve these goals, customers are looking at serverless approaches to deploying their Streamlit apps. With a serverless application, you only pay for the resources required and do not want have to worry about managing servers or capacity planning.

In this post, we will walk you through deploying containerized, serverless Streamlit applications automatically via HashiCorp Terraform, an Infrastructure as Code (IaC) tool that enables users to define and provision infrastructure across cloud platforms.

Solution Overview

For this solution, we have the Streamlit app running on an Amazon Elastic Container Service (ECS) cluster across multiple availability zones (AZs), using AWS Fargate to manage the compute. Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building apps without managing servers. Using Fargate helps reduce the undifferentiated heavy lifting that can come with building and maintaining web applications. It is also often desirable to use a Content Delivery Network (CDN) to ensure low latency for users globally by caching the content at edge locations closer to where the users are geographically located.

Let’s zoom in on the two architectures – the Streamlit App hosting architecture, and the Streamlit App deployment pipeline.

Streamlit app hosting

Image depicting the AWS data flow architecture for the solution. The architecture shows an Amazon Elastic Container Service (ECS) cluster that spans across two availability zones. Within each availability zone are a public and private subnet. A NAT gateway is within the public subnet, and an ECS Cluster with AWS Fargate deployment type is in the private subnet. An Internet Gateway (IGW) is used to allow traffic to flow through the NAT Gateway out to the internet.An Application Load Balancer (ALB) is used to distribute the load to the ECS cluster. Amazon CloudFront is used as the content delivery network (CDN).

In the above architecture, the following flow applies:

  1. Users access the Streamlit App using the public DNS endpoint for an Amazon CloudFront distribution.
  2. Using an Internet Gateway (IGW), user requests are routed to a public-facing Application Load Balancer (ALB).
  3. This ALB has target groups which map to ECS task nodes that are part of an ECS cluster running in two AZs (us-east-1a and us-east-1b in this example).
  4. Fargate will automatically scale the underlying compute nodes in the ECS cluster based on the demand.

Streamlit app deployment pipeline

Image depicting the Streamlit app deployment pipeline architecture. Within it, a developer uploads a .zip file called streamlit-app-assets.zip to an Amazon S3 Bucket. This upload event is processed by Amazon EventBridge, which in turn invokes an AWS CodePipeline to run. Related artifacts are stored in a connected CodePipeline S3 bucket. CodePipeline orchestrates an AWS CodeBuild project that creates a new Docker image using the .zip file that was uploaded, and stores in an Amazon Elastic Container Registry (ECR) repository. This image upload triggers a new Amazon Elastic Container Service (ECS) deployment. Terraform then creates a Amazon CloudFront invalidation to serve the new version of the application to customers.

In the above architecture, the following flow applies:

  1. User develops a local Streamlit App and defines the path of these assets in the module configuration, then runs terraform apply to generate a local .zip file comprised of the Streamlit App directory, and upload this to an Amazon S3 bucket (Streamlit Assets) with versioning enabled, which is configured to trigger the Streamlit CI/CD pipeline to run.
  2. AWS CodePipeline (Streamlit CI/CD pipeline) begins running. The pipeline copies the .zip file from the Streamlit Assets S3 Bucket, stores the contents in a connected CodePipeline Artifacts S3 bucket, and passes the asset to the AWS CodeBuild project that is also part of the pipeline.
  3. CodeBuild (Streamlit CodeBuild Project) configures a compute/build environment and fetches a Python Docker Image from a public Amazon ECR repository. CodeBuild uses Docker to build a new Streamlit App image based on what is defined in the Dockerfile within the .zip file, and pushes the new image to a private ECR repository. It tags the image with latest, an app_version (user-defined in Terraform), as well as the S3 Version ID of the .zip file and pushes the image to ECR.
  4. ECS has a task definition that references the image in ECR based on the S3 Version ID tag which will always be a unique value, as it is generated whenever a new version of the file is created. This also serves as data lineage so versions of the Streamlit App .zip files in S3 can be linked to versions of the image stored in ECR. Once a new image is pushed to ECR (with a unique image tag), the task definition is updated and the ECS service begins a new deployment using the new version of the Streamlit App.
  5. When a new image is pushed to ECR, the Terraform Module is configured to use the local-exec provisioner to run an AWS CLI command that creates a CloudFront invalidation. This enables users of the Streamlit app to use the new version without waiting for the time-to-live (TTL) of the cached file to expire on the edge locations (default is 24 hours).
    Both of these pipelines are built and packaged into a Terraform module that can be reused efficiently with only a few lines of code.

Both of these pipelines are built and packaged into a Terraform module that can be reused efficiently with only a few lines of code.

Prerequisites

This solution requires the following prerequisites:

  • An AWS account. If you don’t have an account, you can sign up for one.
  • Terraform v1.0.0 or newer installed.
  • python v3.8 or newer installed.
  • A Streamlit app. If you don’t have a Streamlit project already, you can download this app directory as a sample Streamlit app for this post and save it to a local folder.

Your folder structure will look something like this:

terraform_streamlit_folder
├── README.md
└── app                 # Streamlit app directory
    ├── home.py         # Streamlit app entry point
    ├── Dockerfile      # Dockerfile
     └── pages/          # Streamlit pages

Create and initialize a Terraform project

In the same folder where you have the your Streamlit app saved, in the above example in the terraform_streamlit_folder, you will create and initialize a new Terraform project.

  1.  In your preferred terminal, create a new file named main.tf by running the following command on Unix/Linux machines, or an equivalent command on Windows machines:
    touch main.tf
  2. Open up the main.tf file and add the following code to it:
    module "serverless-streamlit-app" {
      source          = "aws-ia/serverless-streamlit-app/aws"
      app_name        = "streamlit-app"
      app_version     = "v1.1.0" 
      path_to_app_dir = "./app" # Replace with path to your app
    }

    This code utilizes a module block with a source pointing to the Terraform module, and the appropriate input variables passed in. When Terraform encounters a module block, it loads and processes that module’s configuration files using the source. The Serverless Streamlit App Terraform module has many optional input variables. If you have existing resources, such as an existing VPC, subnets, and security groups that you’d like to reuse instead of deploying new ones, you can use the module’s input variables to reference your existing resources. However, in this post, we’re deploying all of the resources in the above architecture from scratch. Here, we simply define the source that references the module hosted in the Terraform Registry, provide an app_name that will be used as a prefix for naming your resources, the app_version that is used for tracking changes to your app, and the path_to_app_dir which is the path to the local directory where the assets for your Streamlit app are stored.

  3. Save the file.
  4. To initialize the Terraform working directory, run the following command in your terminal:
    terraform init

    The output will contain a successful message like the following:

    "Terraform has been successfully initialized"

Output the CloudFront URL

To be able to easily access the Cloudfront URL of the deployed Streamlit application, you can add the URL as a Terraform output.

  1. In your terminal, create a new file named outputs.tf by running the following command on Unix/Linux machines, or an equivalent command on Windows machines:
    touch outputs.tf
  2. Open up the outputs.tf file and add the following code to it:
    output "streamlit_cloudfront_distribution_url" {
      value = module.serverless-streamlit-app.streamlit_cloudfront_distribution_url
    }
  3. Save the file.
    Now, your folder structure will look like:

    terraform_streamlit_folder
    ├── README.md
    ├── app                 # Streamlit app directory
    │   ├── home.py         # Streamlit app entry point
    │   ├── Dockerfile      # Dockerfile
    │   └── pages/          # Streamlit pages
    │     
    ├── main.tf             # Terraform Code (where you call the module) 
    └── outputs.tf          # Outputs definition

Deploy the solution

Now you can use Terraform to deploy the resources defined in your main.tf file.

  1. In your terminal, run the following command to apply to deploy the infrastructure. This includes the hosting for your Streamlit application using ECS and CloudFront, as well as the pipeline that is used to push updates.
    terraform apply

    When the apply command finishes running, you’ll see the Terraform outputs displayed in the terminal.

  2. Navigate to the streamlit_cloudfront_distribution_url to see your Streamlit application that is hosted on AWS.
  3. When you make changes to your Streamlit codebase, you can go ahead and re-run terraform apply to push your new changes to your cloud environment.

When updating the Streamlit codebase, the CodePipeline and CodeBuild processes kick off to automatically update your new changes, which get reflected on your Streamlit application. CodePipeline automates the entire software release process, managing stages like source retrieval, building, testing, and deployment. It integrates with AWS services and third-party tools (such as GitHub and Jenkins) to enhance automation, speed, and security. CodeBuild focuses on automating code compilation, testing, and packaging, supporting multiple languages and custom Docker environments, while integrating with CodePipeline for scalable, secure builds. With this CI/CD pipeline, when you make changes to your code, all you need to run is terraform apply to update your cloud environment. For an example buildspec, see the example in the repo.

You can find full examples of deploying the infrastructure with and without existing resources in the GitHub repository.

Clean up

When you no longer need the resources deployed in this post, you can clean up the resources by using the Terraform destroy command. Simply run terraform destroy . This will remove all of the resources you have deployed in this post with Terraform.

Conclusion

Building serverless Streamlit applications with Terraform on AWS offers a powerful combination of scalability, efficiency, and automation. As you continue to build and refine your Streamlit applications, Terraform’s flexibility ensures that your infrastructure can evolve seamlessly, supporting rapid innovation and agile development. With Streamlit and Terraform, you have the tools to create dynamic, serverless applications that scale effortlessly and operate reliably in the cloud.

Authors

Image depicting Kevon Mayers, a Solutions Architect at AWS

Kevon Mayers

Kevon Mayers is a Solutions Architect at AWS. Kevon is a Terraform Contributor and has led multiple Terraform initiatives within AWS. Prior to joining AWS he was working as a DevOps Engineer and Developer, and before that was working with the GRAMMYs/The Recording Academy as a Studio Manager, Music Producer, and Audio Engineer. He also owns a professional production company, MM Productions.

Image depicting Alexa Perlov, a Prototyping Architect at AWS

Alexa Perlov

Alexa Perlov is a Prototyping Architect with the Prototyping Acceleration team at AWS. She helps customers build with emerging technologies by open sourcing repeatable projects. She is currently based out of Pittsburgh, PA.

Image depicting Shravani Malipeddi, a Solutions Architect at AWS

Shravani Malipeddi

Shravani Malipeddi is a Solutions Architect at AWS who came out of the TechU Program. She currently supports strategic accounts and is based out of San Francisco, CA. .

How Banfico built an Open Banking and Payment Services Directive (PSD2) compliance solution on AWS

Post Syndicated from Otis Antoniou original https://aws.amazon.com/blogs/architecture/how-banfico-built-an-open-banking-and-payment-services-directive-psd2-compliance-solution-on-aws/

This post was co-written with Paulo Barbosa, the COO of Banfico. 

Introduction

Banfico is a London-based FinTech company, providing market-leading Open Banking regulatory compliance solutions. Over 185 leading Financial Institutions and FinTech companies use Banfico to streamline their compliance process and deliver the future of banking.

Under the EU’s revised PSD2, banks can use application programming interfaces (APIs) to securely share financial data with licensed and approved third-party providers (TPPs), when there is customer consent. For example, this can allow you to track your bank balances across multiple accounts in a single budgeting app.

PSD2 requires that all parties in the open banking system are identified in real time using secured certificates. Banks must also provide a service desk to TPPs, and communicate any planned or unplanned downtime that could impact the shared services.

In this blog post, you will learn how the Red Hat OpenShift Service on AWS helped Banfico deliver their highly secure, available, and scalable Open Banking Directory — a product that enables seamless and compliant connectivity between banks and FinTech companies.

Using this modular architecture, Banfico can also serve other use cases such as confirmation of payee, which is designed to help consumers verify that the name of the recipient account, or business, is indeed the name that they intended to send money to.

Design Considerations

Banfico prioritized the following design principles when building their product:

  1. Scalability: Banfico needed their solution to be able to scale up seamlessly as more financial institutions and TPPs begin to utilize the solution, without any interruption to service.
  2. Leverage Managed Solutions and Minimize Administrative Overhead: The Banfico team wanted to focus on their areas of core competency around the product, financial services regulation, and open banking. They wanted to leverage solutions that could minimize the amount of infrastructure maintenance they have to perform.
  3. Reliability: Because the PSD2 regulations require real-time identification and up-to-date communication about planned or unplanned downtime, reliability was a top priority to enable stable communication channels between TPPs and banks. The Open Banking Directory therefore needed to reach availability of 99.95%.
  4. Security and Compliance: The Open Banking Directory needed to be highly secure, ensuring that sensitive data is protected at all times. This was also important due to Banfico’s ISO27001 certification.

To address these requirements, Banfico decided to partner up with AWS and Red Hat and use the Red Hat OpenShift Service on AWS (ROSA). This is a service operated by Red Hat and jointly supported with AWS to provide fully managed Red Hat OpenShift platform that gives them a scalable, secure, and reliable way to build their product. They also leveraged other AWS Managed Services to minimize infrastructure management tasks and focus on delivering business value for their customers.

To understand how they were able to architect a solution that addressed their needs while following the design considerations, see the following reference architecture diagram.

Banfico’s Open Banking Directory Architecture Overview:

Banfico's open banking directory architecture overview diagram

Breakdown of key components:

Red Hat OpenShift Service on AWS (ROSA) cluster: The Banfico Open Banking SaaS key services are built on a ROSA cluster that is deployed across three Availability Zones for high availability and fault tolerance. These key services support the following fundamental business capabilities:

  • Their core aggregated API platform that integrates with, and provides access to banking information for TPPs.
  • Facilitating transactions and payment authorizations.
  • TPP authentication and authorization, more specifically:
    • Checking if a certain TPP is authorized by each country’s central bank to check account information and initiate payments.
    • Validating TPP certificates that are issued by Qualified Trust Service Provider (QTSPs), which are: “regulated (Qualified) to provide trusted digital certificates under the electronic Identification and Signature (eIDAS) regulation. PSD2 also requires specific types of eIDAS certificates to be issued.” – Planky Open Banking Glossary
  • Certificate issuing and management. Banfico is able to issue, manage, and store digital certificates that TPPs can use to interact with Open Banking APIs.
  • The collection of data from central banks across the world to collect regulated entity details.

Elastic Load Balancer (ELB): A load balancer helps Banfico deliver their highly-available and scalable product. It allows them to route traffic across their containers as they grow, and perform health checks accordingly, and it provides Banfico customers access to the application workloads running on ROSA through the ROSA router layers.

Amazon Elastic File System (Amazon EFS): During the collection of data from central banks, either through APIs or by scraping HTML, Banfico’s workloads and apps use the highly-scalable and durable Amazon EFS for shared storage. Amazon EFS automatically scales and provides high availability, simplifying operations and enabling Banfico to focus on application development and delivery.

Amazon Simple Storage Service (Amazon S3): To store digital certificates issued and managed by Banfico’s Open Banking Directory, they rely on Amazon S3, which is a highly-durable, available, and scalable object storage service.

Amazon Relational Database Service (Amazon RDS): The Open Banking Directory uses Amazon RDS PostgreSQL to store application data coming from their different containerized services. Using Amazon RDS, they are able to have a highly-available managed relational database which they also replicate to a secondary Region for disaster recovery purposes.

AWS Key Management Service (AWS KMS): Banfico uses AWS KMS to encrypt all data stored on the volumes used by Amazon RDS to make sure their data is secured.

AWS Identity and Access Management (IAM): Leveraging IAM with the principle of least privilege allows the product to follow security best practices.

AWS Shield: Banfico’s product relies on AWS Shield for DDoS protection, which helps in dynamic detection and automatic inline mitigation.

Amazon Route 53: Amazon Route 53 routes end users to Banfico’s site reliably with globally dispersed Domain Name System (DNS) servers and automatic scaling. They can set up in minutes, and having custom routing policies help Banfico maintain compliance.

Using this architecture and AWS technologies, Banfico is able to deliver their Open Banking Directory to their customers, through a SaaS frontend as shown in the following image.

Banfico's Open Banking Directory SaaS front-end

Conclusion

This AWS solution has proven instrumental in meeting Banfico’s critical business needs, delivering 99.95% availability and scalability. Through the utilization of AWS services, the Open Banking Directory product seamlessly accommodates the entirety of Banfico’s client traffic across Europe. This heightened agility not only facilitates rapid feature deployment (40% faster application development), but also enhances user satisfaction. Looking ahead, Banfico’s Open Banking Directory remains committed to fostering safety and trust within the open banking ecosystem, with AWS standing as a valued partner in Banfico’s journey toward sustained success. Customers who are looking to build their own secure and scalable products in the Financial Services Industry have access industry AWS Specialists; contact us for help in your cloud journey. You can also learn more about AWS services and solutions for financial services by visiting AWS for Financial Services.

Exploring Telemetry Events in Amazon Q Developer

Post Syndicated from David Ernst original https://aws.amazon.com/blogs/devops/exploring-telemetry-events-in-amazon-q-developer/

As organizations increasingly adopt Amazon Q Developer, understanding how developers use it is essential. Diving into specific telemetry events and user-level data clarifies how users interact with Amazon Q Developer, offering insights into feature usage and developer behaviors. This granular view, accessible through logs, is vital for identifying trends, optimizing performance, and enhancing the overall developer experience. This blog is intended to give visibility to key telemetry events logged by Amazon Q Developer and how to explore this data to gain insights.

To help you get started, the following sections will walk through several practical examples that showcase how to extract meaningful insights from AWS CloudTrail. By reviewing the logs, organizations can track usage patterns, identify top users, and empower them to train and mentor other developers, ultimately fostering broader adoption and engagement across teams.

Although the examples here focus on Amazon Athena for querying logs, the methods can be adapted to integrate with other tools like Splunk or Datadog for further analysis. Through this exploration, readers will learn how to query the log data to understand better how Amazon Q Developer is used within your organization.

Solution Overview 

Architecture diagram illustrating the solution using Amazon Q Developer's logs from the IDE and terminal, captured in AWS CloudTrail. The logs are stored in Amazon S3 and queried using Amazon Athena to analyze feature usage, including in-line code suggestions, chat interactions, and security scanning events.

This solution leverages Amazon Q Developer’s logs from the Integrated Development Environment (IDE) and terminal, captured in AWS CloudTrail. The logs will be queried directly using Amazon Athena from Amazon Simple Storage Service (Amazon S3) to analyze feature usage, such as in-line code suggestions, chat interactions, and security scanning events.

Analyzing Telemetry Events in Amazon Q Developer

Amazon Athena is used to query the CloudTrail logs directly to analyze this data. By utilizing Athena, queries can be run on existing CloudTrail records, making it simple to extract insights from the data in its current format.

Ensuring CloudTrail is set up to log the data events.

  1. Navigate to the AWS CloudTrail Console.
  2. Edit an Existing Trail:
    • If you have a trail, verify it is configured to log data events for Amazon CodeWhisperer.
    • Note: As of 4/30/24, CodeWhisperer has been renamed to Amazon Q Developer. All the functionality previously provided by CodeWhisperer is now part of Amazon Q Developer. However, for consistency, the original API names have been retained. 
  3. Click on your existing trail in CloudTrail. Find the Data Events section and click edit.
    • For CodeWhisperer:
      • Data event type: CodeWhisperer
      • Log selector template: Log all events
  4. Save your changes.
  5. Note your “Trail log location.” This S3 bucket will be used in our Athena setup.

If you don’t have an existing trail, follow the instructions in the AWS CloudTrail User Guide to set up a new trail.

Below is a screenshot of the data events addition:

Screenshot showing the configuration of data events in AWS CloudTrail. The image illustrates the setup for logging data events for CodeWhisperer, including log selector templates ("Log all events").

Steps to Create an Athena Table from CloudTrail Logs: This step aims to turn CloudTrail events into a queryable Athena table.

 1. Navigate to the AWS Management Console > Athena > Editor.

 2. Click on the plus to create a query tab.

 3. Run the following query to create a database and table. Note to update the location to your S3 bucket.

-- Step 1: Create a new database (if it doesn't exist)
CREATE DATABASE IF NOT EXISTS amazon_q_metrics;

-- Step 2: Create the external table explicitly within the new database
CREATE EXTERNAL TABLE amazon_q_metrics.cloudtrail_logs (

    userIdentity STRUCT<
        accountId: STRING,
        onBehalfOf: STRUCT<
            userId: STRING,
            identityStoreArn: STRING
        >
    >,  
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    requestParameters STRING,
    requestId STRING,
    eventId STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING
    >>,
    recipientAccountId STRING

)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://{Insert Bucket Name from CloudTrail}/'
TBLPROPERTIES ('classification'='cloudtrail');

 4. Click Run

 5. Run a quick query to view the data.

SELECT 
    eventTime,
    userIdentity.onBehalfOf.userId AS user_id,
    eventName,
    requestParameters
FROM 
    amazon_q_metrics.cloudtrail_logs AS logs
WHERE 
    eventName = 'SendTelemetryEvent'
LIMIT 10;

In this section, the significance of the telemetry events captured in the requestParameters field will be explained. The query begins by displaying key fields and their data, offering insights into how users interact with various features of Amazon Q Developer.

Query Breakdown:

  1. eventTime: This field captures the time the event was recorded, providing insights into when specific user interactions took place.
  2. userIdentity.onBehalfOf.userId: This extracts the userId of the user. This is critical for attributing interactions to the correct user, which will be covered in more detail later in the blog.
  3. eventName: The query is filtered on SendTelemetryEvent. Telemetry events are triggered when the user interacts with particular features or when a developer uses the service.
  4. requestParameters: The requestParameters field is crucial because it holds the details of the telemetry events. This field contains a rich set of information depending on the type of interaction and feature the developer uses, which programming languages are used, completion types, or code modifications.

In the context of the SendTelemetryEvent, various telemetry events are captured in the requestParameters field of CloudTrail logs. These events provide insights into user interactions, overall usage, and the effectiveness of Amazon Q Developer’s suggestions. Here are the key telemetry events along with their descriptions:

  1. UserTriggerDecisionEvent
    • Description: This event is triggered when a user interacts with a suggestion made by Amazon Q Developer. It captures whether the suggestion was accepted or rejected, along with relevant metadata.
    • Key Fields:
      • completionType: Whether the completion was a block or a line.
      • suggestionState: Whether the user accepted, rejected, or discarded the suggestion.
      • programmingLanguage: The programming language associated with the suggestion.
      • generatedLine: The number of lines generated by the suggestion.
  2. CodeScanEvent
    • Description: This event is logged when a code scan is performed. It helps track the scope and result of the scan, providing insights into security and code quality checks.
    • Key Fields:
      • codeAnalysisScope: Whether the scan was performed at the file level or the project level.
      • programmingLanguage: The language being scanned.
  3. CodeScanRemediationsEvent
    • Description: This event captures user interactions with Amazon Q Developer’s remediation suggestions, such as applying fixes or viewing issue details.
    • Key Fields:
      • CodeScanRemediationsEventType: The type of remediation action taken (e.g., viewing details or applying a fix).
      • includesFix: A boolean indicating whether the user applied a fix.
  4. ChatAddMessageEvent
    • Description: This event is triggered when a new message is added to an ongoing chat conversation. It captures the user’s intent which refers to the purpose or goal the user is trying to achieve with the chat message. The intent can include various actions, such as suggesting alternate implementations of the code, applying common best practices, improving the quality or performance of the code.
    • Key Fields:
      • conversationId: The unique identifier for the conversation.
      • messageId: The unique identifier for the chat message.
      • userIntent: The user’s intent, such as improving code or explaining code.
      • programmingLanguage: The language related to the chat message.
  5. ChatInteractWithMessageEvent
    • Description: This event captures when users interact with chat messages, such as copying code snippets, clicking links, or hovering over references.
    • Key Fields:
      • interactionType: The type of interaction (e.g., copy, hover, click).
      • interactionTarget: The target of the interaction (e.g., a code snippet or a link).
      • acceptedCharacterCount: The number of characters from the message that were accepted.
      • acceptedSnippetHasReference: A boolean indicating if the accepted snippet included a reference.
  6. TerminalUserInteractionEvent
    • Description: This event logs user interactions with terminal commands or completions in the terminal environment.
    • Key Fields:
      • terminalUserInteractionEventType: The type of interaction (e.g., terminal translation or code completion).
      • isCompletionAccepted: A boolean indicating whether the completion was accepted by the user.
      • terminal: The terminal environment in which the interaction occurred.
      • shell: The shell used for the interaction (e.g., Bash, Zsh).

For a full exploration of all event types and their detailed fields, you can refer to the official schema reference for Amazon Q Developer.

Telemetry events are key to understanding how users engage with Amazon Q Developer. They track interactions such as code completion, security scans, and chat-based suggestions. Analyzing the data in the requestParameters field helps reveal usage patterns and behaviors that offer valuable insights.

By exploring events such as UserTriggerDecisionEvent, ChatAddMessageEvent, TerminalUserInteractionEvent, and others in the schema, organizations can assess the effectiveness of Amazon Q Developer and identify areas for improvement.

Example Queries for Analyzing Developer Engagement

To gain deeper insights into how developers interact with Amazon Q Developer, the following queries can help analyze key telemetry data from CloudTrail logs. These queries track in-line code suggestions, chat interactions, and code-scanning activities. By running these queries, you can uncover valuable metrics such as the frequency of accepted suggestions, the types of chat interactions, and the programming languages most frequently scanned. This analysis helps paint a clear picture of developer engagement and usage patterns, guiding efforts to enhance productivity.

These four examples only cover a sample set of the available telemetry events, but they serve as a starting point for further exploration of Amazon Q Developer’s capabilities.

Query 1: Analyzing Accepted In-Line Code Suggestions

SELECT 
    eventTime,
    userIdentity.onBehalfOf.userId AS user_id,
    eventName,
    json_extract_scalar(requestParameters, '$.telemetryEvent.userTriggerDecisionEvent.suggestionState') AS suggestionState,
    json_extract_scalar(requestParameters, '$.telemetryEvent.userTriggerDecisionEvent.completionType') AS completionType
FROM 
    amazon_q_metrics.cloudtrail_logs
WHERE 
    eventName = 'SendTelemetryEvent'
    AND json_extract(requestParameters, '$.telemetryEvent.userTriggerDecisionEvent') IS NOT NULL
    AND json_extract_scalar(requestParameters, '$.telemetryEvent.userTriggerDecisionEvent.suggestionState') = 'ACCEPT';

Use Case:This use case focuses on how developers interact with in-line code suggestions by analyzing accepted snippets. It helps identify which users are accepting suggestions, the type of snippets being accepted (blocks or lines), and the programming languages involved. Understanding these patterns can reveal how well Amazon Q Developer aligns with the developers’ expectations.

Query Explanation: The query retrieves the event time, user ID, event name, suggestion state (filtered to show only ACCEPT), and completion type. TotalGeneratedLinesBlockAccept and totalGeneratedLinesLineAccept or discarded suggestions are not included, but this gives an idea of the developers using the service for in-line code suggestions and the lines or blocks they have accepted. Additionally, the programming language field can be extracted to see which languages are used during these interactions.

Query 2: Analyzing Chat Interactions

SELECT 
    userIdentity.onBehalfOf.userId AS userId,
    json_extract_scalar(requestParameters, '$.telemetryEvent.chatInteractWithMessageEvent.interactionType') AS interactionType,
    COUNT(*) AS eventCount
FROM 
    amazon_q_metrics.cloudtrail_logs
WHERE 
    eventName = 'SendTelemetryEvent'
    AND json_extract(requestParameters, '$.telemetryEvent.chatInteractWithMessageEvent') IS NOT NULL
GROUP BY 
    userIdentity.onBehalfOf.userId,
    json_extract_scalar(requestParameters, '$.telemetryEvent.chatInteractWithMessageEvent.interactionType')
ORDER BY 
    eventCount DESC;

Use Case: This use case looks at how developers use chat options like upvoting, downvoting, and copying code snippets. Understanding the chat usage patterns shows which interactions are most used and how developers engage with Amazon Q Developer chat. As an organization, this insight can help support other developers in successfully leveraging this feature.

Query Explanation: The query provides insights into chat interactions within Amazon Q Developer by retrieving user IDs, interaction types, and event counts. This query aggregates data based on the interactionType field within chatInteractWithMessageEvent, showcasing various user actions such as UPVOTE, DOWNVOTE, INSERT_AT_CURSOR, COPY_SNIPPET, COPY, CLICK_LINK, CLICK_BODY_LINK, CLICK_FOLLOW_UP, and HOVER_REFERENCE.

This analysis highlights how users engage with the chat feature and the interactions, offering a view of interaction patterns. By focusing on the interactionType field, you can better understand how developers interact with the chat feature of Amazon Q Developer.

Query 3: Analyzing Code Scanning Jobs Across Programming Languages

SELECT 
    userIdentity.onBehalfOf.userId AS userId,
    json_extract_scalar(requestParameters, '$.telemetryEvent.codeScanEvent.programmingLanguage.languageName') AS programmingLanguage,
    COUNT(json_extract_scalar(requestParameters, '$.telemetryEvent.codeScanEvent.codeScanJobId')) AS jobCount
FROM 
    amazon_q_metrics.cloudtrail_logs
WHERE 
    eventName = 'SendTelemetryEvent'
    AND json_extract(requestParameters, '$.telemetryEvent.codeScanEvent') IS NOT NULL
GROUP BY 
    userIdentity.onBehalfOf.userId,
    json_extract_scalar(requestParameters, '$.telemetryEvent.codeScanEvent.programmingLanguage.languageName')
ORDER BY 
    jobCount DESC;

Use Case: Amazon Q Developer includes security scanning, and this section helps determine how the security scanning feature is being used across different users and programming languages within the organization. Understanding these trends provides valuable insights into which users actively perform security scans and the specific languages targeted for these scans.

Query Explanation: The query provides insights into the distribution of code scanning jobs across different programming languages in Amazon Q Developer. It retrieves user IDs and the count of code-scanning jobs by programming language. This analysis focuses on the CodeScanEvent, aggregating data to show the total number of jobs executed per language.

By summing up the number of code scanning jobs per programming language, this query helps to understand which languages are most frequently analyzed. It provides a view of how users are leveraging the code-scanning feature. This can be useful for identifying trends in language usage and optimizing code-scanning practices.

Query 4: Analyzing User Activity across features.

SELECT 
    userIdentity.onBehalfOf.userId AS user_id,
    COUNT(DISTINCT CASE 
        WHEN json_extract(requestParameters, '$.telemetryEvent.userTriggerDecisionEvent') IS NOT NULL 
        THEN eventId END) AS inline_suggestions_count,
    COUNT(DISTINCT CASE 
        WHEN json_extract(requestParameters, '$.telemetryEvent.chatInteractWithMessageEvent') IS NOT NULL 
        THEN eventId END) AS chat_interactions_count,
    COUNT(DISTINCT CASE 
        WHEN json_extract(requestParameters, '$.telemetryEvent.codeScanEvent') IS NOT NULL 
        THEN eventId END) AS security_scans_count,
    COUNT(DISTINCT CASE 
        WHEN json_extract(requestParameters, '$.telemetryEvent.terminalUserInteractionEvent') IS NOT NULL 
        THEN eventId END) AS terminal_interactions_count
FROM 
    amazon_q_metrics.cloudtrail_logs
WHERE 
    eventName = 'SendTelemetryEvent'
GROUP BY 
    userIdentity.onBehalfOf.userId

Use Case:This use case looks at how developers use Amazon Q Developer across different features: in-line code suggestions, chat interactions, security scans, and terminal interactions. By tracking usage, organizations can see overall engagement and identify areas where developers may need more support or training. This helps optimize the use of Amazon Q Developer and helps teams get the most out of the tool.

Query Explanation: Let’s take the other events from the prior queries and additional events to get more detail overall and tie it all together. This expanded query provides a comprehensive view of user activity within Amazon Q Developer by tracking the number of in-line code suggestions, chat interactions, security scans, and terminal interactions performed by each user. By analyzing these events, organizations can gain a better understanding of how developers are using these key features.

By summing up the interactions for each feature, this query helps identify which users are most active in each category, offering insights into usage patterns and areas where additional training or support may be needed.

Enhancing Metrics with Display Names and Usernames

The previous queries had userid as a field; however, many customers would prefer to see a user alias (such as username or display name). The following section illustrates enhancing these metrics by augmenting user IDs with display names and usernames from the AWS IAM Identity Center. This will provide more human-readable user names.

In this example, the export is run locally to enhance user metrics with IAM Identity Center for simplicity. This method works well for demonstrating how to access and work with the data, but it provides a static snapshot of the users at the time of export. In a production environment, an automated solution would be preferable to capture newly added users continuously. For the purposes of this blog, this straightforward approach is used to focus on data access.

To proceed, install Python 3.8+ and Boto3, and configure AWS credentials via the CLI. Then, run the following Python script locally to export the data:

import boto3, csv
# replace this with the region of your IDC instance
RegionName='us-east-1'
# client creation
idstoreclient = boto3.client('identitystore', RegionName)
ssoadminclient = boto3.client('sso-admin', RegionName)

Instances= (ssoadminclient.list_instances()).get('Instances')
InstanceARN=Instances[0].get('InstanceArn')
IdentityStoreId=Instances[0].get('IdentityStoreId')

# query
UserDigestList = []
ListUserResponse = idstoreclient.list_users(IdentityStoreId=IdentityStoreId)
UserDigestList.extend([[user['DisplayName'], user['UserName'], user['UserId']] for user in ListUserResponse['Users']])
NextToken = None
if 'NextToken' in ListUserResponse.keys(): NextToken = ListUserResponse['NextToken']
while NextToken is not None:
    ListUserResponse = idstoreclient.list_users(IdentityStoreId=IdentityStoreId, NextToken=NextToken)
    UserDigestList.extend([[user['DisplayName'], user['UserName'], user['UserId']] for user in ListUserResponse['Users']])
    if 'NextToken' in ListUserResponse.keys(): NextToken = ListUserResponse['NextToken']
    else: NextToken = None

# write the query results to IDCUserInfo.csv
with open('IDCUserInfo.csv', 'w') as CSVFile:
    CSVWriter = csv.writer(CSVFile, quoting=csv.QUOTE_ALL)
    HeaderRow = ['DisplayName', 'UserName', 'UserId']
    CSVWriter.writerow(HeaderRow) 
    for UserRow in UserDigestList:
        CSVWriter.writerow(UserRow)

This script will query the IAM Identity Center for all users and write the results to a CSV file, including DisplayName, UserName, and UserId. After generating the CSV file, upload it to an S3 bucket. Please make note of this S3 location.

Steps to Create an Athena Table from the above CSV output: Create a table in Athena to join the existing table with the user details.

 1. Navigate to the AWS Management Console > Athena > Editor.

 2. Click on the plus to create a query tab.

 3. Run the following query to create our table. Note to update the location to your S3 bucket.

CREATE EXTERNAL TABLE amazon_q_metrics.user_data (
    DisplayName STRING,
    UserName STRING,
    UserId STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar'     = '"'
)
STORED AS TEXTFILE
LOCATION 's3://{Update to your S3 object location}/'  -- Path containing CSV file
TBLPROPERTIES ('skip.header.line.count'='1');

 4. Click Run

 5. Now, let’s run a quick query to verify the data in the new table.

SELECT * FROM amazon_q_metrics.user_data limit 10;  

The first query creates an external table in Athena from user data stored in a CSV file in S3. The user_data table has three fields: DisplayName, UserName, and UserId. To specify the correct parsing of the CSV, separatorChar is specified as a comma and quoteChar as a double quote. Additionally, the TBLPROPERTIES
(‘skip.header.line.count’=’1’) flag skips the header row in the CSV file, ensuring that column names aren’t treated as data.

The user_data table holds key details: DisplayName (full name), UserName (username), and UserId (unique identifier). This table will be joined with the cloudtrail_q_metrics table using the userId field from the onBehalfOf struct, enriching the interaction logs with human-readable user names and display names instead of user IDs.

In the previous analysis of in-line code suggestions, the focus was on retrieving key metrics related to user interactions with Amazon Q Developer. The query below follows a similar structure but now includes a join with the user_data table to enrich insights with additional user details such as DisplayName and Username.

To include a join with the user_data table in the query, it is necessary to define a shared key between the cloudtrail_logs_amazon_q and user_data tables. For this example, user_id will be used.

SELECT 
    logs.eventTime,
    user_data.displayname,  -- Additional field from user_data table
    user_data.username,     -- Additional field from user_data table
    json_extract_scalar(logs.requestParameters, '$.telemetryEvent.userTriggerDecisionEvent.suggestionState') AS suggestionState,
    json_extract_scalar(logs.requestParameters, '$.telemetryEvent.userTriggerDecisionEvent.completionType') AS completionType
FROM 
    amazon_q_metrics.cloudtrail_logs AS logs  -- Specified database for cloudtrail_logs
JOIN 
    amazon_q_metrics.user_data  -- Specified database for user_data
ON 
    logs.userIdentity.onBehalfOf.userId = user_data.userid
WHERE 
    logs.eventName = 'SendTelemetryEvent'
    AND json_extract_scalar(logs.requestParameters, '$.telemetryEvent.userTriggerDecisionEvent.suggestionState') = 'ACCEPT';

This approach allows for a deeper analysis by integrating user-specific information with the telemetry data, helping you better understand how different user roles interact with the in-line suggestions and other features of Amazon Q Developer.

Cleanup

If you have been following along with this workflow, it is important to clean up the resources to avoid unnecessary charges. You can perform the cleanup by running the following query in the Amazon Athena console:

-- Step 1: Drop the tables
DROP TABLE IF EXISTS amazon_q_metrics.cloudtrail_logs;
DROP TABLE IF EXISTS amazon_q_metrics.user_data;

-- Step 2: Drop the database after the tables are removed
DROP DATABASE IF EXISTS amazon_q_metrics CASCADE;

This query removes both the cloudtrail_logs and user_data tables, followed by the amazon_q_metrics database.

Remove the S3 objects used to store the CloudTrail logs and user data by navigating to the S3 console, selecting the relevant buckets or objects, and choosing “Delete.”

If a new CloudTrail trail was created, consider deleting it to stop further logging. For instructions, see Deleting a Trail. If an existing trail was used, remove the CodeWhisperer data events to prevent continued logging of those events.

Conclusion

By tapping into Amazon Q Developer’s logging capabilities, organizations can unlock detailed insights that drive better decision-making and boost developer productivity. The ability to analyze user-level interactions provides a deeper understanding of how the service is used.

Now that you have these insights, the next step is leveraging them to drive improvements. For example, organizations can use this data to identify opportunities for Proof of Concepts (PoCs) and pilot programs that further demonstrate the value of Amazon Q Developer. By focusing on areas where engagement is high, you can support the most engaged developers as champions to advocate for the tool across the organization, driving broader adoption.

The true potential of these insights lies in the “art of the possible.” With the data provided, it is up to you to explore how to query or visualize it further. Whether you’re examining metrics for in-line code suggestions, interactions, or security scanning, this foundational analysis is just the beginning.

As Amazon Q Developer continues to evolve, staying updated with emerging telemetry events is crucial for maintaining visibility into the available metrics. You can do this by regularly visiting the official Amazon Q Developer documentation and the Amazon Q Developer’s Changelog to stay up-to-date latest information and insights.

About the authors:

David Ernst

David Ernst is an AWS Sr. Solution Architect with a DevOps and Generative AI background, leveraging over 20 years of IT experience to drive transformational change for AWS’s customers. Passionate about leading teams and fostering a culture of continuous improvement, David excels in architecting and managing cloud-based solutions, emphasizing automation, infrastructure as code, and continuous integration/delivery.

Joe Miller

Joseph Miller is a AWS Software Engineer working to illuminate Q usage insights. He specializes in Distributed Systems and Big Data applications. Joseph is passionate about high performance distributed computing, and is proficient in C++, Java and Python. In his free time, he skis and rock climbs.

How to implement relationship-based access control with Amazon Verified Permissions and Amazon Neptune

Post Syndicated from Henry Ho original https://aws.amazon.com/blogs/security/how-to-implement-relationship-based-access-control-with-amazon-verified-permissions-and-amazon-neptune/

Externalized authorization for custom applications is a security approach where access control decisions are managed outside of the application logic. Instead of embedding authorization rules within the application’s code, these rules are defined as policies, which are evaluated by a separate system to make an authorization decision. This separation enhances an application’s security posture by aligning with Zero Trust principles of continual real-time authorization, simplifies the management of security policies, and enables consistent policy enforcement across multiple applications. Amazon Verified Permissions is a scalable permissions management and fine-grained authorization service that you can use to externalize application authorization.

Two common access control models that you might consider when implementing your authorization system are role-based access control (RBAC) and attribute-based access control (ABAC). RBAC grants permissions to users based on their assigned roles within an organization, simplifying the management of access by grouping permissions into roles that correspond to job functions. ABAC grants permissions based on a set of attributes associated with users, resources, and the context, allowing for more fine-grained and dynamic authorization decisions. However, as systems become more complex and have more interconnected data—especially in environments like social networks, collaborative environments, and multi-tenant applications—the limitations of RBAC and ABAC become apparent. These models often fail to effectively capture the relationships between entities. Relationship-based access control (ReBAC) offers a more nuanced approach by using the relationships between users and resources to make decisions about permitted actions, thus addressing scenarios more efficiently than other models.

In this blog post, we show you how to implement ReBAC using Verified Permissions and Amazon Neptune, a managed, serverless graph database on AWS.

What is relationship-based access control?

The core principle of ReBAC is that authorization decisions are based on the relationships between the principal requesting access and the resource being accessed. These relationships can be of several types—ownership, collaboration, or membership relationships—that form hierarchical structures. Examples of ReBAC can be found in multiple domains, including social media sites, project management tools, and content management systems. For example, in a social media application, ReBAC can be used to control who can view, comment, or share a post based on the relationships between the poster, their connections, and the content itself.

Conceptually, roles are types of relationships, and relationships are subsets of attributes.

Benefits of ReBAC

In some types of applications, relationships change dynamically. For example, in a collaborative or social media application, relationships such as contributor or co-owner are continually being established between individual users and resources. Compared to traditional access control models, ReBAC offers the following benefits in these use cases.

  • Fine-grained access control – ReBAC grants access at the level of an individual resource based on a user’s relationship with that resource. For example, a user can update individual photo albums with which they have a contributor relationship.
  • Scalability and adaptability – Relationships can change dynamically. Access permissions are updated automatically when a relationship changes. For example, when the contributor relationship is removed, the user no longer has access.
  • Support for hierarchies – ReBAC can handle hierarchical relationships. For example, the contributor relationship can be inherited down through an album hierarchy, permitting the user to update photo albums that are members of the album with which they have the relationship.

Common relationship models in ReBAC

Here are some common relationship models, also shown in Figure 1, for consideration when building the application and its authorization system:

  • Resource ownership – Permissions to access or manipulate a resource are granted based on whether a user owns that resource. For example, you can delete a GitHub repository if you are the owner of the repository.
  • Resource hierarchies – Permissions to access or manipulate a resource are granted based on the permissions that a principal has for the parent resource. For example, a GitHub repository contributor can close issues that belong to that repository.
  • User hierarchies – These are similar to AWS Identity and Access Management (IAM) user groups. Principals that belong to a group will have the permissions granted to that group.

Figure 1: Common relationship models in ReBAC

Figure 1: Common relationship models in ReBAC

In a relationship model, direct relationships represent clear, explicit links between users and resources, such as an employee owns their expense reports or a file is a member of a folder. These connections are straightforward and simply definable.

However, relationship models often extend beyond these direct links to include hierarchical structures. These create indirect relationships that are more complex in nature. For example, team managers might have access to all expense reports filed by their subordinates, even though they don’t directly own these reports. Similarly, folder owners might have access to all files within their subfolders, regardless of who created those files.

These indirect relationships are derived from a series of direct relationships. They form a relationship chain that, while not explicitly defined, is implied by the hierarchical structure. Because of their complexity and potential for far-reaching implications, these indirect relationships require careful consideration when designing an authorization system.

In this blog post, we focus on the implementation of the relationship models that use resource ownership and resource hierarchies, and relationship hierarchies in these models.

Example scenario

Consider a video application that allows users to manage and share videos of their pets. Alice and Bob are individual users within the environment and so they only have access permissions to their own directory or videos. Because Alice and Bob directly own their resources, they have direct OWNER relationships to these resources, represented as solid lines in Figure 2. aliceCatVideo.mp4 is a video resource stored in the aliceVideoDirectory directory. There is a MemberOf relationship between these resources.

Figure 2: Alice has direct relationship to resources that she has direct ownership

Figure 2: Alice has direct relationship to resources that she has direct ownership

Charlie has direct OWNER relationship to the root directory petVideosDirectory. Because aliceVideoDirectory is a subdirectory of petVideosDirectory, Charlie inherits an OWNER relationship to aliceVideoDirectory and the video resource aliceCatVideo.mp4 inside. This indirect OWNER relationship is inherited through the MemberOf relationship between resources and is represented as dotted lines in Figure 3.

Figure 3: Charlie has indirect relationship to resources that inherited from the MemberOf relationship

Figure 3: Charlie has indirect relationship to resources that inherited from the MemberOf relationship

When implementing access control for this scenario, both RBAC and ABAC offer distinct approaches. In RBAC, you might define roles such as OWNER and VIEWER, and grant Charlie full access to each resource through the OWNER role. While initially straightforward, this method can become inflexible as the application grows, potentially leading to role proliferation. For example, you might want to have separate roles to manage different resources (such as photos or videos) for each type of pet (such as cats or dogs). In ABAC, you might assign attributes such as OWNER and VIEWER and grant each user permissions to resources with specific attributes. This approach offers more flexibility, but fine-grained control can be more complex to set up and manage. As the application’s hierarchy becomes more intricate, both models face challenges in maintaining scalability while maintaining proper access control.

ReBAC addresses these limitations by implementing an access control model that uses direct and indirect relationships between principals and resources. In the example scenario, when Charlie requests access to the video resource aliceCatVideo.mp4, the application traverses the relationship graph in Neptune to retrieve the inherited OWNER relationship through the MemberOf relationship and make the authorization decision.

Overview of a ReBAC application

In this solution, relationship data is stored in Neptune. Prior to requesting an authorization decision from Verified Permissions, the application runs a Neptune query that traverses the relationship graph to retrieve the set of principals that have a specific relationship with the resource. The application then constructs an authorization request for Verified Permissions, using the results of this query to populate the entity data in the request.

In the Cedar schema, the resource has an attribute—named for the relationship—that contains the set of principals that have that relationship with the resource. In our sample application, entities of type Video have an attribute called OWNER, which contains the set of users that have an owner relationship, directly or indirectly, with a video. Each potential relationship is represented by a distinct resource attribute and requires a dedicated query to fetch the set of principals that have that relationship.

See the GitHub repository for the step-by-step walkthrough. In this post, we focus on the key concepts of the solution.

Architecture

Figure 4: Solution architecture

Figure 4: Solution architecture

The solution architecture, as shown in Figure 4, includes the following:

  1. The user authenticates with Amazon Cognito and obtains an access token and an ID token.
  2. The user accesses the application through Amazon API Gateway with the provided token.
  3. An application AWS Lambda function traverses the relationship graph in Neptune and returns the set of principals that have a specific relationship with the resource.
  4. The application Lambda function constructs the requests by putting relationship data in the entities field and passes the requests to Verified Permissions. Verified Permissions acts as the policy decision point (PDP) and evaluates the Cedar policies to arrive at an authorization decision.
  5. The application Lambda function acts as the policy enforcement point (PEP) to enforce the authorization decision returned by Verified Permissions by allowing or denying access to the API.

Data modelling and queries in Neptune

Relationships between entities are created and stored in Neptune as a property graph. A property graph is a set of vertices and edges with respective properties (key-value pairs). The vertices represent entities such as User, Directory, and Video in our example, and the edges represent directional relationships between vertices. Each edge has a label that denotes the type of relationship.

Neptune supports multiple graph query languages, including Gremlin, openCypher, and SPARQL, to access a graph. In this solution, we use Gremlin as the graph query language. For more information about Gremlin, see the documentation from Apache TinkerPop. You can use Neptune graph notebooks to work with a Neptune graph.

You can visualize the relationship graph (Figure 5) using the following query. We use elementMap() to include attributes to represent a vertex or an edge.

# Visualizing the relationship graph and extracting the attributes of each vertex and edge
%%gremlin -p v,oute,inv
g.V().outE().inV().path().by(elementMap('name','directoryId','videoId','ownerName','ownerId','userId','isPublic').order().by(keys))
Figure 5: Relationship graph in Neptune

Figure 5: Relationship graph in Neptune

The following code snippet shows how to add a vertex for entity and an edge for relationship in a relationship graph. Static attributes such as ownerId, ownerName, and isPublic are defined as properties of a vertex. In our example, we will define two relationships—MEMBEROF and OWNER—to denote the direct relationships between resources-to-resources and resources-to-users respectively.

# Adding video vertices (eg. aliceCatVideo_vertex)
g.addV('video').property('name', 'aliceCatVideo.mp4').property('videoId', aliceCatVideo_id).property('ownerId', alice_id).property('ownerName', 'alice').property('isPublic', False)

# Adding relationship edges
g.V(aliceCatVideo_vertex).addE('MEMBEROF').to(aliceVideosDir_vertex)
g.V(alice_vertex).addE('OWNER').to(aliceCatVideo_vertex)

It’s a best practice to assign universally unique identifiers (UUIDs) for all principal and resource identifiers. Another best practice is to not include personally identifying, confidential, or sensitive information as part of the unique identifier for your principals or resources.

To traverse the relationship graph to obtain the owner vertex of a resource vertex, you can use the following query. This query returns the vertex that has a direct OWNER relationship to the resource vertex aliceCatVideo.mp4.

# Retrieve the direct owner of a specific video
g.V().hasLabel('video').has('name', 'aliceCatVideo.mp4').in('OWNER').values(‘name’)

You can use the following query to discover inherited OWNER relationships through MemberOf relationships between resources. The query traverses the relationship graph starting from a video vertex and return the OWNER vertex of each resource vertex along the path to the root directory petVideosDirectory. It outputs the set of owners after deduplication. This query discovers the inherited OWNER in the file system hierarchy and includes them in the entities list of authorization requests.

# Retrieve the direct and transitive owners of a specific video
g.V().hasLabel('video').has('videoId',video_id).union(in('OWNER'),repeat(out('MEMBEROF')).until(has('name', 'petVideosDirectory')).in('OWNER')).dedup().values('userId').toList()

Cedar policy design

Verified Permissions uses the Cedar policy language to define fine-grained permissions. The default decision for an authorization response is DENY. The first policy permits a principal to perform actions in the action group OwnerActions on resources in petVideosDirectory only when the same principal is included in the set of resource owners.

// Resource owner and related persons can access the resources
permit (
	principal,
	action in [PetVideosApp::Action::"OwnerActions"],
	resource in PetVideosApp::Directory::<petVideosDirectory_UUID> ) 
when { 
	resource has owner && 
	principal in resource.owner };

The second policy is an ABAC policy that permits a principal to perform actions in the action group PublicActions on resources in petVideosDirectory only when the resource has the static attribute isPublic and its value is true.

// Allow public access to the resources
permit (
	principal,
	action in [PetVideosApp::Action::"PublicActions"], 
	resource in PetVideosApp::Directory::<petVideosDirectory_UUID> ) 
when { 
	resource has isPublic &&
	resource.isPublic == true };

Implementing ReBAC using this Cedar design pattern in conjunction with a relationship graph requires the careful construction of queries. Verified Permissions will validate that the Cedar policies are correct, based on the Cedar schema, but cannot validate that the Neptune queries correctly traverse the graph to return the correct set of principals with the referenced relationship.

When designing your policies and queries, take account of the following guidelines.

  • Each Cedar policy governs the behaviors of a specific relationship, in this case OWNER. Use a distinct Cedar policy for each relationship in your use cases.
  • Define action groups for each relationship in your use cases.
  • Each new relationship referenced in a Cedar policy requires its own query, and the application needs to run this query if the relationship is relevant to the authorization request. Policy writers must collaborate closely with the application developer to help ensure that the application fetches all data that’s relevant to the authorization request.
  • Indirect relationships can be hard to intuit and prone to errors. The example here of an OWNER relationship inherited through the MEMBEROF relationship is relatively intuitive. However, we recommend avoiding policies that rely on indirect relationships that are derived from multiple different types of direct relationship.
  • Indirect relationships can be over-permissive when there is no permission boundary defined. In our example, the boundary for inherited relationship is defined at the root level of the directory (petVideosDirectory). Follow the least privilege principle to limit inherited relationship within a clearly defined permission boundary.
  • Use MEMBEROF to denote the parent relationship in your graph to align with Cedar policy terminology. However, remember that Verified Permissions cannot auto-discover the Neptune graph, so your queries will still need to be designed to traverse it correctly.

Authorization request to Verified Permissions

The following example shows the structure of an authorization request made to Verified Permissions. In the example, Amazon Cognito is used as the identity source of the Verified Permissions policy store. Cognito user ID claims are mapped to the user entity PetVideosApp::User. Tokens issued by Cognito are mapped to a principal ID in the format <user pool ID>|<sub> by Verified Permissions.

The following request was made for action ViewVideo to the video resource entity with UUID 878c101a-ca0e-4733-904d-af3f252abf50 (the video ID of aliceCatVideo.mp4) using the ID token of alice. The user IDs for alice and charlie were returned after traversing the relationship graph in Neptune to fetch users with the OWNER relationship and include these in the owner attribute in the entities field. The entities field is an array of attributes that Verified Permissions can examine when evaluating the policies. The resource hierarchy of this video resource was shown by including the parent directories (petVideosDirectory and aliceVideosDirectory) as the parent entities in the authorization request.

With reference to the Cedar policy <Resource owner and related persons can access the resources>, the following authorization request returns an ALLOW decision.

{
    "policyStoreId": "HhuNNuHBJJYJd4MfEhAZzD",
    "identityToken": [ID Token Redacted],
    "action": {
        "actionType": "PetVideosApp::Action",
        "actionId": "ViewVideo"
    },
    "resource": {
        "entityType": "PetVideosApp::Video",
        "entityId": "878c101a-ca0e-4733-904d-af3f252abf50"
    },
    "entities": {
        "entityList": [
            {
                "identifier": {
                    "entityType": "PetVideosApp::Video",
                    "entityId": "878c101a-ca0e-4733-904d-af3f252abf50"
                },
                "attributes": {
                    "owner": {
                        "set": [
                            {
                                "entityIdentifier": {
                                    "entityType": "PetVideosApp::User",
                                    "entityId": "ap-southeast-2_K9khoza7q|696e7428-e021-708d-7996-2d322fcf4b29"
                                }
                            },
                            {
                                "entityIdentifier": {
                                    "entityType": "PetVideosApp::User",
                                    "entityId": "ap-southeast-2_K9khoza7q|f91eb468-2001-7080-b860-eff8e20c333c"
                                }
                            }
                        ]
                    },
                    "isPublic": {
                        "boolean": false
                    }
                },
                "parents": [
                    {
                        "entityType": "PetVideosApp::Directory",
                        "entityId": "8e46133a-18da-47dc-bb7c-5e8640f45043"
                    },
                    {
                        "entityType": "PetVideosApp::Directory",
                        "entityId": "5e732639-692b-4fb0-8b69-d305926144fe"
                    }
                ]
            }
        ]
    }
}

Combining ReBAC policies with ABAC policies

ReBAC policies are a great fit when you want to create access based on a relationship between the principal and the resource. However, there can be cases where an ABAC policy is a more intuitive expression of a business rule. For example, in the sample application, you might want to grant all principals permission to view any public resource.

With ReBAC, you would need to create a vertex public in the relationship graph, create MEMBEROF relationships between all public resources and this vertex, and then create a VIEWER relationship between all principals and the vertex public.

With Cedar, you can create a policy store that is a mix of ReBAC and ABAC policies, enabling you to express this access rule with a single ABAC policy that allows public access to resources, as described in the section Cedar Policy Design. This policy grants broad access on resources with the attribute isPublic set to true.

You can use the following Gremlin query to modify the static property isPublic of the video resource vertex bobDogVideo.mp4 to true.

# Set the property "isPublic" to "true" for a specific video
g.V().hasLabel('video').has('name','bobDogVideo.mp4').property(single,'isPublic',true)

You can verify the value of property isPublic of bobDogVideo.mp4 with the following Gremlin query.

# Verify the value of property "isPublic" of a specific video
g.V().hasLabel('video').has('name','bobDogVideo.mp4').values('isPublic')

The following authorization request is made to Verified Permissions using the principal alice after you have set the isPublic property of the video resource bobDogVideo.mp4. In the entities field, there is the attribute isPublic with true as the value.

With reference to the Cedar policy <Allow public access to the resources>, the following authorization request returns ALLOW.

{
    "policyStoreId": "HhuNNuHBJJYJd4MfEhAZzD",
    "identityToken": [ID Token Redacted], ,
    "action": {
        "actionType": "PetVideosApp::Action",
        "actionId": "ViewVideo"
    },
    "resource": {
        "entityType": "PetVideosApp::Video",
        "entityId": "8646429e-dca1-4229-aa26-9afcf75f053b"
    },
    "entities": {
        "entityList": [
            {
                "identifier": {
                    "entityType": "PetVideosApp::Video",
                    "entityId": "8646429e-dca1-4229-aa26-9afcf75f053b"
                },
                "attributes": {
                    "owner": {
                        "set": [
                            {
                                "entityIdentifier": {
                                    "entityType": "PetVideosApp::User",
                                    "entityId": "ap-southeast-2_K9khoza7q|b99ee448-f081-7078-5343-826a680f781f"
                                }
                            },
                            {
                                "entityIdentifier": {
                                    "entityType": "PetVideosApp::User",
                                    "entityId": "ap-southeast-2_K9khoza7q|f91eb468-2001-7080-b860-eff8e20c333c"
                                }
                            }
                        ]
                    },
                    "isPublic": {
                        "boolean": true
                    }
                },
                "parents": [
                    {
                        "entityType": "PetVideosApp::Directory",
                        "entityId": "b1551923-838e-43dc-946c-9fc63a85f445"
                    },
                    {
                        "entityType": "PetVideosApp::Directory",
                        "entityId": "5e732639-692b-4fb0-8b69-d305926144fe"
                    }
                ]
            }
        ]
    }
}

Conclusion

In this post, we showed you what ReBAC is and its benefits and demonstrated the implementation of ReBAC using Amazon Verified Permissions and Amazon Neptune. We also reviewed Cedar policy design patterns and considerations, in addition to the authorization request structure for a ReBAC application. You also saw how to combine ReBAC policies with ABAC policies.

To learn more about this solution and the source code, visit the GitHub repository. For more information, see Cedar PoliciesAmazon Verified Permissions, and Amazon Neptune.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Henry Ho
Henry Ho

Henry is a Senior Solutions Architect at AWS, dedicated to serving enterprise customers in Hong Kong. He specializes in cybersecurity and works with customers from different segments to establish secure landing zones on AWS, elevate their cloud security postures, and advocate cloud security.
Christine Chan
Christine Chan

Christine is an Enterprise Support Technical Account Manager (TAM) based in Hong Kong. She focuses on serving large customers from different industries, using her expertise to provide guidance and technical support. She assists in delivering scalable, resilient, and cost-effective solutions. Apart from work, she also enjoys doing sports.

Securing Your Software Supply Chain with Amazon CodeCatalyst and Amazon Inspector

Post Syndicated from Piyush Mattoo original https://aws.amazon.com/blogs/devops/securing-your-software-supply-chain-with-amazon-codecatalyst-and-amazon-inspector/

Amazon CodeCatalyst is a unified service that streamlines the entire software development lifecycle, empowering teams to build, deliver, and scale applications on AWS.

DevSecOps is the practice of integrating security into all stages of software development. Rather than prioritizing features, it injects security into an earlier phase of the development process – baking it into design, coding, testing, deployment, and operations from the start. Extensive automation like policy checks, scanning, and more proactively uncovers risks.

Amazon Inspector Scan is a CodeCatalyst Action, a logical unit of work to be performed during a workflow run, which leverages software bill of materials (SBOM) generator (sbomgen) to produce a SBOM and ScanSbom to scan a provided CycloneDX 1.5 SBOM and report on any vulnerabilities discovered in that SBOM. An SBOM inventories third-party and open-source components in an application, documenting names, versions, licenses, dependencies, and more. It enables vital DevSecOps initiatives, such as checking an SBOM against CVE databases to rapidly identify vulnerable libraries needing remediation.

Introduction

This blog talks about the benefits of DevSecOps in general and the SBOM in particular. It provides a walkthrough of adding SBOM generation and scanning as a CodeCatalyst Action to an existing CodeCatalyst Workflow. A workflow is an automated procedure that describes how to build, test, and deploy your code as part of a CI/CD system. First, you will create a new Amazon CodeCatalyst project in the CodeCatalyst console. Next, you will modify the workflow to add the Amazon Inspector Scan action. Lastly, you will run the workflow and view the SBOM and vulnerabilities report.

Pre-requisites

Walkthrough

First, you will create a project using CodeCatalyst Blueprints. Blueprints setup a code repository with a working sample app, define cloud infrastructure and run pre-configured CI/CD workflows for your project.

Create a project from a blueprint

Go to your space by clicking your space name in the CodeCatalyst console. From your space, click Create Project. Upon selecting Start with a blueprint, you will select the Single-page application blueprint and click Next as shown in figure 1.

This diagram shows the Amazon CodeCatalyst Create Project screen where you can create the project using a specific blueprint

Figure 1 Amazon CodeCatalyst Create Project screen

You will then pick a suitable name for your project, for this post I will use SafeWebShip. Select the AWS IAM role associated with the space and account connection, then click on Create Project.

Next, you will take a look at the current workflow and add the Amazon Inspector Scan action to identify the packages and libraries that make up a software application and scan for vulnerabilities from the associated packages and libraries.

Review the current workflow

A workflow defines a series of steps, or actions, to take during a workflow run and can be assembled using YAML or a visual editor. Actions which require interaction with AWS resources like creation, modification, reading, and deletion occur in the customer’s AWS account, such as creating an Amazon Inspector task to scan an SBOM report.

Add SBOM generation and scanning to the workflow

To add the Amazon Inspector Scan action to the workflow:

  • Navigate to the CI/CD menu on the left side of your screen, and then click Workflows
  • Click on the onPushToMainPipeline workflow
  • Click the edit button to make changes to the workflow
  • Ensure the Visual tab is selected, then add a CodeCatalyst Action by clicking Actions at the top left of the screen
  • In the new Actions catalog pop-up, search Amazon Inspector Scan. Click the + at the bottom right of the action card as shown in figure 2

This diagram shows an Amazon CodeCatalyst Actions Catalog where the results are filtered based on your search. You can search for the "Amazon Inspect Scan" to be able to integrate that action into your Amazon CodeCatalyst workflow

Figure 2 Amazon CodeCatalyst Actions Catalog

  • Click the Configuration tab of the action and rename the action to inspector_sbom by clicking the pencil under action name
  • Select the environment from the Environment dropdown, AWS account connection and the Role you created earlier in the pre-requisite
  • Scroll down to Path and ensure it is “./” which represents the root of the source repository. The tool will traverse all of the directories of the source repository for supported manifest files to scan
  • The Scan Source should be REPO. The action can scan directories or source repository and a container image. For the purpose of this blog, you will be scanning an existing source repository
  • The tool can be configured to run scanners that will inspect container images, packages, archive, directory, and binary scanners. For the purpose of this blog, you will be using javascript-nodejs scanner. You can skip the rest of the scanners. You can read the action’s documentation from the action’s catalog page for a full list of supported scanners
  • Scroll down to Severity Threshold, type medium to fail the action if a vulnerability of medium or greater is found
  • Skip Files determines the files to skip and should be public/ since you are scanning a public repository
  • Depth specifies the depth of directory traversal when generating the SBOM. You should pick Depth as 1 to scan all the files in the root directory of the public repository. Other inputs that are relevant to container images can be ignored as the source repository is only being scanned
  • The action produces two files, the SBOM in CycloneDX v1.5 format from sbomgen and the vulnerability report from ScanSbom. Click the Outputs tab of the action, under Artifacts click Add artifact
  • Name the Build artifact name as SBOM
  • Paste the Files produced by build with the followinginspector_sbom_report.json
  • Click Add artifact again
  • Name the Build artifact name as SBOM_VULNERABILITIES
  • Paste the Files produced by build with the following

inspector_scan_report.json

This diagram show the Amazon CodeCatalyst Workflow Screen with input options pre-filled. The inputs depend on whether you want to scan a local application or a container image

Figure 3 Amazon CodeCatalyst Workflow Screen

As a best practice, you don’t want to build your project unless it has passed the security scan.

Do the following:

  • From the visual diagram of the workflow, click the Build action
  • With new menu pop-up to the right, in the pre-loaded inputs tab, and under Depends on – optional, select the Add actions dropdown menu and select the inspector_sbom action
  • Click x next to the action name to leave the action input menu

Finally, in order to save our changes to the workflow do the following:

  • Click Validate
  • Once you see a banner at the top of the page that says the workflow definition is valid, click Commit then click Commit once more to publish the changes to the workflow.

Run the workflow and view artifacts

After committing your changes to the workflow. There should be a new workflow run automatically as the trigger to the workflow is a code push.

Currently, SBOM CycloneDX v1.5 is not supported via CodeCatalyst Reports. Therefore, the report can not be visualized under the Reports feature of Amazon CodeCatalyst. However, the SBOM in CycloneDX v1.5 and the scan report are provided as artifacts for you to download and are stored as part of a workflow run.

To access the reports, do the following:

  • Navigate to the CI/CD menu on the left side of your screen, and then click Workflows
  • Click on the onPushToMainPipeline workflow
  • The current view is the Latest state, Click Runs
  • When the workflow, or CI/CD pipeline, runs, it is referred to as a run. Runs that are in progress are under Active runs and Runs history contains all previous workflow runs. Click the Run ID under latest run
  • Once the workflow run page loads, click Artifacts
  • On the Artifacts page, you should notice SBOM, CycloneDX v1.5, and SBOM_VULNERABILITIES, scan of the SBOM for vulnerabilities as shown in figure 4. Both of these artifacts can be downloaded and viewed on your local machine

This diagram show the Amazon CodeCatalyst Workflow Artifact Screen showing the two artifacts resulting from the Amazon CodeCatalyst workflow execution

Figure 4 Amazon CodeCatalyst Workflow Artifact Screen

The SBOM report will be downloaded as inspector_sbom_report.json. In the SBOM report, all the components that make up the software application are available to view. Each listed component is identified by its name and version as shown in figure 5.

This diagram shows the Amazon Inspector Scan SBOM Artifact listing all the components that make up the software application.

Figure 5 Amazon Inspector Scan SBOM Artifact

The scan of the SBOM report for vulnerabilities will be downloaded as inspector_scan_report.json. In the scan report example below, there are 41 medium vulnerabilities and 32 high vulnerabilities. Since the threshold was set to “medium,” the build failed so these vulnerabilities can be addressed.

This diagram shows the Amazon Inspector Scan Vulnerability Artifact. You can see that there are 41 medium vulnerabilities and 32 high vulnerabilities.

Figure 6 Amazon Inspector Scan Vulnerability Artifact

The scan report details the vulnerabilities, links to the affected components, and contains information on the vulnerability like description and CVE reference identifier as shown in figure 6 and figure 7.

This diagram is a continuation of the Amazon Inspector Scan Vulnerability Artifact. The scan report details the vulnerabilities, links to the affected components, and contains information on the vulnerability like description and CVE reference identifier

Figure 7 Amazon Inspector Scan Vulnerability Artifact continued

Clean Up

If you have been following along with building this workflow, you should delete the resources you deployed so you do not continue to incur charges.

First, delete the stack titled DevelopmentFrontendStack-* that has been deployed from the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. Second, delete the project from CodeCatalyst by navigating to Project settings and clicking the Delete project button

Conclusion

In this blog, we demonstrated how you can integrate security practices into a development pipeline using Amazon CodeCatalyst and Amazon Inspector. You created a project from a blueprint that came pre-configured with a workflow. Next, you modified the workflow to add DevSecOps practices to the pipeline through SBOM generation and scanning. Finally, you ran the workflow and viewed the SBOM and vulnerabilities report. It is essential to secure application dependencies during modern software development. For improved software supply chain security, Amazon CodeCatalyst and Amazon Inspector connect effortlessly. Add this action to your existing or new workflows to improve code security. This is necessary in today’s circumstances to protect your software supply chain. Learn more about Amazon CodeCatalyst and get started today!

Piyush Mattoo

Piyush Mattoo is a Senior Solution Architect for Financial Services Data Provider segment at Amazon Web Services. He is a software technology leader with over a decade long experience building scalable and distributed software systems to enable business value through the use of technology. He has an educational background in Computer Science with a Masters degree in Computer and Information Science from University of Massachusetts. He is based out of Southern California and current interests include outdoor camping and nature walks.

Omar Faruk

Omar Faruk is a Partner Solutions Architect at Amazon Web Services. He helps long-tail technology and consulting partners to design, build and operate their and shared customers’ workloads in AWS. He is passionate about serverless and DevOps. Outside work, he enjoys family time and travel.

Jeff Graham

Jeff Graham is a Partner Solutions Architect at Amazon Web Services. He helps MSP partners design, implement, and optimize scalable, secure, and cost-effective solutions on AWS while providing technical guidance and enablement. He is passionate about serverless and front-end technologies. Outside work, he enjoys travel and working out.

James Rehfeld

James Rehfeld is a Senior Cloud Application Architect within ProServe at Amazon Web Services. He helps customers design and build applications that tackle complex business challenges. He is passionate about automation and well-architected solutions. Outside work, he enjoys running, video games and traveling with his family.

Amazon ECS Multi-region Deployment with Amazon CodeCatalyst

Post Syndicated from Piyush Mattoo original https://aws.amazon.com/blogs/devops/amazon-ecs-multi-region-deployment-with-amazon-codecatalyst/

Many AWS customers run their mission-critical workloads across multiple AWS regions to serve geographically dispersed customer base, meet disaster recovery objectives or address local laws and regulations. Amazon CodeCatalyst is a unified software development service designed to streamline and accelerate the process of building and delivering applications on AWS. It is an all-in-one platform for managing your entire development lifecycle, from planning and collaboration to continuous integration, deployment, and scaling. Amazon CodeCatalyst aims to boost developer productivity, ensure consistency, and improve the overall software development experience on AWS. By leveraging Amazon CodeCatalyst for multi-region deployments, AWS customers can ensure high availability and disaster recovery and comply with various regulatory requirements, all while improving their development and deployment process.

In this post, we will walk you through a solution which allows you to easily control updates to applications that are deployed across multiple AWS regions using Amazon CodeCatalyst.

Architecture

In this post, we are going to consider a containerized application running on Amazon Elastic Container Service (Amazon ECS) deployed in two different regions us-east-1 and us-west-2. We will walk you through how to configure an Amazon CodeCatalyst workflow to perform the deployment in stages, limiting the deployment scope to one region at a time (Figure 1).

This diagram shows an Amazon CodeCatalyst workflow the begins with the user pushing code to a repository and the workflow deploy to Amazon ECS region 1 then to Amazon ECS region 2
Figure 1: Amazon ECS Multi Region deployment

Here are the high-level steps in the multi-region deployment process.

  1. The developer makes updates to the application code base and pushes the code changes to the source repository hosted in Amazon CodeCatalyst
  2. This code push invokes an Amazon CodeCatalyst workflow for the multi-region deployment. In this example, the workflow deploys changes to containerized application running on Amazon ECS in two regions.
  3. The deployment to different regions happens in stages. In Step 3, the updates get deployed to region 1. This staged approach allows for initial testing and validation in one AWS region before proceeding.
  4. Once the deployment (and any associated validation steps) is completed successfully in region 1, the workflow proceeds with deployment to the second AWS region.

Limiting the scope of each individual deployment limits the potential impact on customers from failed production deployments and prevents a multi-region impact.

Prerequisites

  • You need access to an AWS account. If you don’t have one, you can create a new AWS account.
  • Follow Amazon ECS Multi-Region Workshop to deploy a simple containerized application across two different AWS regions. Clone the repository and then follow steps to deploy the foundation, data and backend stack. Note the outputs from workshop-backend-main & workshop-backend-secondary, we’ll use them later on in this post.
  • Follow the Amazon ECR user guide to create an Amazon Elastic Container Registry (Amazon ECR) repository named codecatalyst-ecs-image-repo.
  • Create an Amazon CodeCatalyst space, with an empty Amazon CodeCatalyst project named codecatalyst-ecs-project and an Amazon CodeCatalyst environment called codecatalyst-ecs-environment. Associate your AWS account to the CodeCatalyst space. Follow the Amazon CodeCatalyst tutorial to set these up.
  • An AWS Identity and Access Management (IAM) role in the Amazon CodeCatalyst space to provide Amazon CodeCatalyst service permissions to build and deploy applications. Note the name of this role as you’ll use it later in this post.
  • Create an Amazon CodeCatalyst source repository titled ecs-multi-region-repo following the instructions in the documentation.
  • Local installation of Visual Studio Code & Remote Development extension pack.

Walkthrough

Step 1: Create an Amazon CodeCatalyst Dev Environment

In this step, you will create an Amazon CodeCatalyst Dev Environment directly linked to your source repository ecs-multi-region-repo allowing you to work on your Amazon ECS Multi Region application code and configuration files.

  • Open Amazon CodeCatalyst and navigate to your project
  • In the left navigation pane, choose Code and then choose Source repositories
  • Choose the source repository ecs-multi-region-repo for the Amazon ECS multi-region application
  • Choose Create Dev Environment
  • Choose Visual Studio Code  from the drop-down menu
  • In Create Dev Environment and open with Visual Studio Code page (Figure 2), choose Create to create a Visual Studio Code development environment

Create Dev Environment in Amazon CodeCatalyst

Figure 2: Create Dev Environment in Amazon CodeCatalyst

  • Choose Open in Visual Studio Code when prompted (Figure 3), this establishes a remote connection to Dev Environment from your local Visual Studio Code. Keep this window open as you will need it for Step 2

Open Dev Environment with Visual Studio Code 

Figure 3: Open Dev Environment with Visual Studio Code

Step 2: Add Source files to Amazon CodeCatalyst source repository

In this step you will add the necessary source files source files to the Amazon CodeCatalyst repository you created in the pre-requisites, including the sample Amazon ECS multi-region application and the Amazon ECS task definition file.

  • Inside the Visual Studio Code IDE, choose Terminal in the top menu.
  • Select New Terminal or use an existing terminal window if you prefer.
  • Clone the Github project inside your project folder by running the below commands in the terminal.
    git clone https://github.com/aws-samples/amazon-ecs-multi-region.git
    rm -rf amazon-ecs-multi-region/.git
    cp -r amazon-ecs-multi-region/ ecs-multi-region-repo
    cd ecs-multi-region-repo/

  • You need to create an Amazon ECS task definition file for the sample application. Create a file named task.json inside the app folder. Paste the below contents into the task.json replacing placeholder <Account_ID> with your AWS Account ID and <ecsTaskExecutionRole> with your role from workshop-backend-main outputs.
    {
    "executionRoleArn":"arn:aws:iam:<Account_ID>:role/<ecsTaskExecutionRole>",
    "containerDefinitions": [
    {
    "name": "web",
    "image": "$REPOSITORY_URI:$IMAGE_TAG",
    "essential": true,
    "portMappings": [
    {
    "hostPort": 5000,
    "protocol": "tcp",
    "containerPort": 5000
    }
    ],
    "environment": [
    {
    "name": "DYNAMODB_TABLE_NAME",
    "value": "workshop-table"
    }
    ]
    }
    ],
    "requiresCompatibilities": [
    "FARGATE"
    ],
    "networkMode": "awsvpc",
    "cpu": "256",
    "memory": "512",
    "family": "CdkEcsInfraStackTaskDef"
    }

Commit the changes to the Amazon CodeCatalyst repository by issuing the following commands inside the Visual Studio Code IDE terminal window. You will need to update the <your_email> and <your_name> with your email and name.

git config user.email "<your_email>"
git config user.name "<your_name>"
git add .
git commit -m "Initial commit"
git push

In this example, we are using a single task definition file (task.json) which Amazon CodeCatalyst will use to render task definitions in both regions. But, if your workload requires different task definition files across different regions (e.g. region specific resource requirements, compliance requirements, environment specific configurations etc), you can create multiple task definition files in Amazon CodeCatalyst repository and configure RenderAmazonECStaskdefinition action for each regions with different task definition files.

Step 3: Create Amazon CodeCatalyst Workflow for multi-region deployment

Amazon CodeCatalyst workflow is an automated procedure that describes how to build, test, and deploy your code as part of a continuous integration and continuous delivery (CI/CD) system. A workflow defines a series of steps, or actions, to be executed during a workflow run. You can group actions into action groups to keep your workflow organized and configure dependencies between different groups.

  • In the navigation pane, choose CI/CD, and then choose Workflows
  • Choose Create workflow. Select ecs-multi-region-repo from the Source repository dropdown
  • Choose main in the branch. Select Create (Figure 4). The workflow definition file appears in the Amazon CodeCatalyst console’s YAML editor

Create Workflow page in Amazon CodeCatalyst

Figure 4: Create Workflow page in Amazon CodeCatalyst

  • In the YAML editor, you will replace the default content with the below provided workflow definition. Replace <Account_ID> with your AWS account ID.
  • Replace <EcsRegionNameMain>, <EcsClusterNameMain>, <EcsServiceNameMain>, <EcsRegionNameSecondary>, <EcsClusterNameSecondary>, <EcsServiceNameSecondary>. For values with “main” refer to output from workshop-backend-main, and values with “secondary” refer to output from workshop-backend-secondary.
    • Otherwise use your own Amazon ECS Region, Amazon ECS Cluster ARN, Amazon ECS Service Name values.
  • Replace <CodeCatalyst-Dev-Admin-Role> with the Role Name from the pre-requisite
    Name: BuildAndDeployToECS
    SchemaVersion: "1.0"
    
    # Set automatic triggers on code push.
    Triggers:
      - Type: Push
        Branches:
          - main
    
    Actions:
      Build_application_Multi_Region:
            Identifier: aws/build@v1
            Inputs:
              Sources:
                - WorkflowSource
              Variables:
                - Name: region
                  Value: <EcsRegionNameMain>
                - Name: registry
                  Value: <Account_ID>.dkr.ecr.<EcsRegionNameMain>.amazonaws.com
                - Name: image
                  Value: codecatalyst-ecs-image-repo
            Outputs:
              AutoDiscoverReports:
                Enabled: false
              Variables:
                - IMAGE
            Compute:
              Type: EC2
            Environment:
              Connections:
                - Role: <CodeCatalyst-Dev-Admin-Role>
                  Name: "<Account_ID>"
              Name: codecatalyst-ecs-environment
            Configuration:
              Steps:
                - Run: export account=`aws sts get-caller-identity --output text | awk '{ print $1
                    }'`
                - Run: aws ecr get-login-password --region ${region} | docker login --username AWS
                    --password-stdin ${registry}
                - Run: docker build -t appimage app
                - Run: docker tag appimage ${registry}/${image}:${WorkflowSource.CommitId}
                - Run: docker push --all-tags ${registry}/${image}
                - Run: export IMAGE=${registry}/${image}:${WorkflowSource.CommitId}
      build-deploy-region-one:
        Actions:
          RenderAmazonECStaskdefinition_Region_One:
            Identifier: aws/ecs-render-task-definition@v1
            Configuration:
              image: ${Build_application_Multi_Region.IMAGE}
              container-name: web
              task-definition: app/task.json
            Outputs:
              Artifacts:
                - Name: TaskDefinitionOne
                  Files:
                    - task-definition*
            DependsOn:
              - Build_application_Multi_Region
            Inputs:
              Sources:
                - WorkflowSource
          DeploytoAmazonECS_Region_One:
            Identifier: aws/ecs-deploy@v1
            Configuration:
              task-definition: /artifacts/build-deploy-region-one@DeploytoAmazonECS_Region_One/TaskDefinitionOne/${RenderAmazonECStaskdefinition_Region_One.task-definition}
              service: <EcsServiceNameMain>
              cluster: <EcsClusterNameMain>
              region: <EcsRegionNameMain>
            Compute:
              Type: EC2
              Fleet: Linux.x86-64.Large
            Environment:
              Connections:
                - Role: <CodeCatalyst-Dev-Admin-Role>
                  Name: "<Account_ID>"
              Name: codecatalyst-ecs-environment
            DependsOn:
              - RenderAmazonECStaskdefinition_Region_One
            Inputs:
              Artifacts:
                - TaskDefinitionOne
              Sources:
                - WorkflowSource
      build-deploy-region-two:
        DependsOn:
          - build-deploy-region-one
        Actions:
          RenderAmazonECSTaskDefinition_Region_Two:
            # Identifies the action. Do not modify this value.
            Identifier: aws/[email protected]
            # Defines the action's properties.
            Configuration:
              image: ${Build_application_Multi_Region.IMAGE}
              container-name: web
              task-definition: app/task.json
            Outputs:
              Artifacts:
                - Name: TaskDefinitionTwo
                  Files:
                    - task-definition*
            DependsOn:
              - Build_application_Multi_Region
            # Specifies the source and/or artifacts to pass to the action as input.
            Inputs:
              # Optional
              Sources:
                - WorkflowSource # This specifies that the action requires this Workflow as a source
          DeployToAmazonECS_Region_Two:
            Identifier: aws/[email protected] # Defines the action's properties.
            Configuration:
              task-definition: /artifacts/build-deploy-region-two@DeployToAmazonECS_Region_Two/TaskDefinitionTwo/${RenderAmazonECSTaskDefinition_Region_Two.task-definition}
              service: <EcsServiceNameSecondary>
              cluster: <EcsClusterNameSecondary>
              region: <EcsRegionNameSecondary>
            # Required; You can use an environment to access AWS resources.
            Environment:
              Connections:
                - Role: <CodeCatalyst-Dev-Admin-Role>
                  Name: "<Account_ID>"
              Name: codecatalyst-ecs-environment
            DependsOn:
              - RenderAmazonECSTaskDefinition_Region_Two
            # Specifies the source and/or artifacts to pass to the action as input.
            Inputs:
              Artifacts:
                - TaskDefinitionTwo
              # Optional
              Sources:
                - WorkflowSource # This specifies that the action requires this Workflow as a source

Amazon CodeCatalyst Workflow Screen
Figure 5: Amazon CodeCatalyst Workflow Screen

The workflow above (Figure 5) does the following:

  • Whenever code changes are pushed to the repository, a Build action is invoked automatically. The Build action builds a container image and pushes the image to the Amazon Elastic Container Registry (Amazon ECR) repository in the primary region. In this example, we are storing the container image only within the primary region. If you are implementing multi-region for disaster recovery, enable cross-region replication on Amazon ECR to automatically replicate images to repositories in other regions. You will also need to update the task definition files to reference the Amazon ECR repository in the same region where the task will run
  • Once the Build stage is complete, the Amazon ECS task definition is updated with the new Amazon ECR repository image
  • The DeployToECS action then deploys the new image to Amazon ECS in the first region
  • Once the first action group execution succeeds, the Amazon CodeCatalyst workflow invokes second action group repeating the last two steps (Render Task Definition, Deploy) for the second region.

As you may have noticed, the build action is separated from the deployment actions in this example. This way, we are building the container image only once and deploying the same image across multiple regions. But, if you have specific build steps that are region-specific, you can include those actions in the region-specific action groups. This allows for customizations based on regional requirements while maintaining overall consistency.

To check the syntax and structure of your workflow definition:

  • Choose the Validate button. It should add a green banner with “The workflow definition is valid” at the top
  • Select Commit to add the workflow to the repository (Figure 6)

Commit workflow page in Amazon CodeCatalyst
Figure 6: Commit workflow page in Amazon CodeCatalyst

The workflow file is stored in a ~/.codecatalyst/workflows/ folder in the root of your source repository. The file can have a .yml or .yaml extension.

Using the URL of the Application Load Balancer you noted from the pre-requisite from either of the two regions, add /healthcheck to load the health check page in your browser. You’ll to see the message in the health check page as shown in figure 7.

ECS Multi Region Application (US-West-1)

Figure 7: ECS Multi Region Application (US-West-1)

Step 4: Validate the setup

To validate the setup, you will make a small change to the Health check of the sample application.

    • Open Amazon CodeCatalyst dev environment (Visual Studio Code) that you created in Step 1.
    • Update your local copy of the repository. In the terminal run the below command
      git pull

    • Inside the Visual Studio Code IDE, open app.py present inside the app folder.
    • Inside healthcheck() method, on line 13, update the string from ok to ok v1
    • Commit the changes to the repository using the below commands:
      git add . 
      git commit -m “Updating health check static text”
      git push

    After the change is commit, the Amazon CodeCatalyst workflow should start running automatically. Once the Amazon CodeCatalyst workflow finishes execution, paste the Application Load Balancer URL for region and add /healthcheck to reach the check page. You will be able to see the updated message in the health check page as shown in figure 8 and 9.

    ECS Multi Region Application (US-East-1)

    Figure 8: ECS Multi Region Application (US-East-1)

    ECS Multi Region Application (US-West-1)

    Figure 9: ECS Multi Region Application (US-West-1)

    Considerations for multi-region deployments

    In this post, we considered a deployment scenario across two regions. Many organizations have workload running across many regions, serving customers across the globe. The Amazon CodeCatalyst workflow, that we created in this post, can be extended to more than two regions.

    Amazon CodeCatalyst allows fine-grained control for progressive wave-based deployments across multiple regions. This is achieved by using multiple action groups and sequencing those action groups using dependencies in the Amazon CodeCatalyst workflow. For example, in the workflow discussed in Step 3, you defined two action groups build-deploy-region-one and build-deploy-region-two. We setup build-deploy-region-two  to depend on build-deploy-region-one using DependsOn: property, so that the deployment to the second region starts only after the completion of the first region. This approach allows for staggered deployments, mitigating risks by preventing issues in one AWS region from impacting others.

    For workloads spanning multiple regions, the same staggering deployment approach can be extended with more action groups. Each action group can contain a list of regions to deploy to in parallel. Dependencies between action groups ensures the deployment happens sequentially. Below is a high-level architecture (Figure 10) of the setup of 3-stage deployment process for a workload running across 6 regions.

    Staggered Deployment architecture
    Figure 10: Staggered Deployment architecture

    Cleanup

    If you have been following along with the post, you should delete the resources you deployed so you do not continue to incur charges.

    • Manually delete Amazon CodeCatalyst dev environment, source repository and project from your CodeCatalyst Space.
    • Clean up resources created with the CDK
    cdk destroy workshop-backend-secondary
    cdk destroy workshop-backend-main
    cdk destroy workshop-data
    cdk destroy workshop-foundation-secondary
    cdk destroy workshop-foundation-main

    Conclusion

    In conclusion, we demonstrated how you can setup multi-region deployments for Amazon ECS workloads using Amazon CodeCatalyst workflows. We showed how to configure the Amazon CodeCatalyst workflow to deploy to one region at a time, allowing for validation before proceeding to additional regions. The pattern can be extended to more than two AWS regions using additional action groups and dependencies. This solution addresses key challenges in multi-region deployments like maintaining consistency while ensuring high availability. Learn more about multi region in AWS Multi-Region Fundamentals Whitepaper

    Piyush Mattoo

    Piyush Mattoo is a Senior Solution Architect for Financial Services Data Provider segment at Amazon Web Services. He is a software technology leader with over a decade long experience building scalable and distributed software systems to enable business value through the use of technology. He has an educational background in Computer Science with a Masters degree in Computer and Information Science from University of Massachusetts. He is based out of Southern California and current interests include outdoor camping and nature walks.

    William Cardoso

    William Cardoso is a Solutions Architect at Amazon Web Services based in South Florida area. He has 20+ years of experience in designing and developing enterprise systems. He leverages his real world experience in IT operations to work with AWS customers providing architectural and best practice recommendations for new and existing solutions. Outside of work, William enjoys woodworking, walking and cooking for friends and family.

    Hareesh Iyer

    Hareesh Iyer is a Senior Solutions Architect at AWS. He helps customers build scalable, secure, resilient and cost-efficient architectures on AWS. He is passionate about cloud-native patterns, containers and microservices.

Automate detection and response to website defacement with Amazon CloudWatch Synthetics

Post Syndicated from Agus Komang original https://aws.amazon.com/blogs/security/automate-detection-and-response-to-website-defacement-with-amazon-cloudwatch-synthetics/

Website defacement occurs when threat actors gain unauthorized access to a website, most commonly a public website, and replace content on the site with their own messages. In this blog post, we show you how to detect website defacement, and then automate both defacement verification and your defacement response by using Amazon CloudWatch Synthetics visual monitoring canaries. Canaries are configurable scripts that run on a schedule and compare screenshots taken during a canary run with screenshots taken during a baseline canary run. If the discrepancy between the two screenshots exceeds a threshold percentage, the canary fails. We will show you how to quickly deploy a maintenance page through AWS WAF after you verify the defacement.

Common causes of defacement include unauthorized access, SQL injection, cross-site scripting (XSS), or malware. You can use AWS services such as AWS WAF, Amazon Route 53, and Amazon GuardDuty to put additional mechanisms in place to help improve your security posture.

Solution overview

The architectural diagram in Figure 1 shows a typical web application where users access the application by using Amazon CloudFront protected by AWS WAF.

Figure 1: Defacement detection and response with CloudWatch Synthetics

Figure 1: Defacement detection and response with CloudWatch Synthetics

As shown in the diagram, the solution consists of two parts: 1) visual monitoring for defacement detection, and 2) automation of the verification and defacement response.

Part 1: Visual monitoring for defacement detection

Defacement detection uses CloudWatch Synthetics visual monitoring canaries to perform visual monitoring. You can create canaries in CloudWatch Synthetics that periodically take a screenshot of the monitored URLs. Because the canaries only need network access to the monitored URLs, you can implement this solution without affecting the application or modifying its code. For more details on how to create CloudWatch Synthetics visual monitoring canaries, see Visual monitoring of applications with Amazon CloudWatch Synthetics.

You can use the CloudWatch Synthetics visual monitoring blueprint to compare screenshots taken during a canary run with screenshots taken during a baseline canary run. This solution is suitable for static a target=”_blank” hrefs where a discrepancy between the two screenshots that exceeds a threshold percentage could indicate a possible defacement attempt, causing the canary to trigger a failure event.

The threshold percentage is defined by the visual variance that occurs when the current screenshot differs from the baseline screenshot that was captured during the first run of the canary. To reduce false positives, you can adjust the threshold for detecting visual variance.

In the following script, we updated the visual variance to 5% in the visual monitoring blueprint:

# Setting Threshold to 5%
syntheticsConfiguration.withVisualVarianceThresholdPercentage(5);

Figure 2 shows the first baseline screenshot of a webpage with visual variance set to 5%.

Figure 2: Image taken during a baseline canary run

Figure 2: Image taken during a baseline canary run

Figure 3 shows the visual variance of a defaced webpage. In this case, the visual variance was set to 5% in the script, and the visual variance detected was 30.92%.

Figure 3: Failed canary run due to differences from the baseline screenshot

Figure 3: Failed canary run due to differences from the baseline screenshot

Figure 4 shows a webpage with dynamic content that triggered a false positive because the visual monitoring canary was unable to differentiate between real dynamic content and variation from the baseline. In this case, the visual variance was set to 5% in the script, and the visual variance detected was 5.25%.

Figure 4: Dynamic content in Feedback form that triggered canary failure

Figure 4: Dynamic content in Feedback form that triggered canary failure

You can select the dynamic content to exclude it from the visual comparison for subsequent canary runs. To exclude the dynamic content, edit the baseline screenshot in CloudWatch Synthetics. Using a simple click-drag, you can select the area to exclude from visual comparison for subsequent canary runs, as shown in Figure 5.

Figure 5: Exclusion of dynamic content

Figure 5: Exclusion of dynamic content

If your applications have additional areas with dynamic content, you can select more than one area to exclude from comparison.

Figure 6 shows a successful canary run after exclusion of the area that contains the dynamic content.

Figure 6: Canary succeeded after the exclusion of dynamic content

Figure 6: Canary succeeded after the exclusion of dynamic content

You can automate the defacement response by using Amazon EventBridge rules to trigger Amazon Simple Notification Service (Amazon SNS) when a canary run fails. By using the publish-subscribe pattern, you can customize and add on the response functions based on your organization’s needs.

The following shows the event pattern script in EventBridge. Make sure to update the canary name with the name of the CloudWatch Synthetics visual monitoring canary that you created earlier to serve as the event source.

 // Event patterns in EventBridge to get event source from canary

{
  "source": ["aws.synthetics"],
  "detail-type": ["Synthetics Canary TestRun Failure"],
  "detail": {
    "canary-name": ["<replace-with-canary-name>"]
  }
}

When the event pattern matches the rules that you configured in EventBridge, the Amazon SNS topic triggers the approval flow, as shown in Figure 7. This begins automation of the verification and defacement response, which we describe in the next section.

Figure 7: Amazon SNS topic triggered when the event pattern matches

Figure 7: Amazon SNS topic triggered when the event pattern matches

Part 2: Automation of the verification and defacement response

Figure 8 outlines how to automate the verification and defacement response. When alerts are received upon detection of defacement, the notified team can choose to verify the defacement. This defacement monitor uses CloudWatch Synthetics while maintaining the flexibility to configure and verify threshold settings through manual verification. If you are confident in your thresholds, you can bypass the approval flow and directly block site traffic by using an AWS WAF rule during a defacement attempt.

Figure 8: Defacement detection and response with CloudWatch Synthetics

Figure 8: Defacement detection and response with CloudWatch Synthetics

As shown in the diagram, this is what the traffic flow looks like during a defacement:

  1. The canary from the CloudWatch Synthetics visual monitor identifies defacement through visual variance against the baseline screenshot taken during the first canary run and emits an event.
  2. If the emitted event matches the rules configured in EventBridge, Amazon SNS is triggered. This triggers the subscribed AWS Lambda function that sends a Slack notification with the event details asking for approval.
  3. The notified team receives a Slack message about the defacement and makes an approval decision.
  4. If approval is granted, an AWS WAF rule is added to block traffic and a maintenance page is served to users.
  5. The user that accessed the origin is shown a maintenance page served by AWS WAF.

Although this example shows the use of Slack as an approval mechanism, you can use the communication mechanism of your choice.

Conclusion

In this post, you learned how to use CloudWatch Synthetics to monitor for defacement and display a maintenance page through AWS WAF and CloudFront while you work on recovering the service. You also learned how to use manual approval to identify the optimal threshold and exclude the area that contains dynamic content to reduce false positives.

Although most web applications already use CloudFront and AWS WAF, you can integrate this solution to your existing environment without affecting the application or modifying its code. This solution helps detect potential defacement, providing you with an additional layer of protection for your environment.

We recommend that you explore the capabilities of CloudWatch Synthetics monitoring to detect and use the capabilities of the cloud through services such as EventBridge, Amazon SNS, and Lambda to enable automation. This can help you proactively protect your application against defacement attempts.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Agus Komang
Agus Komang

Agus is a Principal Solutions Architect (Security) at AWS. He helps public sector customers to build secure, resilient, and compliant workloads in the cloud.
Jessica Ang
Jessica Ang

Jessica is a Solutions Architect at AWS specializing in security. She helps customers to innovate securely and efficiently through the use of AWS services and security automation.

Streamline SMS and Emailing Marketing Compliance with Amazon Comprehend

Post Syndicated from Koushik Mani original https://aws.amazon.com/blogs/messaging-and-targeting/streamline-sms-and-emailing-marketing-compliance-with-amazon-comprehend/

In today’s digital landscape, businesses heavily rely on SMS and email campaigns to engage with customers and deliver timely, relevant messages. The shift towards digital marketing has increased customer engagement, accelerated delivery, and expanded personalization options. Email and SMS marketing is essential to digital strategies according to 44% of Chief Marketing Officers and they allocate approximately 8% of their marketing towards this. Industries face stringent restrictions on the content they can send due to legal regulations and carrier filtering policies.

Messages related to the subjects listed below are considered restricted and are subject to heavy filtering or even being blocked outright. Failing to comply with these restrictions can result in severe consequences, including legal action, fines, and irreparable damage to a brand’s reputation. Marketers need a solution that will proactively scan their content used in campaigns and flag restricted content before sending it out to their customers without facing penalties and losing trust.:

  • Gambling
  • High-risk financial services
  • Debt forgiveness
  • S.H.A.F.T (Sex, Hate, Alcohol, Firearms, and Tobacco)
  • Illegal substances

In this blog, we will explore how to leverage Amazon Comprehend, Amazon S3, and AWS Lambda to proactively scan text-based marketing campaigns before publishing content . This solution enables businesses to enhance their marketing efforts while maintaining compliance with industry regulations, avoiding costly fines, and preserving their hard-earned reputation, conforming to best practices.

Solution Overview

AWS provides a robust suite of services to meet the infrastructure needs of the booming digital marketing industry, including messaging capabilities through email, SMS, push, and other channels through Amazon Simple Email Service, Amazon Simple Notification Service, or Amazon Pinpoint.

The main goal for this approach is to flag any message that contains restricted content mentioned above before distribution.

Figure 1: Architecture for proactive scanning of marketing content

Figure 1: Architecture for proactive scanning of marketing content

Following are the high-level steps:

  1. Upload documents to be scanned to the S3 bucket.
  2. Utilize Amazon Comprehend custom classification for categorizing the documents uploaded.
  3. Create an Amazon Comprehend endpoint to perform analysis.
  4. Inference output is published to the destination S3 bucket.
  5. Utilize AWS Lambda function to consume the output from the destination S3 bucket.
  6. Send the compliant messages through various messaging channels.

Solution Walkthrough

Step 1: Upload Documents to Be Scanned to S3

  1. Sign in to the AWS Management Console and open the Amazon S3 console
  2. In the navigation bar on the top of the page, choose the name of the currently displayed AWS Region. Next, choose the Region in which you want to create a bucket.
  3. In the left navigation pane, choose Buckets.
  4. Choose Create bucket.
    • The Create bucket page opens.
  5. Under General configuration, view the AWS Region where your bucket will be created.
  6. Under Bucket type, choose General purpose.
  7. For Bucket name, enter a name for your bucket.
    • The bucket name must:
      • Be unique within a partition. A partition is a grouping of Regions. AWS currently has three partitions: aws (Standard Regions), aws-cn (China Regions), and aws-us-gov (AWS GovCloud (US) Regions).
      • Be between 3 and 63 characters long.
      • Consist only of lowercase letters, numbers, dots (.), and hyphens (-). For best compatibility, we recommend that you avoid using dots (.) in bucket names, except for buckets that are used only for static website hosting.
      • Begin and end with a letter or number.
  8. In the Buckets list, choose the name of the bucket that you want to upload your folders or files to.
  9. Choose Upload.
  10. In the Upload window, do one of the following:
    • Drag and drop files and folders to the Upload window.
    • Choose Add file or Add folder, choose the files or folders to upload, and choose Open.
    • To enable versioning, under Destination, choose Enable Bucket Versioning.
    • To upload the listed files and folders without configuring additional upload options, at the bottom of the page, choose Upload.
  11. Amazon S3 uploads your objects and folders. When the upload is finished, you see a success message on the Upload: status page.

Step 2: Creating a Custom Classifiction Model

Custom Classification Model

Out-of-the-box models may not capture nuances and terminology specific to an organization’s industry or use case. Therefore, we train a custom model to identify compliant messages.

A custom classification model is a feature that allows you to train a machine learning model to classify text data based on categories that are specific to your use case or industry. It trains the model to recognize and sort different types of content which is used to power the endpoint. A custom classification model is designed to save costs and promote compliant messages and further prevent marketing companies from potential fines.

Requirements for custom classification:

  • Dataset creation
    • A CSV dataset with 1000 examples of marketing messages, each labeled as compliant (1) or non-compliant (0).
    • Designed to train a model for accurate predictions on marketing message compliance.
Figure 2: Screenshot of dataset – 20 entries of censored marketing messages

Figure 2: Screenshot of dataset – 20 entries of censored marketing messages

  • Creating a Test Data Set
    In addition to providing a dataset to power your customer classification model, a test dataset is also required to test the data that the model will be running on. Without a test dataset, Amazon Comprehend trains the model with 90 percent of the training data. It reserves 10 percent of the training data to use for testing. When using a test dataset, the test data must include at least one example for each unique label (0 or 1) in the training dataset.
  1. Upload the data set and test data set to an S3 Bucket, by following the steps in this user guide.
  2. In the AWS Console, search for Amazon Comprehend.
  3. Once selected, select custom classification on the left panel.
  4. Once there, select Create new model.
  5. Next specify model settings:
    • Model name
    • Specify the version (optional)
    • Language: EnglishModel Setting
  6. Specify the data specifications:
    • Training model type: Plain Text Documents
    • Data format: CSV File
    • Classifier Mode: Using Single-Label Mode
    • Training Dataset: Give the name of the bucket you created in step 1
    • Test Data set: Autosplit, i.e. how much of your data will be used for training and testing.
    • Data Specifications
  1. Specify the location of the model output in S3
    • Output data
  1. Create an IAM Role
    • Permissions to access: Train, Test and output data (if specified in your S3 Buckets)
    • IAM Role
  2. Once all parameters have been identified, select Create.
    • Preferences
  3. Once the model has been created, you can view it under Custom Classification. To check and verify the accuracy and F1 score, select the version number of the model. Here, you can view the model details under the Performance tab.

Step 3: Creating an Endpoint in Amazon Comprehend

Next, an endpoint needs to be created to analyze documents. To create an endpoint:

  • Select endpoint on the left panel in Amazon Comprehend.
  • Select Create endpoint in the left panel.
  • Specify Endpoints Settings :
    • Provide a name
    • Custom model type: Custom Classification
    • Choose a custom model that you want to attach to the new endpoint. From the dropdown, you can search by model name.
    • create_endpointFigure 8: Amazon Comprehend – Endpoint settings
  • Provide the number of inference units (IUs): 1
  • Once all the parameters have been provided, ensure that the Acknowledge checkbox has been selected.
  • Finally, select Create endpoint.
  • inference_units

Step 4: Scanning Text with the Custom Classification Endpoint

Once the endpoint has been successfully created, it can be used for real-time analysis or batch-processing jobs. Below is a walkthrough of how both options can be achieved.

Real-time analysis:

  • On the left panel, select Realtime Analysis.
  • Pick Analysis type: custom, to view real-time insights based on the custom models from an endpoint you’ve created
  • Select custom model type
  • Select your Endpoint
  • Insert your input text.
    • For this example, we have used a non-compliant message: Huge sale at Pine County Shooting Range 25% off for 6mm and 9mm bullets! Lazer add-ons on clearance too
  • Once inserted, click Analyze.input_data
  • Once analyzed, you will see a confidence score under Classes. Because the dataset is labeled as 0 for non-compliant and 1 for compliant. The message that was inserted was non-compliant, the result of the real-time analysis is a high confidence score for non-compliant.insights

Real-time analysis in Amazon Comprehend:

  • On the left panel in Amazon Comprend, select Analysis Jobs.
  • Select the Create Job button.
  • Configure Job settings:
    • Enter the Name
    • Analysis Type: Custom Classification
    • Classifications Model: The model you have created for your Classifier, as well as the version number of that model you would like to use for this job.
    • create_analysis
  • Enter the location of the Input Data and Output Data in the form of an S3 bucket URL.
  • input_data
  • Before creating a job the last thing, we want to do is provide the right access permission, by creating an IAM role that give access permissions to the S3 input and output locations.
  • access_premissions
  • Once the batch processing job shows a status of completed, you can view the results in the output S3 bucket which was identified earlier. The results will be in a json file where each line represents the confidence score for each marketing message.
  • json

Step 5 (optional): Publish message to communication service

The result from the batch processing is automatically uploaded to the output S3 bucket. For each json file uploaded, S3 will initiate an S3 Event Notification which will inform a Lambda function that a new S3 object has been created.

The Lambda function will evaluate the results and automatically identify the messages labeled as compliant (label 0). These compliant messages will then be published to communication services using one of the following three APIs, depending on the desired service:

To automatically trigger the AWS Lambda function, which will read the files uploaded into the S3 bucket and display the data using the Python Pandas library, we will use the boto3 API to read the files from the S3 bucket.

  1. Create an IAM Role in AWS.
  2. Create an AWS S3 bucket.
  3. Create the AWS Lambda function with S3 triggers enabled.
  4. Update the Lambda code with a Python script to read the data and send the communication to customer.

Conclusion

Proactively scanning and classifying marketing content for compliance is a critical aspect of ensuring successful digital marketing campaigns while adhering to industry regulations. Leveraging the powerful combination of Amazon Comprehend, Amazon S3, and AWS Lambda enables the automatic analysis of text-based marketing messages and flagging of any non-compliant content before sending them to your customer. Following these steps provides you with the tools and knowledge to implement proactive scanning for your marketing content. This solution will help mitigate the risks of non-compliance, avoiding costly fines and reputational damage, while freeing up time for your content creation teams to focus on ideation and crafting compelling marketing messages. Regular monitoring and fine-tuning of the custom classification model should be conducted to ensure accurate identification of non-compliant language.

To get started with proactively scanning and classifying marketing content for compliance, see Amazon Comprehend Custom Classification.

About the Authors

Caroline Des Rochers

Caroline Des Rochers

Caroline is a Solutions Architect at Amazon Web Services, based in Montreal, Canada. She works closely with customers, in both English and French, to accelerate innovation and advise them through technical challenges. In her spare time, she is a passionate skier and cyclist, always ready for new outdoor adventures.

Erika_Houde_Pearce

Erika Houde Pearce

Erika Houde-Pearce is a bilingual Solutions Architect in Montreal, guiding small and medium businesses in eastern Canada through their cloud transformations. Her expertise empowers organizations to unlock the full potential of cloud technology, accelerating their growth and agility. Away from work, she spends her spare time with her golden retriever, Scotia.

Koushik_Mani

Koushik Mani

Koushik Mani is a Solutions Architect at AWS. He had worked as a Software Engineer for two years focusing on machine learning and cloud computing use cases at Telstra. He completed his masters in computer science from University of Southern California. He is passionate about machine learning and generative AI use cases and building solutions.