Tag Archives: Manufacturing

Breaking down data silos: Volkswagen’s approach with Amazon DataZone

Post Syndicated from Bandana Das original https://aws.amazon.com/blogs/big-data/breaking-down-data-silos-volkswagens-approach-with-amazon-datazone/

Over the years, organizations have invested in building purpose-built cloud-based data warehouses that are siloed from one another. One of the major challenges these organizations encounter today is enabling cross-organization discovery and access to data across these siloed data warehouses built using different technology stacks. The data mesh pattern addresses these issues, founded in four principles: domain-oriented decentralized data ownership and architecture, treating data as a product, providing self-serve data infrastructure as a platform, and implementing federated governance. The data mesh pattern helps organizations mimic their organizational structure into data domains and makes it possible to share the data across the organization and beyond to improve their business models.

In 2019, Volkswagen AG and Amazon Web Services (AWS) started their collaboration to co-develop the Digital Production Platform (DPP), with the goal of enhancing production and logistics efficiency by 30% while reducing production costs by the same margin. The DPP was developed to streamline access to data from shop floor devices and manufacturing systems by handling integrations and providing a range of standardized interfaces. However, as applications and use cases evolved on the platform, a significant challenge emerged: the ability to share data across applications stored in isolated data warehouses (within Amazon Redshift in isolated AWS accounts designated for specific use cases), without the need to consolidate data into a central data warehouse. Another challenge was discovering all the available data stored across multiple data warehouses and facilitating a workflow to request access to data across business domains within each plant. The common method used was largely manual, relying on emails and general communication (through tickets and emails). The manual approach not only increased the overhead but also varied from one use case to another in terms of data governance.

In this post, we introduce Amazon DataZone and explore how Volkswagen used Amazon DataZone to build their data mesh, tackle the challenges encountered, and break the data silos. A key aspect of the solution was enabling data providers to automatically publish their data products to Amazon DataZone, serving as a central data mesh for enhanced data discoverability. Additionally, we provide code to guide you through the deployment and implementation process.

Introduction to Amazon DataZone

Amazon DataZone is a data management service that makes it faster and straightforward to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources. Key features of Amazon DataZone include the business data catalog, with which users can search for published data, request access, and start working on data in days instead of weeks. In addition, the service facilitates collaboration across teams and helps them manage and monitor data assets across different organizational units. The service also includes the Amazon DataZone portal, which offers a personalized analytics experience for data assets through a web-based application or API. Lastly, Amazon DataZone offers governed data sharing, which makes sure the right data is accessed by the right user for the right purpose with a governed workflow.

Solution overview

The following architecture diagram represents a high-level design that is built on top of the data mesh pattern. It separates source systems, data domain producers (data publishers), data domain subscribers (data consumers), and central governance to highlight the key aspects. This data mesh architecture is specially tailored for cross-AWS account usage. The objective of this approach is to create a foundation for building data governance on a scale, supporting the objectives of data producers and consumers with strong and consistent governance.

This architecture allows for the integration of multiple data warehouses into a centralized governance account that stores all the metadata from each environment.

A data domain producer uses Amazon Redshift as their analytical data warehouse to store, process, and manage structured and semi-structured data. The data domain producers load data into their respective Amazon Redshift clusters through extract, transform, and load (ETL) pipelines they manage, own, and operate. The producers maintain control over their data through Amazon Redshift security features, including column-level access controls and dynamic data masking, supporting data governance at the source. A data domain producer uses Amazon Redshift ETL and Amazon Redshift Spectrum to process and transform raw data into consumable data products. The data products could be Amazon Redshift tables, views, or materialized views.

Data domain producers expose datasets to the rest of the organization by registering them to Amazon DataZone service, which acts as a central data catalog. They can choose what data assets to share, for how long, and how consumers can interact with these. They’re also responsible for maintaining the data and making sure it’s accurate and current.

The data assets from the producers are then published using the data source run to Amazon DataZone in the central governance account. This process populates the technical metadata into the business data catalog for each data asset. The business metadata can be added by business users (data analysts) to provide business context, tags, and data classification for the datasets. This approach provides the required features to allow producers to create catalog entries with Amazon Redshift from all their data warehouses built in with Redshift clusters. In addition, the central data governance account is used to share datasets securely between producers and consumers. It’s important to note that sharing is done through metadata linking alone. No data (except logs) exists in the governance account. The data isn’t copied to the central account; just a reference to the data is used, so that the data ownership remains with the producer.

Amazon DataZone provides a streamlined way to search for data. The Amazon DataZone data portal provides a personalized view for users to discover and search data assets. An Amazon DataZone user (consumer) with permissions to access the data portal can search for assets and submit requests for subscription of data assets using a web-based application. An approver can then approve or reject the subscription request.

When a data domain consumer has access to an asset in the catalog, they can consume it (query and analyze) using the Amazon Redshift query editor. Each consumer runs their own workload based on their use case. In this way, the team can choose the tools for the job to perform analytics and machine learning activities in its AWS consumer environment.

Publishing and registering data assets to Amazon DataZone

To publish a data asset from the producer account, each asset must be registered in Amazon DataZone for consumer subscription. For more information, refer to Create and run an Amazon DataZone data source for Amazon Redshift. In the absence of an automated registration process, required tasks must be completed manually for each data asset.

Using the automated registration workflow, the manual steps can be automated for the Amazon Redshift data asset (Redshift table or view) that needs to be published in an Amazon DataZone domain or when there’s a schema change in an already published data asset.

The following architecture diagram represents how data assets from Amazon Redshift data warehouses have been automatically published to the data mesh created with Amazon DataZone.

The process consists of the following steps:

  1. In the producer account (Account B), the data to be shared resides in a Redshift cluster.
  2. The producer account (Account B) uses a mechanism to trigger the dataset registration AWS Lambda function with a specific payload containing the information and name of the database, schema, table, or view that has a change in metadata.
  3. The Lambda function performs the steps to automatically register and publish the dataset in Amazon DataZone:
    1. Get the Amazon Redshift clusterName, dbName, schemas, and tables from the JSON payload, which is used as the event to trigger the Lambda function.
    2. Get the Amazon DataZone data warehouse blueprint ID.
    3. Enable the blueprint in the data producer account.
    4. Identify the Amazon DataZone Domain ID and project ID for the producer via assuming role in Amazon DataZone account (Account A).
    5. Check if an environment already exists in the project. If not, create an environment.
    6. Create a new Redshift data source by providing the correct Redshift database information in the newly created environment.
    7. Initiate a data source run request in the data source to make the Redshift tables or views available in Amazon DataZone.
    8. Publish the tables or views in the Amazon DataZone catalog.

Prerequisites

The following prerequisites are required before starting:

  • Two AWS accounts to implement the solution have been described in this post. However, you can also use Amazon DataZone to publish data within a single account or across multiple accounts.
    • Amazon DataZone account (Account A) – This is the central data governance account, which will have the Amazon DataZone domain and project.
    • Data domain producer account (Account B) – This account acts as the data domain producer. It has been added as an associated account to Account A.

Prerequisites in data domain producer account (Account B)

As part of this post, we want to publish assets and subscribe to assets from a Redshift cluster that already exists. Complete the following prerequisite steps to set up Account B:

  1. Set up the Redshift cluster, including database, schema, tables, and views (optional). The node type must be from the RA3 family. For more information, see Amazon Redshift provisioned clusters.

    Create a superuser in Amazon Redshift for Amazon DataZone. For the Redshift cluster, the database user you provide in AWS Secrets Manager must have superuser permissions. For reference please see the note section in this QuickStart guide with sample Amazon Redshift data

  2. Store the user’s credentials in Secrets Manager. Select the credential type, enter the credential values, and choose the AWS Key Management Service (AWS KMS) key with which to encrypt the secret.
  3. Add the tags to the Secret Manager secret to allow Amazon DataZone to find this secret and limit the access to a particular Amazon DataZone domain and Amazon DataZone project. The Redshift cluster Amazon Resource Name (ARN) must be added as a tag so it can be used by Amazon Redshift as a valid credential. For reference please see the note section in this QuickStart guide with sample Amazon Redshift data
  4. Add an Amazon DataZone provisioning IAM role and Amazon Redshift manage access IAM role in the secret’s resource policy. The AWS Identity and Access Management (IAM) roles are created as part of the AWS Cloud Development Kit (AWS CDK) deployment (discussed later in this post). The following code shows an example of the Secrets Manager secret’s resource policy. Store the secret ARN in an AWS Systems Manager parameter.
    {
      "Version" : "2012-10-17",
      "Statement" : [ {
        "Effect" : "Allow",
        "Principal" : "*",
        "Action" : "secretsmanager:GetSecretValue",
        "Resource" : "*",
        "Condition" : {
          "ArnEquals" : {
            "aws:PrincipalArn" : [ 
              "arn:aws:iam::<<Data_Producer_Acct_Id(Account B)>>:role/DzRedshiftAccess-<<AWS_Region>>-<< Amazon_DataZone _Domain_Name>>",
              "arn:aws:iam::<<Data_Producer_Acct_Id(Account B)>>:role/DataZoneProvisioning-<< Amazon_DataZone_Account_id(Account A)>>"
            ]
          }
        }
      } ]
    }

    If your secret is encrypted with a custom KMS key, append the key policy with the following statement and add a tag to the key: AmazonDatazoneEnvironment = All. You can skip this step if you’re using an AWS managed KMS key.

    {
        "Effect": "Allow",
        "Principal": {
            "Service": "logs.<<AWS_Region>>.amazonaws.com",
            "AWS": "arn:aws:iam::<<Data_Producer_Acct_Id(Account B)>>:root"
        },
        "Action": [
            "kms:Decrypt",
            "kms:Encrypt",
            "kms:GenerateDataKey*",
            "kms:ReEncrypt*"
        ],
        "Resource": "*"
     },
     {
        "Sid": "AllowDatazoneRoles-DEV",
        "Effect": "Allow",
        "Principal": {
            "AWS": "*"
        },
        "Action": [
            "kms:Decrypt",
            "kms:Describe*",
            "kms:Get*",
            "kms:Encrypt",
            "kms:GenerateDataKey",
            "kms:ReEncrypt*",
            "kms:CreateGrant"
        ],
        "Resource": "*",
        "Condition": {
            "StringLike": {
                "aws:PrincipalArn": [
                    "arn:aws:iam::<<Data_Producer_Acct_Id(Account B)>>:role/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift",
                    "arn:aws:iam::<<Data_Producer_Acct_Id(Account B)>>:role/datazone_*",
                    "arn:aws:iam::<<Data_Producer_Acct_Id(Account B)>>:role/<<Redshift_Cluster_IAM_Role>>",
                    "arn:aws:iam::<<Data_Producer_Acct_Id(Account B)>>:role/service-role/AmazonDataZoneRedshiftAccess-<<AWS_Region>>-*"
                ]
             }
         }
     } 

  5. Place a mechanism to generate the following payload to trigger the dataset registration Lambda function. The payload must contain the relevant Redshift database, schema, and table or view that you want to publish in the Amazon DataZone domain. The following example code assumes you have three databases in your Redshift cluster and within those databases you have different schemas, tables, and views. You should adjust the payload based on your use case.
    {
        "source": "redshift-user-initiated",
        "detail-type": "Amazon Redshift dataset registration in Amazon DataZone",
        "datasets": [
            {
                "clusterName": "<<YOUR_REDSHIFT_CLUSTER_NAME>>",
                "dbName":"<<YOUR_REDSHIFT_DATABASE_NAME_1>>",
                "schemas": [
                    {
                        "schemaName":"<<YOUR_REDSHIFT_SCHEMA_NAME>>",
                        "addAllTables":false,
                        "addAllViews":false,
                        "tables":[
                            "<<YOUR_REDSHIFT_TABLE_NAME>>",
                            "<<YOUR_REDSHIFT_TABLE_NAME>>"
                        ],
                        "views":[
                            "<<YOUR_REDSHIFT_VIEW_NAME>>"
                        ]
                    }
                ]
            },
            {
                "clusterName": "<<YOUR_REDSHIFT_CLUSTER_NAME>>",
                "dbName":"<<YOUR_REDSHIFT_DATABASE_NAME_2>>",
                "schemas": [
                    {
                        "schemaName":"<<YOUR_REDSHIFT_SCHEMA_NAME>>",
                        "addAllTables":true,
                        "addAllViews":true,
                        "tables":[],
                        "views":[]
                    }
                ]
            },
            {
                "clusterName": "<<YOUR_REDSHIFT_CLUSTER_NAME>>",
                "dbName":"<<YOUR_REDSHIFT_DATABASE_NAME_3>>",
                "schemas": [
                    {
                        "schemaName":"<<YOUR_REDSHIFT_SCHEMA_NAME>>",
                        "addAllTables":true,
                        "addAllViews":false,
                        "tables":[],
                        "views":[
                            "<<YOUR_REDSHIFT_VIEW_NAME>>"
                        ]
                    }
                ]
            }
        ]
    }

Prerequisites in Amazon DataZone account (Account A)

Complete the following steps to set up your Amazon DataZone account (Account A):

  1. Sign in to Account A and make sure you have already deployed an Amazon DataZone domain and a project within that domain. Refer to Create Amazon DataZone domains for instructions to create a domain.
  2. If your Amazon DataZone domain is encrypted with a KMS key, add the data domain account (Account B) to the KMS key policy with the following actions:
    "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:DescribeKey"
    ]

  3. Create an IAM role that is assumable by Account B and make sure the role has a following policy attached and is a member (as contributor) of your Amazon DataZone project. For this post, we call the role dz-assumable-env-dataset-registration-role. By adding this role, you can successfully run the registration Lambda function.
    1. In the following policy, provide the AWS Region and account ID corresponding to where your Amazon DataZone domain is created, and the KMS key ARN used to encrypt the domain:
        {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": [
                      "datazone:CreateDataSource",
                      "datazone:CreateEnvironment",
                      "datazone:CreateEnvironmentProfile",
                      "datazone:GetDataSource",
                      "datazone:GetDataSourceRun",
                      "datazone:GetEnvironment",
                      "datazone:GetEnvironmentProfile",
                      "datazone:GetIamPortalLoginUrl",
                      "datazone:ListDataSources",
                      "datazone:ListDomains",
                      "datazone:ListEnvironmentProfiles",
                      "datazone:ListEnvironments",
                      "datazone:ListProjectMemberships",
                      "datazone:ListProjects",
                      "datazone:StartDataSourceRun",
                      "datazone:UpdateDataSource",
                      "datazone:SearchUserProfiles"
                  ],
                  "Resource": "*",
                  "Effect": "Allow"
              },
              {
                  "Action": [
                      "kms:Decrypt",
                      "kms:DescribeKey",
                      "kms:GenerateDataKey"
                  ],
                  "Resource": "arn:aws:kms:<<account_region>>:<<Datazone_Account_id(Account A)>>
      
      
      }:key/${DataZonekmsKey}",
                  "Effect": "Allow"
              }
          ]
      }

    2. Add Account B in the trust relationship of this role with the following trust relationship:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": [
                          "arn:aws:iam::<<Data_Producer_Acct_Id(Account B)>>:root",
                          "arn:aws:iam::<<Datazone_Account_id(Account A)>>:root",
                      ]
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }

    3. Add the role as a member of the Amazon DataZone project in which you want to register your data sources. For more information, see Add members to a project.

Additional tools

The following tools are needed to deploy the solution using the AWS CDK:

Deploy the solution

After you complete the prerequisites, use the AWS CDK stack provided on the GitHub repo to deploy the solution for automatic registration of data assets into the Amazon DataZone domain. Complete the following steps:

  1. Clone the repository from GitHub to your preferred integrated development environment (IDE) using the following commands:
    git clone https://github.com/aws-samples/sample-how-to-automate-amazon-redshift-cluster-data-asset-publish-to-amazon-datazone

    $ cd sample-how-to-automate-amazon-redshift-cluster-data-asset-publish-to-amazon-datazone

  2. At the base of the repository folder, run the following commands to build and deploy resources to AWS:
    $ npm install

    $ npm run lint

  3. Sign in to Account B (the data domain producer account) using the AWS CLI with your profile name.
  4. Make sure you have configured the Region in your credential’s configuration file.
  5. Bootstrap the AWS CDK environment with the following commands at the base of the repository folder. Provide the profile name of your deployment account (Account B). Bootstrapping is a one-time activity and is not needed if your AWS account is already bootstrapped.
    $ export AWS_PROFILE=<<PROFILE_NAME>>

    $ npm run cdk bootstrap

  6. Replace the placeholder parameters (marked with the suffix _PLACEHOLDER) in the file config/DataZoneConfig.ts:
    1. Amazon DataZone domain and project name of your Amazon DataZone instance. Make sure all names are in lowercase.
    2. The AWS account ID of the Amazon DataZone account (Account A).
    3. The assumable IAM role from the prerequisites.
    4. The AWS Systems Manager parameter name containing the Secrets Manager secret ARN of the Amazon Redshift credentials.

  7. Use the following command in the base folder to deploy the AWS CDK solution. During deployment, enter y if you want to deploy the changes for some stacks when you see the prompt Do you wish to deploy these changes (y/n)?
    npm run cdk deploy --all

  8. After the deployment is complete, sign in to Account B and open the AWS CloudFormation console to verify that the infrastructure was deployed.

Test automatic data registration to Amazon DataZone

Complete the following steps to test the solution:

  1. Sign in to Account B (producer account).
  2. On the Lambda console, open the datazone-redshift-dataset-registration function.
  3. Under TEST EVENTS, choose Create new test event.
  4. For Event name, enter Redshift, and for Event JSON, enter the following JSON structure (change the cluster, schema, database, and table names according to your environment):
    {
      "source": "redshift-user-initiated",
      "detail-type": "Amazon Redshift dataset registration in Amazon DataZone",
      "datasets": [
        {
          "clusterName": "YOUR_REDSHIFT_CLUSTER_NAME",
          "dbName": "DATABASE_NAME",
          "schemas": [
            {
              "schemaName": "SCHEMA_NAME_1",
              "addAllTables": false,
              "addAllViews": false,
              "tables": [
                "TABLE_NAME"
              ],
              "views": []
            },
            {
              "schemaName": "SCHEMA_NAME_2",
              "addAllTables": false,
              "addAllViews": false,
              "tables": [],
              "views": [
                "VIEW_NAME"
              ]
            }
          ]
        }
      ]
    }

  5. Choose Save.
  6. Choose Invoke.
  7. Open the Amazon DataZone console in Account A where you deployed the resources.
  8. Choose Domains in the navigation pane, then open your domain.
  9. On the domain details page, locate the Amazon DataZone data portal URL in the Summary section. Choose the link to the data portal.

    For more details about accessing Amazon DataZone, refer to How can I access Amazon DataZone?

  10. In the data portal, open your project and choose the Data tab.
  11. In the navigation pane, choose Data sources and find the newly created data source for Amazon Redshift.
  12. Verify that the data source has been successfully published.

After the data sources are published, users can discover the published data and submit a subscription request. The data producer can approve or reject requests. Upon approval, users can consume the data by querying the data in the Amazon Redshift query editor. The following screenshot illustrates data discovery in the Amazon DataZone data portal.

Clean up

Complete the following steps to clean up the resources deployed through the AWS CDK:

  1. Sign in to Account B, go to the Amazon DataZone domain portal, and check there is no subscription for your published data asset. If there is a subscription, either ask the subscriber to unsubscribe or revoke the subscription request.
  2. Delete the published data assets that were created in the Amazon DataZone project by the dataset registration Lambda function.
  3. Delete the remaining resources created using the following command in the base folder:
    npm run cdk destroy –all

Conclusion

Amazon DataZone offers a seamless integration with AWS services, providing a powerful solution for organizations like Volkswagen to break down their data silos and implement effective data mesh architectures through a straightforward implementation highlighted in this post. By using Amazon DataZone, Volkswagen addressed its immediate data sharing hurdles and laid the groundwork for a more agile, data-driven future in automotive manufacturing. The automated data publishing from various warehouses, coupled with standardized governance workflows, has significantly reduced the manual overhead that once slowed down Volkswagen’s data engineering teams. Now, instead of navigating a labyrinth of emails, tickets, and communication, Volkswagen’s data engineers and data scientists can quickly discover and access the data they need, all while maintaining their security and compliance standards.

By using Amazon DataZone, organizations can bring their isolated data together in ways that make it simpler for teams to collaborate while maintaining security and compliance at scale. This approach not only addresses current data governance challenges but also creates a highly scalable foundation for future data-driven innovations. For guidance on establishing your organization’s data mesh with Amazon DataZone, contact your AWS team today.


About the Authors

Bandana Das

Bandana Das

Bandana is a Senior Data Architect in AWS and specializes in data and analytics. She builds event-driven data architectures to support customers in data management and data-driven decision-making. She is also passionate about helping customers on their data management journey to the cloud.

Anirban Saha

Anirban Saha

Anirban is a DevOps Architect at AWS, specializing in architecting and implementation of solutions for customer challenges in the automotive domain. He is passionate about well-architected infrastructures, automation, data-driven solutions, and helping make the customer’s cloud journey as seamless as possible. In his spare time, he likes to keep himself engaged with reading, painting, language learning, and traveling.

Stoyan Stoyanov

Stoyan Stoyanov

Stoyan works for AWS as a DevOps Engineer. He has more than 10 years of experience in software engineering, cloud technologies, DevOps, data engineering, and security.

Sindi Cali

Sindi Cali

Sindi is a ProServe Associate Consultant with AWS Professional Services. She supports customers in building data-driven applications in AWS.

How Volkswagen Autoeuropa built a data mesh to accelerate digital transformation using Amazon DataZone

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/how-volkswagen-autoeuropa-built-a-data-mesh-to-accelerate-digital-transformation-using-amazon-datazone/

This is a joint blog post co-authored with Martin Mikoleizig from Volkswagen Autoeuropa.

Volkswagen Autoeuropa is a Volkswagen Group plant that produces the T-Roc. The plant is located near Lisbon, Portugal and produces about 934 cars per day. In 2023, Volkswagen Autoeuropa represented 1.3% of the national GDP of Portugal and 4% in national export of goods impact with a sales volume of 3.3511 billion Euros. Volkswagen Autoeuropa aims to become a data-driven factory and has been using cutting-edge technologies to enhance digitalization efforts.

In this post, we discuss how Volkswagen Autoeuropa used Amazon DataZone to build a data marketplace based on data mesh architecture to accelerate their digital transformation. The data mesh, built on Amazon DataZone, simplified data access, improved data quality, and established governance at scale to power analytics, reporting, AI, and machine learning (ML) use cases. As a result, the data solution offers benefits such as faster access to data, expeditious decision making, accelerated time to value for use cases, and enhanced data governance.

Understanding Volkswagen Autoeuropa’s challenges

At the time of writing this post, Volkswagen Autoeuropa has already implemented more than 15 successful digital use cases in the context of real-time visualization, business intelligence, industrial computer vision, and AI.

Before the AWS partnership, Volkswagen Autoeuropa faced the following challenges.

  • Long lead time to access data – The digital use cases launched by Volkswagen Autoeuropa spent most of their project time getting access to the data that was relevant to their use cases. After the right data for the use case was found, the IT team provided access to the data through manual configuration. The lead time to access data was often from several days to weeks.
  • Insufficient data governance and auditing – Data was shared directly to use cases by copying it. Therefore, the IT team connected the data manually from their sources to the desired destinations multiple times. This process wasn’t centrally tracked to discover any information on the data sharing process. For example, if the data was copied in the past, how many use cases have access to the data, when access was granted, and who granted the access.
  • Redundant effort to process the same information – Because the IT team copied the data sources based on the exact use case requirements, they shared specific columns of the tables from the data. As additional use cases requested access to the same data with different column requirements, even more copies of the data were created.
  • Repeated process to establish security and governance guardrails – Each time the IT and the security team provided a connection to a new data source, they had to set up the security and governance guardrails. This required repeated manual effort.
  • Data quality issues – Because the data was processed redundantly and shared multiple times, there was no guarantee of or control over the quality of the data. This led to reduced trust in the data.
  • Absence of data catalog and metadata management – Data didn’t have any metadata associated with it, and so use cases couldn’t consume the data without further explanation from the data source owners and specialists. Furthermore, no process to discover new data existed. Similar to the consumption process, use cases would consult specialists to understand the context of the data and if it could provide value.

Envisioning a data solution for Volkswagen Autoeuropa

To address these challenges, Volkswagen Autoeuropa embarked on a bold vision. They envisioned a seamless data consumption process, similar to an online shopping experience. They envisioned a data marketplace where data users could browse and access high-quality, secure data with clear specifications, business context, and relevant attributes. This vision materialized into a project aimed at transforming data accessibility and governance as the foundation for the digital ecosystem. The vision to be realized: Data as seamless as online shopping.

In collaboration with Amazon Web Services (AWS), Volkswagen Autoeuropa joined the Enhanced Plant Onboarding Program of the Global Volkswagen Group’s Digital Production Platform (DPP EPO) strategy. Through this partnership, AWS and Volkswagen Autoeuropa created a data marketplace that significantly improved data availability.

In the discovery phase of the project, Volkswagen Autoeuropa and AWS evaluated several options to build the data solution. In the end, Volkswagen Autoeuropa chose a solution based on data mesh architecture using Amazon DataZone. Being a managed service, Amazon DataZone provided the necessary speed and agility to build the solution. At the same time, it led to higher operational efficiencies and lower operational overhead. The team adopted a data mesh architecture because the principles of the data mesh aligned with Volkswagen Autoeuropa’s vision of being a data driven factory.

Solution overview

This section describes the key features and architecture of the Volkswagen Autoeuropa data solution. The solution is based on a data mesh architecture.

Data solution features

The following figure shows the key capabilities of the Volkswagen Autoeuropa data solution.

The key capabilities of the solution are:

  • Data quality – In the solution, we’ve built a data quality framework to streamline the process of data quality checks and publishing quality scores. It uses AWS Glue Data Quality to generate recommendation rulesets, run orchestrated jobs, store results, and send notifications to users. This framework can be seamlessly integrated into AWS Glue jobs, providing a quality score for data pipeline jobs. In addition, the quality score is published in the Amazon DataZone data portal, allowing consumers to subscribe to the data based on its quality score.Assigning a quality score to the data helps build trust in the data, and shifts the responsibility of maintaining data quality to the data owner. As a result, the quality of the results delivered by these use cases improves.
  • Data registration – The producers sign in to the Amazon DataZone data portal using their AWS Identity and Access Management (IAM) credentials or single sign-on with integration through AWS IAM Identity Center. They register their data assets, which are stored in Amazon Simple Storage Service (Amazon S3), in the Amazon DataZone data catalog. The metadata of the data assets is stored in an AWS Glue catalog and made available in the business data catalog of Amazon DataZone and in the Amazon DataZone data source. The producers add business context such as business unit name, data owner contact information, and data refresh frequency using Amazon DataZone glossaries and metadata forms. In addition, they use generative AI capabilities to generate business metadata. After the business metadata is generated, they review the changes and modify the metadata if needed.Because all data products in Volkswagen Autoeuropa are now registered in the same location, the likelihood of data duplication is significantly reduced. Moreover, the data producers are improving the quality of the data by adding business context to it.
  • Data discovery – The consumers sign in to the Amazon DataZone data portal using their IAM credentials or single sign-on with integration through IAM Identity Center and search the data using keywords in the search bar. After the results are returned, they can further filter the results using glossary terms and project names. Finally, they review the business metadata of the data assets to evaluate if the data is relevant to their business use cases. They can check the quality score of the data assets and the refresh schedule for their use cases.With a data discovery capability in place, consumers can gain information about the data without the need to consult the source system owners or specialists.
  • Data access management – When the consumers find a data asset that’s relevant to their use case, they request access to it using the subscription feature of Amazon DataZone. Data is classified as public, internal, and confidential. For public and internal data assets, the access request is automatically approved. For confidential data assets, the data producer team reviews the access request and either accepts or rejects the subscription request.With a central place to manage data access, data owners can view which use cases have access to their data and when the access request was granted. The fine-grained access control feature of Amazon DataZone gives data owners granular control of their data at the row and column levels.
  • Data consumption – Upon approval of the subscription request, Amazon DataZone provisions the backend infrastructure to make the data accessible to the corresponding consumers. After this process is complete, the consumers can access the data through Amazon Athena using the deep link feature of Amazon DataZone. The data consumption pattern in Volkswagen Autoeuropa supports two use cases:
    • Cloud-to-cloud consumption – Both data assets and consumer teams or applications are hosted in the cloud.
    • Cloud-to-on-premises consumption – Data assets are hosted in the cloud and consumer use cases or applications are hosted on-premises.

Requirements specific to a use case requires access to the relevant data assets; sharing data to use cases using Amazon DataZone doesn’t require creating multiple copies. As a result, duplication and processing of data. Furthermore, by reducing the number of copies of the data, the overall quality of the data products improves. In addition, the backend automation of Amazon DataZone to make data available to use cases reduces the manual effort and improves the lead time to access data.

  • Single collaborative environment – The Amazon DataZone data portal provides a single collaborative environment to the users in Volkswagen Autoeuropa. Data consumers such as use case owners, data engineers, data scientists, and ML engineers can browse and request access to data assets. At the same time, data producers, such as use case owners and source system owners, can publish and curate their data in the Amazon DataZone data portal. This collaborative experience promotes teamwork and accelerates the realization of business value. Furthermore, the security and governance guardrails scales across the organization as the number of use cases increases.

Data solution architecture

The following figure displays the reference architecture of the data solution at Volkswagen Autoeuropa. In the next part of the post, we discuss how we arrived at the solution.

The architecture includes:

  1. The data from SAP applications, manufacturing execution systems (MES), and supervisory control and data acquisition (SCADA) systems is ingested into the producer accounts of Volkswagen Autoeuropa.
  2. In the producer account, raw data is transformed using AWS Glue. The technical metadata of the data is stored in AWS Glue catalog. The data quality is measured using the data quality framework. The data stored in Amazon Simple Storage Service (Amazon S3) is registered as an asset in the Amazon DataZone data catalog hosted in the central governance account.
  3. The central governance account hosts the Amazon DataZone domain and the related Amazon DataZone data portal. The AWS accounts of the data producers and consumers are associated with the Amazon DataZone domain. Amazon DataZone projects belonging to the data producers and consumers are created under the related Amazon DataZone domain units.
  4. Consumers of the data products sign in to the Amazon DataZone data portal hosted in the central governance account using their IAM credentials or single sign-on with integration through IAM Identity Center. They search, filter, and view asset information (for example, data quality, business, and technical metadata).
  5. After the consumer finds the asset they need, they request access to the asset using the subscription feature of Amazon DataZone. Based on the validity of the request, the asset owner approves or rejects the request.
  6. After the subscription request is granted and fulfilled, the asset is accessed in the consumer account for a one-time query using Athena and Microsoft Power BI applications hosted on premises. This consumption pattern can be extended for AI and machine learning (AI/ML) model development using Amazon SageMaker and reporting purposes using Amazon QuickSight.

User journey

After discussing the desired system with the use case teams and stakeholders and analyzing the current workflow, Volkswagen Autoeuropa grouped the user personas of the data solution into three main categories: data producer, data consumer, and data solution administrator. This sets the foundation for the desired user experience and what’s needed to achieve the solution goals.

Data producer

Data producers create the data products in the data solution. There are two types of data producers.

  • Data source owners – Data source owners publish the raw data in the Amazon DataZone data portal. These data products are attributed as source-based data.
  • Use case owners – Use case owners publish data that’s fit for consumption by other use cases. These data products are called consumer-based data.

The following figure shows the user journey of a data producer:

 

A data producer’s journey includes:

  1. Identify data of interest
    1. Identify data (Volkswagen Autoeuropa network).
    2. Perform data quality checks (Volkswagen Autoeuropa network).
  2. Connect data to the data solution
    1. Ingest data into the data solution (Amazon DataZone portal).
    2. Start process to connect data using AWS Glue.
  3. Locate the data source in the data solution
    1. Register data (Amazon DataZone portal).
    2. Add data to the inventory in Amazon DataZone.
  4. Add or edit metadata
    1. Add or edit metadata (Amazon DataZone portal).
    2. Publish data assets (Amazon DataZone portal).
  5. Approve or reject subscription request
    1. Review subscription requests.
  6. Maintain data assets
    1. Manage data assets (Amazon DataZone portal).

Data consumer

Data consumers use data for business analytics, machine learning, AI, and business reporting. Data consumers are data engineers, data scientists, ML engineers, and business users. The following diagram shows the journey of a data consumer.

A data consumer’s journey includes:

  1. Access Amazon DataZone portal
    1. Amazon DataZone portal – Access is granted based on the user’s assigned domain and projects.
  2. Search for data assets
    1. Data assets in Amazon DataZone portal – Search for data and brows the results by glossary terms or the project name. Use additional filters to refine the results.
  3. View business metadata
    1. Select a data asset to see additional information – Review the description, data quality score and metadata.
  4. Request access to data (subscribe)
    1. Subscribe to request access.
    2. After the subscription request is approved, review the data products that you have access to.
    3. Query the data to view and consume the data.
  5. Retrieve additional data
    1. Repeat the steps as needed to access and retrieve additional data.

Data solution administrator

Data solution administrators are responsible for performing administrative tasks on the data solution. The following figure shows the common tasks performed by the data solution administrator.

A data administrator’s journey includes:

  1. Manage projects
    1. Manage Amazon DataZone domain.
    2. Manage Amazon DataZone projects within the domain.
  2. Manage environment
    1. Set up the environment to manage the infrastructure.
  3. Manage business metadata glossary
    1. Manage and enable Amazon DataZone glossaries and metadata forms.
  4. Manage data assets
    1. Manage assets.
    2. Query the data to view and consume the data.
  5. Manage access to data solution
    1. Monitor and revoke access when appropriate.

Conclusion

In this post, you learned how Volkswagen Autoeuropa embarked on a bold vision to become a data driven factory. It shows how this vision was put into action by building a data solution based on data mesh architecture using Amazon DataZone. It highlights the key features and architecture of the data solutions and presents the user journey. As of writing this post, Volkswagen Autoeuropa reduced the data discovery time from days to minutes using the data solution. The time to access data took several weeks before the Volkswagen Autoeuropa and AWS collaboration. Now, with the help of the data solution, the data access time has been reduced to several minutes.

In May 2024, the team achieved a major milestone by successfully offering data on the data solution and transporting it instantly to Power BI, a process that previously took several weeks.

“After one year of work, we did the full roundtrip from offering data on our new data marketplace built using Amazon DataZone to transporting it instantly to third-party tools, a process that previously took several weeks. This was a big achievement for our team.”

– Jorge Paulino, Product owner of the data solution. Volkswagen Autoeuropa.

The next post of the two-part series details discusses how we built the solution, its technical details, and the business value created.

If you want to harness the agility and scalability of a data mesh architecture and Amazon DataZone to accelerate innovation and drive business value for your organization, we have the resources to get you started. Be sure to check out the AWS Prescriptive Guidance: Strategies for building a data mesh-based enterprise solution on AWS. This comprehensive guide covers the key considerations and best practices for establishing a robust, well-governed data mesh on AWS. From aligning your data mesh with overall business strategy to scaling the data mesh across your organization, this Prescriptive Guidance provides a clear roadmap to help you succeed.

If you’re curious to get hands-on, see the GitHub repository: Building an enterprise Data Mesh with Amazon DataZone, Amazon DataZone, AWS CDK, and AWS CloudFormation. This open source project delivers a step-by-step guide to build a data mesh architecture using Amazon DataZone, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.


About the Authors

Dhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data analytics, and data governance at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure AWS solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. An active contributor to the AWS community, Dhrubajyoti authors AWS Prescriptive Guidance publications, blog posts, and open-source artifacts, sharing his insights and best practices with the broader community. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

Ravi Kumar is a Data Architect and Analytics expert at Amazon Web Services; he finds immense fulfillment in working with data. His days are dedicated to designing and analyzing complex data systems, uncovering valuable insights that drive business decisions. Outside of work, he unwinds by listening to music and watching movies, activities that allow him to recharge after a long day of data wrangling.

Martin Mikoleizig studied mechanical engineering and production technology at the RWTH Aachen University before starting to work in Dr. h.c. Ing. F. Porsche AG 2015 as a production planner for the engine assembly. In several years as a Project Manager on Testing Technology for new engine models he also introduced several innovations like human-machine-collaborations and intelligent assistance systems. From 2017, he was responsible for the Shopfloor IT team of the module lines in Zuffenhausen before he became responsible for the Planning of the E-Drive assembly at Porsche. Beside this he was responsible for the Digitalisation Strategy of the Production Ressort at Porsche. Since October 2022, he has been assigned to Volkswagen Autoeuropa in Portugal in the role of a Digital Transformation Manager for the plant driving the Digital Transformation towards a Data Driven Factory.

Weizhou Sun is a Lead Architect at Amazon Web Services, specializing in digital manufacturing solutions and IoT. With extensive experience in Europe, she has enhanced operational efficiencies, reducing latency and increasing throughput. Weizhou’s expertise includes Industrial Computer Vision, predictive maintenance, and predictive quality, consistently delivering top performance and client satisfaction. A recognized thought leader in IoT and remote driving, she has contributed to business growth through innovations and open-source work. Committed to knowledge sharing, Weizhou mentors colleagues and contributes to practice development. Known for her problem-solving skills and customer focus, she delivers solutions that exceed expectations. In her free time, Weizhou explores new technologies and fosters a collaborative culture.

Shameka Almond is an Advisory Consultant at Amazon Web Services. She works closely with enterprise customers to help them better understand the business impact and value of implementing data solutions, including data governance best practices. Shameka has over a decade of wide-ranging IT experience in the manufacturing and aerospace industries, and the nonprofit sector. She has supported several data governance initiatives, helping both public and private organizations identify opportunities for improvement and increased efficiency. Outside of the office she enjoys hosting large family gatherings, and supporting community outreach events dedicated to introducing students in K-12 to STEM.

Adjoa Taylor has over 20 years of experience in industrial manufacturing, providing industry and technology consulting services, digital transformation, and solution delivery. Currently Adjoa leads Product Centric Digital Transformation, enabling customers to solve complex manufacturing problems by leveraging Smart Factory and Industry leading transformation mechanisms. Most recently driving value with AI/ML and generative AI use-cases for the plant floor. Adjoa is an experienced leader spending over 20 years of her career delivering projects in countries throughout North America, Latin America, Europe, and Asia. Through prior roles, Adjoa brings deep experience across multiple business segments with a focus on business outcome driven solutions. Adjoa is passionate about helping customers solve problems while realizing the art of the possible via the right impacting value-based solution.

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Post Syndicated from Bandana Das original https://aws.amazon.com/blogs/big-data/how-volkswagen-streamlined-access-to-data-across-multiple-data-lakes-using-amazon-datazone-part-1/

Over the years, organizations have invested in creating purpose-built, cloud-based data lakes that are siloed from one another. A major challenge is enabling cross-organization discovery and access to data across these multiple data lakes, each built on different technology stacks. A data mesh addresses these issues with four principles: domain-oriented decentralized data ownership and architecture, treating data as a product, providing self-serve data infrastructure as a platform, and implementing federated governance. Data mesh enables organizations to organize around data domains with a focus on delivering data as a product.

In 2019, Volkswagen AG (VW) and Amazon Web Services (AWS) formed a strategic partnership to co-develop the Digital Production Platform (DPP), aiming to enhance production and logistics efficiency by 30 percent while reducing production costs by the same margin. The DPP was developed to streamline access to data from shop-floor devices and manufacturing systems by handling integrations and providing standardized interfaces. However, as applications evolved on the platform, a significant challenge emerged: sharing data across applications stored in multiple isolated data lakes in Amazon Simple Storage Service (Amazon S3) buckets in individual AWS accounts without having to consolidate data into a central data lake. Another challenge is discovering available data stored across multiple data lakes and facilitating a workflow to request data access across business domains within each plant. The current method is largely manual, relying on emails and general communication, which not only increases overhead but also varies from one use case to another in terms of data governance. This blog post introduces Amazon DataZone and explores how VW used it to build their data mesh to enable streamlined data access across multiple data lakes. It focuses on the key aspect of the solution, which was enabling data providers to automatically publish data assets to Amazon DataZone, which served as the central data mesh for enhanced data discoverability. Additionally, the post provides code to guide you through the implementation.

Introduction to Amazon DataZone

Amazon DataZone is a data management service that makes it faster and easier for customers to catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources. Key features of Amazon DataZone include a business data catalog that allows users to search for published data, request access, and start working on data in days instead of weeks. Amazon DataZone projects enable collaboration with teams through data assets and the ability to manage and monitor data assets across projects. It also includes the Amazon DataZone portal, which offers a personalized analytics experience for data assets through a web-based application or API. Lastly, Amazon DataZone governed data sharing ensures that the right data is accessed by the right user for the right purpose with a governed workflow.

Architecture for Data Management with Amazon DataZone

Figure 1: Data mesh pattern implementation on AWS using Amazon DataZone

The architecture diagram (Figure 1) represents a high-level design based on the data mesh pattern. It separates source systems, data domain producers (data publishers), data domain consumers (data subscribers), and central governance to highlight key aspects. This cross-account data mesh architecture aims to create a scalable foundation for data platforms, supporting producers and consumers with consistent governance.

  1. A data domain producer resides in an AWS account and uses Amazon S3 buckets to store raw and transformed data. Producers ingest data into their S3 buckets through pipelines they manage, own, and operate. They are responsible for the full lifecycle of the data, from raw capture to a form suitable for external consumption.
  2. A data domain producer maintains its own ETL stack using AWS Glue, AWS Lambda to process, AWS Glue Databrew to profile the data and prepare the data asset (data product) before cataloguing it into AWS Glue Data Catalog in their account.
  3. A second pattern could be that a data domain producer prepares and stores the data asset as table within Amazon Redshift using AWS S3 Copy.
  4. Data domain producers publish data assets using datasource run to Amazon DataZone in the Central Governance account. This populates the technical metadata in the business data catalog for each data asset. The business metadata, can be added by business users to provide business context, tags, and data classification for the datasets. Producers control what to share, for how long, and how consumers interact with it.
  5. Producers can register and create catalog entries with AWS Glue from all their S3 buckets. The central governance account securely shares datasets between producers and consumers via metadata linking, with no data (except logs) existing in this account. Data ownership remains with the producer.
  6. With Amazon DataZone, once data is cataloged and published into the DataZone domain, it can be shared with multiple consumer accounts.
  7. The Amazon DataZone Data portal provides a personalized view for users to discover/search and submit requests for subscription of data assets using a web-based application. The data domain producer receives the notification of subscription requests in the Data portal and can approve/reject the requests.
  8. Once approved, the consumer account can read and further process data assets to implement various use cases with AWS Lambda, AWS Glue, Amazon Athena, Amazon Redshift query editor v2, Amazon QuickSight (Analytics use cases) and with Amazon Sagemaker (Machine learning use cases).

Manual process to publish data assets to Amazon DataZone

To publish a data asset from the producer account, each asset must be registered in Amazon DataZone as a data source for consumer subscription. The Amazon DataZone User Guide provides detailed steps to achieve this. In the absence of an automated registration process, all required tasks must be completed manually for each data asset.

How to automate publishing data assets from AWS Glue Data Catalog from the producer account to Amazon DataZone

Using the automated registration workflow, the manual steps can be automated for any new data asset that needs to be published in an Amazon DataZone domain or when there’s a schema change in an already published data asset.

The automated solution reduces the repetitive manual steps to publish the data sources (AWS Glue tables) into an Amazon DataZone domain.

Architecture for automated data asset publish

Figure 2 Architecture for automated data publish to Amazon DataZone

To automate publishing data assets:

  1. In the producer account (Account B), the data to be shared resides in an Amazon S3 bucket (Figure 2). An AWS Glue crawler is configured for the dataset to automatically create the schema using AWS Cloud Development Kit (AWS CDK).
  2. Once configured, the AWS Glue crawler crawls the Amazon S3 bucket and updates the metadata in the AWS Glue Data Catalog. The successful completion of the AWS Glue crawler generates an event in the default event bus of Amazon EventBridge.
  3. An EventBridge rule is configured to detect this event and invoke a dataset-registration AWS Lambda function.
  4. The AWS Lambda function performs all the steps to automatically register and publish the dataset in Amazon Datazone.

Steps performed in the dataset-registration AWS Lambda function

    • The AWS Lambda function retrieves the AWS Glue database and Amazon S3 information for the dataset from the Amazon Eventbridge event triggered by the successful run of the AWS Glue crawler.
    • It obtains the Amazon DataZone Datalake blueprint ID from the producer account and the Amazon DataZone domain ID and project ID by assuming an IAM role in the central governance account where the Amazon Datazone domain exists.
    • It enables the Amazon DataZone Datalake blueprint in the producer account.
    • It checks if the Amazon Datazone environment already exists within the Amazon DataZone project. If it does not, then it initiates the environment creation process. If the environment exists, it proceeds to the next step.
    • It registers the Amazon S3 location of the dataset in Lake Formation in the producer account.
    • The function creates a data source within the Amazon DataZone project and monitors the completion of the data source creation.
    • Finally, it checks whether the data source sync job in Amazon DataZone needs to be started. If new AWS Glue tables or metadata is created or updated, then it starts the data source sync job.

Prerequisites

As part of this solution, you will publish data assets from an existing AWS Glue database in a producer account into an Amazon DataZone domain for which the following prerequisites need to be performed.

  1. You need two AWS accounts to deploy the solution.
    • One AWS account will act as the data domain producer account (Account B) which will contain the AWS Glue dataset to be shared.
    • The second AWS account is the central governance account (Account A), which will have the Amazon DataZone domain and project deployed. This is the Amazon DataZone account.
    • Ensure that both the AWS accounts belong to the same AWS Organization
  2. Remove the IAMAllowedPrincipals permissions from the AWS Lake Formation tables for which Amazon DataZone handles permissions.
  3. Make sure in both AWS accounts that you have cleared the checkbox for Default permissions for newly created databases and tables under the Data Catalog settings in Lake Formation (Figure 3).

    Figure 3: Clear default permissions in AWS Lake Formation

  4. Sign in to Account A (central governance account) and make sure you have created an Amazon DataZone domain and a project within the domain.
  5. If your Amazon DataZone domain is encrypted with an AWS Key Management Service (AWS KMS) key, add Account B (producer account) to the key policy with the following actions:
    {
      "Sid": "Allow use of the key",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<Account B>:root"
      },
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:DescribeKey"
      ],
      "Resource": "*"
    }

  6. Ensure you have created an AWS Identity and Access Management (IAM) role that Account B (producer account) can assume and this IAM role is added as a member (as contributor) of your Amazon DataZone project. The role should have the following permissions:
    • This IAM role is called dz-assumable-env-dataset-registration-role in this example. Adding this role will enable you to successfully run the dataset-registration Lambda function. Replace the account-region, account id, and DataZonekmsKey in the following policy with your information. These values correspond to where your Amazon DataZone domain is created and the AWS KMS key Amazon Resource Name (ARN) used to encrypt the Amazon DataZone domain.
      {
          "Version": "2012-10-17",
          "Statement": [
               {
                  "Action": [
                      "DataZone:CreateDataSource",
                     "DataZone:CreateEnvironment",
                     "DataZone:CreateEnvironmentProfile",
                     "DataZone:GetDataSource",
                     "DataZone:GetEnvironment",
                     "DataZone:GetEnvironmentProfile",
                     "DataZone:GetIamPortalLoginUrl",
                     "DataZone:ListDataSources",
                      "DataZone:ListDomains",
                      "DataZone:ListEnvironmentProfiles",
                      "DataZone:ListEnvironments",
                      "DataZone:ListProjectMemberships",
                     "DataZone:ListProjects",
                      "DataZone:StartDataSourceRun"
                  ],
                  "Resource": "*",
                  "Effect": "Allow"
              },
              {
                  "Action": [
                       "kms:Decrypt",
                      "kms:DescribeKey",
                      "kms:GenerateDataKey"
                  ],
                 "Resource": "arn:aws:kms:${account_region}:${account_id}:key/${DataZonekmsKey}",
                  "Effect": "Allow"
              }
          ]
      }

    • Add the AWS account in the trust relationship of this role with the following trust relationship. Replace ProducerAccountId with the AWS account ID of Account B (data domain producer account).
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": [
                          "arn:aws:iam::${ProducerAccountId}:root",
                      ]
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      } }

  7. The following tools are needed to deploy the solution using AWS CDK:

Deployment Steps

After completing the pre-requisites, use the AWS CDK stack provided on GitHub to deploy the solution for automatic registration of data assets into DataZone domain

  1. Clone the repository from GitHub to your preferred IDE using the following commands.
    git clone https://github.com/aws-samples/automate-and-simplify-aws-glue-data-asset-publish-to-amazon-datazone.git
    
    cd automate-and-simplify-aws-glue-data-asset-publish-to-amazon-datazone

  2. At the base of the repository folder, run the following commands to build and deploy resources to AWS.
    npm install 
    npm run lint

  3. Sign in to the AWS account B (the data domain producer account) using AWS Command Line Interface (AWS CLI) with your profile name.
  4. Ensure you have configured the AWS Region in your credential’s configuration file.
  5. Bootstrap the CDK environment with the following commands at the base of the repository folder. Replace <PROFILE_NAME> with the profile name of your deployment account (Account B). Bootstrapping is a one-time activity and is not needed if your AWS account is already bootstrapped.
    export AWS_PROFILE=<PROFILE_NAME>
    npm run cdk bootstrap

  6. Replace the placeholder parameters (marked with the suffix _PLACEHOLDER) in the file config/DataZoneConfig.ts (Figure 4).
    • Amazon DataZone domain and project name of your Amazon DataZone instance. Make sure all names are in lowercase.
    • The AWS account ID and Region.
    • The assumable IAM role from the prerequisites.
    • The deployment role starting with cfn-xxxxxx-cdk-exec-role-.

Figure 4: Edit the DataZoneConfig file

  1. In the AWS Management Console for Lake Formation, select Administrative roles and tasks from the navigation pane (Figure 5) and make sure the IAM role for AWS CDK deployment that starts with cfn-xxxxxx-cdk-exec-role- is selected as an administrator in Data lake administrators. This IAM role needs permissions in Lake Formation to create resources, such as an AWS Glue database. Without these permissions, the AWS CDK stack deployment will fail.

Figure 5: Add cfn-xxxxxx-cdk-exec-role- as a Data Lake administrator

  1. Use the following command in the base folder to deploy the AWS CDK solution
    npm run cdk deploy --all

During deployment, enter y if you want to deploy the changes for some stacks when you see the prompt Do you wish to deploy these changes (y/n)?

  1. After the deployment is complete, sign in to your AWS account B (producer account) and navigate to the AWS CloudFormation console to verify that the infrastructure deployed. You should see a list of the deployed CloudFormation stacks as shown in Figure 6.

Figure 6: Deployed CloudFormation stacks

Test automatic data registration to Amazon DataZone

To test, we use the Online Retail Transactions dataset from Kaggle as a sample dataset to demonstrate the automatic data registration.

  1. Download the Online Retail.csv file from Kaggle dataset.
  2. Login to AWS Account B (producer account) and navigate to the Amazon S3 console, find the DataZone-test-datasource S3 bucket, and upload the csv file there (Figure 7).

Figure 7: Upload the dataset CSV file

  1. The AWS Glue crawler is scheduled to run at a specific time each day. However for testing, you can manually run the crawler by going to the AWS Glue console and selecting Crawlers from the navigation pane. Run the on-demand crawler starting with DataZone-. After the crawler has run, verify that a new table has been created.
  2. Go to the Amazon DataZone console in AWS account A (central governance account) where you deployed the resources. Select Domains in the navigation pane (Figure 8), then Select and open your domain.

    Figure 8: Amazon DataZone domains

  3. After you open the Datazone Domain, you can find the Amazon Datazone data portal URL in the Summary section (Figure 9). Select and open data portal.

    Figure 9: Amazon DataZone data portal URL

  4. In the data portal find your project (Figure 10). Then select the Data tab at the top of the window.

    Figure 10: Amazon DataZone Project overview

  5. Select the section Data Sources (Figure 11) and find the newly created data source DataZone-testdata-db.

    Figure 11:  Select Data sources in the Amazon Datazone Domain Data portal

  6. Verify that the data source has been successfully published (Figure 12).

    Figure 12:  The data sources are visible in the Published data section

  7. After the data sources are published, users can discover the published data and can submit a subscription request. The data producer can approve or reject requests. Upon approval, users can consume the data by querying data in Amazon Athena. Figure 13 illustrates data discovery in the Amazon DataZone data portal.

    Figure 13: Example data discovery in the Amazon DataZone portal

Clean up

Use the following steps to clean up the resources deployed through the CDK.

  1. Empty the two S3 buckets that were created as part of this deployment.
  2. Go to the Amazon DataZone domain portal and delete the published data assets that were created in the Amazon DataZone project by the dataset-registration Lambda function.
  3. Delete the remaining resources created using the following command in the base folder:
    npm run cdk destroy --all

Conclusion

By using AWS Glue and Amazon DataZone, organizations can make their data management easier and allow teams to share and collaborate on data smoothly. Automatically sending AWS Glue data to Amazon DataZone not only makes the process simple but also keeps the data consistent, secure, and well-governed. Simplify and standardize publishing data assets to Amazon DataZone and streamline data management with Amazon DataZone. For guidance on establishing your organization’s data mesh with Amazon DataZone, contact your AWS team today.


About the Authors

Bandana Das is a Senior Data Architect at Amazon Web Services and specializes in data and analytics. She builds event-driven data architectures to support customers in data management and data-driven decision-making. She is also passionate about enabling customers on their data management journey to the cloud.

Anirban Saha is a DevOps Architect at AWS, specializing in architecting and implementation of solutions for customer challenges in the automotive domain. He is passionate about well-architected infrastructures, automation, data-driven solutions and helping make the customer’s cloud journey as seamless as possible. Personally, he likes to keep himself engaged with reading, painting, language learning and traveling.

Chandana Keswarkar is a Senior Solutions Architect at AWS, who specializes in guiding automotive customers through their digital transformation journeys by using cloud technology. She helps organizations develop and refine their platform and product architectures and make well-informed design decisions. In her free time, she enjoys traveling, reading, and practicing yoga.

Sindi Cali is a ProServe Associate Consultant with AWS Professional Services. She supports customers in building data driven applications in AWS.

How to secure an enterprise scale ACM Private CA hierarchy for automotive and manufacturing

Post Syndicated from Anthony Pasquariello original https://aws.amazon.com/blogs/security/how-to-secure-an-enterprise-scale-acm-private-ca-hierarchy-for-automotive-and-manufacturing/

In this post, we show how you can use the AWS Certificate Manager Private Certificate Authority (ACM Private CA) to help follow security best practices when you build a CA hierarchy. This blog post walks through certificate authority (CA) lifecycle management topics, including an architecture overview, centralized security, separation of duties, certificate issuance auditing, and certificate sharing by means of templates. These topics provide best practices surrounding your ACM Private CA hierarchy so that you can build the right CA hierarchy for your organization.

With ACM Private CA, you can create private certificate authority hierarchies, including root and subordinate CAs, without the upfront investment and ongoing maintenance costs of operating your own private CA. You can issue certificates for authenticating internal users, computers, applications, services, servers or other devices, and code signing.

This post includes the following Amazon Web Services (AWS) services:

Solution overview

In this blog post, you’ll see an example automotive manufacturing company and their supplier companies. Each will have associated AWS accounts, which we will call Manufacturer Account(s) and Supplier Account(s), respectively.

Automotive manufacturing companies usually have modules that come from different suppliers. Modules, in the automotive context, are embedded systems that control electrical systems in the vehicle. These modules might be interconnected throughout the in-vehicle network or provide connectivity external to the vehicle, for example, for navigation or sending telemetry to off-board systems.

The architecture needs to allow the Manufacturer to retain control of their CA hierarchy, while giving their external Suppliers limited access to sign the certificates on these modules with the Manufacturer’s CA hierarchy. The architecture we provide here gives you the basic information you need to cover the following objectives:

  1. Creation of accounts that logically separate CAs in a hierarchy
  2. IAM role creation for specific personas to manage the CA lifecycle
  3. Auditing the CA hierarchy by using audit reports
  4. Cross-account sharing by using AWS RAM with certificate template scoping

Architecture overview

Figure 1 shows the solution architecture.

Figure 1: Multi-account certificate authority hierarchy using ACM Private CA

Figure 1: Multi-account certificate authority hierarchy using ACM Private CA

The Manufacturer has two categories of AWS accounts:

  1. A dedicated account to hold the Manufacturer’s root CA
  2. An account to hold their subordinate CA

Note: The diagram shows two subordinate CAs in the Manufacturer account. However, depending on your security needs, you can have a subordinate CA per account per supplier.

Additionally, each Supplier has one AWS account. These accounts will have the Manufacturer’s subordinate CA shared by using AWS RAM. The Manufacturer will have a subordinate CA for each Supplier.

Logically separate accounts

In order to minimize the scope of impact and scope users to actions within their duties, it’s critical that you logically separate AWS accounts based on workload within the CA hierarchy. The following section shows a recommendation for how to do that.

AWS account that holds the root CA

You, the Manufacturer, should place the ACM Private root CA within its own dedicated AWS account to segment and tightly control access to the root CA. This limits access at the account level and only uses the dedicated account for a single purpose: holding the root CA for your organization. This account will only have access from IAM principals that maintain the CA hierarchy through a federation service like AWS Single Sign-On (AWS SSO) or direct federation to IAM through an existing identity provider. This account also has AWS CloudTrail enabled and configured for business-specific alerting, including actions like creation, updating, or deletion of the root CA.

AWS account that holds the subordinate CAs

You, the Manufacturer, will have a dedicated account where the entire CA hierarchy below the root will be located. You should have a separate subordinate CA for each Supplier, and in some cases a separate subordinate CA for each hardware module the Supplier is building. The subordinate CAs can issue certificates for specific hardware modules within the Supplier account.

This Manufacturer account shares each subordinate CA to the respective Supplier’s AWS account by using AWS RAM. This provides joint control to the shared subordinate CA, creating isolation between individual Suppliers. AWS RAM allows Suppliers to control certificate issuance and revocation if this is allowed by the Manufacturer. Each Supplier is only shared certificate provisioning access through AWS RAM configuration, which means that you can tightly monitor and revoke access through AWS RAM. Given this sharing through AWS RAM, the Suppliers don’t have access to modify or delete the CA hierarchy itself and can only provision certificates from it.

Supplier AWS account(s)

These AWS accounts are owned by each respective Supplier. For example, you might partner with radio, navigation system, and telemetry suppliers. Each Supplier would have their own AWS account, which they control. The Supplier accepts an invitation from the manufacturer through AWS RAM, sharing the subordinate CA. The subordinate is allowed to take only certain actions, based on how the Manufacturer configured the share (more on this later in the post).

Separation of duties by means of IAM role creation

In order to follow least privilege best practices when you create a CA hierarchy with ACM Private CA, you must create IAM roles that are specific to each job function. The recommended method is to separate administrator and certificate issuer roles.

For this automotive manufacturing use case, we recommend the following roles:

  1. Manufacturer IAM roles:
    • A CA admin role with CA disable permission
    • A CA admin role with CA delete permission
  2. Supplier certificate issuer IAM role:

Manufacturer IAM role overview

In this flow, one IAM role is able to disable the CA, and a second principal can delete the CA. This enables two-person control for this highly privileged action—meaning that you need a two-person quorum to rotate the CA certificate.

Day-to-day CA admin policy (with CA disable)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "acm-pca:ImportCertificateAuthorityCertificate",
                "acm-pca:DeletePolicy",
                "acm-pca:PutPolicy",
                "acm-pca:TagCertificateAuthority",
                "acm-pca:ListTags",
                "acm-pca:GetCertificate",
                "acm-pca:CreateCertificateAuthority",
                "acm-pca:ListCertificateAuthorities",
                "acm-pca:UntagCertificateAuthority",
                "acm-pca:GetCertificateAuthorityCertificate",
                "acm-pca:RevokeCertificate",
                "acm-pca:UpdateCertificateAuthority",
                "acm-pca:GetPolicy",
                "acm-pca:IssueCertificate",
                "acm-pca:DescribeCertificateAuthorityAuditReport",
                "acm-pca:CreateCertificateAuthorityAuditReport",
                "acm-pca:RestoreCertificateAuthority",
                "acm-pca:GetCertificateAuthorityCsr",
                "acm-pca:DeletePermission",
                "acm-pca:DescribeCertificateAuthority",
                "acm-pca:CreatePermission",
                "acm-pca:ListPermissions"
            ],
            "Resource": “*”
        },
        {
            "Effect": "Deny",
            "Action": [
                "acm-pca:DeleteCertificateAuthority"
            ],
            "Resource": <Enter Root CA ARN Here>
        }
    ]
}

Privileged CA admin policy (with CA delete)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "acm-pca:ImportCertificateAuthorityCertificate",
                "acm-pca:DeletePolicy",
                "acm-pca:PutPolicy",
                "acm-pca:TagCertificateAuthority",
                "acm-pca:ListTags",
                "acm-pca:GetCertificate",
                "acm-pca:UntagCertificateAuthority",
                "acm-pca:GetCertificateAuthorityCertificate",
                "acm-pca:RevokeCertificate",
                "acm-pca:GetPolicy",
    "acm-pca:CreateCertificateAuthority",
                "acm-pca:ListCertificateAuthorities",
                "acm-pca:DescribeCertificateAuthorityAuditReport",
                "acm-pca:CreateCertificateAuthorityAuditReport",
                "acm-pca:RestoreCertificateAuthority",
                "acm-pca:GetCertificateAuthorityCsr",
                "acm-pca:DeletePermission",
    "acm-pca:IssueCertificate",
                "acm-pca:DescribeCertificateAuthority",
                "acm-pca:CreatePermission",
                "acm-pca:ListPermissions",
                "acm-pca:DeleteCertificateAuthority"
            ],
            "Resource": “*”
        },
        {
            "Effect": "Deny",
            "Action": [
                "acm-pca:UpdateCertificateAuthority"
            ],
            "Resource": <Enter Root CA ARN Here>
        }
    ]
}

We recommend that you, the Manufacturer, create a two-person process for highly privileged events like CA certificate rotation ceremonies. The preceding policies serve two purposes. First, they allow you to designate separation of management duties between day-to-day CA admin tasks and infrequent root CA rotation ceremonies. The day-to-day CA admin policy allows all ACM Private CA actions except the ability to delete the root CA. This is because the day-to-day CA admin should not be deleting the root CA. Meanwhile, the privileged CA admin policy has the ability to call DeleteCertificateAuthority. However, in order to call DeleteCertificateAuthority, you first need to have the day-to-day CA admin role disable the root CA.

This means that both roles listed here are necessary to perform a root CA deletion for a rotation or replacement ceremony. This arrangement creates a way to control the deletion of the CA resource by requiring two separate actors to disable and delete. It’s crucial that the two roles are assumed by two different people at the identity provider. Having one person assume both of these roles negates the increased security created by each role.

You might also consider enforcing tagging of CAs at the organization level so that each new CA has relevant tags. The blog post Securing resource tags used for authorization using a service control policy in AWS Organizations illustrates in detail how to secure tags using service control policies (SCPs), so that only authorized users can modify tags.

Supplier IAM role overview

Your Suppliers should also follow least privilege when creating IAM roles within their own accounts. However, as we’ll see in the Cross-account sharing by using AWS RAM section, even if the Suppliers don’t follow best practices, the Manufacturer’s ACM Private CA hierarchy is still isolated and secure.

That being said, here are common IAM roles that your Suppliers should create within their own accounts:

  1. Developers who provision certificates for development and QA workloads
  2. Developers who provision certificates for production

These certificate issuing roles give the Supplier the ability to issue end-entity certificates from the CA hierarchy. In this use case, the Supplier needs two different levels of permissions: non-production certificates and production certificates. To simplify the roles within IAM, the Supplier decided to use ABAC. These ABAC policies allow operations when the principal’s tag matches the resource tag. Because the Supplier has many similar policies, each with a different set of users, they use ABAC to create a single IAM policy that uses principal tags rather than creating multiple slightly different IAM policies.

Certificate issuing policy that uses ABAC

{
	"Version": "2012-10-17",
	"Statement": [
	{
		"Effect": "Allow",
		"Action": [
			"acm-pca:IssueCertificate",
			"acm-pca:ListTags",
			"acm-pca:GetCertificate",
			"acm-pca:ListCertificateAuthorities"
		],
		"Resource": "*",
		"Condition": {
			"StringEquals": {
				"aws:ResourceTag/access-project": "${aws:PrincipalTag/access-project}",
				"aws:ResourceTag/access-team": "${aws:PrincipalTag/access-team}"
			}
		}
	}
}

This single policy enables all personas to be scoped to least privilege access. If you look at the Condition portion of the IAM policy, you can see the power of ABAC. This condition verifies that the PrincipalTag matches the ResourceTag. The Supplier is federating into IAM roles through AWS SSO and tagging the Supplier’s principals within its selected identity providers.

Because you as the Manufacturer have tagged the subordinate CAs that are shared with the Supplier, the Supplier can use identity provider (IdP) attributes as tags to simplify the Supplier’s IAM strategy. In this example, the Supplier configures each relevant user in the IdP with the attribute (tag) key: access-team. This tag matches the tagging strategy used by the Manufacturer. Here’s the mapping for each persona within the use case:

  • Dev environment:
    • access-team: DevTeam
  • Production environment:
    • access-team: ProdTeam

You can choose to add or remove tags depending on your use case, and the preceding scenario serves as a simple example. This offloads the need to create new IAM policies as the number of subordinate CAs grow. If you decide to use ABAC, make sure that you require both principal tagging and resource tagging upon creation of each, because these tags become your authorization mechanism.

CA lifecycle: Audit report published by the Manufacturer

In terms of auditing and monitoring, we recommend that the Manufacturer have a mechanism to track how many certificates were issued for a specific Supplier or module. Within the Manufacturer accounts, you can generate audit reports through the console or CLI. This allows you, the manufacturer, to gather metrics on certificate issuance and revocation. Following is an example of a certificate issuance.

Figure 2: Audit report output for certificate issuance

Figure 2: Audit report output for certificate issuance

For more information on generating an audit report, see Using audit reports with your private CA.

Cross-account sharing by using AWS RAM

With AWS RAM, you can share CAs with another account. We recommend that you, as a Manufacturer, use AWS RAM to share CAs with Suppliers so that they can issue certificates without administrator access to the CA. This arrangement allows you as the Manufacturer to more easily limit and revoke access if you change Suppliers. The Suppliers can create certificates through the ACM console or through the CLI, API, or AWS CloudFormation. Manufacturers are only sharing the ability to create, manage, bind, and export certificates from the CA hierarchy. The CA hierarchy itself is contained within the Manufacturers’ accounts, and not within the Suppliers’ accounts. By using AWS RAM, the Suppliers don’t have any administrator access to the CA hierarchy. From a cost perspective, you can centrally control and monitor the costs of your private CA hierarchy without having to deal with cost-sharing across Suppliers.

Refer to How to use AWS RAM to share your ACM Private CA cross-account for a full walkthrough on how to use RAM with ACM Private CA.

Certificate templates with AWS RAM managed permissions

AWS RAM has the ability to create managed permissions in order to define the actions that can be performed on shared resources. For each shareable resource type, you can use AWS RAM managed permissions to define which permissions to grant to whom for shared resource types that support additional managed permissions. This means that when you use AWS RAM to share a resource (in this case ACM Private CA), you can now specify which IAM actions can take place on that resource. AWS RAM managed permissions integrate with the following ACM Private CA certificate templates:

  • Permission 1: BlankEndEntityCertificate_APICSRPassthrough
  • Permission 2: EndEntityClientAuthCertificate
  • Permission 3: EndEntityServerAuthCertificate
  • Permission 4: subordinatesubordinateCACertificate_PathLen0
  • Permission 5: RevokeCertificate

These five certificate templates allow a Manufacturer to scope its Suppliers to the certificate template provisioning level. This means that you can limit which certificate templates can be issued by the Suppliers.

Let’s assume you have a Supplier that is supplying a module that has infotainment media capability, and you, the manufacturer, want the Supplier to provision the end-entity client certificate but you don’t want them to be able to revoke that certificate. You can use AWS RAM managed permissions to scope that Supplier’s shared private CA to allow the EndEntityClientAuthCertificate issuance template, which implicitly denies RevokeCertificate template actions. This further scopes down what the Supplier is authorized to issue on the shared CA, gives the responsibility for revoking infotainment device certificates to the Manufacturer, but still allows the Supplier to load devices with a certificate upon creation.

Example of creating a resource share in AWS RAM by using the AWS CLI

This walkthrough shows you the general process of sharing a private CA by using AWS RAM and then accepting that shared resource in the partner account.

  1. Create your shared resource in AWS RAM from the Manufacturer subordinate CA account. Notice that in the example that follows, we selected one of the certificate templates within the managed permissions option. This limits the shared CA so that it can only issue certain types of certificate templates.

    Note: Replace the <variable> placeholders with your own values.

    aws ram create-resource-share
    		--name Shared_Private_CA
    		--resource-arns arn:aws:acm-pca:<region:111122223333>:certificate-authority/<xxxx-xxxx-xxxx-xxxx-example>
    		--permission-arns "arn:aws:ram::aws:permission/<AWSRAMBlankEndEntityCertificateAPICSRPassthroughIssuanceCertificateAuthority>"
    		--principals <444455556666>

  2. From the Supplier account, the Supplier administrator will accept the resource. Follow How to use AWS RAM to share your ACM Private CA cross-account to complete the shared resource acceptance and issue an end entity certificate.

Conclusion

In this blog post, you learned about the various considerations for building a secure public key infrastructure (PKI) hierarchy by using ACM Private CA through an example customer’s prescriptive setup. You learned how you can use AWS RAM to share CAs across accounts easily and securely. You also learned about sharing specific CAs through the ability to define permissions to specific principals across accounts, allowing for granular control of permissions on principals that might act on those resources.

The main takeaways of this post are how to create least privileged roles within IAM in order to scope down the activities of each persona and limit the potential scope of impact for your organization’s private CA hierarchy. Although these best practices are specific to manufacturer business requirements, you can alter them based on your business needs. With the managed permissions in AWS RAM, you can further scope down the actions that principals can perform with your CA by limiting the certificate templates allowed on that CA when you share it. Using all of these tools, you can help your PKI hierarchy to have a high level of security. To learn more, see the other ACM Private CA posts on the AWS Security Blog.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Anthony Pasquariello

Anthony Pasquariello

Anthony is a Senior Solutions Architect at AWS based in New York City. He specializes in modernization and security for our advanced enterprise customers. Anthony enjoys writing and speaking about all things cloud. He’s pursuing an MBA, and received his MS and BS in Electrical & Computer Engineering.

Omar Zoma

Omar Zoma

Omar is a senior AWS Security Solutions Architect that lives in metro Detroit. Omar is passionate about helping customers solve cloud and vehicle security problems at a global scale. In his free time, Omar trains hundreds of students a year in security and cloud through universities and training programs.

Detect Real-Time Anomalies and Failures in Industrial Processes Using Apache Flink

Post Syndicated from Hubert Asamer original https://aws.amazon.com/blogs/architecture/detect-real-time-anomalies-and-failures-in-industrial-processes-using-apache-flink/

For a long time, industrial control systems were the heart of the manufacturing process which allows collecting, processing, and acting on data from the shop floor. Process manufacturers used a distributed control system (DCS) to do the automated control and operation of an industrial process or plant.

With the convergence of operational technology and information technology (IT), customers such as Yara are integrating their DCS with additional intelligence from the IT side. This provides customers with a holistic view of the different data sources to make more complex decisions with advanced analytics.

In this blog post, we show how to start with advanced analytics on streaming data coming from the shop floor. The sensor data, such as pressure and temperature, is typically published by a DCS. It is then ingested with a local edge gateway and streamed to the cloud with streaming and industrial internet of things (IoT) technology. Analytics on the streaming data is typically done before all data points are stored in the data layer. Figure 1 shows how the data flow can be modeled and visualized with AWS services.

Figure 1: High-level ingestion and analytics architecture

Figure 1: High-level ingestion and analytics architecture

In this example, we are concentrating on the streaming analytics part in the Cloud. We will generate data from a simulated DCS to Amazon Kinesis Data Streams where you have a gateway such as AWS IoT Greengrass and maybe other IoT services in-between.

For the simulated process that the DCS is controlling, we use a well-documented industrial process for creating a chemical compound (acetic anhydride) called the Tennesee Eastman process (TEP). There are several simulations available as open source. We demonstrate how to use this data as a constant stream with more than 30 real-time measurement parameters, ingest to Kinesis Data Streams, and run in-stream analytics using Apache Flink. Within Apache Flink, data is grouped and mapped to the respective stages and parts of the industrial process, and constantly analyzed by calculating anomalies of all process stages. All raw data, plus the derived anomalies and failure patterns, are then ingested from Apache Flink to Amazon Timestream for further use in near real-time dashboards.

Overview of solution

Note: Refer to steps 1 to 6 in Figure 2.

As a starting point for a realistic and data intensive measurement source, we use an already existing (TEP) simulation framework written in C++ originally created from National Institute of Standards and Technology, and published as open source. The GitHub Blog repository contains a small patch which adds AWS connectivity with the software development kits (SDKs) and modifications to the command line arguments. The programs provided by this framework are (step 1) a simulation process starter with configurable starting conditions and timestep configurations and a real-time client (step 2) which connects to the simulation and sends the simulation output data to the AWS Cloud.

Tennesee Eastman process (TEP) background

A paper by Downs & Vogel, A plant-wide industrial process control problem, from 1991 states:

“This chemical standard process consists of a reactor/separator/recycle arrangement involving two simultaneous gas-liquid exothermic reactions.”

“The process produces two liquid products from four reactants. Also present are an inert and a byproduct making a total of eight components. Two additional byproduct reactions also occur. The process has 12 valves available for manipulation and 41 measurements available for monitoring or control.“

The simulation framework used can control all of the 12 valve settings and produces 41 measurement variables with varying sampling frequency.

Data ingestion

The 41 measurement variables, named xmeas_1 to xmeas_41, are emitted by the real-time client (step 2) as key-value JSON messages. The client code is configured to produce 100 messages per second. A built-in C++ Kinesis SDK allows the real-time client to directly stream JSON messages to a Kinesis data stream (step 3).

Figure 2: Detailed system architecture

Figure 2 – Detailed system architecture

Stream processing with Apache Flink

Messages sent to Amazon Kinesis Data Stream are processed in configurable batch sizes by an Apache Flink application, deployed in Amazon Kinesis Data Analytics. Apache Flink is an open-source stream processing framework, written and usable in Java or Scala. As described in Figure 3, it allows the definition of various data sources (for example, a Kinesis data stream) and data sinks for storing processing results. In-between data can be processed by a range of operators—typically mapping and reducing functions (step 4).

In our case, we use a mapping operator where each batch of incoming messages is processed. In Code snippet 1, we apply a custom mapping function to the raw data stream. For rapid and iterative development purposes it’s possible to have the complete stream processing pipeline running in a local Java or Scala IDE such as Maven, Eclipse, or IntelliJ.

Figure 3: Flink execution plan (green: streaming data sources; yellow: data sinks)

Figure 3: Flink execution plan (green: streaming data sources; yellow: data sinks)

public class StreamingJob extends AnomalyDetector {
---
  public static DataStream<String> createKinesisSource
    (StreamExecutionEnvironment env, 
     ParameterTool parameter)
    {
    // create Stream
    return kinesisStream;
  }
---
  public static void main(String[] args) {
    // set up the execution environment
    final StreamExecutionEnvironment env = 
      StreamExecutionEnvironment.getExecutionEnvironment();
---
    DataStream<List<TimestreamPoint>> mainStream =
      createKinesisSource(env, parameter)
      .map(new AnomalyJsonToTimestreamPayloadFn(parameter))
      .name("MaptoTimestreamPayload");
---
    env.execute("Amazon Timestream Flink Anomaly Detection Sink");
  }
}

Code snippet 1: Flink application main class

In-stream anomaly detection

Within the Flink mapping operator, a statistical outlier detection (anomaly detection) is implemented. Flink allows the inclusion of custom libraries within its operators. The library used here is published by AWS—a Random Cut Forest implementation available from GitHub. Random Cut Forest is a well understood statistical method which can operate on batches of measurements. It then calculates an anomaly score for each new measurement by comparing a new value with a cached pool (=forest) of older values.

The algorithm allows the creation of grouped anomaly scores, where a set of variables is combined to calculate a single anomaly score. In the simulated chemical process (TEP), we can group the measurement variables into three process stages:

  1. reactor feed analysis
  2. purge gas analysis
  3. product analysis.

Each group consists of 5–10 measurement variables. We’re getting anomaly scores for a, b, and c. In Code snippet 2 we can learn how an anomaly detector is created. The class AnomalyDetector is instantiated and extended then three times (for our three distinct process stages) within the mapping function as described in Code snippet 3.

Flink distributes this calculation across its worker nodes and handles data deduplication processes within its system.

---
public class AnomalyDetector {
    protected final ParameterTool parameter;
    protected final Function<RandomCutForest, LineTransformer> algorithmInitializer;
    protected LineTransformer algorithm;
    protected ShingleBuilder shingleBuilder;
    protected double[] pointBuffer;
    protected double[] shingleBuffer;
    public AnomalyDetector(
      ParameterTool parameter,
      Function<RandomCutForest,LineTransformer> algorithmInitializer)
    {
      this.parameter = parameter;
      this.algorithmInitializer = algorithmInitializer;
    }
    public List<String> run(Double[] values) {
            if (pointBuffer == null) {
                prepareAlgorithm(values.length);
            }
      return processLine(values);
    }
    protected void prepareAlgorithm(int dimensions) {
---
      RandomCutForest forest = RandomCutForest.builder()
        .numberOfTrees(Integer.parseInt(
          parameter.get("RcfNumberOfTrees", "50")))
        .sampleSize(Integer.parseInt(
          parameter.get("RcfSampleSize", "8192")))
        .dimensions(shingleBuilder.getShingledPointSize())
        .lambda(Double.parseDouble(
          parameter.get("RcfLambda", "0.00001220703125")))
        .randomSeed(Integer.parseInt(
          parameter.get("RcfRandomSeed", "42")))
      .build();
---
    algorithm = algorithmInitializer.apply(forest);
  }

Code snippet 2: AnomalyDetector base class, which gets extended by the streaming applications main class

public class AnomalyJsonToTimestreamPayloadFn extends 
    RichMapFunction<String, List<TimestreamPoint>> {
  protected final ParameterTool parameter;
  private final Logger logger = 

  public AnomalyJsonToTimestreamPayloadFn(ParameterTool parameter) {
    this.parameter = parameter;
  }

  // create new instance of StreamingJob for running our Forest
  StreamingJob overallAnomalyRunner1;
  StreamingJob overallAnomalyRunner2;
  StreamingJob overallAnomalyRunner3;
---

  // use `open`method as RCF initialization
  @Override
  public void open(Configuration parameters) throws Exception {
    overallAnomalyRunner1 = new StreamingJob(parameter);
    overallAnomalyRunner2 = new StreamingJob(parameter);
    overallAnomalyRunner3 = new StreamingJob(parameter);
  super.open(parameters);
}
---

Code snippet 3: Mapping Function uses the Flink RichMapFunction open routine to initialize three distinct Random Cut Forests

Data persistence – Flink data sinks

After all anomalies are calculated, we can decide where to send this data. Flink provides various ready-to-use data sinks. In these examples, we fan out all (raw and processed) data to Amazon Kinesis Data Firehose for storing in Amazon Simple Storage Service (Amazon S3) (long term) (step 5) and to Amazon Timestream (short term) (step 5). Kinesis Data Firehose is configured with a small AWS Lambda function to reformat data from JSON to CSV, and data is stored with automated partitioning to Amazon S3. A Timestream data sink does not come pre-bundled with Flink. A custom Timestream ingestion code is used in these examples. Flink provides extensible operator interfaces for the creation of custom map and sink functions.

Timeseries handling

Timestream, in combination with Grafana, is used for near real-time monitoring. Grafana comes bundled with a Timestream data source plugin and can constantly query and visualize Timestream data (step 6).

Walkthrough

Our architecture is available as a deployable AWS CloudFormation template. The simulation framework comes packed as a docker image, with an option to install it locally on a linux host.

Prerequisites

To implement this architecture, you will need:

  • An AWS account
  • Docker (CE) Engine v18++
  • Java JDK v11++
  • maven v3.6++

We recommend running a local and recent Linux environment. It is assumed that you are using AWS Cloud9, deployed with CloudFormation, within your AWS account.

Steps

Follow these steps to deploy the solution and play with the simulation framework. At the end, detected anomalies derived from Flink are stored next to all raw data in Timestream and presented in Grafana. We’re using AWS Cloud9 and its Linux terminal capabilities here to fire up a Grafana instance, then manually run the simulation to ingest data to Kinesis and optionally manually start the Flink app from the console using Maven.

Deploy stack

After you’re logged in to the AWS Management console you can deploy the CloudFormation stack. This stack creates a fully configured AWS Cloud9 environment with the related GitHub Repo already in place, a Kinesis data stream, Kinesis Data Firehose delivery stream, Kinesis Data Analytics with Flink app deployed, Timestream database, and an S3 bucket.

launch stack button

After successful deployment, record two important facts from the CloudFormation console: the chosen stack name and the attribute 03Cloud9EnvUrl displayed in the Output Section of the stack. The attribute’s URL will take you directly to our deployed AWS Cloud9 environment.

Run post install step within AWS Cloud9

The deployed stack created an AWS Cloud9 environment and an AWS Identity and Access Management (IAM) instance profile. We apply this instance profile to AWS Cloud9 to interact with Kinesis, Timestream, and Amazon S3 throughout the next steps. The used script also configures and installs other required tools.

1.       Open a terminal window.

$ cd flinkAnomalySteps/deployment
$ source c9-postInstall.sh
---SETTING UP IAM INSTANCE PROFILE
Please enter cloudformation stack name (default: flink-rcf-app):
# enter your stack name

Start a Grafana development server

In this section we are starting a Grafana server using docker. Cloud 9 allows us to expose web applications (for demo & development purposes) on container port 8080.

1.       Open a terminal window.

$ cd ../src/grafana-dashboard
$ docker volume create grafana-storage
# this creates a docker volume for persisting your Grafana settings
$./start-grafana.sh
# this starts a recent Grafana using docker with Timestream plugin and a pre-configured dashboard in place

2.       Open the preview panel by selecting Preview, and then select Preview Running Application.

Cloud9 screenshot

3.       Next, in the preview pane, select Pop out into new Window.

Cloud9 screenshot2

4.       A new browser tab with Grafana opens.

5.       Choose any username and password combination.

6.       In Grafana use the “Search dashboards” icon on the left and choose “TEP-SIM-DEV”. This pre-configured dashboard displays data from Amazon Timestream (see step “Open Grafana”).

TEP simulation procedure

Within your local or AWS Cloud9 Linux environment, fetch the simulation docker image from the public AWS container registry, or build the simulation binaries locally, for building manually check the GitHub repo.

Start simulation (in separate terminal)

# starts container and switch into the container-shell
$ docker run -it --rm \
  --network host \
  --name tesim-runner \
  tesim-runner:01 \
 /bin/bash
# then inside container
$ ./tesim --simtime 100  --external-ctrl
# simulation started…

Manipulate simulation (in separate terminal)

Follow the steps here for a basic process disturbance task. Review the aspects of influencing the simulation in the GitHub-Repo. The rtclient program has a range of commands to use for introducing disturbances.

# first switch into running simulation container
$ docker exec -it tesim-runner /bin/bash
# now we can access the shared storage of the simulation process…
$ ./rtclient –setidv 6
# this enables one of the built in process disturbances (1-20)
$ ./rtclient –setidv 7
$ ./rtclient –setidv 8
$ …

Stream Simulation data to Amazon Kinesis DataStream (in separate terminal)

The client has a built-in record frequency of 50 messages per second. One message contains more than 50 measurements, so we have approximately 2,500 measurements per second.

$ ./rtclient -k

AWS libcrypto resolve: found static libcrypto 1.1.1 HMAC symbols
AWS libcrypto resolve: found static libcrypto 1.1.1 EVP_MD symbols
{"xmeas_1": 3649.739476,"xmeas_2": 4451.32071,"xmeas_3": 9.223142558,"xmeas_4": 32.39290913,"xmeas_5": 47.55975621,"xmeas_6": 2798.975688,"xmeas_7": 64.99582601,"xmeas_8": 122.8987929,"xmeas_9": 0.1978264656,…}
# Messages in JSON sent to Kinesis DataStream visible via stdout

Compile and start Flink Application (optional step)

If you want deeper insights into the Flink Application, we can start this as well from the AWS Cloud9 instance. Note: this is only appropriate in development.

$ cd flinkAnomalySteps/src
$ cd flink-rcf-app
$ mvn clean compile
# the Flink app gets compiled
$ mvn exec:java -Dexec.mainClass= \
    "com.amazonaws.services.kinesisanalytics.StreamingJob"
# Flink App is started with default settings in place…
…

Open Grafana dashboard (from the step Start a Grafana development server)

Process anomalies are visible instantly after you start the simulation. Use Grafana to drill down into the data as needed.

/**example - simplest possible Timestream query used for Viz:**/

SELECT CREATE_TIME_SERIES(time, measure_value::double) as anomaly_stream6 FROM "kdaflink"."kinesisdata1"

    WHERE measure_name='anomaly_score_stream6' AND

    time between ago(15m) and now()

Code snippet 4: Timestream SQL example; Timestream database is `kdaflink` – table is `kinesisdata1`

Figure 4 - Grafana dashboard showing near real-time simulation data

Figure 4 – Grafana dashboard showing near real-time simulation data, three anomalies, mapped to the TEP process, are constantly calculated by Flink

S3 raw metrics bucket

For the sake of completeness and potential usefulness, the Flink Application emits all raw data in an intermediate step to Kinesis Data Firehose. The service converts all JSON data to CSV format by using a small AWS Lambda function.

$ aws s3 ls flink-rcf-app-rawmetricsbucket-<CFN-UUID>/tep-raw-csv/2021/11/26/19/

Cleaning up

Delete the deployed CloudFormation stack. All resources (excluding S3 buckets) are permanently deleted.

Conclusion

In this blog post, we learned that in-stream anomaly detection and constant measurement data insights can work together. The Apache Flink framework offers a ready-to-use platform that is mission critical for future adoption across manufacturing and other industries. Other applications of the presented Flink pattern can run on capable edge compute devices. Integration with AWS IoT Greengrass and AWS Greengrass Stream Manager are part of the GitHub Blog repository.

Another extension includes measurement data pattern detection routines, which can coexist with in-stream anomaly detection and can detect specific failure patterns over time using time-windowing features of the Flink framework. You can refer to the GitHub repo which accompanies this blog post. Give it a try and let us know your feedback in the comments!

Optimize your IoT Services for Scale with IoT Device Simulator

Post Syndicated from Ajay Swamy original https://aws.amazon.com/blogs/architecture/optimize-your-iot-services-for-scale-with-iot-device-simulator/

The IoT (Internet of Things) has accelerated digital transformation for many industries. Companies can now offer smarter home devices, remote patient monitoring, connected and autonomous vehicles, smart consumer devices, and many more products. The enormous volume of data emitted from IoT devices can be used to improve performance, efficiency, and develop new service and business models. This can help you build better relationships with your end consumers. But you’ll need an efficient and affordable way to test your IoT backend services without incurring significant capex by deploying test devices to generate this data.

IoT Device Simulator (IDS) is an AWS Solution that manufacturing companies can use to simulate data, test device integration, and improve the performance of their IoT backend services. The solution enables you to create hundreds of IoT devices with unique attributes and properties. You can simulate data without configuring and managing physical devices.

An intuitive UI to create and manage devices and simulations

IoT Device Simulator comes with an intuitive user interface that enables you to create and manage device types for data simulation. The solution also provides you with a pre-built autonomous car device type to simulate a fleet of connected vehicles. Once you create devices, you can create simulations and generate data (see Figure 1.)

Figure 1. The landing page UI enables you to create devices and simulation

Figure 1. The landing page UI enables you to create devices and simulation

Create devices and simulate data

With IDS, you can create multiple device types with varying properties and data attributes (see Figure 2.) Each device type has a topic where simulation data is sent. The supported data types are object, array, sinusoidal, location, Boolean, integer, float, and more. Refer to this full list of data types. Additionally, you can import device types via a specific JSON format or use the existing automotive demo to pre-populate connected vehicles.

Figure 2. Create multiple device types and their data attributes

Figure 2. Create multiple device types and their data attributes

Create and manage simulations

With IDS, you can create simulations with one device or multiple device types (see Figure 3.) In addition, you can specify the number of devices to simulate for each device type and how often data is generated and sent.

Figure 3. Create simulations for multiple devices

Figure 3. Create simulations for multiple devices

You can then run multiple simulations (see Figure 4) and use the data generated to test your IoT backend services and infrastructure. In addition, you have the flexibility to stop and restart the simulation as needed.

Figure 4. Run and stop multiple simulations

Figure 4. Run and stop multiple simulations

You can view the simulation in real time and observe the data messages flowing through. This way you can ensure that the simulation is working as expected (see Figure 5.) You can stop the simulation or add a new simulation to the mix at any time.

Figure 5. Observe your simulation in real time

Figure 5. Observe your simulation in real time

IoT Device Simulator architecture

Figure 6. IoT Device Simulator architecture

Figure 6. IoT Device Simulator architecture

The AWS CloudFormation template for this solution deploys the following architecture, shown in Figure 6:

  1. Amazon CloudFront serves the web interface content from an Amazon Simple Storage Service (Amazon S3) bucket.
  2. The Amazon S3 bucket hosts the web interface.
  3. Amazon Cognito user pool authenticates the API requests.
  4. An Amazon API Gateway API provides the solution’s API layer.
  5. AWS Lambda serves as the solution’s microservices and routes API requests.
  6. Amazon DynamoDB stores simulation and device type information.
  7. AWS Step Functions include an AWS Lambda simulator function to simulate devices and send messages.
  8. An Amazon S3 bucket stores pre-defined routes that are used for the automotive demo (which is a pre-built example in the solution).
  9. AWS IoT Core serves as the endpoint to which messages are sent.
  10. Amazon Location Service provides the map display showing the location of automotive devices for the automotive demo.

The IoT Device Simulator console is hosted on an Amazon S3 bucket, which is accessed via Amazon CloudFront. It uses Amazon Cognito to manage access. API calls, such as retrieving or manipulating information from the databases or running simulations, are routed through API Gateway. API Gateway calls the microservices, which will call the relevant service.

For example, when creating a new device type, the request is sent to API Gateway, which then routes the request to the microservices Lambda function. Based on the request, the microservices Lambda function recognizes that it is a request to create a device type and saves the device type to DynamoDB.

Running a simulation

When running a simulation, the microservices Lambda starts a Step Functions workflow. First, the request contains information about the simulation to be run, including the unique device type ID. Then, using the unique device type ID, Step Functions retrieves all the necessary information about each device type to run the simulation. Once all the information has been retrieved, the simulator Lambda function is run. The simulator Lambda function uses the device type information, including the message payload template. The Lambda function uses this template to build the message sent to the IoT topic specified for the device type.

When running a custom device type, the simulator generates random information based on the values provided for each attribute. For example, when the automotive simulation is run, the simulation runs a series of calculations to simulate an automobile moving along a series of pre-defined routes. Pre-defined routes are created and stored in an S3 bucket, when the solution is launched. The simulation retrieves the routes at random each time the Lambda function runs. Automotive demo simulations also show a map generated from Amazon Location Service and display the device locations as they move.

The simulator exits once the Lambda function has completed or has reached the fifteen-minute execution limit. It then passes all the necessary information back to the Step Function. Step Functions then enters a choice state and restarts the Lambda function if it has not yet surpassed the duration specified for the simulation. It then passes all the pertinent information back to the Lambda function so that it can resume where it left off. The simulator Lambda function also checks DynamoDB every thirty seconds to see if the user has manually stopped the simulation. If it has, it will end the simulation early. Once the simulation is complete, the Step Function updates the DynamoDB table.

The solution enables you to launch hundreds of devices to test backend infrastructure in an IoT workflow. The solution contains an Import/Export feature to share device types. Exporting a device type generates a JSON file that represents the device type. The JSON file can then be imported to create the same device type automatically. The solution allows the viewing of up to 100 messages while the solution is running. You can also filter the messages by topic and device and see what data each device emits.

Conclusion

IoT Device Simulator is designed to help customers test device integration and IoT backend services more efficiently without incurring capex for physical devices. This solution provides an intuitive web-based graphic user interface (GUI) that enables customers to create and simulate hundreds of connected devices. It is not necessary to configure and manage physical devices or develop time-consuming scripts. Although we’ve illustrated an automotive application in this post, this simulator can be used for many different industries, such as consumer electronics, healthcare equipment, utilities, manufacturing, and more.

Get started with IoT Device Simulator today.

Field Notes: Building an Industrial Data Platform to Gather Insights from Operational Data

Post Syndicated from Shailaja Suresh original https://aws.amazon.com/blogs/architecture/field-notes-building-an-industrial-data-platform-to-gather-insights-from-operational-data/

Co-authored with Russell de Pina, former Sr. Partner Solutions Architect at AWS

Manufacturers looking to achieve greater operational efficiency need actionable insights from their operational data. Traditionally, operational technology (OT) and enterprise information technology (IT) existed in silos with various manual processes to clean and process data. Leveraging insights at a process level requires converging data across silos and abstracting the right data for the right use case in a scalable fashion.

The AWS Industrial Data Platform (IDP) facilitates the convergence of enterprise and operational data for gathering insights that lead to overall operational efficiency of the plant. In this blog, we demonstrate a sample AWS IDP architecture along with the steps to gather data across multiple data sources on the shop floor.

Overview of solution

With the IDP Reference Architecture as the backdrop, let us walk through the steps to implement this solution. Let us say we have gas compressors being used in our factory. The Operations Team has identified an anomaly in the wear of the bearings used in some of these compressors. Currently the bearings are supplied by two manufacturers, ABC and XYZ.

The graphs in Figures 1 and 2 show the expected and actual performance data of the bearings with respect to the vibration (actual and expected) against time. The points represent a cluster of plots which are represented as single dots here. The Mean Time Between Maintenance (MTBM) per the manufacturer for these bearings is five years. The factory has detected that for ABC bearings, the actual vibrations detected during its half-life is much more than the normal (expected) as shown. This clearly requires further analysis to identify the root cause for this discrepancy to prevent compressor breakdowns, unplanned downtime and cost.

ABC Bearings Anomoly

Figure 1 – ABC Bearings Anomaly

XYZ Bearings Vibration

Figure 2 – XYZ Bearings Vibration

Although the deviation observed from expected is much less in XYZ bearings, there is still a deviation which needs to be understood. This blog provides an overview of how the various AWS services of the IDP architecture help with this analysis. Figure 3, shows the AWS architecture used for this specific use case.

IDP AWS Architecture to solve for the Bearings Anomaly 

Figure 3 – IDP AWS Architecture to solve for the Bearings Anomaly

Tracking Anomaly Data

The sensors on the compressor bearings send the vibration/chatter data to AWS through AWS IoT core. Amazon Lookout for Equipment is configured as detailed out in the user guide with the necessary data formatting guidelines.

The raw sensor data from IoT Core which is in a JSON format is converted into a CSV format with the necessary headers to match the schema (one schema for each sensor) in Figure 7 using AWS Lambda. This CSV conversion is needed for the data to be processed by Amazon Lookout for Equipment. A sample sensor data from the ABC sensor, the AWS Lambda code snippet to convert this JSON to CSV and the output CSV which is to be ingested into Lookout for Equipment are shown in Figure 4,5 and 6 respectively. For detailed steps on how a dataset is created to be ingested into Lookout For Equipment, please refer the user guide.

Figure 4: Sample Sensor Data from ABC Bearings’ Sensor

Figure 4: Sample Sensor Data from ABC Bearings’ Sensor

import boto3
import botocore
import csv
import json

def lambda_handler(event, context):
    BUCKET_NAME = 'L4EData' 
    OUTPUT_KEY = 'csv/ABCSensor.csv' # OUTPUT FILE 
    INPUT_KEY = 'ABCSensor.json'# INPUT FILE 
    s3client = boto3.client('s3')
    s3 = boto3.resource('s3') 
    
    obj = s3.Object(BUCKET_NAME, INPUT_KEY)
    data = obj.get()['Body'].read().decode('utf-8')
    json_data = json.loads(data)
    print(json_data)

    
    output_file_path = "/tmp/data.csv"
    
    with open(output_file_path, "w") as file:
        csv_file = csv.writer(file)
        csv_file.writerow(['TIMESTAMP', 'Sensor'])
        for item in json_data:
            csv_file.writerow([item.get('TIMESTAMP'),item.get('Sensor')])
    
    csv_binary = open(output_file_path, 'rb').read()
    
    
    try:
        obj = s3.Object(BUCKET_NAME, OUTPUT_KEY)
        obj.put(Body=csv_binary)
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == "404":
            print("The object does not exist.")
        else:
            raise
    
    try:
        download_url = s3client.generate_presigned_url(
                         'get_object',
                          Params={
                              'Bucket': BUCKET_NAME,
                              'Key': OUTPUT_KEY
                              },
                          ExpiresIn=3600
        )
        return json.dumps(download_url)
    except Exception as e:
        raise utils_exception.ErrorResponse(400, e, Log)

Figure 5 –  AWS Lambda Python Code snippet to Convert CSV to JSON

Figure 6: Output CSV from AWS Lambda with Timestamp and Vibration values from Sensor

Figure 6 – Output CSV from AWS Lambda with Timestamp and Vibration values from Sensor

Figure 7: Data Schema used for Dataset Ingestion in Amazon Lookout for Equipment for ABC Vibration Sensor

Figure 7 – Data Schema used for Dataset Ingestion in Amazon Lookout for Equipment for ABC Vibration Sensor

Once the model is trained the anomaly is detected on bearing ABC as already stated.

Analyze Factory Environment Data

For our analysis, it is important to factor in the factory environment conditions. There are variables like factory humidity, machine temperature, bearing lubrication levels that could attribute to increased vibration. We have all the environment data gathered from factory sensors made available through AWS IoT Core as shown in Figure 3. Since the number of parameters measured through the factory sensors can change over time, the architecture uses a no-SQL database, Amazon DynamoDB, to persist this data for downstream analysis.

Since we only need sensor events that exceed a particular threshold, we set rules under AWS IoT Core to capture those events that could potentially cause increased vibration on the bearings. For example, we just want only those events that exceed a temperature threshold, say anything above 75-degree Fahrenheit along with the timestamp in Amazon DynamoDB. This is beyond the normal operating temperature for the bearings and certainly demands our interest.

So, we set a rule trigger as shown in Figure 8 in the Rule query statement while defining an action under IoT Core. The flow of events from an IoT sensor to Amazon DynamoDB is illustrated in Figure 9.

Figure 8 - Define a Rule in IoT Core

Figure 8 – Define a Rule in IoT Core

The events are received on a minute basis. So, if there are 3 records inserted in Amazon DynamoDB on a day, it means that there has been a temperature threshold breach for a maximum of 3 minutes.

We set up similar rules and actions to DynamoDB for humidity and compressor lubrication levels.

Figure 9: Event from IoT factory sensor to Amazon DynamoDB

Figure 9: Event from IoT factory sensor to Amazon DynamoDB

1)      Factory Data Storage

It is required to factor in both the factory data like machine id, shift number, shop floor and the operator data like the operator name, machine-operator relationships, operator’s shift, etc., for this analysis since we are drilling down to the granular details of which operators where working on the machines which have ABC bearings. Since this data is relational, it is stored in Amazon Aurora. We opt for the serverless option of the Aurora database since the operations team requires a database that is fully managed.

The  data of the vendors who supply ABC and XYZ bearings and their contracts  are stored in Amazon S3. We also have the operators’ performance scores pre-calculated and stored in Amazon S3.

2)      Querying the Data Lake

Now that we have data ingested and stored in various channels – Amazon S3, AWS IoT Core, Amazon DynamoDB and Amazon Aurora, it is required that we collate this data for further analysis and querying. We use Athena Federated Query  under Amazon Athena for querying across these multiple data sources and store the results in Amazon S3 as shown in Figure 10. It is required that we create a workgroup and configure connectors to set up Amazon S3, Amazon DynamoDB and Amazon Aurora as the data sources under Amazon Athena. The detailed steps to create and connect the data sources are provided in the user guide.

Figure 10: Athena Federated Query across Data Sources

Figure 10: Athena Federated Query across Data Sources

We are now interested to gather some insights across the three different data sources. Let us find all those machines and their corresponding operators with performance scores in factory shop floor 1 which have the lubrication levels as demonstrated in Figure 11.

Figure 11: Query to Determine Operators and their Performance

Figure 11: Query to Determine Operators and their Performance

From the output in Table 1, we see that operators on these machines which have the machine temperature levels of above the threshold to be having good scores (above 5, which is the average) based on past performance. Hence, we could conclude for now that they have been operating these machines under the normal expected conditions with the right controls and settings.

Table 1: Operators and their Scores

Table 1: Operators and their Scores

We would then want to find all those machines which have ABC Bearings and the vendors who supplied them as demonstrated in Figure 12.

Figure 12: Vendors Supplying ABC Bearings

Figure 12: Vendors Supplying ABC Bearings

Table 2: Names of Vendors Who Supply ABC Bearings

Table 2: Names of Vendors Who Supply ABC Bearings

As a next step, we would want to reach out to the vendors, Jane Doe Enterprises and Scott Inc.(see Table 2) to report the issue with ABC Bearings. The next step is to potentially modify the supply contracts and purchase only from XYZ manufacturer and/or to look for other bearing manufacturers in the vendor supply chain.

Conclusion

In this blog, we covered a sample use case to demonstrate how the AWS Industrial Data Platform can help gather actionable insights from factory floor data to gain operation efficiencies. We also did a walkthrough of a sample use case to demonstrate how the various AWS services can be used to build a scalable data platform for seamless querying and processing of data coming in with various formats.

New Amazon Virtual Andon 3.0 – Automate Issue Resolution via APIs and Predictive Services

Post Syndicated from Ajay Swamy original https://aws.amazon.com/blogs/architecture/new-amazon-virtual-andon-3-0-automate-issue-resolution-via-apis-and-predictive-services/

Developing a modern manufacturing enterprise requires careful thought and attention to several priorities. Predictive maintenance and issue resolution automation are likely high on your list. Maximizing your operational efficiency and optimizing output are critical in this competitive global market. As demand grows, manufacturers are under pressure to fulfill increased production needs.

A recent report from McKinsey on Industry 4.0 technologies discusses pandemic implementations such as digital issue detection and resolution. These solutions are critical for crisis response in the COVID-19 era. 30% of manufacturers have highlighted increased operational productivity, reduced time-to-market, and reduction in cost as major strategic imperatives for their Industry 4.0 transformation.

Amazon Virtual Andon 3.0 (AVA) is an Amazon Web Services (AWS) Solution that provides a scalable, digital Andon system to help detect and resolve issues. It optimizes processes, supports your transition to predictive maintenance, and helps prevent future equipment failures.

Overview of new features

AVA provides factory and fulfillment center associates with an intuitive, responsive web interface and workflow. This can be used to raise issues, route those issues to the appropriate engineers, and resolve them in a timely way. With AVA, you can associate explanatory root causes with resolved issues for more insightful reporting. Issue raising and resolution happen digitally. There’s no need for manual intervention (such as raising a manual alarm at a factory workstation.)

AVA introduces the capability to raise issues directly from devices and automated APIs. Flexible integrations can be made directly with your factory devices and systems. Additionally, the APIs integrate with Amazon Machine Learning (ML) services, such as Amazon Lookout for Equipment and Amazon Lookout for Vision. This enables you to automate ML inference into your AVA-based workflows.

With AVA 3.0, you can now monitor your factory floors for disruptions in near-real-time. You can respond and resolve issues quickly, minimizing production disruption. AVA 3.0 provides users with an analytics pipeline so you can create dashboards and custom reports.

Introducing GraphQL APIs

The solution introduces GraphQL APIs via AWS AppSync, secured through AWS Identity and Access Management (IAM) policies. This powers the web interface and functionality to create, route, and manage issues. With AVA GraphQL APIs, you can create and manage site hierarchies, devices, events, and raise and manage issues. For example, you can create an issue by calling the createIssue API to raise issues and track them to completion automatically.

createIssue
{
    id: "<string>",
    siteName: "<string>",
    areaName: "<string>",
    stationName: "<string>",
    deviceName: "<string>",
    processName: "<string>",
    eventId: "<string>",
    eventDescription: "<string>",
    eventType: "<string>",
    issueSource: "<string>",
    priority: "<string>",
    status: "<string>",
    created: "AWSDateTime",
    acknowledged: "AWSDateTime",
    closed: "AWSDateTime",
    acknowledgedTime: "<number>",
    resolutionTime: "<number>",
    createdBy: "<string>",
    additionalDetails: "<string>"
  }

Integration with Amazon Machine Learning services to raise issues automatically

With AVA APIs, you can integrate with predictive services like Amazon Lookout for Equipment (L4E). AVA provides an AWS Lambda function for detecting anomalies generated via L4E. It automatically calls the APIs to create issues for abnormal events.

When L4E detects anomalies in your machinery or production line, AVA can automatically raise issues for those anomalies (Figure 1). This gives you the visibility and tracking mechanism to ensure anomalies are resolved. You can create automated events and issues via APIs by integrating with Amazon Lookout for Equipment. You’re able to track anomalies with their details in near-real-time.

Figure 1. AVA displays anomalies raised from L4E. A prediction score of > 0 indicates that an anomaly has been detected, and AVA will raise an issue with the underlying sensor details.” width=”1063″ height=”611″></p>
<p id=Figure 1. AVA displays anomalies raised from L4E. A prediction score of > 0 indicates that an anomaly has been detected, and AVA will raise an issue with the underlying sensor details.

Direct device integration

AVA provides the capability to integrate with IoT devices via AWS IoT Core. You can configure your IoT devices to send data to the ava/issues AWS IoT Core topic. Additionally, AVA can send messages to the ava/devices AWS IoT Core topic to automatically raise issues via MQTT or HTTPS. It maps your machine name to an AVA device and a tag/value combination to an AVA event.

{
  "id": <ID!>,
  "eventId": String,
  "eventDescription": String,
  "type": String,
  "priority": String,
  "siteName": String,
  "processName": String,
  "areaName":" String,
  "stationName": String,
  "deviceName": String,
  "created": AWSDateTime,
  "acknowledged": AWSDateTime,
  "closed": AWSDateTime,
  "status": "open"
}

Analytics pipeline for custom dashboards

AVA uses Amazon DynamoDB to store factory configuration and issues data. All AVA data is exported from the DynamoDB database to an Amazon S3 bucket via an AWS Glue workflow. You can then use Amazon Athena to query underlying data and create custom reports using business intelligence (BI) solutions like Amazon QuickSight (Figure 2). With the analytics pipeline, you can create custom dashboards and monitor your factory operations holistically.

Figure 2. Custom dashboard generated via Amazon QuickSight

Figure 2. Custom dashboard generated via Amazon QuickSight

Architecture and workflow

Figure 3. AVA 3.0 architecture

Figure 3. AVA 3.0 architecture

The AWS CloudFormation template deploys the following infrastructure, shown in Figure 3:

  1. The AWS CloudFormation template provides an Amazon CloudFront web interface that deploys into an Amazon Simple Storage Service (Amazon S3) bucket configured for web hosting.
  2. An Amazon Cognito user pool allows this solution’s administrators to register users and groups using the web interface.
  3. AWS AppSync GraphQL APIs and AWS Amplify power the web interface. Amazon DynamoDB tables store the factory data.
  4. An AWS IoT rule engine helps you monitor manufacturing workstations or devices for events. It then routes the event to the correct engineer for resolution in real time.
  5. Authorized users can interact with and receive notifications from this solution. An AWS Lambda function and Amazon Simple Notification Service (Amazon SNS) send emails and SMS notifications.
  6. Issues created, acknowledged, and closed in the web interface are recorded and updated using AWS AppSync and DynamoDB.
  7. The AWS AppSync GraphQL APIs can be called directly with HTTP POST requests.
  8. If you are using L4E to monitor your machines, enter the name of the Amazon S3 bucket where inference files will be delivered in the Anomaly Detection Output Bucket CloudFormation parameter. This solution can be configured to automatically raise issues if an anomaly is detected.
  9. When the Activate Glue Workflow CloudFormation parameter is set to “Yes”, an AWS Glue workflow will be created to extract data from DynamoDB. It delivers this data via an AWS Glue Data Catalog into Amazon S3. For more information, refer to the Data Analysis section.
  10. You can use an existing SAML provider as an additional identity provider for access to this solution. You can configure the Amazon Cognito Domain Prefix, SAML Provider Name, and SAML Provider Metadata URL CloudFormation parameters. For more information, refer to SAML identity provider.

An intuitive UI with nested events

With AVA, you also can create nested events / issues, so you can more easily get to the root of the problem and resolve issues quickly. The solution builds upon the same intuitive UI as highlighted in this previous Amazon Virtual Andon blog post. In addition, it allows you to manage events granularly, create subevents, raise, and resolve issues and sub issues (Figure 4).

Figure 4. AVA allows you to raise issues and sub issues (for example, ‘Out of boxes’ allows you to specify the kind of boxes you need to resolve the issue)

Figure 4. AVA allows you to raise issues and sub issues (for example, ‘Out of boxes’ allows you to specify the kind of boxes you need to resolve the issue)

Conclusion

Amazon Virtual Andon (AVA) is a self-deployable solution that provides you the digital capability to create, route, manage, and resolve issues. With AVA, you can monitor your overall enterprise for issues and engage with the right engineers to resolve them promptly. It offers a clear, intuitive user interface and straightforward workflow to help team members resolve issues.

Get started with Amazon Virtual Andon today.

Serverless Architecture for a Structured Data Mining Solution

Post Syndicated from Uri Rotem original https://aws.amazon.com/blogs/architecture/serverless-architecture-for-a-structured-data-mining-solution/

Many businesses have an essential need for structured data stored in their own database for business operations and offerings. For example, a company that produces electronics may want to store a structured dataset of parts. This requires the following properties: color, weight, connector type, and more.

This data may already be available from external sources. In many cases, one source is sufficient. But often, multiple data sources from different vendors must be incorporated. Each data source might have a different structure for the same data field, which is problematic. Achieving one unified structure from variable sources can be difficult, and is a classic data mining problem.

We will break the problem into two main challenges:

  1. Locate and collect data. Collect from multiple data sources and load data into a data store.
  2. Unify the collected data. Since the collected data has no constraints, it might be stored in different structures and file formats. To use the collected data, it must be unified by performing an extract, transform, load (ETL) process. This matches the different data sources and creates one unified data store.

In this post, we demonstrate a pipeline of services, built on top of a serverless architecture that will handle the preceding challenges. This architecture supports large-scale datasets. Because it is a serverless solution, it is also secure and cost effective.

We use Amazon SageMaker Ground Truth as a tool for classifying the data, so that no custom code is needed to classify different data sources.

Data mining and structuring

There are three main steps to explore in order to solve these challenges:

  1. Collect the data – Data mine from different sources
  2. Map the data – Construct a dictionary of key-value pairs without writing code
  3. Structure the collected data – Enrich your dataset with a unified collection of data that was collected and mapped in steps 1 and 2

Following is an example of a use case and solution flow using this architecture:

  • In this scenario, a company must enrich an empty data base with items and properties, see Figure 1.
Figure 1. Company data before data mining

Figure 1. Company data before data mining

  • Data will then be collected from multiple data sources, and stored in the cloud, as shown in Figure 2.
Figure 2. Collecting the data by SKU from different sources

Figure 2. Collecting the data by SKU from different sources

  • To unify different property names, SageMaker Ground Truth is used to label the property names with a list of properties. The results are stored in Amazon DynamoDB, shown in Figure 3.
Figure 3. Mapping the property names to match a unified name

Figure 3. Mapping the property names to match a unified name

  • Finally, the database is populated and enriched by the mapped properties from the different data sources. This can be iterated with new sources to further enrich the data base, see Figure 4.
Figure 4. Company data after data mining, mapping, and structuring

Figure 4. Company data after data mining, mapping, and structuring

1. Collect the data

Using this serverless architecture illustrated in Figure 5, your teams can minimize the effort and cost. You’ll be able to handle large-scale datasets to collect and store the data required for your business.

Figure 5. Serverless architecture for parallel data collection

Figure 5. Serverless architecture for parallel data collection

We use Amazon S3 as it is a highly scalable and durable object storage service, and can store the original dataset. It will initiate an event that will invoke a Lambda function to start a state machine, using the original dataset as its input.

AWS Step Functions are used to orchestrate the process of preparing the dataset for parallel scraping of the items. It will automatically manage the queue of items to be processed when the dataset is large. Step Functions ensures visibility of the process, reports errors, and decouples the compute-intensive scraping operation per item.

The state machine has two steps:

  1. ETL the data to clean and standardize it. Store each item in Amazon DynamoDB, a fast and flexible NoSQL database service for any scale. The ETL function will create an array of all the items identifiers. The identifier is a unique describer of the item, such as manufacturer ID and SKU.
  2. Using the Map functionality of Step Functions, a Lambda function will be invoked for each item. This runs all your scrapers for that item and stores the results in an S3 bucket.

This solution requires custom implementation of only these two functions, according to your own dataset and scraping sources. The ETL Lambda function will contain logic needed to transform your input into an array of identifiers. The scraper Lambda function will contain logic to locate the data in the source and then store it.

Scraper function flow

For each data source, write your own scraper. The Lambda function can run them sequentially.

  1. Use the identifier input to locate the item in each one of the external sources. The data source can be an API, a webpage, a PDF file, or other source.
    • API: Collecting this data will be specific to the interface provided.
    • Webpages: Data is collected with custom code. There are open source libraries that are popular for this task, such as Beautiful Soup.
    • PDF files: Consider using Amazon Textract. Amazon Textract will give you key-value pairs and table analysis.
  2. Transform the response to key-value pairs as part of the scraper logic.
  3. Store the key-value pairs in a sub folder of the scraper responses S3 bucket, and name it after that data source.

2. Mapping the responses

Figure 6. Pipeline for property mapping

Figure 6. Pipeline for property mapping

This pipeline is initiated after the data is collected. It creates a labeling job of Named Entity Recognition, with a pre-defined set of labels. The labeling work will be split among your Workforces. When the job is completed, the output manifest file for named entity recognition is used for the final ETL Lambda. This manually locates the labeling key and values detected by your workforce, and places the results in a reusable mapping table in DynamoDB.

Services used:

Amazon SageMaker Ground Truth is a fully managed data labeling service that helps you build highly accurate training datasets for machine learning (ML). By using Ground Truth, your teams can unify different data sources to match each other, so they can be identified and used in your applications.

Figure 7. Example of one line item being labeled by one of the Workforce team members

Figure 7. Example of one line item being labeled by one of the Workforce team members

3. Structure the collected data

Figure 8. Architecture diagram of entire data collection and classification process

Figure 8. Architecture diagram of entire data collection and classification process

Using another Lambda function (see in Figure 8, populate items properties), we use the collected data (1), and the mapping (2), to populate the unified dataset into the original data DynamoDB table (3).

Conclusion

In this blog, we showed a solution to automatically collect and structure data. We used a serverless architecture that requires minimal effort, to build a reusable asset that can unify different property definitions from different data sources. Minimal effort is involved in structuring this data, as we use Amazon SageMaker Ground Truth to match and reconcile the new data sources.

For further reading:

Securely Ingest Industrial Data to AWS via Machine to Cloud Solution

Post Syndicated from Ajay Swamy original https://aws.amazon.com/blogs/architecture/securely-ingest-industrial-data-to-aws-via-machine-to-cloud-solution/

As a manufacturing enterprise, maximizing your operational efficiency and optimizing output are critical factors in this competitive global market. However, many manufacturers are unable to frequently collect data, link data together, and generate insights to help them optimize performance. Furthermore, decades of competing standards for connectivity have resulted in the lack of universal protocols to connect underlying equipment and assets.

Machine to Cloud Connectivity Framework (M2C2) is an Amazon Web Services (AWS) Solution that provides the secure ingestion of equipment telemetry data to the AWS Cloud. This allows you to use AWS services to conduct analysis on your equipment data, instead of managing underlying infrastructure operations. The solution allows for robust data ingestion from industrial equipment that use OPC Data Access (OPC DA) and OPC Unified Access (OPC UA) protocols.

Secure, automated configuration and ingestion of industrial data

M2C2 allows manufacturers to ingest their shop floor data into various data destinations in AWS. These include AWS IoT SiteWise, AWS IoT Core, Amazon Kinesis Data Streams, and Amazon Simple Storage Service (S3). The solution is integrated with AWS IoT SiteWise so you can store, organize, and monitor data from your factory equipment at scale. Additionally, the solution provides customers an intuitive user interface to create, configure, monitor, and manage connections.

Automated setup and configuration

Figure 1. Automatically create and configure connections

Figure 1. Automatically create and configure connections

With M2C2, you can connect to your operational technology assets (see Figure 1). The solution automatically creates AWS IoT certificates, keys, and configuration files for AWS IoT Greengrass. This allows you to set up Greengrass to run on your industrial gateway. It also automates the deployment of any Greengrass group configuration changes required by the solution. You can define a connection with the interface, and specify attributes about equipment, tags, protocols, and read frequency for equipment data.

Figure 2. Send data to different destinations in the AWS Cloud

Figure 2. Send data to different destinations in the AWS Cloud

Once the connection details have been specified, you can send data to different destinations in AWS Cloud (see Figure 2). M2C2 provides capability to ingest data from industrial equipment using OPC-DA and OPC-UA protocols. The solution collects the data, and then publishes the data to AWS IoT SiteWise, AWS IoT Core, or Kinesis Data Streams.

Publishing data to AWS IoT SiteWise allows for end-to-end modeling and monitoring of your factory floor assets. When using the default solution configuration, publishing data to Kinesis Data Streams allows for ingesting and storing data in an Amazon S3 bucket. This gives you the capability for custom advanced analytics use cases and reporting.

You can choose to create multiple connections, and specify sites, areas, processes, and machines, by using the setup UI.

Management of connections and messages

Figure 3. Manage your connections

Figure 3. Manage your connections

M2C2 provides a straightforward connections screen (see Figure 3), where production managers can monitor and review the current state of connections. You can start and stop connections, view messages and errors, and gain connectivity across different areas of your factory floor. The Manage connections UI allows you to holistically manage data connectivity from a centralized place. You can then make changes and corrections as needed.

Architecture and workflow

Figure 4. Machine to Cloud Connectivity (M2C2) Framework architecture

Figure 4. Machine to Cloud Connectivity (M2C2) Framework architecture

The AWS CloudFormation template deploys the following infrastructure, shown in Figure 4:

  1. An Amazon CloudFront user interface that deploys into an Amazon S3 bucket configured for web hosting.
  2. An Amazon API Gateway API provides the user interface for client requests.
  3. An Amazon Cognito user pool authenticates the API requests.
  4. AWS Lambda functions power the user interface, in addition to the configuration and deployment mechanism for AWS IoT Greengrass and AWS IoT SiteWise gateway resources. Amazon DynamoDB tables store the connection metadata.
  5. An AWS IoT SiteWise gateway configuration can be used for any OPC UA data sources.
  6. An Amazon Kinesis Data Streams data stream, Amazon Kinesis Data Firehose, and Amazon S3 bucket to store telemetry data.
  7. AWS IoT Greengrass is installed and used on an on-premises industrial gateway to run protocol connector Lambda functions. These connect and read telemetry data from your OPC UA and OPC DA servers.
  8. Lambda functions are deployed onto AWS IoT Greengrass Core software on the industrial gateway. They connect to the servers and send the data to one or more configured destinations.
  9. Lambda functions that collect the telemetry data write to AWS IoT Greengrass stream manager streams. The publisher Lambda functions read from the streams.
  10. Publisher Lambda functions forward the data to the appropriate endpoint.

Data collection

The Machine to Cloud Connectivity solution uses Lambda functions running on Greengrass to connect to your on-premises OPC-DA and OPC-UA industrial devices. When you deploy a connection for an OPC-DA device, the solution configures a connection-specific OPC-DA connector Lambda. When you deploy a connection for an OPC-UA device, the solution uses the AWS IoT SiteWise Greengrass connector to collect the data.

Regardless of protocol, the solution configures a publisher Lambda function, which takes care of sending your streaming data to one or more desired destinations. Stream Manager enables the reading and writing of stream data from multiple sources and to multiple destinations within the Greengrass core. This enables each configured collector to write data to a stream. The publisher reads from that stream and sends the data to your desired AWS resource.

Conclusion

Machine to Cloud Connectivity (M2C2) Framework is a self-deployable solution that provides secure connectivity between your technology (OT) assets and the AWS Cloud. With M2C2, you can send data to AWS IoT Core or AWS IoT SiteWise for analytics and monitoring. You can store your data in an industrial data lake using Kinesis Data Streams and Amazon S3. Get started with Machine to Cloud Connectivity (M2C2) Framework today.

Scaling Data Analytics Containers with Event-based Lambda Functions

Post Syndicated from Brian Maguire original https://aws.amazon.com/blogs/architecture/scaling-data-analytics-containers-with-event-based-lambda-functions/

The marketing industry collects and uses data from various stages of the customer journey. When they analyze this data, they establish metrics and develop actionable insights that are then used to invest in customers and generate revenue.

If you’re a data scientist or developer in the marketing industry, you likely often use containers for services like collecting and preparing data, developing machine learning models, and performing statistical analysis. Because the types and amount of marketing data collected are quickly increasing, you’ll need a solution to manage the scale, costs, and number of required data analytics integrations.

In this post, we provide a solution that can perform and scale with dynamic traffic and is cost optimized for on-demand consumption. It uses synchronous container-based data science applications that are deployed with asynchronous container-based architectures on AWS Lambda. This serverless architecture automates data analytics workflows utilizing event-based prompts.

Synchronous container applications

Data science applications are often deployed to dedicated container instances, and their requests are routed by an Amazon API Gateway or load balancer. Typically, an Amazon API Gateway routes HTTP requests as synchronous invocations to instance-based container hosts.

The target of the requests is a container-based application running a machine learning service (SciLearn). The service container is configured with the required dependency packages such as scikit-learn, pandas, NumPy, and SciPy.

Containers are commonly deployed on different targets such as on-premises, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Compute Cloud (Amazon EC2), and AWS Elastic Beanstalk. These services run synchronously and scale through Amazon Auto Scaling groups and a time-based consumption pricing model.

Figure 1. Synchronous container applications diagram

Figure 1. Synchronous container applications diagram

Challenges with synchronous architectures

When using a synchronous architecture, you will likely encounter some challenges related to scale, performance, and cost:

  • Operation blocking. The sender does not proceed until Lambda returns, and failure processing, such as retries, must be handled by the sender.
  • No native AWS service integrations. You cannot use several native integrations with other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon EventBridge, and Amazon Simple Queue Service (Amazon SQS).
  • Increased expense. A 24/7 environment can result in increased expenses if resources are idle, not sized appropriately, and cannot automatically scale.

The New for AWS Lambda – Container Image Support blog post offers a serverless, event-based architecture to address these challenges. This approach is explained in detail in the following section.

Benefits of using Lambda Container Image Support

In our solution, we refactored the synchronous SciLearn application that was deployed on instance-based hosts as an asynchronous event-based application running on Lambda. This solution includes the following benefits:

  • Easier dependency management with Dockerfile. Dockerfile allows you to install native operating system packages and language-compatible dependencies or use enterprise-ready container images.
  • Similar tooling. The asynchronous and synchronous solutions use Amazon Elastic Container Registry (Amazon ECR) to store application artifacts. Therefore, they have the same build and deployment pipeline tools to inspect Dockerfiles. This means your team will likely spend less time and effort learning how to use a new tool.
  • Performance and cost. Lambda provides sub-second autoscaling that’s aligned with demand. This results in higher availability, lower operational overhead, and cost efficiency.
  • Integrations. AWS provides more than 200 service integrations to deploy functions as container images on Lambda, without having to develop it yourself so it can be deployed faster.
  • Larger application artifact up to 10 GB. This includes larger application dependency support, giving you more room to host your files and packages in oppose to hard limit of 250 MB of unzipped files for deployment packages.

Scaling with asynchronous events

AWS offers two ways to asynchronously scale processing independently and automatically: Elastic Beanstalk worker environments and asynchronous invocation with Lambda. Both options offer the following:

  • They put events in an SQS queue.
  • They can be designed to take items from the queue only when they have the capacity available to process a task. This prevents them from becoming overwhelmed.
  • They offload tasks from one component of your application by sending them to a queue and process them asynchronously.

These asynchronous invocations add default, tunable failure processing and retry mechanisms through “on failure” and “on success” event destinations, as described in the following section.

Integrations with multiple destinations

“On failure” and “on success” events can be logged in an SQS queue, Amazon Simple Notification Service (Amazon SNS) topic, EventBridge event bus, or another Lambda function. All four are integrated with most AWS services.

“On failure” events are sent to an SQS dead-letter queue because they cannot be delivered to their destination queues. They will be reprocessed them as needed, and any problems with message processing will be isolated.

Figure 2 shows an asynchronous Amazon API Gateway that has placed an HTTP request as a message in an SQS queue, thus decoupling the components.

The messages within the SQS queue then prompt a Lambda function. This runs the machine learning service SciLearn container in Lambda for data analysis workflows, which are integrated with another SQS dead letter queue for failures processing.

Figure 2. Example asynchronous-based container applications diagram

Figure 2. Example asynchronous-based container applications diagram

When you deploy Lambda functions as container images, they benefit from the same operational simplicity, automatic scaling, high availability, and native integrations. This makes it an appealing architecture for our data analytics use cases.

Design considerations

The following can be considered when implementing Docker Container Images with Lambda:

  • Lambda supports container images that have manifest files that follow these formats:
    • Docker image manifest V2, schema 2 (Docker version 1.10 and newer)
    • Open Container Initiative (OCI) specifications (v1.0.0 and up)
  • Storage in Lambda:
  • On create/update, Lambda will cache the image to speed up the cold start of functions during execution. Cold starts occur on an initial request to Lambda, which can lead to longer startup times. The first request will maintain an instance of the function for only a short time period. If Lambda has not been called during that period, the next invocation will create a new instance
  • Fine grain role policies are highly recommended for security purposes
  • Container images can use the Lambda Extensions API to integrate monitoring, security and other tools with the Lambda execution environment.

Conclusion

We were able to architect this synchronous service based on a previously deployed on instance-based hosts and design it to become asynchronous on Amazon Lambda.

By using the new support for container-based images in Lambda and converting our workload into an asynchronous event-based architecture, we were able to overcome these challenges:

  • Performance and security. With batch requests, you can scale asynchronous workloads and handle failure records using SQS Dead Letter Queues and Lambda destinations. Using Lambda to integrate with other services (such as EventBridge and SQS) and using Lambda roles simplifies maintaining a granular permission structure. When Lambda uses an Amazon SQS queue as an event source, it can scale up to 60 more instances per minute, with a maximum of 1,000 concurrent invocations.
  • Cost optimization. Compute resources are a critical component of any application architecture. Overprovisioning computing resources and operating idle resources can lead to higher costs. Because Lambda is serverless, it only incurs costs on when you invoke a function and the resources allocated for each request.

Architecture Monthly Magazine: Manufacturing

Post Syndicated from Jane Scolieri original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-manufacturing/

Architecture Monthly Magazine - Manufacturing

How is operational data being used to transform manufacturing? Steve Blackwell, AWS Worldwide Tech Leader for manufacturing, speaks about considerations for architecting for manufacturing, considerations for Industry 4.0 applications and data, and AWS for Industrial.

We hope you’ll find this edition of Architecture Monthly useful. My team would like your feedback, so please give us a rating and include your comments on the Amazon Kindle page. You can view past issues and reach out to [email protected] anytime with your questions and comments.

In this month’s Manufacturing issue:

  • Ask an Expert: Steve Blackwell, Tech Leader, Manufacturing, AWS
  • Case Study: Amazon Prime Air’s Drone Takes Flight with AWS and Siemens
  • Reference Architecture: PTC Windchill Product Lifecycle Management on AWS
  • Blog: How Genie® (a Terex® brand) improved paint quality using AWS IoT SiteWise
  • Solution: Amazon Virtual Andon
  • Whitepaper: Achieve Production Optimization with AWS Machine Learning
  • Reference Architecture: OSIsoft PI System Enterprise Data Infrastructure on AWS
  • Blog: AWS for Industrial – Making it easy for customers to scale their industrial
    workloads on AWS
  • Case Study: Arneg Predicts Customer Maintenance Needs Worldwide Using Amazon
    Forecast and Amazon SageMaker
  • Related Videos: Driving Predictive Quality with Machine Connectivity, AWS Knows
    Industrial Operations

Download the Magazine

How to access the magazine

View and download past issues as PDFs on the AWS Architecture Monthly webpage.
Readers in the US, UK, Germany, and France can subscribe to the Kindle version of the magazine at Kindle Newsstand.
Visit Flipboard, a personalized mobile magazine app that you can also read on your computer.
We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us star rating and comment on the Amazon Kindle Newsstand page or contact us anytime at [email protected].

Amazon Lookout for Vision – New ML Service Simplifies Defect Detection for Manufacturing

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/amazon-lookout-for-vision-new-machine-learning-service-that-simplifies-defect-detection-for-manufacturing/

Today, I’m excited to announce Amazon Lookout for Vision, a new machine learning (ML) service that helps customers in industrial environments to detect visual defects on production units and equipment in an easy and cost-effective way.

Can you spot the circuit board with the defect in these images?

Image of 3 circuit boards - one is faulty

Maybe you can if you are familiar with circuit boards, but I have to say that it took me a while to discover the error. Humans, when properly trained and are well rested, are great at finding anomalies in a set of objects. However, when they are tired or not properly trained – like me in this example – they can be slow, prone to errors and inconsistent.

That’s why many companies use machine vision technologies to detect anomalies. However, these technologies need to be calibrated with controlled lighting and camera viewpoints. In addition, you need to specify hard-coded rules that define what is a defect and what is not, making the technologies very specialized and complex to build.

Lookout for Vision is a new machine learning service that helps increase industrial product quality and reduce operational costs by automating visual inspection of product defects across production processes. Lookout for Vision uses deep learning models to replace hard-coded rules and handles the differences in camera angle, lighting and other challenges that arise from the operational environment. With Lookout for Vision, you can reduce the need for carefully controlled environments.

Using Lookout for Vision, you can detect damages to manufactured parts, identify missing components or parts, and uncover underlying process-related issues in your manufacturing lines.

How to Get Started With Lookout for Vision
The first thing I want to mention is that to use Lookout for Vision, you don’t need to be a machine learning expert. Lookout for Vision is a fully managed service and comes with anomaly detection models that can be optimized for your use case and your data.

There are several steps for using Lookout for Vision. The first is preparing the dataset, which includes creating a dataset of images and adding labels to the images. Then, Lookout for Vision uses this dataset to automatically train the ML model that learns to detect anomalies in your product. The final part is using the model in production. You can keep evaluating the performance of your trained model and improve it at any time using tools that Lookout for Vision provides.

Service console tutorial for getting started

Preparing the Data
To get started with the model, you first need a set of images of your product. For better results, include images with normal (no defects) and anomalous content (includes defects). To get started with training, you will need at least 20 normal images and 10 anomalous images.

There are many ways of importing images into Lookout for Vision from the AWS Management Console: You can provide manifests for annotated images using the Amazon SageMaker Ground Truth service, provide images from an S3 bucket or upload directly from your computer.

Different ways to import your images.

After you upload the images, you need to add labels to classify the images in your dataset as normal or anomalous. Labeling is a very important step, as this is the key information that Lookout for Vision uses to train the model for your use case.

For this demo, I import the images from an S3 bucket. If you’ve organized the images in your S3 bucket by folder name (/anomaly/01.jpeg), Lookout for Vision will automatically import the folder structure into corresponding labels.

Training the Model
When our dataset is ready, we need to train our model with it. The training button is enabled once you have the minimum number of labeled images: 20 normal and 10 anomalous.

Depending on the size of the dataset, training may take a while to complete: for me, it took around an hour to train the model with 100 images. Note that you will begin incurring costs when Lookout for Vision starts to actually train the model. After training is complete, your model is ready to detect anomalies in new images.

Screenshot of a model in training.

Evaluating the Model
There are a couple of ways to evaluate whether your model is ready to be deployed to production. The first is to review the performance metrics of the model and the second is to run some productionlike tests that will help you to verify if the model is ready to be deployed.

There are three main performance metrics: precision, recall and the F1 score. Precision measures the percentage of times the model prediction is correct and recall measures the percentage of true defects the model identified. F1 score is used to determine the model performance metric.

Screenshot of model performance metrics

If you want to run some production-like tests to verify if your model is ready, use the run trial detection feature. This will enable you to run your Lookout for Vision model and predict anomalies on new images. You can further improve the model by manually verifying the results and adding new training images.

Create a new job to predict anomalies.

I used the three images that appear at the beginning of this post for my trial detection. The trial detection job ran for 15-20 minutes, and after that Lookout for Vision used the trained model to classify the images into “Normal” and “Anomaly.” When Lookout for Vision finalizes the trial detection job, you can verify the results as correct or incorrect, and add this images to the dataset.

Screenshot verifying the results of the trial

Using the Model in Production
To use Lookout for Vision, you need to integrate the AWS SDKs or CLI in the systems that are processing the images of the products in the manufacturing line, and internet connectivity is required for this to work. The first thing you need to do is to start the model. When using Lookout for Vision, you are billed for the time your model is running and making inferences. For example, if you start your model at 8 a.m. and stop it at 5 p.m., you will be billed for 9 hours.

# Example CLI
aws lookoutvision start-model 
--project-name circuitBoard 
--model-version 1
--additional-output-config "Bucket=<OUTPUT_BUCKET>,Prefix=<PREFIX_KEY>" 
--min-anomaly-detection-units 10 

# Example response
{ "status" : "STARTING_HOSTING" }

When your model is ready, you can call the detect-anomalies API from Lookout for Vision.

# Example CLI
aws lookoutvision detect-anomalies 
--project-name circuitBoard 
--model-version 1 

And this API will return a JSON response that shows if the image is an anomaly or not, along with the confidence level of that prediction.

{
    "DetectAnomalyResult": {
        "Source": {
            "Type": "direct"
        },
        "IsAnomalous": true,
        "Confidence": 0.97
    }
}

When you are done with detecting anomalies for the day, use the stop-model API. In the Lookout for Vision service console you can find code snippets on how to use these APIs.

When you are using Lookout for Vision in production, you’ll find a dashboard that helps you to sort and track the production lines by most defective line, line with the most recent defects, and the line with the highest anomaly ratio.

Available Today
Lookout for Vision is available in all AWS Regions.

To get started with Amazon Lookout for Vision, visit the service page today.

Marcia