All posts by Ramakant Joshi

Accelerate data integration with Salesforce and AWS using AWS Glue

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/big-data/accelerate-data-integration-with-salesforce-and-aws-using-aws-glue/

The rapid adoption of software as a service (SaaS) solutions has led to data silos across various platforms, presenting challenges in consolidating insights from diverse sources. Effective data analytics relies on seamlessly integrating data from disparate systems through identifying, gathering, cleansing, and combining relevant data into a unified format. AWS Glue, a serverless data integration service, has simplified this process by offering scalable, efficient, and cost-effective solutions for integrating data from various sources. With AWS Glue, you can streamline data integration, reduce data silos and complexities, and gain agility in managing data pipelines, ultimately unlocking the true potential of your data assets for analytics, data-driven decision-making, and innovation.

This post explores the new Salesforce connector for AWS Glue and demonstrates how to build a modern extract, transform, and load (ETL) pipeline with AWS Glue ETL scripts.

Introducing the Salesforce connector for AWS Glue

To meet the demands of diverse data integration use cases, AWS Glue now supports SaaS connectivity for Salesforce. This enables users to quickly preview and transfer their customer relationship management (CRM) data, fetch the schema dynamically on request, and query the data. With the AWS Glue Salesforce connector, you can ingest and transform your CRM data to any of the AWS Glue supported destinations, including Amazon Simple Storage Service (Amazon S3), in your preferred format, including Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake; data warehouses such as Amazon Redshift and Snowflake; and many more. Reverse ETL use cases are also supported, allowing you to write data back to Salesforce.

The following are key benefits of the Salesforce connector for AWS Glue:

  •  You can use AWS Glue native capabilities
  •  It is well tested with AWS Glue capabilities and is production ready for any data integration workload
  •  It works seamlessly on top of AWS Glue and Apache Spark in a distributed fashion for efficient data processing

Solution overview

For our use case, we want to retrieve the full load of a Salesforce account object in a data lake on Amazon S3 and capture the incremental changes. This solution also allows you to update certain fields of the account object in the data lake and push it back to Salesforce. To achieve this, you create two ETL jobs using AWS Glue with the Salesforce connector, and create a transactional data lake on Amazon S3 using Apache Iceberg.

In the first job, you configure AWS Glue to ingest the account object from Salesforce and save it into a transactional data lake on Amazon S3 in Apache Iceberg format. Then you update the account object data that is extracted from the first job in the transactional data lake in Amazon S3. Lastly, you run the second job to send that change back to Salesforce.

Prerequisites

Complete the following prerequisite steps:

  1. Create an S3 bucket to store the results.
  2. Sign up for a Salesforce account, if you don’t already have one.
  3. Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job to use. The role must grant access to all resources used by the job, including Amazon S3 and AWS Secrets Manager. For this post, we name the role AWSGlueServiceRole-SalesforceConnectorJob. Use the following policies:
    • AWS managed policies:
    • Inline policy:
      {
             "Version": "2012-10-17",
             "Statement": [
                    {
                            "Sid": "VisualEditor0",
                            "Effect": "Allow",
                            "Action": [
                                   "s3:PutObject",
                                   "s3:GetObjectAcl",
                                   "s3:GetObject",
                                   "s3:GetObjectAttributes",
                                   "s3:ListBucket",
                                   "s3:DeleteObject",
                                   "s3:PutObjectAcl"],
                            "Resource": [
                                   "arn:aws:s3:::<S3-BUCKET-NAME>",
                                   "arn:aws:s3:::<S3-BUCKET-NAME>/*"
                            ]
                    }
             ]
      }

  1. Create the AWS Glue connection for Salesforce:
    1. The Salesforce connector supports two OAuth2 grant types: JWT_BEARER and AUTHORIZATION_CODE. For this post, we use the AUTHORIZATION_CODE grant type.
    2. On the Secrets Manager console, create a new secret. Add two keys, ACCESS_TOKEN and REFRESH_TOKEN, and keep their values blank. These will be populated after you enter your Salesforce credentials.
    3. Configure the Salesforce connection in AWS Glue. Use AWSGlueServiceRole-SalesforceConnectorJob while creating the Salesforce connection. For this post, we name the connection Salesforce_Connection.
    4. In the Authorization section, choose Authorization Code and the secret you created in the previous step.
    5. Provide your Salesforce credentials when prompted. The ACCESS_TOKEN and REFRESH_TOKEN keys will be populated after you enter your Salesforce credentials.
  2. Create an AWS Glue database. For this post, we name it glue_etl_salesforce_db.

Create an ETL job to ingest the account object from Salesforce

Complete the following steps to create a new ETL job in AWS Glue Studio to transfer data from Salesforce to Amazon S3:

  1. On the AWS Glue console, create a new job (with the Script editor option). For this post, we name the job Salesforce_to_S3_Account_Ingestion.
  2. On the Script tab, enter the Salesforce_to_S3_Account_Ingestion script.

Make sure that the name, which you used to create the Salesforce connection, is passed as the connectionName parameter value in the script, as shown in the following code example:

# Script generated for node Salesforce

input_Salesforce_Dyf = glueContext.create_dynamic_frame.from_options(connection_type="salesforce", connection_options={"entityName": "Account", "apiVersion": "v60.0", "connectionName": "Salesforce_Connection"}, transformation_ctx="inputSalesforceDyf")

The script fetches records from the Salesforce account object. Then it checks if the account table exists in the transactional data lake. If the table doesn’t exist, it creates a new table and inserts the records. If the table exists, it performs an upsert operation.

  1. On the Job details tab, for IAM role, choose AWSGlueServiceRole-SalesforceConnectorJob.
  2. Under Advanced properties, for Additional network connection, choose the Salesforce connection.
  3. Set up the job parameters:
    1. --conf: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
    2. --datalake-formats: iceberg
    3. --db_name: glue_etl_salesforce_db
    4. --s3_bucket_name: your S3 bucket
    5. --table_name: account

  4. Save the job and run it.

Depending on the size of the data in your account object in Salesforce, the job will take a few minutes to complete. After a successful job run, a new table called account is created and populated with Salesforce account information.

  1. You can use Amazon Athena to query the data:
    SELECT id, name, type, active__c, upsellopportunity__c, lastmodifieddate
    
    FROM "glue_etl_salesforce_db"."account"

Validate transactional capabilities

You can validate the transactional capabilities supported by Apache Iceberg. For testing, try three operations: insert, update, and delete:

  1. Create a new account object in Salesforce, rerun the AWS Glue job, then run the query in Athena to validate the new account is created.
  2. Delete an account in Salesforce, rerun the AWS Glue job, and validate the deletion using Athena.
  3. Update an account in Salesforce, rerun the AWS Glue job, and validate the update operation using Athena.

Create an ETL job to send updates back to Salesforce

AWS Glue also allows you to write data back to Salesforce. Complete the following steps to create an ETL job in AWS Glue to get updates from the transactional data lake and write them to Salesforce. In this scenario, you update an account record and push it back to Salesforce.

  1. On the AWS Glue console, create a new job (with the Script editor option). For this post, we name the job S3_to_Salesforce_Account_Writeback.
  2. On the Script tab, enter the S3_to_Salesforce_Account_Writeback script.

Make sure that the name, which you used to create the Salesforce connection, is passed as the connectionName parameter value in the script:

# Script generated for node Salesforce

Salesforce_node = glueContext.write_dynamic_frame.from_options(frame=SelectFields_dyf, connection_type="salesforce", connection_options={"apiVersion": "v60.0", "connectionName": "Salesforce_Connection", "entityName": "Account", "writeOperation": "UPDATE", "idFieldNames": "Id"}, transformation_ctx="Salesforce_node")
  1. On the Job details tab, for IAM role, choose AWSGlueServiceRole-SalesforceConnectorJob.
  2. Configure the job parameters:
    1. --conf:
      spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
    2. --datalake-formats: iceberg
    3. --db_name: glue_etl_salesforce_db
    4. --table_name: account

  3. Run the update query in Athena to change the value of UpsellOpportunity__c for a Salesforce account to “Yes”:
    update “glue_etl_salesforce_db”.”account”
    set upsellopportunity__c = ‘Yes’
    where name = ‘<SF Account>’

  4. Run the S3_to_Salesforce_Account_Writeback AWS Glue job.

Depending on the size of the data in your account object in Salesforce, the job will take a few minutes to complete.

  1. Validate the object in Salesforce. The value of UpsellOpportunity should change.

You have now successfully validated the Salesforce connector.

Considerations

You can set up AWS Glue job triggers to run the ETL jobs on a schedule, so that the data is regularly synchronized between Salesforce and Amazon S3. You can also integrate the ETL jobs with other AWS services, such as AWS Step Functions, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Lambda, or Amazon EventBridge, to create a more advanced data processing pipeline.

By default, the Salesforce connector doesn’t import deleted records from Salesforce objects. However, you can set the IMPORT_DELETED_RECORDS option to “true” to import all records, including the deleted ones. Refer to Salesforce connection options for different Salesforce connection options.

# Script generated for node Salesforce

input_Salesforce_Dyf = glueContext.create_dynamic_frame.from_options(connection_type = "salesforce", connection_options = {"entityName": "Account", "apiVersion": "v60.0", "connectionName": " Salesforce_Connection", "IMPORT_DELETED_RECORDS": "true"},  transformation_ctx="inputSalesforceDyf")

Clean up

To avoid incurring charges, clean up the resources used in this post from your AWS account, including the AWS Glue jobs, Salesforce connection, Secrets Manager secret, IAM role, and S3 bucket.

Conclusion

The AWS Glue connector for Salesforce simplifies the analytics pipeline, reduces time to insights, and facilitates data-driven decision-making. It empowers organizations to streamline data integration and analytics. The serverless nature of AWS Glue means there is no infrastructure management, and you pay only for the resources consumed while your jobs are running. As organizations increasingly rely on data for decision-making, this Salesforce connector provides an efficient, cost-effective, and agile solution to swiftly meet data analytics needs.

To learn more about the AWS Glue connector for Salesforce, refer to Connecting to Salesforce in AWS Glue Studio. In this user guide, we walk through the entire process, from setting up the connection to running the data transfer flow. For more information on AWS Glue, visit AWS Glue.


About the authors

Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.

BDB-4354-awskamenKamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!

Debaprasun Chakraborty is an AWS Solutions Architect, specializing in the analytics domain. He has around 20 years of software development and architecture experience. He is passionate about helping customers in cloud adoption, migration and strategy.

Cross-account integration between SaaS platforms using Amazon AppFlow

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/big-data/cross-account-integration-between-saas-platforms-using-amazon-appflow/

Implementing an effective data sharing strategy that satisfies compliance and regulatory requirements is complex. Customers often need to share data between disparate software as a service (SaaS) platforms within their organization or across organizations. On many occasions, they need to apply business logic to the data received from the source SaaS platform before pushing it to the target SaaS platform.

Let’s take an example. AnyCompany’s marketing team hosted an event at the Anaheim Convention Center, CA. The marketing team created leads based on the event in Adobe Marketo. An automated process downloaded the leads from Marketo in the marketing AWS account. These leads are then pushed to the sales AWS account. A business process picks up those leads, filters them based on a “Do Not Call” criteria, and creates entries in the Salesforce system. Now, the sales team can pursue those leads and continue to track the opportunities in Salesforce.

In this post, we show how to share your data across SaaS platforms in a cross-account structure using fully managed, low-code AWS services such as Amazon AppFlow, Amazon EventBridge, AWS Step Functions, and AWS Glue.

Solution overview

Considering our example of AnyCompany, let’s look at the data flow. AnyCompany’s Marketo instance is integrated with the producer AWS account. As the leads from Marketo land in the producer AWS account, they’re pushed to the consumer AWS account, which is integrated to Salesforce. Business logic is applied to the leads data in the consumer AWS account, and then the curated data is loaded into Salesforce.

We have used a serverless architecture to implement this use case. The following AWS services are used for data ingestion, processing, and load:

  • Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications like Salesforce, SAP, Marketo, Slack, and ServiceNow, and AWS services like Amazon S3 and Amazon Redshift, in just a few clicks. With AppFlow, you can run data flows at nearly any scale at the frequency you choose—on a schedule, in response to a business event, or on demand. You can configure data transformation capabilities like filtering and validation to generate rich, ready-to-use data as part of the flow itself, without additional steps. Amazon AppFlow is used to download leads data from Marketo and upload the curated leads data into Salesforce.
  • Amazon EventBridge is a serverless event bus that lets you receive, filter, transform, route, and deliver events. EventBridge is used to track the events like receiving the leads data in the producer or consumer AWS accounts and then triggering a workflow.
  • AWS Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines. Step Functions is used to orchestrate the data processing.
  • AWS Glue is a serverless data preparation service that makes it easy to run extract, transform, and load (ETL) jobs. An AWS Glue job encapsulates a script that reads, processes, and then writes data to a new schema. This solution uses Python 3.6 AWS Glue jobs for data filtration and processing.
  • Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Amazon S3 is used to store the leads data.

Let’s review the architecture in detail. The following diagram shows a visual representation of how this integration works.

The following steps outline the process for transferring and processing leads data using Amazon AppFlow, Amazon S3, EventBridge, Step Functions, AWS Glue, and Salesforce:

  1. Amazon AppFlow runs on a daily schedule and retrieves any new leads created within the last 24 hours (incremental changes) from Marketo.
  2. The leads are saved as Parquet format files in an S3 bucket in the producer account.
  3. When the daily flow is complete, Amazon AppFlow emits events to EventBridge.
  4. EventBridge triggers Step Functions.
  5. Step Functions copies the Parquet format files containing the leads from the producer account’s S3 bucket to the consumer account’s S3 bucket.
  6. Upon a successful file transfer, Step Functions publishes an event in the consumer account’s EventBridge.
  7. An EventBridge rule intercepts this event and triggers Step Functions in the consumer account.
  8. Step Functions calls an AWS Glue crawler, which scans the leads Parquet files and creates a table in the AWS Glue Data Catalog.
  9. The AWS Glue job is called, which selects records with the Do Not Call field set to false from the leads files, and creates a new set of curated Parquet files. We have used an AWS Glue job for the ETL pipeline to showcase how you can use purpose-built analytics service for complex ETL needs. However, for simple filtering requirements like Do Not Call, you can use the existing filtering feature of Amazon AppFlow.
  10. Step Functions then calls Amazon AppFlow.
  11. Finally, Amazon AppFlow populates the Salesforce leads based on the data in the curated Parquet files.

We have provided artifacts in this post to deploy the AWS services in your account and try out the solution.

Prerequisites

To follow the deployment walkthrough, you need two AWS accounts, one for the producer and other for the consumer. Use us-east-1 or us-west-2 as your AWS Region.

Consumer account setup:

Stage the data

To prepare the data, complete the following steps:

  1. Download the zipped archive file to use for this solution and unzip the files locally.

The AWS Glue job uses the glue-job.py script to perform ETL and populates the curated table in the Data Catalog.

  1. Create an S3 bucket called consumer-configbucket-<ACCOUNT_ID> via the Amazon S3 console in the consumer account, where ACCOUNT_ID is your AWS account ID.
  2. Upload the script to this location.

Create a connection to Salesforce

Follow the connection setup steps outlined in here. Please make a note of the Salesforce connector name.

Create a connection to Salesforce in the consumer account

Follow the connection setup steps outlined in Create Opportunity Object Flow.

Set up resources with AWS CloudFormation

We provided two AWS CloudFormation templates to create resources: one for the producer account, and one for the consumer account.

Amazon S3 now applies server-side encryption with Amazon S3 managed keys (SSE-S3) as the base level of encryption for every bucket in Amazon S3. Starting January 5, 2023, all new object uploads to Amazon S3 are automatically encrypted at no additional cost and with no impact on performance. We use this default encryption for both producer and consumer S3 buckets. If you choose to bring your own keys with AWS Key Management Service (AWS KMS), we recommend referring to Replicating objects created with server-side encryption (SSE-C, SSE-S3, SSE-KMS) for cross-account replication.

Launch the CloudFormation stack in the consumer account

Let’s start with creating resources in the consumer account. There are a few dependencies on the consumer account resources from the producer account. To launch the CloudFormation stack in the consumer account, complete the following steps:

  1. Sign in to the consumer account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack.
    BDB-2063-launch-cloudformation-stack
  3. Choose Next.
  4. For Stack name, enter a stack name, such as stack-appflow-consumer.
  5. Enter the parameters for the connector name, object, and producer (source) account ID.
  6. Choose Next.
  7. On the next page, choose Next.
  8. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  9. Choose Create stack.

Stack creation takes approximately 5 minutes to complete. It will create the following resources. You can find them on the Outputs tab of the CloudFormation stack.

  • ConsumerS3Bucketconsumer-databucket-<consumer account id>
  • Consumer S3 Target Foldermarketo-leads-source
  • ConsumerEventBusArnarn:aws:events:<region>:<consumer account id>:event-bus/consumer-custom-event-bus
  • ConsumerEventRuleArnarn:aws:events:<region>:<consumer account id>:rule/consumer-custom-event-bus/consumer-custom-event-bus-rule
  • ConsumerStepFunctionarn:aws:states:<region>:<consumer account id>:stateMachine:consumer-state-machine
  • ConsumerGlueCrawlerconsumer-glue-crawler
  • ConsumerGlueJobconsumer-glue-job
  • ConsumerGlueDatabaseconsumer-glue-database
  • ConsumerAppFlowarn:aws:appflow:<region>:<consumer account id>:flow/consumer-appflow

Producer account setup:

Create a connection to Marketo

Follow the connection setup steps outlined in here. Please make a note of the Marketo connector name.

Launch the CloudFormation stack in the producer account

Now let’s create resources in the producer account. Complete the following steps:

  1. Sign in to the producer account’s AWS CloudFormation console in the source Region.
  2. Choose Launch Stack.
    BDB-2063-launch-cloudformation-stack
  3. Choose Next.
  4. For Stack name, enter a stack name, such as stack-appflow-producer.
  5. Enter the following parameters and leave the rest as default:
    • AppFlowMarketoConnectorName: name of the Marketo connector, created above
    • ConsumerAccountBucket: consumer-databucket-<consumer account id>
    • ConsumerAccountBucketTargetFolder: marketo-leads-source
    • ConsumerAccountEventBusArn: arn:aws:events:<region>:<consumer account id>:event-bus/consumer-custom-event-bus
    • DefaultEventBusArn: arn:aws:events:<region>:<producer account id>:event-bus/default


  6. Choose Next.
  7. On the next page, choose Next.
  8. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  9. Choose Create stack.

Stack creation takes approximately 5 minutes to complete. It will create the following resources. You can find them on the Outputs tab of the CloudFormation stack.

  • Producer AppFlowproducer-flow
  • Producer Bucketarn:aws:s3:::producer-bucket.<region>.<producer account id>
  • Producer Flow Completion Rulearn:aws:events:<region>:<producer account id>:rule/producer-appflow-completion-event
  • Producer Step Functionarn:aws:states:<region>:<producer account id>:stateMachine:ProducerStateMachine-xxxx
  • Producer Step Function Rolearn:aws:iam::<producer account id>:role/service-role/producer-stepfunction-role
  1. After successful creation of the resources, go to the consumer account S3 bucket, consumer-databucket-<consumer account id>, and update the bucket policy as follows:
{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "AllowAppFlowDestinationActions",
            "Effect": "Allow",
            "Principal": {"Service": "appflow.amazonaws.com"},
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::consumer-databucket-<consumer-account-id>",
                "arn:aws:s3:::consumer-databucket-<consumer-account-id>/*"
            ]
        }, {
            "Sid": "Producer-stepfunction-role",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<producer-account-id>:role/service-role/producer-stepfunction-role"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::consumer-databucket-<consumer-account-id>",
                "arn:aws:s3:::consumer-databucket-<consumer-account-id>/*"
            ]
        }
    ]
}

Validate the workflow

Let’s walk through the flow:

  1. Review the Marketo and Salesforce connection setup in the producer and consumer account respectively.

In the architecture section, we suggested scheduling the AppFlow (producer-flow) in the producer account. However, for quick testing purposes, we demonstrate how to manually run the flow on demand.

  1. Go to the AppFlow (producer-flow) in the producer account. On the Filters tab of the flow, choose Edit filters.
  2. Choose the Created At date range for which you have data.
  3. Save the range and choose Run flow.
  4. Review the producer S3 bucket.

AppFlow generates the files in the producer-flow prefix within this bucket. The files are temporarily located in the producer S3 bucket under s3://<producer-bucket>.<region>.<account-id>/producer-flow.

  1. Review the EventBridge rule and Step Functions state machine in the producer account.

The Amazon AppFlow job completion triggers an EventBridge rule (arn:aws:events:<region>:<producer account id>:rule/producer-appflow-completion-event, as noted in the Outputs tab of the CloudFromation stack in the Producer Account), which triggers the Step Functions state machine (arn:aws:states:<region>:<producer account id>:stateMachine:ProducerStateMachine-xxxx) in the producer account. The state machine copies the files to the consumer S3 bucket from the producer-flow prefix in the producer S3 bucket. Once file copy is complete, the state machine moves the files from the producer-flow prefix to the archive prefix in the producer S3 bucket. You can find the files in s3://<producer-bucket>.<region>.<account-id>/archive.

  1. Review the consumer S3 bucket.

The Step Functions state machine in the producer account copies the files to the consumer S3 bucket and sends an event to EventBridge in the consumer account. The files are located in the consumer S3 bucket under s3://consumer-databucket-<account-id>/marketo-leads-source/.

  1. Review the EventBridge rule (arn:aws:events:<region>:<consumer account id>:rule/consumer-custom-event-bus/consumer-custom-event-bus-rule) in the consumer account, which should have triggered the Step Function workflow (arn:aws:states:<region>:<consumer account id>:stateMachine:consumer-state-machine).

The AWS Glue crawler (consumer-glue-crawler) runs to update the metadata followed by the AWS Glue job (consumer-glue-job), which curates the data by applying the Do not call filter. The curated files are placed in s3://consumer-databucket-<account-id>/marketo-leads-curated/. After data curation, the flow is started as part of the state machine.

  1. Review the Amazon AppFlow job (arn:aws:appflow:<region>:<consumer account id>:flow/consumer-appflow) run status in the consumer account.

Upon a successful run of the Amazon AppFlow job, the curated data files are moved to the s3://consumer-databucket-<account-id>/marketo-leads-processed/ folder and Salesforce is updated with the leads. Additionally, all the original source files are moved from s3://consumer-databucket-<account-id>/marketo-leads-source/ to s3://consumer-databucket-<account-id>/marketo-leads-archive/.

  1. Review the updated data in Salesforce.

You will see newly created or updated leads created by Amazon AppFlow.

Clean up

To clean up the resources created as part of this post, delete the following resources:

  1. Delete the resources in the producer account:
    • Delete the producer S3 bucket content.
    • Delete the CloudFormation stack.
  2. Delete the resources in the consumer account:
    • Delete the consumer S3 bucket content.
    • Delete the CloudFormation stack.

Summary

In this post, we showed how you can support a cross-account model to exchange data between different partners with different SaaS integrations using Amazon AppFlow. You can expand this idea to support multiple target accounts.

For more information, refer to Simplifying cross-account access with Amazon EventBridge resource policies. To learn more about Amazon AppFlow, visit Amazon AppFlow.


About the authors

Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.

Debaprasun Chakraborty is an AWS Solutions Architect, specializing in the analytics domain. He has around 20 years of software development and architecture experience. He is passionate about helping customers in cloud adoption, migration and strategy.

Suraj Subramani Vineet is a Senior Cloud Architect at Amazon Web Services (AWS) Professional Services in Sydney, Australia. He specializes in designing and building scalable and cost-effective data platforms and AI/ML solutions in the cloud. Outside of work, he enjoys playing soccer on weekends.

Detecting solar panel damage with Amazon Rekognition Custom Labels

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/architecture/detecting-solar-panel-damage-with-amazon-rekognition-custom-labels/

Enterprises perform quality control to ensure products meet production standards and avoid potential brand reputation damage. As the cost of sensors decreases and connectivity increases, industries adopt real-time imagery analysis to detect quality issues.

At the same time, artificial intelligence (AI) advancements enable advanced automation, reduce overall cost and project time, and produce accurate defect detection results in manufacturing plants. As these technologies mature, AI-driven inspections are more common outside of the plant environment.

Overview of solution

This post describes our SOLVED (Solar Roving Eye Detector) project leveraging machine learning (ML) to identify damaged solar panels using Amazon Rekognition Custom Labels and alert operators to take corrective action.

As solar adoption increases, so does the need to detect panel damage. Applying AWS-managed AI services is a simpler, more cost-effective approach than human solar panel inspection or custom-built production applications.

Customers can capture and process videos from the field and build effective computer vision models without creating a dedicated data science team. This approach can be generalized for use cases across industries to detect defects in wind turbines, cell phone towers, automotive parts, and other field components.

Amazon Rekognition Custom Labels builds off of existing service capabilities already trained to identify the objects and scenes in millions of cross-category images. You upload a small set of training images—typically a few hundred or less—into our console. The solution automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. You can then integrate your custom model into your applications through the Amazon Rekognition Custom Labels API.

Walkthrough

This post introduces the SOLVED project featured at the re:Invent 2021 Builders Fair. It will:

  • Review the need for solar panel damage detection
  • Discuss a cloud-based approach to ingest, store, process, analyze, and detect damaged solar panels
  • Present a diagram streaming videos from a Raspberry Pi, storing them on Amazon Simple Storage Service (Amazon S3), processing them using an AWS video-on-demand solution, and inferring damage using Amazon Rekognition
  • Introduce a console to mimic an operation center for appropriate action
  • Demonstrate the integration of AWS IoT Core with a Philips Hue bulb for operator alerts

Prerequisites

Before getting started, review the following prerequisites for this solution:

The SOLVED project

The SOLVED project leverages ML to identify damaged solar panels using Amazon Rekognition Custom Labels. It involves four steps:

  1. Data ingestion: Live solar panel video ingested from moving rover into an Amazon S3 bucket
  2. Pre-processing: Captured video split into thumbnail images
  3. Processing and visualization: ML models making real-time inferences to identify defective panels with a dashboard to review images and prediction scores
  4. Alerting: Defective panels result in notification sent through MQTT messages to light a smart bulb

Figure 1 shows the SOLVED project system architecture.

The SOLVED project system architecture

Figure 1. The SOLVED project system architecture

Installation steps

Let’s review each of the steps in this use case.

Data ingestion

The data ingestion layer of the SOLVED project consists of a continuous video stream captured as a rover moves through a field of solar panels.

We used a Freenove 4WD Smart Car rover with Raspberry Pi. The mounted camera captures video as it moves through the field. We installed an Amazon Kinesis Video Streams Producer on the Pi and streamed the live video to a Kinesis Video Stream named reinventbuilder2021.

Figure 2 shows the Kinesis Video Stream setup window for reinventbuilder2021.

Kinesis Video Stream setup for reinventbuilder2021

Figure 2. Kinesis Video Stream setup for reinventbuilder2021

To start streaming, use the following steps.

  1. Create a new Kinesis Video Stream using this Amazon Kinesis Video Streams Developer Guide
  2. Make a note of the Amazon Resource Name (ARN)
  3. On the Pi, access the command prompt and use aws sts get-session-token for temporary credentials. The IAM user should have the permissions for Kinesis Video Streams PutMedia.
  4. Set the following environment variables:
    export AWS_DEFAULT_REGION="us-east-1"
    export AWS_ACCESS_KEY_ID="xxxxx"
    export AWS_SECRET_ACCESS_KEY="yyyyy"
    export AWS_SESSION_TOKEN=“zzzzz”
  5. Start the streamer using the following command:
    cd ~/amazon-kinesis-video-streams-producer-sdk-cpp/build
    ./kvs_gstreamer_sample reinventbuilder2021
  6. Validate the captured stream by viewing the Media playback on the console.

Figure 3 shows the video stream console, including the Media playback option.

Video stream console with Media playback option

Figure 3. Video stream console with Media playback option

There are two ways to clip video snippets, which we’ll do next.

You can use the Download clip button on the video stream console as shown in Figure 4.

Choose your video streaming clip duration

Figure 4. Choose your video streaming clip duration

Alternately, you can use a script from the following command line:

ONE_MIN_AGO=$(date -v -30S -u "+%FT%T+0000")
NOW=$(date -u "+%FT%T+0000")

FILE_NAME=reinventbuilder-solved-$RANDOM.mp4
echo $FILE_NAME
S3_PATH=s3://videoondemandsplitter-source-e6lyof9qjv1j/

aws kinesis-video-archived-media get-clip --endpoint-url $KVS_DATA_ENDPOINT \
--stream-name reinventbuilder2021 \
--clip-fragment-selector "FragmentSelectorType=SERVER_TIMESTAMP,TimestampRange={StartTimestamp=$ONE_MIN_AGO,EndTimestamp=$NOW}" \
$FILE_NAME

echo "Running get-clip for stream"

sleep 45

aws s3 cp $FILE_NAME $S3_PATH
echo "copying file $FILE_NAME TO $S3_PATH"

The clip is available in the Amazon S3 source folder created using AWS CloudFormation, as shown in Figure 5.

Access your clip in the Amazon S3 source folder

Figure 5. Access your clip in the Amazon S3 source folder

Pre-processing

To process the video, we leverage Video on Demand at AWS. This solution encodes video files with AWS Elemental MediaConvert. Out of the box, it:

1. Automatically transcodes videos uploaded to Amazon S3 into formats suitable for playback on a range of devices using MediaConvert
2. Customizes MediaConvert job settings by uploading a custom file and using different settings per input
3. Stores transcoded files in a destination Amazon S3 bucket and uses CloudFront to deliver them to end viewers
4. Provides outputs including input file metadata, job settings, and output details in addition to transcoded video. These outputs are stored in a separate JSON file, available for further processing

For our use case, we used the frame capture feature to create a set of thumbnails from the source videos. The thumbnails are stored in the Amazon S3 bucket with the video output.

To deploy this solution, use the CloudFormation stack.

Processing and visualization

Every trained ML model requires quality training data. We began with publicly available solar panel images that were categorized as “good” or “defective” and uploaded the images to an Amazon S3 bucket into corresponding folders.

Next, we configured Amazon Rekognition Custom Labels with the folders to indicate the labels to use in training and deploying the model. Using the rover images, we tested the model.

We used the rover to record videos of good and damaged solar panels over an extended period and label the outcome favorably. The video was then split into individual frames using MediaConvert, giving us a well-labeled dataset that we trained our model with using Amazon Rekognition Custom Labels.

We used the model endpoint to infer outcomes on solar panels with varying damage footprints across multiple locations. AWS Elemental Mediaconvert expedited the process of curating the training set, and creating the model and endpoint using Amazon Rekognition was straightforward.

As shown in Figure 6, we used a training set of 7,000 images with an even mix of good and damaged panels.

A training set of images

Figure 6. A training set of images

Examples of good panel images are depicted in Figure 7.

Good panel images

Figure 7. Good panel images

Examples of damaged panel images are depicted in Figure 8.

Damaged panel images

Figure 8. Damaged panel images

In this use case, 90 percent model accuracy was achieved.

To visualize the results, we leveraged AWS Amplify to provide an operator interface to identify the damaged panels.

Figure 9 shows screenshots from the operator dashboard with output from the Amazon Custom Labels Rekognition model for good and defective panels.

Operator dashboard in AWS Amplify

Figure 9. Operator dashboard in AWS Amplify

Alerting

Maintenance teams must be notified of defective panels to take corrective action. To create alerts, we configured AWS IoT Core to send MQTT messages to a Philips Hue smart bulb, with red bulbs indicating defective panels. To set up the Philips Hue API, use the How to develop for Hue guide.

For example, here’s the API to change color:

PUT https://192.xx.xx.xx/api/xxxxxxx/lights/1/state

{"on":true, "sat":254, "bri":254,"hue":20000} 

turns color to green

{"on":true, "sat":254, "bri":254,"hue":1000}

turns to red.

We set up a client on the Pi that listens on an AWS IoT Core MQTT topic and makes an API request to Philips Hue.

To connect a device to AWS IoT, complete these steps:

  1. Create an IoT thing, a device certificate, and an AWS IoT policy. An AWS IoT thing represents a physical device (in this case, Raspberry Pi) and contains static device metadata, as shown in Figure 10.
    AWS IoT Thing

    Figure 10. AWS IoT Thing

    2. Create a device certificate, required to connect to and authenticate with AWS IoT. An example is shown in Figure 11.

Device certificate

Figure 11. Device certificate

3. Associate an AWS IoT policy with each device certificate. They determine which AWS IoT resources the device can access. In this case, we allowed iot.*, giving the device access to all IoT resources, as shown in Figure 12.

IoT policy

Figure 12. IoT policy

Devices and other clients use an AWS IoT root CA certificate to authenticate the server they’re communicating with. For more on how devices authenticate with AWS IoT Core, see Server authentication in the AWS IoT Core Developer Guide. Copy the certificate chain to the Raspberry Pi.

For communication with the Philips Hue, we used the Qhue wrapper as shown in Figure 13.

Qhue wrapper

Figure 13. Qhue wrapper

The authors presented a demo of this solution at re:Invent 2021 Builder’s Fair.

Author demo at re:Invent 2021 Builder's Fair

Figure 14. Author demo at re:Invent 2021 Builder’s Fair

Clean up

If you used the CloudFormation stack, delete it to avoid unexpected future charges. Delete Amazon S3 buckets and terminate Amazon Rekognition jobs to stop accruing charges.

Conclusion

Amazon Rekognition helps customers collect images in the field and apply AI-based analysis to interpret the condition of assets within the images.

In this post, you learned how to configure the Kinesis Video Stream producer on a Raspberry Pi to upload captured videos to Amazon Kinesis Video streams. You also learned how to save video streams to Amazon S3 and leverage the Video on Demand at AWS solution.

Using AWS MediaConvert, we transcoded the videos and create a set of thumbnails from the source videos. We then used Amazon Rekognition Custom Labels to train and deploy models for solar panel damage detection. Finally, we configured AWS IoT core to send MQTT messages to a Philips Hue smart bulb for notifications.

In this post, we presented a serverless architecture on AWS to detect defective solar panels. The reference architecture diagram is adaptable to solve inspection and damage detection problems across other industries.

Modernization pathways for a legacy .NET Framework monolithic application on AWS

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/architecture/modernization-pathways-for-a-legacy-net-framework-monolithic-application-on-aws/

Organizations aim to deliver optimal technological solutions based on their customers’ needs. Although they may be at any stage in their cloud adoption journey, businesses often end up managing and building monolithic applications. However, there are many challenges to this solution. The internal structure of a monolithic application makes it difficult for developers to maintain code. This creates a steep learning curve for new developers and increases costs. Monoliths require multiple teams to coordinate a single large release, which increases the collaboration and knowledge transfer burden. As a business grows, a monolithic application may struggle to meet the demands of an expanding user base. To address these concerns, customers should evaluate their readiness to modernize their applications in the AWS Cloud to meet their business and technical needs.

We will discuss an approach to modernizing a monolithic three-tier application (MVC pattern): a web tier, an application tier using a .NET Framework, and a data tier with a Microsoft SQL (MSSQL) Server relational database. There are three main modernization pathways for .NET applications: rehosting, replatforming, and refactoring. We recommend following this decision matrix to assess and decide on your migration path, based on your specific requirements. For this blog, we will focus on a replatform and refactor strategy to design loosely coupled microservices, packaged as lightweight containers, and backed by a purpose-built database.

Your modernization journey

The outcomes of your organization’s approach to modernization gives you the ability to scale optimally with your customers’ demands. Let’s dive into a guided approach that achieves your goals of a modern architecture, and at the same time addresses scalability, ease of maintenance, rapid deployment cycles, and cost optimization.

This involves four steps:

  1. Break down the monolith
  2. Containerize your application
  3. Refactor to .NET 6
  4. Migrate to a purpose-built, lower-cost database engine.

1. Break down the monolith

Migration to the Amazon Web Services (AWS) Cloud has many advantages. These can include increased speed to market and business agility, new revenue opportunities, and cost savings. To take full advantage, you should continuously modernize your organization’s applications by refactoring your monolithic applications into microservices.

Decomposing a monolithic application into microservices presents technical challenges that require a solid understanding of the existing code base and context of the business domains. Several patterns are useful to incrementally transform a monolithic application into microservices and other distributed designs. However, the process of refactoring the code base is manual, risky, and time consuming.

To help developers accelerate the transformation, AWS introduced AWS Microservice Extractor for .NET. This helps breakdown architecting and refactoring applications into smaller code projects. Read how AWS Microservice Extractor for .NET helped our partner, Kloia, accelerate the modernization journey of their customers and decompose a monolith.

The next modernization pathway is to containerize your application.

2. Containerize

Why should you move to containers? Containers offer a way to help you build, test, deploy, and redeploy applications on multiple environments. Specifically, Docker Containers provide you with a reliable way to gather your application components and package them together into one build artifact. This is important because modern applications are often composed of a variety of pieces besides code, such as dependencies, binaries, or system libraries. Moving legacy .NET Framework applications to containers helps to optimize operating system utilization and achieve runtime consistency.

To accelerate this process, containerize these applications to Windows containers with AWS App2Container (A2C). A2C is a command line tool for modernizing .NET and java applications into containerized applications. A2C analyzes and builds an inventory of all applications running in virtual machines, on-premises, or in the cloud. Select the application that you want to containerize and A2C packages the application artifact and identified dependencies into container images. Here is a step-by-step article and self-paced workshop to get you started using A2C.
Once your app is containerized, you can choose to self-manage by using Amazon EC2 to host Docker with Windows containers. You can also use Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). These are fully managed container orchestration services that frees you to focus on building and managing applications instead of your underlying infrastructure. Read Amazon ECS vs Amazon EKS: making sense of AWS container services.

In the next section, we’ll discuss two primary aspects to optimizing costs in our modernization scenario:

  1. Licensing costs of running workloads on Windows servers.
  2. SQL Server licensing cost.

3. Refactor to .NET 6

To address Windows licensing costs, consider moving to a Linux environment by adopting .NET Core and using the Dockerfile for a Linux Container. Customers such as GoDataFeed benefit by porting .NET Framework applications to more recent .NET 6 and running them on AWS. The .NET team has significantly improved performance with .NET 6, including a 30–40% socket performance improvement on Linux. They have added ARM64-specific optimizations in the .NET libraries, which enable customers to run on AWS Graviton.

You may also choose to switch to a serverless option using AWS Lambda (which supports .NET 6 runtime), or run your containers on ECS with Fargate, a serverless, pay-as-you-go compute engine. AWS Fargate powered by AWS Graviton2 processors can reduce cost by up to 20%, and increase performance by up to 40% versus x86 Intel-based instances. If you need full control over an application’s underlying virtual machine (VM), operating system, storage, and patching, run .NET 6 applications on Amazon EC2 Linux instances. These are powered by the latest-generation Intel and AMD processors.

To help customers port their application to .NET 6 faster, AWS added .NET 6 support to Porting Assistant for .NET. Porting Assistant is an analysis tool that scans .NET Framework (3.5+) applications to generate a target .NET Core or .NET 6 compatibility assessment. This helps you to prioritize applications for porting based on effort required. It identifies incompatible APIs and packages from your .NET Framework applications, and finds known replacements. You can refer to a demo video that explains this process.

4. Migrate from SQL Server to a lower-cost database engine

AWS advocates that you build use case-driven, highly scalable, distributed applications suited to your specific needs. From a database perspective, AWS offers 15+ purpose-built engines to support diverse data models. Furthermore, microservices architectures employ loose coupling, so each individual microservice can independently store and retrieve information from its own data store. By deploying the database-per-service pattern, you can choose the most optimal data stores (relational or non-relational databases) for your application and business requirements.

For the purpose of this blog, we will focus on a relational database alternate for SQL Server. To address the SQL Server licensing costs, customers can consider a move to an open-source relational database engine. Amazon Relational Database Service (Amazon RDS) supports MySQL, MariaDB, and PostgreSQL. We will focus on PostgreSQL with a well-defined migration path. Amazon RDS supports two types of Postgres databases: Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition. To help you choose, read Is Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL a better choice for me?

Once you’ve decided on the Amazon RDS flavor, the next question would be “what’s the right migration strategy for me?” Consider the following:

  1. Convert your schema
  2. Migrate the data
  3. Refactor your application

Schema conversion

AWS Schema Conversion Tool (SCT) is a free tool that can help you convert your existing database from one engine to another. AWS SCT supports a number of source databases, including Microsoft SQL Server, Oracle, and MySQL. You can choose from target database engines such as Amazon Aurora PostgreSQL-Compatible Edition, or choose to set up a data lake using Amazon S3. AWS SCT provides a graphical user interface that directly connects to the source and target databases to fetch the current schema objects. When connected, you can generate a database migration assessment report to get a high-level summary of the conversion effort and action items.

Data migration

When the schema migration is complete, you can move your data from the source database to the target database. Depending on your application availability requirements, you can run a straightforward extraction job that performs a one-time copy of the source data into the new database. Or, you can use a tool that copies the current data and continues to replicate all changes until you are ready to cut over to the new database. One such tool is AWS Database Migration Service (AWS DMS) that helps you migrate relational databases, data warehouses, NoSQL databases, and other types of data stores.

With AWS DMS, you can perform one-time migrations, and you can replicate ongoing changes to keep sources and targets in sync. When the source and target databases are in sync, you can take your database offline and move your operations to the target database. Read Microsoft SQL Server To Amazon Aurora with PostgreSQL Compatibility for a playbook or use this self-guided workshop to migrate to a PostgreSQL compatible database using SCT and DMS.

Application refactoring

Each database engine has its differences and nuances, and moving to a new database engine such as PostgreSQL from MSSQL Server will require code refactoring. After the initial database migration is completed, manually rewriting application code, switching out database drivers, and verifying that the application behavior hasn’t changed requires significant effort. This involves potential risk of errors when making extensive changes to the application code.

AWS built Babelfish for Aurora PostgreSQL to simplify migrating applications from SQL Server to Amazon Aurora PostgreSQL-Compatible Edition. Babelfish for Aurora PostgreSQL is a new capability for Amazon Aurora PostgreSQL-Compatible Edition that enables Aurora to understand commands from applications written for Microsoft SQL Server. With Babelfish, Aurora PostgreSQL now understands T-SQL, Microsoft SQL Server’s proprietary SQL dialect. It supports the same communications protocol, so your apps that were originally written for SQL Server can now work with Aurora. Read about how to migrate from SQL Server to Babelfish for Aurora PostgreSQL. Make sure you run the Babelfish Compass tool to determine whether the application contains any SQL features not currently supported by Babelfish.

Figure 1 shows the before and after state for your application based on the modernization path described in this blog. The application tier consists of microservices running on Amazon ECS Fargate clusters (or AWS Lambda functions), and the data tier runs on Amazon Aurora (PostgreSQL flavor).

Figure 1. A modernized microservices-based rearchitecture

Figure 1. A modernized microservices-based rearchitecture

Summary

In this post, we showed a migration path for a monolithic .NET Framework application to a modern microservices-based stack on AWS. We discussed AWS tools to break the monolith into microservices, and containerize the application. We also discussed cost optimization strategies by moving to Linux-based systems, and using open-source database engines. If you’d like to know more about modernization strategies, read this prescriptive guide.