Tag Archives: Architecture

Optimize Cost by Automating the Start/Stop of Resources in Non-Production Environments

2021-12-23 Ashutosh Pateriya

Post Syndicated from Ashutosh Pateriya original https://aws.amazon.com/blogs/architecture/optimize-cost-by-automating-the-start-stop-of-resources-in-non-production-environments/

Co-authored with Nirmal Tomar, Principal Consultant, Infosys Technologies Ltd.

Ease of creating on-demand resources on AWS can sometimes lead to over-provisioning or under-utilization of AWS resources like Amazon EC2 and Amazon RDS. This can lead to higher costs that can often be avoided with proper planning and monitoring. Non-critical environments, like development and test are often not monitored on a regular basis and can result in under-utilization of AWS resources.

In this blog, we discuss a common AWS cost optimization strategy, which is an automated and deployable solution to schedule the start/stop of AWS resources. For this example, we are considering non-production environments because in most scenarios these do not need to be available at all time. By following this solution, cloud architects can automate the start/stop of services per their usage pattern and can save up to 70% of costs while running their non-production environment.

The solution outlined here is designed to automatically stop and start the following AWS services based on your needs; Amazon RDS, Amazon Aurora, Amazon EC2, Auto Scaling groups, AWS Beanstalk, and Amazon EKS. The solution is automated using AWS Step functions, AWS Lambda, AWS CloudFormation Templates, Amazon EventBridge and AWS Identity and Access Management (IAM).

In this solution, we also provide an option to exclude specific Amazon Resource Names (ARNs) of the aforementioned services. This helps cloud architects to exclude the start/stop function for various use cases like in a QA environment when they don’t want to stop Aurora or they want to start RDS in a Development environment. The solution can be used to start/stop the services mentioned previously on a scheduled interval but can also be used for other applicable services like Amazon ECS, Amazon SageMaker Notebook Instances, Amazon Redshift and many more.

Note – Don’t set up this solution in a production or other environment where you require continuous service availability.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account with permission to spin up required resources.
A running Amazon Aurora instance in the source AWS account.

Walkthrough

To set up this solution, proceed with the following two steps:

Set up the step function workflow to stop services using a CloudFormation template. On a scheduled interval, this workflow will run and stop the chosen services.
Set up the step function workflow to start services using a CloudFormation template. On a scheduled interval, this workflow will run and start services as configured during the CloudFormation setup.

1. Stop Services using the Step Function workflow for a predefined duration

Figure 1 – Architecture showing the AWS Step Functions Workflow to stop services

The AWS Lambda functions involved in this workflow:

StopAuroraCluster: This Lambda function will stop all Aurora Cluster setup across Region including read replica.
StopRDSInstances: This Lambda function will stop all RDS Instances except the Aurora setup across the Region.
ScaleDownEKSNodeGroups: This Lambda function will downsize all nodegroups to zero instance across the Region.
ScaleDownASG: This Lambda function will downsize all Auto Scaling groups including the Elastic Beanstalk Auto Scaling group to zero instance across Region. We can edit CloudFormation templates to include the custom value.
StopEC2Instances: This Lambda function will stop all EC2 instances set up across the Region.

Using the following AWS CloudFormation Template, we set up the required services and workflow:

a. Launch the template in the source account and source Region:

Specify stack details screenshot

b. Fill out the preceding form with the following details and select Next.

Stack name: Stack Name which you want to create.

ExcludeAuroraClusterArnListInCommaSeprated: Comma separated Aurora clusters ARN which enterprises don’t want to stop, keep the default value if there is no exclusion list.

e.g. arn:aws:rds:us-east-1:111111111111:cluster:aurorcluster1, arn:aws:rds:us-east-2:111111111111:cluster:auroracluster2

ExcludeRDSDBInstancesArnListInCommaSeprated: Comma separated DB instances ARN which enterprises don’t want to stop, keep the default value if there is no exclusion list.

e.g. arn:aws:rds:us-east-1:111111111111:db:rds-instance-1, arn:aws:rds:us-east-2:111111111111:db:rds-instance-2

ExcludeEKSClusterNodeGroupsArnListInCommaSeprated: Comma separated EKS Clusters ARN which enterprises don’t want to stop, leave it with default value if there is no exclusion list.

e.g. arn:aws:eks:us-east-2:111111111111:cluster/testcluster

ExcludeAutoScalingGroupIncludingBeanstalkListInCommaSeprated: Comma separated Beanstalk and other Auto Clusters groups ARN (except Amazon EKS) which enterprises don’t want to stop, keep the default value if there is no exclusion list.

e.g. arn:aws:autoscaling:us-east-1:111111111111:autoScalingGroup:6d5af669-eb3b-4530-894b-e314a667f2e7:autoScalingGroupName/test-0-ASG

ExcludeEC2InstancesIdListInCommaSeprated: Comma separated EC2 instance ID’s which you don’t want to stop, keep the default value if there is no exclusion list.

e.g. i-02185df0872f0f852, 0775f7e39513c50dd

ScheduleExpression: Schedule a cron expression when you want to run this workflow. Sample expressions are available in this guide, Schedule expressions using rate or cron.

c. Select IAM role to launch this template. As a best practice, select the AWS CloudFormation service role to manage AWS services and resources available to each user.

Permissions screenshot

d. Acknowledge that you want to create various resources including IAM roles/policies and select Create Stack.

d. Acknowledge to create various resources including IAM roles/policies and select Create Stack.

2. Start Services using the Step Function workflow in pre-configured time

Figure 2 – Architecture showing the AWS Step Functions Workflow to start services

The Lambda functions involved in this workflow:

StartAuroraCluster: This Lambda function will start all Aurora Cluster setup across Region including read-replica.
StartRDSInstances: This Lambda function will start all RDS Instances except for the Aurora setup across the Region.
ScaleUpEKSNodeGroups: This Lambda function will upsize all nodegroups to minimum 2 and maximum 4 instances across Region. We can edit CloudFormation templates for custom value.
ScaleUpASG: This Lambda function will Scale up all Auto Scaling group including Elastic Beanstalk Auto Scaling group to minimum 2 and maximum 4 instances across the Region. We can edit CloudFormation templates for custom value.
StartEC2Instances: This Lambda function will start all EC2 instances setup across the Region.

Using the following AWS CloudFormation template, we set up the required services and workflow:

a. Launch the template in the source account and source Region:

b. Fill out the preceding form with the following details and select Next.

Stack details screenshot

Stack name: Stack Name which you want to create.

ExcludeAuroraClusterArnListInCommaSeprated: Comma separated Aurora clusters the ARN which you don’t want to start, keep the default value if there is no exclusion list.

For example: arn:aws:rds:us-east-1:111111111111:cluster:aurorcluster1, arn:aws:rds:us-east-2:111111111111:cluster:auroracluster2

ExcludeRDSDBInstancesArnListInCommaSeprated: Comma separated databaseinstances ARN which you don’t want to start, keep default value if there is no exclusion list.

For example: arn:aws:rds:us-east-1:111111111111:db:rds-instance-1, arn:aws:rds:us-east-2:111111111111:db:rds-instance-2

ExcludeEKSClusterNodeGroupsArnListInCommaSeprated: Comma separated EKS Clusters ARN which you don’t want to start, keep the default value if there is no exclusion list.

For example: arn:aws:eks:us-east-2:111111111111:cluster/testcluster

ExcludeAutoScalingGroupIncludingBeanstalkListInCommaSeprated: Comma separated Beanstalk and other Auto Clusters groups ARN (except EKS) which you don’t want to start, keep the default value if there is no exclusion list.

For example: arn:aws:autoscaling:us-east-1:111111111111:autoScalingGroup:6d5af669-eb3b-4530-894b-e314a667f2e7:autoScalingGroupName/test-0-ASG

ExcludeEC2InstancesIdListInCommaSeprated: Comma separated EC2 instance ID s you don’t want to start, keep the default value if there is no exclusion list.

For example: i-02185df0872f0f852, 0775f7e39513c50dd

ScheduleExpression: Schedule a cron expression when you want to run this workflow. Sample expressions are available in this guide, Schedule expressions using rate or cron.

c. Select IAM role to launch this template. As a best practice, select the AWS CloudFormation service role to manage AWS services and resources available to each user.

Permissions screenshot

d. Acknowledge that you want to create various resources including IAM roles and policies and select Create Stack.

Acknowledgement screenshot

Cleaning up

Delete any unused resources to avoid incurring future charges.

Conclusion

In this blog post, we outlined a solution to help you optimize cost by automating the stop/start of AWS services in non-production environments. Cost Optimization and Cloud Financial Management are ongoing initiatives. We hope you found this solution helpful and encourage you to explore additional ways to optimize cost on the AWS Architecture Center.

Increasing McGraw-Hill’s Application Throughput with Amazon SQS

2021-12-22 Vikas Panghal

Post Syndicated from Vikas Panghal original https://aws.amazon.com/blogs/architecture/increasing-mcgraw-hills-application-throughput-with-amazon-sqs/

This post was co-authored by Vikas Panghal, Principal Product Mgr – Tech, AWS and Nick Afshartous, Principal Data Engineer at McGraw-Hill

McGraw-Hill’s Open Learning Solutions (OL) allow instructors to create online courses using content from various sources, including digital textbooks, instructor material, open educational resources (OER), national media, YouTube videos, and interactive simulations. The integrated assessment component provides instructors and school administrators with insights into student understanding and performance.

McGraw-Hill measures OL’s performance by observing throughput, which is the amount of work done by an application in a given period. McGraw-Hill worked with AWS to ensure OL continues to run smoothly and to allow it to scale with the organization’s growth. This blog post shows how we reviewed and refined their original architecture by incorporating Amazon Simple Queue Service (Amazon SQS). to achieve better throughput and stability.

Reviewing the original Open Learning Solutions architecture

Figure 1 shows the OL original architecture, which works as follows:

The application makes a REST call to DMAPI. DMAPI is an API layer over the Datamart. The call results in a row being inserted in a job requests table in Postgres.
A monitoring process called Watchdog periodically checks the database for pending requests.
Watchdog spins up an Apache Spark on Databricks (Spark) cluster and passes up to 10 requests.
The report is processed and output to Amazon Simple Storage Service (Amazon S3).
Report status is set to completed.
User can view report.
The Databricks clusters shut down.

Figure 1. Original OL architecture

To help isolate longer running reports, we separated requests that have up to five schools (P1) from those having more than five (P2) by allocating a different pool of clusters. Each of the two groups can have up to 70 clusters running concurrently.

Challenges with original architecture

There are several challenges inherent in this original architecture, and we concluded that this architecture will fail under heavy load.

It takes 5 minutes to spin up a Spark cluster. After processing up to 10 requests, each cluster shuts down. Pending requests are processed by new clusters. This results in many clusters continuously being cycled.

We also identified a database resource contention problem. In testing, we couldn’t process 142 reports out of 2,030 simulated reports within the allotted 4 hours. Furthermore, the architecture cannot be scaled out beyond 70 clusters for the P1 and P2 pools. This is because adding more clusters will increase the number of database connections. Other production workloads on Postgres would also be affected.

Refining the architecture with Amazon SQS

To address the challenges with the existing architecture, we rearchitected the pipeline using Amazon SQS. Figure 2 shows the revised architecture. In addition to inserting a row to the requests table, the API call now inserts the job request Id into one of the SQS queues. The corresponding SQS consumers are embedded in the Spark clusters.

Figure 2. New OL architecture with Amazon SQS

The revised flow is as follows:

An API request results in a job request Id being inserted into one of the queues and a row being inserted into the requests table.
Watchdog monitors SQS queues.
Pending requests prompt Watchdog to spin up a Spark cluster.
SQS consumer consumes the messages.
Report data is processed.
Report files output to Amazon S3
Job status is updated in the requests table.
Report can be viewed in the application.

After deploying the Amazon SQS architecture, we reran the previous load of 2,030 reports with a configuration ceiling of up to five Spark clusters. This time all reports were completed within the 4-hour time limit, including the 142 reports that timed out previously. Not only did we achieve better throughput and stability, but we did so by running far fewer clusters.

Reducing the number of clusters reduced the number of concurrent database connections that access Postgres. Unlike the original architecture, we also now have room to scale by adding more clusters and consumers. Another benefit of using Amazon SQS is a more loosely coupled architecture. The Watchdog process now only prompts Spark clusters to spin up, whereas previously it had to extract and pass job requests Ids to the Spark job.

Consumer code and multi-threading

The following code snippet shows how we consumed the messages via Amazon SQS and performed concurrent processing. Messages are consumed and submitted to a thread pool that utilizes Java’s ThreadPoolExecutor for concurrent processing. The full source is located on GitHub.

/**
  * Main Consumer run loop performs the following steps:
  *   1. Consume messages
  *   2. Convert message to Task objects
  *   3. Submit tasks to the ThreadPool
  *   4. Sleep based on the configured poll interval.
  */
 def run(): Unit = {
   while (!this.shutdownFlag) {
     val receiveMessageResult = sqsClient.receiveMessage(new  
                                           ReceiveMessageRequest(queueURL)
       .withMaxNumberOfMessages(threadPoolSize))
     val messages = receiveMessageResult.getMessages
     val tasks = getTasks(messages.asScala.toList)

     threadPool.submitTasks(tasks, sqsConfig.requestTimeoutMinutes)
     Thread.sleep(sqsConfig.pollIntervalSeconds * 1000)
   }

   threadPool.shutdown()
 }

Kafka versus Amazon SQS

We also considered routing the report requests via Kafka, because Kafka is part of our analytics platform. However, Kafka is not a queue, it is a publish-subscribe streaming system with different operational semantics. Unlike queues, Kafka messages are not removed by the consumer. Publish-subscribe semantics can be useful for data processing scenarios. In other words, it can be used in cases where it’s required to reprocess data or to transform data in different ways using multiple independent consumers.

In contrast, for performing tasks, the intent is to process a message exactly once. There can be multiple consumers, and with queue semantics, the consumers work together to pull messages off the queue. Because report processing is a type of task execution, we decided that SQS queue semantics better fit the use case.

Conclusion and future work

In this blog post, we described how we reviewed and revised a report processing pipeline by incorporating Amazon SQS as a messaging layer. Embedding SQS consumers in the Spark clusters resulted in fewer clusters and more efficient cluster utilization. This, in turn, reduced the number of concurrent database connections accessing Postgres.

There are still some improvements that can be made. The DMAPI call currently inserts the report request into a queue and the database. In case of an error, it’s possible for the two to become out of sync. In the next iteration, we can have the consumer insert the request into the database. Hence, the DMAPI call would only insert the SQS message.

Also, the Java ThreadPoolExecutor API being used in the source code exhibits the slow poke problem. Because the call to submit the tasks is synchronous, it will not return until all tasks have completed. Here, any idle threads will not be utilized until the slowest task has completed. There’s an opportunity for improved throughput by using a thread pool that allows idle threads to pick up new tasks.

Ready to get started? Explore the source code illustrating how to build a multi-threaded AWS SQS consumer.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Use AWS Step Functions to Monitor Services Choreography

2021-12-22 Vito De Giosa

Post Syndicated from Vito De Giosa original https://aws.amazon.com/blogs/architecture/use-aws-step-functions-to-monitor-services-choreography/

Organizations frequently need access to quick visual insight on the status of complex workflows. This involves collaboration across different systems. If your customer requires assistance on an order, you need an overview of the fulfillment process, including payment, inventory, dispatching, packaging, and delivery. If your products are expensive assets such as cars, you must track each item’s journey instantly.

Modern applications use event-driven architectures to manage the complexity of system integration at scale. These often use choreography for service collaboration. Instead of directly invoking systems to perform tasks, services interact by exchanging events through a centralized broker. Complex workflows are the result of actions each service initiates in response to events produced by other services. Services do not directly depend on each other. This increases flexibility, development speed, and resilience.

However, choreography can introduce two main challenges for the visibility of your workflow.

It obfuscates the workflow definition. The sequence of events emitted by individual services implicitly defines the workflow. There is no formal statement that describes steps, permitted transitions, and possible failures.
It might be harder to understand the status of workflow executions. Services act independently, based on events. You can implement distributed tracing to collect information related to a single execution across services. However, getting visual insights from traces may require custom applications. This increases time to market (TTM) and cost.

To address these challenges, we will show you how to use AWS Step Functions to model choreographies as state machines. The solution enables stakeholders to gain visual insights on workflow executions, identify failures, and troubleshoot directly from the AWS Management Console.

This GitHub repository provides a Quick Start and examples on how to model choreographies.

Modeling choreographies with Step Functions

Monitoring a choreography requires a formal representation of the distributed system behavior, such as state machines. State machines are mathematical models representing the behavior of systems through states and transitions. States model situations in which the system can operate. Transitions define which input causes a change from the current state to the next. They occur when a new event happens. Figure 1 shows a state machine modeling an order workflow.

Figure 1. Order workflow

The solution in this post uses Amazon State Language to describe a choreography as a Step Functions state machine. The state machine pauses, using Task states combined with a callback integration pattern. It then waits for the next event to be published on the broker. Choice states control transitions to the next state by inspecting event payloads. Figure 2 shows how the workflow in Figure 1 translates to a Step Functions state machine.

Figure 2. Order workflow translated into Step Functions state machine

Figure 3 shows the architecture for monitoring choreographies with Step Functions.

Figure 3. Choreography monitoring with AWS Step Functions

Services involved in the choreography publish events to Amazon EventBridge. There are two configured rules. The first rule matches the first event of the choreography sequence, Order Placed in the example. The second rule matches any other event of the sequence. Event payloads contain a correlation id (order_id) to group them by workflow instance.
The first rule invokes an AWS Lambda function, which starts a new execution of the choreography state machine. The correlation id is passed in the name parameter, so you can quickly identify an execution in the AWS Management Console.
The state machine uses Task states with AWS SDK service integrations, to directly call Amazon DynamoDB. Tasks are configured with a callback pattern. They issue a token, which is stored in DynamoDB with the execution name. Then, the workflow pauses.
A service publishes another event on the event bus.
The second rule invokes another Lambda function with the event payload.
The function uses the correlation id to retrieve the task token from DynamoDB.
The function invokes the Step Functions SendTaskSuccess API, with the token and the event payload as parameters.
The state machine resumes the execution and uses Choice states to transition to the next state. If the choreography definition expects the received event payload, it selects the next state and the process will restart from Step # 3. The state machine transitions to a Fail state when it receives an unexpected event.

Increased visibility with Step Functions console

Modeling service choreographies as Step Functions Standard Workflows increases visibility with out-of-the-box features.

1. You can centrally track events produced by distributed components. Step Functions records full execution history for 90 days after the execution completes. You’ll be able to capture detailed information about the input and output of each state, including event payloads. Additionally, state machines integrate with Amazon CloudWatch to publish execution logs and metrics.

2. You can monitor choreographies visually. The Step Functions console displays a list of executions with information such as execution id, status, and start date (see Figure 4).

Figure 4. Step Functions workflow dashboard

After you’ve selected an execution, a graph inspector is displayed (see Figure 5). It shows states, transitions, and marks individual states with colors. This identifies at a glance, successful tasks, failures, and tasks that are still in progress.

Figure 5. Step Functions graph inspector

3. You can implement event-driven automation. Step Functions enables you to capture execution status changes emitting events directly to EventBridge (see Figure 6). Additionally, AWS gives you the ability to emit events by setting alarms on top of metrics. Step Functions publishes these to CloudWatch. You can respond to events by initiating corrective actions, sending notifications, or integrating with third-party solutions, such as issue tracking systems.

Figure 6. Automation with Step Functions, EventBridge, and CloudWatch alarms

Enabling access to AWS Step Functions console

Stakeholders need secure access to the Step Functions console. This requires mechanisms to authenticate users and authorize read-only access to specific Step Functions workflows.

AWS Single Sign-On authenticates users by directly managing identities or through federation. SSO supports federation with Active Directory and SAML 2.0 compliant external identity providers (IdP). Users gain access to Step Functions state machines by assigning a permission set, which is a collection of AWS Identity and Access Management (IAM) policies. Additionally, with permission sets, you can configure a relay state, which is a URL to redirect the user after successful authentication. You can authenticate the user through the selected identity provider and immediately show the AWS Step Functions console with the workflow state machine already displayed. Figure 7 shows this process.

Figure 7. Access to Step Functions state machine with AWS SSO

The user logs in through the selected identity provider.
The SSO user portal uses the SSO endpoint to send the response from the previous step. SSO uses AWS Security Token Service (STS) to get temporary security credentials on behalf of the user. It then creates a console sign-in URL using those credentials and the relay state. Finally, it sends the URL back as a redirect.
The browser redirects the user to the Step Functions console.

When the identity provider does not support SAML 2.0, SSO is not a viable solution. In this case, you can create a URL with a sign-in token for users to securely access the AWS Management Console. This approach uses STS AssumeRole to get temporary security credentials. Then, it uses credentials to obtain a sign-in token from the AWS federation endpoint. Finally, it constructs a URL for the AWS Management Console, which includes the token. It then distributes this to users to grant access. This is similar to the SSO process. However, it requires custom development.

Conclusion

This post shows how you can increase visibility on choreographed business processes using AWS Step Functions. The solution provides detailed visual insights directly from the AWS Management Console, without requiring custom UI development. This reduces TTM and cost.

To learn more:

Find Public IPs of Resources – Use AWS Config for Vulnerability Assessment

2021-12-22 Gurkamal Deep Singh Rakhra

Post Syndicated from Gurkamal Deep Singh Rakhra original https://aws.amazon.com/blogs/architecture/find-public-ips-of-resources-use-aws-config-for-vulnerability-assessment/

Systems vulnerability management is a key component of your enterprise security program. Its goal is to remediate OS, software, and applications vulnerabilities. Scanning tools can help identify and classify these vulnerabilities to keep the environment secure and compliant.

Typically, vulnerability scanning tools operate from internal or external networks to discover and report vulnerabilities. For internal scanning, the tools use private IPs of target systems in scope. For external scans, the public target system’s IP addresses are used. It is important that security teams always maintain an accurate inventory of all deployed resource’s IP addresses. This ensures a comprehensive, consistent, and effective vulnerability assessment.

This blog discusses a scalable, serverless, and automated approach to discover public IP addresses assigned to resources in a single or multi-account environment in AWS, using AWS Config.

Single account is when you have all your resources in a single AWS account. A multi-account environment refers to many accounts under the same AWS Organization.

Understanding scope of solution

You may have good visibility into the private IPs assigned to your resources: Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (EKS) clusters, Elastic Load Balancing (ELB), and Amazon Elastic Container Service (Amazon ECS). But it may require some effort to establish a complete view of the existing public IPs. And these IPs can change over time, as new systems join and exit the environment.

An elastic network interface is a logical networking component in a Virtual Private Cloud (VPC) that represents a virtual network card. The elastic network interface routes traffic to other destinations/resources. Usually, you have to make Describe* API calls for the specific resource with an elastic network interface to get information about its configuration and IP address. This may throttle the resource-specific API calls, and result in higher costs. Additionally, if there are tens or hundreds of accounts, it becomes exponentially more difficult to get the information into a single inventory.

AWS Config enables you to assess, audit, and evaluate the configurations of your AWS resources. The advanced query feature provides a single query endpoint and language to get current resource state metadata for a single account and Region, or multiple accounts and Regions. You can use configuration aggregators to run the same queries from a central account across multiple accounts and AWS Regions.

AWS Config supports a subset of structured query language (SQL) SELECT syntax, which enables you to perform property-based queries and aggregations on the current configuration item (CI) data. Advanced query is available at no additional cost to AWS Config customers in all AWS Regions (except China Regions) and AWS GovCloud (US).

AWS Organizations helps you centrally govern your environment. Its integration with other AWS services lets you define central configurations, security mechanisms, audit requirements, and resource sharing across accounts in your organization.

Choosing scope of advanced queries in AWS Config

When running advanced queries in AWS Config, you must choose the scope of the query. The scope defines the accounts you want to run the query against and is configured when you create an aggregator.

Following are the three possible scopes when running advanced queries:

Single account and single Region
Multiple accounts and multiple Regions
AWS Organization accounts

Single account and single Region

Figure 1. AWS Config workflow for single account and single Region

The use case shown in Figure 1 addresses the need of customers operating within a single account and single Region. With AWS Config enabled for the individual account, you will use AWS Config advanced query feature to run SQL queries. These will give you resource metadata about associated public IPs. You do not require an aggregator for single-account and single Region.

In Figure 1.1, the advanced query returned results from a single account and all Availability Zones within the Region in which the query was run.

Figure 1.1 Advanced query returning results for a single account and single Region

Query for reference

SELECT

resouceId,

resourceName,

resourceType,

configuration.association.publicIp,

availabilityZone,

awsRegion

WHERE

resourceType='AWS::EC2::NetworkInterface'

AND configuration.association.publicIp>'0.0.0.0'

This query is fetching the properties of all elastic network interfaces. The WHERE condition is used to list the elastic network interfaces using the resourceType property and find all public IPs greater than 0.0.0.0. This is because elastic network interfaces can exist with a private IP, in which case there will be no public IP assigned to it. For a list of supported resourceType, refer to supported resource types for AWS Config.

Multiple accounts and multiple Regions

Figure 2. AWS Config monitoring workflow for multiple account and multiple Regions. The figure shows EC2, EKS, and Amazon ECS, but it can be any AWS resource having a public elastic network interface.

AWS Config enables you to monitor configuration changes against multiple accounts and multiple Regions via an aggregator, see Figure 2. An aggregator is an AWS Config resource type that collects AWS Config data from multiple accounts and Regions. You can choose the aggregator scope when running advanced queries in AWS Config. Remember to authorize the aggregator accounts to collect AWS Config configuration and compliance data.

Figure 2.1 Advanced query returning results from multiple Regions (awsRegion column) as highlighted in the diagram

This use case applies when you have AWS resources in multiple accounts (or span multiple organizations) and multiple Regions. Figure 2.1 shows the query results being returned from multiple AWS Regions.

Accounts in AWS Organization

Figure 3. The workflow of accounts in an AWS Organization being monitored by AWS Config. This figure shows EC2, EKS, and Amazon ECS but it can be any AWS resource having a public elastic network interface.

An aggregator also enables you to monitor all the accounts in your AWS Organization, see Figure 3. When this option is chosen, AWS Config enables you to run advanced queries against the configuration history in all the accounts in your AWS Organization. Remember that an aggregator will only aggregate data from the accounts and Regions that are specified when the aggregator is created.

Figure 3.1 Advanced query returning results from all accounts (accountId column) under an AWS Organization

In Figure 3.1, the query is run against all accounts in an AWS Organization. This scope of AWS Organization is accomplished by the aggregator and it automatically accumulates data from all accounts under a specific AWS Organization.

Common architecture workflow for discovering public IPs

Figure 4. High-level architecture pattern for discovering public IPs

The workflow shown in Figure 4 starts with Amazon EventBridge triggering an AWS Lambda function. You can configure an Amazon EventBridge schedule via rate or cron expressions, which define the frequency. This AWS Lambda function will host the code to make an API call to AWS Config that will run an advanced query. The advanced query will check for all elastic network interfaces in your account(s). This is because any public resource launched in your account will be assigned an elastic network interface.

When the results are returned, they can be stored on Amazon S3. These result files can be timestamped (via naming or S3 versioning) in order to keep a history of public IPs used in your account. The result set can then be fed into or accessed by the vulnerability scanning tool of your choice.

Note: AWS Config advanced queries can also be used to query IPv6 addresses. You can use the “configuration.ipv6Addresses” AWS Config property to get IPv6 addresses. When querying IPv6 addresses, remove “configuration.association.publicIp > ‘0.0.0.0’” condition from the preceding sample queries. For more information on available AWS Config properties and data types, refer to GitHub.

Conclusion

In this blog, we demonstrated how to extract public IP information from resources deployed in your account(s) using AWS Config and AWS Config advanced query. We discussed how you can support your vulnerability scanning process by identifying public IPs in your account(s) that can be fed into your scanning tool. This solution is serverless, automated, and scalable, which removes the undifferentiated heavy lifting required to manage your resources.

Learn more about AWS Config best practices:

Modernized Database Queuing using Amazon SQS and AWS Services

2021-12-17 Scott Wainner

Post Syndicated from Scott Wainner original https://aws.amazon.com/blogs/architecture/modernized-database-queuing-using-amazon-sqs-and-aws-services/

A queuing system is composed of producers and consumers. A producer enqueues messages (writes messages to a database) and a consumer dequeues messages (reads messages from the database). Business applications requiring asynchronous communications often use the relational database management system (RDBMS) as the default message storage mechanism. But the increased message volume, complexity, and size, competes with the inherent functionality of the database. The RDBMS becomes a bottleneck for message delivery, while also impacting other traditional enterprise uses of the database.

In this blog, we will show how you can mitigate the RDBMS performance constraints by using Amazon Simple Queue Service (Amazon SQS), while retaining the intrinsic value of the stored relational data.

Problems with legacy queuing methods

Commercial databases such as Oracle offer Advanced Queuing (AQ) mechanisms, while SQL Server supports Service Broker for queuing. The database acts as a message queue system when incoming messages are captured along with metadata. A message stored in a database is often processed multiple times using a sequence of message extraction, transformation, and loading (ETL). The message is then routed for distribution to a set of recipients based on logic that is often also stored in the database.

The repetitive manipulation of messages and iterative attempts at distributing pending messages may create a backlog that interferes with the primary function of the database. This backpressure can propagate to other systems that are trying to store and retrieve data from the database and cause a performance issue (see Figure 1).

Figure 1. A relational database serving as a message queue.

There are several scenarios where the database can become a bottleneck for message processing:

Message metadata. Messages consist of the payload (the content of the message) and metadata that describes the attributes of the message. The metadata often includes routing instructions, message disposition, message state, and payload attributes.

The message metadata may require iterative transformation during the message processing. This creates an inefficient sequence of read, transform, and write processes. This is especially inefficient if the message attributes undergo multiple transformations that must be reflected in the metadata. The iterative read/write process of metadata consumes the database IOPS, and forces the database to scale vertically (add more CPU and more memory).
A new paradigm emerges when message management processes exist outside of the database. Here, the metadata is manipulated without interacting with the database, except to write the final message disposition. Application logic can be applied through functions such as AWS Lambda to transform the message metadata.

Message large object (LOB). A message may contain a large binary object that must be stored in the payload.

Storing large binary objects in the RDBMS is expensive. Manipulating them consumes the throughput of the database with iterative read/write operations. If the LOB must be transformed, then it becomes wasteful to store the object in the database.
An alternative approach offers a more efficient message processing sequence. The large object is stored external to the database in universally addressable object storage, such as Amazon Simple Storage Service (Amazon S3). There is only a pointer to the object that is stored in the database. Smaller elements of the message can be read from or written to the database, while large objects can be manipulated more efficiently in object storage resources.

Message fan-out. A message can be loaded into the database and analyzed for routing, where the same message must be distributed to multiple recipients.

Messages that require multiple recipients may require a copy of the message replicated for each recipient. The replication creates multiple writes and reads from the database, which is inefficient.
A new method captures only the routing logic and target recipients in the database. The message replication then occurs outside of the database in distributed messaging systems, such as Amazon Simple Notification Service (Amazon SNS).

Message queuing. Messages are often kept in the database until they are successfully processed for delivery. If a message is read from the database and determined to be undeliverable, then the message is kept there until a later attempt is successful.

An inoperable message delivery process can create backpressure on the database where iterative message reads are processed for the same message with unsuccessful delivery. This creates a feedback loop causing even more unsuccessful work for the database.
Try a message queuing system such as Amazon MQ or Amazon SQS, which offloads the message queuing from the database. These services offer efficient message retry mechanisms, and reduce iterative reads from the database.

Sequenced message delivery. Messages may require ordered delivery where the delivery sequence is crucial for maintaining application integrity.

The application may capture the message order within database tables, but the sorting function still consumes processing capabilities. The order sequence must be sorted and maintained for each attempted message delivery.
Message order can be maintained outside of the database using a queue system, such as Amazon SQS, with first-in/first-out (FIFO) delivery.

Message scheduling. Messages may also be queued with a scheduled delivery attribute. These messages require an event driven architecture with initiated scheduled message delivery.

The database often uses trigger mechanisms to initiate message delivery. Message delivery may require a synchronized point in time for delivery (many messages at once), which can cause a spike in work at the scheduled interval. This impacts the database performance with artificially induced peak load intervals.
Event signals can be generated in systems such as Amazon EventBridge, which can coordinate the transmission of messages.

Message disposition. Each message maintains a message disposition state that describes the delivery state.

The database is often used as a logging system for message transmission status. The message metadata is updated with the disposition of the message, while the message remains in the database as an artifact.
An optimized technique is available using Amazon CloudWatch as a record of message disposition.

Modernized queuing architecture

Decoupling message queuing from the database improves database availability and enables greater message queue scalability. It also provides a more cost-effective use of the database, and mitigates backpressure created when database performance is constrained by message management.

The modernized architecture uses loosely coupled services, such as Amazon S3, AWS Lambda, Amazon Message Queue, Amazon SQS, Amazon SNS, Amazon EventBridge, and Amazon CloudWatch. This loosely coupled architecture lets each of the functional components scale vertically and horizontally independent of the other functions required for message queue management.

Figure 2 depicts a message queuing architecture that uses Amazon SQS for message queuing and AWS Lambda for message routing, transformation, and disposition management. An RDBMS is still leveraged to retain metadata profiles, routing logic, and message disposition. The ETL processes are handled by AWS Lambda, while large objects are stored in Amazon S3. Finally, message fan-out distribution is handled by Amazon SNS, and the queue state is monitored and managed by Amazon CloudWatch and Amazon EventBridge.

Figure 2. Modernized queuing architecture using Amazon SQS

Conclusion

In this blog, we show how queuing functionality can be migrated from the RDMBS while minimizing changes to the business application. The RDBMS continues to play a central role in sourcing the message metadata, running routing logic, and storing message disposition. However, AWS services such as Amazon SQS offload queue management tasks related to the messages. AWS Lambda performs message transformation, queues the message, and transmits the message with massive scale, fault-tolerance, and efficient message distribution.

Read more about the diverse capabilities of AWS messaging services:

By using AWS services, the RDBMS is no longer a performance bottleneck in your business applications. This improves scalability, and provides resilient, fault-tolerant, and efficient message delivery.

Read our blog on modernization of common database functions:

Migrating a Database Workflow to Modernized AWS Workflow Services

Ingesting Automotive Sensor Data using DXC RoboticDrive Ingestor on AWS

2021-12-17 Bryan Berezdivin

Post Syndicated from Bryan Berezdivin original https://aws.amazon.com/blogs/architecture/ingesting-automotive-sensor-data-using-dxc-roboticdrive-ingestor-on-aws/

This post was co-written by Pawel Kowalski, a Technical Product Manager for DXC RoboticDrive and Dr. Max Böhm, a software and systems architect and DXC Distinguished Engineer.

To build the first fully autonomous vehicle, L5 standard per SAE, auto-manufacturers collected sensor data from test vehicle fleets across the globe in their testing facilities and driving circuits. It wasn’t easy collecting, ingesting and managing massive amounts of data, and that’s before it’s used for any meaningful processing, and training the AI self-driving algorithms. In this blog post, we outline a solution using DCX RoboticDrive Ingestor to address the following challenges faced by automotive original equipment manufacturers (OEMs) when ingesting sensor data from the car to the data lake:

Each test drive alone can gather hundreds of terabytes of data per fleet. A few vehicles doing test fleets can produce petabyte(s) worth of data in a day.
Vehicles collect data from all the sensors into local storage with fast disks. Disks must offload data to a buffer station and then to a data lake as soon as possible to be reused in the test drives.
Data is stored in binary file formats like ROSbag, ADTF or MDF4, that are difficult to read for any pre-ingest checks and security scans.
On-premises (ingest location) to Cloud data ingestion requires a hybrid solution with reliable connectivity to the cloud.
Ingesting process depositing files into a data lake without organizing them first can cause discovery and read performance issues later in the analytics phase.
Orchestrating and scaling an ingest solution can involve several steps, such as copy, security scans, metadata extraction, anonymization, annotation, and more.

Overview of the RoboticDrive Ingestor solution

The DXC RoboticDrive Ingestor (RDI) solution addresses the core requirements as follows:

Cloud ready: Implements a recommended architectural pattern where data from globally located logger copy stations are ingested into the cloud data lake in one or more AWS Regions over AWS Direct Connect or Site-to-Site VPN connections.
Scalable: Runs multiple steps as defined in an Ingest Pipeline, containerized on Amazon EKS based Amazon EC2 compute nodes that can auto-scale horizontally and vertically on AWS. Steps can be for example, Virus Scan, Copy to Amazon S3, Quality Check, Metadata Extraction, and others.
Performant: Uses the Apache Spark-based DXC RoboticDrive Analyzer component to process large amounts of sensor data files efficiently in parallel, which will be described in a future blog post. Analyzer can read and processes native automotive file formats, extract and prepare metadata, and populate metadata into the DXC RoboticDrive Meta Data Store. The maximum throughput of network and disk is leveraged and data files are placed on the data lake following a hierarchical prefix domain model, that aligns with the best practices around S3 performance.
Operations ready: Containerized deployment, custom monitoring UI, visual workflow management with MWAA, Amazon CloudWatch based monitoring for virtual infrastructure and cloud native services.
Modular: Programmatically add or customize steps into the managed Ingest Pipeline DAG, defined in Amazon Managed Workflows for Apache Airflow (MWAA).
Security: IAM based access policy and technical roles, Amazon Cognito based UI authentication, virus scan, and an optional data anonymization step.

The ingest pipeline copies new files to S3 followed by user configurable steps, such as Virus Scan or Integrity Checks. When new files have been uploaded, cloud native S3 event notifications generate messages in an Amazon SQS queue. These are then consumed by serverless AWS Lambda functions or by containerized workloads in the EKS cluster, to asynchronously perform metadata extraction steps.

Walkthrough

There are two main areas of the ingest solution:

On-Premises devices – these are dedicated copy stations in the automotive OEM’s data center where data cartridges are physically connected. Copy stations are easily configurable devices that automatically mount inserted data cartridges and share its content via NFS. At the end of the mounting process, they call an HTTP post API to trigger the ingest process.
Managed Ingest Pipeline – this is a Managed service in the cloud to orchestrate the ingest process. As soon as a cartridge is inserted into the Copy Station, mounted and shared via NFS, the ingest process can be started.

Figure 1 – Architecture showing the DXC RoboticDrive Ingestor (RDI) solution

Depending on the customer’s requirements, the ingest process may differ. It can be easily modified using predefined Docker images. A configuration file defines what features from the Ingest stack should be used. It can be a simple copy from the source to the cloud but we can easily extend it with additional steps, such as a virus scan, special data partitioning, data quality checks on even scripts provided by the automotive OEM.

The ingest process is orchestrated by a dedicated MWAA pipeline built dynamically from the configuration file. We can easily track the progress, see the logs and, if required, stop or restart the process. Progress is tracked at a detailed level, including the number and volume of data uploaded, remaining files, estimated finish time, and ingest summary, all visualized in a dedicated ingest UI.

Prerequisites

AWS Site-To-Site VPN (in production an AWS Direct Connect connection is used instead)
Amazon Managed Workflows for Apache Airflow (MWAA)
Managed Kubernetes Service (Amazon EKS)
Data provided on source side, shared via a Network File System (NFS)

Ingest pipeline

To build an MWAA pipeline, we need to choose from prebuilt components of the ingestor and add them into the JSON configuration, which is stored in an Airflow variable:

{
    "tasks" : [
        {
            "name" : "Virus_Scan",
            "type" : "copy",
            "srcDir" : "file:///mnt/ingestor/CopyStation1/",
            "quarantineDir" : "s3a://k8s-ingestor/CopyStation1/quarantine/”
        },
        {
            "name" : "Copy_to_S3",
            "type" : "copy",
            "srcDir" : "file:///mnt/ingestor/car01/test.txt",
            "dstDir" : "s3a://k8s-ingestor/copy_dest/",
            "integrity_check" : false
        },
        {
            "name" : "Quality_Check",
            "type" : "copy",
            "srcDir" : "file:///mnt/ingestor/car01/test.txt",
            "InvalidFilesDir" : "s3a://k8s-ingestor/CopyStation1/invalid/",
        }
    ]
}

The final ingest pipeline, a Directed Acyclic Graph (DAG), is built automatically from the configuration and changes are visible within seconds:

Figure 2 – Screenshot showing the final ingest pipeline, a Directed Acyclic Graph (DAG)

If required, external tools or processes can also call ingest components using the Ingest API, which is shown in the following screenshot:

Figure 3 – Screenshot showing a tool being used to call ingest components using the Ingest API

The usual ingest process consists of the following steps:

Ingest data source (cartridge) into the Copy Station

2. Wait for Copy Station to report proper cartridge mounting. In case a dedicated device (Copy Station) is not used, and you want to use an external disc as your data source, it can be mounted manually:

mount -o ro /dev/sda /media/disc

Here, /dev/sda is the disc containing your data and /media/disc is the mountpoint, which must also be added to /etc/exports.

3. Every time the MWAA ingest pipeline (DAG) is triggered, a Kubernetes job is created through a Helm template and Helm values which add the above data source as a persistent volume (PV) mount inside the job.

Helm template is as follows:

apiVersion: v1

kind: PersistentVolume

metadata:

name: {{ .name }}

namespace: {{ .Values.namespace }}

spec:

capacity:

storage: {{ .storage }}

accessModes:

- {{ .accessModes }}

nfs:

server: {{ .server }}

path: {{ .path }}

Helm values:

name: "YourResourceName"

server: 10.1.1.44

path: "/cardata"

mountPath: "/media/disc"

storage: 100Mi

accessModes: ReadOnlyMany

4. Trigger ingest manually from MWAA UI – this can also be automatically triggered from a Copy Station by calling MWAA API:

curl -X POST "https://roboticdrive/api/v1/dags/ingest_dag/dagRuns" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"dag_run_id\":\"IngestDAG\"}"

5. Watch ingest progress – this can be omitted as the ingest results, such as success, failure, can be notified by email or reported to a ticketing system.

6. Once successfully uploaded, turn off the Copy Station manually or it can be the last task of a flow in the following ingest DAG:

(…)

def ssh_operator(task_identifier: str, cmd: str):

return SSHOperator(task_id=task_identifier,

command=cmd + ' ',

remote_host=CopyStation,

ssh_hook=sshHook,

trigger_rule=TriggerRule.NONE_FAILED,

do_xcom_push=True,

dag= IngestDAG)

host_shutdown = ssh_operator(task_identifier='host_shutdown', \

cmd='sudo shutdown –h +5 &')

When launched, you can track the progress and basic KPIs on the Monitoring UI:

Figure 5 - Screenshot basic KPIs on the Monitoring UI

Figure 4 – Screenshot basic KPIs on the Monitoring UI

Conclusion

In this blog post, we showed how to set up the DXC RoboticDrive Ingestor. This solution was designed to overcome several data ingestion challenges and resulted in the following positive results:

Time-to-Analyze KPI decreased by 30% in our tests, local offload steps become obsolete, and data offloaded from cars were accessible on a globally accessible data-lake (S3) for further processing and analysis.
MWAA integrated with EKS made the solution flexible, fault tolerant, as well as easy to manage and maintain.
Autoscaling the compute capacity as needed for any pre-processing steps within ingestion, helped with boosting performance and productivity.
On-premises to AWS cloud connectivity options provided flexibility in configuring and scaling (up and down), and provides optimal performance and bandwidth.

We hope you found this solution insightful and look forward to hearing your feedback in the comments!

Insights for CTOs: Part 2 – Enable Good Decisions at Scale with Robust Security

2021-12-17 Syed Jaffry

Post Syndicated from Syed Jaffry original https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-2-enable-good-decisions-at-scale-with-robust-security/

In my role as a Senior Solutions Architect, I have spoken to chief technology officers (CTOs) and executive leadership of large enterprises like big banks, software as a service (SaaS) businesses, mid-sized enterprises, and startups.

In this 6-part series, I share insights gained from various CTOs during their cloud adoption journeys at their respective organizations. I have taken these lessons and summarized architecture best practices to help you build and operate applications successfully in the cloud. This series will also cover topics on building and operating cloud applications, security, cloud financial management, modern data and artificial intelligence (AI), cloud operating models, and strategies for cloud migration.

In part 2, my colleague Paul Hawkins and I will show you how to effectively communicate organization-wide security processes. This will ensure you can make informed decisions to scale effectively. We also describe how to establish robust security controls using best practices from the Security Pillar of the Well-Architected Framework.

Effectively establish and communicate security processes

To ensure your employees, customers, contractors, etc., understand your organization’s security goals, make sure that people know the what, how, and why behind your security objectives:

What are the overall objectives they need to meet?
How do you intend for the organization and your customers to work together to meet these goals?
Why are meeting these goals important to your organization and customers?

Having well communicated security principles gives a common understanding of overall objectives. Once you communicate these goals, you can get more specific in terms of how those objectives can be achieved.

The next sections discuss best practices to establish your organization’s security processes.

Create a “path to production” process

A “path to production” process is a set of consistent and reusable engineering standards and steps that each new cloud workload must adhere to prior to production deployment. Using this process will increase delivery velocity while reducing business risk by ensuring strong compliance to standards.

Classify your data for better access control

Understanding the type of data that you are handling and where it is being handled is critical to understanding what you need to do to appropriately protect it. For example, the requirements for a public website are different than a payment processing workload. By knowing where and when sensitive data is being accessed or used, you can more easily assess and establish the appropriate controls.

Figure 1 shows a scale that will help you determine when and how to protect sensitive data. It shows that you would apply stricter access controls for more sensitive data to reduce the risk of inappropriate access. Detective controls allow you to audit and respond to unexpected access.

By simplifying the baseline control posture across all environments and layering on stricter controls where appropriate, you will make it easier to deliver change more swiftly while maintaining the right level of security.

Figure 1. Data classification and control scale

Identify and prioritize how to address risks using a threat model

As shown in the How to approach threat modeling blog post, threat modeling helps workload teams identify potential threats and develop or implement security controls to address those threats.

Threat modeling is most effective when it’s done at the workload (or workload feature) level. We recommend creating reusable threat modeling templates. This will help ensure quicker time to production and a consistent security control posture for your systems.

Create feedback cycles

Security, like other areas of architecture and design, is not static. You don’t implement security processes and walk away, just like you wouldn’t ship an application and never improve its availability, performance, or ease of operation.

Implementation of feedback cycles will vary depending on your organizational structure and processes. However, one common way we have seen feedback cycles being implemented is with a collaborative, blame-free root cause analysis (RCA) process. It allows you to understand how many issues you have been able to prevent or effectively respond to and apply that knowledge to make your systems more secure. It also demonstrates organizational support for an objective discussion where people are not penalized for asking questions.

Security controls

Protect your applications and infrastructure

To secure your organization, build automation that delivers robust answers to the following questions:

Preventative controls – how well can you block unauthorized access?
Detective controls – how well can you identify unexpected activity or unwanted configuration?
Incident response – how quickly and effectively can you respond and recover from issues?
Data protection – how well is the data protected while being used and stored?

Preventative controls

Start with robust identity and access management (IAM). For human access, avoid having to maintain separate credentials between cloud and on-premises systems. It does not scale and creates threat vectors such as long-lived credentials and credential leaks.

Instead, use federated authentication within a centralized system for provisioning and deprovisioning organization-wide access to all your systems, including the cloud. For AWS access, you can do this with AWS Single Sign-On (AWS SSO), direct federation to IAM, or integration with partner solutions, such as Okta or Active Directory.

Enhance your trust boundary with the principles of “zero trust.” Traditionally, organizations tend to rely on the network as the primary point of control. This can create a “hard shell, soft core” model, which doesn’t consider context for access decisions. Zero trust is about increasing your use of identity as a means to grant access in addition to traditional controls that rely on network being private.

Apply “defense in depth” to your application infrastructure with a layered security architecture. The sequence in which you layer the controls together can depend on your use case. For example, you can apply IAM controls either at the database layer or at the start of user activity—or both. Figure 2 shows a conceptual view of layering controls to help secure access to your data. Figure 3 shows the implementation view for a web-facing application.

Figure 2. Defense in depth

Figure 3. Defense in depth applied to a web application

Detective controls

Detective controls allow you to get the information you need to respond to unexpected changes and incidents. Tools like Amazon GuardDuty and AWS Config can integrate with your security information and event monitoring (SIEM) system so you can respond to incidents using human and automated intervention.

Incident response

When security incidents are detected, timely and appropriate response is critical to minimize business impact. A robust incident response process is a combination of human intervention steps and automation. The AWS Security Hub Automated Response and Remediation solution provides an example of how you can build incident response automation.

Protect data with robust controls

Restrict access to your databases with private networking and strong identity and access control. Apply data encryption in transit (TLS) and at rest. A common mistake that organizations make is not enabling encryption at rest in databases at the time of initial deployment.

It is difficult to enable database encryption after the fact without time-consuming data migration. Therefore, enable database encryption from the start and minimize direct human access to data by applying principles of least privilege. This reduces the likelihood of accidental disclosure of information or misconfiguration of systems.

Ready to get started?

As a CTO, understanding the overall posture of your security processes against the foundational security controls is beneficial. Tracking key metrics on the effectiveness of the decision-making process, overall security objectives, and the improvement in posture over time should be regularly evaluated by the CTO and CISO organizations.

Embedding the principles of robust security processes and controls into the way your organization designs, develops, and operates workloads makes it easier to consistently make good decisions quickly.

To get started, look at workloads where engineering and security are already working together or bootstrap an initiative for this. Use the Well Architected Tool’s Security Pillar to create and communicate a set of objectives that demonstrate value.

Other blogs in this series

Insights for CTOs: Part 1 – Building and Operating Cloud Applications

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Serverless Scheduling with Amazon EventBridge, AWS Lambda, and Amazon DynamoDB

2021-12-16 Peter Grman

Post Syndicated from Peter Grman original https://aws.amazon.com/blogs/architecture/serverless-scheduling-with-amazon-eventbridge-aws-lambda-and-amazon-dynamodb/

Many applications perform scheduled tasks. For instance, you might want to automatically publish an article at a given time, change prices for offers which were defined weeks in advance, or notify customers 8 hours before a flight. These might be one-off tasks, or recurring ones.

On Unix-like operating systems, you might have opted for the cron utility. There are also similar alternatives for many web application frameworks, as well as advanced libraries, to schedule future one-off tasks. In a single server environment, this might seem like a simple solution. However, when you run dozens of instances of your application server, it gets harder to rely on those libraries to schedule tasks reliably at least once, without taking up too many resources. If you decide to build a serverless application, you need a new approach all together.

This post shows how you can build a scalable serverless job scheduler. You can use this method to scale to thousands, or even millions, of distributed jobs per minute. Because you are using serverless technologies, the underlying infrastructure is fully managed by AWS and you only pay for what you use. You can use this solution as an addition to your existing applications, regardless if they already use serverless technologies.

Similarly to a cron job running on a single instance, this solution uses an Amazon EventBridge rule, which starts new events periodically on a schedule. For recurring jobs, you would use this capability to start specific actions directly. This will work if you have only a few dozen periodic tasks, whose execution cycle can be defined as a cron expression. However, remember that there are limits to how many rules can be defined per event bus, and rules with a scheduled expression can only be defined on the default event bus. This post describes a method to multiplex a single Amazon EventBridge rule via an AWS Lambda function and Amazon DynamoDB, to scale beyond thousands of jobs. While this example focuses on one-off tasks, you can use the same approach for recurring jobs as well.

Overview of solution

The following diagram shows the architecture of the serverless scheduling solution.

Figure 1 – Architecture diagram showing Serverless Scheduling with Amazon EventBridge, AWS Lambda, and Amazon DynamoDB

Amazon EventBridge with scheduled expressions periodically starts an AWS Lambda function. An Amazon DynamoDB table stores the future jobs. The Lambda function queries the table for due jobs and distributes them via Amazon EventBridge to the workers.

The following services are used:

Amazon EventBridge: to initiate the serverless scheduling solution. Amazon EventBridge is a serverless event bus that makes it easier to build event-driven applications at scale. It can also schedule events based on time intervals or cron expressions.

In this solution, you’ll use EventBridge for two things:

to periodically start the AWS Lambda function, which checks for new jobs to be executed, and
to distribute those jobs to the workers.

Here, you can control the granularity of your job executions. The fastest rate possible is once every minute. But if you don’t need a 1-minute precision, you can also opt for once every 5 minutes, or even once every hour. Remember that you cannot control at which second the event is started. It might be at the beginning of the minute, in the middle, or at the end.

AWS Lambda: to execute the scheduler logic. AWS Lambda is a serverless, event-driven compute service that lets you run code without provisioning or managing servers. The Lambda function queries the jobs from DynamoDB and distributes them via EventBridge. Based on your requirements, you can adjust this to use different mechanisms to notify the workers about the jobs, such as HTTP APIs, gRPC calls, or AWS services like Amazon Simple Notification Service (SNS) or Amazon Simple Queue Service (SQS).

Amazon DynamoDB: to store scheduled jobs. Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. Defining the right data model is important to be able to scale to thousands or even millions of scheduled and processed jobs per minute. The DynamoDB table in this solution has a partition key “pk” and a sort key “sk”. For the Lambda function, to be able to query all due jobs quickly and efficiently, jobs must be partitioned. For this, they are grouped together based on their scheduled times in intervals of 5 minutes. This value is the partition key “pk”. How to calculate this value is explained in detail, when you will test the solution.

The sort key “sk” contains the precise execution time concatenated with a unique identifier, such as a job ID, because the combination of “pk” and “sk” must be unique. To schedule a job in this example, you write it manually into the DynamoDB table. In your production code you can abstract the synchronous DynamoDB access, by implementing it in a shared library, or using Amazon API Gateway. You could also schedule jobs from a Lambda function reacting to events in your system.

Amazon EventBridge: to distribute the jobs. The Lambda function uses Amazon EventBridge as an example to distribute the jobs. The workers which should receive the jobs, must configure the corresponding rules upfront. For testing purposes, this solution comes with a rule which logs all events from the Lambda function into Amazon CloudWatch Logs.

Walkthrough

In this section, you will deploy the solution and test it.

An AWS account
An AWS user, which has access to the AWS Management Console and has the IAM permissions to launch the AWS CloudFormation stack and create the aforementioned resources.

Deploying the solution

To deploy it in your account:

1. Select Launch Stack.

2. Select the Region where you want to launch your serverless scheduler.

3. Define a name for your stack. Leave the parameters with the default values for now and select Next.

parameters table

4. At the bottom of the page, acknowledge the required Capabilities and select Create stack.

5. Wait until the status of the stack is CREATE_COMPLETE, this can take a minute or two.

Testing the solution

In this section, you test the serverless scheduler. First, you’ll schedule a job for some time in the near future. Afterwards you will check that the job has been logged in CloudWatch Logs at the time, it was scheduled.

1. In the AWS Management Console, navigate to the DynamoDB service and select the Items sub-menu on the left side, between Tables and PartiQL editor.

2. Select the JobsTable which you created via the CloudFormation Stack; it should be empty for now:

Jobstable

3. Select Create item. Make sure you switch to the JSON editor at the top, and disable View DynamoDB JSON. Now copy this item into the editor:

{
  "pk": "j#2015-03-20T09:45",
  "sk": "2015-03-20T09:46:47.123Z#564ade05-efda-4a2e-a7db-933ad3c89a83",
  "detail": {
    "action": "send-reminder",
    "userId": "16f3a019-e3a5-47ed-8c46-f668347503d1",
    "taskId": "6d2f710d-99d8-49d8-9f52-92a56d0c6b81",
    "params": {
      "can_skip": false,
      "reminder_volume": 0.5
    }
  },
  "detail_type": "job-reminder"
}

Create table DynamoDB table

This is a sample job definition. You will need to adjust it, to be started a few minutes from now. For this you need to adjust the first 2 attributes, the partition key “pk” and the sort key “sk”. Start with “sk”, this is the UTC timestamp for the due date of the job in ISO 8601 format (YYYY-MM-DDTHH:MM:SS), followed by a separator (“#”) and a unique identifier, to make sure that multiple jobs can have the same due timestamp.

Afterwards adjust “pk”. The “pk” looks like the ISO 8601 timestamp in the “sk” reduced to date and time in hours and minutes. The minutes for the partition key must be an integer multiple of 5. This value represents the grouping of the jobs, so they can be queried quickly and efficiently by the Lambda function. For instance, for me 2021-11-26T13:31:55.000Z is in the future and the corresponding partition would be 2021-11-26T13:30.

Note: your local time zone might not be UTC. You can get the current UTC time on timeanddate.com.

You can find in the following table for every “sk” minute the corresponding “pk” minute:

SK and PK table

The corresponding python code would be:

f'{(sk_minutes – sk_minutes % 5):02d}'

Create table DynamoDB table

4. Now that you defined your event in the near future, you can optionally adjust the content of the “detail” and “detail_type” attributes. These are forwarded to EventBridge as “detail” and “detail-type” and should be used by your workers to understand which task they are supposed to perform. You can find more details on EventBridge event structure in our documentation. After you configured the job correctly, select Create item.

5. It is time to navigate to CloudWatch Log groups and wait for the item to be due and to show up in the debug logs.

CloudWatch log groups

For now, the log streams should be empty:

Log streams sceenshot

After the item was due, you should see a new log stream with the item “detail” and “detail_type” attributes logged.

If you don’t see a new log stream with the item, check back in your DynamoDB table, if the “sk” is in the UTC time zone and the minutes of the “pk” are a multiple of 5. You can consult the table at the end of step 3, to check for the correct “pk” minutes based on your “sk” minutes.

Log events screenshot

You might notice that the timestamp of the message is within a minute after the job was scheduled. In my example, I scheduled the job for 2021-11-26T13:31:55.000Z and it was put into EventBridge at 2021-11-26T13:32:33Z. The delay comes from the Lambda function only starting once per minute. As I mentioned in the beginning, the function also isn’t started at second 00 but at a random second within that minute.

Exploring the Lambda function

Now, let’s have a look at the core logic. For this, navigate to AWS Lambda in the AWS Management console and open the SchedulerFunction.

AWS Lambda screenshot

In the function configuration, you can see that it is triggered by EventBridge via a scheduled expression at the rate, which was defined in the CloudFormation Stack.

Function Configuration in Cloudformation Stack

When you open the Code tab, you can see that it is less than 100 lines of python code. The main part is the lambda_handler function:

def lambda_handler(event, context):
    event_time_in_utc = event['time']
    previous_partition, current_partition = get_partitions(event_time_in_utc)

    previous_jobs = query_jobs(previous_partition, event_time_in_utc)
    current_jobs = query_jobs(current_partition, event_time_in_utc)
    all_jobs = previous_jobs + current_jobs

    print('dispatching {} jobs'.format(len(all_jobs)))

    put_all_jobs_into_event_bridge(all_jobs)
    delete_all_jobs(all_jobs)

    print('dispatched and deleted {} jobs'.format(len(all_jobs)))

The function starts by calculating the current and previous partitions. This is done to ensure that no jobs stay unprocessed in the old partition, when a new one starts. Afterwards, jobs from these partitions are queried up to the current time, so no future jobs will be fetched from the current partition. Lastly, all jobs are put into EventBridge and deleted from the table.

Instead of pushing the jobs into EventBridge, they could be started via HTTP(S), gRPC, or pushed into other AWS services, like Amazon Simple Notification Service (SNS) or Amazon Simple Queue Service (SQS). Also remember that the communication with other AWS services is synchronous and does not use batching options when putting jobs into EventBridge or deleting them from the DynamoDB table. This is to keep the function simpler and easier to understand. When you plan to distribute thousands of jobs per minute, you’d want to adjust this, to improve the throughput of the Lambda function.

Cleaning up

To avoid incurring future charges, delete the CloudFormation Stack and all resources you created.

Conclusion

In this post, you learned how to build a serverless scheduling solution. Using only serverless technologies which scale automatically, don’t require maintenance, and offer a pay as you go pricing model, this scheduler solution can be implemented for use cases with varying throughput requirements for their scheduled jobs. These could range from publishing articles at a scheduled time to notifying hundreds of passengers per minute about their upcoming flight.

You can adjust the Lambda function to distribute the jobs with a technology more fitting to your application, as well as to handle recurring tasks. The grouping interval of 5 minutes for the partition key, can be also adjusted based on your throughput requirements. It’s important to note that for this solution to work, the interval by which the jobs are grouped must be longer than the rate at which the Lambda function is started.

Give it a try and let us know your thoughts in the comments!

Unify log aggregation and analytics across compute platforms

2021-12-15 Hari Ohm Prasath

Post Syndicated from Hari Ohm Prasath original https://aws.amazon.com/blogs/big-data/unify-log-aggregation-and-analytics-across-compute-platforms/

Our customers want to make sure their users have the best experience running their application on AWS. To make this happen, you need to monitor and fix software problems as quickly as possible. Doing this gets challenging with the growing volume of data needing to be quickly detected, analyzed, and stored. In this post, we walk you through an automated process to aggregate and monitor logging-application data in near-real time, so you can remediate application issues faster.

This post shows how to unify and centralize logs across different computing platforms. With this solution, you can unify logs from Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Kinesis Data Firehose, and AWS Lambda using agents, log routers, and extensions. We use Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) with OpenSearch Dashboards to visualize and analyze the logs, collected across different computing platforms to get application insights. You can deploy the solution using the AWS Cloud Development Kit (AWS CDK) scripts provided as part of the solution.

Customer benefits

A unified aggregated log system provides the following benefits:

A single point of access to all the logs across different computing platforms
Help defining and standardizing the transformations of logs before they get delivered to downstream systems like Amazon Simple Storage Service (Amazon S3), Amazon OpenSearch Service, Amazon Redshift, and other services
The ability to use Amazon OpenSearch Service to quickly index, and OpenSearch Dashboards to search and visualize logs from its routers, applications, and other devices

Solution overview

In this post, we use the following services to demonstrate log aggregation across different compute platforms:

Amazon EC2 – A web service that provides secure, resizable compute capacity in the cloud. It’s designed to make web-scale cloud computing easier for developers.
Amazon ECS – A web service that makes it easy to run, scale, and manage Docker containers on AWS, designed to make the Docker experience easier for developers.
Amazon EKS – A web service that makes it easy to run, scale, and manage Docker containers on AWS.
Kinesis Data Firehose – A fully managed service that makes it easy to stream data to Amazon S3, Amazon Redshift, or Amazon OpenSearch Service.
Lambda – A compute service that lets you run code without provisioning or managing servers. It’s designed to make web-scale cloud computing easier for developers.
Amazon OpenSearch Service – A fully managed service that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more.

The following diagram shows the architecture of our solution.

The architecture uses various log aggregation tools such as log agents, log routers, and Lambda extensions to collect logs from multiple compute platforms and deliver them to Kinesis Data Firehose. Kinesis Data Firehose streams the logs to Amazon OpenSearch Service. Log records that fail to get persisted in Amazon OpenSearch service will get written to AWS S3. To scale this architecture, each of these compute platforms streams the logs to a different Firehose delivery stream, added as a separate index, and rotated every 24 hours.

The following sections demonstrate how the solution is implemented on each of these computing platforms.

Amazon EC2

The Kinesis agent collects and streams logs from the applications running on EC2 instances to Kinesis Data Firehose. The agent is a standalone Java software application that offers an easy way to collect and send data to Kinesis Data Firehose. The agent continuously monitors files and sends logs to the Firehose delivery stream.

The AWS CDK script provided as part of this solution deploys a simple PHP application that generates logs under the /etc/httpd/logs directory on the EC2 instance. The Kinesis agent is configured via /etc/aws-kinesis/agent.json to collect data from access_logs and error_logs, and stream them periodically to Kinesis Data Firehose (ec2-logs-delivery-stream).

Because Amazon OpenSearch Service expects data in JSON format, you can add a call to a Lambda function to transform the log data to JSON format within Kinesis Data Firehose before streaming to Amazon OpenSearch Service. The following is a sample input for the data transformer:

46.99.153.40 - - [29/Jul/2021:15:32:33 +0000] "GET / HTTP/1.1" 200 173 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"

The following is our output:

{
    "logs" : "46.99.153.40 - - [29/Jul/2021:15:32:33 +0000] \"GET / HTTP/1.1\" 200 173 \"-\" \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36\"",
}

We can enhance the Lambda function to extract the timestamp, HTTP, and browser information from the log data, and store them as separate attributes in the JSON document.

Amazon ECS

In the case of Amazon ECS, we use FireLens to send logs directly to Kinesis Data Firehose. FireLens is a container log router for Amazon ECS and AWS Fargate that gives you the extensibility to use the breadth of services at AWS or partner solutions for log analytics and storage.

The architecture hosts FireLens as a sidecar, which collects logs from the main container running an httpd application and sends them to Kinesis Data Firehose and streams to Amazon OpenSearch Service. The AWS CDK script provided as part of this solution deploys a httpd container hosted behind an Application Load Balancer. The httpd logs are pushed to Kinesis Data Firehose (ecs-logs-delivery-stream) through the FireLens log router.

Amazon EKS

With the recent announcement of Fluent Bit support for Amazon EKS, you no longer need to run a sidecar to route container logs from Amazon EKS pods running on Fargate. With the new built-in logging support, you can select a destination of your choice to send the records to. Amazon EKS on Fargate uses a version of Fluent Bit for AWS, an upstream conformant distribution of Fluent Bit managed by AWS.

The AWS CDK script provided as part of this solution deploys an NGINX container hosted behind an internal Application Load Balancer. The NGINX container logs are pushed to Kinesis Data Firehose (eks-logs-delivery-stream) through the Fluent Bit plugin.

Lambda

For Lambda functions, you can send logs directly to Kinesis Data Firehose using the Lambda extension. You can deny the records being written to Amazon CloudWatch.

After deployment, the workflow is as follows:

On startup, the extension subscribes to receive logs for the platform and function events. A local HTTP server is started inside the external extension, which receives the logs.
The extension buffers the log events in a synchronized queue and writes them to Kinesis Data Firehose via PUT records.
The logs are sent to downstream systems.
The logs are sent to Amazon OpenSearch Service.

The Firehose delivery stream name gets specified as an environment variable (AWS_KINESIS_STREAM_NAME).

For this solution, because we’re only focusing on collecting the run logs of the Lambda function, the data transformer of the Kinesis Data Firehose delivery stream filters out the records of type function ("type":"function") before sending it to Amazon OpenSearch Service.

The following is a sample input for the data transformer:

[
   {
      "time":"2021-07-29T19:54:08.949Z",
      "type":"platform.start",
      "record":{
         "requestId":"024ae572-72c7-44e0-90f5-3f002a1df3f2",
         "version":"$LATEST"
      }
   },
   {
      "time":"2021-07-29T19:54:09.094Z",
      "type":"platform.logsSubscription",
      "record":{
         "name":"kinesisfirehose-logs-extension-demo",
         "state":"Subscribed",
         "types":[
            "platform",
            "function"
         ]
      }
   },
   {
      "time":"2021-07-29T19:54:09.096Z",
      "type":"function",
      "record":"2021-07-29T19:54:09.094Z\tundefined\tINFO\tLoading function\n"
   },
   {
      "time":"2021-07-29T19:54:09.096Z",
      "type":"platform.extension",
      "record":{
         "name":"kinesisfirehose-logs-extension-demo",
         "state":"Ready",
         "events":[
            "INVOKE",
            "SHUTDOWN"
         ]
      }
   },
   {
      "time":"2021-07-29T19:54:09.097Z",
      "type":"function",
      "record":"2021-07-29T19:54:09.097Z\t024ae572-72c7-44e0-90f5-3f002a1df3f2\tINFO\tvalue1 = value1\n"
   },   
   {
      "time":"2021-07-29T19:54:09.098Z",
      "type":"platform.runtimeDone",
      "record":{
         "requestId":"024ae572-72c7-44e0-90f5-3f002a1df3f2",
         "status":"success"
      }
   }
]

Prerequisites

To implement this solution, you need the following prerequisites:

The AWS Command Line Interface (AWS CLI) installed. The AWS CLI is a unified tool to manage your AWS services.
The AWS CDK installed on your local laptop.
Git installed and configured on your machine.
The Lambda extension for Kinesis Data Firehose, which is packaged as part of this solution.

Build the code

Check out the AWS CDK code by running the following command:

mkdir unified-logs && cd unified-logs
git clone https://github.com/aws-samples/unified-log-aggregation-and-analytics .

Build the lambda extension by running the following command:

cd lib/computes/lambda/extensions
chmod +x extension.sh
./extension.sh
cd ../../../../

Make sure to replace default AWS region specified under the value of firehose.endpoint attribute inside lib/computes/ec2/ec2-startup.sh.

Build the code by running the following command:

yarn install && npm run build

Deploy the code

If you’re running AWS CDK for the first time, run the following command to bootstrap the AWS CDK environment (provide your AWS account ID and AWS Region):

cdk bootstrap \
    --cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
    aws://<AWS Account Id>/<AWS_REGION>

You only need to bootstrap the AWS CDK one time (skip this step if you have already done this).

Run the following command to deploy the code:

cdk deploy --requires-approval

You get the following output:

 ✅  CdkUnifiedLogStack

Outputs:
CdkUnifiedLogStack.ec2ipaddress = xx.xx.xx.xx
CdkUnifiedLogStack.ecsloadbalancerurl = CdkUn-ecsse-PY4D8DVQLK5H-xxxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.ecsserviceLoadBalancerDNS570CB744 = CdkUn-ecsse-PY4D8DVQLK5H-xxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.ecsserviceServiceURL88A7B1EE = http://CdkUn-ecsse-PY4D8DVQLK5H-xxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.eksclusterClusterNameCE21A0DB = ekscluster92983EFB-d29892f99efc4419bc08534a3d253160
CdkUnifiedLogStack.eksclusterConfigCommand515C0544 = aws eks update-kubeconfig --name ekscluster92983EFB-d29892f99efc4419bc08534a3d253160 --region us-east-1 --role-arn arn:aws:iam::xxx:role/CdkUnifiedLogStack-clustermasterroleCD184EDB-12U2TZHS28DW4
CdkUnifiedLogStack.eksclusterGetTokenCommand3C33A2A5 = aws eks get-token --cluster-name ekscluster92983EFB-d29892f99efc4419bc08534a3d253160 --region us-east-1 --role-arn arn:aws:iam::xxx:role/CdkUnifiedLogStack-clustermasterroleCD184EDB-12U2TZHS28DW4
CdkUnifiedLogStack.elasticdomainarn = arn:aws:es:us-east-1:xxx:domain/cdkunif-elasti-rkiuv6bc52rp
CdkUnifiedLogStack.s3bucketname = cdkunifiedlogstack-logsfailederrcapturebucket0bcc-xxxxx
CdkUnifiedLogStack.samplelambdafunction = CdkUnifiedLogStack-LambdatransformerfunctionFA3659-c8u392491FrW

Stack ARN:
arn:aws:cloudformation:us-east-1:xxxx:stack/CdkUnifiedLogStack/6d53ef40-efd2-11eb-9a9d-1230a5204572

AWS CDK takes care of building the required infrastructure, deploying the sample application, and collecting logs from different sources to Amazon OpenSearch Service.

The following is some of the key information about the stack:

ec2ipaddress – The public IP address of the EC2 instance, deployed with the sample PHP application
ecsloadbalancerurl – The URL of the Amazon ECS Load Balancer, deployed with the httpd application
eksclusterClusterNameCE21A0DB – The Amazon EKS cluster name, deployed with the NGINX application
samplelambdafunction – The sample Lambda function using the Lambda extension to send logs to Kinesis Data Firehose
opensearch-domain-arn – The ARN of the Amazon OpenSearch Service domain

Generate logs

To visualize the logs, you first need to generate some sample logs.

To generate Lambda logs, invoke the function using the following AWS CLI command (run it a few times):

aws lambda invoke \
--function-name "<<samplelambdafunction>>" \
--payload '{"payload": "hello"}' /tmp/invoke-result \
--cli-binary-format raw-in-base64-out \
--log-type Tail

Make sure to replace samplelambdafunction with the actual Lambda function name. The file path needs to be updated based on the underlying operating system.

The function should return "StatusCode": 200, with the following output:

{
    "StatusCode": 200,
    "LogResult": "<<Encoded>>",
    "ExecutedVersion": "$LATEST"
}

Run the following command a couple of times to generate Amazon EC2 logs:

curl http://ec2ipaddress:80

Make sure to replace ec2ipaddress with the public IP address of the EC2 instance.

Run the following command a couple of times to generate Amazon ECS logs:

curl http://ecsloadbalancerurl:80

Make sure to replace ecsloadbalancerurl with the public ARN of the AWS Application Load Balancer.

We deployed the NGINX application with an internal load balancer, so the load balancer hits the health checkpoint of the application, which is sufficient to generate the Amazon EKS access logs.

Visualize the logs

To visualize the logs, complete the following steps:

On the Amazon OpenSearch Service console, choose the hyperlink provided for the OpenSearch Dashboard 7URL.
Configure access to the OpenSearch Dashboard.
Under OpenSearch Dashboard, on the Discover menu, start creating a new index pattern for each compute log.

We can see separate indexes for each compute log partitioned by date, as in the following screenshot.

The following screenshot shows the process to create index patterns for Amazon EC2 logs.

After you create the index pattern, we can start analyzing the logs using the Discover menu under OpenSearch Dashboard in the navigation pane. This tool provides a single searchable and unified interface for all the records with various compute platforms. We can switch between different logs using the Change index pattern submenu.

Clean up

Run the following command from the root directory to delete the stack:

cdk destroy

Conclusion

In this post, we showed how to unify and centralize logs across different compute platforms using Kinesis Data Firehose and Amazon OpenSearch Service. This approach allows you to analyze logs quickly and the root cause of failures, using a single platform rather than different platforms for different services.

If you have feedback about this post, submit your comments in the comments section.

Resources

For more information, see the following resources:

About the author

Hari Hari Ohm Prasath is a Senior Modernization Architect at AWS, helping customers with their modernization journey to become cloud native. Hari loves to code and actively contributes to the open source initiatives. You can find him in Medium, Github & Twitter @hariohmprasath.

ballu Ballu Singh is a Principal Solutions Architect at AWS. He lives in the San Francisco Bay area and helps customers architect and optimize applications on AWS. In his spare time, he enjoys reading and spending time with his family.

Modernize your Penetration Testing Architecture on AWS Fargate

2021-12-14 Conor Walsh

Post Syndicated from Conor Walsh original https://aws.amazon.com/blogs/architecture/modernize-your-penetration-testing-architecture-on-aws-fargate/

Organizations in all industries are innovating their application stack through modernization. Developers have found that modular architecture patterns, serverless operational models, and agile development processes provide great benefits. They offer faster innovation, reduced risk, and reduction in total cost of ownership.

Security organizations must evolve and innovate as well. But security practitioners often find themselves stuck between using powerful yet inflexible open-source tools with little support, and monolithic software with expensive and restrictive licenses.

This post describes how you can use modern cloud technologies to build a scalable penetration testing platform, with no infrastructure to manage.

The penetration testing monolith

AWS operates under the shared responsibility model, where AWS is responsible for the security of the cloud, and the customer is responsible for securing workloads in the cloud. This includes validating the security of your internal and external attack surface. Following the AWS penetration testing policy, customers can run tests against their AWS accounts, except for denial of service (DoS).

A legacy model commonly involves a central server for running a scanning application among the team. The server must be powerful enough for peak load and likely runs 24/7. Common licensing for scanner software is capped on the number of targets you can scan. This model does not scale, and incurs cost when no assessments are being performed.

Penetration testers must constantly reinvent their toolkit. Many one-off tools or scripts are built during engagements when encountering a unique problem. These tools and their environments are often customized, making standardization between machines and software difficult. Building, maintaining, and testing UI/UX and platform compatibility can be expensive and difficult to scale. This often leads to these tools being discarded and the value lost when the analyst moves on to the next engagement. Later, other analysts may run into the same scenario and need to rebuild the tool all over again, resulting in duplicated effort.

Network security scanning using modern cloud infrastructure

By using modern cloud container technologies, we can redesign this monolithic architecture to one that scales to meet increased demand, yet incurs no cost when idle. Containerization provides flexibility and secure isolation.

Figure 1. Overview of the serverless security scanning architecture

Scanning task flow

This workflow is based on the architecture shown in Figure 1:

User authenticates to Amazon Cognito with their organization’s SSO.
User makes authorized request to Amazon API Gateway.
Request is forwarded to an AWS Lambda function that pulls configuration from Amazon Simple Storage Service (S3).
Lambda function validates parameters, incorporates them into the task definition, and calls Amazon Elastic Container Service (ECS).
ECS orchestrates worker nodes using AWS Fargate compute engine and initiates task.
ECS asynchronously returns the task configuration to Lambda, which sanitizes sensitive data and sends response through API Gateway.
The ECS task launches one or more containers, which run the tool.
Scan results are stored in the ephemeral storage provided by Fargate.
Final container in the ECS task copies the scan report to S3.

Now we’ll describe the different components of the architecture shown in Figure 1. Start by packaging one’s favorite tool into a container, and publish it to Amazon Elastic Container Registry (ECR). ECR provides your containers additional layers of security assurance with built-in dependency vulnerability scans.

AWS Fargate is a serverless compute engine powering Amazon ECS to orchestrate container tasks. Fargate scales up capacity to support the current load, and scales down once complete to reduce cost. By default, Fargate offers 20 GB of ephemeral storage to each ECS task for shared storage between containers as volume mounts.

Task input and output can be processed with custom code running on the serverless computing service AWS Lambda. For multi-stage Lambda functionality, you can use AWS Step Functions.

Amazon API Gateway can forward incoming requests to these Lambda functions. API Gateway provides serverless REST endpoints to handle requests processed by Lambda functions. Amazon Cognito authorizes users through API Gateway or your organization’s single-sign on (SSO) provider.

The final step of the ECS task can upload any resulting files to an Amazon S3 bucket. Amazon S3 offers industry-leading scalability, data availability, security, and performance with integration into other AWS services. This means that the results of your data can be consumed by other AWS services for processing, analytics, machine learning, and security controls.

Amazon CloudWatch Events are used to build an event-based workflow. The S3 upload initiates a CloudWatch Event, which can then invoke a Lambda function to process the file, or launch another ECS task.

This solution is completely serverless. It will scale on demand, yet cost nothing when not in use. This architecture can support anything that can be run in a container, regardless of tool function.

Network Mapper workflow

Figure 2. Network Mapper scanner task workflow

The example in Figure 2 was based on using a tool called Network Mapper, or Nmap. However, a variety of tools can be used, including nslookup/dig, Selenium, Nikto, recon-ng, SpiderFoot, Greenbone Vulnerability Manager (GVM), or OWASP ZAP. You can use anything that runs in a container! With some additional work, findings could be fed into AWS security services like AWS Security Hub, or Amazon GuardDuty. You can also use AWS Partner Network services like Splunk and Datadog, or open source frameworks like Metasploit and DefectDojo. The flexibility to add additional applications that integrate with AWS services means that this architecture can be easily deployed into a variety of AWS environments.

Remember, installation and use of software not included in an AWS-supported Amazon Machine Image (AMI) or container, falls into the customer side of the shared responsibility model. Make sure to do your due diligence in securing any software you decide to use in this or any workload. To reduce blast radius, run this in an isolated account and only provide least privilege access to targets.

Conclusion

In this blog post, we showed how to run a penetration testing workload on a modern platform, powered with serverless, and container-based services. Amazon API Gateway is the entry point for your architecture, which calls on AWS Lambda. Lambda builds a task definition to launch a fully orchestrated, on-demand container workload using AWS Fargate and Amazon ECS. The final stage of the ECS task copies the results of the scan to Amazon S3. This can be accessed by security analysts or other downstream containers, tools, or services.

We encourage you to go build this architecture in your own environment, and begin conducting your own tests! Construct your Nmap container and store it in Amazon ECR or use securecodebox/nmap, a Docker container built for the Open Web Application Security Project® (OWASP) SecureCodeBox project. Make sure to spend time securing this workload, especially when using open-source software you’re not familiar with. Now go get scanning!

Architecture Monthly Magazine: Agriculture

2021-12-14 Bonnie McClure

Post Syndicated from Bonnie McClure original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-agriculture/

Amazon Web Services (AWS) helps agriculture customers forecast supply and demand and create and maintain responsive, resilient food systems. This edition of Architecture Monthly focuses on the agriculture industry and their role in providing products to the world that are nutritious, healthy, accessible, affordable, and sustainable.

We’d like to thank our expert, Karen Hildebrand, Worldwide Tech Lead for Agriculture at AWS, for her contributions to the Ask an Expert column.

Please give us your feedback! Include your comments on the Amazon Kindle page. You can view past issues and reach out to [email protected] any time with your questions and comments.

In this month’s issue:

Ask an Expert: Karen Hildebrand, PhD, Worldwide Tech Lead for Agriculture at AWS
Case Study: AGCO Lowers Costs, Boosts Speed, and Increases Retention Using Amazon Kinesis Services
Quick Start: Machine to Cloud Connectivity Framework
Blog: Building a Controlled Environment Agriculture Platform
Case Study: ProRefrigeration Extends Their AWS IoT Capabilities to the Edge with CEI America
Quick Start: Genomics Tertiary Analysis and Machine Learning Using Amazon SageMaker
Solution: Improving Forecast Accuracy with Machine Learning
Blog: Building Machine Learning at the Edge Applications using Amazon SageMaker Edge Manager and IoT Greengrass v2
Blog: Processing satellite imagery with serverless architecture
Videos:
- The Future of Fresh Fruit Harvest: A Talk with FFRobotics
- Seeing is Believing – the Rise of Data & Analytics in Agriculture
- The Season for Vintner Innovation: Viticulture & More with E.J. Gallo Winery
- Farming Innovation in Ireland: A Talk with Herdwatch
- Voice Activated Agriculture
- Amazon Prime – Clarkson’s Farm Season 1

How to access the magazine

View and download past issues as PDFs on the AWS Architecture Monthly webpage.
Readers in the US, UK, Germany, and France can subscribe to the Kindle version of the magazine at Kindle Newsstand.
Visit Flipboard, a personalized mobile magazine app that you can also read on your computer.

We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us a star rating and comment on the Kindle Newsstand page or contact us anytime at [email protected].

This month, we’re also asking you to take a 10-question survey about your experiences with this magazine. Please take a few moments to give us your opinions.

Use Amazon EKS and Argo Rollouts for Progressive Delivery

2021-12-13 Srinivasa Shaik

Post Syndicated from Srinivasa Shaik original https://aws.amazon.com/blogs/architecture/use-amazon-eks-and-argo-rollouts-for-progressive-delivery/

A common hurdle to DevOps strategies is the manual testing, sign-off, and deployment steps required to deliver new or enhanced feature sets. If an application is updated frequently, these actions can be time-consuming and error prone. You can address these challenges by incorporating progressive delivery concepts along with the Amazon Elastic Kubernetes Service (Amazon EKS) container platform and Argo Rollouts.

Progressive delivery deployments

Progressive delivery is a deployment paradigm in which new features are gradually rolled out to an expanding set of users. Real-time measurements of key performance indicators (KPIs) enable deployment teams to measure customer experience. These measurements can detect any negative impact during deployment and perform an automated rollback before it impacts a larger group of users. Since predefined KPIs are being measured, the rollout can continue autonomously and alleviate the bottleneck of manual approval or rollbacks.

These progressive delivery concepts can be applied to common deployment strategies such as blue/green and canary deployments. A blue/green deployment is a strategy where separate environments are created that are identical to one another. One environment (blue) runs the current application version, while the other environment (green) runs the new version. This enables teams to test on the new environment and move application traffic to the green environment when validated. Canary deployments slowly release your new application to the production environment so that you can build confidence while it is being deployed. Gradually, the new version will replace the current version in its entirety.

Using Kubernetes, you already can perform a rolling update, which can incrementally replace your resource’s Pods with new ones. However, you have limited control of the speed of the rollout, and can’t automatically revert a deployment. KPIs are also difficult to measure in this scenario, resulting in more manual work validating the integrity of the deployment.

To exert more granular control over the deployments, a progressive delivery controller such as Argo Rollouts can be implemented. By using a progressive delivery controller in conjunction with AWS services, you can tune the speed of your deployments and measure your success with KPIs. During the deployment, Argo Rollouts will query metric providers such as Prometheus to perform analysis. (You can find the complete list of the supported metric providers at Argo Rollouts.) If there is an issue with the deployment, automatic rollback actions can be taken to minimize any type of disruption.

Using blue/green deployments for progressive delivery

Blue/green deployments provide zero downtime during deployments and an ability to test your application in production without impacting the stable version. In a typical blue/green deployment on EKS using Kubernetes native resources, a new deployment will be spun up. This includes the new feature version in parallel with the stable deployment (see Figure 1). The new deployment will be tested by a QA team.

Figure 1. Blue/green deployment in progress

Once all the tests have been successfully conducted, the traffic must be directed from the live version to the new version (Figure 2). At this point, all live traffic is funneled to the new version. If there are any issues, a rollback can be conducted by swapping the pointer back to the previous stable version.

Figure 2. Blue/green deployment post-promotion

Keeping this process in mind, there are several manual interactions and decisions involved during a blue/green deployment. Using Argo Rollouts you can replace these manual steps with automation. It automatically creates a preview service for testing out a green service. With Argo Rollouts, test metrics can be captured by using a monitoring service, such as Amazon Managed Service for Prometheus (Figure 3).

Prometheus is a monitoring software that can be used to collect metrics from your application. With PromQL (Prometheus Query Language), you can write queries to obtain KPIs. These KPIs can then be used to define the success or failure of the deployment. Argo Rollout deployment includes stages to analyze the KPIs before and after promoting the new version. During the prePromotionAnalysis stage, you can validate the new version using preview endpoint. This stage can verify smoke tests or integration tests. Upon meeting the desired success (KPIs), the live traffic will be routed to the new version. In the postPromotionAnalysis stage, you can verify KPIs from the production environment. After promoting the new version, failure of the KPIs during any analysis stage will automatically shut down the deployment and revert to the previous stable version.

Figure 3. Blue/green deployment using KPIs

Using canary deployment for progressive delivery

Unlike in a blue/green deployment strategy, in a canary deployment a subset of the traffic is gradually shifted to the new version in your production environment. Since the new version is being deployed in a live environment, feedback can be obtained in real-time, and adjustments can be made accordingly (Figure 4).

Figure 4. An example of a canary deployment

Argo Rollouts supports integration with an Application Load Balancer to manipulate the flow of traffic to different versions of an application. Argo Rollouts can gradually and automatically increase the amount of traffic to the canary service at specific intervals. You can also fully automate the promotion process by using KPIs and metrics from Prometheus as discussed in the blue/green strategy. The analysis will run while the canary deployment is progressing. The success of the KPIs will gradually increase the traffic on the canary service. Any failure will stop the deployment and stop routing live traffic.

Conclusion

Implementing progressive delivery in your application can help you deploy new versions of applications with confidence. This approach mitigates the risk of rolling out new application versions by providing visibility into live error rate and performance. You can measure KPIs and automate rollout and rollback procedures. By leveraging Argo Rollouts, you can have more granular control over how your application is deployed in an EKS cluster.

For additional information on progressive delivery or Argo Rollouts:

Use Amazon Managed Service for Prometheus (AMP) or install Prometheus on your EKS cluster to incorporate Prometheus
Setting up Prometheus

Migrating a Database Workflow to Modernized AWS Workflow Services

2021-12-11 Scott Wainner

Post Syndicated from Scott Wainner original https://aws.amazon.com/blogs/architecture/migrating-a-database-workflow-to-modernized-aws-workflow-services/

The relational database is a critical resource in application architecture. Enterprise organizations often use relational database management systems (RDBMS) to provide embedded workflow state management. But this can present problems, such as inefficient use of data storage and compute resources, performance issues, and decreased agility. Add to this the responsibility of managing workflow states through custom triggers and job-based algorithms, which further exacerbate the performance constraints of the database. The complexity of modern workflows, frequency of runtime, and external dependencies encourages us to seek alternatives to using these database mechanisms.

This blog describes how to use modernized workflow methods that will mitigate database scalability constraints. We’ll show how transitioning your workflow state management from a legacy database workflow to AWS services enables new capabilities with scale.

A workflow system is composed of an ordered set of tasks. Jobs are submitted to the workflow where tasks are initiated in the proper sequence to achieve consistent results. Each task is defined with a task input criterion, task action, task output, and task disposition, see Figure 1.

Figure 1. Task with input criteria, an action, task output, and task disposition

Embedded Workflow

Figure 2 depicts the database serving as the workflow state manager where an external entity submits a job for execution into the database workflow. This can be challenging, as the embedded workflow definition requires the use of well-defined database primitives. In addition, any external tasks require tight coupling with database primitives that constrains workflow agility.

Figure 2. Embedded database workflow mechanisms with internal and external task entities

Externalized workflow

A paradigm change is made with use of a modernized workflow management system, where the workflow state exists external to the relational database. A workflow management system is essentially a modernized database specifically designed to manage the workflow state (depicted in Figure 3.)

Figure 3. External task manager extracting workflow state, job data, performing the task, and re-inserting the job data back into the database

AWS offers two workflow state management services: Amazon Simple Workflow Service (Amazon SWF) and AWS Step Functions. The workflow definition and workflow state are no longer stored in a relational database; these workflow attributes are incorporated into the AWS service. The AWS services are highly scalable, enable flexible workflow definition, and integrate tasks from many other systems, including relational databases. These capabilities vastly expand the types of tasks available in a workflow. Migrating the workflow management to an AWS service reduces demand placed upon the database. In this way, the database’s primary value of representing structured and relational data is preserved. AWS Step Functions offers a well-defined set of task primitives for the workflow designer. The designer can still incorporate tasks that leverage the inherent relational database capabilities.

Pull and push workflow models

First, we must differentiate between Amazon SWF and AWS Step Functions to determine which service is optimal for your workflow. Amazon SWF uses an HTTPS API pull model where external Workers and Deciders execute Tasks and assert the Next-Step, respectively. The workflow state is captured in the Amazon SWF history table. This table tracks the state of jobs and tasks so a common reference exists for all the candidate Workers and Deciders.

Amazon SWF does require development of external entities that make the appropriate API calls into Amazon SWF. It inherently supports external tasks that require human intervention. This workflow can tolerate long lead times for task execution. The Amazon SWF pull model is represented in the Figure 4.

Figure 4. ‘Pull model’ for workflow definition when using Amazon SWF

In contrast, AWS Step Functions uses a push model, shown in Figure 5, that initiates workflow tasks and integrates seamlessly with other AWS services. AWS Step Functions may also incorporate mechanisms that enable long-running tasks that require human intervention. AWS Step Functions provides the workflow state management, requires minimal coding, and provides traceability of all transactions.

Figure 5. ‘Push model’ for workflow definition when using AWS Step Functions

Workflow optimizations

The introduction of an external workflow manager such as AWS Step Functions or Amazon SWF, can effectively handle long-running tasks, computationally complex processes, or large media files. AWS workflow managers support asynchronous call-back mechanisms to track task completion. The state of the workflow is intrinsically captured in the service, and the logging of state transitions is automatically captured. Computationally expensive tasks are addressed by invoking high-performance computational resources.

Finally, the AWS workflow manager also improves the handling of large data objects. Previously, jobs would transfer large data objects (images, videos, or audio) into a database’s embedded workflow manager. But this impacts the throughput capacity and consumes database storage.

In the new paradigm, large data objects are no longer transferred to the workflow as jobs, but as job pointers. These are transferred to the workflow whenever tasks must reference external object storage systems. The sequence of state transitions can be traced through CloudWatch Events. This verifies workflow completion, diagnostics of task execution (start, duration, and stop) and metrics on the number of jobs entering the various workflows.

Large data objects are best captured in more cost-effective object storage solutions such as Amazon Simple Storage Service (Amazon S3). Data records may be conveyed via a variety of NoSQL storage mechanisms including:

Amazon DynamoDB: Scalable and fast data retrieval of key-value task datasets
Amazon Simple Notification Service (SNS): Scalable distribution mechanism for tasks
Amazon Simple Queue Service (SQS): Asynchronous processing of data for task

The workflow manager stores pointer references so tasks can directly access these data objects and perform transformation on the data. It provides pointers to the results without transferring the data objects to the workflow. Transferring pointers in the workflow as opposed to transferring large data objects significantly improves the performance, reduces costs, and dramatically improves scalability. You may continue to use the RDBMS for the storage of structured data and use its SQL capabilities with structured tables, joins, and stored procedures. AWS Step Functions enable indirect integration with relational databases using tools such as the following:

AWS Lambda: Short-lived execution of custom code to handle tasks
AWS Glue: Data integration enabling combination and preparation of data including SQL

AWS Step Functions can be coupled with AWS Lambda, a serverless compute capability. Lambda code can manipulate the job data and incorporate many other AWS services. AWS Lambda can also interact with any relational database including Amazon Relational Database Service (RDS) or Amazon Aurora as the executor of a task.

The modernized architecture shown in Figure 6 offers more flexibility in creating new workflows that can evolve with your business requirements.

Figure 6. Using Step Functions as workflow state manager

Summary

Several key advantages are highlighted with this modernized architecture using either Amazon SWF or AWS Step Functions:

You can manage multiple versions of a workflow. Backwards compatibility is maintained as capability expands. Previous business requirements using metadata interpretation on job submission is preserved.
Tasks leverage loose coupling of external systems. This provides far more data processing and data manipulation capabilities in a workflow.
Upgrades can happen independently. A loosely coupled system enables independent upgrade capabilities of the workflow or the external system executing the task.
Automatic scaling. Serverless architecture scales automatically with the growth in job submissions.
Managed services. AWS provides highly resilient and fault tolerant managed services
Recovery. Instance recovery mechanisms can manage workflow state machines.

The modernized workflow using Amazon SWF or AWS Step Functions offers many key advantages. It enables application agility to adapt to changing business requirements. By using a managed service, the enterprise architect can focus on the workflow requirements and task actions, rather than building out a workflow management system. Finally, critical intellectual property developed in the RDBMS system can be preserved as tasks in the modernized workflow using AWS services.

Further reading:

Applying Federated Learning for ML at the Edge

2021-12-09 Randy DeFauw

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/architecture/applying-federated-learning-for-ml-at-the-edge/

Federated Learning (FL) is an emerging approach to machine learning (ML) where model training data is not stored in a central location. During ML training, we typically need to access the entire training dataset on a single machine. For purposes of performance scaling, we divide the training data between multiple CPUs, multiple GPUs, or a cluster of machines. But in actuality, the training data is available in a single place. This poses challenges when we gather data from many distributed devices, like mobile phones or industrial sensors. For privacy reasons, we may not be able to collect training data from mobile phones. In an industrial setting, we may not have the network connectivity to push a large volume of sensor data to a central location. In addition, raw sensor data may contain sensitive information about a plant’s operation.

In this blog, we will demonstrate a working example of federated learning on AWS. We’ll also discuss some design challenges you may face when implementing FL for your own use case.

Federated learning challenges

In FL, all data stays on the devices. A training coordinator process invokes a training round or epoch, and the devices update a local copy of the model using local data. When complete, the devices each send their updated copy of the model weights to the coordinator. The coordinator combines these weights using an approach like federated averaging, and sends the new weights to each device. Figures 1 and 2 compare traditional ML with federated learning.

Traditional ML training:

Figure 1. Traditional ML training

Federated Learning:

Figure 2. Federated learning. Note that in federated learning, all data stays on the devices, as compared to traditional ML.

Compared to traditional ML, federated learning poses several challenges.

Unpredictable training participants. Mobile and industrial devices may have intermittent connectivity. We may have thousands of devices, and the communication with these devices may be asynchronous. The training coordinator needs a way to discover and communicate with these devices.

Asynchronous communication. As previously noted, devices like IoT sensors may use communication protocols like MQTT, and rely on IoT frameworks with asynchronous pub/sub messaging. For example, the training coordinator cannot simply make a synchronous call to a device to send updated weights.

Passing large messages. The model weights can be several MB in size, making them too large for typical MQTT payloads. This can be problematic when network connections are not high bandwidth.

Emerging frameworks. Deep learning frameworks like TensorFlow and PyTorch do not yet fully support FL. Each has emerging options for edge-oriented ML, but they are not yet production-ready, and are intended mostly for simulation work.

Solving for federated learning challenges

Let’s begin by discussing the framework used for FL. This framework should work with any of the major deep learning systems like PyTorch and TensorFlow. It should clearly separate what happens on the training coordinator versus what happens on the devices. The Flower framework satisfies these requirements. In the simplest cases, Flower only requires that a device or client implement four methods: get model weights, set model weights, run a training round, and return metrics (like accuracy). Flower provides a default weight combination strategy, federated averaging, although we could design our own.

Next, let’s consider the challenges of devices and coordinator-to-device communication. Here it helps to have a specific use case in mind. Let’s focus on an industrial use case using the AWS IoT services. The IoT services will provide device discovery and asynchronous messaging tools. Running Flower code in tasks running on Amazon ECS on AWS Fargate containers orchestrated by an AWS Step Functions workflow, bridges the gap between the synchronous training process and the asynchronous device communication.

Devices: Our devices are registered with the AWS IoT Core, which gives access to MQTT communication. We use an AWS IoT Greengrass device, which lets us push AWS Lambda functions on the core device to run the actual ML training code. Greengrass also provides access to other AWS services and provides a streaming capability for pushing large payloads to the AWS Lambdacloud. Devices publish heartbeat messages to an MQTT topic to facilitate device discovery by the training coordinator.

Flower processes: The Flower framework requires a server and one or more clients. The clients communicate with the server synchronously over gRPC, so we cannot directly run the Flower client code on the devices. Rather, we use Amazon ECS Fargate containers to run the Flower server and client code. The client container acts as a proxy and communicates with the devices using MQTT messaging, Greengrass streaming, and direct Amazon S3 file transfer. Since the devices cannot send responses directly to the proxy containers, we use IoT rules to push responses to an Amazon DynamoDB bookkeeping table.

Orchestration: We use a Step Functions workflow to launch a training run. It performs device discovery, launches the server container, and then launches one proxy client container for each Greengrass core.

Metrics: The Flower server and proxy containers emit useful data to logs. We use Amazon CloudWatch log metric filters to collect this information on a CloudWatch dashboard.

Figure 3 shows the high-level design of the prototype.

Figure 3. FL prototype deployed on Amazon ECS Fargate containers and AWS IoT Greengrass cores

Production considerations

As you move into production with FL, you must design for a new set of challenges compared to traditional ML.

What type of devices do you have? Mobile and IoT devices are the most common candidates for FL.

How do the devices communicate with the coordinator? Devices come and go, and don’t always support synchronous communication. How will you discover devices that are available for FL and manage asynchronous communication? IoT frameworks are designed to work with large fleets of devices with intermittent connectivity. But mobile devices will find it easier to push results back to a training coordinator using regular API endpoints.

How will you integrate FL into your ML practice? As you saw in this example, we built a working FL prototype using AWS IoT, container, database, and application integration services. A data science team is likely to work in Amazon SageMaker or an equivalent ML environment that provides model registries, experiment tracking, and other features. How will you bridge this gap? (We may need to invent a new term – “MLEdgeOps.”)

Conclusion

In this blog, we gave a working example of federated learning in an IoT scenario using the Flower framework. We discussed the challenges involved in FL compared to traditional ML when building your own FL solution. As FL is an important and emerging topic in edge ML scenarios, we invite you to try our GitHub sample code. Please give us feedback as GitHub issues, particularly if you see additional federated learning challenges that we haven’t covered yet.

Optimize your IoT Services for Scale with IoT Device Simulator

2021-12-08 Ajay Swamy

Post Syndicated from Ajay Swamy original https://aws.amazon.com/blogs/architecture/optimize-your-iot-services-for-scale-with-iot-device-simulator/

The IoT (Internet of Things) has accelerated digital transformation for many industries. Companies can now offer smarter home devices, remote patient monitoring, connected and autonomous vehicles, smart consumer devices, and many more products. The enormous volume of data emitted from IoT devices can be used to improve performance, efficiency, and develop new service and business models. This can help you build better relationships with your end consumers. But you’ll need an efficient and affordable way to test your IoT backend services without incurring significant capex by deploying test devices to generate this data.

IoT Device Simulator (IDS) is an AWS Solution that manufacturing companies can use to simulate data, test device integration, and improve the performance of their IoT backend services. The solution enables you to create hundreds of IoT devices with unique attributes and properties. You can simulate data without configuring and managing physical devices.

An intuitive UI to create and manage devices and simulations

IoT Device Simulator comes with an intuitive user interface that enables you to create and manage device types for data simulation. The solution also provides you with a pre-built autonomous car device type to simulate a fleet of connected vehicles. Once you create devices, you can create simulations and generate data (see Figure 1.)

Figure 1. The landing page UI enables you to create devices and simulation

Create devices and simulate data

With IDS, you can create multiple device types with varying properties and data attributes (see Figure 2.) Each device type has a topic where simulation data is sent. The supported data types are object, array, sinusoidal, location, Boolean, integer, float, and more. Refer to this full list of data types. Additionally, you can import device types via a specific JSON format or use the existing automotive demo to pre-populate connected vehicles.

Figure 2. Create multiple device types and their data attributes

Create and manage simulations

With IDS, you can create simulations with one device or multiple device types (see Figure 3.) In addition, you can specify the number of devices to simulate for each device type and how often data is generated and sent.

Figure 3. Create simulations for multiple devices

You can then run multiple simulations (see Figure 4) and use the data generated to test your IoT backend services and infrastructure. In addition, you have the flexibility to stop and restart the simulation as needed.

Figure 4. Run and stop multiple simulations

You can view the simulation in real time and observe the data messages flowing through. This way you can ensure that the simulation is working as expected (see Figure 5.) You can stop the simulation or add a new simulation to the mix at any time.

Figure 5. Observe your simulation in real time

IoT Device Simulator architecture

Figure 6. IoT Device Simulator architecture

The AWS CloudFormation template for this solution deploys the following architecture, shown in Figure 6:

Amazon CloudFront serves the web interface content from an Amazon Simple Storage Service (Amazon S3) bucket.
The Amazon S3 bucket hosts the web interface.
Amazon Cognito user pool authenticates the API requests.
An Amazon API Gateway API provides the solution’s API layer.
AWS Lambda serves as the solution’s microservices and routes API requests.
Amazon DynamoDB stores simulation and device type information.
AWS Step Functions include an AWS Lambda simulator function to simulate devices and send messages.
An Amazon S3 bucket stores pre-defined routes that are used for the automotive demo (which is a pre-built example in the solution).
AWS IoT Core serves as the endpoint to which messages are sent.
Amazon Location Service provides the map display showing the location of automotive devices for the automotive demo.

The IoT Device Simulator console is hosted on an Amazon S3 bucket, which is accessed via Amazon CloudFront. It uses Amazon Cognito to manage access. API calls, such as retrieving or manipulating information from the databases or running simulations, are routed through API Gateway. API Gateway calls the microservices, which will call the relevant service.

For example, when creating a new device type, the request is sent to API Gateway, which then routes the request to the microservices Lambda function. Based on the request, the microservices Lambda function recognizes that it is a request to create a device type and saves the device type to DynamoDB.

Running a simulation

When running a simulation, the microservices Lambda starts a Step Functions workflow. First, the request contains information about the simulation to be run, including the unique device type ID. Then, using the unique device type ID, Step Functions retrieves all the necessary information about each device type to run the simulation. Once all the information has been retrieved, the simulator Lambda function is run. The simulator Lambda function uses the device type information, including the message payload template. The Lambda function uses this template to build the message sent to the IoT topic specified for the device type.

When running a custom device type, the simulator generates random information based on the values provided for each attribute. For example, when the automotive simulation is run, the simulation runs a series of calculations to simulate an automobile moving along a series of pre-defined routes. Pre-defined routes are created and stored in an S3 bucket, when the solution is launched. The simulation retrieves the routes at random each time the Lambda function runs. Automotive demo simulations also show a map generated from Amazon Location Service and display the device locations as they move.

The simulator exits once the Lambda function has completed or has reached the fifteen-minute execution limit. It then passes all the necessary information back to the Step Function. Step Functions then enters a choice state and restarts the Lambda function if it has not yet surpassed the duration specified for the simulation. It then passes all the pertinent information back to the Lambda function so that it can resume where it left off. The simulator Lambda function also checks DynamoDB every thirty seconds to see if the user has manually stopped the simulation. If it has, it will end the simulation early. Once the simulation is complete, the Step Function updates the DynamoDB table.

The solution enables you to launch hundreds of devices to test backend infrastructure in an IoT workflow. The solution contains an Import/Export feature to share device types. Exporting a device type generates a JSON file that represents the device type. The JSON file can then be imported to create the same device type automatically. The solution allows the viewing of up to 100 messages while the solution is running. You can also filter the messages by topic and device and see what data each device emits.

Conclusion

IoT Device Simulator is designed to help customers test device integration and IoT backend services more efficiently without incurring capex for physical devices. This solution provides an intuitive web-based graphic user interface (GUI) that enables customers to create and simulate hundreds of connected devices. It is not necessary to configure and manage physical devices or develop time-consuming scripts. Although we’ve illustrated an automotive application in this post, this simulator can be used for many different industries, such as consumer electronics, healthcare equipment, utilities, manufacturing, and more.

Get started with IoT Device Simulator today.

Use a City Planning Analogy to Visualize and Create your Cloud Architecture

2021-12-06 Marwan Al Shawi

Post Syndicated from Marwan Al Shawi original https://aws.amazon.com/blogs/architecture/use-a-city-planning-analogy-to-visualize-and-create-your-cloud-architecture/

If you are new to creating cloud architectures, you might find it a daunting undertaking. However, there is an approach that can help you define a cloud architecture pattern by using a similar construct. In this blog post, I will show you how to envision your cloud architecture using this structured and simplified approach.

Such an approach helps you to envision the architecture as a whole. You can then create reusable architecture patterns that can be used for scenarios with similar requirements. It also will help you define the more detailed technological requirements and interdependencies of the different architecture components.

First, I will briefly define what is meant by an architecture pattern and an architecture component.

Architecture pattern and components

An architecture pattern can be defined as a mechanism used to structure multiple functional components of a software or a technology solution to address predefined requirements. It can be characterized by use case and requirements, and should be tested and reusable whenever possible.

Architecture patterns can be composed of three main elements: the architecture components, the specific functions or capabilities of each component, and the connectivity among those components.

A component in the context of a technology solution architecture is a building block. Modular architecture is composed of a collection of these building blocks.

To think modularly, you must look at the overall technology solution. What is its intended function as a complete system? Then, break it down into smaller parts or components. Think about how each component communicates with others. Identify and define each block or component and its specific roles and function. Consider the technical operational responsibilities each is expected to deliver.

Cloud architecture patterns and the city planning analogy

Let’s assume a content marketing company wants to provide marketing analytics to its partners. It proposes a SaaS solution, by offering an analytics dashboard on Amazon Web Services (AWS). This company may offer the same solution in other locations in the future.

How would you create a reusable architecture pattern for such a solution? To simplify the concept of a component and the architecture pattern, let’s use city planning as a frame of reference.

Subarchitectures or components

A city can be imagined as consisting of three organizing contexts or components:

Overall City Architecture (the big picture)
District Architecture
Building Architecture

Let’s define each of these components or subarchitectures, and see how they correlate to an enterprise cloud architecture.

I. City Architecture consists of the city structures and the integrations of services required by the population, see Figure 1.

Figure 1. Oversimplified city layout

The overall anticipated capacity within a certain period must be calculatedfor roads, sewage, water, electricity grids, and overall city layout. Typically, this structure should be built from the intended purpose or vision of the city. This can be the type of services it will offer, and the function of each district.

Think of City Architecture as the overall cloud architecture for your enterprise. Include the anticipated capacity, the layout (single Region, multi-Region), type, and number of Amazon Virtual Private Cloud (VPC)s. Decide how you will connect and integrate all these different architecture components.

The initial workflow that can be used to define the high-level architecture pattern layout of the SaaS solution example is analogous to the overall city architecture. We can define its three primary elements: architecture components, specific functions of each component, and the connectivity among those components.

Production environment. The front and backend of your application. It provides the marketing data analytics dashboard.
Testing and development environment. A replica of, but isolated from the Production app. Users’ traffic doesn’t pass through security inspection layer.
Security layer. Provides perimeter security inspection. Users’ traffic passes through security inspection layer.

Translating this workflow into an AWS architecture, Figure 2 shows the analogous structure.

Single AWS Region (to be offered in a specific geographical area)
Amazon VPC to host the production application
Amazon VPC to host the test/dev application
Separate VPC (or a layer within a VPC) to provide security services for perimeter security inspection
Customer’s connectivity (for example, over public internet, or VPN)
AWS Transit Gateway (TGW) to connect and isolate the different components (VPCs and VPN)

Figure 2. Architecture pattern (high-level layout)

Domain-driven design

At this stage, you may also consider a domain-driven design (DDD). This is an approach to software development that centers on a domain model. With your DDD, you can break the solution into different bounded contexts. You can translate the business functions/capabilities into logical domains, and then define how they communicate.

Let’s use the same SaaS example and further analyze the requirements of the solution with the DDD approach in mind. The SaaS solution is offered based on two types of industries: regulated with specific security compliance, and non-regulated. By translating this into logical domains, we can optimize the design to offer a more modular architecture. This will minimize the blast radius of the solution, as illustrated in Figure 3. Watch How AWS Minimizes the Blast Radius of Failures.

Figure 3. DDD-based architecture pattern (high-level layout)

Now let’s think of governmental boundaries within a city and among its districts. This can be analogous to AWS accounts structures and the trust boundaries among them. By applying this to the example preceding, the VPC with the security compliance requirements can be placed in a separate AWS account. Read Design principles for organizing your AWS accounts.

II. District Architecture consists of the structures and integrations required within a district to manage its buildings, see Figure 4.

Figure 4. City structure with districts

It illustrates how to connect/integrate back to the city-wide architecture. It should consider the overall anticipated capacity within each district.

For instance, a district can be designed based on the type of function/service it provides, such as residential district, leisure district, or business district.

Mapping this to cloud architecture, you can envision it as the more specific functions/services you are expecting from a certain block, component, or domain. Your architecture can be within one or multiple VPCs, as shown in Figure 5. The structure of a domain or block can vary by number of Availability Zones and VPCs, type of external access, compliance requirements, and the hosted application requirements. Each of these blocks serves a different function and requires different specifications. However, they all need to integrate back to the overall cloud and network architecture to provide a cohesive design.

The architect must define and specify clearly the communication model among the architecture components. You may further break the application architecture at the module level into microservices using the DDD approach. An example is the use of Micro-frontend Architectures on AWS.

Figure 5. Architecture module structure

III. Building Architecture refers to the buildings’ structures and standards required to deliver the specific properties/services within a district. It also must integrate back with the district architecture.

To apply this to your architecture, envision the specialized functions/capabilities you are expecting from your application within a module (subcomponents). What are the requirements needed for the application tiers? In this example, let’s assume that the VPC without security compliance requirements will use a frontend web tier on Amazon EC2. Its backend database will be Amazon Relational Database Service (RDS).

Each of these subcomponents must integrate with other components and modules, as well as to the public internet. For example, an AWS Application Load Balancer could handle connections requests from external users, and AWS Web Application Firewall (AWS WAF) used as the perimeter security layer. AWS Transit Gateway could connect to other modules (VPCs). NAT gateways could provide connectivity to the internet for the internal systems in a VPC (shown in Figure 6.)

Figure 6. Architecture module and its subcomponents structure

Conclusion

The vision and goal of a city architecture can set the basis for districts’ architectures. In turn, the district architecture sets the basis of the building architecture within a district. Similarly, the targeted enterprise cloud architecture goal should set the key requirements of the building blocks (or functional components) of the architecture.

Each architecture block sets the requirements of the subcomponents. They collectively construct a system or module of a system, as illustrated in Figure 7.

Figure 7. Structure of cloud architecture requirements and interdependencies

As a next step, assess your architecture from both a scale and reliability perspective. Designing for scale alone is not enough. Reliable scalability should be always the targeted architectural attribute. Read Architecting for Reliable Scalability.

Integrate Okta to Extend Active Directory Infrastructure into AWS

2021-12-03 Pavankumar Kasani

Post Syndicated from Pavankumar Kasani original https://aws.amazon.com/blogs/architecture/integrate-okta-to-extend-active-directory-infrastructure-into-aws/

Are you ready to extend your on-premises Active Directory to Amazon Web Services (AWS) to remove undifferentiated heavy lifting? Would you like to maintain a highly available Directory Service for your applications? Companies who have already set up integration with Okta Identity Cloud for external or internal applications require Active Directory objects to be synced to Okta for authentication. To achieve centralized access for on-premises and cloud applications, you can extend your on-premises Active Directory to AWS Managed Microsoft Active Directory (AD) using a trust relationship.

This blog shows an architecture pattern that you can use to synchronize your on-premises AD and AWS Managed AD objects. You can use Okta Identity Cloud using an Okta AD agent for syncing users and groups. The Okta AD agent can be installed and configured on a domain-joined on-premises server or an Amazon EC2 instance on AWS (see Figure 1).

AWS Directory Service lets you run Microsoft Active Directory (AD) as a managed service, and is powered by Windows Server 2012 R2. When you select and launch this directory type, it is created as a highly available pair of domain controllers connected to your Amazon Virtual Private Cloud (VPC). The domain controllers run in different Availability Zones in an AWS Region of your choice.

Okta is an enterprise-grade identity management service, which is compatible with many on-premises and cloud applications. The Okta AD agent enables you to integrate Okta with your on-premises AD. This way you can integrate your SaaS applications and your AD instances with Okta. You can simplify and centralize user management and share user credentials with other integrated cloud and on-premises applications.

Figure 1. Active Directory objects synchronization to Okta identity cloud

Network connectivity between corporate data center and AWS Regions

Before getting started with configuring a trust relationship with on-premises AD and AWS managed AD, be sure you’ve read and understand the prerequisites for setting up trust. For example, it is highly recommended to have a VPN or AWS Direct Connect circuit in place between your VPC and your on-premises environment. To create a resilient VPN connection, refer to the Site-to-Site VPN User Guide.

AWS Site-to-Site VPN is a fully managed service that uses IP security (IPsec) tunnels to create a secure connection between your data center or branch office, and your AWS resources. When using Site-to-Site VPN, you can connect to Amazon VPC and also AWS Transit Gateway. Two tunnels per connection are used for increased redundancy. You can also create a dedicated or a hosted connection using AWS Direct Connect.

Trust relationship between on-premises AD and AWS Managed AD

A trust relationship is a link between two different domains. For example, a one-way trust scenario allows the user accounts from the trusted domain to access resources in the trusting domain. In a two-way trust scenario, user accounts and resources can be passed between the two domains bidirectionally. A two-way trust relationship is commonly set up between on-premises AD and AWS Managed AD to extend authentication. This is used for any directory-aware workloads in the AWS Cloud, providing users and groups access to resources in either domain using single sign-on (SSO).

AWS Managed Microsoft Active Directory (AD) supports external and forest trust relationships with your existing on-premises domain in all three trust relationship directions:

One-way incoming
One-way outgoing
Two-way bidirectional

To create a trust relationship, follow these steps:

Prepare your on-premises domain for the trust relationship. This includes preparing your firewall configuration, enable Kerberos pre-authentication, and configuring conditional forwarders.
Prepare your AWS Managed Microsoft AD for the trust relationship. This includes configuring your VPC subnets, security groups, and enabling Kerberos pre-authentication.
Create the trust relationship between your on-premises AD and your AWS Managed Microsoft Active Directory (AD).

Install and configure Okta agent

Download and install Okta AD agent on your Amazon EC2 instance, which should be domain-joined with AWS Managed AD. One Okta AD agent can associate with multiple domains. Once the trust has been set up between on-premises AD and AWS Managed AD, you can associate multiple domains under the same Okta AD agent on Amazon EC2, instead of hosting and managing separate Okta AD agent servers in your own data center and AWS.

For a highly available architecture, a redundant Okta AD agent running in your corporate data center is recommended. This will help you avoid the impact of network connectivity failure between data centers and AWS Regions. Okta recommends installing multiple Okta AD agents on each domain server to achieve high availability and failover protection.

Read Okta AD integration step-by-step setup for installing and configuring Okta agent.

Validate AD objects

Once the Okta agent is installed and configured on the Amazon EC2 instance, log in to the Okta admin console. Under the provisioning to Okta tab, do a full import of users from AWS Managed AD (see Figure 2, Figure 3). The subsequent objects synchronization will be done through scheduled import with a minimum interval of one hour. After the import is done, if there are any user account overlaps between AWS Managed AD and Okta, manually assign the AD users to Okta users. You can create matching rules to automatically map the users from AD to Okta. Read Import AD users to Okta.

Figure 2. Import users under Okta admin console

Figure 3. Import users results under Okta admin console

Matching rules are used in the import of users from all apps and directories that provide importing. If there is an existing Okta account, AD allows you to import and confirm users automatically (see Figure 4).

Figure 4. User creation and matching under Okta admin console

You can import groups from any forest or domain connected to Okta. The Okta AD Agent detects all groups in the domain or the organizational units (OUs) that you select. If you register an Okta AD Agent for more than one domain and you have the root OU selected for all domains, all groups will be imported. Read Import AD Groups to Okta to synchronize groups from AD to Okta.

Synchronize passwords to Okta

When you sign in to Okta using your organization’s AD credentials, the authentication process is delegated to the connected on-premises AD. Okta does not see or store the credentials.

In some cases, the credentials must be synchronized from a directory across Okta to an application. If a user changes the password stored in Active Directory and then tries to access applications using the same single sign-on session, they will receive a password error message. This is because the new password has not been synchronized to the application, so a new sign-in process is required for password validation.

To avoid a disruptive user experience, use the Okta AD Password Sync Agent to synchronize passwords from AD to Okta and to integrated apps. The Okta AD Password Sync Agent will track password changes in AD and then synchronize to Okta.

For more details on the password synchronization and password reset workflow, you can read step-by-step instructions on Synchronize passwords from Active Directory to Okta.

Summary

In this blog post, we discussed a way for synchronizing users and credentials from on-premises Active Directory and AWS Managed AD to Okta Identity Cloud. With synchronization, you can centralize access of cloud and on-premises applications and provide seamless access to AD-aware services within AWS.

Customers can also migrate on-premises AD to AWS using Active Directory Migration Tool (ADMT) along with the Password Export Server (PES) service.

Read more:

Field Notes: Building an Industrial Data Platform to Gather Insights from Operational Data

2021-12-02 Shailaja Suresh

Post Syndicated from Shailaja Suresh original https://aws.amazon.com/blogs/architecture/field-notes-building-an-industrial-data-platform-to-gather-insights-from-operational-data/

Co-authored with Russell de Pina, former Sr. Partner Solutions Architect at AWS

Manufacturers looking to achieve greater operational efficiency need actionable insights from their operational data. Traditionally, operational technology (OT) and enterprise information technology (IT) existed in silos with various manual processes to clean and process data. Leveraging insights at a process level requires converging data across silos and abstracting the right data for the right use case in a scalable fashion.

The AWS Industrial Data Platform (IDP) facilitates the convergence of enterprise and operational data for gathering insights that lead to overall operational efficiency of the plant. In this blog, we demonstrate a sample AWS IDP architecture along with the steps to gather data across multiple data sources on the shop floor.

Overview of solution

With the IDP Reference Architecture as the backdrop, let us walk through the steps to implement this solution. Let us say we have gas compressors being used in our factory. The Operations Team has identified an anomaly in the wear of the bearings used in some of these compressors. Currently the bearings are supplied by two manufacturers, ABC and XYZ.

The graphs in Figures 1 and 2 show the expected and actual performance data of the bearings with respect to the vibration (actual and expected) against time. The points represent a cluster of plots which are represented as single dots here. The Mean Time Between Maintenance (MTBM) per the manufacturer for these bearings is five years. The factory has detected that for ABC bearings, the actual vibrations detected during its half-life is much more than the normal (expected) as shown. This clearly requires further analysis to identify the root cause for this discrepancy to prevent compressor breakdowns, unplanned downtime and cost.

Figure 1 – ABC Bearings Anomaly

Figure 2 – XYZ Bearings Vibration

Although the deviation observed from expected is much less in XYZ bearings, there is still a deviation which needs to be understood. This blog provides an overview of how the various AWS services of the IDP architecture help with this analysis. Figure 3, shows the AWS architecture used for this specific use case.

Figure 3 – IDP AWS Architecture to solve for the Bearings Anomaly

Tracking Anomaly Data

The sensors on the compressor bearings send the vibration/chatter data to AWS through AWS IoT core. Amazon Lookout for Equipment is configured as detailed out in the user guide with the necessary data formatting guidelines.

The raw sensor data from IoT Core which is in a JSON format is converted into a CSV format with the necessary headers to match the schema (one schema for each sensor) in Figure 7 using AWS Lambda. This CSV conversion is needed for the data to be processed by Amazon Lookout for Equipment. A sample sensor data from the ABC sensor, the AWS Lambda code snippet to convert this JSON to CSV and the output CSV which is to be ingested into Lookout for Equipment are shown in Figure 4,5 and 6 respectively. For detailed steps on how a dataset is created to be ingested into Lookout For Equipment, please refer the user guide.

Figure 4: Sample Sensor Data from ABC Bearings’ Sensor

import boto3
import botocore
import csv
import json

def lambda_handler(event, context):
    BUCKET_NAME = 'L4EData' 
    OUTPUT_KEY = 'csv/ABCSensor.csv' # OUTPUT FILE 
    INPUT_KEY = 'ABCSensor.json'# INPUT FILE 
    s3client = boto3.client('s3')
    s3 = boto3.resource('s3') 
    
    obj = s3.Object(BUCKET_NAME, INPUT_KEY)
    data = obj.get()['Body'].read().decode('utf-8')
    json_data = json.loads(data)
    print(json_data)

    
    output_file_path = "/tmp/data.csv"
    
    with open(output_file_path, "w") as file:
        csv_file = csv.writer(file)
        csv_file.writerow(['TIMESTAMP', 'Sensor'])
        for item in json_data:
            csv_file.writerow([item.get('TIMESTAMP'),item.get('Sensor')])
    
    csv_binary = open(output_file_path, 'rb').read()
    
    
    try:
        obj = s3.Object(BUCKET_NAME, OUTPUT_KEY)
        obj.put(Body=csv_binary)
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == "404":
            print("The object does not exist.")
        else:
            raise
    
    try:
        download_url = s3client.generate_presigned_url(
                         'get_object',
                          Params={
                              'Bucket': BUCKET_NAME,
                              'Key': OUTPUT_KEY
                              },
                          ExpiresIn=3600
        )
        return json.dumps(download_url)
    except Exception as e:
        raise utils_exception.ErrorResponse(400, e, Log)

Figure 5 – AWS Lambda Python Code snippet to Convert CSV to JSON

Figure 6 – Output CSV from AWS Lambda with Timestamp and Vibration values from Sensor

Figure 7 – Data Schema used for Dataset Ingestion in Amazon Lookout for Equipment for ABC Vibration Sensor

Once the model is trained the anomaly is detected on bearing ABC as already stated.

Analyze Factory Environment Data

For our analysis, it is important to factor in the factory environment conditions. There are variables like factory humidity, machine temperature, bearing lubrication levels that could attribute to increased vibration. We have all the environment data gathered from factory sensors made available through AWS IoT Core as shown in Figure 3. Since the number of parameters measured through the factory sensors can change over time, the architecture uses a no-SQL database, Amazon DynamoDB, to persist this data for downstream analysis.

Since we only need sensor events that exceed a particular threshold, we set rules under AWS IoT Core to capture those events that could potentially cause increased vibration on the bearings. For example, we just want only those events that exceed a temperature threshold, say anything above 75-degree Fahrenheit along with the timestamp in Amazon DynamoDB. This is beyond the normal operating temperature for the bearings and certainly demands our interest.

So, we set a rule trigger as shown in Figure 8 in the Rule query statement while defining an action under IoT Core. The flow of events from an IoT sensor to Amazon DynamoDB is illustrated in Figure 9.

Figure 8 – Define a Rule in IoT Core

The events are received on a minute basis. So, if there are 3 records inserted in Amazon DynamoDB on a day, it means that there has been a temperature threshold breach for a maximum of 3 minutes.

We set up similar rules and actions to DynamoDB for humidity and compressor lubrication levels.

Figure 9: Event from IoT factory sensor to Amazon DynamoDB

1) Factory Data Storage

It is required to factor in both the factory data like machine id, shift number, shop floor and the operator data like the operator name, machine-operator relationships, operator’s shift, etc., for this analysis since we are drilling down to the granular details of which operators where working on the machines which have ABC bearings. Since this data is relational, it is stored in Amazon Aurora. We opt for the serverless option of the Aurora database since the operations team requires a database that is fully managed.

The data of the vendors who supply ABC and XYZ bearings and their contracts are stored in Amazon S3. We also have the operators’ performance scores pre-calculated and stored in Amazon S3.

2) Querying the Data Lake

Now that we have data ingested and stored in various channels – Amazon S3, AWS IoT Core, Amazon DynamoDB and Amazon Aurora, it is required that we collate this data for further analysis and querying. We use Athena Federated Query under Amazon Athena for querying across these multiple data sources and store the results in Amazon S3 as shown in Figure 10. It is required that we create a workgroup and configure connectors to set up Amazon S3, Amazon DynamoDB and Amazon Aurora as the data sources under Amazon Athena. The detailed steps to create and connect the data sources are provided in the user guide.

Figure 10: Athena Federated Query across Data Sources

We are now interested to gather some insights across the three different data sources. Let us find all those machines and their corresponding operators with performance scores in factory shop floor 1 which have the lubrication levels as demonstrated in Figure 11.

Figure 11: Query to Determine Operators and their Performance

From the output in Table 1, we see that operators on these machines which have the machine temperature levels of above the threshold to be having good scores (above 5, which is the average) based on past performance. Hence, we could conclude for now that they have been operating these machines under the normal expected conditions with the right controls and settings.

Table 1: Operators and their Scores

We would then want to find all those machines which have ABC Bearings and the vendors who supplied them as demonstrated in Figure 12.

Figure 12: Vendors Supplying ABC Bearings

Table 2: Names of Vendors Who Supply ABC Bearings

As a next step, we would want to reach out to the vendors, Jane Doe Enterprises and Scott Inc.(see Table 2) to report the issue with ABC Bearings. The next step is to potentially modify the supply contracts and purchase only from XYZ manufacturer and/or to look for other bearing manufacturers in the vendor supply chain.

Conclusion

In this blog, we covered a sample use case to demonstrate how the AWS Industrial Data Platform can help gather actionable insights from factory floor data to gain operation efficiencies. We also did a walkthrough of a sample use case to demonstrate how the various AWS services can be used to build a scalable data platform for seamless querying and processing of data coming in with various formats.

Top 5: Featured Architecture Content for November

2021-12-01 Ellen Crowley

Post Syndicated from Ellen Crowley original https://aws.amazon.com/blogs/architecture/top-5-featured-architecture-content-for-november/

The AWS Architecture Center provides new and notable reference architecture diagrams, vetted architecture solutions, AWS Well-Architected best practices, whitepapers, and more. This blog post features some of our best picks from November’s new and updated content.

1. Game Industry Lens for the AWS Well-Architected Framework

This new Lens specifies best practices for the unique characteristics of building and operating games in the cloud, based on our experience working with game developers and publishers around the world. It provides guidance on how to design and operate your environment so that it is cost-optimized and scalable for fluctuations in global player demand. This Lens also provides guidance for securing your game infrastructure and tuning performance in order to deliver a positive player experience.

GameLift with Serverless backend reference architecture diagram

2. SAP Lens for the Well-Architected Framework

Another new Lens is aimed at SAP technology architects, cloud architects, and others who build, operate, and maintain SAP systems on AWS. This lens provides insights that AWS has gathered from customers, AWS Partners, and our SAP specialist community. You will learn how to adopt a cloud-native approach when running SAP software including SAP Business Suite, SAP S/4HANA, and supporting products. Use it to evaluate SAP on AWS workloads before, during, and after implementation.

3. Data Analytics Lens for the Well-Architected Framework

The AWS Well-Architected Data Analytics Lens is a collection of customer-proven best practices for designing analytics workloads. The newly updated Lens contains insights that AWS has gathered from real-world case studies, and helps you learn the key design elements of analytics workloads along with recommendations for improvement. The document cover 6 scenarios that are common in many analytics applications, including modern data architecture, batch data processing, and data visualization.

4. Availability and Beyond: Understanding and improving the resilience of distributed systems on AWS

What does it mean to build a highly available workload? How do you measure availability? What can I do to increase my workload’s availability? This new whitepaper will help you answer these questions for workloads operating on complex, distributed systems both in the cloud and on-premises. It outlines a common understanding for availability as a measure of resilience, establishes rules for building highly available workloads, and offers guidance on how to improve workload availability.

5. Content Localization on AWS

This Solutions Implementation provides automated video content transcription, subtitling and translation. It helps you create accurate, multi-language subtitles for video-on-demand (VOD) content, which can help all customers better understand the content. However, localizing (translating) video subtitles is both complex and labor-intensive. Learn how AWS artificial intelligence (AI) services can assist with the subtitle creation process to save manual time spent transcribing, subtitling, translating, and reviewing media assets.

Field Notes: Building a Data Service for Autonomous Driving Systems Development using Amazon EKS

2021-11-30 Ajay Vohra

Post Syndicated from Ajay Vohra original https://aws.amazon.com/blogs/architecture/field-notes-building-a-data-service-for-autonomous-driving-systems-development-using-amazon-eks/

Many aspects of autonomous driving (AD) system development are based on data that capture real-life driving scenarios. Therefore, research and development professionals working on AD systems need to handle an ever-changing array of interesting datasets composed from the real-life driving data. In this blog post, we address a key problem in AD system development, which is how to dynamically compose interesting datasets from real-life driving data and serve them at scale in near real-time.

The first challenge in composing large interesting datasets is high latency. If you have to wait for the entire dataset to be composed before you can start consuming the dataset, you may have to wait for several minutes, or even hours. This latency slows down AD system research and development. The second challenge is creating a data service that can cost-efficiently serve the dynamically composed datasets at scale. In this blog post, we propose solutions to both these challenges.

For the challenge of high latency, we propose dynamically composing the data sets as chunked data streams, and serving them using a Amazon FSx for Lustre high-performance file-system. Chunked data streams immediately solve the latency issue, because you do not need to compose the entire stream before it can be consumed. For the challenge of cost-efficiently serving the datasets at scale, we propose using Amazon EKS with auto-scaling features.

Overview of the Data Service Architecture

The data service described in this post dynamically composes and serves data streams of selected sensor modalities for a specified drive scene selected from the A2D2 driving dataset. The data stream is dynamically composed from the extracted A2D2 drive scene data stored in Amazon S3 object data store, and the accompanying meta-data stored in an Amazon Redshift data warehouse. While the data service described in this post uses the Robot Operating System (ROS), the data service can be easily adapted for use with other robotic systems.

The data service runs in Kubernetes Pods in an Amazon EKS cluster configured to use a Horizontal Pod Autoscaler and EKS Cluster Autoscaler. An Amazon Managed Service For Apache Kafka (MSK) cluster provides the communication channel between the data service, and the data clients. The data service implements a request-response paradigm over Apache Kafka topics. However, the response data is not sent back over the Kafka topics. Instead, the data service stages the response data in Amazon S3, Amazon FSx for Lustre, or Amazon EFS, as specified in the data client request, and only the location of the staged response data is sent back to the data client over the Kafka topics. The data client directly reads the response data from its staged location.

The data client running in a ROS enabled Amazon EC2 instance plays back the received data stream into ROS topics, whereby it can be nominally consumed by any ROS node subscribing to the ROS topics. The solution architecture diagram for the data service is shown in Figure 1.

Figure 1 – Data service solution architecture with default configuration

Data Client Request

Imagine the data client wants to request drive scene data in ROS bag file format from A2D2 autonomous driving dataset for vehicle id a2d2, drive scene id 20190401145936, starting at timestamp 1554121593909500 (microseconds) , and stopping at timestamp 1554122334971448 (microseconds). The data client wants the response to include data only from the camera/front_left sensor encoded in sensor_msgs/Image ROS data type, and the lidar/front_left sensor encoded in sensor_msgs/PointCloud2 ROS data type. The data client wants the response data to be streamed back chunked in series of rosbag files, each file spanning 1000000 microseconds of the drive scene. The data client wants the chunked response rosbag files to be staged on a shared Amazon FSx for Lustre file system.

Finally, the data client wants the camera/front_left sensor data to be played back on /a2d2/camera/front_left ROS topic, and the lidar/front_left sensor data to be played back on /a2d2/lidar/front_left ROS topic.

The data client can encode such a data request using the following JSON object, and send it to the Kafka bootstrap servers b-1.msk-cluster-1:9092,b-2.msk-cluster-1:9092 on the Apache Kafka topic named a2d2.

{
 "servers": "b-1.msk-cluster-1:9092,b-2.msk-cluster-1:9092",
 "requests": [{
    "kafka_topic": "a2d2", 
    "vehicle_id": "a2d2",
    "scene_id": "20190401145936",
    "sensor_id": ["lidar/front_left", "camera/front_left"],
    "start_ts": 1554121593909500, 
    "stop_ts": 1554122334971448,
    "ros_topic": {"lidar/front_left": "/a2d2/lidar/front_left", 
    "camera/front_left": "/a2d2/camera/front_left"},
    "data_type": {"lidar/front_left": "sensor_msgs/PointCloud2",
    "camera/front_left": "sensor_msgs/Image"},
    "step": 1000000,
    "accept": "fsx/multipart/rosbag",
    "preview": false
 }]
}

At any given time, one or more EKS pods in the data service are listening for messages on the Kafka topic a2d2. The EKS pod that picks the request message responds to the request by composing the requested data as a series of rosbag files, and staging them on FSx for Lustre, as requested in the "accept": "fsx/multipart/rosbag" field.

Each rosbag in the response is dynamically composed from the drive scene data stored in Amazon S3, using the meta-data stored in Amazon Redshift. Each rosbag contains drive scene data for a single time step. In the preceding example, the time step is specified as "step": 1000000 (microseconds).

Visualizing the Data Service Response

If a human is interested in visualizing the data response, one can use any ROS visualization tool. One such tool is rviz. This tool can be run on the ROS desktop. In the following screenshot, we show the visualization of the response using rviz tool for the example data request shown previously.

Figure 2 – Visualization of response using rviz tool

Dynamically Transforming the Coordinate Frames

The data service supports dynamically transforming the composed data from one coordinate frame to another frame. A typical use case is to transform the data from a sensor specific coordinate frame to AV (ego) coordinate frame. Such transformation request can be included in the data client request.

For example, imagine the data client wants to compose a data stream from all the LiDAR sensors, and transform the point cloud data into the vehicle’s coordinate frame. The example configuration c-config-lidar.json allows you to do that. Following is a visualization of the LiDAR point cloud data transformed to the vehicle coordinate frame and visualized in the rviz tool from a top-down perspective.

Figure 3 – Top-down rviz visualization of point-cloud data transformed to ego vehicle view

Walkthrough

In this walkthrough, we use the A2D2 autonomous driving dataset. The complete code for this walk-through and reference documentation is available in the associated Github repository. So before we get into the walk-through, clone the Github repository on your laptop using the Git clone command. Next, ensure these prerequisites are satisfied.

The approximate cost of the walk-through of this tutorial with default configuration is US $2,000. The actual cost may vary considerably based on actual configuration, and the duration used for the walk-through.

Configure the data service

To configure the data service, we need to create a new AWS CloudFormation stack in the AWS console using the cfn/mozart.yml template from the cloned repository on your laptop.

This template creates AWS Identity and Access Management (IAM) resources, so when you create the CloudFormation Stack using the console, in the review step, you must check I acknowledge that AWS CloudFormation might create IAM resources. The stack input parameters you must specify are the following:

Parameter Name table

For all other stack input parameters, default values are recommended during the first walkthrough. Review the complete list of all the template input parameters in the Github repository reference.

Once the stack status in CloudFormation console is CREATE_COMPLETE, find the ROS desktop instance launched in your stack in the Amazon EC2 console, and connect to the instance using SSH as user ubuntu, using your SSH key pair. The ROS desktop instance is named as <name-of-stack>-desktop.
When you connect to the ROS desktop using SSH, and you see the message "Cloud init in progress. Machine will REBOOT after cloud init is complete!!", disconnect and try later after about 20 minutes. The desktop installs the NICE DCV server on first-time startup, and reboots after the install is complete.
If the message NICE DCV server is enabled!appears, run the command sudo passwd ubuntu to set a new strong password for user ubuntu. Now you are ready to connect to the desktop using the NICE DCV client.
Download and install the NICE DCV client on your laptop.
Use the NICE DCV Client to login to the desktop as user ubuntu
When you first login to the desktop using the NICE DCV client, you may be asked if you would like to upgrade the OS version. Do not upgrade the OS version.

Now you are ready to proceed with the following steps. For all the commands in this blog, we assume the working directory to be ~/amazon-eks-autonomous-driving-data-service on the ROS desktop.

If you used an IAM role to create the stack above, you must manually configure the credentials associated with the IAM role in the ~/.aws/credentials file with the following fields:

aws_access_key_id=

aws_secret_access_key=

aws_session_token=

If you used an IAM user to create the stack, you do not have to manually configure the credentials. In the working directory, run the command:

./scripts/configure-eks-auth.sh

When successfully running this command, the following confirmation appears AWS Credentials Removed.

Configure the EKS cluster environment

In this step, we configure the EKS cluster environment by running the command:

./scripts/setup-dev.sh

This step also builds and pushes the data service container image into Amazon ECR.

Prepare the A2D2 data

Before we can run the A2D2 data service, we need to extract the raw A2D2 data into your S3 bucket, extract the metadata from the raw data, and upload the metadata into the Redshift cluster. We execute these three steps using an AWS Step Functions state machine. To create and run the AWS Step Functions state machine, run the following command in the working directory:

./scripts/a2d2-etl-steps.sh

Note the executionArn of the state machine execution in the output of the previous command. To check the status of the execution, use following command, replacing executionArn below with your value:

The state machine execution time depends on many variable factors, and may take anywhere from 4 – 24 hours, or possibly longer. All the AWS Batch jobs started as part of the state machine automatically reattempt in case of failure.

Run the data service

The data service is deployed using a Helm Chart, and runs as a kubernetes deployment in EKS. To start the data service, execute the following command in the working directory:

kubectl get pods -n a2d2

Run the data service client

To visualize the response data requested by the A2D2 data client, we will use the rviz tool on the ROS desktop. Open a terminal on the desktop, and run rviz.

In the rviz tool, use File>Open Config to select /home/ubuntu/amazon-eks-autonomous-driving-data-service/a2d2/config/a2d2.rviz as the rviz config. You should notice that the rviz tool is now configured with two areas, one for visualizing image data, and the other for visualizing point cloud data.

To run the data client, open a new terminal on the desktop, and execute the following command in the root directory of the cloned Github repository on the ROS desktop:

python ./a2d2/src/data_client.py --config ./a2d2/config/c-config-ex1.json 1>/tmp/a.out 2>&1 &

After a brief delay, you should be able to preview the response data in the rviz tool. You can set "preview": false in the data client config file, ./a2d2/config/c-config-ex1.json, and rerun the preceding command to view the complete response. For maximum performance, pre-load S3 data to FSx for Lustre.

Hard reset of the data service

This step is for reference purposes. If at any time you need to do a hard reset of the data service, you can do so by executing:

helm delete a2d2-data-service

This will delete all data service EKS pods immediately. All in-flight service responses will be aborted. Because the connection between the data client and data service is asynchronous, the data clients may wait indefinitely, and you may need to cleanup the data client processes manually on the ROS desktop using operating system tools. Note, each data client instance spawns multiple Python processes. You may also want to cleanup /fsx/rosbag directory.

Clean Up

When you no longer need the data service, delete the AWS CloudFormation stack from the AWS CloudFormation console. Deleting the stack will shut down the desktop instance, and delete the EFS and FSx for Lustre file-systems created in the stack. The Amazon S3 bucket is not deleted.

Conclusion

In this post, we demonstrated how to build a data service that can dynamically compose near real-time chunked data streams at scale using EKS, Redshift, MSK, and FSx for Lustre. By using a data service, you increase agility, flexibility and cost-efficiency in AD system research and development.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.