Tag Archives: Architecture

Field Notes: How to Identify and Block Fake Crawler Bots Using AWS WAF

2020-11-04 Vijay Menon

Post Syndicated from Vijay Menon original https://aws.amazon.com/blogs/architecture/field-notes-how-to-identify-and-block-fake-crawler-bots-using-aws-waf/

In this blog post, we focus on how to identify fake bots using these AWS services: AWS WAF, Amazon Kinesis Data Firehose, Amazon S3 and AWS Lambda. We use fake Google/Bing bots to demonstrate, but the principles can be applied to other popular crawlers like Slurp Bot from Yahoo, DuckDuckBot from DuckDuckGo, Alexa crawler from Alexa internet ranking service.

For industries like media, online retailors, news or social websites, content is critical and often sets them apart from other competitors. These companies put in a significant amount of effort to make the content as visible and accessible as possible. To do that these companies rely on crawler bots, so that legitimate users searching for content can find the content easily. Crawler bots are useful for indexing the site pages and helping make the content more searchable and improve rankings.

However, this capability can be misused. So it is important to distinguish between genuine crawler bots and fake ones that are doing more than just indexing your site. It’s important to properly identify good and bad actors so that you can stop the bad ones without impacting the ability of good ones, and at scale. This helps in driving more traffic, visitors, and more revenue from your websites.

Identifying bots

There are two primary sources of information required to identify a fake bot:

HTTP Header User-Agent: Fake bots try to present themselves as real bots, for example as Google or Bing, by using the same user agent string used by Google or Bing.
IP Address: You can look at the source IP address of the incoming request and determine if it belongs to the search engine provider network like Google or Bing. You can do this by performing a forward and reverse look up and comparing the results. These methods are well documented by the search engine providers Google and Bing.

Solution Overview

The solution leverages the capabilities of AWS WAF. Our demonstration application is a static website hosted on Amazon S3 fronted by Amazon CloudFront. This means that we can provide permission of access to the S3 bucket only to CloudFront using origin access identity.

The logs are streamed in near real time using Amazon Kinesis Data Firehose, inspected using a Lambda function to help identify fake bots before storing the logs on Amazon S3. The Lambda function does two things:

Inspect the traffic using the rules to identify bad or fake bots. In this case it uses forward and reverse DNS lookup results of the Client IP address of packets with User-Agent string resembling GoogleBot or BingBot.
Once identified as a fake bot, the Lambda function updates AWS WAF IP-Set to permanently block the requests coming from IP addresses of fake bots.

Note: For the sake of this demonstration, we are using a static website hosted on Amazon S3 with CloudFront. This requires the AWS WAF and IP-Set used by AWS WAF to be of scope ‘CLOUDFRONT’. You can modify it to use scope ‘REGIONAL’ if you chose to protect your web properties behind Application Load Balancer or API Gateway.

Solution Architecture

Prerequisites

Readers of this blog post should be familiar with HTTP and the following AWS services:

For this walkthrough, you should have the following:

AWS Account
AWS Command Line Interface (AWS CLI): You need AWS CLI installed and configured on the workstation from where you are going to try the steps mentioned below.
Credentials configured in AWS CLI should have the required IAM permissions to spin up and modify the resources mentioned in this post.
Make sure that you deploy the solution to us-east-1 Region and your AWS CLI default Region is us-east-1. If us-east-1 is not the default Region, reference the Regions explicitly while executing AWS CLI commands using --region us-east-1 switch.
Amazon S3 bucket in us-east-1 region

Walkthrough

1. Create an Amazon S3 static Website and put it behind an Amazon CloudFront distribution. You can follow these steps. Note down the CloudFront distribution ID. We will use it in subsequent steps.

2. Download and unzip the file containing CloudFormation template and lambda function code from here to a folder in your local workstation. You will need to run all of the subsequent commands from this folder.

3. Zip the lambda function lambda_function.py and upload it to an Amazon S3 bucket of your choice in us-east-1. Note the bucket name as it will be used in the subsequent steps. The blog uses waf-logs-fake-bots-us-east-1 for reference as the S3 bucket name.

$ zip lambda_function.py lambda_function.py.zip $ aws s3 cp lambda_function.py.zip s3://waf-logs-fake-bots-us-east-1/

4. Create the resources required for this blog post by deploying the AWS CloudFormation template and running the below command:

aws cloudformation create-stack \
--stack-name FakeBotBlockBlog \
--template-body file://BotBlog.yml \
--parameters ParameterKey=KinesisBufferIntervalSeconds,ParameterValue=900 ParameterKey=KinesisBufferSizeMB,ParameterValue=3 ParameterKey=IPSetName,ParameterValue=BlockFakeBotIPSet ParameterKey=IPSetScope,ParameterValue=CLOUDFRONT ParameterKey=S3BucketWithDeploymentPackage,ParameterValue=waf-logs-fake-bots-us-east-1 ParameterKey=DeploymentPackageZippedFilename,ParameterValue=lambda_function.py.zip \
--capabilities CAPABILITY_IAM \
--region us-east-1

You need to provide the following information, and you can change the parameters based on your specific needs:

a. KinesisBufferIntervalSeconds and KinesisBufferSizeMB. These will define the interval at which Kinesis Firehose ships the logs to Amazon S3, the Default is 900 seconds and 3MB respectively, whichever is met first.

b. IPSetName. Name of the IP Set which will be used to record the client IP address of fake bots. Default value is BlockFakeBotIPSet.

c. IPSetScope. Scope of the IP Set. I am using CLOUDFRONT and associate it with the CloudFront distribution created in step 1. You can choose to make it REGIONAL in which case WebACLassociation will need to be with an ALB or an API Gateway.

d. S3BucketWithDeploymentPackage. Name of S3 bucket used in step 3. The blog assumes waf-logs-fake-bots-us-east-1.

e. DeploymentPackageZipedFilename. Lambda function filename without the file extension. For example, the blog assumes lambda_function.py.zip is available on the Amazon S3 bucket and uses this value for this parameter.

Some stack templates might include resources that can affect permissions in your AWS account, for example, by creating new AWS Identity and Access Management (IAM) role. For those stacks, you must explicitly acknowledge this by specifying CAPABILITY_IAM or CAPABILITY_NAMED_IAM value for the –capabilities parameter.

Stack creation will take you approximately 5-7 minutes. Check the status of the stack by executing the below command every few minutes. You should see StackStatus value as CREATE_COMPLETE.

Example:

aws cloudformation describe-stacks --stack-name FakeBotBlockBlog | grep StackStatus

The CloudFormation template will create the following resources:

IP Set for AWS WAF
WebACL with rules to block the client IP addresses of fake bots, and an AWS-managed common rule set.
Lambda function to help detect fake bots and modify the AWS WAF IP Set to block them
Kinesis Firehose delivery stream, which will use the above Lambda function for processing
IAM roles with required permissions for the Lambda function and Kinesis Firehose
S3 bucket for AWS WAF logs

5. Enable logging for the WebACL using AWS CLI. For this you need the ARN of the WebACL and Kinesis Firehose. You can find that information from the output of the CloudFormation stack created in step 4 using the below AWS CLI command

aws cloudformation describe-stacks --stack-name FakeBotBlockBlog

Please note the 2 ARNs and run the following commands by replacing (1) ResourceArn value with WebACL ARN and (2) LogDestinationConfigs value with Kinesis Firehose delivery stream ARN.

Example:

aws wafv2 put-logging-configuration –-logging-configuration ResourceArn=arn:aws:wafv2:us-east-1:123456789012:global/webacl/FakeBotWebACL/259ea98f-24ba-4acd-8803-3e7d02e8d482,LogDestinationConfigs=arn:aws:firehose:us-east-1:123456789012:deliverystream/aws-waf-logs-FakeBotBlockBlog --region us-east-1

6. Associate CloudFront distribution with this WebACL: Sign in to the AWS Management Console and open the AWS WAF and Shield console at https://console.aws.amazon.com/wafv2/homev2/web-acls?region=global .

Click on the WebACL created earlier in this procedure
Navigate to ‘Associated AWS resources’ tab and select Add AWS resources
In the subsequent screen, select the CloudFront distribution created in step 1
Select Add

Note: If you are using an ALB or API Gateway for your web property, then you need to use REGIONAL WebACL and IP Set. Review the procedure to associate an ALB or API Gateway to the WebACL.

You can monitor WebACL performance from the Overview section of WebACL from the AWS WAF and Shield console.

Testing

To test, you will need to generate some traffic which will trigger the lambda function to detect and block the fake bots created earlier in this blog. The web traffic can be generated from the local machine or from an EC2 instance with access to the internet using curl. Manually set the user agent to resemble Googlebot by running the following command from shell:

Replace http://www.awsdemodesign.com/ with the URL of your CloudFront distribution you created in step 1 of the walkthrough.

for i in {1..1000}; do curl -I -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" http://www.awsdemodesign.com/; done

Initially you will see HTTP1.1/ 200 OK response. This will trigger Lambda and modify the IP Set to include your public IP address to be blocked. You can verify that by inspecting the IP set from AWS WAF and Shield console.

Sign in to the AWS Management Console and open the AWS WAF and Shield console
Click on IP Set created earlier in this blog. In the subsequent screen you can see your public IP address in the list of IP addresses.

If you run the curl command again, you will see that the response now is HTTP/1.1 403 Forbidden.

Clean Up

Disassociate the CloudFront Distribution from WebACL
Delete the S3 bucket and CloudFront Distribution created in Step 1
Empty and delete the S3 bucket created by the CloudFormation stack for AWS WAF logs.
Delete CloudFormation stack created in Step 4.

Conclusion

In this blog post, we demonstrated how you can set up and inspect incoming web traffic using AWS Lambda, AWS WAF native logging capabilities, and Kinesis Firehose to help detect and block bad or fake bots at scale. Furthermore, the solution outlined in this post provides a framework which can be extended to identify similar unwanted traffic impersonating as other good bots.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Architecting for Reliable Scalability

2020-11-03 Marwan Al Shawi

Post Syndicated from Marwan Al Shawi original https://aws.amazon.com/blogs/architecture/architecting-for-reliable-scalability/

Cloud solutions architects should ideally “build today with tomorrow in mind,” meaning their solutions need to cater to current scale requirements as well as the anticipated growth of the solution. This growth can be either the organic growth of a solution or it could be related to a merger and acquisition type of scenario, where its size is increased dramatically within a short period of time.

Still, when a solution scales, many architects experience added complexity to the overall architecture in terms of its manageability, performance, security, etc. By architecting your solution or application to scale reliably, you can avoid the introduction of additional complexity, degraded performance, or reduced security as a result of scaling.

Generally, a solution or service’s reliability is influenced by its up time, performance, security, manageability, etc. In order to achieve reliability in the context of scale, take into consideration the following primary design principals.

Modularity

Modularity aims to break a complex component or solution into smaller parts that are less complicated and easier to scale, secure, and manage.

Figure 1: Monolithic architecture vs. modular architecture

Modular design is commonly used in modern application developments. where an application’s software is constructed of multiple and loosely coupled building blocks (functions). These functions collectively integrate through pre-defined common interfaces or APIs to form the desired application functionality (commonly referred to as microservices architecture).

Figure 2: Scalable modular applications

For more details about building highly scalable and reliable workloads using a microservices architecture, refer to Design Your Workload Service Architecture.

This design principle can also be applied to different components of the solution’s architecture. For example, when building a cloud solution on a single Amazon VPC, it may reach certain scaling limits and make it harder to introduce changes at scale due to the higher level of dependencies. This single complex VPC can be divided into multiple smaller and simpler VPCs. The architecture based on multiple VPCs can vary. For example, the VPCs can be divided based on a service or application building block, a specific function of the application, or on organizational functions like a VPC for various departments. This principle can also be leveraged at a regional level for very high scale global architectures. You can make the architecture modular at a global level by distributing the multiple VPCs across different AWS Regions to achieve global scale (facilitated by AWS Global Infrastructure).

In addition, modularity promotes separation of concerns by having well-defined boundaries among the different components of the architecture. As a result, each component can be managed, secured, and scaled independently. Also, it helps you avoid what is commonly known as “fate sharing,” where a vertically scaled server hosts a monolithic application, and any failure to this server will impact the entire application.

Horizontal scaling

Horizontal scaling, commonly referred to as scale-out, is the capability to automatically add systems/instances in a distributed manner in order to handle an increase in load. Examples of this increase in load could be the increase of number of sessions to a web application. With horizontal scaling, the load is distributed across multiple instances. By distributing these instances across Availability Zones, horizontal scaling not only increases performance, but also improves the overall reliability.

In order for the application to work seamlessly in a scale-out distributed manner, the application needs to be designed to support a stateless scaling model, where the application’s state information is stored and requested independently from the application’s instances. This makes the on-demand horizontal scaling easier to achieve and manage.

This principle can be complemented with a modularity design principle, in which the scaling model can be applied to certain component(s) or microservice(s) of the application stack. For example, only scale-out Amazon Elastic Cloud Compute (EC2) front-end web instances that reside behind an Elastic Load Balancing (ELB) layer with auto-scaling groups. In contrast, this elastic horizontal scalability might be very difficult to achieve for a monolithic type of application.

Leverage the content delivery network

Leveraging Amazon CloudFront and its edge locations as part of the solution architecture can enable your application or service to scale rapidly and reliably at a global level, without adding any complexity to the solution. The integration of a CDN can take different forms depending on the solution use case.

For example, CloudFront played an important role to enable the scale required throughout Amazon Prime Day 2020 by serving up web and streamed content to a worldwide audience, which handled over 280 million HTTP requests per minute.

Go serverless where possible

As discussed earlier in this post, modular architectures based on microservices reduce the complexity of the individual component or microservice. At scale it may introduce a different type of complexity related to the number of these independent components (microservices). This is where serverless services can help to reduce such complexity reliably and at scale. With this design model you no longer have to provision, manually scale, maintain servers, operating systems, or runtimes to run your applications.

For example, you may consider using a microservices architecture to modernize an application at the same time to simplify the architecture at scale using Amazon Elastic Kubernetes Service (EKS) with AWS Fargate.

Figure 3: Example of a serverless microservices architecture

In addition, an event-driven serverless capability like AWS Lambda is key in today’s modern scalable cloud solutions, as it handles running and scaling your code reliably and efficiently. See How to Design Your Serverless Apps for Massive Scale and 10 Things Serverless Architects Should Know for more information.

Secure by design

To avoid any major changes at a later stage to accommodate security requirements, it’s essential that security is taken into consideration as part of the initial solution design. For example, if the cloud project is new or small, and you don’t consider security properly at the initial stages, once the solution starts to scale, redesigning the entire cloud project from scratch to accommodate security best practices is usually not a simple option, which may lead to consider suboptimal security solutions that may impact the desired scale to be achieved. By leveraging CDN as part of the solution architecture (as discussed above), using Amazon CloudFront, you can minimize the impact of distributed denial of service (DDoS) attacks as well as perform application layer filtering at the edge. Also, when considering serverless services and the Shared Responsibility Model, from a security lens you can delegate a considerable part of the application stack to AWS so that you can focus on building applications. See The Shared Responsibility Model for AWS Lambda.

Design with security in mind by incorporating the necessary security services as part of the initial cloud solution. This will allow you to add more security capabilities and features as the solution grows, without the need to make major changes to the design.

Design for failure

The reliability of a service or solution in the cloud depends on multiple factors, the primary of which is resiliency. This design principle becomes even more critical at scale because the failure impact magnitude typically will be higher. Therefore, to achieve a reliable scalability, it is essential to design a resilient solution, capable of recovering from infrastructure or service disruptions. This principle involves designing the overall solution in such a way that even if one or more of its components fail, the solution is still be capable of providing an acceptable level of its expected function(s). See AWS Well-Architected Framework – Reliability Pillar for more information.

Conclusion

Designing for scale alone is not enough. Reliable scalability should be always the targeted architectural attribute. The design principles discussed in this blog act as the foundational pillars to support it, and ideally should be combined with adopting a DevOps model.

Field Notes: Integrating IoT and ITSM using AWS IoT Greengrass and AWS Secrets Manager – Part 2

2020-10-29 Gary Emmerton

Post Syndicated from Gary Emmerton original https://aws.amazon.com/blogs/architecture/field-notes-integrating-iot-and-itsm-using-aws-iot-greengrass-and-aws-secrets-manager-part-2/

In part 1 of this blog I introduced the need for organizations to securely connect thousands of IoT devices with many different systems in the hyperconnected world that exists today, and how that can be addressed using AWS IoT Greengrass and AWS Secrets Manager. We walked through the creation of ServiceNow credentials in AWS Secrets Manager, the creation of IAM roles and the Lambda functions that will run on our edge device (a Raspberry Pi).

In this second part of the blog, we will setup AWS IoT Greengrass, on our Raspberry Pi, and AWS IoT Core so that we can run the AWS Lambda functions and access our ServiceNow credentials, retrieved securely from AWS Secrets Manager.

Setting up AWS IoT Core and AWS IoT Greengrass

The overall sequence for configuring AWS IoT Core and AWS IoT Greengrass is:

Create a certificate, and IoT Thing and link them
Create AWS IoT Greengrass group
Associate IAM role to the AWS IoT Greengrass group
Create and attach a policy to the certificate
Create an AWS IoT Greengrass Resource Definition for our ‘Secret’
Create an AWS IoT Greengrass Function Definition for our Lambda functions
Create an AWS IoT Greengrass Subscription Definition for IoT Topics to be used
Finally associate our Resource, Function and Subscription Definitions with our AWS IoT Greengrass Core

Steps

For this walkthrough, I have selected the AWS region “eu-west-1”, however, feel free to use other Regions where AWS IoT Core and AWS IoT Greengrass are available.

First, let’s install Greengrass on the Raspberry Pi:

Follow the instructions to configure the pre-requisites on the Raspberry Pi
Then we download the AWS IoT Greengrass software
And then we unzip the AWS IoT Greengrass software using the following command (note, this command is for version 1.10.0 of Greengrass and will change as later versions are released):

sudo tar -xzvf greengrass-linux-armv6l-1.10.0.tar.gz -C /

Note that AWS IoT Greengrass must be compatible with the version of the AWS Greengrass SDK installed to identify what versions are compatible and use sudo pip3 install greengrasssdk==<version_number> to install the SDK compatible with the version of AWS IoT Greengrass that we installed.

Our AWS IoT Greengrass core will authenticate with AWS IoT Core in AWS using certificates, so we need to generate these first using the following command:

aws iot create-keys-and-certificate --set-as-active --certificate-pem-outfile "iot-ge.cert.pem" --public-key-outfile "iot-ge.public.key" --private-key-outfile "iot-ge.private.key"

This command will generate three files containing the private key, public key and certificate. All of these files need to be copied to the /greengrass/certs folder on the Raspberry Pi. Also, the output of the preceding command will give the ARN of the certificate – we need to make a note of this ARN as we will use it in the next steps.

We also need to download a copy of the Amazon Root CA into the /greegrass/certs folder using the command below:

sudo wget -O root.ca.pem https://www.amazontrust.com/repository/AmazonRootCA1.pem

For the next step we need our AWS account number and IoT Host address unique to our account – we get the IoT Host address using the command:

aws iot describe-endpoint --endpoint-type iot:Data-ATS

Now we need to create a config.json file on the Raspberry Pi in the /greengrass/config folder, with the account number and IoT Host address obtained in the previous step;

{
  "coreThing" : {
    "caPath" : "root.ca.pem",
    "certPath" : "iot-ge.cert.pem",
    "keyPath" : "iot-ge.private.key",
    "thingArn" : "arn:aws:iot:eu-west-1:<aws_account_number>:thing/IoT-blog_Core",
    "iotHost" : "<endpoint_address>",
    "ggHost" : "greengrass-ats.iot.eu-west-1.amazonaws.com",
    "keepAlive" : 600
  },
  "runtime" : {
    "cgroup" : {
      "useSystemd" : "yes"
    },
    "allowFunctionsToRunAsRoot" : "yes"
  },
  "managedRespawn" : false,
  "crypto" : {
    "principals" : {
      "SecretsManager" : {
        "privateKeyPath" : "file:///greengrass/certs/iot-ge.private.key"
      },
      "IoTCertificate" : {
        "privateKeyPath" : "file:///greengrass/certs/iot-ge.private.key",
        "certificatePath" : "file:///greengrass/certs/iot-ge.cert.pem"
      }
    },
    "caPath" : "file:///greengrass/certs/root.ca.pem"
  }
}

Note that the line "allowFunctionsToRunAsRoot" : "yes" allows the Lambda functions to easily access the SenseHat on the Raspberry Pi. This configuration should normally be avoided in Production environments for security reasons but has been used here for simplicity.

Next we create the IoT Thing to represent our Raspberry Pi to match the entry we added into the config.json file previously:

aws iot create-thing --thing-name IoT-blog_Core

Now that our config.json file is in place and our IoT ‘thing’ created we can start the AWS IoT Greengrass software using the following commands:

cd /greengrass/ggc/core/
sudo ./greengrassd start

Then we attach the certificate to our new Thing – we need the ARN of the certificate that was noted in the earlier steps when we created the certificates:

aws iot attach-thing-principal --thing-name "IoT-blog_Core" --principal "<certificate_arn>"

Now we create the AWS IoT Greengrass group – make a note of the Group ID in the output of this command as we use it later:

aws greengrass create-group --name IoT-blog-group

Next we create the AWS IoT Greengrass Core definition file – create this using a text editor and save as core-def.json

{
  "Cores": [
    {
      "CertificateArn": "<certificate_arn>",
      "Id": "<IoT Thing Name>",
      "SyncShadow": true,
      "ThingArn": "<thing_arn>"
    }
  ]
}

Then, using the preceding file we just created, we create the core definition using the following command:

aws greengrass create-core-definition --name "IoT-blog_Core" --initial-version file://core-def.json

Now we associate the AWS IoT Greengrass core with the AWS IoT Greengrass group – we need the LatestVersionARN from the output of the command above and the group ID of your existing AWS IoT Greengrass group (in the output from the command for creation of the group in previous steps):

aws greengrass create-group-version --group-id "<greengrass_group_id>" --core-definition-version-arn "<core_definition_version_arn>"

Then we associate the IAM role (created earlier) to the AWS IoT Greengrass group;

aws greengrass associate-role-to-group --group-id "<greengrass_group_id>" --role-arn "arn:aws:iam::<aws_account_number>:role/IoTGGRole"

We need to create a policy to associate with the certificate so that our AWS IoT Greengrass Core (authenticated/authorized by our certificates) has rights to interact with AWS IoT Core. To do this we create the policy.json file:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iot:Publish",
        "iot:Subscribe",
        "iot:Connect",
        "iot:Receive"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "iot:GetThingShadow",
        "iot:UpdateThingShadow",
        "iot:DeleteThingShadow"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "greengrass:*"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}

Then create the policy using the policy file using the command below:

aws iot create-policy --policy-name myGGPolicy --policy-document file://policy.json

And finally attach our new policy to the certificate – as the certificate is attached to our AWS IoT Greengrass Core, this gives the rights defined in the policy to our AWS IoT Greengrass Core;

aws iot attach-policy --target "<certificate_arn>" --policy-name "myGGPolicy"

Now we have the AWS IoT Greengrass Core and permissions in place, it’s time to add our Secret as a resource for AWS IoT Greengrass.

First, we need to create a resource definition that refers to the ARN of the secret we created earlier. Get the ARN of the secret using the following command:

aws secretsmanager describe-secret --secret-id "greengrass-snow-creds"

And then we create a text file containing the following and save it as resource.json:

{
"Resources": [
    {
      "Id": "SNOW-Credentials",
      "Name": "SNOW-Credentials",
      "ResourceDataContainer": {
        "SecretsManagerSecretResourceData": {
          "ARN": "<secret_arn>"
        }
      }
    }
  ]
}

Now we command to create the resource reference in IoT to the Secret:

aws greengrass create-resource-definition --name "MySNOWSecret" --initial-version file://resource.json

Note the Resource ID from the output as it is needed as it has to be added to the Lambda definition json file in the next steps. The function definition file contains the details of the Lambda function(s) that we will attach to our AWS IoT Greengrass group. We create a text file with the content below and save as lambda-def.json.

We also specify a couple of variables in the definition file; these are the same as the environment variables that can be specified for Lambda, but they make the variables available in AWS IoT Greengrass.

Note, if we specify environment variables for the functions in the Lambda console then these will NOT be available when the function is running under AWS IoT Greengrass. We will need our ServiceNow API URL to add to the configuration below, and this will be in the form of https://devXXXXX.service-now.com/api/now/table/incident, where XXXXX is the developer instance number assigned by ServiceNow when our instance is created.

We need the ARNs of the Lambda functions that we created in part 1 of the blog – these appear in the output after successfully creating the functions from the command line, or can be obtained using the aws lambda list-functions command – we need to have the ‘:1’ at the end of the ARN as AWS IoT Greengrass needs to reference published function versions.

{
  "DefaultConfig": {
    "Execution": {
      "IsolationMode": "NoContainer",
      "RunAs": {
        "Gid": 0,
        "Uid": 0
      }
    }
  },
  "Functions": [
    {
      "FunctionArn": "<lambda_function1_arn>:1",
      "FunctionConfiguration": {
        "EncodingType": "json",
        "Environment": {
          "Execution": {
            "IsolationMode": "NoContainer"
          },
          "Variables": { 
            "tempLimit": "30",
            "humidLimit": "50"
          }
        },
        "ExecArgs": "string",
        "Executable": "lambda_function.lambda_handler",
        "Pinned": true,
        "Timeout": 10
      },
    "Id": "sensorLambda"
    },
    {
      "FunctionArn": "<lambda_function2_arn>:1",
      "FunctionConfiguration": {
        "EncodingType": "json",
        "Environment": {
          "Execution": {
            "IsolationMode": "NoContainer"
          },
          "ResourceAccessPolicies": [
            {
              "Permission": "ro",
              "ResourceId": "SNOW-Credentials"
            }
          ],
          "Variables": { 
            "snowUrl": "<service_now_api_url>"
          }
        },
        "ExecArgs": "string",
        "Executable": "lambda_function.lambda_handler",
        "Pinned": false,
        "Timeout": 10
      },
    "Id": "anomalyLambda"
    }
  ]
}

The Lambda function now needs to be registered within our AWS IoT Greengrass core using the definition file just created, using the following command:

aws greengrass create-function-definition --name "IoT-blog-lambda" --initial-version file://lambda-def.json

Create Subscriptions

We now need to create some IoT Topics to pass data between the two Lambda functions and also to submit all sensor data to AWS IoT Core, which gives us visibility of the successful collection of sensor data.cd.

First, let’s create a subscription configuration file (subscriptions.json) for sensor data and anomaly data:

{
  "Subscriptions": [
    {
      "Id": "SensorData",
      "Source": "<lambda_function1_arn>:1",
      "Subject": "IoTBlog/sensorData",
      "Target": "cloud"
    },
    {
      "Id": "AnomalyData",
      "Source": "<lambda_function1_arn>:1",
      "Subject": "IoTBlog/anomaly",
      "Target": "<lambda_function2_arn>:1"
    },
    {
      "Id": "AnomalyDataB",
      "Source": "<lambda_function1_arn>:1",
      "Subject": "IoTBlog/anomaly",
      "Target": "cloud"
    }
  ]
}

And next, we run the command to create the subscription from this configuration:

aws greengrass create-subscription-definition --name "IoT-sensor-subs" --initial-version file://subscriptions.json

Update AWS IoT Greengrass Group Associations and Deploy

Now that the functions, subscriptions and resources have been defined, we run the following command to update our AWS IoT Greengrass group to the new version with those components included:

aws greengrass create-group-version --group-id <gg_group_id> --core-definition-version-arn "<core_def_version_arn>" --function-definition-version-arn "<function_def_version_arn>" --resource-definition-version-arn "<resource_def_version_arn>" --subscription-definition-version-arn "<subscription_def_version_arn>"

And finally, we can deploy our configuration. Use the following command to deploy the Greengrass group to our device, using the group-version-id from the output of the previous command and also the group-id:

aws greengrass create-deployment --deployment-type NewDeployment --group-id <gg_group_id> --group-version-id <gg_group_version_id>

Summarized below is the integration between the different functions and components that we have now deployed to get from our sensor data through to an incident being raised in ServiceNow:

Raspberry PI

Create an Incident

Everything is setup now from an IoT perspective, so we can attempt to trigger a threshold breach on the sensors to trigger the creation of an incident in ServiceNow. In order to trigger the incident creation, let’s raise the humidity around the sensor so that it breaches the threshold defined in the environment variables of the Lambda function.

Under normal conditions we will just see the data published by the first Lambda function in the IoTBlog/sensorData topic:

IoTblog sensordata

However, when a threshold is breached (in our example, humidity above 50%), the data is published to the IoTBlog/anomaly topic as shown below:

ioTblog Anomaly

Via the AWS IoT Greengrass subscriptions created earlier, this message arriving in the anomaly topic also triggers the second Lambda function to create the ticket in ServiceNow.

The log for the second Lambda function on AWS IoT Greengrass (stored in /greengrass/ggc/var/log/user/<region>/<aws_account_number>/ on the Raspberry Pi) will show a ‘201’ return code if the incident is successfully created in ServiceNow.

Now let’s log on to ServiceNow and check out our new incident. Good news, our new incident appears correctly:

And when we click on our incident we can see the detail, including the full data from the IoT topic in the Activities section;

This is only a basic use of the ServiceNow API and there are many other parameters that you can use to increase the richness of the incident, refer to the ServiceNow API documentation for more details.

Cleaning up

To avoid incurring future charges, delete the resources that you created in the walkthrough.

Conclusion

We have built an IoT device (Raspberry Pi), running AWS IoT Greengrass, AWS Lambda, and using ServiceNow credentials managed in AWS Secrets Manager. Using this we have triggered an anomaly event that has created an incident automatically in ServiceNow, directly from the Lambda function running on our Pi. You can use this architecture as the foundation to integrate your edge devices and ITSM solution to automate ticket generation in your organization.

Look out for follow-up blogs that will extend this solution to provide a real-time dashboard for the sensor data and store the sensor data in a Data Lake for historical visualization.

Find out more about deploying Secrets to AWS IoT Greengrass Core.

Check out the AWS IoT Blog for more examples of how to use AWS to integrate your edge devices with the AWS Cloud.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Integrating IoT and ITSM using AWS IoT Greengrass and AWS Secrets Manager – Part 1

2020-10-28 Gary Emmerton

Post Syndicated from Gary Emmerton original https://aws.amazon.com/blogs/architecture/field-notes-integrating-iot-and-itsm-using-aws-iot-greengrass-and-aws-secrets-manager-part-1/

IT Security is a hot topic in every organization, and in a hyper connected world the need to integrate thousands of IoT devices securely with many different systems at scale is critical.

AWS Secrets Manager helps customers manage their system credentials securely in the AWS Cloud, and with its integration with AWS IoT Greengrass, that capability now extends out to your edge-connected IoT devices.

In this two part blog post, I will walk through the steps to use this integration to give edge devices the capability to connect to and log incidents directly into ServiceNow. The credentials for connecting to ServiceNow are created in AWS Secrets Manager, and deployed locally (encrypted) to the edge device via AWS IoT Greengrass.

Part 1 (this post) gives an overview of the whole solution and the steps for setting up AWS Secrets Manager, creating the required IAM roles and AWS Lambda functions. In part 2 of the blog we will then set up AWS IoT Greengrass and AWS IoT Core so that we can run the functions and access the secret on our edge device (Raspberry Pi).

Enabling edge devices to automatically raise incidents in your organization’s ITSM toolset ensures that you can use existing workflows and incident escalation paths for your edge devices. Previously this would have been challenging to integrate. Additionally, by running this capability at the edge, it enables quicker responses and reduces the need to make calls back to the AWS Cloud.

Overview of solution

The solution makes use of a Raspberry Pi, running AWS IoT Greengrass. AWS Lambda functions on AWS IoT Greengrass capture temperature and humidity sensor data and make calls directly to ServiceNow when thresholds are breached. The integration of AWS Secrets Manager with AWS IoT Core enables the credentials required for the ServiceNow API calls to be available locally. These calls are encrypted and available on the Pi for the Lambda function to use.

The sensors for temperature and humidity in this example are on a Raspberry Pi Sense Hat, which illustrates sensors that could be used in an industrial or manufacturing use case. You can use any type of sensor such as vibration, strain gauges, or other electro-mechanical sensors.

One of the Lambda functions running on AWS IoT Greengrass on the RPi captures the sensor readings, and should a threshold on either be exceeded then it triggers a second Lambda function (again running on AWS IoT Greengrass). This then makes a ‘create incident’ API call to ServiceNow, using the credentials stored in AWS Secrets Manager.

In order to have visibility of the sensor data, and to manage communications between the first and second Lambda functions, all data from the sensors is published to one IoT Topic Data related to any threshold breaches is published to another IoT Topic.

Following is a high-level diagram for the architecture used in this blog.

ServiceNow RA

Prerequisites

To complete the steps in this blog, you need:

An AWS Account
A ServiceNow developer instance or other test ServiceNow instance that you can access – You can sign-up for a free ServiceNow developer account
A Raspberry Pi (I used a Pi 3B with Raspbian Buster)
A Raspberry Pi SenseHat
A workstation with the latest AWS Command Line Interface (CLI) installed

Additionally, ensure that you have Python 3.7.x installed with the following Python modules on the Raspberry Pi (Raspian Buster includes Python 3.7 by default). Install the following packages as the root user (sudo pip):

greengrasssdk
boto3
requests
sense_hat
datetime

I have taken the approach of having these modules installed on the Pi in order to simplify the creation of the Lambda function. This ensures the function will run locally on the Pi rather than having to build all of the Python modules on the Pi and then zip them to run the Lambda in an AWS IoT Greengrass container.

Walkthrough

The steps in the walkthrough can be achieved from the AWS console. I have focused on the command line approach, using the AWS CLI, as this will give a more detailed view of what is happening and the dependencies between the different components. The overall sequence for the steps is:

Create secret in AWS Secrets Manager – this will contain the credentials required to access ServiceNow for Lambda running on AWS IoT Greengrass
Create IAM role – provides permissions for AWS IoT Greengrass to other AWS services, including AWS Secrets Manager
Create Lambda functions – the functions that will capture sensor data and create the Service Now ticket
Configure and Deploy IoT Core – deploy our configuration to the Raspberry Pi, covered in Part 2 of this blog

I’ve structured the order of the steps in a logical sequence so that any dependencies of later steps are created first. There are a number of places where a value in the output of one command (such as an ARN) needs to be noted as required for subsequent commands. At times the steps may seem counter-intuitive, but whilst developing this blog, I found this sequence has proven to be the most effective.

Create Secret

First, we create our ‘Secret’ in AWS Secrets Manager. This consists of a secret string containing a JSON object for the username and password required for authentication to the API of my ServiceNow developer instance. For the purposes of this example, we will use the default encryption key for the AWS Secrets Manager service.

The following command, from the AWS CLI, is used to create the new secret, entering the relevant username and password for the ServiceNow instance.

IMPORTANT: the name of the secret must start with “greengrass-“(specified in the command after the “–name” parameter) because the IAM Greengrass managed service role (which we will use later) has permission by default to access secrets that start with this text.

aws secretsmanager create-secret --name "greengrass-snow-creds" --description "Credentials for ServiceNow API access" --secret-string '{"username":"&lt;username&gt;","password":"&lt;password&gt;"}'

The successful completion of the *preceding command results in an output on your terminal screen, containing the ARN (Amazon Resource Name) for the new secret.

Create IAM roles

In order for AWS IoT Greengrass to access our new secret and download it securely to AWS IoT Greengrass running on the Pi, we need to give it an IAM role. If we do not set this up correctly then there will be an error when trying to deploy the AWS IoT Greengrass group later in the walkthrough.

Our new IAM role needs a policy document that describes the permissions that we will give the role. The first step is to create the IAM policy document, containing only the permissions for the Greengrass service this new role – to do this we create a text file called assume-role.json containing the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "greengrass.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Then we run the following command, referencing the file just created:

aws iam create-role --role-name "IoTGGRole" --assume-role-policy-document file://assume-role.json

And finally we need to attach the IAM managed AWS IoT Greengrass service role to our new custom role:

aws iam attach-role-policy --role-name "IoTGGRole" --policy-arn arn:aws:iam::aws:policy/service-role/AWSGreengrassResourceAccessRolePolicy

We also need a role for our Lambda functions with basic execution permissions; create a text file called lambda-role.json containing the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Then we run the following command, referencing the file just created:

aws iam create-role --role-name "IoTLambdaRole" --assume-role-policy-document file://lambda-role.json

And finally we need to attach the IAM managed Lambda basic execution role to our new custom role:

aws iam attach-role-policy --role-name "IoTLambdaRole" --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Create Lambda Functions

We now create the Lambda functions that will be deployed to Greengrass – these functions manage the interactions between the sensors and ServiceNow.

Lambda Function 1 – Get/Publish Sensor Data

The first Lambda function reads the sensor data and publishes the data to an IoT Topic (IoTBlog/sensorData), every 5 seconds, which can then be used by downstream services for analytics. This function also determines whether a threshold has been breached and if so, it publishes the data to a separate IoT Topic (IotBlog/anomaly) to which our second Lambda function is subscribed.

import os
import json
from datetime import datetime
import time
import sys
from sense_hat import SenseHat
import greengrasssdk
import boto3

client = greengrasssdk.client('iot-data')
secClient = greengrasssdk.client('secretsmanager')
sense = SenseHat()
sense.clear()

t_threshold = int(os.environ['tempLimit'])
h_threshold = int(os.environ['humidLimit'])

def lambda_handler(event, context):
    return
 
# Get sensor data and check against thresholds
def getSensorData():
    while True:
        eventTitle = "no event"
        anomaly = False
        ts = time.time()
        dt = datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
        
        temp     = round(sense.get_temperature(),2)
        humidity = round(sense.get_humidity(),2)
        
        time.sleep(5)

        if (temp > int(t_threshold)):
            anomaly = True
            eventTitle = "Temperature breach"
            
        if (humidity > int(h_threshold)):
            anomaly = True
            eventTitle = "Humidity breach"

        sensorData = { 'title': eventTitle,'dt':dt,'ts':ts,'t':temp,'h':humidity }
        publishData(anomaly,sensorData)

# Publish sensor data to IoT topic(s)
def publishData(anomaly,myData):
    response = client.publish(
        topic = 'IoTBlog/sensorData',
        payload = json.dumps(myData) )

We create a text file lambda_function.py containing the code above and then compress into a zip file named lambda_function_1.zip. Once this is done we can then create the function in AWS using the following command:

aws lambda create-function --function-name "1-IoT-GetSensorData" --runtime python3.7 --zip-file fileb://lambda_function_1.zip --handler lambda_function.lambda_handler –role arn:aws:iam::&lt;aws_account_number&gt;:role/IoTLambdaRole

In order to make use of Lambda functions in AWS IoT Greengrass we then need to publish a version of the function using the following command:

aws lambda publish-version --function-name "1-IoT-GetSensorData"

Lambda Function 2 – Create Anomaly Ticket

The second Lambda function is triggered through its subscription to the anomaly data published to the anomaly IoT Topic by the first Lambda function. This then makes a call to the ServiceNow API to create an incident. Prior to making the API call, the function obtains the ServiceNow credentials from the secret that has been made available to AWS IoT Greengrass.

As this is a Resource within the AWS IoT Greengrass Core, it is automatically downloaded to the Raspberry Pi as part of the deployment of the AWS IoT Greengrass Core.

import os
import json
import sys
import greengrasssdk
import requests

client = greengrasssdk.client('iot-data')
secClient = greengrasssdk.client('secretsmanager')
secret_name = "greengrass-snow-creds"
snow_instance = os.environ['snowUrl']
msg = ""

def publishAnomaly(msg,title):
    auth = getLocalSecret()
    createTicket(auth,msg,title)

# Get secret from GG
def getLocalSecret():
    secret = secClient.get_secret_value(SecretId=secret_name)
    rawSecret = secret.get('SecretString')
    return json.loads(str(rawSecret))

# Create ticket in ServiceNow            
def createTicket(auth,eventData,title):
    API_ENDPOINT = snow_instance
    HEADERS = {"Content-Type":"application/json","Accept":"application/json"}
    PARAMS = { 
        "short_description":title,
        "assignment_group":"sensor_team",
        "urgency":"2",
        "impact":"2",
        "comments":eventData
    } 

    request = requests.post(url = API_ENDPOINT, auth=(str(auth["username"]),str(auth["password"])), headers=HEADERS, data = json.dumps(PARAMS))
    print("Response:",request)

def lambda_handler(event, context):
    msg = json.dumps(event)
    msg = json.loads(msg)
    title = "Sensor Threshold - " + msg["title"]
    
    publishAnomaly(msg,title)
    return

We create another text file, lambda_function.py containing the code above and then compress into a zip file named lambda_function_2.zip. Once this is done we can then create the function in AWS using the following command:

aws lambda create-function --function-name "2-IoT-ServiceNow" --runtime python3.7 --zip-file fileb://lambda_function_2.zip --handler lambda_function.lambda_handler –role arn:aws:iam::&lt;account_number&gt;:role/IoTLambdaRole

In order to make use of Lambda functions in Greengrass we then need to publish a version of the function using the following command:

aws lambda publish-version --function-name "2-IoT-ServiceNow"

Conclusion

In this post, I showed you the steps for integrating IoT and ITSM by setting up AWS Secrets Manager, creating the required IAM roles and AWS Lambda functions. Now you can proceed to part 2 of the blog to set up AWS IoT-Core and AWS IoT Greengrass to make use of the secret and functions that you created in this post.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Mercado Libre: How to Block Malicious Traffic in a Dynamic Environment

2020-10-27 Gaston Ansaldo

Post Syndicated from Gaston Ansaldo original https://aws.amazon.com/blogs/architecture/mercado-libre-how-to-block-malicious-traffic-in-a-dynamic-environment/

Blog post contributors: Pablo Garbossa and Federico Alliani of Mercado Libre

Introduction

Mercado Libre (MELI) is the leading e-commerce and FinTech company in Latin America. We have a presence in 18 countries across Latin America, and our mission is to democratize commerce and payments to impact the development of the region.

We manage an ecosystem of more than 8,000 custom-built applications that process an average of 2.2 million requests per second. To support the demand, we run between 50,000 to 80,000 Amazon Elastic Cloud Compute (EC2) instances, and our infrastructure scales in and out according to the time of the day, thanks to the elasticity of the AWS cloud and its auto scaling features.

Mercado Libre

As a company, we expect our developers to devote their time and energy building the apps and features that our customers demand, without having to worry about the underlying infrastructure that the apps are built upon. To achieve this separation of concerns, we built Fury, our platform as a service (PaaS) that provides an abstraction layer between our developers and the infrastructure. Each time a developer deploys a brand new application or a new version of an existing one, Fury takes care of creating all the required components such as Amazon Virtual Private Cloud (VPC), Amazon Elastic Load Balancing (ELB), Amazon EC2 Auto Scaling group (ASG), and EC2) instances. Fury also manages a per-application Git repository, CI/CD pipeline with different deployment strategies, such like blue-green and rolling upgrades, and transparent application logs and metrics collection.

Fury- MELI PaaS

For those of us on the Cloud Security team, Fury represents an opportunity to enforce critical security controls across our stack in a way that’s transparent to our developers. For instance, we can dictate what Amazon Machine Images (AMIs) are vetted for use in production (such as those that align with the Center for Internet Security benchmarks). If needed, we can apply security patches across all of our fleet from a centralized location in a very scalable fashion.

But there are also other attack vectors that every organization that has a presence on the public internet is exposed to. The AWS recent Threat Landscape Report shows a 23% YoY increase in the total number of Denial of Service (DoS) events. It’s evident that organizations need to be prepared to quickly react under these circumstances.

The variety and the number of attacks are increasing, testing the resilience of all types of organizations. This is why we started working on a solution that allows us to contain application DoS attacks, and complements our perimeter security strategy, which is based on services such as AWS Shield and AWS Web Application Firewall (WAF). In this article, we will walk you through the solution we built to automatically detect and block these events.

The strategy we implemented for our solution, Network Behavior Anomaly Detection (NBAD), consists of four stages that we repeatedly execute:

Analyze the execution context of our applications, like CPU and memory usage
Learn their behavior
Detect anomalies, gather relevant information and process it
Respond automatically

Step 1: Establish a baseline for each application

End user traffic enters through different AWS CloudFront distributions that route to multiple Elastic Load Balancers (ELBs). Behind the ELBs, we operate a fleet of NGINX servers from where we connect back to the myriad of applications that our developers create via Fury.

MELI Architecture - nomaly detection project-step 1

Step 1: MELI Architecture – Anomaly detection project

We collect logs and metrics for each application that we ship to Amazon Simple Storage Service (S3) and Datadog. We then partition these logs using AWS Glue to make them available for consumption via Amazon Athena. On average, we send 3 terabytes (TB) of log files in parquet format to S3.

Based on this information, we developed processes that we complement with commercial solutions, such as Datadog’s Anomaly Detection, which allows us to learn the normal behavior or baseline of our applications and project expected adaptive growth thresholds for each one of them.

Anomaly detection

Step 2: Anomaly detection

When any of our apps receives a number of requests that fall outside the limits set by our anomaly detection algorithms, an Amazon Simple Notification Service (SNS) event is emitted, which triggers a workflow in the Anomaly Analyzer, a custom-built component of this solution.

Upon receiving such an event, the Anomaly Analyzer starts composing the so-called event context. In parallel, the Data Extractor retrieves vital insights via Athena from the log files stored in S3.

The output of this process is used as the input for the data enrichment process. This is responsible for consulting different threat intelligence sources that are used to further augment the analysis and determine if the event is an actual incident or not.

At this point, we build the context that will allow us not only to have greater certainty in calculating the score, but it will also help us validate and act quicker. This context includes:

Application’s owner
Affected business metrics
Error handling statistics of our applications
Reputation of IP addresses and associated users
Use of unexpected URL parameters
Distribution by origin of the traffic that generated the event (cloud providers, geolocation, etc.)
Known behavior patterns of vulnerability discovery or exploitation

Step 2: MELI Architecture – Anomaly detection project

Step 3: Incident response

Once we reconstruct the context of the event, we calculate a score for each “suspicious actor” involved.

Step 3: MELI Architecture – Anomaly detection project

Based on these analysis results we carry out a series of verifications in order to rule out false positives. Finally, we execute different actions based on the following criteria:

Manual review

If the outcome of the automatic analysis results in a medium risk scoring, we activate a manual review process:

We send a report to the application’s owners with a summary of the context. Based on their understanding of the business, they can activate the Incident Response Team (IRT) on-call and/or provide feedback that allows us to improve our automatic rules.
In parallel, our threat analysis team receives and processes the event. They are equipped with tools that allow them to add IP addresses, user-agents, referrers, or regular expressions into Amazon WAF to carry out temporary blocking of “bad actors” in situations where the attack is in progress.

Automatic response

If the analysis results in a high risk score, an automatic containment process is triggered. The event is sent to our block API, which is responsible for adding a temporary rule designed to mitigate the attack in progress. Behind the scenes, our block API leverages AWS WAF to create IPSets. We reference these IPsets from our custom rule groups in our web ACLs, in order to block IPs that source the malicious traffic. We found many benefits in the new release of AWS WAF, like support for Amazon Managed Rules, larger capacity units per web ACL as well as an easier to use API.

Conclusion

By leveraging the AWS platform and its powerful APIs, and together with the AWS WAF service team and solutions architects, we were able to build an automated incident response solution that is able to identify and block malicious actors with minimal operator intervention. Since launching the solution, we have reduced YoY application downtime over 92% even when the time under attack increased over 10x. This has had a positive impact on our users and therefore, on our business.

Not only was our downtime drastically reduced, but we also cut the number of manual interventions during this type of incident by 65%.

We plan to iterate over this solution to further reduce false positives in our detection mechanisms as well as the time to respond to external threats.

About the authors

Pablo Garbossa is an Information Security Manager at Mercado Libre. His main duties include ensuring security in the software development life cycle and managing security in MELI’s cloud environment. Pablo is also an active member of the Open Web Application Security Project® (OWASP) Buenos Aires chapter, a nonprofit foundation that works to improve the security of software.

Federico Alliani is a Security Engineer on the Mercado Libre Monitoring team. Federico and his team are in charge of protecting the site against different types of attacks. He loves to dive deep into big architectures to drive performance, scale operational efficiency, and increase the speed of detection and response to security events.

Field Notes: Customizing the AWS Control Tower Account Factory with AWS Service Catalog

2020-10-23 Remek Hetman

Post Syndicated from Remek Hetman original https://aws.amazon.com/blogs/architecture/field-notes-customizing-the-aws-control-tower-account-factory-with-aws-service-catalog/

Many AWS customers who are managing hundreds or thousands of accounts know how complex and time consuming this process can be. To reduce the burden and simplify the process of creating new accounts, last year AWS released a new service, AWS Control Tower.

AWS Control Tower helps you automate the process of setting up a multi-account AWS environment (AWS Landing Zone) that is secure, well-architected, and ready to use. This Landing Zone is created following best practices established through AWS’ experience working with thousands of enterprises as they move to the cloud. This includes the configuration of AWS Organizations, centralized logging, federated access, mandatory guardrails, and networking.

Those elements are a good starting point to cover the initial configuration of the new account. For some organizations, the next step is to baseline a newly vended account to align it with the company policies and compliance requirements. This means create or deploy necessary roles, policies, governance controls, security groups and so on.

In this blog post, I describe a solution that helps achieve consistent governance and compliance requirements across accounts created by AWS Control Tower. The solution uses the AWS Service Catalog as the repository of products that will be deployed into the new accounts.

Prerequisites and assumptions

Before I get into how it works, let’s first review a few key AWS Service Catalog concepts:

An AWS Service Catalog product is a blueprint for building your AWS resources that you want to make available for deployment on AWS along with the configuration information.
A portfolio is a collection of products, together with the configuration information.
A provisioned product is an AWS CloudFormation stack.
Constraints control the way users can deploy a product. With launch constraints, you can specify a role that the AWS Service Catalog can assume to launch a product.
Review AWS Service Catalog reference blueprints for a quick way to set up and configure AWS Service Catalog portfolios and products.

You need an AWS Service Catalog portfolio with products that you plan to deploy to the accounts. The portfolio has to be created in the AWS Control Tower primary account. If you don’t have a portfolio, the starting point would be to review these resources:

Solution Overview

This solution supports the following scenarios:

Deployment of products to the newly vended AWS Control Tower account (Figure 1)
Update and deployment of products to existing accounts (Figure 2)

Figure 1 – Deployment to new account

The architecture diagram in figure 1 shows the process of deploying products to the new account.

AWS Control Tower creates a new account
Once an account is created successfully, Amazon CloudWatch Events trigger an AWS Lambda function
The Lambda function pulls the configuration from an Amazon S3 bucket and:
- Validates configuration
- Grants Lambda role access to portfolio(s)
- Creates StackSet constrains for products
Lambda calls the AWS Step Function
The AWS Step Function orchestrates deployment of the AWS Service Catalog products and monitors progress

Figure 2 – Update or deployment to an existing account

The architecture diagram in figure 2 shows process of updating product or deploying product to the existing account.

User uploads update configuration to an Amazon S3 Bucket
Update triggers AWS Lambda Function
Lambda function reads uploaded configuration and:
- Validates configuration
- Grants Lambda role access to portfolio(s)
- Creates StackSet constrains for products
Calls AWS Step Function
AWS Step Function orchestrate update/deployment of AWS Service Catalog products and monitoring progress

Deployment and configuration

Before proceeding with the deployment, you must install Python3.

Then, follow these steps:

Download the solution from GitHub
Go to folder src and run the following command: “pip3 install –r requirements.txt –t .”
Zip content of the src folder to “control-tower-account-factory-solution.zip” file
Upload zip file to your Amazon S3 bucket created in the AWS Region where you want to deploy the solution
Launch the AWS CloudFormation template

AWS CloudFormation Template Parameters:

ConfigurationBucketName – name of the Amazon S3 bucket from where AWS Lambda function should pull the configuration file. You can provide the name of an existing bucket.
CreateConfigurationBucket – set to true if you want to create a bucket specified in previous parameter. If a bucket exists, set value to false
ConfigurationFileName – name of the configuration file for a new account. Default value: config.yml
UpdateFileName – name of the configuration file for updates. Default value: update.yml
SourceCodeBucketName – name of the bucket where you uploaded the zipped Lambda code
SourceCodePackageName – name of the zipped file contain Lambda. If the file was uploaded to folder, include the name of folder(s) as well. Example: my_folder/control-tower-account-factory-solution.zip
BaselineFunctionName –name of the AWS Lambda function. Default value: control-tower- account-factory-lambda
BaselinetLambdaRoleName –name of the AWS Lambda function IAM role. Default value: control-tower- account-factory-lambda-role
StateMachineName – name of the AWS Step Function. Default value: control-tower- account-factory-state-machine
StateMachineRoleName – name of the AWS Step Function IAM role. Default value: control-tower- account-factory-state-machine-role
TrackingTableName – name of the DynamoDB table to track all deployment and updates. Default value: control-tower-account-factory-tracking-table
MaxIterations –maximum iteration of the AWS Step Function before the reports time out. Default value: 30. It can be overwrite in configuration file. For more information see Configuration section.
TopicName – (optional) name of the SNS Topic that will be used for sending notification. Default value: control-tower- account-factory-account-notification
NotificationEmail – (optional) the email address where to send the notification. If this isn’t provided, the SNS topic won’t be created.

Configuration Files

The configuration files have to be in the YAML format, you can use the following schemas for new or existing accounts:

Configuration schema for new account:

The configuration can apply to all new accounts or can be divided based on the organization unit associated with the new account. Also, you can mix deployments where some products are always deployed and some only to specific organization units.

Example configuration file

Configuration schema for existing accounts

This configuration can be used to update provisioned products or deploy products into existing accounts. You can specify a target location either as account(s), organization unit(s) or both. If you specified an organization unit as a target, the product(s) will be updated/deployed to all accounts associated with organization unit.

If the product doesn’t exist in the account, the Lambda function attempts to deploy it. You can manage this by setting the parameter ‘deployifnotexist’ to true. If omitted or set to false, Lambda won’t provision the product to an existing account.

Example configuration file

All products deployed by the AWS Service Catalog are provisioned under their own name that has to be unique across all provisioned products. In this example, products are provisioned under the following naming convention:

<account id where product is provision>-<”provision_name” from configuration file>

Example:  123456789-my-product

In the configuration file you can specify dependencies between products. Dependencies need to be provided as the list of provisioned product name not a product name. If provided, deployment of the product will be suspended until all dependencies successfully deployed.

The Step Function is running in the loop with interval of 1 minute and checking if dependencies were deployed. In the event of an error in the dependency naming or configuration, the step function iterates only until it reaches the maximum iterations defined in the configuration file. If the maximum iterations are reached, the step function reports time out and interrupts the products’ deployment.

Important: if you updating a product by adding a new AWS Region, you cannot specify the dependency or run updates in existing regions the same time. In this scenario you should break the update as follows:

Upload configuration to create dependencies in the new region
Upload configuration to create products only in the new region

You can find different examples of configuration files in GitHub.

Note: The name of the configuration file and Amazon S3 location must match the values provided in the AWS CloudFormation template during solution deployment.

AWS Lambda functions deployment considerations

When deploying product is a Lambda function, you need to consider two requirements:

The source code of the Lambda function needs to be in the Amazon S3 bucket created in the same AWS region where you are planning to deploy a Lambda function
The destination account needs to have permission to the source Amazon S3 bucket

To accommodate both requirements, the approach is to create an Amazon S3 bucket under the AWS Control Tower primary account. There is one Amazon S3 bucket in each Region where the functions will be deployed.

Each deployment bucket should have attached the following policy:

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<S3 bucket name>/*",
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalOrgID": "<AWS Organizations Id>"
                },
                "StringLike": {
                    "aws:PrincipalArn": [
                        "arn:aws:iam::*:role/AWSControlTowerExecution",
        "arn:aws:iam::*:role/< name of the Lambda IAM role – default: control-tower-account-factory-lambda-role>"
                    ]
                }
            }
        }
    ]
}

This allows the deployment role to access Lambda’s source code from any account in your AWS Organizations.

Conclusion

In this blog post, I’ve outlined a solution to help you drive consistent governance and compliance requirements across accounts vended though AWS Control Tower. This solution provides enterprises with mechanisms to manage all deployments from a centralized account. Also, this reduces the need for maintaining multiple separate CI/CD pipelines. Therefore you can simplify and reduce deployment time in a multi-account environment.

This solution allows you keep provisioned products up to date by updating existing products or bringing old accounts to the same compliance level as the new accounts.

Since this solution supports deployment to existing accounts and can be run without AWS Control Tower, it can be used for deployment of any AWS Service Catalog product either in single or multiple account environments. This solution then becomes an integral part of CI/CD pipeline.

For more information, review the AWS Control Tower documentation and AWS Service Catalog documentation. Also, review the links listed in the “Prerequisites and assumptions” section of this post.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Architecting a Data Lake for Higher Education Student Analytics

2020-10-22 Craig Jordan

Post Syndicated from Craig Jordan original https://aws.amazon.com/blogs/architecture/architecting-data-lake-for-higher-education-student-analytics/

One of the keys to identifying timely and impactful actions is having enough raw material to work with. However, this up-to-date information typically lives in the databases that sit behind several different applications. One of the first steps to finding data-driven insights is gathering that information into a single store that an analyst can use without interfering with those applications.

For years, reporting environments have relied on a data warehouse stored in a single, separate relational database management system (RDBMS). But now, due to the growing use of Software as a service (SaaS) applications and NoSQL database options, data may be stored outside the data center and in formats other than tables of rows and columns. It’s increasingly difficult to access the data these applications maintain, and a data warehouse may not be flexible enough to house the gathered information.

For these reasons, reporting teams are building data lakes, and those responsible for using data analytics at universities and colleges are no different. However, it can be challenging to know exactly how to start building this expanded data repository so it can be ready to use quickly and still expandable as future requirements are uncovered. Helping higher education institutions address these challenges is the topic of this post.

About Maryville University

Maryville University is a nationally recognized private institution located in St. Louis, Missouri, and was recently named the second fastest growing private university by The Chronicle of Higher Education. Even with its enrollment growth, the university is committed to a highly personalized education for each student, which requires reliable data that is readily available to multiple departments. University leaders want to offer the right help at the right time to students who may be having difficulty completing the first semester of their course of study. To get started, the data experts in the Office of Strategic Information and members of the IT Department needed to create a data environment to identify students needing assistance.

Critical data sources

Like most universities, Maryville’s student-related data centers around two significant sources: the student information system (SIS), which houses student profiles, course completion, and financial aid information; and the learning management system (LMS) in which students review course materials, complete assignments, and engage in online discussions with faculty and fellow students.

The first of these, the SIS, stores its data in an on-premises relational database, and for several years, a significant subset of its contents had been incorporated into the university’s data warehouse. The LMS, however, contains data that the team had not tried to bring into their data warehouse. Moreover, that data is managed by a SaaS application from Instructure, called “Canvas,” and is not directly accessible for traditional extract, transform, and load (ETL) processing. The team recognized they needed a new approach and began down the path of creating a data lake in AWS to support their analysis goals.

Getting started on the data lake

The first step the team took in building their data lake made use of an open source solution that Harvard’s IT department developed. The solution, comprised of AWS Lambda functions and Amazon Simple Storage Service (S3) buckets, is deployed using AWS CloudFormation. It enables any university that uses Canvas for their LMS to implement a solution that moves LMS data into an S3 data lake on a daily basis. The following diagram illustrates this portion of Maryville’s data lake architecture:

Diagram 1: The data lake for the Learning Management System data

The AWS Lambda functions invoke the LMS REST API on a daily schedule resulting in Maryville’s data, which has been previously unloaded and compressed by Canvas, to be securely stored into S3 objects. AWS Glue tables are defined to provide access to these S3 objects. Amazon Simple Notification Service (SNS) informs stakeholders the status of the data loads.

Expanding the data lake

The next step was deciding how to copy the SIS data into S3. The team decided to use the AWS Database Migration Service (DMS) to create daily snapshots of more than 2,500 tables from this database. DMS uses a source endpoint for secure access to the on-premises database instance over VPN. A target endpoint determines the specific S3 bucket into which the data should be written. A migration task defines which tables to copy from the source database along with other migration options. Finally, a replication instance, a fully managed virtual machine, runs the migration task to copy the data. With this configuration in place, the data lake architecture for SIS data looks like this:

Diagram 2: Migrating data from the Student Information System

Handling sensitive data

In building a data lake you have several options for handling sensitive data including:

Leaving it behind in the source system and avoid copying it through the data replication process
Copying it into the data lake, but taking precautions to ensure that access to it is limited to authorized staff
Copying it into the data lake, but applying processes to eliminate, mask, or otherwise obfuscate the data before it is made accessible to analysts and data scientists

The Maryville team decided to take the first of these approaches. Building the data lake gave them a natural opportunity to assess where this data was stored in the source system and then make changes to the source database itself to limit the number of highly sensitive data fields.

Validating the data lake

With these steps completed, the team turned to the final task, which was to validate the data lake. For this process they chose to make use of Amazon Athena, AWS Glue, and Amazon Redshift. AWS Glue provided multiple capabilities including metadata extraction, ETL, and data orchestration. Metadata extraction, completed by Glue crawlers, quickly converted the information that DMS wrote to S3 into metadata defined in the Glue data catalog. This enabled the data in S3 to be accessed using standard SQL statements interactively in Athena. Without the added cost and complexity of a database, Maryville’s data analyst was able to confirm that the data loads were completing successfully. He was also able to resolve specific issues encountered on particular tables. The SQL queries, written in Athena, could later be converted to ETL jobs in AWS Glue, where they could be triggered on a schedule to create additional data in S3. Athena and Glue enabled the ETL that was needed to transform the raw data delivered to S3 into prepared datasets necessary for existing dashboards.

Once curated datasets were created and stored in S3, the data was loaded into an AWS Redshift data warehouse, which supported direct access by tools outside of AWS using ODBC/JDBC drivers. This capability enabled Maryville’s team to further validate the data by attaching the data in Redshift to existing dashboards that were running in Maryville’s own data center. Redshift’s stored procedure language allowed the team to port some key ETL logic so that the engineering of these datasets could follow a process similar to approaches used in Maryville’s on-premises data warehouse environment.

Conclusion

The overall data lake/data warehouse architecture that the Maryville team constructed currently looks like this:

Diagram 3: The complete architecture

Through this approach, Maryville’s two-person team has moved key data into position for use in a variety of workloads. The data in S3 is now readily accessible for ad hoc interactive SQL workloads in Athena, ETL jobs in Glue, and ultimately for machine learning workloads running in EC2, Lambda or Amazon Sagemaker. In addition, the S3 storage layer is easy to expand without interrupting prior workloads. At the time of this writing, the Maryville team is both beginning to use this environment for machine learning models described earlier as well as adding other data sources into the S3 layer.

Acknowledgements

The solution described in this post resulted from the collaborative effort of Christine McQuie, Data Engineer, and Josh Tepen, Cloud Engineer, at Maryville University, with guidance from Travis Berkley and Craig Jordan, AWS Solutions Architects.

Field Notes: Migrating a Self-managed Kubernetes Cluster on Amazon EC2 to Amazon EKS

2020-10-20 Ahmed Bham

Post Syndicated from Ahmed Bham original https://aws.amazon.com/blogs/architecture/field-notes-migrating-a-self-managed-kubernetes-cluster-on-ec2-to-amazon-eks/

AWS customers from startups to enterprises have been successfully running Kubernetes clusters on Amazon EC2 instances since 2015, well before Amazon Elastic Kubernetes Service (Amazon EKS), was launched in 2018. As a fully managed Kubernetes service, Amazon EKS customers can run Kubernetes on AWS without needing to install, operate, and maintain their own Kubernetes control plane. Since its launch, many existing and new customers are building and running their Kubernetes clusters on Amazon EKS.

At re:Invent 2019, AWS announced AWS Fargate for Amazon EKS, which is serverless compute for containers. Then, in January 2020, AWS announced a price reduction of Amazon EKS by 50% to $0.10 per hour, per cluster. These developments, coupled with realization of management and cost overhead of Kubernetes control plane operations at scale, made more customers look into migrating their self-managed Kubernetes clusters to Amazon EKS.

The “how” of this migration is the focus of this blog.

Overview of Solution

For most customers migrating from self-managed Kubernetes clusters on Amazon EC2 to Amazon EKS can usually be accomplished with minimal or no downtime. However, for large clusters involving hundreds of nodes and thousands of pods, this requires more planning and testing, and it is recommended to engage AWS Support for guidance.

Kubernetes control plane

Considerations

There are certain considerations to ensure a successful Amazon EKS migration and operational excellence.

Security

Access Control
- Amazon EKS uses AWS Identity and Access Management (IAM) to provide authentication to your Kubernetes cluster, but it still relies on native Kubernetes Role Based Access Control (RBAC) for authorization. It’s important to plan for the creation and governance of IAM users, roles, or groups, for Kubernetes cluster administration.
- You can enable private access to the Kubernetes API server so that all communication between your nodes and the API server stays within your VPC. You can limit the IP addresses that can access your API server from the internet, or completely disable internet access to the API server.
IAM Role for Service Account
- With IAM roles for service accounts on Amazon EKS clusters, you can associate an IAM role with a Kubernetes service account. This service account can then provide AWS permissions to the containers in any pod that uses that service account. With this feature, you no longer need to provide extended permissions to the node IAM role so that pods on that node can call AWS APIs.
Security groups for pods
- Security groups for pods integrate Amazon EC2 security groups with Kubernetes pods. You can use Amazon EC2 security groups to define rules that allow inbound and outbound network traffic to and from pods. These pods are deployed to nodes running on many Amazon EC2 instance types.

Networking

Amazon EKS supports native VPC networking with the Amazon VPC Container Network Interface (CNI) plugin for Kubernetes. Using this plugin allows Kubernetes pods to have the same IP address inside the pod as they do on the VPC network.
VPC CNI plugin uses IP addresses for the pods from the VPC CIDR ranges, and specifically from the subnet where the worker node is hosted. Therefore, customers must ensure the VPC and subnets that will host their Amazon EKS cluster have sufficient IP addresses available for the expected number of pods running at a time. Additionally, IP addresses are allocated to the Elastic Network Interface (ENI) attached to the EC2 instances. The EC2 instance selection for the worker nodes should take into account the number of ENI attachments supported by the instance type.

Compute Options

An Amazon EKS cluster can schedule pods on any combination of self-managed nodes, managed node groups, and AWS Fargate. This table in the Amazon EKS User Guide provides several criteria to evaluate when deciding which option best meets your requirements.

Kubernetes Versions

Amazon EKS supports four major Kubernetes versions at a time, which you can review in the available AWS documentation, along with a calendar for future Amazon EKS releases.
If you are currently running a non-supported Kubernetes cluster, or would like to migrate to a newer version on Amazon EKS, consider the following:

1. Review release notes for specific Kubernetes version you want to migrate to, and make necessary updates to Kubernetes manifest files.
2. Update your Kubernetes add-ons (CNI plugin, CoreDNS, Kube-Proxy) to compatible versions, as listed in the Updating an Amazon EKS Cluster guide.

Prerequisites

Create an IAM Role for the creator and/or administrator of the Amazon EKS cluster. Specify this role when creating the Amazon EKS cluster.
If using an existing VPC and subnets to host Amazon EKS cluster:

- You will need subnets in at least two Availability Zones
- All public subnets should have the property MapPublicIpOnLaunch enabled (that is, Auto-assign public IPv4 address in the AWS Management Console) to host self-managed and managed node groups.

3. If your pods are currently accessing AWS resources, and if you would like to use IAM roles for service accounts, then:

- Create service accounts in Kubernetes to be used by your pods.
- Follow these steps to create IAM roles, and assign to service account created.
- Update your pod manifest files to specify the newly created service account and role ARN annotation. Remove any existing code for storing or passing IAM credentials.

4. If you are planning to use AWS Fargate to run your pods, you need to create the appropriate Fargate Profile and pod execution role.

Application and Data Migration

For stateless workloads, apply your resource definitions (YAML or Helm) to the new cluster, and make sure everything works as expected. This includes the connection to resources external to the cluster.
For stateful workloads:

1. You will need to carefully plan your migration to avoid data loss or unexpected downtime.
2. If you are currently using shared persistent file storage based on Amazon EFS or Amazon FSx for Lustre, they can be mounted to Amazon EKS pods concurrently. Just make sure that pods don’t write to the same files concurrently.
3. For pods using EBS volumes, and for other persistent storage types, you can use a Kubernetes backup and restore tool, Velero.

Traffic Migration

If you have an entire domain migration that you would like to smoothly migrate, you can take advantage of Amazon Route 53’s Weighted Routing (as shown in the following diagram). With Weighted Routing, you are able to have a progressive transition from your existing cluster to the new one with zero downtime by splitting the traffic at the DNS level.

Your customers are slowly being transferred to your new cluster as their cached TTL expires. The split could start with a small share of your customers, for example, 10% being pointed to the new Amazon EKS cluster and 90% still on the old one. As soon as traffic is confirmed to be working as expected on the new cluster, that percentage of clients pointed to the new one can be increased.

mydomain.com

This implementation is flexible, it can be tied to Load Balancers, EC2 Instances, and even to external on-premises infrastructure.

Conclusion

In this blog post, we showed how to migrate your live-traffic serving self-hosted Kubernetes Cluster to Amazon EKS. Amazon EKS offers a cost-effective and highly available option for running Kubernetes clusters in the cloud. Since Amazon EKS is upstream Kubernetes compliant, you can migrate existing self-managed Kubernetes workloads to Amazon EKS, with multiple options to minimize or avoid service disruption. To create your first Amazon EKS cluster, visit Getting started with Amazon EKS.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Unlocking Data from Existing Systems with a Serverless API Facade

2020-10-15 Santiago Freitas

Post Syndicated from Santiago Freitas original https://aws.amazon.com/blogs/architecture/unlocking-data-from-existing-systems-with-serverless-api-facade/

In today’s modern world, it’s not enough to produce a good product; it’s critical that your products and services are well integrated into the surrounding business ecosystem. Companies lose market share when valuable data about their products or services are locked inside their systems. Business partners and internal teams use data from multiple sources to enhance their customers’ experience.

This blog post explains an architecture pattern for providing access to data and functionalities from existing systems in a consistent way using well-defined APIs. It then covers what the API Facade architecture pattern looks like when implemented on AWS using serverless for API management and mediation layer.

Background

Modern applications are often developed with an application programming interface (API)-first approach. This significantly eases integrations with internal and third-party applications by exposing data and functionalities via well-documented APIs.

On the other hand, applications built several years ago have multiple interfaces and data formats which creates a challenge for integrating their data and functionalities into new applications. Those existing applications store vast amounts of historical data. Integrating their data to build new customer experiences can be very valuable.

Figure 1: Existing applications use a broad range of integration methods and data formats

API Facade pattern

When building modern APIs for existing systems, you can use an architecture pattern called API Facade. This pattern creates a layer that exposes well-structured and well-documented APIs northbound, and it integrates southbound with the required interfaces and protocols that existing applications use. This pattern is about creating a facade, which creates a consistent view from the perspective of the API consumer—usually an application developer, and ultimately another application.

In addition to providing a simple interface for complex existing systems, an API Facade allows you to protect future compatibility of your solution. This is because if the underlying systems are modified or replaced, the facade layer will remain the same. From the API consumer perspective, nothing will have changed.

The API facade consists of two layers: 1) API management layer; and 2) mediation layer.

Figure 2: Conceptual representation of API facade pattern.

The API management layer exposes a set of well-designed, well-documented APIs with associated URLs, request parameters and responses, a list of supported headers and query parameters, and possible error codes and descriptions. A developer portal is used to help API consumers discover which APIs are available, browse the API documentation, and register for—and immediately receive—an API key to build applications. The APIs exposed by this layer can be used by external as well as internal consumers and enables them to build applications faster.

The mediation layer is responsible for integration between API and underlying systems. It transforms API requests into formats acceptable for different systems and then process and transform underlying systems’ responses into response and data formats the API has promised to return to the API consumers. This layer can perform tasks ranging from simple data manipulations, such as converting a response from XML to JSON, to much more complex operations where an application-specific client is required to run in order to connect to existing systems.

API Facade pattern on AWS serverless platform

To build the API management and the mediation layer, you can leverage services from the AWS serverless platform.

Amazon API Gateway allows you to build the API management. With API Gateway you can create RESTful APIs and WebSocket APIs. It supports integration with the mediation layer running on containers on Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (EKS), and also integration with serverless compute using AWS Lambda. API Gateway allows you to make your APIs available on the Internet for your business partners and third-party developers or keep them private. Private APIs hosted within your VPC can be accessed by resources inside your VPC, or those connected to your VPC via AWS Direct Connect or Site-to-Site VPN. This allows you to leverage API Gateway for building the API management of the API facade pattern for internal and external API consumers.

When it comes to building the Mediation layer, AWS Lambda is a great choice as it runs your mediation code without requiring you to provision or manage servers. AWS Lambda hosts the code that ingests the request coming from the API management layer, processes it, and makes the required format and protocols transformations. It can connect to the existing systems, and then return the response to the API management layer to send it back to the system which originated the request. AWS Lambda functions run outside your VPC or they can be configured to access systems in your VPC or those running in your own data centers connected to AWS via Direct Connect or Site-to-Site VPN.

However, some of the most complex mediations may require a custom client or have the need to maintain a persistent connection to the backend system. In those cases, using containers, and specifically AWS Fargate, would be more suitable. AWS Fargate is a serverless compute engine for containers with support for Amazon ECS and Amazon EKS. Containers running on AWS Fargate can access systems in your VPC or those running in your own data centers via Direct Connect or Site-to-Site VPN.

When building the API Facade pattern using AWS Serverless, you can focus most of your resources writing the API definition and mediation logic instead of managing infrastructure. This makes it easier for the teams who own the existing applications that need to expose data and functionality to own the API management and mediation layer implementations. A team that runs an existing application usually knows the best way to integrate with it. This team is also better equipped to handle changes to the mediation layer, which may be required as a result of changes to the existing application. Those teams will then publish the API information into a developer portal, which could be made available as a central API repository provided by a company’s tools team.

The following figure shows the API Facade pattern built on AWS Serverless using API Gateway for the API management layer and AWS Lambda and Fargate for the mediation layer. It functions as a facade for the existing systems running on-premises connected to AWS via Direct Connect and Site-to-Site VPN. The APIs are also exposed to external consumers via a public API endpoint as well as to internal consumers within a VPC. API Gateway supports multiple mechanisms for controlling and managing access to your API.

Figure 3: API Facade pattern built on AWS Serverless

To provide an example of a practical implementation of this pattern we can look into UK Open Banking. The Open Banking standard set the API specifications for delivering account information and payment initiation services banks such as HSBC had to implement. HSBC internal landscape is hugely varied and they needed to harness the power of multiple disparate on-premises systems while providing uniform API to the outside world. HSBC shared how they met the requirements on this re:Invent 2019 session.

Conclusion

You can build differentiated customer experiences and bring services to market faster when you integrate your products and services into the surrounding business ecosystem. Your systems can participate in a business ecosystem more effectively when they expose their data and capabilities via well-established APIs. The API Facade pattern enables existing systems that don’t offer well-established APIs natively to participate on this well-integrated business ecosystem. By building the API Facade pattern on the AWS serverless platform, you can focus on defining the APIs and the mediation layer code instead of spending resources on managing the infrastructure required to implement this pattern. This allows you to implement this pattern faster.

Field Notes: Building an Autonomous Driving and ADAS Data Lake on AWS

2020-10-14 Junjie Tang

Post Syndicated from Junjie Tang original https://aws.amazon.com/blogs/architecture/field-notes-building-an-autonomous-driving-and-adas-data-lake-on-aws/

Customers developing self-driving car technology are continuously challenged by the amount of data captured and created during the development lifecycle. This is accelerated by the need to design and launch incremental feature improvements on advanced driver-assistance systems (ADAS). Efforts to advance ADAS functionality have led to new approaches for storing, cataloging, and analyzing driving data captured from vehicles on the road today. Combining data from connected vehicle fleets transmitted over cellular networks in combination with manually ingesting data from vehicle data loggers requires a complex architecture and elastic data lake capability that only AWS provides.

This blog explains how to build an Autonomous Driving Data Lake using this Reference Architecture. We cover the workflow from how to ingest the data, prepare it for machine learning, catalog the output from ADAS systems and vehicle sensors, label it, automatically detect scenarios, and manage the various workflows required for moving it into an organized data lake construct. The AWS Autonomous Driving and ADAS Data Lake Reference Architecture was developed after working with numerous customers on the challenges they faced in achieving this. Using multiple AWS services and solution best practices, we outlined an approach we found helpful to others.

In the blog post, Autonomous Vehicle and ADAS development on AWS Part 1: Achieving Scale, we outlined the benefits having a data lake in the cloud, as opposed to building your own on-premises data lake solution.

Before we dive into the details of the reference architecture, let’s review the typical workflow of autonomous and ADAS development. The diagram below shows the following steps, much of which are common in any machine learning project:

data acquisition and ingest,
data processing and analytics,
labeling, map development,
model and algorithm development,
simulation and validation, and
orchestration and deployment.

This blog focuses on data ingest, data processing and analytics, labeling, and the data lake itself, as shown in the following workflow diagram:

driving dev workflow

Why build a data lake for ADAS and Autonomous Driving System development?

At AWS re:Invent 2019, BMW spoke at this session; (AUT306): Creating a data-driven, cloud-native ecosystem at BMW Group where they explained the motivation for developing an enterprise-wide cloud data lake, or the Cloud Data Hub as they call it. The Cloud Data Hub ingests information from multiple lines of business including sources like manufacturing systems, logistics, customer service, after-sales, and connected vehicle sensor telemetry data.

BMW created a global organization to drive DevOps culture with a strong emphasis on tooling, data quality, and end-to-end data lineage. Their journey began in the Hadoop eco-system (Hive, HBase) with an on-premises, heterogeneous, and difficult-to-scale environment, then moved to cloud-native building blocks with a focus on data quality. Examples of benefits derived from the AWS Cloud implementation are multi-region deployments, high security standards, and compliance with local regulations. The goal is to democratize the most valuable information assets so they can be used across a broad community globally.

The BMW Cloud Data Hub is just one example of how customers manage information assets for sharing across the enterprise. Using a similar approach, the AWS Autonomous Driving and ADAS Data Lake Reference Architecture extends the data lake pattern to address specific challenges around:

metadata and the data catalog including automated scenario detection;
data lineage from source to semantic layer;
data sharing with external consumers and third party service providers via Amazon SageMaker Ground Truth.

Now let’s review the details of the Autonomous Driving Data Lake Reference Architecture:

1. Ingest data from autonomous fleet with AWS Outposts for local data processing.

Sensor data is captured and written to data loggers containing multiple SSD hard drives. Once at the garage or customer facility, the hard drives are removed and inserted into copy stations. From there the data is copied to Amazon S3 or to a local storage system and AWS Outposts for pre-processing. AWS Outposts is a fully managed service that extends AWS infrastructure with AWS Services like Amazon Elastic Kubernetes Service, Amazon Relational Database Service, Amazon EMR on Amazon S3 (EMR supports EMRFS and HDFS which both natively support S3). These services are used to run data integrity checks, compress the data to remove redundant information and prepare the data for downstream AD workloads.

Using AWS DataSync, the data is synchronized at high data rates and securely between on-premises network attached storage (NAS) sources and Amazon S3. This is done over a high bandwidth connection provided by AWS Direct Connect.

2. Ingest vehicle telemetry data in real time using AWS IoT Core and Amazon Kinesis Data Firehose.

Vehicle telemetry is captured and published to the cloud using a number of different technologies, typically over HTTPS or MQTT. In this architecture, AWS IoT Greengrass provides an intelligent edge runtime in the vehicle with application logic running in Lambda functions deployed locally to filter vehicle network signals like CAN data, GPS location, ADAS system output, road condition metadata derived from cameras, and other vehicle sensor information.

AWS IoT Greengrass allows customers to deploy containers, machine learning inference models, create multiple data streams and prioritize them based on business logic you define. Eventually, the data ends up in S3 to be combined with the sensor data captured from the data logger process defined in step one above.

3. Remove and transform low quality data.

Autonomous vehicles produce terabytes of data per hour. In this trove of information, there may be redundant as well as corrupted data coming from the vehicle telemetry stream and raw sensor stack. This data needs to be normalized for optimal downstream processing. Customers use a number of technologies to do this. For example, Amazon EMR provides a runtime for high volume, complex data processing using open source Apache big data processing engines like Spark. There are a few common steps in the data transformation including;

checking if the driving is complete by combining batch files and streaming data;
parsing the log files based on recording formats (rosbag, mdf4, etc.);
decoding the signals from binary formats to readable text;
filtering inconsistent data files; and
synchronize the timestamp of signals

EMR Launch is an open source framework developed by AWS Labs available for customers to accelerate and simplify defining, deploying, managing, and using Amazon EMR clusters with the following features:

separating the definition of cluster security configurations (EMR Profile) and cluster resource configurations (Cluster Configuration) into reusable and shareable constructs; and
provide a suite of tools to simplify the construction of orchestration pipelines using Amazon Step Functions (EMR Launch Function).

4. Schedule the extract, transform, load (ETL) jobs using Apache Airflow.

To create trustworthy insights and empower your next machine learning use case with a reliable data foundation you need to bring the data creation process under design control. Fundamental to this approach is a centralized, governed workflow system powered by Apache Airflow. With Airflow you can establish trust in your data processing pipelines by making the workflow part of your code base to enable transparent, repeatable, pipeline executions.

The following solution diagram shows how radar and video data processing in MDF4 format achieves the highest scalability by leveraging AWS Fargate for Amazon ECS.

To ensure data pipeline integrity, the solution is deployed securely in an Amazon Virtual Private Cloud with end-to-end TLS and is only accessed from a private bastion host.
The containers for Airflow Webserver, Scheduler and Worker are deployed in multiple Availability Zones for high availability.
The communication between the components are decoupled via Amazon ElastiCache for Redis.
The status of running jobs is stored in Amazon Aurora.
To learn more about how to leverage complex workflows and model training jobs on AWS, a future blog post will describe the detailed architecture of Apache Airflow.

radar and video data processing in MDF4 format

5. Enrich data with map information and weather conditions based on GPS location and timestamp.

Data-sets are enriched using Amazon EMR with map or weather information from external geospatial and weather service providers and stored in Amazon S3, or database services like Amazon DynamoDB. Sensors like cameras and LiDAR could malfunction or even fail in adverse weather conditions. Advanced sensor fusing could perform the weather perception in real-time. The result of the weather perception could be verified with the real weather conditions.

6. Extract metadata into Amazon DynamoDB and Amazon Elasticsearch Service.

Using drive logs that contain the telemetry, telematics, perception, and sensor data, a catalog is built to create searchable quantitative metadata that includes speed, turning angles, location as well as simple and complex semantic descriptions of scene snippets such as “high velocity,” “left turn,” or “pedestrian.” The data lake catalog is updated with scenario data and indexed in Amazon Elastic Search for discovery by analysts and ADS engineers. The extraction process is ideally fully automated, but many higher order behavioral descriptions may require human annotations from later processing steps.

The majority of systems leverage quantitative and simple semantic descriptions for the drive data, with a clear trend towards needing higher order behaviors extracted in the data lake catalog. These more complex behaviors are ideal for enhanced searching capabilities and better validating coverage mapping. ASAM OpenSCENARIO defines a scenario description language that provides a common ontology and hierarchy for detailing the behavior of the vehicle and surroundings.

This standard provides an open approach for describing complex, synchronized maneuvers that involve multiple entities including other vehicles, vulnerable road users (VRUs) like pedestrians, bikers, construction workers, and other traffic participants. The description of a maneuver may be based on driver actions like performing a lane change by the ego, or based on the actions of others in the scenario such as a cut in from another driver. OpenSCENARIO also accounts for the appearance/description of the participant in the scene.

7. Store data lineage in Amazon Neptune and catalog data using AWS Glue Data Catalog.

Amazon Neptune is a fully managed graph database service. It’s useful to catalog data lineage in a graph model to visualize file and object dependencies. AWS Glue is a fully managed service that provides a data catalog making assets in the data lake discoverable. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

The following diagram shows how we parse data from the Ford Autonomous Vehicle Dataset in Rosbag format using Amazon EMR. We store it in Amazon S3 in parquet format, use AWS Glue Crawler to read the file schema and create the tables in the AWS Glue Data Catalog, and finally use Amazon Athena to query the velocity data.

Ford Autonomous Vehicle Dataset

8. Process drive data and perform deep signal validation.

Deploy your drive data signal validation code in Amazon Elastic Kubernetes Service (Amazon EKS). EKS is a managed service that makes it easy for you to run Kubernetes without needing to stand up or maintain your own Kubernetes control plane. Only a subset of the signals from the Rosbag or MDF4 files will be extracted for KPI calculation and aggregation which potentially reduces the stored data volume from gigabytes to megabytes.

9. Perform automated labeling using Amazon SageMaker Ground Truth.

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. Ground Truth offers automatic data labeling and/or annotation which uses a machine learning model to label your data.

In addition, the service helps you create custom workflows for data labeling that leverage human workers from Amazon Mechanical Turk, the AWS Partner Network, or your own private workforce to improve the automated labeling accuracy. Ground Truth now supports 3D Point Cloud Labeling for task types like Object Detection, Object Tracking and Semantic Segmentation. This blog shows how to use the service with open data sets from Audi A2D2 and KITTI. Alternatively, customers can run a custom container on Amazon EKS for Ground Truth generation and labeling.

10. Provide a search function for particular scenarios using AWS AppSync.

Developers and data scientists can search for a particular scenario and all of the associated metadata related to it. AWS AppSync is a managed service that uses GraphQL to make it easy for applications to get data from a range of data sources such as Amazon DynamoDB, Amazon ES and AWS Lambda.

Additional Aspects to consider

China: The collection of raw data that includes video, lidar, radar, and GPS data is defined as a controlled activity by the government (geographic information surveying and mapping). This is a regulated activity and must be done under the governance of local certified map providers with navigation surveying licenses.
Data Encryption and Anonymization: Some ADS/ADAS use cases include sensitive or personal information. AWS Key Management Service (KMS) supports Customer Master Keys (CMKs). Vehicle Identification Numbers (VIN) can be anonymized by Amazon EMR jobs. This blog shows how to anonymize personal data like faces from the video using Amazon Rekognition.
Exchange Data with partners: AWS Data Exchange is a service that makes it easy for AWS customers to securely exchange file-based data sets in the AWS Cloud. Providers in AWS Data Exchange have a secure, transparent, and reliable channel to reach AWS customers and grant existing customers their subscriptions more efficiently.
Data Lake as code: AWS provides a full stack of DevOps tooling including AWS CodeCommit, AWS CodeBuild and AWS CodePipeline to simplify the provisioning and management infrastructure, to deploy application code, automate software release processes, and monitor application and infrastructure performance. Third party CI/CD tools like Jenkins and Zuul can be integrated as well.

Conclusion

In this post, we discussed the steps outlined in this Reference architecture to build an Autonomous Driving and ADAS Data Lake on AWS. We hope you found this interesting and helpful and invite your comments on the architecture.

Also, check out the Automotive issue of the AWS Architecture Monthly Magazine.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Implementing Hardware-in-the-Loop for Autonomous Driving Development on AWS

2020-10-13 Bryan Berezdivin

Post Syndicated from Bryan Berezdivin original https://aws.amazon.com/blogs/architecture/field-notes-implementing-hardware-in-the-loop-for-autonomous-driving-development-on-aws/

Automotive customers use AWS as their platform for advanced driving assistance systems (ADAS) and autonomous driving (AD) development to accelerate their development cycles and experience faster time-to-market. In the blog post, Autonomous Vehicle and ADAS development on AWS Part 1: Achieving Scale, we illustrated how software in the loop (SiL) and hardware in the loop (HiL) simulations are part of the workflow used to develop and validate safe AD and ADAS functionality. In this post, I run through some of the more common questions and patterns for implementing HiL on AWS, while looking at some of the differences from running your development all on-premises.

HiL simulations leverage test drive data and derived synthetic data to develop and validate various functions in the AD software stack. Test drive log data is ingested and stored in Amazon S3 for use for HiL simulations in parallel with other AD development workloads including visualization, processing, labeling, analysis, and model and algorithm development.

As such, we see customers exist in a hybrid context with their HiL workloads running on-premises to support customized equipment. For ADAS and ADS customers, this poses a few questions and considerations:

What are the recommendations to deploy HiL for AD on AWS?
How is this different from what customers were used to on-premises?

For hybrid customers, there are assumptions and misconceptions:

Do I need to replicate the test drive data locally, and if so, what are the considerations and consequences?

For the purposes of brevity, the remainder of this blog post will use the term AD development to encompass ADAS and AD unless specifically called out.

HiL Building Blocks

Simulations and validations make up an important aspect of AD development. According to Rand’s analysis, there is a need to demonstrate safe driving on billions of miles for an autonomous vehicle to have a lower failure rate than a human driver. While this analysis is statistically derived for fully autonomous driving (SAE Level 5), it demonstrates a need for further validation on millions of miles. This pattern is reflected in most ADAS and AD development projects, where software in the loop (SiL) and HiL are used for verification and validation.

The following diagram is an illustration of the ISO 26262 V-Model, a product development approach for matching requirements with corresponding tests, where HiL simulation is required for much of system level testing and validation phases on the right side of the overlaid V-Model.

Figure 1: V-Model as defined by ISO-26262

HiL simulations require a few key elements. The main component is the device under test (DUT), such as one or more electronic control units (ECUs) running the AD software stack. HiL simulations allow customers to put the device under test (DUT) under the rigor of real-world signals found in a vehicle. By providing more accurate environments and scenarios for the DUT, it can be fine-tuned for key performance indicators (KPIs) such as power utilization, response time, and accuracy.

The DUT is connected to a “HIL Rig,” a high performance server with multiple expansion boards to connect to various components of the AD system. The various interfaces are identified in Figure 2: High Level Hardware-in-the-loop (HiL) System and Interfaces and include Controller Area Network (CAN), Automotive Ethernet, Low Voltage Differential Signal (LVDS), and PCIExpress. These interfaces emulate the vehicular topology for testing purposes and allow system level validation of the DUT.

The HiL Rigs and the corresponding software tooling are offered by companies like Elektrobit, dSpace, National Instruments, and Opal-RT. The HiL systems facilitate time synchronized inputs and outputs to the DUT and measures system performance. These solutions have optimizations for large-scale operations aligned with faster validation cycles for customers. This latter point is relevant when a project requires validation across a large number of miles. Elektrobit provides the ability to orchestrate and deploy large HiL server farms that can be deployed to work in parallel. The larger server farms allow parallel HiL simulations to reduce the time to validate thousands of miles of drive time and assess the key performance indicators (KPIs) for the feature sets. Results can be acted on more quickly and reduce the overall development time.

Figure 2: High Level Hardware-in-the-loop (HiL) System and Interfaces

The HIL system loads sensor data derived from test drive logs. These logs vary in format, but often are captured and stored as MDF4, ADTF, rosbag, or other data logger proprietary formats. These are then processed for HIL simulations to implement open-loop and closed-loop simulations.

Open loop simulations refer to replay of log data from test drives.
Closed-loop simulations rely on the behavior of the system as inputs vary based on new outputs of the simulation.

Both open loop and closed loop simulations are part of autonomous driving development, but open-loop simulations require the largest datasets (multiple petabytes on average) due to reliance on the log data from test drives making them a primary concern for deploying HiL in a hybrid manner.

Overview of Solution

Architectures for supporting HIL simulations with AWS for AD development vary based primarily on the networking available at HiL locations. A common pattern for AWS customers is to have the HiL systems directly interfacing with Amazon S3 over high-bandwidth network links leveraging AWS Direct Connect. This is the simplest approach to deploying HiL and avoids hybrid data management of the petabytes of data in Amazon S3 to a local storage system.

AWS Direct Connect provides customers options to deploy their HIL rigs at their data center or in AWS colocation facilities with low latency connections. AWS has the largest number of Direct Connection locations and points-of-presence (POPs) to enable low latency connectivity to any of the >24 AWS Regions. The following diagram illustrates a reference architecture leveraging a direct interface from the HiL systems and Amazon S3.

Figure 3 : Reference Architecture for Hardware-in-the-Loop (HiL) Direct to Amazon S3

As shown in Figure 3, we illustrate the common interfaces, topology, and AWS services used for autonomous driving customers.

Amazon S3 is used to store and analyze the test drive logs used by the HiL simulations and also the results from the simulation runs for further analysis.
Metadata of test drive data is populated in various database and analytics services, referred to as the data catalogues, with metadata crawlers and processing pipelines that extract from the drive log and test result data on Amazon S3.
The data catalogues provide flexible search interfaces for developers and validation engineers or advanced analytics tools. These systems provide keyword search in Amazon Elasticsearch or SQL queries in Amazon Redshift or Amazon RDS and noSQL interfaces using Amazon DynamoDB. Amazon Partner Network solutions for these database and analysis tools are common as well, such as those in AWS Marketplace.
Validation engineer, data scientists, or developers use these data catalogues to find scenarios for testing. These personas also use the HiL management interfaces to configure and orchestrate the HiL simulation runs on the scenarios identified and ensure traceability.
HiL management systems control the HiL Rigs that interface to the DUT and implement the HiL simulations using the test drive logs. The HiL management system then writes results back to S3 for further analysis via various tool chains.

A common question AWS customers have is how to determine an optimal hybrid architecture using this approach. The primary factors are properly sized network links to accommodate data sets used by the HiL simulations as well as low latency network links between Amazon S3 and the HiL rigs. As a result, a key factor is ensuring use of an AWS region for your AWS storage that is in close proximity to your HiL testing site(s).

Based on current HiL implementations, open-loop simulations can sustain latencies of 30-50 ms RTT. AWS has numerous AWS Direct Connect locations in co-location facilities with latencies <5ms RTT. Sizing for these network links can be calculated based on the expected dataset sizes and the interval of time targeted for simulation run. We show a basic formula used for network sizing.

Average_Throughput (Gbps) = Average_Dataset_Size(GB)*8 / Time_Interval (seconds)

As an example, for a scenario where an average of 20PB is needed by the HIL rig every 2 weeks, we require ~200Gbps for the AWS Direct Connect bandwidth.

Figure 4 shows an example of a high-level architecture supported by Elektrobit with multiple EB 9101 test racks grouped together. This architecture supports multiple ECUs to be tested at once, leveraging drive log data in Amazon S3. This system is controlled with a central management software that allows optimal orchestration to keep the Elektrobit HiL system running optimally.

Use cases include:

The automated replay of all relevant sensor data with high time precision to ECU
Capture of ECU responses including debug data
Integration of customer components inside the HiL rack for visualization or post-processing.

Another common question from AWS customers is whether this architecture is supported for their HiL implementation. Many HiL providers are adding AWS functionality to their software and hardware stacks in response to customers transitioning to cloud for the development platforms. Some vendors still require Amazon S3 as a supported interface in their HiL Rigs. The work needed to accommodate Amazon S3 is usually a small level of effort for any developer by using Amazon SDKs on the HiL rig software stack. If there is a project where this is needed, contact the AWS account teams and your HiL vendors to ensure a successful and cost efficient project implementation.

Figure 4: Elektrobit HiL Architecture with AWS

An alternative HiL solution shown in Figure 5 includes Amazon S3 as the primary storage for drive log data and the scale out NAS storage system is located on-premises operating as a cache for the HIL rigs. This is common when the networking options at the HIL site are limited in bandwidth or latency to handle the target datasets and time windows.

AWS customers calculate the size of the cache to transfer the entire dataset over the intended time interval. Following is a simple calculation to demonstrate this.

Cache_Size(GB) = Average_Dataset_Size(GB) - Average_Throughput (Gbps) /8 * Time_Interval (seconds)

In this example, a customer has 40 Gbps AWS Direct Connect available and a 10PB dataset needed for HIL simulations every 2 weeks. Using the preceding formula there is a need for local cache of four PB capable of high read-rates.

Figure 5: Reference Architecture for Hardware-in-the Loop (HiL) with Local Cache to Amazon S3

In this hybrid architecture there is a need to orchestrate the data movement in line with the needs of the HIL simulation data set. This requires third party software generally or built in functionality into workflow orchestration tools like Apache Airflow. At CES 2019, Dell EMC and AWS illustrated a solution for this hybrid architecture documented in this short solution brief using Isilon as the scale out NAS storage system and DataIQ as the data movement and orchestration mechanism.

Any of these architectures can be cost-optimized, and AWS has programs and pricing options for Amazon Direct Connect as well as the other AWS services involved. There are Enterprise Agreements and Migration Acceleration Programs (MAP) in line with the holistic AD development platform needs, that reduce the costs for hybrid architecture functionality needed in the HiL solutions. One common need is support for AWS Direct Connect “flat rate” pricing option to accommodate the data transfer out (DTO) needs for the HiL workload. If you need details on these programs for your AD development project, contact your AWS account team.

Conclusion

In this blog post, we discussed two common architectural patterns for supporting HiL simulations for ADAS and Autonomous Driving development. These help customers decide on the right networking, storage, and hybrid topologies for these systems.

HiL systems directly interfacing with Amazon S3 is the most common pattern as you see with Elektrobit HiL solutions, but for customers with limited network links the use of a local cache is an option. Autonomous driving customers looking to increase velocity in their SAE Level 2-5 development programs with HiL simulations have achieved success with AWS as the development platform using these patterns. AWS has a team dedicated to autonomous driving, so contact your AWS account team to get a more prescriptive solution for your HiL or related ADAS and AD development needs.

Also, check out the Automotive issue of the AWS Architecture Monthly Magazine.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Architecture Monthly Magazine: AWS Solutions

2020-10-12 Annik Stahl

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-aws-solutions/

For October’s issue of AWS Architecture Monthly Magazine, we decided to do a deep dive into the AWS Solutions Library, a virtual treasure trove of cloud-based solutions for dozens of technical and business problems. Whether you want to combine pre-built, well-architected multi-service patterns to create your own solution, deploy vetted architecture directly into your AWS account, or get help deploying vetted architecture from AWS Competency Partners, we can help. Our expert runs us though the various offerings you can take advantage of, and some of our other guest writers will go more deeply into the individual options.

In this month’s AWS Solutions issue

Ask an Expert: Tom Begley, Manager, AWS Solutions Builder
Customer Success Story: App8: Helping Restaurants Succeed during COVID-19
AWS Solutions Implementations: Detailed architectures, a deployment guide, and instructions for both automated and manual deployment
AWS Solutions Constructs: Building faster and more confidently with vetted architecture patterns
AWS Solutions Consulting Offers: Enhancing the AWS Solutions Library to address customer needs
Related Videos: Watch what AWS Solutions can do for you

How to access the magazine

View and download past issues as PDFs on the AWS Architecture Monthly webpage.
Readers in the US, UK, Germany, and France can subscribe to the Kindle version of the magazine at Kindle Newsstand.
Visit Flipboard, a personalized mobile magazine app that you can also read on your computer.

We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us star rating and comment on the Amazon Kindle Newsstand page or contact us anytime at [email protected].

Nielsen: Processing 55TB of Data Per Day with AWS Lambda

2020-10-09 Boaz Ziniman

Post Syndicated from Boaz Ziniman original https://aws.amazon.com/blogs/architecture/nielsen-processing-55tb-of-data-per-day-with-aws-lambda/

Earlier this year, I went into the studio with Opher Dubrovsky from Nielsen Marketing Cloud (a data management platform) to record an episode of This is My Architecture about Big Data architecture.

In preparation for the recording and during my initial conversations with Opher, I realized that there is an amazing story here that can help a lot of developers and architects who are building, or thinking of building, serverless architecture.

Many of these builders aren’t sure if serverless can withstand the load they need, how the costs is going to look like, or how complicated and different it is from what they are familiar with, and they have countless more questions. Nielsen’s story is a great example of how you can take this technology, which I like so much, to the extreme.

Serverless Big Data?

There are many use cases for serverless architecture, such as web systems, use of IoT, connection to microservices via API, automation, and more. Using serverless for data processing is one of the earliest architectures, and serverless Big Data, whether it is for processing large-volume data streams or file processing, is a very common practice.

What surprised me about Nielsen’s story was the size and complexity of the solution. Nielsen’s system, called “DataOut,” is processing 250 billion events-per-day, which translates to 55 terabytes (TB) of data. The system can automatically scale up and down, thanks to the capabilities of serverless architecture, processing from 1TB to 6TB of data per hour, and costing “only” $1,000 per day. (If this sounds like a lot to you, think again about how much computing power it takes to process this amount of data and how much workforce time the system saves.)

What did we learn?

More and more organizations (small and large) are adopting serverless architecture-based solutions for their core systems. The technology has the ability to scale up (and down), reach a very large scale, and provide flexibility at a reasonable price. This is what makes this architecture, even in the world of Big Data, something worth considering when planning a new system or improving an existing one.

True, there are still challenges to be solved and knowledge to be gained as more and more developers use these solutions, but Nielsen’s story is just one of a series of solutions built for better scalability and mostly simpler to operate and maintain.

Spend 10 minutes with Opher and me in the below video to learn how Nielsen built a data processing machine without machines (kind of).

For more content like this, subscribe to our YouTube channels This is My Architecture, This is My Code, and This is My Model, or visit the This is My Architecture on AWS, which has search functionality and the ability to filter by industry, language, and service.

New Whitepaper: Selecting & Designing Your Hybrid Connectivity Model

2020-10-08 Santiago Freitas

Post Syndicated from Santiago Freitas original https://aws.amazon.com/blogs/architecture/new-whitepaper-selecting-designing-your-hybrid-connectivity-model/

Introduction

Many organizations need to connect their on-premises data centers, remote sites, and the cloud. A hybrid network connects these different environments.

A modern organization uses an extensive array of IT resources. In the past, it was common to host these resources in an on-premises data center or a colocation facility. With the increased adoption of cloud computing, IT resources are delivered and consumed from cloud service providers over a network connection. In some cases, organizations have opted to migrate all existing IT resources to the cloud. In other cases, organizations maintain IT resources both on premises and in the cloud. In both cases, a common network is required to connect on-premises and cloud resources. Coexistence of on-premises and cloud resources is called “hybrid cloud” and the common network connecting them is referred to as a “hybrid network. “ Even if your organization keeps all of its IT resources in the cloud, it may still require hybrid connectivity to remote sites.

There are several connectivity models to choose from. Although having options adds flexibility, selecting the best option requires analysis of the business and technical requirements and the elimination of options that are not suitable. Requirements can be grouped together across considerations, such as: security, time to deploy, performance, reliability, communication model, scalability, and more. Once requirements are carefully collected, analyzed, and considered, network and cloud architects identify applicable AWS hybrid network building blocks and solutions. To identify and select the optimal model(s), architects must understand advantages and disadvantages of each model. There are also technical limitations that might cause an otherwise good model to be excluded.

Figure 1 – Consideration covered on the whitepaper.

A new whitepaper on Hybrid Connectivity describes AWS building blocks and the key things to consider when deciding which hybrid connectivity model is right for you. To help you determine the best solution for your business and technical requirements, we provide decision trees to guide you through the logical selection process as well as a customer use case to show how to apply the considerations and decision trees in practice.

Decision tree applied to Example Corp. Automotive use case

Figure 2: Example Corp. Automotive connection type decision tree

Contributors

Contributors to this new whitepaper on Hybrid Connectivity are: Marwan Al Shawi, AWS Solutions Architect; Santiago Freitas, AWS Head of Technology; Evgeny Vaganov, AWS Specialist Solutions Architect – Networking; and Tom Adamski, AWS Specialist Solutions Architect – Networking. Special thanks to Stephen Bird, AWS Senior Program Manager – Content.

Field Notes: Powering the Connected Vehicle with Amazon Alexa

2020-10-07 Amit Kumar

Post Syndicated from Amit Kumar original https://aws.amazon.com/blogs/architecture/field-notes-powering-the-connected-vehicle-with-amazon-alexa/

Alexa has improved the in-home experience and has potential to greatly enhance the in-car experience. This blog is a continuation of my previous blog: Field Notes: Implementing a Digital Shadow of a Connected Vehicle with AWS IoT. Multiple OEMs (Original Equipment Manufacturers) have showcased this capability during CES 2020. Use cases include; a person seating at the rear seat can play a song, control HVAC (Heating, ventilation, and air conditioning), pay for gas/coffee, all while using Alexa. In this blog, I cover how you create a connected vehicle using Alexa, to initiate a command, such as; ‘Alexa, open my trunk’.

Solution Architecture

“Alexa, open my trunk”

The preceding architecture shows a message flowing in the following example:

A user of a connected vehicle wants to open their trunk using an Alexa voice command. Alexa will identify the right intent based on utterances and invoke a Lambda function. The Lambda function updates the device shadow with (desired {““trunk””: ““open””}).
Vehicle TCU registered the callback function shadowRegisterDeltaCallback(). Listen to delta topics for the device shadow by subscribing to delta topics. Whenever there is a difference between the desired and reported state, the registered callback will be called. The delta payload will be available in the callback. Update performed in #1 will be received in delta callback.
Now, the vehicle must act on the desired state. In this case, it acts on the trunk status change. After performing the required action for the trunk change, the vehicle TCU will update the device shadow with the reported state (reported : { “trunk”: “open”} )
The web/mobile app subscribed to the topic $aws/things/tcu/shadow/update/accepted”. Therefore, as soon as the vehicle TCU updates the shadow, the Web/Mobile app received the update and synchronized the UI state.

As part of the previous blog, we implemented #2, #3 and #4. Lets implement #1 and incorporate into the solution.

The source code (vehicle-command) of this blog is available in this code repository.

The Alexa voice command required the implementation of three key areas:

Configure Alexa – which will listen to utterances and identify the right intent and invoke a Lambda function.
Set up the Lambda function – which will interpret the command and invoke the AWS IoT Core device shadow API.
Handle Command at Vehicle tcu and App – Vehicle tcu must register shadowRegisterDeltaCallback so any update in the device shadow will receive a call message to perform the actual command by the vehicle and synchronize the state with a web/mobile app.

Let’s ‘Open a trunk’ using Alexa voice command. First set up the environment:

Open AWS Cloud9 IDE created in an earlier lab and run the following command:

Set up permanent credentials. Note: Alexa doesn’t work with temporary credentials. Configure it with permanent credentials for ASK command line interface (CLI).

Open Cloud9 Preferences by clicking AWS Cloud9 > Preference or by clicking on the “gear” icon in the upper right corner of the Cloud9 window
Select “AWS Settings”
Disable “AWS managed temporary credentials”
$ aws configure
Enter the Access Key and Secret Access Key of a user that has required access credentials
Use us-east-1 as the region. It will store in ~/.aws/config

Verify that everything worked by examining the file ~/.aws/credentials. It should resemble the following:

[default]
aws_access_key_id = <access_key>
aws_secret_access_key = <secrect_key>
aws_session_token=

*Remove aws_session_token line from credentials file.

Next, install the Alexa CLI:

$ npm install ask-cli --global

Initialize ASK CLI by issuing the following command. This will initialize the ASK CLI with a profile associated with your Amazon developer credentials.

$ ask configure --no-browser

Check you are linking AWS account with Alexa:

Do you want to link your AWS account in order to host your Alexa skills? Yes

#At the end output should look as follows:

------------------------- Initialization Complete -------------------------
Here is the summary for the profile setup:
ASK Profile: default
AWS Profile: default
Vendor ID: MXXXXXXXXXX

As part of the previous blog, you have already cloned the following git repository in AWS Cloud9 IDE. It has a baseline code to jump start.

$ git clone

Configure Alexa Skills

The Alexa Developer console GUI can be used but we are doing it programmatically so it can be done at scale and allows versioning.

1. Open connected-vehicle-lab/vehicle-command/skill-package/skill.json . We have 2 locale en-US, en-IN are defined in the base code for Alexa command. Let’s add en-GB locale in the json file located at “manifest”/”publishingInformation”/”locales”. Similarly, you can add locale for your preferred language:

"en-GB": {
"name": "vehicle-command",
"summary": "Control Vehicle using voice command",
"description": "Allow you to control vehicle using voice command",
"examplePhrases": [
    "Alexa open genie",
    "ask genie to lower window",
    "window up"
    ],
"keywords": []
}

If you are inserting into the middle then make sure it is separated by a comma.

2. Let’s create a copy of models connected-vehicle-lab/vehicle-command/skill-package/interactionModels/custom/en-US.json and rename it to en-GB.json and add our intent

We have “invocationName”: “genie”. Here, we are using “genie” as a command to invoke our Alexa skill. You can change if needed
The key elements in this json file is intent, slots, sample utterance and slot types. Let’s define the slot types t_action_type for ‘open’, ‘close’, ‘lock’, ‘unlock’. under “types”: [].

        {
        "name": "t_action_type",
        "values": [
            {
                "name": {
                "value": "unlock"
                }
            },
            {
                "name": {
                "value": "lock"
                }
            },
            {
                "name": {
                "value": "close"
                }
            },
            {
                "name": {
                "value": "open"
                }
            }
          ]
        }

Let’s add intent under “intents”: [] for trunk ‘TrunkCommandIntent’ and define the sample utterance speech like ‘lock my trunk’, ‘open trunk’. We are using slot types to simplify the utterance and understand the operation requested by a user.

        {
            "name": "TrunkCommandIntent",
            "slots": [
            {
                "name": "t_action",
                "type": "t_action_type"
            }
            ],
            "samples": [
                "{t_action} trunk",
                "trunk {t_action}",
                "{t_action} my trunk",
                "{t_action} trunk"
            ]
}

Now add the same intent, slots, slot type and sample utterances for other locales files (en-US.json and en-IN.json) as well.

3. Let’s add response message under languageString.js (available at /connected-vehicle-lab/vehicle-command/lambda/custom).

TRUNK_OPEN: 'Trunk Open',
TRUNK_CLOSE: 'Trunk Close'

If you are inserting into the middle then make sure it is separated by a comma.

Set up the Lambda function

1. Add a Lambda function which will get invoked by Alexa. This Lambda function will handle the intent and invoke IoT Core Device Shadow API and execute the actual command of ‘Trunk open/unlock or lock/close’.

Open /connected-vehicle-lab/vehicle-command/lambda/custom/index.js and add our TrunkCommandIntent

const TrunkCommandIntentHandler = {
                canHandle(handlerInput) {
                return Alexa.getRequestType(handlerInput.requestEnvelope) === 'IntentRequest'
                && Alexa.getIntentName(handlerInput.requestEnvelope) === 'TrunkCommandIntent';
                },
                    handle(handlerInput) {
                    var t_action_value = handlerInput.requestEnvelope.request.intent.slots.t_action.value;
                    console.log(t_action_value);
                    var speakOutput;
                    const obj = "trunk";
                    if (t_action_value == "lock" || t_action_value == "open")
                    {
                        updateDeviceShadow(obj, "open");
                        speakOutput = handlerInput.t('TRUNK_OPEN')
                    }
                    else 
                    {
                        updateDeviceShadow(obj, "close");
                        speakOutput = handlerInput.t('TRUNK_CLOSE')
                    } 
                    console.log(speakOutput);
                    return handlerInput.responseBuilder
                    .speak(speakOutput)
                    //.reprompt('add a reprompt if you want to keep the session open for the user to respond')
                    .getResponse();
                }
            };

We have UpdateDeviceShadow(“vehicle_part”, “command”) function which actually invokes the IoT core Device Shadow API

 function updateDeviceShadow (obj, command)
    {
        shadowMessage.state.desired[obj] = command;
        var iotdata = new AWS.IotData({endpoint: ioT_EndPoint});
        var params = {
        payload: JSON.stringify(shadowMessage) , /* required */
        thingName: deviceName /* required */ 
        };
        iotdata.updateThingShadow(params, function(err, data) {
            if (err) 
            console.log(err, err.stack); // an error occurred
            else 
            console.log(data); 
            //reset the shadow 
            shadowMessage.state.desired = {}
        });
}

2. Update the value of ioT_EndPoint from AWS IoT Core > Settings > Custom Endpoint

3. Add Trunk CommandIntent in request handler

exports.handler = Alexa.SkillBuilders.custom()
    .addRequestHandlers(
        LaunchRequestHandler,
        WindowCommandIntentHandler,
        DoorCommandIntentHandler,
        TrunkCommandIntentHandler,

4. Deploy Alexa Skills

$ cd ~/environment/connected-vehicle-lab/vehicle-command
$ ask deploy

Handle Command at Vehicle tcu and App

For more detail on this section, refer to part 1 of this blog: Field Notes: Implementing a Digital Shadow of a Connected Vehicle with AWS IoT.

@ Vehicle tcu – tcuShadowRead.py has trunk_handle() function to receive a message from device shadow

def trunk_handle(status):
  if status is not None:
    shadowClient.reportedShadowMessage['state']['reported']['trunk'] = status
    print ('Perform action on trunk status change : ' + str(status))

@web App – demo-car/js/websocket.js has handleTrunkCommand function receive callback message as soon any update happened on Device Shadow

//this function will be called by onMessageArrive
function handleTrunkCommand(trunkStatus) {
    obj = document.getElementsByClassName("action trunk")[0];
    obj.checked = trunkStatus == "open" ? true : false;
    console.log(obj.getAttribute("data-text") + " : " + obj.checked);
}

demo-car/js/demo-car.js has handleTrunkCommand function to handle UI input and invoke IoT Core Device Gateway API to update the desired state.

//this function will be called when user will click on trunk checkbox
    handleTrunkCommand: function(obj) {
        obj.checked ? demoCar.shadowMessage.state.desired.trunk = "open" : demoCar.shadowMessage.state.desired.trunk = "close";
        console.log(obj.getAttribute("data-text") + " : " + demoCar.shadowMessage.state.desired.trunk);
        demoCar.accessIoTDevice();
    },

Use Alexa skill to invoke a command

Let’s test or command ‘Alexa, open my trunk’. We can use a command line and execute:

$ask dialog --locale "en-GB"

Using Alexa GUI, provides an interesting visualization, as shown in the following screenshot.

Open the Alexa GUI, Select ‘vehicle command’ skill and select test tab. Allow “developer.amazon.com” to use your microphone?
Open a demo.html web app side by side of the Alexa GUI to check an actual operation happened at the Vehicle tcu and synchronize the status with virtual car model.
Now test the Alexa skill. You can use an audio command as well. You can ask or write ‘ask genie’.

Alexa developer console

Clean Up

What a fun exploration this has been! Now clean up AWS resources created for this and the previous post to avoid incurring any future AWS services costs. Resources created by CDK can be deleted by deleting the stack on the CloudFormation console. Resources created manually need to be deleted individually.

Conclusion

In this blog post, I showed how you can enable voice command for a connected vehicle and enhance in-vehicle user experience. Similarly, you can also extend this solution for the use cases like Alexa ‘open my garage’. AWS IoT Core Device Shadow API does all the heavy-lifting in this case. Any update in device shadow allows both device and user application to act. Alexa skill is acting as an interface to capture the user command and invoke the lambda function.

Since these are all serverless services, that means this implementation can scale without making any change in the application and you only pay when someone invokes a command. Creating an engaging, high-quality interaction with Alexa in the vehicle is critical. You can refer to Alexa Automotive Documentation for an Alexa Built-in automotive experience.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Implementing a Digital Shadow of a Connected Vehicle with AWS IoT

2020-10-06 Amit Sinha

Post Syndicated from Amit Sinha original https://aws.amazon.com/blogs/architecture/field-notes-implementing-a-digital-shadow-of-a-connected-vehicle-with-aws-iot/

Innovations in connected vehicle technology are expected to improve the quality and speed of vehicle communications and create a safer driving experience. As connected vehicles are becoming part of the mainstream, OEMs (Original Equipment Manufacturers) are broadening the capabilities of their products and dramatically improving the in-vehicle experience for customers.

An important feature in a connected vehicle is its ability to execute a remote command and synchronize the state of the vehicle between a web/mobile app in real time.

This blog demonstrates how to:

secure two-way communication between a device (vehicle telematics control unit) and the AWS Cloud using AWS IoT
execute command at vehicle
execute a remote command
and test with a vehicle virtual model

You can watch a quick animation of a remote command execution in the following GIF:

Solution Overview

In a traditional connected vehicle approach, there are many processes running on multiple servers. These processes are subscribing to one another, coordinating with each other, and polling for an update. This makes scalability and availability a challenge. We use AWS IoT Core and AWS IoT Device Shadow service as primary components for this solution.

This solution has three building blocks:

a vehicle TCU (telematics control unit),
the AWS Cloud (with connection via AWS IoT Core) and
a virtual Model (e.g.; web/mobile app to send/receive commands to TCU). These three building blocks together reflect the current state of a vehicle.

Alexa Solution Overview

The previous diagram shows a message flowing in the following example:

A user of a connected vehicle wants to open their door using a web/mobile app. The app updates the device shadow with (desired {““door””: ““open””}). The app will always request the vehicle to execute the command; therefore, it will always update the device shadow with the desired state.
Vehicle TCU registered the callback function shadowRegisterDeltaCallback(). Listen on delta topics for the device shadow by subscribing to delta topics. Whenever there is a difference between the desired and reported state, the registered callback is called and the delta payload will be available in the callback. Update performed in #1 will received in delta callback.
Now, the vehicle needs to act on the desired state. In this case, ‘act on’ is the door status change. After performing the required action for the door change, the vehicle TCU will update the device shadow with the reported state (reported : { “door”: “open”} )
Now, the vehicle is closing the door. The vehicle will always perform the action; therefore, it will always update device shadow with reports state (reported: {“door” : “close”})
The Web/Mobile app subscribed to topic $aws/things/tcu/shadow/update/accepted”. Therefore, as soon as the vehicle TCU updates the shadow, the Web/Mobile app received the update and synchronized the UI state.
You can also build an Amazon Alexa skill to control your vehicle (“Alexa, raise my window”). After identifying the utterance, Alexa can invoke the Lambda function to update the device shadow and perform the requested action.

Note: For the Web/Mobile app developments for production, it is recommended to use AWS AppSync and AWS Amplify SDK for building a flexible and decoupled application from the API. Refer to this code sample for more detail.

Implementation

First, you need to set up the code. Refer to the directions in this code sample.

Create device

In AWS IoT Core, name a device ‘TCU’ (created by connected-vehicle-app-cdk-stack). Create a new certificate (download files) and attach the policy generated by cdk.

create a certificate

Next, deploy the certificate key and pem file on your device so it can connect with the AWS Cloud using the X.509 certificate. For more detail, refer to the directions in the code sample.

Execute Command at Vehicle

AWS IoT Device Shadow is an important feature of AWS IoT core for remote command execution because it allows you to decouple the vehicle and the app which controls and commands the vehicle. A device’s shadow is a JSON document that is used to store and retrieve current state information for a device. Primarily we use state.desired and state.reported. properties of a device’s shadow document.

The device shadow (Device SKD and APIs) enables applications to interact with devices even when they are offline and allow:

Cloud representation of device state
Query last known state for offline devices
Real-time state changes
Track last known device state
Control devices via change of state
Automatic synchronization once devices connect to the cloud
APIs for applications to discover and interact with devices

The rich features of a device shadow allows the app to interact with the vehicle TCU even when there is no connectivity. Once connectivity is established, the device gateway pushes the changes to device and vice versa.

We need to deploy a program (tcuShadowWrite.py) on the vehicle TCU device to update the device shadow and send the update to the AWS Cloud. This program is available in this code repository.

Let’s assume that after reaching their home, the vehicle’s user closes the door, switches off the headlights, and rolls up the windows. The same state of the vehicle should be reflected on their web/mobile app in real time. The vehicle TCU has to update the “reported” state in the device shadow JSON document.

shadow message

AWSIoTMQTTShadowClient library has a method called shadowUpdate that needs to be called from the vehicle TCU to update the device shadow. Essentially, it is publishing the shadow reported state on topic $aws/things/<thingName>/shadow/update.

If you run tcuShadowWrite.py script, you should be able to see the output as described in the following image.

tcushadowscript

Open the AWS IoT Core console.
Select Manage -> Things -> Select tcu, and then choose Shadow. You should be able to see the shadow message sent from the device described in the following image.

shadow document

Execute Remote Command

We need to deploy a program (tcuShadowRead.py) on the vehicle TCU to receive updates from the AWS Cloud. It is available in this code sample.

Let’s assume the vehicle owner uses the mobile app to open the door, switch on the headlight and roll down the windows. The vehicle TCU should receive this command and instruct the Electronic Control Unit (ECU) to execute the command. The web/mobile app will update the “desire” state in the device shadow JSON document.

shadow message2

In tcuShadowRead.py, AWSIoTMQTTShadowClient has a method shadowRegisterDeltaCallback. It listens on delta topics for this device shadow by subscribing to delta topics. Whenever there is a difference between the desired and reported state, the registered callback is called and the delta payload will be available in the callback.

callback

The callback function has a code to handle the state change request. In an actual implementation, a function like door_handle() would be calling the ECU to execute the door open command.

door open command

If you make changes in Device Shadow on AWS IoT for the tcu device, you should receive the output in the following image.

Device shadow

Test with a Virtual Vehicle Model

To help you test this solution, you can deploy the virtual vehicle model shown in the following image. Detailed steps for the deployment of the virtual vehicle is available in this code sample.

virtual vehicle model

Any changes in the model state should be reflected on the virtual demo vehicle and vice versa.

Here, we use open-source Paho-mqtt library. and Developers can use this to write JavaScript applications that access AWS IoT using MQTT or MQTT over the WebSocket protocol without using AWS IoT SDK. This implementation is made simpler by using AWS IoT Device SDK for JavaScript v2 Readme.

Review the JavaScript file named webSocketApp.js:

websocket app

onMessageArrived() function will be invoked whenever the device will change the shadow state.
handle<object>Command functions (such as handleDoorCommand) will be called with the current state. Call this function if the device has received any status change.

We have another JavaScript file demo-car.js in the demo-car folder. This includes the functions that our simulated vehicle will use in order to change the device shadow.

Let’s review the following code:

democar javascript

We have 3 handle command function defined (e.g., handleDoorCommand) to take the user’s input and access AWS IoT Core services.
connectDevice is an actual function to invoke updateThingsShadow function to send the desired state
accessIoTDevice uses Amazon Cognito Identity to get the authenticated identities to access AWS IoT Core securely without exposing the access key or secret key.

Now, keep demo.html side by side to your code and run the tcuShadowRead.py script. Any change made at the virtual model will reflect at the command output. Similarly, any change made by tcuShadowWrite.py will reflect the state update on the virtual model.

Conclusion

In this blog, we showed how to implement a digital shadow of a connected vehicle using AWS IoT. This solution removes complexity from running multiple processes in parallel and ensures a successful outcome. AWS IoT Core enables scalable, secure, low-latency, low-overhead, bi-directional communication between connected devices, tolerate and recover from slow/brittle connection, the AWS Cloud and customer-facing applications.

The Device Shadow in AWS IoT Core enables the AWS Cloud and applications to easily and accurately receive data from connected vehicles and send commands to the vehicles. The Device Shadow’s uniform and always-available interface simplifies the implementation of time-sensitive use cases. These include, remote command execution and two-way state synchronization between a device and app where the cloud is acting as a broker. This solution enables you to shift operational responsibilities of a connected vehicle infrastructure to the AWS Cloud while paying only for what you use, with no minimum fees or mandatory service usage.

For more information about how AWS can help you build connected vehicle solutions, refer to the AWS Connected Vehicle solution page.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Gaining Insights into Labeling Jobs for Machine Learning

2020-09-30 Michael Graumann

Post Syndicated from Michael Graumann original https://aws.amazon.com/blogs/architecture/field-notes-gaining-insights-into-labeling-jobs-for-machine-learning/

In an era where more and more data is generated, it becomes critical for businesses to derive value from it. With the help of supervised learning, it is possible to generate models to automatically make predictions or decisions by leveraging historical data. For example, image recognition for self-driving cars, predicting anomalies on X-rays, fraud detection in finance and more. With supervised learning, these models learn from labeled data. The success of those models is highly dependent on readily available, high quality labeled data.

However, you might encounter cases where a high percentage of your pre-existing data is unlabeled. In these situations, providing correct labeling to previously unlabeled data points would directly translate to higher model accuracy.

Amazon SageMaker Ground Truth helps you with exactly that. It lets you build highly accurate training datasets for machine learning quickly. SageMaker Ground Truth provides your labelers with built-in workflows and interfaces for common labeling tasks. This process could take several hours or more depending on the size of your unlabeled dataset, and you might have a need to track the progress easily, preferably in the form of a dashboard.

In this blogpost we show how to gain deep insights into the progress of labeling and the performance of the workers by using Amazon Athena and Amazon QuickSight. We use Amazon Athena former to set up several views with specific insights into the labeling progress. Finally we will reference these views in Amazon QuickSight to visualize the data in a dashboard.

This approach also works for combining multiple AWS services in general. AWS provides many building blocks than you can mix-and-match to create a unique, integrated solution with cohesive insights. In this blog post we use data produced by one service (Ground Truth), prepare it with another (Athena) and visualize with a third (QuickSight). The following diagram shows this architecture.

Solution Architecture

ML Solution Architecture

Mapping a JSON structure to a table structure

Ground Truth creates several directories in your Amazon S3 output path. These directories contain the results of your labeling job and other artifacts of the job. The top-level directory for a labeling job has the same name as your labeling job, while the output directories are placed inside it. We will create all insights from what SageMaker Ground Truth calls worker responses.

All respective JSON files reside in the path s3://bucket/<job-name>/annotations/worker-response/.

To analyze the labeling data with Amazon Athena we need to understand the structure of the underlying JSON files. Let’s review the example below. For each item that was labeled, we see the label itself, followed by the submission time and a workerId pointing to an identity. This identity lives in Amazon Cognito, a fully managed service that provides the user directory for our labelers.

{
    "answers": [
        {
            "answerContent": {
                "crowd-classifier": {
                    "label": "Compute"
                }
            },
            "submissionTime": "2020-03-27T10:31:04.210Z",
            "workerId": "private.eu-west-1.1111111111111111",
            "workerMetadata": {
                "identityData": {
                    "identityProviderType": "Cognito",
                    "issuer": "https://cognito-idp.eu-west-1.amazonaws.com/eu-west-1_111111111",
                    "sub": "11111111-1111-1111-1111-111111111111"
                }
            }
        },
        ...
    ]
}

Although the data is stored in Amazon S3 object storage, we are able to use SQL to access the data by using Amazon Athena. Since we now understand the JSON structure from shown in the preceding code, we use Athena and define how to interpret the data that is relevant to us. We do so by first creating a database using the Athena Query Editor:

CREATE DATABASE analyze_labels_db;

Once inside the database, we add the table schema. The actual files remain on Amazon S3, but using the metadata catalog, Athena then knows where the data lies and how to interpret it. The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given dataset, you can store its table definition, physical location, add business relevant attributes, in addition to track how this data has changed over time. Besides, Athena the AWS Glue Data Catalog also provides out-of-box integration with Amazon EMR and Amazon Redshift Spectrum. Once you add your table definitions to the Glue Data Catalog, they are available for ETL. They are also readily available for querying in Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum so that you can have a common view of your data between these services.

When going from JSON to SQL, we are crossing format boundaries. To further facilitate how to read the JSON formatted data we are using SerDe Properties to replace the hyphen in crowd-classifier with an underscore due to DDL constraints. Finally we point the location to our Amazon S3 bucket containing the single worker responses. Recognize in the following script that we translate the nested structure of the JSON file itself into a hierarchical, nested data structure in the schema definition. Also, we could leave out the workerMetadata as we don’t need it at this time. The data would still stay in the files on Amazon S3, so that we could later change and add the workerMetadata STRUCT into the table definition for our analysis.

CREATE EXTERNAL TABLE annotations_raw (
  answers array<
    struct<answercontent: 
      struct<crowd_classifier: 
        struct<label: string>
      >,
      submissionTime: string,
      workerId: string,
      workerMetadata: 
        struct<identityData: 
          struct<identityProviderType: string, issuer: string, sub: string>
        >
    >
  >
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  "mapping.crowd_classifier"="crowd-classifier" 
) 
LOCATION 's3://<YOUR_BUCKET>/<JOB_NAME>/annotations/worker-response/'

Creating Views in Athena

Now, we have nested data in our annotations_raw table. For many use cases, especially for analytical uses, representing data in a tabular fashion—as rows—is more natural. This is also the standard way when using SQL and business intelligence tools. To unnest the hierarchical data into flattened rows, we create the following view which will serve as foundation for the other views we create. For an in-depth look into unnesting data with Amazon Athena, read this blog post.

Some of the information we’re interested in might not be part of the document, but is encoded in the path. We use a trick in Athena by using the $path variable from the Presto Hive Connector. This determines which Amazon S3 file contains data that is returned by a specific row in an Athena table. This way we can find out which data object an annotation belongs to. Since Athena is built on top of Presto, we are able to use Presto’s built-in regexp_extract function to find out the iteration as well as the data object id per labeling result. We also cast the submission time in date format to later determine the labeling progress per day.

CREATE OR REPLACE VIEW annotations_view AS
SELECT 
  regexp_extract("$path", 'iteration-[0-9]*') as iteration,
  regexp_extract("$path", '(iteration-[0-9]*\/([0-9]*))',2) as dataRecord,
  answer.answercontent.crowd_classifier.label,
  cast(from_iso8601_timestamp(answer.submissionTime) as timestamp) as submissionTime,
  cast(from_iso8601_timestamp(answer.submissionTime) as date) as submissionDay,
  answer.workerId,
  answer.workerMetadata.identityData.identityProviderType,
  answer.workerMetadata.identityData.issuer,
  answer.workerMetadata.identityData.sub,
  "$path" path
FROM 
  annotations_raw
CROSS JOIN UNNEST(answers) AS t(answer)

This view, annotations_view, will be the starting point for the other views we will be creating in further in this post.

Visualizing with QuickSight

In this section, we explore a way to visualize the views we build in Athena by pointing Amazon QuickSight to the respective view. Amazon QuickSight lets you create and publish interactive dashboards that include ML Insights. Dashboards can then be accessed from any device, and embedded into your applications, portals, and websites.

Thanks to the tight integration between Athena and QuickSight, we are able to map one dataset in QuickSight to one Athena view. In order to further optimize the performance of the dashboard, we can optionally import the datasets into the in-memory optimized calculation engine for Amazon QuickSight called SPICE. With the datasets in place we can now create an analysis in order to interact with the visuals we’re going to add. You can think of an analysis as a container for a set of related visuals. You can use multiple datasets in an analysis, although any given visual can only use one of those datasets. After you create an analysis and an initial visual, you can expand the analysis. You can do this for example by adding datasets and visuals.

Let’s start with our first insight.

Annotations per worker

We’d like to gain insights not only into the total number of labeled items but also on the level of contributions of each individual workers. This could give us an indication whether the labels were created by a diverse crowd of labelers or by a few productive ones. A largely disproportionate amount of contributions from a handful of workers who may have brought along their biases.

SageMaker Ground Truth calls labeled data objects annotations, which is the result of a single workers labeling task.

Luckily we encapsulated all the heavy lifting of format conversion in the annotations_view, so that it is now easy to create a view for the annotations per user:

CREATE OR REPLACE VIEW annotations_per_user AS
SELECT COUNT(sub) AS LabeledItems,
sub AS User
FROM annotations_view
GROUP BY sub
ORDER BY LabeledItems DESC

Next we visualize this view in QuickSight. We add a visual to our analysis, select the respective dataset for the view and use the AutoGraph feature, which chooses the most appropriate visual type. Since we already arranged our view in Athena by the number of labeled items in descending order, there is no need now to sort the data in QuickSight. In the following screenshot, worker c4ef78e4... contributed more labels compared to their peers.

Annotations per worker

This view gives you an indicator to check for a bias that the leading worker might have brought along.

Annotations per label

One thing we want to be aware of is potential imbalances between classes in our dataset. Especially simple machine learning models, which may learn to frequently predict a label that is massively over represented in the dataset. If we can identify an imbalance, we can apply mitigation actions such as upsampling data of underrepresented classes. With the following view we list the total number of annotations per label.

CREATE OR REPLACE VIEW annotations_per_label AS
SELECT Count(dataRecord) AS TotalLabels, label As Label 
FROM annotations_view
GROUP BY label
ORDER BY TotalLabels DESC, Label;

As before, we create a dataset in QuickSight pointing to the annotations_per_label view, open the analysis, add a new visual and leverage the AutoGraph functionality. The result is the following visual representation:

Annotations per worker 2

One can clearly see that the Analytics & AI/ML class is massively underrepresented. At this point, you might want to try getting more data or think about upsampling data for that class.

Annotations per day

Seeing the total number of annotations per label and per worker is good, but we are also interested in how the labeling progress changes over time. This way we might see spikes related to labeller activations. We can also or estimate how long it takes to reach a certain goal of annotations given the current pace. For this purpose we create the following view aggregating the total annotations per day.

CREATE OR REPLACE VIEW annotations_per_day AS
SELECT COUNT(datarecord) AS LabeledItems,
submissionDay
FROM annotations_view
GROUP BY submissionDay
ORDER BY submissionDay, LabeledItems DESC

This time the QuickSight AutoGraph provides us with the following line chart. You might have noticed that the axis labels do not match the column names in Athena. That is because we renamed them in QuickSight for better readability.

Total annotations per day

In the preceding chart we see that there is no consistent pace of labeling, which makes it hard to predict when a certain amount of labeled data will be reached. In this example, after starting strong the progress immediately went down. Knowing this, we might want to take action into motivating our workers to contribute more and validate the effectiveness of these actions with the help of this chart. The spikes indicate an effective short-term action.

Distribution of total annotations by user

We already have insights into annotations per worker, per label and per day. Let us now now see what insights we can get from aggregating some of this information.

The bigger your labeling workforce gets, the harder it can become to see the whole picture. For that reason we will now create a histogram consisting of five buckets. Each bucket represents an interval of total annotations (for example, 0-25 annotations) mapped to the number of users whose amount of total annotations lies in that interval. This allows us to get a sense of what kind of bias might be introduced by the majority of annotations being contributed by a small amount of workers.

To do that, we use the Presto function width_bucket which returns the number of labeled data objects according to the five buckets we defined with a size of 25 each. We define these buckets by creating an Array with 5 elements that specify the boundaries.

CREATE OR REPLACE VIEW users_per_bucket_annotations AS
SELECT 
bucket,numberOfUsers,
CASE
   WHEN bucket=5 THEN 'B' || cast(bucket AS VARCHAR(10)) || ': ' || cast(((bucket-1) * 25) AS VARCHAR(10)) || '+'
   ELSE 'B' || cast(bucket AS VARCHAR(10)) || ': ' || cast(((bucket-1) * 25) AS VARCHAR(10)) || '-' || cast((bucket * 25) AS VARCHAR(10))
END AS NumberOfAnnotations
FROM
(SELECT width_bucket(labeleditems,ARRAY[0,25,50,75,100]) AS bucket,
 count(user) AS numberOfUsers
FROM annotations_per_user
GROUP BY 1
ORDER BY bucket)

A SELECT * FROM users_per_bucket_annotations produces the following result:

A SELECT FROM users_per_bucket_annotations

Let’s now investigate the same data via QuickSight:

Annotations per User in buckets of Size 25

Now that we can look at the data visually it becomes clear that we have a bimodal distribution, with many labelers having done very little, and many labelers doing quite a lot. This may warrant interviewing some labelers to find out if there is something holding back users from progressing, or if we can keep engagement high over time.

Putting it all together in QuickSight

Since we created all previous visuals into one analysis, we can now utilize it as a central place to consume our insights in a user-friendly way. Moreover, we can share our insights with others as a read-only snapshot which QuickSight calls a dashboard. User who are dashboard viewers can view and filter the dashboard data as below:

Groundtruth dashboard

Furthermore, you can generate a report and let QuickSight send it either once or on a schedule (daily, weekly or monthly) to your peers. This way users do not have to sign in and they can get reminders to check the progress of the labeling job. Lastly, sending out those reports is an opportunity to stay in touch with the labelers and keep the engagement high.

Conclusion

In this blogpost, we have shown one example of combining multiple AWS services in order to build a solution tailored to your needs. We took the Amazon S3 output generated by SageMaker Ground Truth and showed how it can be further processed and analyzed with Athena. Finally, we created a central place to consume our insights in a user-friendly way with QuickSight. By putting it all together in a dashboard we were able to share our insights with our peers.

You can take the same pattern and apply it to other situations: take some of the many building blocks AWS provides and mix-and-match them to create a unique, integrated solution with cohesive insights just as we did with Ground Truth, Athena, and QuickSight.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Why Deployment Requirements are Important When Making Architectural Choices

2020-09-29 Yusuf Mayet

Post Syndicated from Yusuf Mayet original https://aws.amazon.com/blogs/architecture/why-deployment-requirements-are-important-when-making-architectural-choices/

Introduction

Too often, architects fall into the trap of thinking the architecture of an application is restricted to just the runtime part of the architecture. By doing this we focus on only a single customer (such as the application’s users and how they interact with the system) and we forget about other important customers like developers and DevOps teams. This means that requirements regarding deployment ease, deployment frequency, and observability are delegated to the back burner during design time and tacked on after the runtime architecture is built. This leads to increased costs and reduced ability to innovate.

In this post, I discuss the importance of key non-functional requirements, and how they can and should influence the target architecture at design time.

Architectural patterns

When building and designing new applications, we usually start by looking at the functional requirements, which will define the functionality and objective of the application. These are all the things that the users of the application expect, such as shopping online, searching for products, and ordering. We also consider aspects such as usability to ensure a great user experience (UX).

We then consider the non-functional requirements, the so-called “ilities,” which typically include requirements regarding scalability, availability, latency, etc. These are constraints around the functional requirements, like response times for placing orders or searching for products, which will define the expected latency of the system.

These requirements—both functional and non-functional together—dictate the architectural pattern we choose to build the application. These patterns include Multi-tier, event-driven architecture, microservices, and others, and each one has benefits and limitations. For example, a microservices architecture allows for a system where services can be deployed and scaled independently, but this also introduces complexity around service discovery.

Aligning the architecture to technical users’ requirements

Amazon is a customer-obsessed organization, so it’s important for us to first identify who the main customers are at each point so that we can meet their needs. The customers of the functional requirements are the application users, so we need to ensure the application meets their needs. For the most part, we will ensure that the desired product features are supported by the architecture.

But who are the users of the architecture? Not the applications’ users—they don’t care if it’s monolithic or microservices based, as long as they can shop and search for products. The main customers of the architecture are the technical teams: the developers, architects, and operations teams that build and support the application. We need to work backwards from the customers’ needs (in this case the technical team), and make sure that the architecture meets their requirements. We have therefore identified three non-functional requirements that are important to consider when designing an architecture that can equally meet the needs of the technical users:

Deployability: Flow and agility to consistently deploy new features
Observability: feedback about the state of the application
Disposability: throwing away resources and provision new ones quickly

Together these form part of the Developer Experience (DX), which is focused on providing developers with APIs, documentation, and other technologies to make it easy to understand and use. This will ensure that we design for Day 2 operations in mind.

Deployability: Flow

There are many reasons that organizations embark on digital transformation journeys, which usually involve moving to the cloud and adopting DevOps. According to Stephen Orban, GM of AWS Data Exchange, in his book Ahead in the Cloud, faster product development is often a key motivator, meaning the most important non-functional requirement is achieving flow, the speed at which you can consistently deploy new applications, respond to competitors, and test and roll out new features. As well, the architecture needs to be designed upfront to support deployability. If the architectural pattern is a monolithic application, this will hamper the developers’ ability to quickly roll out new features to production. So we need to choose and design the architecture to support easy and automated deployments. Results from years of research prove that leaders use DevOps to achieve high levels of throughput:

Graphic - Using DevOps to achieve high levels of throughput

Decisions on the pace and frequency of deployments will dictate whether to use rolling, blue/green, or canary deployment methodologies. This will then inform the architectural pattern chosen for the application.

Using AWS, in order to achieve flow of deployability, we will use services such as AWS CodePipeline, AWS CodeBuild, AWS CodeDeploy and AWS CodeStar.

Observability: feedback

Once you have achieved a rapid and repeatable flow of features into production, you need a constant feedback loop of logs and metrics in order to detect and avoid problems. Observability is a property of the architecture that will allow us to better understand the application across the delivery pipeline and into production. This requires that we design the architecture to ensure that health reports are generated to analyze and spot trends. This includes error rates and stats from each stage of the development process, how many commits were made, build duration, and frequency of deployments. This not only allows us to measure code characteristics such as test coverage, but also developer productivity.

On AWS, we can leverage Amazon CloudWatch to gather and search through logs and metrics, AWS X-Ray for tracing, and Amazon QuickSight as an analytics tool to measure CI/CD metrics.

Disposability: automation

In his book, Cloud Strategy: A Decision-based Approach to a Successful Cloud Journey, Gregor Hohpe, Enterprise Strategist at AWS, notes that cloud and automation add a new “-ility”: disposability, which is the ability to set up and dispose of new servers in an automated and pain-free manner. Having immutable, disposable infrastructure greatly enhances your ability to achieve high levels of deployability and flow, especially when used in a CI/CD pipeline, which can create new resources and kill off the old ones.

At AWS, we can achieve disposability with serverless using AWS Lambda, or with containers running on Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (EKS), or using AWS Auto Scaling with Amazon Elastic Compute Cloud (EC2).

Three different views of the architecture

Once we have designed an architecture that caters for deployability, observability, and disposability, it exposes three lenses across which we can view the architecture:

3 views of the architecture

Build lens: the focus of this part of the architecture is on achieving deployability, with the objective to give the developers an easy-to-use, automated platform that builds, tests, and pushes their code into the different environments, in a repeatable way. Developers can push code changes more reliably and frequently, and the operations team can see greater stability because environments have standard configurations and rollback procedures are automated
Runtime lens: the focus is on the users of the application and on maximizing their experience by making the application responsive and highly available.
Operate lens: the focus is on achieving observability for the DevOps teams, allowing them to have complete visibility into each part of the architecture.

Summary

When building and designing new applications, the functional requirements (such as UX) are usually the primary drivers for choosing and defining the architecture to support those requirements. In this post I have discussed how DX characteristics like deployability, observability, and disposability are not just operational concerns that get tacked on after the architecture is chosen. Rather, they should be as important as the functional requirements when choosing the architectural pattern. This ensures that the architecture can support the needs of both the developers and users, increasing quality and our ability to innovate.

Field Notes: Monitoring the Java Virtual Machine Garbage Collection on AWS Lambda

2020-09-23 Steffen Grunwald

Post Syndicated from Steffen Grunwald original https://aws.amazon.com/blogs/architecture/field-notes-monitoring-the-java-virtual-machine-garbage-collection-on-aws-lambda/

When you want to optimize your Java application on AWS Lambda for performance and cost the general steps are: Build, measure, then optimize! To accomplish this, you need a solid monitoring mechanism. Amazon CloudWatch and AWS X-Ray are well suited for this task since they already provide lots of data about your AWS Lambda function. This includes overall memory consumption, initialization time, and duration of your invocations. To examine the Java Virtual Machine (JVM) memory you require garbage collection logs from your functions. Instances of an AWS Lambda function have a short lifecycle compared to a long-running Java application server. It can be challenging to process the logs from tens or hundreds of these instances.

In this post, you learn how to emit and collect data to monitor the JVM garbage collector activity. Having this data, you can visualize out-of-memory situations of your applications in a Kibana dashboard like in the following screenshot. You gain actionable insights into your application’s memory consumption on AWS Lambda for troubleshooting and optimization.

The lifecycle of a JVM application on AWS Lambda

Let’s first revisit the lifecycle of the AWS Lambda Java runtime and its JVM:

A Lambda function is invoked.
AWS Lambda launches an execution context. This is a temporary runtime environment based on the configuration settings you provide, like permissions, memory size, and environment variables.
AWS Lambda creates a new log stream in Amazon CloudWatch Logs for each instance of the execution context.
The execution context initializes the JVM and your handler’s code.

You typically see the initialization of a fresh execution context when a Lambda function is invoked for the first time, after it has been updated, or it scales up in response to more incoming events.

AWS Lambda maintains the execution context for some time in anticipation of another Lambda function invocation. In effect, the service freezes the execution context after a Lambda function completes. It thaws the execution context when the Lambda function is invoked again if AWS Lambda chooses to reuse it.

During invocations, the JVM also maintains garbage collection as usual. Outside of invocations, the JVM and its maintenance processes like garbage collection are also frozen.

Garbage collection and indicators for your application’s health

The purpose of JVM garbage collection is to clean up objects in the JVM heap, which is the space for an application’s objects. It finds objects which are unreachable and deletes them. This frees heap space for other objects.

You can make the JVM log garbage collection activities to get insights into the health of your application. One example for this is the free heap after each garbage collection. If this metric keeps shrinking, it is an indicator for a memory leak – eventually turning into an OutOfMemoryError. If there is not enough of free heap, the JVM might be too busy with garbage collection instead of running your application code. Otherwise, a heap that is too big does indicate that there’s potential to decrease the memory configuration of your AWS Lambda function. This keeps garbage collection pauses low and provides a consistent response time.

The garbage collection logging can be configured via an environment variable as part of the AWS Lambda function configuration. The environment variable JAVA_TOOL_OPTIONS is considered by both the Java 8 and 11 JVMs. You use it to pass options that you would usually add to the command line when launching the JVM. The options to configure garbage collection logging and the output is specific to the Java version.

Java 11 uses the Unified Logging System (JEP 158 and JEP 271) which has been introduced in Java 9. Logging can be configured with the environment variable:

JAVA_TOOL_OPTIONS=-Xlog:gc+metaspace,gc+heap,gc:stdout:time,tags

The Serial Garbage Collector will output the logs:

[<TIMESTAMP>][gc] GC(4) Pause Full (Allocation Failure) 9M->9M(11M) 3.941ms (D)
[<TIMESTAMP>][gc,heap] GC(3) DefNew: 3063K->234K(3072K) (A)
[<TIMESTAMP>][gc,heap] GC(3) Tenured: 6313K->9127K(9152K) (B)
[<TIMESTAMP>][gc,metaspace] GC(3) Metaspace: 762K->762K(52428K) (C)
[<TIMESTAMP>][gc] GC(3) Pause Young (Allocation Failure) 9M->9M(21M) 23.559ms (D)

Prior to Java 9, including Java 8, you configure the garbage collection logging as follows:

JAVA_TOOL_OPTIONS=-XX:+PrintGCDetails -XX:+PrintGCDateStamps

The Serial garbage collector output in Java 8 is structured differently:

<TIMESTAMP>: [GC (Allocation Failure)
    <TIMESTAMP>: [DefNew: 131042K->131042K(131072K), 0.0000216 secs] (A)
    <TIMESTAMP>: [Tenured: 235683K->291057K(291076K), 0.2213687 secs] (B)
    366725K->365266K(422148K), (D)
    [Metaspace: 3943K->3943K(1056768K)], (C)
    0.2215370 secs]
    [Times: user=0.04 sys=0.02, real=0.22 secs]
<TIMESTAMP>: [Full GC (Allocation Failure)
    <TIMESTAMP>: [Tenured: 297661K->36658K(297664K), 0.0434012 secs] (B)
    431575K->36658K(431616K), (D)
    [Metaspace: 3943K->3943K(1056768K)], 0.0434680 secs] (C)
    [Times: user=0.02 sys=0.00, real=0.05 secs]

Independent of the Java version, the garbage collection activities are logged to standard out (stdout) or standard error (stderr). Logs appear in the AWS Lambda function’s log stream of Amazon CloudWatch Logs. The log contains the size of memory used for:

A: the young generation
B: the old generation
C: the metaspace
D: the entire heap

The notation is before-gc -> after-gc (committed heap). Read the JVM Garbage Collection Tuning Guide for more details.

Visualizing the logs in Amazon Elasticsearch Service

It is hard to fully understand the garbage collection log by just reading it in Amazon CloudWatch Logs. You must visualize it to gain more insight. This section describes the solution to achieve this.

Solution Overview

Java Solution Overview

Amazon CloudWatch Logs have a feature to stream CloudWatch Logs data to Amazon Elasticsearch Service via an AWS Lambda function. The AWS Lambda function for log transformation is subscribed to the log group of your application’s AWS Lambda function. The subscription filters for a pattern that matches the one of the garbage collection log entries. The log transformation function processes the log messages and puts it to a search cluster. To make the data easy to digest for the search cluster, you add code to transform and convert the messages to JSON. Having the data in a search cluster, you can visualize it with Kibana dashboards.

Get Started

To start, launch the solution architecture described above as a prepackaged application from the AWS Serverless Application Repository. It contains all resources ready to visualize the garbage collection logs for your Java 11 AWS Lambda functions in a Kibana dashboard. The search cluster consists of a single t2.small.elasticsearch instance with 10GB of EBS storage. It is protected with Amazon Cognito User Pools so you only need to add your user(s). The T2 instance types do not support encryption of data at rest.

Read the source code for the application in the aws-samples repository.

1. Spin up the application from the AWS Serverless Application Repository:

2. As soon as the application is deployed completely, the outputs of the AWS CloudFormation stack provide the links for the next steps. You will find two URLs in the AWS CloudFormation console called createUserUrl and kibanaUrl.

search stack

3. Use the createUserUrl link from the outputs, or navigate to the Amazon Cognito user pool in the console to create a new user in the pool.

a. Enter an email address as username and email. Enter a temporary password of your choice with at least 8 characters.

b. Leave the phone number empty and uncheck the checkbox to mark the phone number as verified.

c. If necessary, you can check the checkboxes to send an invitation to the new user or to make the user verify the email address.

d. Choose Create user.

create user dialog of Amazon Cognito User Pools

4. Access the Kibana dashboard with the kibanaUrl link from the AWS CloudFormation stack outputs, or navigate to the Kibana link displayed in the Amazon Elasticsearch Service console.

a. In Kibana, choose the Dashboard icon in the left menu bar

b. Open the Lambda GC Activity dashboard.

You can test that new events appear by using the Kibana Developer Console:

POST gc-logs-2020.09.03/_doc
{
  "@timestamp": "2020-09-03T15:12:34.567+0000",
  "@gc_type": "Pause Young",
  "@gc_cause": "Allocation Failure",
  "@heap_before_gc": "2",
  "@heap_after_gc": "1",
  "@heap_size_gc": "9",
  "@gc_duration": "5.432",
  "@owner": "123456789012",
  "@log_group": "/aws/lambda/myfunction",
  "@log_stream": "2020/09/03/[$LATEST]123456"
}

5. When you go to the Lambda GC Activity dashboard you can see the new event. You must select the right timeframe with the Show dates link.

Lambda GC activity

The dashboard consists of six tiles:

In the Filters you optionally select the log group and filter for a specific AWS Lambda function execution context by the name of its log stream.
In the GC Activity Count by Execution Context you see a heatmap of all filtered execution contexts by garbage collection activity count.
The GC Activity Metrics display a graph for the metrics for all filtered execution contexts.
The GC Activity Count shows the amount of garbage collection activities that are currently displayed.
The GC Duration show the sum of the duration of all displayed garbage collection activities.
The GC Activity Raw Data at the bottom displays the raw items as ingested into the search cluster for a further drill down.

Configure your AWS Lambda function for garbage collection logging

1. The application that you want to monitor needs to log garbage collection activities. Currently the solution supports logs from Java 11. Add the following environment variable to your AWS Lambda function to activate the logging.

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags

The environment variables must reflect this parameter like the following screenshot:

environment variables

2. Go to the streamLogs function in the AWS Lambda console that has been created by the stack, and subscribe it to the log group of the function you want to monitor.

streamlogs function

3. Select Add Trigger.

4. Select CloudWatch Logs as Trigger Configuration.

5. Input a Filter name of your choice.

6. Input "[gc" (including quotes) as the Filter pattern to match all garbage collection log entries.

7. Select the Log Group of the function you want to monitor. The following screenshot subscribes to the logs of the application’s function resize-lambda-ResizeFn-[...].

add trigger

8. Select Add.

9. Execute the AWS Lambda function you want to monitor.

10. Refresh the dashboard in Amazon Elasticsearch Service and see the datapoint added manually before appearing in the graph.

Troubleshooting examples

Let’s look at an example function and draw some useful insights from the Java garbage collection log. The following diagrams show the Sample Amazon S3 function code for Java from the AWS Lambda documentation running in a Java 11 function with 512 MB of memory.

An S3 event from a new uploaded image triggers this function.
The function loads the image from S3, resizes it, and puts the resized version to S3.
The file size of the example image is close to 2.8MB.
The application is called 100 times with a pause of 1 second.

Memory leak

For the demonstration of a memory leak, the function has been changed to keep all source images in memory as a class variable. Hence the memory of the function keeps growing when processing more images:

GC activity metrics

In the diagram, the heap size drops to zero at timestamp 12:34:00. The Amazon CloudWatch Logs of the function reveal an error before the next call to your code in the same AWS Lambda execution context with a fresh JVM:

Java heap space: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
 at java.desktop/java.awt.image.DataBufferByte.<init>(Unknown Source)
[...]

The JVM crashed and was restarted because of the error. You leverage primarily the Amazon CloudWatch Logs of your function to detect errors. The garbage collection log and its visualization provide additional information for root cause analysis:

Did the JVM run out of memory because a single image to resize was too large?

Or was the memory issue growing over time?

The latter could be an indication that you have a memory leak in your code.

The Heap size is too small

For the demonstration of a heap that was chosen too small, the memory leak from the preceding image has been resolved, but the function was configured to 128MB of memory. From the baseline of the heap to the maximum heap size, there are only approximately 5 MB used.

GC activity metrics

This will result in a high management overhead of your JVM. You should experiment with a higher memory configuration to find the optimal performance also taking cost into account. Check out AWS Lambda power tuning open source tool to do this in an automated fashion.

Finetuning the initial heap size

If you review the development of the heap size at the start of an execution context, this indicates that the heap size is continuously increased. Each heap size change is an expensive operation consuming time of your function. Over time, the heap size is changed as well. The garbage collector logs 502 activities, which take almost 17 seconds overall.

GC activity metrics

This on-demand scaling is useful on a local workstation where the physical memory is shared with other applications. On AWS Lambda, the configured memory is dedicated to your function, so you can use it to its full extent.

You can do so by setting the minimum and maximum heap size to a fixed value by appending the -Xms and -Xmx parameters to the environment variable we introduced before.

The heap is not the only part of the JVM that consumes memory, so you must experiment with this setting and closely monitor the performance.

Start with the heap size that you observe to be working from the garbage collection log. If you set the heap size too large, your function will not initialize at all or break unexpectedly. Remember that the ability to tweak JVM parameters might change with future service features.

Let’s set 400 MB of the 512 MB memory and examine the results:

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags -Xms400m -Xmx400m

GC activity metrics

The preceding dashboard shows that the overall garbage collection duration was reduced by about 95%. The garbage collector had 80% fewer activities.

The garbage collection log entries displayed in the dashboard reveal that exclusively minor garbage collection (Pause Young) activities were triggered instead of major garbage collections (Pause Full). This is expected as the images are immediately discarded after the download, resize, upload operation. The effect on the overall function durations of 100 invocations, is a 5% decrease on average in this specific case.

Lambda duration

Cost estimation and clean up

Cost for the processing and transformation of your function’s Amazon CloudWatch Logs incurs when your function is called. This cost depends on your application and how often garbage collection activities are triggered. Read an estimate of the monthly cost for the search cluster. If you do not need the garbage collection monitoring anymore, delete the subscription filter from the log group of your AWS Lambda function(s). Also, delete the stack of the solution above in the AWS CloudFormation console to clean up resources.

Conclusion

In this post, we examined further sources of data to gain insights about the health of your Java application. We also demonstrated a pipeline to ingest, transform, and visualize this information continuously in a Kibana dashboard. As a next step, launch the application from the AWS Serverless Application Repository and subscribe it to your applications’ logs. Feel free to submit enhancements to the application in the aws-samples repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Architecture Patterns for Red Hat OpenShift on AWS

2020-09-22 Ryan Niksch

Post Syndicated from Ryan Niksch original https://aws.amazon.com/blogs/architecture/architecture-patterns-for-red-hat-openshift-on-aws/

Editor’s note: Although this blog post and its accompanying code make use of the word “Master,” Red Hat is making open source code more inclusive by eradicating “problematic language.” Read more about this.

Introduction

Red Hat OpenShift is an application platform that provides customers with turnkey application platform that is much more than a simple Kubernetes orchestration.

OpenShift customers choose AWS as their cloud of choice because of the efficiency, security, and reliability, scalability, and elasticity it provides. Customers seeking to modernize their business, process, and application stacks are drawn to the rich AWS service and feature sets.

As such, we see some customers migrate from on-premises to AWS or exist in a hybrid context with application workloads running in various locations. For OpenShift customers, this poses a few questions and considerations:

What are the recommendations for the best way to deploy OpenShift on AWS?
How is this different from what customers were used to on-premises?
How does this ensure resilience and availability?
Do customers need a multi-region, multi-account approach?

For hybrid customers, there are assumptions and misconceptions:

Where does the control plane exist?
Is there replication, and if so, what are the considerations and ramifications?

In this post I will run through some of the more common questions and patterns for OpenShift on AWS, while looking at some of the terminology and conceptual differences of AWS. I’ll explore migration and hybrid use cases and address some misconceptions.

OpenShift building blocks

On AWS, OpenShift 4x is the norm. To that effect, I will focus on OpenShift 4, but many of the considerations will apply to both OpenShift 3 and OpenShift 4.

Let’s unpack some of the OpenShift building blocks. An OpenShift cluster consists of Master, infrastructure, and worker nodes. The Master forms the control plane and infrastructure nodes cater to a routing layer and additional functions, such as logging, monitoring etc. Worker nodes are the nodes that customer application container workloads will exist on.

When deployed on-premises, OpenShift nodes will be placed in separate network subnets. Depending on distance, latency, etc., a single OpenShift cluster may span two data centers that have some nodes in a subnet in one data center and other subnets in a different data center. This applies to customers with data centers within a few miles of each other with high-speed connectivity. An alternative would be an OpenShift cluster in each data center.

AWS concepts and terminology

At AWS, the concept of “region” is a geolocation, such as EMEA (Europe, Middle East, and Africa) or APAC (Asian Pacific) rather than a data center or specific building. An Availability Zone (AZ) is the closest construct on AWS that maps to a physical data center. Within each region you will find multiple (typically three or more) AZs. Note that a single AZ will contain multiple physical data centers but we treat it as a single point of failure. For example, an event that impacts an AZ would be expected to impact all the data centers within that AZ. To this effect, customers should deploy workloads spanning multiple AZs to protect against any event that would impact a single AZ.

Deploying OpenShift

When deploying an OpenShift cluster on AWS, we recommend starting with three Master nodes spread across three AWS AZs and three worker nodes spread across three AZs. This allows for the combination of resilience and availably constructs provided by AWS as well as Red Hat OpenShift. The OpenShift installer provides a means of deploying the underlying AWS infrastructure in two ways: IPI Installer-provisioned infrastructure and UPI user-provisioned infrastructure. Both Red Hat and AWS collect customer feedback and use this to drive recommended patterns that are then included in the OpenShift installer. As such, the OpenShift installer IPI mode becomes a living reference architecture for deploying OpenShift on AWS.

Deploying OpenShift

The installer will require inputs for the environment on which it’s being deployed. In this case, since I am deploying on AWS, I will need to provide the AWS region, AZs, or subnets that related to the AZs, as well as EC2 instance type. The installer will then generate a set of ignition files that will be used during the deployment of OpenShift:

apiVersion: v1
baseDomain: example.com 
controlPlane: 
  hyperthreading: Enabled   
  name: master
  platform:
    aws:
      zones:
      - us-west-2a
      - us-west-2b
      - us-west-2c
      rootVolume:
        iops: 4000
        size: 500
        type: io1
      type: m5.xlarge 
  replicas: 3
compute: 
- hyperthreading: Enabled 
  name: worker
  platform:
    aws:
      rootVolume:
        iops: 2000
        size: 500
        type: io1 
      type: m5.xlarge
      zones:
      - us-west-2a
      - us-west-2b
      - us-west-2c
  replicas: 3
metadata:
  name: test-cluster 
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: us-west-2 
    userTags:
      adminContact: jdoe
      costCenter: 7536
pullSecret: '{"auths": ...}' 
fips: false 
sshKey: ssh-ed25519 AAAA...

What does this look like at scale?

For larger implementations, we would see additional worker nodes spread across three or more AZs. As more worker nodes are added, use of the control plane increases. Initially scaling up the Amazon Elastic Compute Cloud (EC2) instance type to a larger instance type is an effective way of addressing this. It’s possible to add more Master nodes, and we recommend that an odd number of nodes are maintained. It is more common to see scaling out of the infrastructure nodes before there is a need to scale Masters. For large-scale implementations, infrastructure functions such as the router, monitoring, and logging functions can be moved to separate EC2 instances from the Master nodes, as well as from each other. It is important to spread the routing layer across multiple AZs, which is critical to maintaining availability and resilience.

The process of resource separation is now controlled by infrastructure machine sets within OpenShift. An infrastructure machine set would need to be defined, then the infrastructure role edited to be moved from the default to this new infrastructure machine set. Read about this in greater detail.

OpenShift in a multi-account context

Using AWS accounts as a means of separation is a common well-architected pattern. AWS Organizations and AWS Control Tower are services that are commonly adopted as part of a multi-account strategy. This is very much the case when looking to enable teams to use their own accounts and when an account vending process is needed to cater for self-service account provisioning.

OpenShift in a multi-account context

OpenShift clusters are deployed into multiple accounts. An OpenShift dev cluster is deployed into an AWS Dev account. This account would typically have AWS Developer Support associated with it. A separate production OpenShift cluster would be provisioned into an AWS production account with AWS Enterprise Support. Enterprise support provides for faster support case response times, and you get the benefit of dedicated resources such as a technical account manager and solutions architect.

CICD pipelines and processes are then used to control the application life cycle from code to dev to production. The pipelines would push the code to different OpenShift cluster end points at different stages of the life cycle.

Hybrid use case implementation

A common misconception of hybrid implementations is that there is a single cluster or control plan that has worker nodes in various locations. For example, there could be a cluster where the Master and infrastructure nodes are deployed in one location, but also worker nodes registered with this cluster that exist on-premises as well as in the cloud.

Having a single customer control plane for a hybrid implementation, even if technically possible, introduces undesired risks.

There is the potential to take multiple environments with very different resilience characteristics and make them interdependent of each other. This can result in performance and reliability issues, and these may increase not only the possibility of the risk manifesting, but also increase in the impact or blast radius.

Instead, hybrid implementations will see separate OpenShift clusters deployed into various locations. A customer may deploy clusters on-premises to cater for a workload that can’t be migrated to the cloud in the short term. Separate OpenShift clusters can then deployed into accounts in AWS for workloads on the cloud. Customers can also deploy separate OpenShift clusters in different AWS regions to cater for proximity to the consuming customer.

Though adding multiple clusters doesn’t add significant administrative overhead, there is a desire to be able to gain visibility and telemetry to all the deployed clusters from a central location. This may see the OpenShift clusters registered with Red Hat Advanced Cluster Manager for Kubernetes.

Summary

Take advantage of the IPI model, not only as a guide but to also save time. Make AWS Organizations, AWS Control Tower, and the AWS Service catalog part of your cloud and hybrid strategies. These will not only speed up migrations but also form building blocks for a modernized business with a focus of enabling prescriptive self-service. Consider Red Hat advanced cluster manager for multi cluster management.