How to Accelerate Building a Lake House Architecture with AWS Glue

2021-08-24 Raghavarao Sodabathina

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/how-to-accelerate-building-a-lake-house-architecture-with-aws-glue/

Customers are building databases, data warehouses, and data lake solutions in isolation from each other, each having its own separate data ingestion, storage, management, and governance layers. Often these disjointed efforts to build separate data stores end up creating data silos, data integration complexities, excessive data movement, and data consistency issues. These issues are preventing customers from getting deeper insights. To overcome these issues and easily move data around, a Lake House approach on AWS was introduced.

In this blog post, we illustrate the AWS Glue integration components that you can use to accelerate building a Lake House architecture on AWS. We will also discuss how to derive persona-centric insights from your Lake House using AWS Glue.

Components of the AWS Glue integration system

AWS Glue is a serverless data integration service that facilitates the discovery, preparation, and combination of data. It can be used for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration. So you can start analyzing your data and putting it to use in minutes, rather than months.

The following diagram illustrates the various components of the AWS Glue integration system.

Figure 1. AWS Glue integration components

Connect – AWS Glue allows you to connect to various data sources anywhere

Glue connector: AWS Glue provides built-in support for the most commonly used data stores. You can use Amazon Redshift, Amazon RDS, Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, or PostgreSQL using JDBC connections. AWS Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data stores that are not natively supported such as SaaS applications, you can use connectors. You can also subscribe to several connectors offered in the AWS Marketplace.

Glue crawlers: You can use a crawler to populate the AWS Glue Data Catalog with tables. A crawler can crawl multiple data stores in a single pass. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

Catalog – AWS Glue simplifies data discovery and governance

Glue Data Catalog: The Data Catalog serves as the central metadata catalog for the entire data landscape.

Glue Schema Registry: The AWS Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. With AWS Glue Schema Registry, you can manage and enforce schemas on your data streaming applications.

Data quality – AWS Glue helps you author and monitor data quality rules

Glue DataBrew: AWS Glue DataBrew allows data scientists and data analysts to clean and normalize data. You can use a visual interface, reducing the time it takes to prepare data by up to 80%. With Glue DataBrew, you can visualize, clean, and normalize data directly from your data lake, data warehouses, and databases.

Curate data: You can use either Glue development endpoint or AWS Glue Studio to curate your data.

AWS Glue development endpoint is an environment that you can use to develop and test your AWS Glue scripts. You can choose either Amazon SageMaker notebook or Apache Zeppelin notebook as an environment.

AWS Glue Studio is a new visual interface for AWS Glue that supports extract-transform-and-load (ETL) developers. You can author, run, and monitor AWS Glue ETL jobs. You can now use a visual interface to compose jobs that move and transform data, and run them on AWS Glue.

AWS Data Exchange makes it easy for AWS customers to securely exchange and use third-party data in AWS. This is for data providers who want to structure their data across multiple datasets or enrich their products with additional data. You can publish additional datasets to your products using the AWS Data Exchange.

Deequ is an open-source data quality library developed internally at Amazon, for data quality. It provides multiple features such as automatic constraint suggestions and verification, metrics computation, and data profiling.

Build a Lake House architecture faster, using AWS Glue

Figure 2 illustrates how you can build a Lake House using AWS Glue components.

Figure 2. Building Lake House architectures with AWS Glue

The architecture flow follows these general steps:

Glue crawlers scan the data from various data sources and populate the Data Catalog for your Lake House.
The Data Catalog serves as the central metadata catalog for the entire data landscape.
Once data is cataloged, fine-grained access control is applied to the tables through AWS Lake Formation.
Curate your data with business and data quality rules by using Glue Studio, Glue development endpoints, or Glue DataBrew. Place transformed data in a curated Amazon S3 for purpose built analytics downstream.
Facilitate data movement with AWS Glue to and from your data lake, databases, and data warehouse by using Glue connections. Use AWS Glue Elastic views to replicate the data across the Lake House.

Derive persona-centric insights from your Lake House using AWS Glue

Many organizations want to gather observations from increasingly larger volumes of acquired data. These insights help them make data-driven decisions with speed and agility. They must use a central data lake, a ring of purpose-built data services, and data warehouses based on persona or job function.

Figure 3 illustrates the Lake House inside-out data movement with AWS Glue DataBrew, Amazon Athena, Amazon Redshift, and Amazon QuickSight to perform persona-centric data analytics.

Figure 3. Lake House persona-centric data analytics using AWS Glue

This shows how Lake House components serve various personas in an organization:

Data ingestion: Data is ingested to Amazon Simple Storage Service (S3) from different sources.
Data processing: Data curators and data scientists use DataBrew to validate, clean, and enrich the data. Amazon Athena is also used to run improvised queries to analyze the data in the lake. The transformation is shared with data engineers to set up batch processing.
Batch data processing: Data engineers or developers set up batch jobs in AWS Glue and AWS Glue DataBrew. Jobs can be initiated by an event, or can be scheduled to run periodically.
Data analytics: Data/Business analysts can now analyze prepared dataset in Amazon Redshift or in Amazon S3 using Athena.
Data visualizations: Business analysts can create visuals in QuickSight. Data curators can enrich data from multiple sources. Admins can enforce security and data governance. Developers can embed QuickSight dashboard in applications.

Conclusion

Using a Lake House architecture will help you get persona-centric insights quickly from all of your data based on user role or job function. In this blog post, we describe several AWS Glue components and AWS purpose-built services that you can use to build Lake House architectures on AWS. We have also presented persona-centric Lake House analytics architecture using AWS Glue, to help you derive insights from your Lake House.

Read more and get started on building Lake House Architectures on AWS.

What is a Lake House approach?
Harness the power of your data with AWS Analytics
AWS Lake House reference architecture (scroll down in blog)
Derive Insights from AWS Lake House whitepaper
AWS Glue Developer Guide

How to automate forensic disk collection in AWS

2021-08-24 Matt Duda

Post Syndicated from Matt Duda original https://aws.amazon.com/blogs/security/how-to-automate-forensic-disk-collection-in-aws/

In this blog post you’ll learn about a hands-on solution you can use for automated disk collection across multiple AWS accounts. This solution will help your incident response team set up an automation workflow to capture the disk evidence they need to analyze to determine scope and impact of potential security incidents. This post includes AWS CloudFormation templates and all of the required AWS Lambda functions, so you can deploy this solution in your own environment. This post focuses primarily on two sources as the origination of the evidence collection workflow: AWS Security Hub and Amazon GuardDuty.

Why is automating forensic disk collection important?

AWS offers unique scaling capabilities in our compute environments. As you begin to increase your number of compute instances across multiple AWS accounts or organizations, you will find operational aspects of your business that must also scale. One of these critical operational tasks is the ability to quickly gather forensically sound disk and memory evidence during a security event.

During a security event, your incident response (IR) team must be able to collect and analyze evidence quickly while maintaining accuracy for the time period surrounding the event. It is both challenging and time consuming for the IR team to manually collect all of the relevant evidence in a cloud environment, across a large number of instances and accounts. Additionally, manual collection requires time that could otherwise be spent analyzing and responding to an event. Every role assumption, every console click, and every manual trigger required by the IR team, adds time for an attacker to continue to work through systems to meet their objectives.

Indicators of compromise (IoCs) are pieces of data that IR teams often use to identify potential suspicious activity within networks that might need further investigation. These IoCs can include file hashes, domains, IP addresses, or user agent strings. IoCs are used by services such as GuardDuty to help you discover potentially malicious activity in your accounts. For example, when you are alerted that an Amazon Elastic Compute Cloud (Amazon EC2) instance contains one or more IoCs, your IR team must gather a point-in-time copy of relevant forensic data to determine the root cause, and evaluate the likelihood that the finding requires action. This process involves gathering snapshots of any and all attached volumes, a live dump of the system’s memory, a capture of the instance metadata, and any logs that relate to the instance. These sources help your IR team to identify next steps and work towards a root cause.

It is important to take a point-in-time snapshot of an instance as close in time to the incident as possible. If there is a delay in capturing the snapshot, it can alter or make evidence unusable because the data has changed or been deleted. To take this snapshot quickly, you need a way to automate the collection and delivery of potentially hundreds of disk images while ensuring each snapshot is collected in the same way and without creating a bottleneck in the pipeline that could reduce the integrity of the evidence. In this blog post, I explain the details of the automated disk collection workflow, and explain why you might make different design decisions. You can download the solutions in CloudFormation, so that you can deploy this solution and get started on your own forensic automation workflows.

AWS Security Hub provides an aggregated view of security findings across AWS accounts, including findings produced by GuardDuty, when enabled. Security Hub also provides you with the ability to ingest custom or third-party findings, which makes it an excellent starting place for automation. This blog post uses EC2 GuardDuty findings collected into Security Hub as the example, but you can also use the same process to include custom detection events, or alerts from partner solutions such as CrowdStrike, McAfee, Sophos, Symantec, or others.

Infrastructure overview

The workflow described in this post automates the tasks that an IR team commonly takes during the course of an investigation.

Overview of disk collection workflow

The high-level disk collection workflow steps are as follows:

Create a snapshot of each Amazon Elastic Block Store (Amazon EBS) volume attached to suspected instances.
Create a folder in the Amazon Simple Storage Service (Amazon S3) evidence bucket with the original event data.
Launch one Amazon EC2 instance per EBS volume, to be used in streaming a bit-for-bit copy of the EBS snapshot volume. These EC2 instances are launched without SSH key pairs, to help prevent any unintentional evidence corruption and to ensure consistent processing without user interaction. The EC2 instances use third-party tools dc3dd and incrond to trigger and process volumes.
Write all logs from the workflow and instances to Amazon CloudWatch Logs log groups, for audit purposes.
Include all EBS volumes in the S3 evidence bucket as raw image files (.dd), with the metadata from the automated capture process, as well as hashes for validation and verification.

Overview of AWS services used in the workflow

Another way of looking at this high-level workflow is from the service perspective, as shown in Figure 1.

Figure 1: Service workflow for forensic disk collection

The workflow in Figure 1 shows the following steps:

A GuardDuty finding is triggered for an instance in a monitored account. This example focuses on a GuardDuty finding, but the initial detection source can also be a custom event, or an event from a third party.
The Security Hub service in the monitored account receives the GuardDuty finding, and forwards it to the Security Hub service in the security account.
The Security Hub service in the security account receives the monitored account’s finding.
The Security Hub service creates an event over Amazon EventBridge for the GuardDuty findings, which is then caught by an EventBridge rule to forward to the DiskForensicsInvoke Lambda function. The following is the example event rule, which is included in the deployment. This example can be expanded or reduced to fit your use-case. By default, the example is set to disabled in CloudFormation. When you are ready to use the automation, you will need to enable it.
```
{
  "detail-type": [
    "Security Hub Findings - Imported"
  ],
  "source": [
    "aws.securityhub"
  ],
  "detail": {
    "findings": {
      "ProductFields": {
        "aws/securityhub/SeverityLabel": [
          "CRITICAL",
          "HIGH",
          "MEDIUM"
        ],
        "aws/securityhub/ProductName": [
          "GuardDuty"
        ]
      }
    }
  }
}
```
The DiskForensicsInvoke Lambda function receives the event from EventBridge, formats the event, and provides the formatted event as input to AWS Step Functions workflow.
The DiskForensicStepFunction workflow includes ten Lambda functions, from initial snapshot to streaming the evidence to the S3 bucket. After the Step Functions workflow enters the CopySnapshot state, it converts to a map state. This allows the workflow to have one thread per volume submitted, and ensures that each volume will be placed in the evidence bucket as quickly as possible without needing to wait for other steps to complete.

Figure 2: Forensic disk collection Step Function workflow

As shown in Figure 2, the following are the embedded Lambda functions in the DiskForensicStepFunction workflow:
1. CreateSnapshot – This function creates the initial snapshots for each EBS volume attached to the instance in question. It also records instance metadata that is included with the snapshot data for each EBS volume.
  Required Environmental Variables: ROLE_NAME, EVIDENCE_BUCKET, LOG_GROUP
2. CheckSnapshot – This function checks to see if the snapshots from the previous step are completed. If not, the function retries with an exponential backoff.
  Required Environmental Variable: ROLE_NAME
3. CopySnapshot – This function copies the initial snapshot and ensures that it is using the forensics AWS Key Management Service (AWS KMS) key. This key is stored in the security account and will be used throughout the remainder of the process.
  Required Environmental Variables: ROLE_NAME, KMS_KEY
4. CheckCopySnapshot – This function checks to see if the snapshot from the previous step is completed. If not, the function retries with exponential backoff.
  Required Environmental Variable: ROLE_NAME
5. ShareSnapshot – This function takes the copied snapshot using the forensics KMS key, and shares it with the security account.
  Required Environmental Variables: ROLE_NAME, SECURITY_ACCOUNT
6. FinalCopySnapshot – This function copies the shared snapshot into the security account, as the original shared snapshot is still owned by the monitored account. This ensures that a copy is available, in case it has to be referenced for additional processing later.
  Required Environmental Variable: KMS_KEY
7. FinalCheckSnapshot – This function checks to see if the snapshot from the previous step is completed. If not, the function fails and it retries with an exponential backoff.
8. CreateVolume – This function creates an EBS Magnetic volume from the snapshot in the previous step. These volumes created use magnetic disks, because they are required for consistent hash results from the dc3dd process. This volume cannot use a solid state drive (SSD), because the hash would be different each time. If the EBS Magnetic volume size is greater than or equal to 500GB, then Amazon EBS switches from using standard EBS Magnetic volumes to Throughput Optimized HDD (st1) volumes.
  Required Environmental Variables: KMS_KEY, SUPPORTED_AZS
9. RunInstance – This function launches one EC2 instance per volume, to be used in streaming the volume to the S3 bucket. The AMI passed by the environmental variable needs to be created using the provided Amazon EC2 Image Builder pipeline before deploying the environment. This function also passes some user data to the instance, artifact bucket, source volume name, and the incidentID. This information is used by the instance when placing the evidence into the S3 bucket.
  Required Environmental Variables: AMI_ID, INSTANCE_PROFILE_NAME, VPC_ID, SECURITY_GROUP
10. CreateInstanceWait – This function creates a 30-second wait, to allow the instance some additional time to spin up.
11. MountForensicVolume – This function checks the CloudWatch log group ForensicDiskReadiness, to see that the incrond service is running on the instance. If the incrond service is running, the function attaches the volume to the instance and then writes the final logs to the S3 bucket and CloudWatch Logs.
  Required Environmental Variable: LOG_GROUP
The instance that is created has pre-built tools and scripts on it from the template below using Image Builder. This instance uses the incrond tool to monitor /dev/disk/by-label for new devices being attached to the instance. After the MountForensicVolume Lambda function attaches the volume to the instance, a file is created in the /dev/disk/by-label directory for the attached volume. The incrond daemon starts the orchestrator script, which calls the collector script. The collector script uses the dc3dd tool to stream the bit-for-bit copy of the volume to S3. After the copy has completed, the image shuts down and is terminated. All logs from the process are sent to the S3 bucket and CloudWatch Logs.

The solution provided in the post includes the CloudFormation templates you need to get started, except for creation the initial EventBridge rule (which is provided in step 4 of the previous section). The solution includes an isolated VPC, subnets, security groups, roles, and more. The VPC provided does not provide any egress through an internet gateway or NAT gateway, and that is the recommended solution. The only connectivity provided is through the S3 gateway VPC endpoint and the CloudWatch Logs interface VPC endpoint (also deployed in the template).

Deploy the CloudFormation templates

To implement the solution outlined in this post, you need to deploy three separate AWS CloudFormation templates in the order described in this section.

diskForensicImageBuilder (security account)

First, you deploy diskForensicImageBuilder in the security account. This template contains the resources and AMIs needed to create and run the Image Builder pipeline that is required to build the collector VM. This pipeline installs the required binaries, and scripts, and updates the system.

Note: diskForensicImageBuilder is configured to use the default VPC and security group. If you have added restrictions or deleted your default VPC, you will need to modify the template.

To deploy the diskForensicImageBuilder template

To open the AWS CloudFormation console pre-loaded with the template, choose the following Launch Stack button.
In the AWS CloudFormation console, on the Specify Details page, enter a name for the stack.
Leave all default settings in place, and choose Next to configure the stack options.
Choose Next to review and scroll to the bottom of the page. Select the check box under the Capabilities section, next to of the acknowledgement:
- I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create Stack.
After the Image Builder pipeline has been created, on the Image pipelines page, choose Actions and select Run pipeline to manually run the pipeline to create the base AMI.

Figure 3: Run the new Image Builder pipeline

diskForensics (security account)

Second, you deploy diskForensics in the security account. This is a nested CloudFormation stack containing four CloudFormation templates. The four CloudFormation templates are as follows:

forensicResources – This stack holds all of the foundation for the solution, including the VPC and networking components, the S3 evidence bucket, CloudWatch log groups, and collectorVM instance profile.

Figure 4: Forensics VPC
forensicFunctions – This stack contains all of the Lambda functions referenced in the Step Functions workflow as well as the role used by the Lambda functions.
forensicStepFunction – This stack contains the Step Functions code, the role used by the Step Functions service, and the CloudWatch log group used by the service. It also contains an Amazon Simple Notification Service (SNS) topic used to alert on pipeline failure.
forensicStepFunctionInvoke – This stack contains the DiskForensicsInvoke Lambda function and the role used by that Lambda function that allows it to call the Step Function workflow.

Note: You need to have the following required variables to continue:

ArtifactBucketName

ORGID

ForensicsAMI

If your accounts are not using AWS Organizations, you can use a dummy string for now. It adds a condition statement to the forensics KMS key that you can update or remove later.

To deploy the diskForensics stack

To open the AWS CloudFormation console pre-loaded with the template, choose the following Launch Stack button.
In the AWS CloudFormation console, on the Specify Details page, enter a name for the stack.
For the ORGID field, enter the AWS Organizations ID.

Note: If you are not using AWS organizations, leave the default string. If you are deploying as multi-account without AWS Organizations, you will need to update the KMS key policy to remove the principalOrgID condition statements, and add the correct principals.
For the ArtifactBucketName field, enter the S3 bucket name you would like to use for your forensic artifacts.

Important: The ArtifactBucketName must be a globally unique name.
For the ForensicsAMI field, enter the AMI ID for the image that was created by Image Builder.
For the example in this post, leave the default values for all other fields. Customizing these fields allows you to customize this code example for your own purposes.
Choose Next to configure the stack options and leave all default settings in place.
Choose Next to review and scroll to the bottom of the page. Select the two check boxes under the Capabilities section, next to each of the acknowledgements:
- I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
Choose Create Stack.
After the stack has completed provisioning, subscribe to the Amazon SNS topic to receive pipeline alerts.

diskMember (each monitored account)

Third, you deploy diskMember in each monitored account. This stack contains the role and policy that the automation workflow needs to assume, so that it can create the initial snapshots and share the snapshot with the security account. If you are deploying this solution in a single account, you deploy diskMember in the security account.

Important: Ensure that all KMS keys that could be used to encrypt EBS volumes in each monitored account grant this role the ability to CreateGrant, Encrypt, Decrypt, ReEncrypt*, GenerateDataKey*, and Describe key. The default policy grants the permissions in AWS Identity and Access Management (IAM), but any restrictive resource policies could block the ability to create the initial snapshot and decrypt the snapshot when making the copy.

To deploy the diskMember stack

To open the AWS CloudFormation console pre-loaded with the template, choose the following Launch Stack button.

If deploying across multiple accounts, consider using AWS CloudFormation StackSets for simplified multi-account deployment.
In the AWS CloudFormation console, on the Specify Details page, enter a name for the stack.
For the MasterAccountNum field, enter the account number for your security administrator account.
Choose Next to configure the stack options and leave all default settings in place.
Choose Next to review and scroll to the bottom of the page. Select the check box under the Capabilities section, next to the acknowledgement:
- I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create Stack.

Test the solution

Next, you can try this solution with an event sample to start the workflow.

To initiate a test run

Copy the following example GuardDuty event. The example uses the AWS Region us-east-1, but you can update the example to use another Region. Be sure to replace the account ID 0123456789012 with the account number of your monitored account, and replace the instance ID i-99999999 with the instance ID you would like to capture.

{
  “SchemaVersion”: “2018-10-08”,
  “Id”: “arn:aws:guardduty:us-east-1:0123456789012:detector/f2b82a2b2d8d8541b8c6d2c7d9148e14/finding/b0baa737c3bf7309db2a396651fdb500”,
  “ProductArn”: “arn:aws:securityhub:us-east-1::product/aws/guardduty”,
  “GeneratorId”: “arn:aws:guardduty:us-east-1:0123456789012:detector/f2b82a2b2d8d8541b8c6d2c7d9148e14”,
  “AwsAccountId”: “0123456789012”,
  “Types”: [
    “Effects/Resource Consumption/UnauthorizedAccess:EC2-TorClient”
  ],
  “FirstObservedAt”: “2020-10-22T03:52:13.438Z”,
  “LastObservedAt”: “2020-10-22T03:52:13.438Z”,
  “CreatedAt”: “2020-10-22T03:52:13.438Z”,
  “UpdatedAt”: “2020-10-22T03:52:13.438Z”,
  “Severity”: {
    “Product”: 8,
    “Label”: “HIGH”,
    “Normalized”: 60
  },
  “Title”: “EC2 instance i-99999999 is communicating with Tor Entry node.”,
  “Description”: “EC2 instance i-99999999 is communicating with IP address 198.51.100.0 on the Tor Anonymizing Proxy network marked as an Entry node.”,
  “SourceUrl”: “https://us-east-1.console.aws.amazon.com/guardduty/home?region=us-east-1#/findings?macros=current&fId=b0baa737c3bf7309db2a396651fdb500”,
  “ProductFields”: {
    “aws/guardduty/service/action/networkConnectionAction/remotePortDetails/portName”: “HTTP”,
    “aws/guardduty/service/archived”: “false”,
    “aws/guardduty/service/action/networkConnectionAction/remoteIpDetails/organization/asnOrg”: “GeneratedFindingASNOrg”,
    “aws/guardduty/service/action/networkConnectionAction/remoteIpDetails/Geolocation/lat”: “0”,
    “aws/guardduty/service/action/networkConnectionAction/remoteIpDetails/ipAddressV4”: “198.51.100.0”,
    “aws/guardduty/service/action/networkConnectionAction/remoteIpDetails/Geolocation/lon”: “0”,
    “aws/guardduty/service/action/networkConnectionAction/blocked”: “false”,
    “aws/guardduty/service/action/networkConnectionAction/remotePortDetails/port”: “80”,
    “aws/guardduty/service/action/networkConnectionAction/remoteIpDetails/country/countryName”: “GeneratedFindingCountryName”,
    “aws/guardduty/service/serviceName”: “guardduty”,
    “aws/guardduty/service/action/networkConnectionAction/localIpDetails/ipAddressV4”: “10.0.0.23”,
    “aws/guardduty/service/detectorId”: “f2b82a2b2d8d8541b8c6d2c7d9148e14”,
    “aws/guardduty/service/action/networkConnectionAction/remoteIpDetails/organization/org”: “GeneratedFindingORG”,
    “aws/guardduty/service/action/networkConnectionAction/connectionDirection”: “OUTBOUND”,
    “aws/guardduty/service/eventFirstSeen”: “2020-10-22T03:52:13.438Z”,
    “aws/guardduty/service/eventLastSeen”: “2020-10-22T03:52:13.438Z”,
    “aws/guardduty/service/evidence/threatIntelligenceDetails.0_/threatListName”: “GeneratedFindingThreatListName”,
    “aws/guardduty/service/action/networkConnectionAction/localPortDetails/portName”: “Unknown”,
    “aws/guardduty/service/action/actionType”: “NETWORK_CONNECTION”,
    “aws/guardduty/service/action/networkConnectionAction/remoteIpDetails/city/cityName”: “GeneratedFindingCityName”,
    “aws/guardduty/service/resourceRole”: “TARGET”,
    “aws/guardduty/service/action/networkConnectionAction/localPortDetails/port”: “39677”,
    “aws/guardduty/service/action/networkConnectionAction/protocol”: “TCP”,
    “aws/guardduty/service/count”: “1”,
    “aws/guardduty/service/additionalInfo/sample”: “true”,
    “aws/guardduty/service/action/networkConnectionAction/remoteIpDetails/organization/asn”: “-1”,
    “aws/guardduty/service/action/networkConnectionAction/remoteIpDetails/organization/isp”: “GeneratedFindingISP”,
    “aws/guardduty/service/evidence/threatIntelligenceDetails.0_/threatNames.0_”: “GeneratedFindingThreatName”,
    “aws/securityhub/FindingId”: “arn:aws:securityhub:us-east-1::product/aws/guardduty/arn:aws:guardduty:us-east-1:0123456789012:detector/f2b82a2b2d8d8541b8c6d2c7d9148e14/finding/b0baa737c3bf7309db2a396651fdb500”,
    “aws/securityhub/ProductName”: “GuardDuty”,
    “aws/securityhub/CompanyName”: “Amazon”
  },
  “Resources”: [
    {
      “Type”: “AwsEc2Instance”,
      “Id”: “arn:aws:ec2:us-east-1:0123456789012:instance/i-99999999”,
      “Partition”: “aws”,
      “Region”: “us-east-1”,
      “Tags”: {
        “GeneratedFindingInstaceTag7”: “GeneratedFindingInstaceTagValue7”,
        “GeneratedFindingInstaceTag8”: “GeneratedFindingInstaceTagValue8”,
        “GeneratedFindingInstaceTag9”: “GeneratedFindingInstaceTagValue9”,
        “GeneratedFindingInstaceTag1”: “GeneratedFindingInstaceValue1”,
        “GeneratedFindingInstaceTag2”: “GeneratedFindingInstaceTagValue2”,
        “GeneratedFindingInstaceTag3”: “GeneratedFindingInstaceTagValue3”,
        “GeneratedFindingInstaceTag4”: “GeneratedFindingInstaceTagValue4”,
        “GeneratedFindingInstaceTag5”: “GeneratedFindingInstaceTagValue5”,
        “GeneratedFindingInstaceTag6”: “GeneratedFindingInstaceTagValue6”
      },
      “Details”: {
        “AwsEc2Instance”: {
          “Type”: “m3.xlarge”,
          “ImageId”: “ami-99999999”,
          “IpV4Addresses”: [
            “10.0.0.1”,
            “198.51.100.0”
          ],
          “IamInstanceProfileArn”: “arn:aws:iam::0123456789012:example/instance/profile”,
          “VpcId”: “GeneratedFindingVPCId”,
          “SubnetId”: “GeneratedFindingSubnetId”,
          “LaunchedAt”: “2016-08-02T02:05:06Z”
        }
      }
    }
  ],
  “WorkflowState”: “NEW”,
  “Workflow”: {
    “Status”: “NEW”
  },
  “RecordState”: “ACTIVE”
}

Navigate to the DiskForensicsInvoke Lambda function and add the GuardDuty event as a test event.
Choose Test. You should see a success for the invocation.
Navigate to the Step Functions workflow to monitor its progress. When the instances have terminated, all of the artifacts should be in the S3 bucket with additional logs in CloudWatch Logs.

Expected outputs

The forensic disk collection pipeline maintains logs of the actions throughout the process, and uploads the final artifacts to the S3 artifact bucket and CloudWatch Logs. This enables security teams to send forensic collection logs to log aggregation tools or service management tools for additional integrations. The expected outputs of the solution are detailed in the following sections, organized by destination.

S3 artifact outputs

The S3 artifact bucket is the final destination for all logs and the raw disk images. For each security incident that triggers the Step Functions workflow, a new folder will be created with the name of the IncidentID. Included in this folder will be the JSON file that triggered the capture operation, the image (dd) files for the volumes, the capture log, and the resources associated with the capture operation, as shown in Figure 5.

Figure 5: Forensic artifacts in the S3 bucket

Forensic Disk Audit log group

The Forensic Disk Audit CloudWatch log group contains a log of where the Step Functions workflow was after creating the initial snapshots in the CreateSnapshot Lambda function. This includes the high-level finding information, as well as the metadata for each snapshot. Also included in this log group is the completed data around each completed disk collection operation, including all associated resources and the location of the forensic evidence in the S3 bucket. The following event is an example log demonstrating a completed capture. Notice all of the metadata provided under captured snapshots. Be sure to update the example to use the correct AWS Region. Replace the account ID 0123456789012 with the account number of your monitored account, and replace the instance ID i-99999999 with the instance ID you would like to capture.

{
  "AwsAccountId": "0123456789012",
  "Types": [
    "Effects/Resource Consumption/UnauthorizedAccess:EC2-TorClient"
  ],
  "FirstObservedAt": "2020-10-22T03:52:13.438Z",
  "LastObservedAt": "2020-10-22T03:52:13.438Z",
  "CreatedAt": "2020-10-22T03:52:13.438Z",
  "UpdatedAt": "2020-10-22T03:52:13.438Z",
  "Severity": {
    "Product": 8,
    "Label": "HIGH",
    "Normalized": 60
  },
  "Title": "EC2 instance i-99999999 is communicating with Tor Entry node.",
  "Description": "EC2 instance i-99999999 is communicating with IP address 198.51.100.0 on the Tor Anonymizing Proxy network marked as an Entry node.",
  "FindingId": "arn:aws:securityhub:us-east-1::product/aws/guardduty/arn:aws:guardduty:us-east-1:0123456789012:detector/f2b82a2b2d8d8541b8c6d2c7d9148e14/finding/b0baa737c3bf7309db2a396651fdb500",
  "Resource": {
    "Type": "AwsEc2Instance",
    "Arn": "arn:aws:ec2:us-east-1:0123456789012:instance/i-99999999",
    "Id": "i-99999999",
    "Partition": "aws",
    "Region": "us-east-1",
    "Details": {
      "AwsEc2Instance": {
        "Type": "m3.xlarge",
        "ImageId": "ami-99999999",
        "IpV4Addresses": [
          "10.0.0.1",
          "198.51.100.0"
        ],
        "IamInstanceProfileArn": "arn:aws:iam::0123456789012:example/instance/profile",
        "VpcId": "GeneratedFindingVPCId",
        "SubnetId": "GeneratedFindingSubnetId",
        "LaunchedAt": "2016-08-02T02:05:06Z"
      }
    }
  },
  "EvidenceBucket": "forensic-artifact-bucket",
  "IncidentID": "b0baa737c3bf7309db2a396651fdb500",
  "CapturedSnapshots": [
    {
      "SourceSnapshotID": "snap-99999999",
      "SourceVolumeID": "vol-99999999",
      "SourceDeviceName": "/dev/xvda",
      "VolumeSize": 100,
      "InstanceID": "i-99999999",
      "FindingID": "arn:aws:securityhub:us-east-1::product/aws/guardduty/arn:aws:guardduty:us-east-1:0123456789012:detector/f2b82a2b2d8d8541b8c6d2c7d9148e14/finding/b0baa737c3bf7309db2a396651fdb500",
      "IncidentID": "b0baa737c3bf7309db2a396651fdb500",
      "AccountID": "0123456789012",
      "Region": "us-east-1",
      "EvidenceBucket": "forensic-artifact-bucket"
    }
  ]
}
{
  "SourceSnapshotID": "snap-99999999",
  "SourceVolumeID": "vol-99999999",
  "SourceDeviceName": "/dev/sdd",
  "VolumeSize": 100,
  "InstanceID": "i-99999999",
  "FindingID": "arn:aws:securityhub:us-east-1::product/aws/guardduty/arn:aws:guardduty:us-east-1:0123456789012:detector/f2b82a2b2d8d8541b8c6d2c7d9148e14/finding/b0baa737c3bf7309db2a396651fdb500",
  "IncidentID": "b0baa737c3bf7309db2a396651fdb500",
  "AccountID": "0123456789012",
  "Region": "us-east-1",
  "EvidenceBucket": "forensic-artifact-bucket",
  "CopiedSnapshotID": "snap-99999998",
  "EncryptionKey": "arn:aws:kms:us-east-1:0123456789012:key/e793cbd3-ce6a-4b17-a48f-7e78984346f2",
  "FinalCopiedSnapshotID": "snap-99999997",
  "ForensicVolumeID": "vol-99999998",
  "VolumeAZ": "us-east-1a",
  "ForensicInstances": [
    "i-99999998"
  ],
  "DiskImageLocation": "s3://forensic-artifact-bucket/b0baa737c3bf7309db2a396651fdb500/disk_evidence/vol-99999999.image.dd"
}

Forensic Disk Capture log group

The Forensic Disk Capture CloudWatch log group contains the logs from the EC2Collector VM. These logs detail the operations taken by the instance, which include when the dc3dd command was executed, what the transfer speed was to the S3 bucket, what the hash of the volume was, and how long the total operation took to complete. The log example in Figure 6 shows the output of the disk capture on the collector instance.

Figure 6: Forensic Disk Capture logs

Cost and capture times

This solution may save you money over a traditional system that requires bastion hosts (jump boxes) and forensic instances to be readily available. With AWS, you pay only for the individual services you need, for as long as you use them. The cost of this solution is minimal, because charges are only incurred based on the logs or artifacts that you store in CloudWatch or Amazon S3, and the invocation of the Step Functions workflow. Additionally, resources such as the collectorVM are only created and used when needed.

This solution can also save you time. If an analyst was manually working through this workflow, it could take significantly more time than the automated solution. The following are some examples of collection times. You can see that even when the manual workflow time increases, the automated workflow time stays the same, because of how the solution scales.

Scenario 1: EC2 instance with one 8GB volume

Automated workflow: 11 minutes
Manual workflow: 15 minutes

Scenario 2: EC2 instance with four 8GB volumes

Automated workflow: 11 minutes
Manual workflow: 1 hour 10 minutes

Scenario 3: Four EC2 instances with one 8GB volume each

Automated workflow: 11 minutes
Manual workflow: 1 hour 20 minutes

Clean up and delete artifacts

To clean up the artifacts from the solution in this post, first delete all information in your artifact S3 bucket. Then delete the diskForensics stack, followed by the diskForensicImageBuilder stack, and finally the diskMember stack. You must also manually delete any EBS volumes or EBS snapshots created by the pipeline, these are not deleted automatically. You must also manually delete the AMI and images that are created and published by Image Builder.

Considerations

This solution covers EBS volume storage as the target for forensic disk capture. If your instances use Amazon EC2 Instance Stores in your environment, then you cannot snapshot and copy those volumes, because that data is not included in an EC2 snapshot operation. Instead, you should consider running the commands that are included in collector.sh script with AWS Systems Manager. The collector.sh script is included in the Image Builder recipe and uses dc3dd to stream a copy of the volume to Amazon S3.

Conclusion

Having this solution in place across your AWS accounts will enable fast response times to security events, as it helps ensure that forensic artifacts are collected and stored as quickly as possible. Download the .zip file for the solutions in CloudFormation, so that you can deploy this solution and get started on your own forensic automation workflows. For the talks describing this solution, see the video of SEC306 from Re:Invent 2020 and the AWS Online Tech Talk AWS Digital Forensics Automation at Goldman Sachs.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon GuardDuty forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

12 Sangoma Phone Provisioning + Endpoint Manager – FreePBX 101 v15

2021-08-24 Crosstalk Solutions

Post Syndicated from Crosstalk Solutions original https://www.youtube.com/watch?v=kMBz__InqB4

11 Crosstalk Phone Provisioning + Clearly Devices Module – FreePBX 101 v15

2021-08-24 Crosstalk Solutions

Post Syndicated from Crosstalk Solutions original https://www.youtube.com/watch?v=U4dvm3syI60

10 Zero Touch Provisioning – FreePBX 101 v15

2021-08-24 Crosstalk Solutions

Post Syndicated from Crosstalk Solutions original https://www.youtube.com/watch?v=fkmSiaCxD2k

Optimizing Website Performance With a CDN

2021-08-24 Molly Clancy

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/optimizing-website-performance-with-a-cdn/

If you’ve ever wondered how a content delivery network (CDN) works, here’s a decent analogy… For most of the year, I keep one, maybe two boxes of tissues in the house. But, during allergy season, there’s a box in every room. When pollen counts are up, you need zero latency between sneeze and tissue deployment.

Instead of tissues in every room of the house, a CDN has servers in every corner of the globe, and they help reduce latency between a user’s request and when the website loads. If you want to make sure your website loads quickly no matter who accesses it, a CDN can help. Today, we’ll dig into the benefits of CDNs, how they work, and some common use cases with real-world examples.

What Is a CDN?

According to Cloudflare, one of the market leaders in CDN services, a CDN is “a geographically distributed group of servers which work together to provide fast delivery of internet content.” A CDN speeds up your website performance by temporarily keeping your website content on servers that are closer to end users. This is known as caching.

When someone in Australia visits your website that’s hosted in New York City, instead of fetching content like images, video, HTML pages, javascript files, etc. all the way from the the “origin store” (the server where the main, original website lives in the Big Apple), the CDN fetches content from an “edge server” that’s geographically closer to the end user at the edge of the network. Your website loads much faster when the content doesn’t have to travel halfway around the world to reach your website visitors.

How Do CDNs Work?

While a CDN does consist of servers that host website content, a CDN cannot serve as a web host itself一you still need traditional web hosting to operate your website. The CDN just holds your website content on servers closer to your end users. It refers back to the main, original website content that’s stored on your origin store in case you make any changes or updates.

Your origin store could be an actual, on-premises server located wherever your business is headquartered, but many growing businesses opt to use cloud storage providers to serve as their origin store. With cloud storage, they can scale up or down as website content grows and only pay for what they need rather than investing in expensive on-premises servers and networking equipment.

The CDN provider sets up their edge servers at internet exchange points, or IXPs. IXPs are points where traffic flows between different internet service providers like a highway interchange so your data can get to end users faster.

Not all of your website content will be stored on IXPs all of the time. A user must first request that website content. After the CDN retrieves it from the origin store to whatever server is nearest to the end user, it keeps it on that server as long as the content continues to be requested. The content has a specific “time to live,” or TTL, on the server. The TTL specifies how long the edge server keeps the content. At a certain point, if the content has not been requested within the TTL, the server will stop storing the content.

When a user pulls up website content from the cache on the edge server, it’s known as a cache hit. When the content is not in the cache and must be fetched from the origin store, it’s known as a cache miss. The ratio of hits to misses is known as the cache hit ratio, and it’s an important metric for website owners who use cloud storage as their origin and are trying to optimize their egress fees (the fees cloud storage providers charge to send data out of their systems). The better the cache hit ratio, the less they’ll be charged for egress out of their origin store.

Another important metric for CDN users is round trip time, or RTT. RTT is the time it takes for a request from a user to travel to its destination and back again. RTT metrics help website owners understand the health of a network and the speed of network connections. A CDN’s primary purpose is to reduce RTT as much as possible.

Key Terms: Demystifying Acronym Soup

Origin Store: The main server or cloud storage provider where your website content lives.
CDN: Content delivery network, a geographically distributed group of servers that work to deliver internet content fast.
Edge Server: Servers in a CDN network that are located at the edge of the network.
IXP: Internet exchange point, a point where traffic flows between different internet service providers.
TTL: Time to live, the time content has to live on edge servers.
RTT: Round trip time, the time it takes for a request from a user to travel to its destination and back.
Cache Hit Ratio: The ratio of times content is retrieved from edge servers in the CDN network vs. the times content must be retrieved from the origin store.

Do I Need a CDN?

CDNs are a necessity for companies with a global presence or with particularly complex websites that deliver a lot of content, but you don’t have to be a huge enterprise to benefit from a CDN. You might be surprised to know that more than half of the world’s website content is delivered by CDN, according to Akamai, one of the first providers to offer CDN services.

What Are the Benefits of a CDN?

A CDN offers a few specific benefits for companies, including:

Faster website load times.
Lower bandwidth costs.
Redundancy and scalability during high traffic times.
Improved security.

Faster website load times: Content is distributed closer to visitors, which is incredibly important for improving bounce rates. Website visitors are orders of magnitude more likely to click away from a site the longer it takes to load. The probability of a bounce increases 90% as the page load time goes from one second to five on mobile devices, and website conversion rates drop by an average of 4.42% with each additional second of load time. If an e-commerce company makes $50 per conversion and does about $150,000 per month in business, a drop in conversion of 4.42% would equate to a loss of almost $80,000 per year.

If you still think seconds can’t make that much of a difference, think again. Amazon calculated that a page load slowdown of just one second could cost it $1.6 billion in sales each year. With website content distributed closer to website users via a CDN, pages load faster, reducing bounce rates.

Image credit: HubSpot. Data credit: Portent.

Lower bandwidth costs: Bandwidth costs are the costs companies and website owners pay to move their data around telecommunications networks. The farther your data has to go and the faster it needs to get there, the more you’re going to pay in bandwidth costs. The caching that a CDN provides reduces the need for content to travel as far, thus reducing bandwidth costs.

Redundancy and scalability during high traffic times: With multiple servers, a CDN can handle hardware failures better than relying on one origin server alone. If one goes down, another server can pick up the slack. Also, when traffic spikes, a single origin server may not be able to handle the load. Since CDNs are geographically distributed, they spread traffic out over more servers during high traffic times and can handle more traffic than just an origin server.

Improved security: In a DDoS, or distributed denial-of-service attack, malicious actors will try to flood a server or network with traffic to overwhelm it. Most CDNs offer security measures like DDoS mitigation, the ability to block content at the edge, and other enhanced security features.

CDN Cost and Complexity

CDN costs vary by the use case, but getting started can be relatively low or no-cost. Some CDN providers like Cloudflare offer a free tier if you’re just starting a business or for personal or hobby projects, and upgrading to Cloudflare’s Pro tier is just $20 a month for added security features and accelerated mobile load speeds. Other providers, like Fastly, offer a free trial.

Beyond the free tier or trial, pricing for most CDN providers is dynamic. For Amazon CloudFront, for example, you’ll pay different rates for different volumes of data in different regions. It can get complicated quickly, and some CDNs will want to work directly with you on a quote.

At an enterprise scale, understanding if CDN pricing is worth it is a matter of comparing the cost of the CDN to the cost of what you would have paid in egress fees. Some cloud providers and CDNs like those in the Bandwidth Alliance have also teamed up to pass egress savings on to shared users, which can substantially reduce costs related to content storage and delivery. Look into discounts like this when searching for a CDN.

Another way to evaluate if a CDN is right for your business is to look at the opportunity cost of not having one. Using the example above, an e-commerce company that makes $50 per conversion and does $150,000 of business per month stands to lose $80,000 per year due to latency issues. While CDN costs can reach into the thousands per month, the exercise of researching CDN providers and pricing out what your particular spend might be is definitely worth it when you stand to save that much in lost opportunities.

Setting up a CDN is relatively easy. You just need to create an account and connect it to your origin server. Each provider will have documentation to walk you through how to configure their service. Beyond the basic setup, CDNs offer additional features and services like health checks, streaming logs, and security features you can configure to get the most out of your CDN instance. Fastly, for example, allows you to create custom configurations using their Varnish Configuration Language, or VCL. If you’re just starting out, setting up a CDN can be very simple, but if you need or want more bells and whistles, the capabilities are definitely out there.

Who Benefits Most From a CDN?

While a CDN is beneficial for any company with broad geographic reach or a content-heavy site, some specific industries see more benefits from a CDN than others, including e-commerce, streaming media, and gaming.

E-commerce and CDN: Most e-commerce companies also host lots of images and videos to best display their products to customers, so they have lots of content that needs to be delivered. They also stand to lose the most business from slow loading websites, so implementing a CDN is a natural fit for them.

E-commerce Hosting Provider Delivers One Million Websites

Big Cartel is an e-commerce platform that makes it easy for artists, musicians, and independent business owners to build unique online stores. They’ve long used a CDN to make sure they can deliver more than one million websites around the globe at speed on behalf of their customers.

They switched from Amazon’s Cloudfront to Fastly in 2015. As an API-first, edge cloud platform designed for programmability, the team felt Fastly gave Big Cartel more functionality and control than CloudFront. With the Fastly VCL, Big Cartel can detect patterns of abusive behavior, block content at the edge, and optimize images for different browsers on the fly. “Fastly has really been a force multiplier for us. They came into the space with published, open, transparent pricing and the configurability of VCL won us over,” said Lee Jensen, Big Cartel’s Technical Director.

Streaming Media and CDN: Like e-commerce sites, streaming media sites host a lot of content, and need to deliver that content with speed and reliability. Anyone who’s lost service in the middle of a Netflix binge knows: buffering and dropped shows won’t cut it.

Movie Streaming Platform Triples Redundancy

Kanopy is a video streaming platform serving more than 4,000 libraries and 45 million patrons worldwide. In order for a film to be streamed without delays or buffering, it must first be transcoded, or broken up into smaller, compressed files known as “chunks.” A feature-length film may translate to thousands of five to 10-second chunks, and losing just one can cause playback issues that disrupt the customer viewing experience.

Kanopy used a provider that offered a CDN, origin storage, and transcoding all in one, but the provider lost chunks, disrupting the viewing experience. One thing their legacy CDN didn’t provide was backups. If the file couldn’t be located in their primary storage, it was gone.

They switched to a multi-cloud stack, engaging Cloudflare as a CDN and tripled their redundancy by using a cold archive, an origin store, and backup storage.

Gaming and CDN: Gaming platforms, too, have a heavy burden of graphics, images, and video to manage. They also need to deliver content fast and at speed or they risk games glitching up in the middle of a deciding moment.

Gaming Platform Wins When Games Go Viral

SIMMER.io is a community site that makes sharing Unity WebGL games easy for indie game developers. Whenever a game would go viral, their egress costs boiled over, hindering growth. SIMMER.io mirrored their data from Amazon S3 to Backblaze B2 and reduced egress to $0 as a result of the Bandwidth Alliance. They can now grow their site without having to worry about increasing egress costs over time or usage spikes when games go viral.

In addition to the types of companies listed above, financial institutions, media properties, mobile apps, and government entities can benefit from a CDN as well. However, a CDN is not going to be right for everyone. If your audience is hyper-targeted in a specific geographic location, you likely don’t need a CDN and can simply use a geolocated web host.

Pairing CDN With Cloud Storage

A CDN doesn’t cache every single piece of data一there will be times when a user’s request will be pulled directly from the origin store. Reliable, affordable, and performant origin storage becomes critical when the cache misses content. By pairing a CDN with origin storage in the cloud, companies can benefit from the elasticity and scalability of the cloud and the performance and speed of a CDN’s edge network.

Still wondering if a CDN is right for you? Let us know your questions in the comments.

The post Optimizing Website Performance With a CDN appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Security updates for Tuesday

2021-08-24

Post Syndicated from original https://lwn.net/Articles/867247/rss

Security updates have been issued by Debian (ledgersmb, tnef, and tor), Fedora (nodejs-underscore and tor), openSUSE (aws-cli, python-boto3, python-botocore,, fetchmail, firefox, and isync), SUSE (aws-cli, python-boto3, python-botocore, python-service_identity, python-trustme, python-urllib3 and python-PyYAML), and Ubuntu (linux-aws-5.8, linux-azure-5.8, linux-gcp-5.8, linux-oracle-5.8).

Energy Monitoring in Home Assistant with not only Shelly EM

2021-08-24 BeardedTinker

Post Syndicated from BeardedTinker original https://www.youtube.com/watch?v=brOxhWktz9U

The 1906 San Francisco Earthquake

2021-08-24 Geographics

Post Syndicated from Geographics original https://www.youtube.com/watch?v=DN25V1Q05Os

Organizing 1 million+ photos 🤯

2021-08-24 Matt Granger

Post Syndicated from Matt Granger original https://www.youtube.com/watch?v=0aLbnV6_SS0

Introducing logs from the dashboard for Cloudflare Workers

2021-08-24 Ashcon Partovi

Post Syndicated from Ashcon Partovi original https://blog.cloudflare.com/workers-dashboard-logs/

Introducing logs from the dashboard for Cloudflare Workers

If you’re writing code: what can go wrong, will go wrong.

Many developers know the feeling: “It worked in the local testing suite, it worked in our staging environment, but… it’s broken in production?” Testing can reduce mistakes and debugging can help find them, but logs give us the tools to understand and improve what we are creating.

if (this === undefined) {
  console.log("there’s no way… right?") // Narrator: there was.
}

While logging can help you understand when the seemingly impossible is actually possible, it’s something that no developer really wants to set up or maintain on their own. That’s why we’re excited to launch a new addition to the Cloudflare Workers platform: logs and exceptions from the dashboard.

Starting today, you can view and filter the console.log output and exceptions from a Worker… at no additional cost with no configuration needed!

View logs, just a click away

When you view a Worker in the dashboard, you’ll now see a “Logs” tab which you can click on to view a detailed stream of logs and exceptions. Here’s what it looks like in action:

Each log entry contains an event with a list of logs, exceptions, and request headers if it was triggered by an HTTP request. We also automatically redact sensitive URLs and headers such as Authorization, Cookie, or anything else that appears to have a sensitive name.

If you are in the Durable Objects open beta, you will also be able to view the logs and requests sent to each Durable Object. This is a great tool to help you understand and debug the interactions between your Worker and a Durable Object.

For now, we support filtering by event status and type. Though, you can expect more filters to be added to the dashboard very soon! Today, we support advanced filtering with the wrangler CLI, which will be discussed later in this blog.

console.log(), and you’re all set

It’s really simple to get started with logging for Workers. Simply invoke one of the standard console APIs, such as console.log(), and we handle the rest. That’s it! There’s no extra setup, no configuration needed, and no hidden logging fees.

function logRequest (request) {
  const { cf, headers } = request
  const { city, region, country, colo, clientTcpRtt  } = cf
  
  console.log("Detected location:", [city, region, country].filter(Boolean).join(", "))
  if (clientTcpRtt) {
     console.debug("Round-trip time from client to", colo, "is", clientTcpRtt, "ms")
  }

  // You can also pass an object, which will be interpreted as JSON.
  // This is great if you want to define your own structured log schema.
  console.log({ headers })
}

In fact, you don’t even need to use console.log to view an event from the dashboard. If your Worker doesn’t generate any logs or exceptions, you will still be able to see the request headers from the event.

Advanced filters, from your terminal

If you need more advanced filters you can use wrangler, our command-line tool for deploying Workers. We’ve updated the wrangler tail command to support sampling and a new set of advanced filters. You also no longer need to install or configure cloudflared to use the command. Not to mention it’s much faster, no more waiting around for logs to appear. Here are a few examples:

# Filter by your own IP address, and if there was an uncaught exception.
wrangler tail --format=pretty --ip-address=self --status=error

# Filter by HTTP method, then apply a 10% sampling rate.
wrangler tail --format=pretty --method=GET --sampling-rate=0.1

# Filter using a generic search query.
wrangler tail --format=pretty --search="TypeError"

We recommend using the “pretty” format, since wrangler will output your logs in a colored, human-readable format. (We’re also working on a similar display for the dashboard.)

However, if you want to access structured logs, you can use the “json” format. This is great if you want to pipe your logs to another tool, such as jq, or save them to a file. Here are a few more examples:

# Parses each log event, but only outputs the url.
wrangler tail --format=json | jq .event.request?.url

# You can also specify --once to disconnect the tail after receiving the first log.
# This is useful if you want to run tests in a CI/CD environment.
wrangler tail --format=json --once > event.json

Try it out!

Both logs from the dashboard and wrangler tail are available and free for existing Workers customers. If you would like more information or a step-by-step guide, check out any of the resources below.

Go to the dashboard and look at some logs!
Read the getting started guide for logging.
Look at the tail logs API reference.

Monitor, Evaluate, and Demonstrate Backup Compliance with AWS Backup Audit Manager

2021-08-24 Steve Roberts

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/monitor-evaluate-and-demonstrate-backup-compliance-with-aws-backup-audit-manager/

Today, I’m happy to announce the availability of AWS Backup Audit Manager, a new feature of AWS Backup that helps you monitor and evaluate the compliance status of your backups to meet business and regulatory requirements, and enables you to generate reports that help demonstrate compliance to auditors and regulators.

AWS Backup is a fully managed service that provides the ability to initiate policy-driven backups and restores of AWS applications, simplifying the process of protecting data at scale by removing the need for custom scripts and manual processes. However, customers still needed to use their own tooling for verifying that backup policies were being enforced and, as part of proving adherence to auditors, parsing backup transcripts to convert them into auditable reports.

With AWS Backup Audit Manager, you can now continuously and automatically track your backup activity, such as changes to a backup plan or backup vault, and generate automatic daily reports. AWS Backup Audit Manager provides built-in, customizable, compliance controls. Simply put, controls are procedures with backup policy parameters, for example the backup frequency or the retention period, that align with your business compliance and regulatory requirements.

You create a framework, scoped to an account and Region, and add the controls you need to it. Backup activities are tracked against the controls, automatically detecting violations of your defined data protection policies, enabling you to take quick corrective actions. To enable tracking of backup activities, AWS Backup Audit Manager requires you to enable monitoring through AWS Config for your backup plans (AWS::Backup::BackupPlan resource type), backup selection (AWS::Backup::BackupSelection), vaults (AWS::Backup::BackupVault), recovery points (AWS::Backup::RecoveryPoint), and AWS Config resource compliance (AWS::Config::ResourceCompliance). You can check the recording status of these resources in the AWS Backup console, using the Resource Tracking section of the Frameworks page.

Once you’ve added the controls you need to your framework, you deploy it. If you have different internal or regulatory standards to meet, you can create and deploy additional frameworks of controls. Once the framework is deployed, you can set up automatic daily reports of your backup activity. These are displayed in a dashboard, and you can also request on-demand reports at any time. You can also import your findings into AWS Audit Manager, a service I wrote about during AWS re:Invent 2020 in this news blog post.

This short video gives a brief overview of the new AWS Backup Audit Manager feature.

Available Controls and Backup Reports
AWS Backup Audit Manager provides five backup governance control templates and backup activity reporting on your backup jobs, copy jobs, and restore jobs. These reports improve visibility into backup activities for a single account and Region, helping you monitor your operational posture and identify failures that may need further action.

When creating a framework, you provide a name, an optional description, and you select whether to use the provided AWS Backup framework type, which includes five pre-defined controls, or you can opt to customize your framework.

Choosing Custom framework expands the panel to show the available controls, their parameters, and the option to include or exclude them from your framework. The five available controls are titled Backup resources protected by backup plan, Backup plan minimum frequency and minimum retention, Backup prevent recovery point manual deletion, Backup recovery point encrypted, and Backup recovery point minimum retention. To the right of each control’s title you’ll find an info link that describes what the control evaluates, how frequently, and what it means for a resource to be compliant with the control.

Let’s examine a couple of controls. The Backup resources protected by backup plan control enables you to select all supported resources, or those identified by a tag, by type, or a particular resource. This control helps identify gaps in your backup coverage.

The Backup plan minimum frequency and minimum retention control has parameters governing how frequently the backup plan should be taking backups, and for how long recovery points should be maintained. The default settings require backups to occur every hour, and recovery points should be retained for a month, but you can customize the settings to meet your business compliance requirements.

You complete your selections for the remaining controls, including them and setting appropriate parameter values for your needs, or excluding them from the framework, and then click Create framework to complete the process. The new framework will be created and deployed, which will take a few minutes. If needed, you can go back and edit the controls and parameters in a framework at any time.

Once deployed, the controls in the framework will start to evaluate compliance and you can inspect compliance status in the console by selecting the framework. The summary section reports the overall compliance status of the framework and the number of controls in the framework that are compliant or non-compliant, based on your deployed control definitions.

Below the summary, you’ll find a list containing compliance details for each of the controls in the framework, which can be filtered by status. Each control details whether it’s compliant or non-compliant, and how many resources monitored by the control are non-compliant. Clicking a control title will take you directly to the AWS Config dashboard, where you can view more details on the resources identified by the control.

Automated reports on backup activity can be used to demonstrate compliance to auditors and regulators. To set up reports, first click the Reports entry in the navigation toolbar, and then click Create report plan. You’ll be asked to select a report template.

With the template selected (I chose Backup jobs report), you fill in a name and optional description, choose where in your Amazon Simple Storage Service (Amazon S3) buckets you want the report to be delivered, and the report file formats, and then click Create report plan. Your report will update every 24 hours, and you can run an on-demand report at any time.

Once a report has been run, either automatically or on-demand, you can view the report data by first selecting the report in your Report plans list, followed by clicking View report. You’ll be taken directly to the chosen S3 location of the report files, where you’ll see one object (report) per chosen file type.

Downloading the file shows you the time period in which the resources were evaluated, the backup job details, failure or completion status, status messages, the resource type and backup plan, and more. Here I’ve opened the CSV format file in a spreadsheet.

Open Raven Launch Partnership
With this launch, we’re excited to have Open Raven join us as an AWS Backup partner. Open Raven is a cloud-native data security platform purpose-built for protecting modern data lakes and warehouses. From finding all data locations to proactively identifying exposure, their platform solves a broad spectrum of problems that organizations commonly face when living with large amounts of cloud-based data.

Open Raven Chief Technology Officer Mark Curphey had this to say about the new AWS Backup feature: “To successfully recover from a ransomware attack, organizations need to plan ahead by completing two foundational tasks, identifying critical data and systems and backing them up as per organizational requirements so that they can be protected and recovered. The combination of AWS Backup Audit Manager and Open Raven streamlines this effort, eliminating guesswork and hours of manual toil.”

Start Using AWS Backup Audit Manager Today
AWS Backup Audit Manager is available today in the US East (N. Virginia, Ohio), US West (N. California, Oregon), Canada (Central), EU (Frankfurt, Ireland, London, Paris, Stockholm), South America (Sao Paulo), Asia Pacific (Hong Kong, Mumbai, Seoul, Singapore, Sydney, Tokyo), and Middle East (Bahrain) Regions.

For more information about Backup Audit Manager, refer to this section in the AWS Backup Developer Guide. To get started, visit the AWS Backup console.

— Steve

Cybercriminals Selling Access to Compromised Networks: 3 Surprising Research Findings

2021-08-24 Paul Prudhomme

Post Syndicated from Paul Prudhomme original https://blog.rapid7.com/2021/08/24/cybercriminals-selling-access-to-compromised-networks-3-surprising-research-findings/

Cybercriminals Selling Access to Compromised Networks: 3 Surprising Research Findings

Cybercriminals are innovative, always finding ways to adapt to new circumstances and opportunities. The proof of this can be seen in the rise of a certain variety of activity on the dark web: the sale of access to compromised networks.

This type of dark web activity has existed for decades, but it matured and began to truly thrive amid the COVID-19 global pandemic. The worldwide shift to a remote workforce gave cybercriminals more attack surface to exploit, which fueled sales on underground criminal websites, where buyers and sellers transfer network access to compromised enterprises and organizations to turn a profit.

Having witnessed this sharp rise in breach sales in the cybercriminal ecosystem, IntSights, a Rapid7 company, decided to analyze why and how criminals sell their network access, with an eye toward understanding how to prevent these network compromise events from happening in the first place.

We have compiled our network compromise research, as well as our prevention and mitigation best practices, in the brand-new white paper “Selling Breaches: The Transfer of Enterprise Network Access on Criminal Forums.”

During the process of researching and analyzing, we came across three surprising findings we thought worth highlighting. For a deeper dive, we recommend reading the full white paper, but let’s take a quick look at these discoveries here.

1. The massive gap between average and median breach sales prices

As part of our research, we took a close look at the pricing characteristics of breach sales in the criminal-to-criminal marketplace. Unsurprisingly, pricing varied considerably from one sale to another. A number of factors can influence pricing, including everything from the level of access provided to the value of the victim as a source of criminal revenue.

That said, we found an unexpectedly significant discrepancy between the average price and the median price across the 40 sales we analyzed. The average price came out to approximately $9,640 USD, while the median price was $3,000 USD.

In part, this gap can be attributed to a few unusually high prices among the most expensive offerings. The lowest price in our dataset was $240 USD for access to a healthcare organization in Colombia, but healthcare pricing tends to trend lower than other industries, with a median price of $700 in this sample. On the other end of the spectrum, the highest price was for a telecommunications service provider that came in at about $95,000 USD worth of Bitcoin.

Because of this discrepancy, IntSights researchers view the average price of $9,640 USD as a better indicator of the higher end of the price range, while the median price is more representative of typical pricing for these sales — $3,000 USD was also the single most common price. Nonetheless, it was fascinating to discover this difference and dig into the reasons behind it.

2. The numerical dominance of tech and telecoms victims

While the sales of network access are a cross-industry phenomenon, technology and telecommunications companies are the most common victims. Not only are they frequent targets, but their compromised access also commands some of the highest prices on the market.

In our sample, tech and telecoms represented 10 of the 46 victims, or 22% of those affected by industry. Out of the 10 most expensive offerings we analyzed, four were for tech and telecommunications organizations, and there were only two that had prices under $10,000 USD. A telecommunications service provider located in an unspecified Asian country also had the single most expensive offering in this sample at approximately $95,000 USD.

After investigating the reasoning behind this numerical dominance, IntSights researchers believe that the high value and high number of tech and telecommunications companies as breach victims stem from their usefulness in enabling further attacks on other targets. For example, a cybercriminal who gains access to a mobile service provider could conduct SIM swapping attacks on digital banking customers who use two-factor authentication via SMS.

These pricing standards were surprisingly expensive compared to other industries, but for good reason: the investment may cost more upfront but prove more lucrative in the long run.

3. The low proportion of retail and hospitality victims

As previously mentioned, we broke down the sales of network access based on the industries affected, and to our surprise, only 6.5% of victims were in retail and hospitality. This seemed odd, considering the popularity of the industry as a target for cybercrime. Think of all the headlines in the news about large retail companies falling victim to a breach that exposed millions of customer credentials.

We explored the reasoning behind this low proportion of victims in the space and came to a few conclusions. For example, we theorized that the main customers for these network access sales are ransomware operators, not payment card data collectors. Payment card data collection is likely a more optimal way to monetize access to a retail or hospitality business, whereas putting ransomware on a retail and hospitality network would actually “kill the goose that lays the golden eggs.”

We also found that the second-most expensive offering in this sample was for access to an organization supporting retail and hospitality businesses. The victim was a third party managing customer loyalty and rewards programs, and the seller highlighted how a buyer could monetize this indirect access to its retail and hospitality customer base. This victim may have been more valuable because, among other things, loyalty and rewards programs are softer targets with weaker security than credit cards and bank accounts; thus, they’re easier to defraud.

Learn more about compromised network access sales

Curious to learn more about the how and why of cybercriminals selling compromised network access? Read our white paper, Selling Breaches: The Transfer of Enterprise Network Access on Criminal Forums, for the full story behind this research and how it can inform your security efforts.

Behind the scenes at Atari

2021-08-24 Ashley Whittaker

Post Syndicated from Ashley Whittaker original https://www.raspberrypi.org/blog/behind-the-scenes-at-atari/

We love Wireframe magazine’s regular feature ‘The principles of game design’. They’re written by video game pioneer Howard Scott Warshaw, who authored several of Atari’s most famous and infamous titles. In the latest issue of Wireframe, he provides a snapshot of the hell-raising that went on behind the scenes at Atari…

Behind the scenes at Atari Wireframe magazine — A moment of relative calm in Atari’s offices, circa the early 1980s. There’s Howard nearest the camera on the right

Video game creation is unusual in that developers need to be focused intently on achieving design goals while simultaneously battling tunnel vision and re-evaluating those goals. It’s a demanding and frustrating predicament. Therefore, a solid video game creator needs two things: a way to let ideas simmer (since rumination is how games grow from mediocre to fabulous) and a way to blow off steam (since frustration abounds while trying to achieve fabulous). At Atari, there was one place where things both simmered and got steamy… the hot tub. The only thing we couldn’t do was keep a lid on the antics cooked up inside.

The hot tub was situated in the two-storey engineering building. This was ironic, because the hot tub generated way more than two stories in that building. The VCS/2600 and Home Computer development groups were upstairs. The first floor held coin-op development, a kitchen/cafeteria, and an extremely well-appointed gym. The gym featured two appendages: a locker area and the hot tub room. Many shenanigans were hatched and/or executed in the hot tub. One from the more epic end of the spectrum comes to mind: the executive birthday surprise.

It was during the birthday celebration of a VP who shall remain nameless, but it might have been the one who used to keep a canister of nitrous oxide and another of pure oxygen in his office. The nitrous oxide was for getting high and laughing some time away, while the oxygen was used for rapid sobering up in the event a spontaneous meeting was called (which happened regularly at Atari). As the party raged on, a small crew of revellers migrated to the small but accommodating hot tub room. Various intoxicants (well beyond the scope of nitrous) were being consumed in celebration of the special event (although by this standard, nearly every day was a special event at Atari).

As the party rolled on, inhibitions were shed along with numerous articles of clothing. At one point, the birthday boy was adjudged to be in dire need of a proper tubbing as he hadn’t lost sufficient layers to keep pace with the party at large. The birthday boy disagreed, and the ensuing negotiation took the form of a lively chase around the area. The VP ran out of the hot tub room and headed for the workout area with a wet posse in hot pursuit, all in varying stages of undress.

It’s important to note here that although refreshments and revelry were widely available at Atari, one item in short supply was conference rooms. Consequently, meetings could pop up in odd locales. Any place an aggregation could be achieved was a potential meeting spot. The sensitivity of the subject matter would determine the level of privacy required on a case-by-case basis. Since people weren’t always working out, the gym had enough places to sit that it could serve as a decent host for gatherings. And as for sensitivity, the hot tub room was well sound-proofed, so intruding ears weren’t a concern.

As the crew of rowdy revellers followed the VP into the workout area, they were confronted by just such a collection of executives who happened to be meeting at the time. I don’t think the birthday party was on the agenda. However, they may have been pleased that the absentee VP had ultimately decided to join their number. It was embarrassing for some, entertaining for others, and nearly career-ending for a couple. The moral of this story being that Atari executives should never go anywhere without their oxygen tanks in tow.

But morals aside, there was work to be done at Atari. In a place where work can lead to antics and antics can lead to work breakthroughs, it’s difficult at times to suss out the precise boundary between work and antics. It takes passion and commitment to pursue side quests productively and yet remain on task when necessary.

The main reason this was a challenge comes down to the fact there are so many distractions constantly going on. Creative people tend to be creative frequently and spontaneously. Also, their creativity is much more motivated by fascination and interest than it is by task lists or project plans. Fun can break out at any moment, and answering the call isn’t always the right choice, no matter how compelling the siren.

Rob Fulop, creator of Missile Command and Demon Attack for the Atari 2600 (among many other hits) isn’t only a great game maker, he’s also a keen observer of human nature. We used to chat about just where the edge is between work and play at Atari. Those who misjudge it can easily fall off the cliff.

Likewise, we explored the concept of what makes a good game designer. Rob said it’s just the right combination of silly and anal. He believed that the people who did well at Atari (and as game makers in general) were the people who could be silly enough to recognise fun, and anal enough to get all the minutia and details aligned correctly in order to deliver the fun. Of course, Rob (being the poet he is) created a wonderful phrasing to describe those with the right stuff. He put it like this: the people who did well at Atari were the people who could goof around as much as possible but still go to heaven.

Get your copy of Wireframe issue 53

You can read more features like this one in Wireframe issue 53, available directly from Raspberry Pi Press — we deliver worldwide.

And if you’d like a handy digital version of the magazine, you can also download issue 53 for free in PDF format.

The post Behind the scenes at Atari appeared first on Raspberry Pi.

Comic for 2021.08.24

2021-08-24 Explosm.net

Post Syndicated from Explosm.net original http://explosm.net/comics/5958/

New Cyanide and Happiness Comic

Convert and Watermark Documents Automatically with Amazon S3 Object Lambda

2021-08-24 Joseph Simon

Post Syndicated from Joseph Simon original https://aws.amazon.com/blogs/architecture/convert-and-watermark-documents-automatically-with-amazon-s3-object-lambda/

When you provide access to a sensitive document to someone outside of your organization, you likely need to ensure that the document is read-only. In this case, your document should be associated with a specific user in case it is shared.

For example, authors often embed user-specific watermarks into their ebooks. This way, if their ebook gets posted to a file-sharing site, they can prevent the purchaser from downloading copies of the ebook in the future.

In this blog post, we provide you a cost-efficient, scalable, and secure solution to efficiently generate user-specific versions of sensitive documents. This solution helps users track who their documents are shared with. This helps prevent fraud and ensure that private information isn’t leaked. Our solution uses a RESTful API, which uses Amazon S3 Object Lambda to convert documents to PDF and apply a watermark based on the requesting user. It also provides a method for authentication and tracks access to the original document.

Architectural overview

S3 Object Lambda processes and transforms data that is requested from Amazon Simple Storage Service (Amazon S3) before it’s sent back to a client. The AWS Lambda function is invoked inline via a standard S3 GET request. It can return different results from the same document based on parameters, such as who is requesting the document. Figure 1 provides a high-level view of the different components that make up the solution.

Figure 1. Document processing architectural diagram

Authenticating users with Amazon Cognito

This architecture defines a RESTful API, but users will likely be using a mobile or web application that calls the API. Thus, the application will first need to authenticate users. We do this via Amazon Cognito, which functions as its own identity provider (IdP). You could also use an external IdP, including those that support OpenID Connect and SAML.

Validating the JSON Web Token with API Gateway

Once the user is successfully authenticated with Amazon Cognito, the application will be sent a JSON Web Token (JWT). This JWT contains information about the user and will be used in subsequent requests to the API.

Now that the application has a token, it will make a request to the API, which is provided by Amazon API Gateway. API Gateway provides a secure, scalable entryway into your application. The API Gateway validates the JWT sent from the client with Amazon Cognito to make sure it is valid. If it is validated, the request is accepted and sent on to the Lambda API Handler. If it’s not, the client gets rejected and sent an error code.

Storing user data with DynamoDB

When the Lambda API Handler receives the request, it parses the JWT to extract the user making the request. It then logs that user, file, and access time into Amazon DynamoDB. Optionally, you may use DynamoDB to store an encoded string that will be used as the watermark, rather than something in plaintext, like user name or email.

Generating the PDF and user-specific watermark

At this point, the Lambda API Handler sends an S3 GET request. However, instead of going to Amazon S3 directly, it goes to a different endpoint that invokes the S3 Object Lambda function. This endpoint is called an S3 Object Lambda Access Point. The S3 GET request contains the original file name and the string that will be used for the watermark.

The S3 Object Lambda function transforms the original file that it downloads from its source S3 bucket. It uses the open-source office suite LibreOffice (and specifically this Lambda layer) to convert the source document to PDF. Once it is converted, a JavaScript library (PDF-Lib) embeds the watermark into the PDF before it’s sent back to the Lambda API Handler function.

The Lambda API Handler stores the converted file in a temporary S3 bucket, generates a presigned URL, and sends that URL back to the client as a 302 redirect. Then the client sends a request to that presigned URL to get the converted file.

To keep the temporary S3 bucket tidy, we use an S3 lifecycle configuration with an expiration policy.

Figure 2. Process workflow for document transformation

Alternate approach

Before S3 Object Lambda was available, Lambda@Edge was used. However, there are three main issues with using Lambda@Edge instead of S3 Object Lambda:

It is designed to run code closer to the end user to decrease latency, but in this case, latency is not a major concern.
It requires using an Amazon CloudFront distribution, and the single-download pattern described here will not take advantage of Lamda@Edge’s caching.
It has quotas on memory that don’t lend themselves to complex libraries like OfficeLibre.

Extending this solution

This blog post describes the basic building blocks for the solution, but it can be extended relatively easily. For example, you could add another function to the API that would convert, resize, and watermark images. To do this, create an S3 Object Lambda function to perform those tasks. Then, add an S3 Object Lambda Access Point to invoke it based on a different API call.

API Gateway has many built-in security features, but you may want to enhance the security of your RESTful API. To do this, add enhanced security rules via AWS WAF. Integrating your IdP into Amazon Cognito can give you a single place to manage your users.

Monitoring any solution is critical, and understanding how an application is behaving end to end can greatly benefit optimization and troubleshooting. Adding AWS X-Ray and Amazon CloudWatch Lambda Insights will show you how functions and their interactions are performing.

Should you decide to extend this architecture, follow the architectural principles defined in AWS Well-Architected, and pay particular attention to the Serverless Application Lens.

Figure 3. Example expanded document processing architecture

Conclusion

You can implement this solution in a number of ways. However, by using S3 Object Lambda, you can transform documents without needing intermediary storage. S3 Object Lambda will also decouple your file logic from the rest of the application.

The Serverless on AWS components mentioned in this post allow you to reduce administrative overhead, saving you time and money.

Finally, the extensible nature of this architecture allows you to add functionality easily as your organization’s needs grow and change.

The following links provide more information on how to use S3 Object Lambda in your architectures:

Build Next-Generation Microservices with .NET 5 and gRPC on AWS

2021-08-23 Matt Cline

Post Syndicated from Matt Cline original https://aws.amazon.com/blogs/devops/next-generation-microservices-dotnet-grpc/

Modern architectures use multiple microservices in conjunction to drive customer experiences. At re:Invent 2015, AWS senior project manager Rob Brigham described Amazon’s architecture of many single-purpose microservices – including ones that render the “Buy” button, calculate tax at checkout, and hundreds more.

Microservices commonly communicate with JSON over HTTP/1.1. These technologies are ubiquitous and human-readable, but they aren’t optimized for communication between dozens or hundreds of microservices.

Next-generation Web technologies, including gRPC and HTTP/2, significantly improve communication speed and efficiency between microservices. AWS offers the most compelling experience for builders implementing microservices. Moreover, the addition of HTTP/2 and gRPC support in Application Load Balancer (ALB) provides an end-to-end solution for next-generation microservices. ALBs can inspect and route gRPC calls, enabling features like health checks, access logs, and gRPC-specific metrics.

This post demonstrates .NET microservices communicating with gRPC via Application Load Balancers. The microservices run on AWS Graviton2 instances, utilizing a custom-built 64-bit Arm processor to deliver up to 40% better price/performance than x86.

Architecture Overview

Modern Tacos is a new restaurant offering delivery. Customers place orders via mobile app, then they receive real-time status updates as their order is prepared and delivered.

The tutorial includes two microservices: “Submit Order” and “Track Order”. The Submit Order service receives orders from the app, then it calls the Track Order service to initiate order tracking. The Track Order service provides streaming updates to the app as the order is prepared and delivered.

Each microservice is deployed in an Amazon EC2 Auto Scaling group. Each group is behind an ALB that routes gRPC traffic to instances in the group.

Shows the communication flow of gRPC traffic from users through an ALB to EC2 instances. — This architecture is simplified to focus on ALB and gRPC functionality. Microservices are often deployed in
containers for elastic scaling, improved reliability, and efficient resource utilization. ALB, gRPC, and .NET all work equally effectively in these architectures.

Comparing gRPC and JSON for microservices

Microservices typically communicate by sending JSON data over HTTP. As a text-based format, JSON is readable, flexible, and widely compatible. However, JSON also has significant weaknesses as a data interchange format. JSON’s flexibility makes enforcing a strict API specification difficult — clients can send arbitrary or invalid data, so developers must write rigorous data validation code. Additionally, performance can suffer at scale due to JSON’s relatively high bandwidth and parsing requirements. These factors also impact performance in constrained environments, such as smartphones and IoT devices. gRPC addresses all of these issues.

gRPC is an open-source framework designed to efficiently connect services. Instead of JSON, gRPC sends messages via a compact binary format called Protocol Buffers, or protobuf. Although protobuf messages are not human-readable, they utilize less network bandwidth and are faster to encode and decode. Operating at scale, these small differences multiply to a significant performance gain.

gRPC APIs define a strict contract that is automatically enforced for all messages. Based on this contract, gRPC implementations generate client and server code libraries in multiple programming languages. This allows developers to use higher-level constructs to call services, rather than programming against “raw” HTTP requests.

gRPC also benefits from being built on HTTP/2, a major revision of the HTTP protocol. In addition to the foundational performance and efficiency improvements from HTTP/2, gRPC utilizes the new protocol to support bi-directional streaming data. Implementing real-time streaming prior to gRPC typically required a completely separate protocol (such as WebSockets) that might not be supported by every client.

gRPC for .NET developers

Several recent updates have made gRPC more useful to .NET developers. .NET 5 includes significant performance improvements to gRPC, and AWS has broad support for .NET 5. In May 2021, the .NET team announced their focus on a gRPC implementation written entirely in C#, called “grpc-dotnet”, which follows C# conventions very closely.

Instead of working with JSON, dynamic objects, or strings, C# developers calling a gRPC service use a strongly-typed client, automatically generated from the protobuf specification. This obviates much of the boilerplate validation required by JSON APIs, and it enables developers to use rich data structures. Additionally, the generated code enables full IntelliSense support in Visual Studio.

For example, the “Submit Order” microservice executes this code in order to call the “Track Order” microservice:

using var channel = GrpcChannel.ForAddress("https://track-order.example.com");

var trackOrderClient = new TrackOrder.Protos.TrackOrder.TrackOrderClient(channel);

var reply = await trackOrderClient.StartTrackingOrderAsync(new TrackOrder.Protos.Order
{
    DeliverTo = "Address",
    LastUpdated = Timestamp.FromDateTime(DateTime.UtcNow),
    OrderId = order.OrderId,
    PlacedOn = order.PlacedOn,
    Status = TrackOrder.Protos.OrderStatus.Placed
});

This code calls the StartTrackingOrderAsync method on the Track Order client, which looks just like a local method call. The method intakes a data structure that supports rich data types like DateTime and enumerations, instead of the loosely-typed JSON. The methods and data structures are defined by the Track Order service’s protobuf specification, and the .NET gRPC tools automatically generate the client and data structure classes without requiring any developer effort.

Configuring ALB for gRPC

To make gRPC calls to targets behind an ALB, create a load balancer target group and select gRPC as the protocol version. You can do this through the AWS Management Console, AWS Command Line Interface (CLI), AWS CloudFormation, or AWS Cloud Development Kit (CDK).

Screenshot of the AWS Management Console, showing how to configure a load balancer's target group for gRPC communication.

This CDK code creates a gRPC target group:

var targetGroup = new ApplicationTargetGroup(this, "TargetGroup", new ApplicationTargetGroupProps
{
    Protocol = ApplicationProtocol.HTTPS,
    ProtocolVersion = ApplicationProtocolVersion.GRPC,
    Vpc = vpc,
    Targets = new IApplicationLoadBalancerTarget {...}
});

gRPC requests work with target groups utilizing HTTP/2, but the gRPC protocol enables additional features including health checks, request count metrics, access logs that differentiate gRPC requests, and gRPC-specific response headers. gRPC also works with native ALB features like stickiness, multiple load balancing algorithms, and TLS termination.

Deploy the Tutorial

The sample provisions AWS resources via the AWS Cloud Development Kit (CDK). The CDK code is provided in C# so that .NET developers can use a familiar language.

The solution deployment steps include:

Configuring a domain name in Route 53.
Deploying the microservices.
Running the mobile app on AWS Device Farm.

The source code is available on GitHub.

Prerequisites

For this tutorial, you should have these prerequisites:

Sign up for an AWS account.
Complete the AWS CDK Getting Started guide.
Install the AWS CLI and set up your AWS credentials for command-line use – or, you can use the AWS Tools for PowerShell and set up your AWS credentials for PowerShell.
Create a public hosted zone in Amazon Route 53 for a domain name that you control. This will be the “parent” domain name for the microservices.
Install Visual Studio 2019.
Clone the GitHub repository to your computer.
Open a terminal (such as Bash) or a PowerShell prompt.

Configure the environment variables needed by the CDK. In the sample commands below, replace AWS_ACCOUNT_ID with your numeric AWS account ID. Replace AWS_REGION with the name of the region where you will deploy the sample, such as us-east-1 or us-west-2.

If you’re using a *nix shell such as Bash, run these commands:

export CDK_DEFAULT_ACCOUNT=AWS_ACCOUNT_ID
export CDK_DEFAULT_REGION=AWS_REGION

If you’re using PowerShell, run these commands:

$Env:CDK_DEFAULT_ACCOUNT="AWS_ACCOUNT_ID"
$Env:CDK_DEFAULT_REGION="AWS_REGION"
Set-DefaultAWSRegion -Region AWS_REGION

Throughout this tutorial, replace RED TEXT with the appropriate value.

Save the directory path where you cloned the GitHub repository. In the sample commands below, replace EXAMPLE_DIRECTORY with this path.

In your terminal or PowerShell, run these commands:

cd EXAMPLE_DIRECTORY/src/ModernTacoShop/Common/cdk
cdk bootstrap --context domain-name=PARENT_DOMAIN_NAME
cdk deploy --context domain-name=PARENT_DOMAIN_NAME

The CDK output includes the name of the S3 bucket that will store deployment packages. Save the name of this bucket. In the sample commands below, replace SHARED_BUCKET_NAME with this name.

Deploy the Track Order microservice

Compile the Track Order microservice for the Arm microarchitecture utilized by AWS Graviton2 processors. The TrackOrder.csproj file includes a target that automatically packages the compiled microservice into a ZIP file. You will upload this ZIP file to S3 for use by CodeDeploy. Next, you will utilize the CDK to deploy the microservice’s AWS infrastructure, and then install the microservice on the EC2 instance via CodeDeploy.

The CDK stack deploys these resources:

An Amazon EC2 Auto Scaling group.
An Application Load Balancer (ALB) using gRPC, targeting the Auto Scaling group and configured with microservice health checks.
A subdomain for the microservice, targeting the ALB.
A DynamoDB table used by the microservice.
CodeDeploy infrastructure to deploy the microservice to the Auto Scaling group.

If you’re using the AWS CLI, run these commands:

cd EXAMPLE_DIRECTORY/src/ModernTacoShop/TrackOrder/src/
dotnet publish --runtime linux-arm64 --self-contained
aws s3 cp ./bin/TrackOrder.zip s3://SHARED_BUCKET_NAME
etag=$(aws s3api head-object --bucket SHARED_BUCKET_NAME \
    --key TrackOrder.zip --query ETag --output text)
cd ../cdk
cdk deploy

The CDK output includes the name of the CodeDeploy deployment group. Use this name to run the next command:

aws deploy create-deployment --application-name ModernTacoShop-TrackOrder \
    --deployment-group-name TRACK_ORDER_DEPLOYMENT_GROUP_NAME \
    --s3-location bucket=SHARED_BUCKET_NAME,bundleType=zip,key=TrackOrder.zip,etag=$etag \
    --file-exists-behavior OVERWRITE

If you’re using PowerShell, run these commands:

cd EXAMPLE_DIRECTORY/src/ModernTacoShop/TrackOrder/src/
dotnet publish --runtime linux-arm64 --self-contained
Write-S3Object -BucketName SHARED_BUCKET_NAME `
    -Key TrackOrder.zip `
    -File ./bin/TrackOrder.zip
Get-S3ObjectMetadata -BucketName SHARED_BUCKET_NAME `
    -Key TrackOrder.zip `
    -Select ETag `
    -OutVariable etag
cd ../cdk
cdk deploy

The CDK output includes the name of the CodeDeploy deployment group. Use this name to run the next command:

New-CDDeployment -ApplicationName ModernTacoShop-TrackOrder `
    -DeploymentGroupName TRACK_ORDER_DEPLOYMENT_GROUP_NAME `
    -S3Location_Bucket SHARED_BUCKET_NAME `
    -S3Location_BundleType zip `
    -S3Location_Key TrackOrder.zip `
    -S3Location_ETag $etag[0] `
    -RevisionType S3 `
    -FileExistsBehavior OVERWRITE

Deploy the Submit Order microservice

The steps to deploy the Submit Order microservice are identical to the Track Order microservice. See that section for details.

If you’re using the AWS CLI, run these commands:

cd EXAMPLE_DIRECTORY/src/ModernTacoShop/SubmitOrder/src/
dotnet publish --runtime linux-arm64 --self-contained
aws s3 cp ./bin/SubmitOrder.zip s3://SHARED_BUCKET_NAME
etag=$(aws s3api head-object --bucket SHARED_BUCKET_NAME \
    --key SubmitOrder.zip --query ETag --output text)
cd ../cdk
cdk deploy

The CDK output includes the name of the CodeDeploy deployment group. Use this name to run the next command:

aws deploy create-deployment --application-name ModernTacoShop-SubmitOrder \
    --deployment-group-name SUBMIT_ORDER_DEPLOYMENT_GROUP_NAME \
    --s3-location bucket=SHARED_BUCKET_NAME,bundleType=zip,key=SubmitOrder.zip,etag=$etag \
    --file-exists-behavior OVERWRITE

If you’re using PowerShell, run these commands:

cd EXAMPLE_DIRECTORY/src/ModernTacoShop/SubmitOrder/src/
dotnet publish --runtime linux-arm64 --self-contained
Write-S3Object -BucketName SHARED_BUCKET_NAME `
    -Key SubmitOrder.zip `
    -File ./bin/SubmitOrder.zip
Get-S3ObjectMetadata -BucketName SHARED_BUCKET_NAME `
    -Key SubmitOrder.zip `
    -Select ETag `
    -OutVariable etag
cd ../cdk
cdk deploy

The CDK output includes the name of the CodeDeploy deployment group. Use this name to run the next command:

New-CDDeployment -ApplicationName ModernTacoShop-SubmitOrder `
    -DeploymentGroupName SUBMIT_ORDER_DEPLOYMENT_GROUP_NAME `
    -S3Location_Bucket SHARED_BUCKET_NAME `
    -S3Location_BundleType zip `
    -S3Location_Key SubmitOrder.zip `
    -S3Location_ETag $etag[0] `
    -RevisionType S3 `
    -FileExistsBehavior OVERWRITE

Data flow diagram

Architecture diagram showing the complete data flow of the sample gRPC microservices application. — The app submits an order via gRPC.

The Submit Order ALB routes the gRPC call to an instance.

The Submit Order instance stores order data.

The Submit Order instance calls the Track Order service via gRPC.

The Track Order ALB routes the gRPC call to an instance.

The Track Order instance stores tracking data.

The app calls the Track Order service, which streams the order’s location during delivery.

Test the microservices

Once the CodeDeploy deployments have completed, test both microservices.

First, check the load balancers’ status. Go to Target Groups in the AWS Management Console, which will list one target group for each microservice. Click each target group, then click “Targets” in the lower details pane. Every EC2 instance in the target group should have a “healthy” status.

Next, verify each microservice via gRPCurl. This tool lets you invoke gRPC services from the command line. Install gRPCurl using the instructions, and then test each microservice:

grpcurl submit-order.PARENT_DOMAIN_NAME:443 modern_taco_shop.SubmitOrder/HealthCheck
grpcurl track-order.PARENT_DOMAIN_NAME:443 modern_taco_shop.TrackOrder/HealthCheck

If a service is healthy, it will return an empty JSON object.

Run the mobile app

You will run a pre-compiled version of the app on AWS Device Farm, which lets you test on a real device without managing any infrastructure. Alternatively, compile your own version via the AndroidApp.FrontEnd project within the solution located at EXAMPLE_DIRECTORY/src/ModernTacoShop/AndroidApp/AndroidApp.sln.

Go to Device Farm in the AWS Management Console. Under “Mobile device testing projects”, click “Create a new project”. Enter “ModernTacoShop” as the project name, and click “Create Project”. In the ModernTacoShop project, click the “Remote access” tab, then click “Start a new session”. Under “Choose a device”, select the Google Pixel 3a running OS version 10, and click “Confirm and start session”.

Screenshot of the AWS Device Farm showing a Google Pixel 3a.

Once the session begins, click “Upload” in the “Install applications” section. Unzip and upload the APK file located at EXAMPLE_DIRECTORY/src/ModernTacoShop/AndroidApp/com.example.modern_tacos.grpc_tacos.apk.zip, or upload an APK that you created.

Screenshot of the gRPC microservices demo Android app, showing the map that displays streaming location data.

Screenshot of the gRPC microservices demo Android app, on the order preparation screen.

Once the app has uploaded, drag up from the bottom of the device screen in order to reach the “All apps” screen. Click the ModernTacos app to launch it.

Once the app launches, enter the parent domain name in the “Domain Name” field. Click the “+” and “-“ buttons next to each type of taco in order to create your order, then click “Submit Order”. The order status will initially display as “Preparing”, and will switch to “InTransit” after about 30 seconds. The Track Order service will stream a random route to the app, updating with new position data every 5 seconds. After approximately 2 minutes, the order status will change to “Delivered” and the streaming updates will stop.

Once you’ve run a successful test, click “Stop session” in the console.

Cleaning up

To avoid incurring charges, use the cdk destroy command to delete the stacks in the reverse order that you deployed them.

You can also delete the resources via CloudFormation in the AWS Management Console.

In addition to deleting the stacks, you must delete the Route 53 hosted zone and the Device Farm project.

Conclusion

This post demonstrated multiple next-generation technologies for microservices, including end-to-end HTTP/2 and gRPC communication over Application Load Balancer, AWS Graviton2 processors, and .NET 5. These technologies enable builders to create microservices applications with new levels of performance and efficiency.

Matt Cline

Matt Cline is a Solutions Architect at Amazon Web Services, supporting customers in his home city of Pittsburgh PA. With a background as a full-stack developer and architect, Matt is passionate about helping customers deliver top-quality applications on AWS. Outside of work, Matt builds (and occasionally finishes) scale models and enjoys running a tabletop role-playing game for his friends.

Ulili Nhaga

Ulili Nhaga is a Cloud Application Architect at Amazon Web Services in San Diego, California. He helps customers modernize, architect, and build highly scalable cloud-native applications on AWS. Outside of work, Ulili loves playing soccer, cycling, Brazilian BBQ, and enjoying time on the beach.

How MEDHOST’s cardiac risk prediction successfully leveraged AWS analytic services

2021-08-23 Pandian Velayutham

Post Syndicated from Pandian Velayutham original https://aws.amazon.com/blogs/big-data/how-medhosts-cardiac-risk-prediction-successfully-leveraged-aws-analytic-services/

MEDHOST has been providing products and services to healthcare facilities of all types and sizes for over 35 years. Today, more than 1,000 healthcare facilities are partnering with MEDHOST and enhancing their patient care and operational excellence with its integrated clinical and financial EHR solutions. MEDHOST also offers a comprehensive Emergency Department Information System with business and reporting tools. Since 2013, MEDHOST’s cloud solutions have been utilizing Amazon Web Services (AWS) infrastructure, data source, and computing power to solve complex healthcare business cases.

MEDHOST can utilize the data available in the cloud to provide value-added solutions for hospitals solving complex problems, like predicting sepsis, cardiac risk, and length of stay (LOS) as well as reducing re-admission rates. This requires a solid foundation of data lake and elastic data pipeline to keep up with multi-terabyte data from thousands of hospitals. MEDHOST has invested a significant amount of time evaluating numerous vendors to determine the best solution for its data needs. Ultimately, MEDHOST designed and implemented machine learning/artificial intelligence capabilities by leveraging AWS Data Lab and an end-to-end data lake platform that enables a variety of use cases such as data warehousing for analytics and reporting.

Since you’re reading this post, you may also be interested in the following:

MEDHOST expands multi-tenant cloud EHR capabilities with AWS (AWS for Industries)

Getting started

MEDHOST’s initial objectives in evaluating vendors were to:

Build a low-cost data lake solution to provide cardiac risk prediction for patients based on health records
Provide an analytical solution for hospital staff to improve operational efficiency
Implement a proof of concept to extend to other machine learning/artificial intelligence solutions

The AWS team proposed AWS Data Lab to architect, develop, and test a solution to meet these objectives. The collaborative relationship between AWS and MEDHOST, AWS’s continuous innovation, excellent support, and technical solution architects helped MEDHOST select AWS over other vendors and products. AWS Data Lab’s well-structured engagement helped MEDHOST define clear, measurable success criteria that drove the implementation of the cardiac risk prediction and analytical solution platform. The MEDHOST team consisted of architects, builders, and subject matter experts (SMEs). By connecting MEDHOST experts directly to AWS technical experts, the MEDHOST team gained a quick understanding of industry best practices and available services allowing MEDHOST team to achieve most of the success criteria at the end of a four-day design session. MEDHOST is now in the process of moving this work from its lower to upper environment to make the solution available for its customers.

Solution

For this solution, MEDHOST and AWS built a layered pipeline consisting of ingestion, processing, storage, analytics, machine learning, and reinforcement components. The following diagram illustrates the Proof of Concept (POC) that was implemented during the four-day AWS Data Lab engagement.

Ingestion layer

The ingestion layer is responsible for moving data from hospital production databases to the landing zone of the pipeline.

The hospital data was stored in an Amazon RDS for PostgreSQL instance and moved to the landing zone of the data lake using AWS Database Migration Service (DMS). DMS made migrating databases to the cloud simple and secure. Using its ongoing replication feature, MEDHOST and AWS implemented change data capture (CDC) quickly and efficiently so MEDHOST team could spend more time focusing on the most interesting parts of the pipeline.

Processing layer

The processing layer was responsible for performing extract, tranform, load (ETL) on the data to curate them for subsequent uses.

MEDHOST used AWS Glue within its data pipeline for crawling its data layers and performing ETL tasks. The hospital data copied from RDS to Amazon S3 was cleaned, curated, enriched, denormalized, and stored in parquet format to act as the heart of the MEDHOST data lake and a single source of truth to serve any further data needs. During the four-day Data Lab, MEDHOST and AWS targeted two needs: powering MEDHOST’s data warehouse used for analytics and feeding training data to the machine learning prediction model. Even though there were multiple challenges, data curation is a critical task which requires an SME. AWS Glue’s serverless nature, along with the SME’s support during the Data Lab, made developing the required transformations cost efficient and uncomplicated. Scaling and cluster management was addressed by the service, which allowed the developers to focus on cleaning data coming from homogenous hospital sources and translating the business logic to code.

Storage layer

The storage layer provided low-cost, secure, and efficient storage infrastructure.

MEDHOST used Amazon S3 as a core component of its data lake. AWS DMS migration tasks saved data to S3 in .CSV format. Crawling data with AWS Glue made this landing zone data queryable and available for further processing. The initial AWS Glue ETL job stored the parquet formatted data to the data lake and its curated zone bucket. MEDHOST also used S3 to store the .CSV formatted data set that will be used to train, test, and validate its machine learning prediction model.

Analytics layer

The analytics layer gave MEDHOST pipeline reporting and dashboarding capabilities.

The data was in parquet format and partitioned in the curation zone bucket populated by the processing layer. This made querying with Amazon Athena or Amazon Redshift Spectrum fast and cost efficient.

From the Amazon Redshift cluster, MEDHOST created external tables that were used as staging tables for MEDHOST data warehouse and implemented an UPSERT logic to merge new data in its production tables. To showcase the reporting potential that was unlocked by the MEDHOST analytics layer, a connection was made to the Redshift cluster to Amazon QuickSight. Within minutes MEDHOST was able to create interactive analytics dashboards with filtering and drill-down capabilities such as a chart that showed the number of confirmed disease cases per US state.

Machine learning layer

The machine learning layer used MEDHOST’s existing data sets to train its cardiac risk prediction model and make it accessible via an endpoint.

Before getting into Data Lab, the MEDHOST team was not intimately familiar with machine learning. AWS Data Lab architects helped MEDHOST quickly understand concepts of machine learning and select a model appropriate for its use case. MEDHOST selected XGBoost as its model since cardiac prediction falls within regression technique. MEDHOST’s well architected data lake enabled it to quickly generate training, testing, and validation data sets using AWS Glue.

Amazon SageMaker abstracted underlying complexity of setting infrastructure for machine learning. With few clicks, MEDHOST started Jupyter notebook and coded the components leading to fitting and deploying its machine learning prediction model. Finally, MEDHOST created the endpoint for the model and ran REST calls to validate the endpoint and trained model. As a result, MEDHOST achieved the goal of predicting cardiac risk. Additionally, with Amazon QuickSight’s SageMaker integration, AWS made it easy to use SageMaker models directly in visualizations. QuickSight can call the model’s endpoint, send the input data to it, and put the inference results into the existing QuickSight data sets. This capability made it easy to display the results of the models directly in the dashboards. Read more about QuickSight’s SageMaker integration here.

Reinforcement layer

Finally, the reinforcement layer guaranteed that the results of the MEDHOST model were captured and processed to improve performance of the model.

The MEDHOST team went beyond the original goal and created an inference microservice to interact with the endpoint for prediction, enabled abstracting of the machine learning endpoint with the well-defined domain REST endpoint, and added a standard security layer to the MEDHOST application.

When there is a real-time call from the facility, the inference microservice gets inference from the SageMaker endpoint. Records containing input and inference data are fed to the data pipeline again. MEDHOST used Amazon Kinesis Data Streams to push records in real time. However, since retraining the machine learning model does not need to happen in real time, the Amazon Kinesis Data Firehose enabled MEDHOST to micro-batch records and efficiently save them to the landing zone bucket so that the data could be reprocessed.

Conclusion

Collaborating with AWS Data Lab enabled MEDHOST to:

Store single source of truth with low-cost storage solution (data lake)
Complete data pipeline for a low-cost data analytics solution
Create an almost production-ready code for cardiac risk prediction

The MEDHOST team learned many concepts related to data analytics and machine learning within four days. AWS Data Lab truly helped MEDHOST deliver results in an accelerated manner.

About the Authors

Pandian Velayutham is the Director of Engineering at MEDHOST. His team is responsible for delivering cloud solutions, integration and interoperability, and business analytics solutions. MEDHOST utilizes modern technology stack to provide innovative solutions to our customers. Pandian Velayutham is a technology evangelist and public cloud technology speaker.

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

Queue Integration with Third-party Services on AWS

2021-08-23 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/queue-integration-with-third-party-services-on-aws/

Commercial off-the-shelf software and third-party services can present an integration challenge in event-driven workflows when they do not natively support AWS APIs. This is even more impactful when a workflow is subject to unpredicted usage spikes, and you want to increase decoupling and fault tolerance. Given the third-party nature of services, polling an Amazon Simple Queue Service (SQS) queue and having built-in AWS API handling logic may not be an immediate option.

In such cases, AWS Lambda helps out-task the Amazon SQS queue integration and AWS API handling to an additional layer. The success of this depends on how well exception handling is implemented across the different interacting services. In this blog post, we outline issues to consider when adopting this design pattern. We also share a reusable solution.

Design pattern for third-party integration with SQS

With this design pattern, one or more services (producers) asynchronously invoke other third-party downstream consumer services. They publish messages to an Amazon SQS queue, which acts as buffer for requests. Producers provide all commands and other parameters required for consumer service execution with the message.

As messages are written to the queue, the queue is configured to invoke a message broker (implemented as AWS Lambda) for each message. AWS Lambda can interact natively with target AWS services such as Amazon EC2, Amazon Elastic Container Service (ECS), or Amazon Elastic Kubernetes Service (EKS). It can also be configured to use an Amazon Virtual Private Cloud (VPC) interface endpoint to establish a connection to VPC resources without traversing the internet. The message broker assigns the tasks to consumer services by invoking the RunTask API of Amazon ECS and AWS Fargate (see Figure 1.)

Figure 1. On-premises and AWS queue integration for third-party services using AWS Lambda

The message broker asynchronously invokes the API in ‘fire-and-forget’ mode. Therefore, error handling must be built in to respond to API invocation errors. In an event-driven scenario, an error will be invoked if you asynchronously call the third-party service hundreds or thousands of times and reach Service Quotas. This is a potential issue with RunTask API actions, or a large volume of concurrent tasks running on AWS Fargate. Two mechanisms can help implement troubleshooting API request errors.

API retries with exponential backoff. The message broker retries for a number of times with configurable sleep intervals and exponential backoff in-between. This enforces progressively longer waits between retries for consecutive error responses. If the RunTask API fails to process the request and initiate the third-party service, the message remains in the queue for a subsequent retry. The AWS General Reference provides further guidance.
API error handling. Error handling and consequent logging should be implemented at every step. Since there are several services working together in tandem, crucial debugging information from errors may be lost. Additionally, error handling also provides opportunity to define automated corrective actions or notifications on event occurrence. The message broker can publish failure notifications including the root cause to an Amazon Simple Notification Service (SNS) topic.

SNS topic subscription can be configured via different protocols. You can email a distribution group for active monitoring and processing of errors. If persistence is required for messages that failed to process, error handling can be associated directly with SQS by configuring a dead letter queue.

Reference implementation for third-party integration with SQS

We implemented the design pattern in Figure 1, with Broad Institute’s Cell Painting application workflow. This is for morphological profiling from microscopy cell images running on Amazon EC2. It interacts with CellProfiler version 3.0 cell image analysis software as the downstream consumer hosted on ECS/Fargate. Every invocation of CellProfiler required approximately 1,500 tasks for a single processing step.

Resource constraints determined the rate of scale-out. In this case, it was for an Amazon ECS task creation. Address space for Amazon ECS subnets should be large enough to prevent running out of available IPs within your VPC. If Amazon ECS Service Quotas provide further constraints, a quota increase can be requested.

Exceptions must be handled both when validating and initiating requests. As part of the validation workflow, exceptions are captured as follows, also shown in Figure 2.

1. Invalid arguments exception. The message broker validates that the SQS message contains all the needed information to initiate the ECS task. This information includes subnets, security groups and container names required to start the ECS task, and else raises exception.

2. Retry limit exception. On each iteration, the message broker will evaluate whether the SQS retry limit has been reached, before invoking the RunTask API. It will then exit, by sending failure notification to SNS when the retry limit is reached.

Figure 2. Exception handling flow during request validation

As part of the initiation workflow, exceptions are handled as follows, shown in Figure 3:

1. ECS/Fargate API and concurrent execution limitations. The message broker catches API exceptions when calling the API RunTask operation. These exceptions can include:

- When the call to the launch tasks exceeds the maximum allowed API request limit for your AWS account
- When failing to retrieve security group information
- When you have reached the limit on the number of tasks you can run concurrently

With each of the preceding exceptions, the broker will increase the retry count.

2. Networking and IP space limitations. Network interface timeouts received after initiating the ECS task set off an Amazon CloudWatch Events rule, causing the message broker to re-initiate the ECS task.

Figure 3. Exception handling flow during request initiation

While we specifically address downstream consumer services running on ECS/Fargate, this solution can be adjusted for third-party services running on Amazon EC2 or EKS. With EC2, the message broker must be adjusted to interact with the RunInstances API, and include troubleshooting API request errors. Integration with downstream consumers on Amazon EKS requires that the AWS Lambda function is associated via the IAM role with a Kubernetes service account. A Python client for Kubernetes can be used to simplify interaction with the Kubernetes REST API and AWS Lambda would invoke the run API.

Conclusion

This pattern is useful when queue polling is not an immediate option. This is typical with event-driven workflows involving third-party services and vendor applications subject to unpredictable, intermittent load spikes. Exception handling is essential for these types of workflows. Offloading AWS API handling to a separate layer orchestrated by AWS Lambda can improve the resiliency of such third-party services on AWS. This pattern represents an incremental optimization until the third party provides native SQS integration. It can be achieved with the initial move to AWS, for example as part of the V1 AWS design strategy for third-party services.

Some limitations should be acknowledged. While the pattern enables graceful failure, it does not prevent the overloading of the ECS RunTask API. By invoking Amazon ECS RunTask API in ‘fire-and-forget’ mode, it does not monitor service execution once a task was successfully invoked. Therefore, it should be adopted when direct queue polling is not an option. In our example, Broad Institute’s CellProfiler application enabled direct queue polling with its subsequent product version of Distributed CellProfiler.

Further reading

The referenced deployment with consumer services on Amazon ECS can be accessed via AWSLabs.