Tag Archives: Customer Solutions

How Roche democratized access to data with Google Sheets and Amazon Redshift Data API

2021-11-09 Dr. Yannick Misteli

Post Syndicated from Dr. Yannick Misteli original https://aws.amazon.com/blogs/big-data/how-roche-democratized-access-to-data-with-google-sheets-and-amazon-redshift-data-api/

This post was co-written with Dr. Yannick Misteli, João Antunes, and Krzysztof Wisniewski from the Roche global Platform and ML engineering team as the lead authors.

Roche is a Swiss multinational healthcare company that operates worldwide. Roche is the largest pharmaceutical company in the world and the leading provider of cancer treatments globally.

In this post, Roche’s global Platform and machine learning (ML) engineering team discuss how they used Amazon Redshift data API to democratize access to the data in their Amazon Redshift data warehouse with Google Sheets (gSheets).

Business needs

Go-To-Market (GTM) is the domain that lets Roche understand customers and create and deliver valuable services that meet their needs. This lets them get a better understanding of the health ecosystem and provide better services for patients, doctors, and hospitals. It extends beyond health care professionals (HCPs) to a larger Healthcare ecosystem consisting of patients, communities, health authorities, payers, providers, academia, competitors, etc. Data and analytics are essential to supporting our internal and external stakeholders in their decision-making processes through actionable insights.

For this mission, Roche embraced the modern data stack and built a scalable solution in the cloud.

Driving true data democratization requires not only providing business leaders with polished dashboards or data scientists with SQL access, but also addressing the requirements of business users that need the data. For this purpose, most business users (such as Analysts) leverage Excel—or gSheet in the case of Roche—for data analysis.

Providing access to data in Amazon Redshift to these gSheets users is a non-trivial problem. Without a powerful and flexible tool that lets data consumers use self-service analytics, most organizations will not realize the promise of the modern data stack. To solve this problem, we want to empower every data analyst who doesn’t have an SQL skillset with a means by which they can easily access and manipulate data in the applications that they are most familiar with.

The Roche GTM organization uses the Redshift Data API to simplify the integration between gSheets and Amazon Redshift, and thus facilitate the data needs of their business users for analytical processing and querying. The Amazon Redshift Data API lets you painlessly access data from Amazon Redshift with all types of traditional, cloud-native, and containerized, serverless web service-based applications and event-driven applications. Data API simplifies data access, ingest, and egress from languages supported with AWS SDK, such as Python, Go, Java, Node.js, PHP, Ruby, and C++ so that you can focus on building applications as opposed to managing infrastructure. The process they developed using Amazon Redshift Data API has significantly lowered the barrier for entry for new users without needing any data warehousing experience.

Use-Case

In this post, you will learn how to integrate Amazon Redshift with gSheets to pull data sets directly back into gSheets. These mechanisms are facilitated through the use of the Amazon Redshift Data API and Google Apps Script. Google Apps Script is a programmatic way of manipulating and extending gSheets and the data that they contain.

Architecture

It is possible to include publicly available JS libraries such as JQuery-builder provided that Apps Script is natively a cloud-based Javascript platform.

The JQuery builder library facilitates the creation of standard SQL queries via a simple-to-use graphical user interface. The Redshift Data API can be used to retrieve the data directly to gSheets with a query in place. The following diagram illustrates the overall process from a technical standpoint:

Even though AppsScript is, in fact, Javascript, the AWS-provided SDKs for the browser (NodeJS and React) cannot be used on the Google platform, as they require specific properties that are native to the underlying infrastructure. It is possible to authenticate and access AWS resources through the available API calls. Here is an example of how to achieve that.

You can use an access key ID and a secret access key to authenticate the requests to AWS by using the code in the link example above. We recommend following the least privilege principle when granting access to this programmatic user, or assuming a role with temporary credentials. Since each user will require a different set of permissions on the Redshift objects—database, schema, and table—each user will have their own user access credentials. These credentials are safely stored under the AWS Secrets Manager service. Therefore, the programmatic user needs a set of permissions that enable them to retrieve secrets from the AWS Secrets Manager and execute queries against the Redshift Data API.

Code example for AppScript to use Data API

In this section, you will learn how to pull existing data back into a new gSheets Document. This section will not cover how to parse the data from the JQuery-builder library, as it is not within the main scope of the article.

<script src="https://cdn.jsdelivr.net/npm/jQuery-QueryBuilder/dist/js/query-builder.standalone.min.js"></script>

In the AWS console, go to Secrets Manager and create a new secret to store the database credentials to access the Redshift Cluster: username and password. These will be used to grant Redshift access to the gSheets user.

In the AWS console, create a new IAM user with programmatic access, and generate the corresponding Access Key credentials. The only set of policies required for this user is to be able to read the secret created in the previous step from the AWS Secrets Manager service and to query the Redshift Data API.

Below is the policy document:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret"
      ],
      "Resource": "arn:aws:secretsmanager:*::secret:*"
    },
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": "secretsmanager:ListSecrets",
      "Resource": "*"
    },
    {
      "Sid": "VisualEditor2",
      "Effect": "Allow",
      "Action": "redshift-data:*",
      "Resource": "arn:aws:redshift:*::cluster:*"
    }
  ]
}

Access the Google Apps Script console. Create an aws.gs file with the code available here. This will let you perform authenticated requests to the AWS services by providing an access key and a secret access key.
Initiate the AWS variable providing the access key and secret access key created in step 3.
```
AWS.init("<ACCESS_KEY>", "<SECRET_KEY>");
```

Request the Redshift username and password from the AWS Secrets Manager:

function runGetSecretValue_(secretId) {
 
  var resultJson = AWS.request(
    	getSecretsManagerTypeAWS_(),
    	getLocationAWS_(),
    	'secretsmanager.GetSecretValue',
    	{"Version": getVersionAWS_()},
    	method='POST',
    	payload={          
      	"SecretId" : secretId
    	},
    	headers={
      	"X-Amz-Target": "secretsmanager.GetSecretValue",
      	"Content-Type": "application/x-amz-json-1.1"
    	}
  );
 
  Logger.log("Execute Statement result: " + resultJson);
  return JSON.parse(resultJson);
 
}

Query a table using the Amazon Redshift Data API:

function runExecuteStatement_(sql) {
 
  var resultJson = AWS.request(
    	getTypeAWS_(),
    	getLocationAWS_(),
    	'RedshiftData.ExecuteStatement',
    	{"Version": getVersionAWS_()},
    	method='POST',
    	payload={
      	"ClusterIdentifier": getClusterIdentifierReshift_(),
      	"Database": getDataBaseRedshift_(),
      	"DbUser": getDbUserRedshift_(),
      	"Sql": sql
    	},
    	headers={
      	"X-Amz-Target": "RedshiftData.ExecuteStatement",
      	"Content-Type": "application/x-amz-json-1.1"
    	}
  ); 
 
  Logger.log("Execute Statement result: " + resultJson);

The result can then be displayed as a table in gSheets:

function fillGsheet_(recordArray) { 
 
  adjustRowsCount_(recordArray);
 
  var rowIndex = 1;
  for (var i = 0; i < recordArray.length; i++) {  
       
	var rows = recordArray[i];
	for (var j = 0; j < rows.length; j++) {
  	var columns = rows[j];
  	rowIndex++;
  	var columnIndex = 'A';
     
  	for (var k = 0; k < columns.length; k++) {
       
    	var field = columns[k];       
    	var value = getFieldValue_(field);
    	var range = columnIndex + rowIndex;
    	addToCell_(range, value);
 
    	columnIndex = nextChar_(columnIndex);
 
  	}
 
	}
 
  }
 
}

Once finished, the Apps Script can be deployed as an Addon that enables end-users from an entire organization to leverage the capabilities of retrieving data from Amazon Redshift directly into their spreadsheets. Details on how Apps Script code can be deployed as an Addon can be found here.

How users access Google Sheets

Open a gSheet, and go to manage addons -> Install addon:
Once the Addon is successfully installed, select the Addon menu and select Redshift Synchronization. A dialog will appear prompting the user to select the combination of database, schema, and table from which to load the data.
After choosing the intended table, a new panel will appear on the right side of the screen. Then, the user is prompted to select which columns to retrieve from the table, apply any filtering operation, and/or apply any aggregations to the data.
Upon submitting the query, app scripts will translate the user selection into a query that is sent to the Amazon Redshift Data API. Then, the returned data is transformed and displayed as a regular gSheet table:

Security and Access Management

In the scripts above, there is a direct integration between AWS Secrets Manager and Google Apps Script. The scripts above can extract the currently-authenticated user’s Google email address. Using this value and a set of annotated tags, the script can appropriately pull the user’s credentials securely to authenticate the requests made to the Amazon Redshift cluster. Follow these steps to set up a new user in an existing Amazon Redshift cluster. Once the user has been created, follow these steps for creating a new AWS Secrets Manager secret for your cluster. Make sure that the appropriate tag is applied with the key of “email” along with the corresponding user’s Google email address. Here is a sample configuration that is used for creating Redshift groups, users, and data shares via the Redshift Data API:

connection:
 redshift_super_user_database: dev
 redshift_secret_name: dev_
 redshift_cluster_identifier: dev-cluster
 redshift_secrets_stack_name: dev-cluster-secrets
 environment: dev
 aws_region: eu-west-1
 tags:
   - key: "Environment"
 	value: "dev"
users:
 - name: user1
   email: [email protected]
 data_shares:
 - name: test_data_share
   schemas:
 	- schema1
   redshift_namespaces:
 	- USDFJIL234234WE
group:
 - name: readonly
   users:
 	- user1
   databases:
 	- database: database1
   	exclude-schemas:
     	- public
     	- pg_toast
     	- catalog_history
   	include-schemas:
     	- schema1
   	grant:
     	- select

Operational Metrics and Improvement

Providing access to live data that is hosted in Redshift directly to the business users and enabling true self-service decrease the burden on platform teams to provide data extracts or other mechanisms to deliver up-to-date information. Additionally, by not having different files and versions of data circulating, the business risk of reporting different key figures or KPI can be reduced, and an overall process efficiency can be achieved.

The initial success of this add-on in GTM led to the extension of this to a broader audience, where we are hoping to serve hundreds of users with all of the internal and public data in the future.

Conclusion

In this post, you learned how to create new Amazon Redshift tables and pull existing Redshift tables into a Google Sheet for business users to easily integrate with and manipulate data. This integration was seamless and demonstrated how easy the Amazon Redshift Data API makes integration with external applications, such as Google Sheets with Amazon Redshift. The outlined use-cases above are just a few examples of how the Amazon Redshift Data API can be applied and used to simplify interactions between users and Amazon Redshift clusters.

About the Authors

Dr. Yannick Misteli is leading cloud platform and ML engineering teams in global product strategy (GPS) at Roche. He is passionate about infrastructure and operationalizing data-driven solutions, and he has broad experience in driving business value creation through data analytics.

João Antunes is a Data Engineer in the Global Product Strategy (GPS) team at Roche. He has a track record of deploying Big Data batch and streaming solutions for the telco, finance, and pharma industries.

Krzysztof Wisniewski is a back-end JavaScript developer in the Global Product Strategy (GPS) team at Roche. He is passionate about full-stack development from the front-end through the back-end to databases.

Matt Noyce is a Senior Cloud Application Architect at AWS. He works together primarily with Life Sciences and Healthcare customers to architect and build solutions on AWS for their business needs.

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

How Viasat scaled their big data applications by migrating to Amazon EMR

2021-10-13 Manoj Gundawar

Post Syndicated from Manoj Gundawar original https://aws.amazon.com/blogs/big-data/how-viasat-scaled-their-big-data-applications-by-migrating-to-amazon-emr/

This post is co-written with Manoj Gundawar from Viasat.

Viasat is a satellite internet service provider based in Carlsbad, CA, with operations across the United States and worldwide. Viasat’s ambition is to be the first truly global, scalable, broadband service provider with a mission to deliver connections that can change the world. Viasat operates across three main business segments: satellite services, commercial networks, and government systems, providing high-speed satellite broadband services and secure networking systems.

In this post, we discuss how migrating our on-premises big data workloads to Amazon EMR helped us achieve a fully managed cloud-native solution and freed us from the constraints of our legacy on-premises solution, so we can focus on business innovation with a lower TCO.

Challenge with the legacy big data environment

Viasat’s big data application, Usage Data Mart, is a high-volume, low-latency Hadoop-based solution that ingests internet usage data from a multitude of source systems. It curates and aggregates with data from other sources, and provides the data in an optimized system to support high-volume access to reporting and web interfaces. This critical application processes over 1.3 billion internet usage records per day, sourced from various upstream systems. Viasat customer support representatives use this data through APIs and reports to assist end-users on any queries regarding their internet usage. Various internal teams and customers also use data for usage accounting, tracking, compliance, and billing purposes.

The Usage Data Mart was previously implemented in an on-premises footprint of over 40 nodes with three independent clusters all running a commercial distribution of Hadoop to process our big data work load. The legacy Hadoop environment required three times the data replication to achieve high availability, which resulted in a large infrastructure footprint. In this architecture, we had a data ingestion cluster to ingest data from a Kafka streaming service, MySQL, and Oracle databases. We had extract, transform, and load (ETL) jobs for each data source and data domains or models, and most of the ETL jobs filter, curate, canonicalize, and aggregate data (by specific keys) using MapReduce framework and load in HBase as well as in HDFS or Amazon Simple Storage Service (Amazon S3) in Apache Parquet format. We used HBase to provide on-demand query and aggregation of data via REST APIs that used HBase coprocessors to apply aggregation and other business logic. Our independent reporting cluster queried Parquet files on HDFS and Amazon S3 with Apache Drill to generate a few dozen periodic reports. We had separated HBase and ETL clusters from reporting clusters to avoid any resource contention.

Our legacy environment had challenges at various levels:

Hardware – As the hardware aged, we encountered hardware failures on a regular basis. Expiring warranties on hardware and faulty component replacement increased operational risk. The burden of managing the hardware replacement and servers failing to restart after maintenance imposed a serious risk to the business, which made maintenance unpredictable and time-consuming.
Software – Yearly software license renewal costs and engineering efforts involved with software version upgrades to keep it on a supported version added operational complexity. Moreover, the commercial Hadoop distribution that we were running reached end of life in 2020.
Scaling – We needed to scale on-premises hardware to meet growing business needs during additional satellite launches that required forecasting and capacity planning. Delays in procurement and shipment of hardware affected project timelines. Finally, increasing data usage and customer adoption of this portal presented serious scaling challenges for Viasat.

How migration to Amazon EMR helped solve this challenge

Viasat evaluated a few alternatives to modernize big data applications and determined that Amazon EMR would be the right platform for our requirements. The separate compute and storage architecture of Amazon EMR helped us address our challenges. The following are some of the key benefits we realized with migration to AWS:

Storage – Amazon EMR supports HBase on Amazon S3, where Amazon S3 is used as persistent storage for the HBase cluster and allows us to scale our compute needs independently of storage. It also allows us to easily decommission HBase clusters, test upgrades, and optimizes our total cost of ownership. For guidelines and best practices, see Migrating to Apache HBase on Amazon S3 on Amazon EMR.
DNS records – Because it’s so easy for us to provision new HBase clusters, we needed a way to maintain DNS records to make the cluster easily accessible. We use a simple AWS Lambda function to update DNS records during EMR cluster boot up. The DNS records are stored in our custom DNS service and we use a custom API to update the records.
Customization – Amazon EMR allows you to customize existing software on the cluster as well as install additional software. We use Apache Drill for reporting and, with EMR bootstrap actions, we were able to install Apache Drill on all the nodes of the reporting cluster to provide us with distributed reporting using Parquet files generated with a MapReduce job. Because our reporting uses different query patterns than the data in the HBase cluster, the Parquet files are written with a different partition optimized for reporting. We increased the default bucket size from approximately 8 MB to 16 MB and pointed Amazon EMR to private Amazon S3 endpoints to avoid traffic going through an external firewall, which increased performance.

The following diagram depicts the process we followed for migration to Amazon EMR.

Viasat successfully migrated our on-premises big data applications to Amazon EMR in May 2021, following a lift-and-shift approach to move the big data workload to the cloud with minimal changes. Although we have an experienced big data team, we used AWS Infrastructure Event Management (IEM) to support queries on fine-tuning the Amazon EMR infrastructure within the migration timeline.

The following diagram outlines our new architecture. With this new architecture, we can process the same workloads with 50% of the compute footprint and approximately 50% of the costs compared to our on-premises clusters and still meet our SLAs.

Conclusion

With the new data platform, we can ingest additional data sources so that our analysts and customer service teams can gain insights and improve customer experience. As Viasat is launching new satellites and growing business in multiple new countries, we’re looking to stand up this solution in new AWS Regions (closest to the host country) and scale it as needed.

Questions or feedback? Send an email to [email protected].

About the Authors

Manoj Gundawar is a product owner at Viasat. He builds product roadmap, provides architecture guidance and manages full software development life cycle to build high quality product/software with minimal TCO. He is passionate about delighting the customers by providing innovating solutions, leveraging technology, agile methodology and continues improvement mindset.

Archana Srinivasan is a Technical Account Manager within Enterprise Support at Amazon Web Services. Archana helps AWS customers leverage Enterprise Support entitlements to solve complex operational challenges and accelerate their cloud adoption.

Kiran Guduguntla is a WW Go-to-Market Specialist for Amazon EMR at AWS. He works with AWS customers across the globe to strategize, build, develop and deploy modern Big Data solutions. He is passionate about working with customers and helping them in their cloud journey. Kiran loves music, travel, food and watching football.

Automated security and compliance remediation at HDI

2021-10-11 Uladzimir Palkhouski

Post Syndicated from Uladzimir Palkhouski original https://aws.amazon.com/blogs/devops/automated-security-and-compliance-remediation-at-hdi/

with Dr. Malte Polley (HDI Systeme AG – Cloud Solutions Architect)

At HDI, one of the biggest European insurance group companies, we use AWS to build new services and capabilities and delight our customers. Working in the financial services industry, the company has to comply with numerous regulatory requirements in the areas of data protection and FSI regulations such as GDPR, German Supervisory Requirements for IT (VAIT) and Supervision of Insurance Undertakings (VAG). The same security and compliance assessment process in the cloud supports development productivity and organizational agility, and helps our teams innovate at a high pace and meet the growing demands of our internal and external customers.

In this post, we explore how HDI adopted AWS security and compliance best practices. We describe implementation of automated security and compliance monitoring of AWS resources using a combination of AWS and open-source solutions. We also go through the steps to implement automated security findings remediation and address continuous deployment of new security controls.

Background

Data analytics is the key capability for understanding our customers’ needs, driving business operations improvement, and developing new services, products, and capabilities for our customers. We needed a cloud-native data platform of virtually unlimited scale that offers descriptive and prescriptive analytics capabilities to internal teams with a high innovation pace and short experimentation cycles. One of the success metrics in our mission is time to market, therefore it’s important to provide flexibility to internal teams to quickly experiment with new use cases. At the same time, we’re vigilant about data privacy. Having a secure and compliant cloud environment is a prerequisite for every new experiment and use case on our data platform.

Cloud security and compliance implementation in the cloud is a shared effort between the Cloud Center of Competence team (C3), the Network Operation Center (NoC), and the product and platform teams. The C3 team is responsible for new AWS account provisioning, account security, and compliance baseline setup. Cross-account networking configuration is established and managed by the NoC team. Product teams are responsible for AWS services configuration to meet their requirements in the most efficient way. Typically, they deploy and configure infrastructure and application stacks, including the following:

Network configuration – Amazon Virtual Private Cloud (Amazon VPC) subnets and routing
Object storage setup – Amazon Simple Storage Service (Amazon S3) buckets and bucket policies
Data encryption at rest configuration – Management of AWS Key Management Service (AWS KMS) customer master keys (CMKs) and key policies
Managed services configuration – AWS Glue jobs, AWS Cloud9 environments, and others

We were looking for security controls model that would allow us to continuously monitor infrastructure and application components set up by all the teams. The model also needed to support guardrails that allowed product teams to focus on new use case implementation, but also inherited the security and compliance best practices promoted and ensured within our company.

Security and compliance baseline definition

We started with the AWS Well-Architected Framework Security Pillar whitepaper, which provides implementation guidance on the essential areas of security and compliance in the cloud, including identity and access management, infrastructure security, data protection, detection, and incident response. Although all five elements are equally important for implementing enterprise-grade security and compliance in the cloud, we saw an opportunity to improve controls of on-premises environments by automating detection and incident response elements. The continuous monitoring of AWS infrastructure and application changes complemented by the automated incident response of the security baseline helps us foster security best practices and allows for a high innovation pace. Manual security reviews are no longer required to asses security posture.

Our security and compliance controls framework is based on GDPR and several standards and programs, including ISO 27001, C5. Translation of the controls framework into the security and compliance baseline definition in the cloud isn’t always straightforward, so we use a number of guidelines. As a starting point, we use CIS Amazon Web Services benchmarks, because it’s a prescriptive recommendation and its controls cover multiple AWS security areas, including identity and access management, logging and monitoring configuration, and network configuration. CIS benchmarks are industry-recognized cyber security best practices and recommendations that cover a wide range of technology families, and are used by enterprise organizations around the world. We also apply GDPR compliance on AWS recommendations and AWS Foundational Security Best Practices, extending controls recommended by CIS AWS Foundations Benchmarks in multiple control areas: inventory, logging, data protection, access management, and more.

Security controls implementation

AWS provides multiple services that help implement security and compliance controls:

AWS CloudTrail provides a history of events in an AWS account, including those originating from command line tools, AWS SDKs, AWS APIs, or the AWS Management Console. In addition, it allows exporting event history for further analysis and subscribing to specific events to implement automated remediation.
AWS Config allows you to monitor AWS resource configuration, and automatically evaluate and remediate incidents related to unexpected resources configuration. AWS Config comes with pre-built conformance pack sample templates designed to help you meet operational best practices and compliance standards.
Amazon GuardDuty provides threat detection capabilities that continuously monitor network activity, data access patterns, and account behavior.

With multiple AWS services to use as building blocks for continuous monitoring and automation, there is a strong need for a consolidated findings overview and unified remediation framework. This is where AWS Security Hub comes into play. Security Hub provides built-in security standards and controls that make it easy to enable foundational security controls. Then, Security Hub integrates with CloudTrail, AWS Config, GuardDuty, and other AWS services out of the box, which eliminates the need to develop and maintain integration code. Security Hub also accepts findings from third-party partner products and provides APIs for custom product integration. Security Hub significantly reduces the effort to consolidate audit information coming from multiple AWS-native and third-party channels. Its API and supported partner products ecosystem gave us confidence that we can adhere to changes in security and compliance standards with low effort.

While AWS provides a rich set of services to manage risk at the Three Lines Model, we were looking for wider community support in maintaining and extending security controls beyond those defined by CIS benchmarks and compliance and best practices recommendations on AWS. We came across Prowler, an open-source tool focusing on AWS security assessment and auditing and infrastructure hardening. Prowler implements CIS AWS benchmark controls and has over 100 additional checks. We appreciated Prowler providing checks that helped us meet GDPR and ISO 27001 requirements, specifically. Prowler delivers assessment reports in multiple formats, which makes it easy to implement reporting archival for future auditing needs. In addition, Prowler integrates well with Security Hub, which allows us to use a single service for consolidating security and compliance incidents across a number of channels.

We came up with the solution architecture depicted in the following diagram.

Automated remediation solution architecture HDI

Let’s look closely into the most critical components of this solution.

Prowler is a command line tool that uses the AWS Command Line Interface (AWS CLI) and a bash script. Individual Prowler checks are bash scripts organized into groups by compliance standard or AWS service. By supplying corresponding command line arguments, we can run Prowler against a specific AWS Region or multiple Regions at the same time. We can run Prowler in multiple ways; we chose to run it as an AWS Fargate task for Amazon Elastic Container Service (Amazon ECS). Fargate is a serverless compute engine that runs Docker-compatible containers. ECS Fargate tasks are scheduled tasks that make it easy to perform periodic assessments of an AWS account and export findings. We configured Prowler to run every 7 days in every account and Region it’s deployed into.

Security Hub acts as a single place for consolidating security findings from multiple sources. When Security Hub is enabled in a given Region, CIS AWS Foundations Benchmark and Foundational Security Best Practices standards are enabled as well. Enabling these standards also configures integration with AWS Config and Guard Duty. Integration with Prowler requires enabling product integration on the Security Hub side by calling the EnableImportFindingsForProduct API action for a given product. Because Prowler supports integration with Security Hub out of the box, posting security findings is a matter of passing the right command line arguments: -M json-asff to format reports as AWS Security Findings Format and -S to ship findings to Security Hub.

Automated security findings remediation is implemented using AWS Lambda functions and the AWS SDK for Python (Boto3). The remediation function can be triggered in two ways: automatically in response to a new security finding, or by a security engineer from the Security Hub findings page. In both cases, the same Lambda function is used. Remediation functions implement security standards in accordance with recommendations, whether they’re CIS AWS Foundations Benchmark and Foundational Security Best Practices standards, or others.

The exact activities performed depend on the security findings type and its severity. Examples of activities performed include deleting non-rotated AWS Identity and Access Management (IAM) access keys, enabling server-side encryption for S3 buckets, and deleting unencrypted Amazon Elastic Block Store (Amazon EBS) volumes.

To trigger the Lambda function, we use Amazon EventBridge, which makes it easy to build an event-driven remediation engine and allows us to define Lambda functions as targets for Security Hub findings and custom actions. EventBridge allows us to define filters for security findings and therefore map finding types to specific remediation functions. Upon successfully performing security remediation, each function updates one or more Security Hub findings by calling the BatchUpdateFindings API and passing the corresponding finding ID.

The following example code shows a function enforcing an IAM password policy:

import boto3
import os
import logging
from botocore.exceptions import ClientError

iam = boto3.client("iam")
securityhub = boto3.client("securityhub")

log_level = os.environ.get("LOG_LEVEL", "INFO")
logging.root.setLevel(logging.getLevelName(log_level))
logger = logging.getLogger(__name__)


def lambda_handler(event, context, iam=iam, securityhub=securityhub):
    """Remediate findings related to cis15 and cis11.

    Params:
        event: Lambda event object
        context: Lambda context object
        iam: iam boto3 client
        securityhub: securityhub boto3 client
    Returns:
        No returns
    """
    finding_id = event["detail"]["findings"][0]["Id"]
    product_arn = event["detail"]["findings"][0]["ProductArn"]
    lambda_name = os.environ["AWS_LAMBDA_FUNCTION_NAME"]
    try:
        iam.update_account_password_policy(
            MinimumPasswordLength=14,
            RequireSymbols=True,
            RequireNumbers=True,
            RequireUppercaseCharacters=True,
            RequireLowercaseCharacters=True,
            AllowUsersToChangePassword=True,
            MaxPasswordAge=90,
            PasswordReusePrevention=24,
            HardExpiry=True,
        )
        logger.info("IAM Password Policy Updated")
    except ClientError as e:
        logger.exception(e)
        raise e
    try:
        securityhub.batch_update_findings(
            FindingIdentifiers=[{"Id": finding_id, "ProductArn": product_arn},],
            Note={
                "Text": "Changed non compliant password policy",
                "UpdatedBy": lambda_name,
            },
            Workflow={"Status": "RESOLVED"},
        )
    except ClientError as e:
        logger.exception(e)
        raise e

A key aspect in developing remediation Lambda functions is testability. To quickly iterate through testing cycles, we cover each remediation function with unit tests, in which necessary dependencies are mocked and replaced with stub objects. Because no Lambda deployment is required to check remediation logic, we can test newly developed functions and ensure reliability of existing ones in seconds.

Each Lambda function developed is accompanied with an event.json document containing an example of an EventBridge event for a given security finding. A security finding event allows us to verify remediation logic precisely, including deletion or suspension of non-compliant resources or a finding status update in Security Hub and the response returned. Unit tests cover both successful and erroneous remediation logic. We use pytest to develop unit tests, and botocore.stub and moto to replace runtime dependencies with mocks and stubs.

Automated security findings remediation

The following diagram illustrates our security assessment and automated remediation process.

Automated remediation flow HDI

The workflow includes the following steps:

An existing Security Hub integration performs periodic resource audits. The integration posts new security findings to Security Hub.
Security Hub reports the security incident to the company’s centralized Service Now instance by using the Service Now ITSM Security Hub integration.
Security Hub triggers automated remediation:
1. Security Hub triggers the remediation function by sending an event to EventBridge. The event has a source field equal to aws.securityhub, with the filter ID corresponding to the specific finding type and compliance status as FAILED. The combination of these fields allows us to map the event to a particular remediation function.
2. The remediation function starts processing the security finding event.
3. The function calls the UpdateFindings Security Hub API to update the security finding status upon completing remediation.
4. Security Hub updates the corresponding security incident status in Service Now (Step 2)
Alternatively, the security operations engineer resolves the security incident in Service Now:
1. The engineer reviews the current security incident in Service Now.
2. The engineer manually resolves the security incident in Service Now.
3. Service Now updates the finding status by calling the UpdateFindings Security Hub API. Service Now uses the AWS Service Management Connector.
Alternatively, the platform security engineer triggers remediation:
1. The engineer reviews the currently active security findings on the Security Hub findings page.
2. The engineer triggers remediation from the security findings page by selecting the appropriate action.
3. Security Hub triggers the remediation function by sending an event with the source aws.securityhub to EventBridge. The automated remediation flow continues as described in the Step 3.

Deployment automation

Due to legal requirements, HDI uses the infrastructure as code (IaC) principle while defining and deploying AWS infrastructure. We started with AWS CloudFormation templates defined as YAML or JSON format. The templates are static by nature and define resources in a declarative way. We figured out that as our solution complexity grows, the CloudFormation templates also grow in size and complexity, because all the resources deployed have to be explicitly defined. We wanted a solution to increase our development productivity and simplify infrastructure definition.

The AWS Cloud Development Kit (AWS CDK) helped us in two ways:

The AWS CDK provides ready-to-use building blocks called constructs. These constructs include pre-configured AWS services following best practices. For example, a Lambda function always gets an IAM role with an IAM policy to be able to write logs to CloudWatch Logs.
The AWS CDK allows us to use high-level programming languages to define configuration of all AWS services. Imperative definition allows us to build our own abstractions and reuse them to achieve concise resource definition.

We found that implementing IaC with the AWS CDK is faster and less error-prone. At HDI, we use Python to build application logic and define AWS infrastructure. The imperative nature of the AWS CDK is truly a turning point in fulfilling legal requirements and achieving high developer productivity at the same time.

One of the AWS CDK constructs we use is AWS CDK pipeline. This construct creates a customizable continuous integration and continuous delivery (CI/CD) pipeline implemented with AWS CodePipeline. The source action is based on AWS CodeCommit. The synth action is responsible for creating a CloudFormation template from the AWS CDK project. The synth action also runs unit tests on remediations functions. The pipeline actions are connected via artifacts. Lastly, the AWS CDK pipeline constructs offer a self-mutating feature, which allows us to maintain the AWS CDK project as well as the pipeline in a single code repository. Changes of the pipeline definition as well as automated remediation solutions are deployed seamlessly. The actual solution deployment is also implemented as a CI/CD stage. Stages can be eventually deployed in cross-Region and cross-account patterns. To use cross-account deployments, the AWS CDK provides a bootstrap functionality to create a trust relationship between AWS accounts.

The AWS CDK project is broken down to multiple stacks. To deploy the CI/CD pipeline, we run the cdk deploy cicd-4-securityhub command. To add a new Lambda remediation function, we must add remediation code, optional unit tests, and finally the Lambda remediation configuration object. This configuration object defines the Lambda function’s environment variables, necessary IAM policies, and external dependencies. See the following example code of this configuration:

prowler_729_lambda = {
    "name": "Prowler 7.29",
    "id": "prowler729",
    "description": "Remediates Prowler 7.29 by deleting/terminating unencrypted EC2 instances/EBS volumes",
    "policies": [
        _iam.PolicyStatement(
            effect=_iam.Effect.ALLOW,
            actions=["ec2:TerminateInstances", "ec2:DeleteVolume"],
            resources=["*"])
        ],
    "path": "delete_unencrypted_ebs_volumes",
    "environment_variables": [
        {"key": "ACCOUNT_ID", "value": core.Aws.ACCOUNT_ID}
    ],
    "filter_id": ["prowler-extra729"],
 }

Remediation functions are organized in accordance with the security and compliance frameworks they belong to. The AWS CDK code iterates over remediation definition lists and synthesizes corresponding policies and Lambda functions to be deployed later. Committing Git changes and pushing them triggers the CI/CD pipeline, which deploys the newly defined remediation function and adjusts the configuration of Prowler.

We are working on publishing the source code discussed in this blog post.

Looking forward

As we keep introducing new use cases in the cloud, we plan to improve our solution in the following ways:

Continuously add new controls based on our own experience and improving industry standards
Introduce cross-account security and compliance assessment by consolidating findings in a central security account
Improve automated remediation resiliency by introducing remediation failure notifications and retry queues
Run a Well-Architected review to identify and address possible areas of improvement

Conclusion

Working on the solution described in this post helped us improve our security posture and meet compliancy requirements in the cloud. Specifically, we were able to achieve the following:

Gain a shared understanding of security and compliance controls implementation as well as shared responsibilities in the cloud between multiple teams
Speed up security reviews of cloud environments by implementing continuous assessment and minimizing manual reviews
Provide product and platform teams with secure and compliant environments
Lay a foundation for future requirements and improvement of security posture in the cloud

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the Authors

How NortonLifelock built a serverless architecture for real-time analysis of their VPN usage metrics

2021-09-27 Madhu Nunna

Post Syndicated from Madhu Nunna original https://aws.amazon.com/blogs/big-data/how-nortonlifelock-built-a-serverless-architecture-for-real-time-analysis-of-their-vpn-usage-metrics/

This post presents a reference architecture and optimization strategies for building serverless data analytics solutions on AWS using Amazon Kinesis Data Analytics. In addition, this post shows the design approach that the engineering team at NortonLifeLock took to build out an operational analytics platform that processes usage data for their VPN services, consuming petabytes of data across the globe on a daily basis.

NortonLifeLock is a global cybersecurity and internet privacy company that offers services to millions of customers for device security, and identity and online privacy for home and family. NortonLifeLock believes the digital world is only truly empowering when people are confident in their online security. NortonLifeLock has been an AWS customer since 2014.

For any organization, the value of operational data and metrics decreases with time. This lost value can equate to lost revenue and wasted resources. Real-time streaming analytics helps capture this value and provide new insights that can create new business opportunities.

AWS offers a rich set of services that you can use to provide real-time insights and historical trends. These services include managed Hadoop infrastructure services on Amazon EMR as well as serverless options such as Kinesis Data Analytics and AWS Glue.

Amazon EMR also supports multiple programming options for capturing business logic, such as Spark Streaming, Apache Flink, and SQL.

As a customer, it’s important to understand organizational capabilities, project timelines, business requirements, and AWS service best practices in order to define an optimal architecture from performance, cost, security, reliability, and operational excellence perspectives (the five pillars of the AWS Well-Architected Framework).

NortonLifeLock is taking a methodical approach to real-time analytics on AWS while using serverless technology to deliver on key business drivers such as time to market and total cost of ownership. In addition to NortonLifeLock’s implementation, this post provides key lessons learned and best practices for rapid development of real-time analytics workloads.

Business problem

NortonLifeLock offers a VPN product as a freemium service to users. Therefore, they need to enforce usage limits in real time to stop freemium users from using the service when their usage is over the limit. The challenge for NortonLifeLock is to do this in a reliable and affordable fashion.

NortonLifeLock runs its VPN infrastructure in almost all AWS Regions. Migrating to AWS from smaller hosting vendors has greatly improved user experience and VPN edge server performance, including a reduction in connection latency, time to connect and connection errors, faster upload and download speed, and more stability and uptime for VPN edge servers.

VPN usage data is collected by VPN edge servers and uploaded to backend stats servers every minute and persisted in backend databases. The usage information serves multiple purposes:

Displaying how much data a device has consumed for the past 30 days.
Enforcing usage limits on freemium accounts. When a user exhausts their free quota, that user is unable to connect through VPN until the next free cycle.
Analyzing usage data by the internal business intelligence (BI) team based on time, marketing campaigns, and account types, and using this data to predict future growth, ability to retain users, and more.

Design challenge

NortonLifeLock had the following design challenges:

The solution must be able to simultaneously satisfy both real-time and batch analysis.
The solution must be economical. NortonLifeLock VPN has hundreds of thousands of concurrent users, and if a user’s usage information is persisted as it comes in, it results in tens of thousands of reads and writes per second and tens of thousands of dollars a month in database costs.

Solution overview

NortonLifeLock decided to split storage into two parts by storing usage data in Amazon DynamoDB for real-time access and in Amazon Simple Storage Service (Amazon S3) for analysis, which addresses real-time enforcement and BI needs. Kinesis Data Analytics aggregates and loads data to Amazon S3 and DynamoDB. With Amazon Kinesis Data Streams and AWS Lambda as consumers of Kinesis Data Analytics, the implementation of user and device-level aggregations was simplified.

To keep costs down, user usage data was aggregated by the hour and persisted in DynamoDB. This spread hundreds of thousands of writes over an hour and reduced DynamoDB cost by 30 times.

Although increasing aggregation might not be an option for other problem domains, it’s acceptable in this case because it’s not necessary to be precise to the minute for user usage, and it’s acceptable to calculate and enforce the usage limit every hour.

The following diagram illustrates the high-level architecture. The solution is broken into three logical parts:

End-users – Real-time queries from devices to display current usage information (how much data is used daily)
Business analysts – Query historical usage information through Amazon Athena to extract business insights
Usage limit enforcement – Usage data ingestion and aggregation in real time

The solution has the following workflow:

Usage data is collected by a VPN edge server and sends it to the backend service through Application Load Balancer.
A single usage data record sent by the VPN edge server contains usage data for many users. A stats splitter splits the message into individual usage stats per user and forwards the message to Kinesis Data Streams.
Usage data is consumed by both the legacy stats processor and the new Apache Flink application developed and deployed on Kinesis Data Analytics.
The Apache Flink application carries out the following tasks:
1. Aggregate device usage data hourly and send the aggregated result to Amazon S3 and the outgoing Kinesis data stream, which is picked up by a Lambda function that persists the usage data in DynamoDB.
2. Aggregate device usage data daily and send the aggregated result to Amazon S3.
3. Aggregate account usage data hourly and forward the aggregated results to the outgoing data stream, which is picked up by a Lambda function that checks if account usage is over the limit for that account. If account usage is over the limit, the function forwards the account information to another Lambda function, via Amazon Simple Queue Service (Amazon SQS), to cut off access on that account.

Design journey

NortonLifeLock needed a solution that was capable of real-time streaming and batch analytics. Kinesis Data Analysis fits this requirement because of the following key features:

Real-time streaming and batch analytics for data aggregation
Fully managed with a pay-as-you-go model
Auto scaling

NortonLifeLock needed Kinesis Data Analytics to do the following:

Aggregate customer usage data per device hourly and send results to Kinesis Data Streams (ultimately to DynamoDB) and the data lake (Amazon S3)
Aggregate customer usage data per account hourly and send results to Kinesis Data Streams (ultimately to DynamoDB and Lambda, which enforces usage limit)
Aggregate customer usage data per device daily and send results to the data lake (Amazon S3)

The legacy system processes usage data from an incoming Kinesis data stream, and they plan to use Kinesis Data Analytics to consume and process production data from the same stream. As such, NortonLifeLock started with SQL applications on Kinesis Data Analytics.

First attempt: Kinesis Data Analytics for SQL

Kinesis Data Analytics with SQL provides a high-level SQL-based abstraction for real-time stream processing and analytics. It’s configuration driven and very simple to get started. NortonLifeLock was able to create a prototype from scratch, get to production, and process the production load in less than 2 weeks. The solution met 90% of the requirements, and there were alternates for the remaining 10%.

However, they started to receive “read limit exceeded” alerts from the source data stream, and the legacy application was read throttled. With Amazon Support’s help, they traced the issues to the drastic reversal of the Kinesis Data Analytics MillisBehindLatest metric in Kinesis record processing. This was correlated to the Kinesis Data Analytics auto scaling events and application restarts, as illustrated by the following diagram. The highlighted areas show the correlation between spikes due to autoscaling and reversal of MillisBehindLatest metrics.

Here’s what happened:

Kinesis Data Analytics for SQL scaled up KPU due to load automatically, and the Kinesis Data Analytics application was restarted (part of scaling up).
Kinesis Data Analytics for SQL supports the at least once delivery model and uses checkpoints to ensure no data loss. But it doesn’t support taking a snapshot and restoring from the snapshot after a restart. For more details, see Delivery Model for Persisting Application Output to an External Destination.
When the Kinesis Data Analytics for SQL application was restarted, it needed to reprocess data from the beginning of the aggregation window, resulting in a very large number of duplicate records, which led to a dramatic increase in the Kinesis Data Analytics MillisBehindLatest metric.
To catch up with incoming data, Kinesis Data Analytics started re-reading from the Kinesis data stream, which led to over-consumption of read throughput and the legacy application being throttled.

In summary, Kinesis Data Analytics for SQL’s duplicates record processing on restarts, no other means to eliminate duplicates, and limited ability to control auto scaling led to this issue.

Although they found Kinesis Data Analytics for SQL easy to get started, these limitations demanded other alternatives. NortonLifeLock reached out to the Kinesis Data Analytics team and discussed the following options:

Option 1 – AWS was planning to release a new service, Kinesis Data Analytics Studio for SQL, Python, and Scala, which addresses these limitations. But this service was still a few months away (this service is now available, launched May 27, 2021).
Option 2 – The alternative was to switch to Kinesis Data Analytics for Apache Flink, which also provides the necessary tools to address all their requirements.

Second attempt: Kinesis Data Analytics for Apache Flink

Apache Flink has a comparatively steep learning curve (we used Java for streaming analytics instead of SQL), and it took about 4 weeks to build the same prototype, deploy it to Kinesis Data Analytics, and test the application in production. NortonLifeLock had to overcome a few hurdles, which we document in this section along with the lessons learned.

Challenge 1: Too many writes to outgoing Kinesis data stream

The first thing they noticed was that the write threshold on the outgoing Kinesis data stream was greatly exceeded. Kinesis Data Analytics was attempting to write 10 times the amount of expected data to the data stream, with 95% of data throttled.

After a lengthy investigation, it turned out that having too much parallelism in the Kinesis Data Analytics application led to this issue. They had followed default recommendations and set parallelism to 12 and it scaled up to 16. This means that every hour, 16 separate threads were attempting to write to the destination data stream simultaneously, leading to massive contention and writes throttled. These threads attempted to retry continuously, until all records were written to the data stream. This resulted in 10 times the amount of data processing attempted, even though only one tenth of the writes eventually succeeded.

The solution was to reduce parallelism to 4 and disable auto scaling. In the preceding diagram, the percentage of throttled records dropped to 0 from 95% after they reduced parallelism to 4 in the Kinesis Data Analytics application. This also greatly improved KPU utilization and reduced Kinesis Data Analytics cost from $50 a day to $8 a day.

Challenge 2: Use Kinesis Data Analytics sink aggregation

After tuning parallelism, they still noticed occasional throttling by Kinesis Data Streams because of the number of records being written, not record size. To overcome this, they turned on Kinesis Data Analytics sink aggregation to reduce the number of records being written to the data stream, and the result was dramatic. They were able to reduce the number of writes by 1,000 times.

Challenge 3: Handle Kinesis Data Analytics Flink restarts and the resulting duplicate records

Kinesis Data Analytics applications restart because of auto scaling or recovery from application or task manager crashes. When this happens, Kinesis Data Analytics saves a snapshot before shutdown and automatically reloads the latest snapshot and picks up where the work was left off. Kinesis Data Analytics also saves a checkpoint every minute so no data is lost, guaranteeing exactly-once processing.

However, when the Kinesis Data Analytics application shut down in the middle of sending results to Kinesis Data Streams, it doesn’t guarantee exactly-once data delivery. In fact, Flink only guarantees at least once delivery to Kinesis Data Analytics sink, meaning that Kinesis Data Analytics guarantees to send a record at least once, which leads to duplicate records sent when Kinesis Data Analytics is restarted.

How were duplicate records handled in the outgoing data stream?

Because duplicate records aren’t handled by Kinesis Data Analytics when sinks do not have exactly-once semantics, the downstream application must deal with the duplicate records. The first question you should ask is whether it’s necessary to deal with the duplicate records. Maybe it’s acceptable to tolerate duplicate records in your application? This, however, is not an option for NortonLifeLock, because no user wants to have their available usage taken twice within the same hour. So, logic had to be built in the application to handle duplicate usage records.

To deal with duplicate records, you can employ a strategy in which the application saves an update timestamp along with the user’s latest usage. When a record comes in, the application reads existing daily usage and compares the update timestamp against the current time. If the difference is less than a configured window (50 minutes if the aggregation window is 60 minutes), the application ignores the new record because it’s a duplicate. It’s acceptable for the application to potentially undercount vs. overcount user usage.

How were duplicate records handled in the outgoing S3 bucket?

Kinesis Data Analytics writes temporary files in Amazon S3 before finalizing and removing them. When Kinesis Data Analytics restarts, it attempts to write new S3 files, and potentially leaves behind temporary S3 files because of restart. Because Athena ignores all temporary S3 files, no further is action needed. If your BI tools take temporary S3 files into consideration, you have to configure the Amazon S3 lifecycle policy to clean up temporary S3 files after a certain time.

Conclusion

NortonLifelock has been successfully running a Kinesis Data Analytics application in production since May 2021. It provides several key benefits. VPN users can now keep track of their usage in near-real time. BI analysts can get timely insights that are used for targeted sales and marketing campaigns, and upselling features and services. VPN usage limits are enforced in near-real time, thereby optimizing the network resources. NortonLifelock is saving tens of thousands of dollars each month with this real-time streaming analytics solution. And this telemetry solution is able to keep up with petabytes of data flowing through their global VPN service, which is seeing double-digit monthly growth.

To learn more about Kinesis Data Analytics and getting started with serverless streaming solutions on AWS, please see Developer Guide for Studio, the easiest way to build Apache Flink applications in SQL, Python, Scala in a notebook interface.

About the Authors

Lei Gu has 25 years of software development experience and the architect for three key Norton products, Norton Secure Backup, VPN and Norton Family. He is passionate about cloud transformation and most recently spoke about moving from Cassandra to Amazon DynamoDB at AWS re:Invent 2019. Check out his Linkedin profile at https://www.linkedin.com/in/leigu/.

Madhu Nunna is a Sr. Solutions Architect at AWS, with over 20 years of experience in networks and cloud, with the last two years focused on AWS Cloud. He is passionate about Analytics and AI/ML. Outside of work, he enjoys hiking and reading books on philosophy, economics, history, astronomy and biology.

Integral Ad Science secures self-service data lake using AWS Lake Formation

2021-09-23 Mat Sharpe

Post Syndicated from Mat Sharpe original https://aws.amazon.com/blogs/big-data/integral-ad-science-secures-self-service-data-lake-using-aws-lake-formation/

This post is co-written with Mat Sharpe, Technical Lead, AWS & Systems Engineering from Integral Ad Science.

Integral Ad Science (IAS) is a global leader in digital media quality. The company’s mission is to be the global benchmark for trust and transparency in digital media quality for the world’s leading brands, publishers, and platforms. IAS does this through data-driven technologies with actionable real-time signals and insight.

In this post, we discuss how IAS uses AWS Lake Formation and Amazon Athena to efficiently manage governance and security of data.

The challenge

IAS processes over 100 billion web transactions per day. With strong growth and changing seasonality, IAS needed a solution to reduce cost, eliminate idle capacity during low utilization periods, and maximize data processing speeds during peaks to ensure timely insights for customers.

In 2020, IAS deployed a data lake in AWS, storing data in Amazon Simple Storage Service (Amazon S3), cataloging its metadata in the AWS Glue Data Catalog, ingesting and processing using Amazon EMR, and using Athena to query and analyze the data. IAS wanted to create a unified data platform to meet its business requirements. Additionally, IAS wanted to enable self-service analytics for customers and users across multiple business units, while maintaining critical controls over data privacy and compliance with regulations such as GDPR and CCPA. To accomplish this, IAS needed to securely ingest and organize real-time and batch datasets, as well as secure and govern sensitive customer data.

To meet the dynamic nature of IAS’s data and use cases, the team needed a solution that could define access controls by attribute, such as classification of data and job function. IAS processes significant volumes of data and this continues to grow. To support the volume of data, IAS needed the governance solution to scale in order to create and secure many new daily datasets. This meant IAS could enable self-service access to data from different tools, such as development notebooks, the AWS Management Console, and business intelligence and query tools.

To address these needs, IAS evaluated several approaches, including a manual ticket-based onboarding process to define permissions on new datasets, many different AWS Identity and Access Management (IAM) policies, and an AWS Lambda based approach to automate defining Lake Formation table and column permissions triggered by changes in security requirements and the arrival of new datasets.

Although these approaches worked, they were complex and didn’t support the self-service experience that IAS data analysts required.

Solution overview

IAS selected Lake Formation, Athena, and Okta to solve this challenge. The following architectural diagram shows how the company chose to secure its data lake.

The solution needed to support data producers and consumers in multiple AWS accounts. For brevity, this diagram shows a central data lake producer that includes a set of S3 buckets for raw and processed data. Amazon EMR is used to ingest and process the data, and all metadata is cataloged in the data catalog. The data lake consumer account uses Lake Formation to define fine-grained permissions on datasets shared by the producer account; users logging in through Okta can run queries using Athena and be authorized by Lake Formation.

Lake Formation enables column-level control, and all Amazon S3 access is provisioned via a Lake Formation data access role in the query account, ensuring only that service can access the data. Each business unit with access to the data lake is provisioned with an IAM role that only allows limited access to:

That business unit’s Athena workgroup
That workgroup’s query output bucket
The lakeformation:GetDataAccess API

Because Lake Formation manages all the data access and permissions, the configuration of the user’s role policy in IAM becomes very straightforward. By defining an Athena workgroup per business unit, IAS also takes advantage of assigning per-department billing tags and query limits to help with cost management.

Define a tag strategy

IAS commonly deals with two types of data: data generated by the company and data from third parties. The latter usually includes contractual stipulations on privacy and use.

Some data sets require even tighter controls, and defining a tag strategy is one key way that IAS ensures compliance with data privacy standards. With the tag-based access controls in Lake Formation IAS can define a set of tags within an ontology that is assigned to tables and columns. This ensures users understand available data and whether or not they have access. It also helps IAS manage privacy permissions across numerous tables with new ones added every day.

At a simplistic level, we can define policy tags for class with private and non-private, and for owner with internal and partner.

As we progressed, our tagging ontology evolved to include individual data owners and data sources within our product portfolio.

Apply tags to data assets

After IAS defined the tag ontology, the team applied tags at the database, table, and column level to manage permissions. Tags are inherited, so they only need to be applied at the highest level. For example, IAS applied the owner and class tags at the database level and relied on inheritance to propagate the tags to all the underlying tables and columns. The following diagram shows how IAS activated a tagging strategy to distinguish between internal and partner datasets , while classifying sensitive information within these datasets.

Only a small number of columns contain sensitive information; IAS relied on inheritance to apply a non-private tag to the majority of the database objects and then overrode it with a private tag on a per-column basis.

The following screenshot shows the tags applied to a database on the Lake Formation console.

With its global scale, IAS needed a way to automate how tags are applied to datasets. The team experimented with various options including string matching on column names, but the results were unpredictable in situations where unexpected column names are used (ipaddress vs. ip_address, for example). Ultimately, IAS incorporated metadata tagging into its existing infrastructure as code (IaC) process, which gets applied as part of infrastructure updates.

Define fine-grained permissions

The final piece of the puzzle was to define permission rules to associate with tagged resources. The initial data lake deployment involved creating permission rules for every database and table, with column exclusions as necessary. Although these were generated programmatically, it added significant complexity when the team needed to troubleshoot access issues. With Lake Formation tag-based access controls, IAS reduced hundreds of permission rules down to precisely two rules, as shown in the following screenshot.

When using multiple tags, the expressions are logically ANDed together. The preceding statements permit access only to data tagged non-private and owned by internal.

Tags allowed IAS to simplify permission rules, making it easy to understand, troubleshoot, and audit access. The ability to easily audit which datasets include sensitive information and who within the organization has access to them made it easy to comply with data privacy regulations.

Benefits

This solution provides self-service analytics to IAS data engineers, analysts, and data scientists. Internal users can query the data lake with their choice of tools, such as Athena, while maintaining strong governance and auditing. The new approach using Lake Formation tag-based access controls reduces the integration code and manual controls required. The solution provides the following additional benefits:

Meets security requirements by providing column-level controls for data
Significantly reduces permission complexity
Reduces time to audit data security and troubleshoot permissions
Deploys data classification using existing IaC processes
Reduces the time it takes to onboard data users including engineers, analysts, and scientists

Conclusion

When IAS started this journey, the company was looking for a fully managed solution that would enable self-service analytics while meeting stringent data access policies. Lake Formation provided IAS with the capabilities needed to deliver on this promise for its employees. With tag-based access controls, IAS optimized the solution by reducing the number of permission rules from hundreds down to a few, making it even easier to manage and audit. IAS continues to analyze data using more tools governed by Lake Formation.

About the Authors

Mat Sharpe is the Technical Lead, AWS & Systems Engineering at IAS where he is responsible for the company’s AWS infrastructure and guiding the technical teams in their cloud journey. He is based in New York.

Brian Maguire is a Solution Architect at Amazon Web Services, where he is focused on helping customers build their ideas in the cloud. He is a technologist, writer, teacher, and student who loves learning. Brian is the co-author of the book Scalable Data Streaming with Amazon Kinesis.

Danny Gagne is a Solutions Architect at Amazon Web Services. He has extensive experience in the design and implementation of large-scale high-performance analysis systems, and is the co-author of the book Scalable Data Streaming with Amazon Kinesis. He lives in New York City.

Detect Adversary Behavior in Milliseconds with CrowdStrike and Amazon EventBridge

2021-09-21 Joby Bett

Post Syndicated from Joby Bett original https://aws.amazon.com/blogs/architecture/detect-adversary-behavior-in-seconds-with-crowdstrike-and-amazon-eventbridge/

By integrating Amazon EventBridge with Falcon Horizon, CrowdStrike has developed a real-time, cloud-based solution that allows you to detect threats in less than a second. This solution uses AWS CloudTrail and EventBridge. CloudTrail allows governance, compliance, operational auditing, and risk auditing of your AWS account. EventBridge is a serverless event bus that makes it easier to build event-driven applications at scale.

In this blog post, we’ll cover the challenges presented by using traditional log file-based security monitoring. We will also discuss how CrowdStrike used EventBridge to create an innovative, real-time cloud security solution that enables high-speed, event-driven alerts that detect malicious actors in milliseconds.

Challenges of log file-based security monitoring

Being able to detect malicious actors in your environment is necessary to stay secure in the cloud. With the growing volume, velocity, and variety of cloud logs, log file-based monitoring makes it difficult to reveal adverse behaviors in time to stop breaches.

When an attack is in progress, a security operations center (SOC) analyst has an average of one minute to detect the threat, ten minutes to understand it, and one hour to contain it. If you cannot meet this 1/10/60 minute rule, you may have a costly breach that may move laterally and explode exponentially across the cloud estate.

Let’s look at a real-life scenario. When a malicious actor attempts a ransom attack that targets high-value data in an Amazon Simple Storage Service (Amazon S3) bucket, it can involve activities in various parts of the cloud services in a brief time window.

These example activities can involve:

AWS Identity and Access Management (IAM): account enumeration, disabling multi-factor authentication (MFA), account hijacking, privilege escalation, etc.
Amazon Elastic Compute Cloud (Amazon EC2): instance profile privilege escalation, file exchange tool installs, etc.
Amazon S3: bucket and object enumeration; impair bucket encryption and versioning; bucket policy manipulation; getObject, putObject, and deleteObject APIs, etc.

With siloed log file-based monitoring, detecting, understanding, and containing a ransom attack while still meeting the 1/10/60 rule is difficult. This is because log files are written in batches, and files are typically only created every 5 minutes. Once the log file is written, it still needs to be fetched and processed. This means that you lose the ability to dynamically correlate disparate activities.

To summarize, top-level challenges of log file-based monitoring are:

Lag time between the breach and the detection
Inability to correlate disparate activities to reveal sophisticated attack patterns
Frequent false positive alarms that obscure true positives
High operational cost of log file synchronizations and reprocessing
Log analysis tool maintenance for fast growing log volume

Security and compliance is a shared responsibility between AWS and the customer. We protect the infrastructure that runs all of the services offered in the AWS Cloud. For abstracted services, such as Amazon S3, we operate the infrastructure layer, the operating system, and platforms, and customers access the endpoints to store and retrieve data.

You are responsible for managing your data (including encryption options), classifying assets, and using IAM tools to apply the appropriate permissions.

Indicators of attack by CrowdStrike with Amazon EventBridge

In real-world cloud breach scenarios, timeliness of observation, detection, and remediation is critical. CrowdStrike Falcon Horizon IOA is built on an event-driven architecture based on EventBridge and operates at a velocity that can outpace attackers.

CrowdStrike Falcon Horizon IOA performs the following core actions:

Observe: EventBridge streams CloudTrail log files across accounts to the CrowdStrike platform as activity occurs. Parallelism is enabled via event bus rules, which enables CrowdStrike to avoid the five-minute lag in fetching the log files and dynamically correlate disparate activities. The CrowdStrike platform observes end-to-end activities from AWS services and infrastructure hosted in the accounts protected by CrowdStrike.
Detect: Falcon Horizon invokes indicators of attack (IOA) detection algorithms that reveal adversarial or anomalous activities from the log file streams. It correlates new and historical events in real time while enriching the events with CrowdStrike threat intelligence data. Each IOA is prioritized with the likelihood of activity being malicious via scoring and mapped to the MITRE ATT&CK framework.
Remediate: The detected IOA is presented with remediation steps. Depending on the score, applying the remediations quickly can be critical before the attack spreads.
Prevent: Unremediated insecure configurations are revealed via indicators of misconfiguration (IOM) in Falcon Horizon. Applying the remediation steps from IOM can prevent future breaches.

Key differentiators of IOA from Falcon Horizon are:

Observability of wider attack surfaces with heterogeneous event sources
Detection of sophisticated tactics, techniques, and procedures (TTPs) with dynamic event correlation
Event enrichment with threat intelligence that aids prioritization and reduces alert fatigue
Low latency between malicious activity occurrence and corresponding detection
Insight into attacks for each adversarial event from MITRE ATT&CK framework

High-level architecture

Event-driven architectures provide advantages for integrating varied systems over legacy log file-based approaches. For securing cloud attack surfaces against the ever-evolving TTPs, a robust event-driven architecture at scale is a key differentiator.

CrowdStrike maximizes the advantages of event-driven architecture by integrating with EventBridge, as shown in Figure 1. EventBridge allows observing CloudTrail logs in event streams. It also simplifies log centralization from a number of accounts with its direct source-to-target integration across accounts, as follows:

CrowdStrike hosts an EventBridge with central event buses that consume the stream of CloudTrail log events from a multitude of customer AWS accounts.
Within customer accounts, EventBridge rules listen to the local CloudTrail and stream each activity as an event to the centralized EventBridge hosted by CrowdStrike.
CrowdStrike’s event-driven platform detects adversarial behaviors from the event streams in real time. The detection is performed against incoming events in conjunction with historical events. The context that comes from connecting new and historical events minimizes false positives and improves alert efficacy.
Events are enriched with CrowdStrike threat intelligence data that provides additional insight of the attack to SOC analysts and incident responders.

Figure 1. CrowdStrike Falcon Horizon IOA architecture

As data is received by the centralized EventBridge, CrowdStrike relies on unique customer ID and AWS Region in each event to provide integrity and isolation.

EventBridge allows relatively hassle-free customer onboarding by using cross account rules to transfer customer CloudTrail data into one common event bus that can then be used to filter and selectively forward the data into the Falcon Horizon platform for analysis and mitigation.

Conclusion

As your organization’s cloud footprint grows, visibility into end-to-end activities in a timely manner is critical for maintaining a safe environment for your business to operate. EventBridge allows event-driven monitoring of CloudTrail logs at scale.

CrowdStrike Falcon Horizon IOA, powered by EventBridge, observes end-to-end cloud activities at high speeds at scale. Paired with targeted detection algorithms from in-house threat detection experts and threat intelligence data, Falcon Horizon IOA combats emerging threats against the cloud control plane with its cutting-edge event-driven architecture.

Related information

Optimizing Cloud Infrastructure Cost and Performance with Starburst on AWS

2021-09-17 Kinnar Kumar Sen

Post Syndicated from Kinnar Kumar Sen original https://aws.amazon.com/blogs/architecture/optimizing-cloud-infrastructure-cost-and-performance-with-starburst-on-aws/

Amazon Web Services (AWS) Cloud is elastic, convenient to use, easy to consume, and makes it simple to onboard workloads. Because of this simplicity, the cost associated with onboarding workloads is sometimes overlooked.

There is a notion that when an organization moves its workload to the cloud, agility, scalability, performance, and cost issues will disappear. While this may be true for agility and scalability, you must optimize your workload. You can do this with services like Amazon EC2 Auto Scaling via Amazon Elastic Compute Cloud (Amazon EC2) and Amazon EC2 Spot Instances to realize the performance and cost benefits the cloud offers.

In this blog post, we show you how Starburst Enterprise (Starburst) addressed a sudden increase in cost for their data analytics platform as they grew their internal teams and scaled out their infrastructure. After reviewing their architecture and deployments with AWS specialist architects, Starburst and AWS concluded that they could take the following steps to greatly reduce costs:

Use Spot Instances to run workloads.
Add Amazon EC2 Auto Scaling into their training and demonstration environments, because the Starburst platform is designed to elastically scale up and down.

For analytics workloads, when you rein in costs, you typically rein in performance. Starburst and AWS worked together to balance the cost and performance of Starburst’s data analytics platform while also harnessing the flexibility, scalability, security, and performance of the cloud.

What is Starburst Enterprise?

Starburst provides a Massively Parallel Processing SQL (MPPSQL) engine based on open source Trino. It is an analytics platform that provides the cornerstone in customers’ intelligent data mesh and offers the following benefits and services:

The platform gives you a single point to access, monitor, and secure your data mesh.
The platform gives you options for your data compute. You no longer have to wait on data migrations or extract, transform, and load (ETL), there is no vendor lock-in, and there is no need to swap out your existing analytics tools.
Starburst Stargate (Stargate) ensures that large jobs are completed within each data domain of your data mesh. Only the result set is retrieved from the domain.
- Stargate reduces data output, which reduces costs and increases performance.
- Data governance policies can also be applied uniquely in each data domain, ensuring security compliance and federation.

As shown in Figure 1, there are many connectors for input and output that ensure you experience improved performance and security.

Figure 1. Starburst platform

Integrating Starburst Enterprise with AWS

As shown in Figure 2, Starburst Enterprise uses AWS services to deliver elastic scaling and optimize cost. The platform is architected with decoupled storage and compute. This allows the platform to scale as needed to analyze petabytes of data.

The platform can be deployed via AWS CloudFormation or Amazon Elastic Kubernetes Service (Amazon EKS). Starburst on AWS allows you to run analytic queries across AWS data sources and on-premises systems such as Teradata and Oracle.

Figure 2. Deployment architecture of Starburst platform on AWS

Amazon EC2 Auto Scaling

Enterprises have diverse analytic workloads; their compute and memory requirements vary with time. Starburst uses Amazon EKS and Amazon EC2 Auto Scaling to elastically scale compute resources to meet the demands of their analytics workloads.

Amazon EC2 Auto Scaling ensures that you have the compute capacity for your workloads to handle the load elastically. It is used to architect sophisticated, elastic, and resilient applications on the AWS Cloud.
- Starburst uses the scheduled scaling feature of Amazon EC2 Auto Scaling to scale the cluster up/down based on time. Thus, they incur no costs when the cluster is not in use.
Amazon EKS is a fully managed Kubernetes service that allows you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane.

Scaling cloud resource consumption on demand has a major impact on controlling cloud costs. Starburst supports scaling down elastically, which means removing compute resources doesn’t impact the underlying processes.

Amazon EC2 Spot Instances

Spot Instances let you take advantage of unused EC2 capacity in the AWS Cloud. They are available at up to a 90% discount compared to On-Demand Instance prices. If EC2 needs capacity for On-Demand Instance usage, Spot Instances can be interrupted by Amazon EC2 with a two-minute notification. There are many ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance.

Starburst has integrated Spot Instances as a part of the Amazon EKS managed node groups to cost optimize the analytics workloads. This best practice of instance diversification is implemented by using the integration eksctl and instance selector with dry-run flag. This creates a list of instances of same size (vCPU/Mem ratio) and uses them in the underlying node groups.

Same size instances are required to make best use of Kubernetes Cluster Autoscaler, which is used to manage the size of the cluster.

Scaling down, handling interruptions, and provisioning compute

“Scaling in” an active application is tricky, but Starburst was built with resiliency in mind, and it can effectively manage shut downs.

Spot Instances are an ideal compute option because Starburst can handle potential interruptions natively. Starburst also uses Amazon EKS managed node groups to provision nodes in the cluster. This requires significantly less operational effort compared to using self-managed node groups. It allows Starburst to enforce best practices like capacity optimized allocation strategy, capacity rebalancing, and instance diversification.

When you need to “scale out” the deployment, Amazon EKS and Amazon EC2 Auto Scaling help to provision capacity, as depicted in Figure 3.

Figure 3. Depicting “scale out” in a Starburst deployment

Benefits realized from using AWS services

In a short period, Starburst was able to increase the number of people working on AWS. They added five times the number of Solutions Architects, they have previously. Additionally, in their initial tests of their new deployment architecture, their Solutions Architects were able to complete up to three times the amount of work than they had been able to previously. Even after the workload increased more than 15 times, with two simple changes they only had a slight increase in total cost.

This cost and performance optimization allows Starburst to be more productive internally and realize value for each dollar spent. This further justified investing more into building out infrastructure footprint.

Conclusion

In building their architecture with AWS, Starburst realized the importance of having a robust and comprehensive cloud administration plan that they can implement and manage going forward. They are now able to balance the cloud costs with performance and stability, even after considering the SLA requirements. Starburst is planning to teach their customers about the Spot Instance and Amazon EC2 Auto Scaling best practices to ensure they maintain a cost and performance optimized cloud architecture.

If you want to see the Starburst data analytics platform in action, you can get a free trial in the AWS Marketplace: Starburst Data Free Trial.

How Rapid7 built multi-tenant analytics with Amazon Redshift using near-real-time datasets

2021-09-16 Rahul Monga

Post Syndicated from Rahul Monga original https://aws.amazon.com/blogs/big-data/how-rapid7-built-multi-tenant-analytics-with-amazon-redshift-using-near-real-time-datasets/

This is a guest post co-written by Rahul Monga, Principal Software Engineer at Rapid7.

Rapid7 InsightVM is a vulnerability assessment and management product that provides visibility into the risks present across an organization. It equips you with the reporting, automation, and integrations needed to prioritize and fix those vulnerabilities in a fast and efficient manner. InsightVM has more than 5,000 customers across the globe, runs exclusively on AWS, and is available for purchase on AWS Marketplace.

To provide near-real-time insights to InsightVM customers, Rapid7 has recently undertaken a project to enhance the dashboards in their multi-tenant software as a service (SaaS) portal with metrics, trends, and aggregated statistics on vulnerability information identified in their customer assets. They chose Amazon Redshift as the data warehouse to power these dashboards due to its ability to deliver fast query performance on gigabytes to petabytes of data.

In this post, we discuss the design options that Rapid7 evaluated to build a multi-tenant data warehouse and analytics platform for InsightVM. We will deep dive into the challenges and solutions related to ingesting near-real-time datasets and how to create a scalable reporting solution that can efficiently run queries across more than 3 trillion rows. This post also discusses an option to address the scenario where a particular customer outgrows the average data access needs.

This post uses the terms customers, tenants, and organizations interchangeably to represent Rapid7 InsightVM customers.

Background

To collect data for InsightVM, customers can use scan engines or Rapid7’s Insight Agent. Scan engines allow you to collect vulnerability data on every asset connected to a network. This data is only collected when a scan is run. Alternatively, you can install the Insight Agent on individual assets to collect and send asset change information to InsightVM numerous times each day. The agent also ensures that asset data is sent to InsightVM regardless of whether or not the asset is connected to your network.

Data from scans and agents is sent in the form of packed documents, in micro-batches of hundreds of events. Around 500 documents per second are received across customers, and each document is around 2 MB in size. On a typical day, InsightVM processes 2–3 trillion rows of vulnerability data, which translates to around 56 GB of compressed data for a large customer. This data is normalized and processed by InsightVM’s vulnerability management engine and streamed to the data warehouse system for near-real-time availability of data for analytical insights to customers.

Architecture overview

In this section, we discuss the overall architectural setup for the InsightVM system.

Scan engines and agents collect and send asset information to the InsightVM cloud. Asset data is pooled, normalized, and processed to identify vulnerabilities. This is stored in an Amazon ElastiCache for Redis cluster and also pushed to Amazon Kinesis Data Firehouse for use in near-real time by InsightVM’s analytics dashboards. Kinesis Data Firehose delivers raw asset data to an Amazon Simple Storage Service (Amazon S3) bucket. The data is transformed using a custom developed ingestor service and stored in a new S3 bucket. The transformed data is then loaded into the Redshift data warehouse. Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and AWS Lambda are used to orchestrate this data flow. In addition, to identify the latest timestamp of vulnerability data for assets, an auxiliary table is maintained and updated periodically with the update logic in the Lambda function, which is triggered through an Amazon CloudWatch event rule. Custom-built middleware components interface between the web user interface (UI) and the Amazon Redshift cluster to fetch asset information for display in dashboards.

The following diagram shows the implementation architecture of InsightVM, including the data warehouse system:

Rapid-7 Multi-tenant Architecture

The architecture has built-in tenant isolation because data access is abstracted through the API. The application uses a dimensional model to support low-latency queries and extensibility for future enhancements.

Amazon Redshift data warehouse design: Options evaluated and selection

Considering Rapid7’s need for near-real-time analytics at any scale, the InsightVM data warehouse system is designed to meet the following requirements:

Ability to view asset vulnerability data at near-real time, within 5–10 minutes of ingest
Less than 5 seconds’ latency when measured at 95 percentiles (p95) for reporting queries
Ability to support 15 concurrent queries per second, with the option to support more in the future
Simple and easy-to-manage data warehouse infrastructure
Data isolation for each customer or tenant

Rapid7 evaluated Amazon Redshift RA3 instances to support these requirements. When designing the Amazon Redshift schema to support these goals, they evaluated the following strategies:

Bridge model – Storage and access to data for each tenant is controlled at the individual schema level in the same database. In this approach, multiple schemas are set up, where each schema is associated with a tenant, with the same exact structure of the dimensional model.
Pool model – Data is stored in a single database schema for all tenants, and a new column (tenant_id) is used to scope and control access to individual tenant data. Access to the multi-tenant data is controlled using API-level access to the tables. Tenants aren’t aware of the underlying implementation of the analytical system and can’t query them directly.

For more information about multi-tenant models, see Implementing multi-tenant patterns in Amazon Redshift using data sharing.

Initially when evaluating the bridge model, it provided an advantage for tenant-only data for queries, plus the ability to decouple a tenant to an independent cluster if they outgrow the resources that are available in the single cluster. Also, when the p95 metrics were evaluated in this setup, the query response times were less than 5 seconds, because each tenant data is isolated into smaller tables. However, the major concern with this approach was with the near-real-time data ingestion into over 50,000 tables (5,000 customer schemas x approximately 10 tables per schema) every 5 minutes. Having thousands of commits every minute into an online analytical processing (OLAP) system like Amazon Redshift can lead to most resources being exhausted in the ingestion process. As a result, the application suffers query latencies as data grows.

The pool model provides a simpler setup, but the concern was with query latencies when multiple tenants access the application from the same tables. Rapid7 hoped that these concerns would be addressed by using Amazon Redshift’s support for massively parallel processing (MPP) to enable fast execution of most complex queries operating on large amounts of data. With the right table design using the right sort and distribution keys, it’s possible to optimize the setup. Furthermore, with automatic table optimization, the Amazon Redshift cluster can automatically make these determinations without any manual input.

Rapid7 evaluated both the pool and bridge model designs, and decided to implement the pool model. This model provides simplified data ingestion and can support query latencies of under 5 seconds at p95 with the right table design. The following table summarizes the results of p95 tests conducted with the pool model setup.

Query	P95
Large customer: Query with multiple joins, which list assets, their vulnerabilities, and all their related attributes, with aggregated metrics for each asset, and filters to scope assets by attributes like location, names, and addresses	Less than 4 seconds
Large customer: Query to return vulnerability content information given a list of vulnerability identifiers	Less than 4 seconds

Tenet isolation and security

Tenant isolation is fundamental to the design and development of SaaS systems. It enables SaaS providers to reassure customers that, even in a multi-tenant environment, their resources can’t be accessed by other tenants.

With the Amazon Redshift table design using the pool model, Rapid7 built a separate data access layer in the middleware that templatized queries, augmented with runtime parameter substitution to uniquely filter specific tenant and organization data.

The following is a sample of templatized query:

<#if useDefaultVersion()>
currentAssetInstances AS (
SELECT tablename.*
FROM tablename
<#if (applyTags())>
JOIN dim_asset_tag USING (organization_id, attribute2, attribute3)
</#if>
WHERE organization_id ${getOrgIdFilter()}
<#if (applyTags())>
AND tag_id IN ($(tagIds))
</#if>
),
</#if>

The following is a Java interface snippet to populate the template:

public interface TemplateParameters {

boolean useDefaultVersion(); boolean useVersion(); default Set<String> getVersions() {
return null;
} default String getVersionJoin(String var1) {
return "";
}

String getTemplateName();

String getOrgIdString();

default String getOrgIdFilter() {
return "";
}

Every query uses organization_id and additional parameters to uniquely access tenant data. During runtime, organization_id and other metadata are extracted from the secured JWT token that is passed to middleware components after the user is authenticated in the Rapid7 cloud platform.

Best practices and lessons learned

To fully realize the benefits of the Amazon Redshift architecture and design for the multiple tenants & near real-time ingestion, considerations on the table design allow you to take full advantage of the massively parallel processing and columnar data storage. In this section, we discuss the best practices and lessons learned from building this solution.

Sort key for effective data pruning

Sorting a table on an appropriate sort key can accelerate query performance, especially queries with range-restricted predicates, by requiring fewer table blocks to be read from disk. To have Amazon Redshift choose the appropriate sort order, the AUTO option was utilized. Automatic table optimization continuously observes how queries interact with tables and discovers the right sort key for the table. To effectively prune the data by the tenant, organization_id is identified as the sort key to perform the restricted scans. Furthermore, because all queries are routed through the data access layer, organization_id is automatically added in the predicate conditions to ensure effective use of the sort keys.

Micro-batches for data ingestion

Amazon Redshift is designed for large data ingestion, rather than transaction processing. The cost of commits is relatively high, and excessive use of commits can result in queries waiting for access to the commit queue. Data is micro-batched during ingestion as it arrives for multiple organizations. This results in fewer transactions and commits when ingesting the data.

Load data in bulk

If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load, and this type of load is much slower.

The Amazon Redshift manifest file is used to ingest the datasets that span multiple files in a single COPY command, which allows fast ingestion of data in each micro-batch.

RA3 instances for data sharing

Rapid 7 uses Amazon Redshift RA3 instances, which enable data sharing to allow you to securely and easily share live data across Amazon Redshift clusters for reads. In this multi-tenant architecture when a tenant outgrows the average data access needs, it can be isolated to a separate cluster easily and independently scaled using the data sharing. This is accomplished by monitoring the STL_SCAN table to identify different tenants and isolate them to allow for independent scalability as needed.

Concurrency scaling for consistently fast query performance

When concurrency scaling is enabled, Amazon Redshift automatically adds additional cluster capacity when you need it to process an increase in concurrent read queries. To meet the uptick in user requests, the concurrency scaling feature is enabled to dynamically bring up additional capacity to provide consistent p95 values that meet Rapid7’s defined requirements for the InsightVM application.

Results and benefits

Rapid7 saw the following results from this architecture:

The new architecture has reduced the time required to make data accessible to customers to less than 5 minutes on average. The previous architecture had higher level of processing time variance, and could sometimes exceed 45 minutes
Dashboards load faster and have enhanced drill-down functionality, improving the end-user experience
With all data in a single warehouse, InsightVM has a single source of truth, compared to the previous solution where InsightVM had copies of data maintained in different databases and domains, which could occasionally get out of sync
The new architecture lowers InsightVM’s reporting infrastructure cost by almost three times, as compared to the previous architecture

Conclusion

With Amazon Redshift, the Rapid7 team has been able to centralize asset and vulnerability information for InsightVM customers. The team has simultaneously met its performance and management objectives with the use of a multi-tenant pool model and optimized table design. In addition, data ingestion via Kinesis Data Firehose and custom-built microservices to load data into Amazon Redshift in near-real time enabled Rapid7 to deliver asset vulnerability information to customers more than nine times faster than before, improving the InsightVM customer experience.

About the Authors

Rahul Monga is a Principal Software Engineer at Rapid7, currently working on the next iteration of InsightVM. Rahul’s focus areas are highly distributed cloud architectures and big data processing. Originally from the Washington DC area, Rahul now resides in Austin, TX with his wife, daughter, and adopted pup.

Sujatha Kuppuraju is a Senior Solutions Architect at Amazon Web Services (AWS). She works with ISV customers to help design secured, scalable and well-architected solutions on the AWS Cloud. She is passionate about solving complex business problems with the ever-growing capabilities of technology.

Thiyagarajan Arumugam is a Principal Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

How Amazon CodeGuru Reviewer helps Gridium maintain a high quality codebase

2021-09-15 Aaqib Bickiya

Post Syndicated from Aaqib Bickiya original https://aws.amazon.com/blogs/devops/codeguru-gridium-maintain-codebase/

Gridium creates software that lets people run commercial buildings at a lower cost and with less energy. Currently, half of the world lives in cities. Soon, nearly 70% will, while buildings utilize 40% of the world’s electricity. In the U.S. alone, commercial real estate value tops one trillion dollars. Furthermore, much of this asset class is still run with spreadsheets, clipboards, and outdated software. Gridium’s software platform collects large amounts of operational data from commercial buildings like offices, medical providers, and corporate campuses. Then, our analytics identifies energy savings opportunities that we work with customers to actualize.

Gridium’s Challenge

Data streams from utility companies across the U.S. are an essential input for Gridium’s analytics. This data is processed and visualized so that our customers can garner new insights and drive smarter decisions. In order to integrate a new data source into our platform, we often utilize third-party contractors to write tools that ingest data feeds and prepare them for analysis.

Our team firmly emphasizes our codebase quality. We strive for our code to be stylistically consistent, easily testable, swiftly understood, and well-documented. We write code internally by using agreed-upon standards. This makes it easy for us to review and edit internal code using standard collaboration tools.

We work to ensure that code arriving from contractors meets similar standards, but enforcing external code quality can be difficult. Contractor-developed code will be written in the style of each contractor. Furthermore, reviewing and editing code written by contractors introduces new challenges beyond those of working with internal code. We have used tools like PyLint to help stylistically align code, but we wanted a method for uncovering more complex issues without burdening the team and undermining the benefits of outside contractors.

Adopting Amazon CodeGuru Reviewer

We began evaluating Amazon CodeGuru Reviewer in order to provide an additional review layer for code developed by contractors, and to find complex corrections within code that traditional tools might not catch.

Initially, we enabled the service only on external repositories and generated reviews on pull requests. CodeGuru immediately provided us with actionable recommendations for maintaining Gridium’s codebase quality.

CodeGuru exception handling correction

The example above demonstrates a useful recurring recommendation related to exception-handling. Strictly speaking, there is nothing wrong with utilizing a general Exception class whenever needed. However, utilizing general Exception classes in production can complicate error-handling functions and generate ambiguity when trying to debug or understand code.

After a few weeks of utilizing CodeGuru Reviewer and witnessing its benefits, we determined that we wanted to use it for our entire codebase. Moreover, CodeGuru provided us with meaningful recommendations on our internal repositories.

CodeGuru highly depend functions suggestions

This example once again showcases CodeGuru’s ability to highlight subtle issues that aren’t necessarily bugs. Without diving through related libraries and functions, it would be difficult to find any optimizable areas, but CodeGuru found that this function could become problematic due to its dependency on twenty other functions. If any of the dependencies is updated, then it could break the entire function, thereby making debugging the root cause difficult.

The explanations following each code recommendation were essential in our quick adoption of CodeGuru. Simply showing what code to change would make it tough to follow through with a code correction. CodeGuru provided ample context, reasoning, and even metrics in some cases to explain and justify a correction.

Conclusion

Our development team thoroughly appreciates the extra review layer that CodeGuru provides. CodeGuru indicates areas that internal review might otherwise miss, especially in code written by external contractors.

In some cases, CodeGuru highlighted issues never before considered by the team. Examples of these issues include highly coupled functions, ambiguous exceptions, and outdated API calls. Each suggestion is accompanied with its context and reasoning so that our developers can independently judge if an edit should be made. Furthermore, CodeGuru was easily set up with our GitHub repositories. We enabled it within minutes through the console.

After familiarizing ourselves with the CodeGuru workflow, Gridium treats CodeGuru recommendations like suggestions that an internal reviewer would make. Both internal developers and third parties act on recommendations in order to improve code health and quality. Any CodeGuru suggestions not accepted by contractors are verified and implemented by an internal reviewer if necessary.

Over the twelve weeks that we have been utilizing Amazon CodeGuru, we have had 281 automated pull request reviews. These provided 104 recommendations resulting in fifty code corrections that we may not have made otherwise. Our monthly bill for CodeGuru usage is about $10.00. If we make only a single correction to our code over a whole month that we would have missed otherwise, then we can safely say that it was well worth the price!

About the Authors

Kimberly Nicholls is an Engineering Technical Lead at Gridium who loves to make data useful. She also enjoys reading books and spending time outside.

Adnan Bilwani is a Sr. Specialist-Builder Experience providing fully managed ML-based solutions to enhance your DevOps workflows.

Aaqib Bickiya is a Solutions Architect at Amazon Web Services. He helps customers in the Midwest build and grow their AWS environments.

How The Mill Adventure Implemented Event Sourcing at Scale Using DynamoDB

2021-09-13 Uri Segev

Post Syndicated from Uri Segev original https://aws.amazon.com/blogs/architecture/how-the-mill-adventure-implemented-event-sourcing-at-scale-using-dynamodb/

This post was co-written by Joao Dias, Chief Architect at The Mill Adventure and Uri Segev, Principal Serverless Solutions Architect at AWS

The Mill Adventure provides a complete gaming platform, including licenses and operations, for rapid deployment and success in online gaming. It underpins every aspect of the process so that you can focus on telling your story to your audience while the team makes everything else work perfectly.

In this blog post, we demonstrate how The Mill Adventure implemented event sourcing at scale using Amazon DynamoDB and Serverless on AWS technologies. By partnering with AWS, The Mill Adventure reduced their costs, and they are able to maintain operations and scale their solution to suit their needs without their intervention.

What is event sourcing?

Event sourcing captures an entity’s state (such as a transaction or a user) as a sequence of state-changing events. Whenever the state changes, a new event is appended to the sequence of events using an atomic operation.

The system persists these events in an event store, which is a database of events. The store supports adding and retrieving the state events. The system reconstructs the entity’s state by reading the events from the event store and replaying them. Because the store is immutable (meaning these events are saved in the event store forever) the entity’s state can be recreated up to a particular version or date and have accurate historical values.

Why use event sourcing?

Event sourcing provides many advantages, that include (but are not limited to) the following:

Audit trail: Events are immutable and provide a history of what has taken place in the system. This means it’s not only providing the current state, but how it got there.
Time travel: By persisting a sequence of events, it is relatively easy to determine the state of the system at any point in time by aggregating the events within that time period. This provides you the ability to answer historical questions about the state of the system.
Performance: Events are simple and immutable and only require an append operation. The event store should be optimized to handle high-performance writes.
Scalability: Storing events avoids the complications associated with saving complex domain aggregates to relational databases, which allows more flexibility for scaling.

Event-driven architectures

Event sourcing is also related to event-driven architectures. Every event that changes an entity’s state can also be used to notify other components about the change. In event-driven architectures, we use event routers to distribute the events to interested components.

The event router has three main functions:

Decouple the event producers from the event consumers: The producers don’t know who the consumers are, and they do not need to change when new consumers are added or removed.
Fan out: Event routers are capable of distributing events to multiple subscribers.
Filtering: Event routers send each subscriber only the events they are interested in. This saves on the number of events that consumers need to process; therefore, it reduces the cost of the consumers.

How did The Mill Adventure implement event sourcing?

The Mill Adventure uses DynamoDB tables as their object store. Each event is a new item in the table. The DynamoDB table model for an event sourced system is quite simple, as follows:

Field	Type	Description
id	PK	The object identifier
version	SK	The event sequence number
eventdata		The event data itself, in other words, the change to the object’s state

All events for the same object have the same id. Thus, you can retrieve them using a single read request.

When a component modifies the state of an object, it first determines the sequence number for the new event by reading the current state from the table (in other words, the sequence of events for that object). It then attempts to write a new item to the table that represents the change to the object’s state. The item is written using DynamoDB’s conditional write. This ensures that there are no other changes to the same object happening at the same time. If the write failed due to a condition not met error, it will start over.

An additional benefit of using DynamoDB as the event store is DynamoDB Streams, which is used to deliver events about changes in tables. These events can be used by event-driven applications so they will know about the different objects’ change of state.

How does it work?

Let’s use an example of a business entity, such as a user. When a user is created, the system creates a UserCreated event with the initial user data (like user name, address, etc.). The system then persists this event to the DynamoDB event store using a conditional write. This makes sure that the event is only written once and that the version numbers are sequential.

Then the user address gets updated, so again, the system creates a UserUpdated event with the new address and persists it.

When the system needs the user’s current state, for example, to show it in back-office application, the system loads all the events for the given user identifier from the store. For each one of them, it invokes a mutation function that recreates the latest state. Given the following items in the database:

Event 1: UserCreated(name: The Mill, address: Malta)
Event 2: UserUpdated(address: World)

You can imagine how each mutator function for those events would look like, which then produce the latest state:

{ 
"name": "The Mill", 
"address": "World" 
}

A business state like a bank statement can have a large number of events. To optimize loading, the system periodically saves a snapshot of the current state. To reconstruct the current state, the application finds the most recent snapshot and the events that have occurred since that snapshot. As a result, there are fewer events to replay.

Architecture

The Mill Adventure architecture for an event source system using AWS components is straightforward. The architecture is fully serverless, as such, it only uses AWS Lambda functions for compute. Lambda functions produce the state-changing events that are written to the database.

Other Lambda functions, when they retrieve an object’s state, will read the events from the database and calculate the current state by replaying the events.

Finally, interested functions will be notified about the changes by subscribing to the event bus. Then they perform their business logic, like updating state projections or publishing to WebSocket APIs. These functions use DynamoDB streams as the event bus to handle messages as shows in Figure 1.

Figure 1. Event sourcing architecture

Figure 1 is not completely accurate due to a limitation of DynamoDB Streams, which can only support up to two subscribers.

Because The Mill Adventure has many microservices that are interested in these events, they have a single function that gets invoked from the stream and sends the events to other event routers. These fan out to a large number of subscribers such as Amazon EventBridge, Amazon Simple Notification Service (Amazon SNS), or maybe even Amazon Kinesis Data Streams for some use cases.

Any service in the system could be listening to these events being created via the DynamoDB stream and distributed via the event router and act on them. For example, publishing a WebSocket API notification or prompting a contact update in a third-party service.

Conclusion

In this blog post, we showed how The Mill Adventure uses serverless technologies like DynamoDB and Lambda functions to implement an event-driven event sourcing system.

An event sourced system can be difficult to scale, but using DynamoDB as the event store resolved this issue. It can also be difficult to produce consistent snapshots and Command Query Responsibility Segregation (CQRS) views, but using DynamoDB streams for distributing the events made it relatively easy.

By partnering with AWS, The Mill Adventure created a sports/casino platform to be proud of. It provides high quality data and performance without having servers, they only pay for what they use, and their workload can scale up and down as needed.

How MOIA built a fully automated GDPR compliant data lake using AWS Lake Formation, AWS Glue, and AWS CodePipeline

2021-09-02 Leonardo Pêpe

Post Syndicated from Leonardo Pêpe original https://aws.amazon.com/blogs/big-data/how-moia-built-a-fully-automated-gdpr-compliant-data-lake-using-aws-lake-formation-aws-glue-and-aws-codepipeline/

This is a guest blog post co-written by Leonardo Pêpe, a Data Engineer at MOIA.

MOIA is an independent company of the Volkswagen Group with locations in Berlin and Hamburg, and operates its own ride pooling services in Hamburg and Hanover. The company was founded in 2016 and develops mobility services independently or in partnership with cities and existing transport systems. MOIA’s focus is on ride pooling and the holistic development of the software and hardware for it. In October 2017, MOIA started a pilot project in Hanover to test a ride pooling service, which was brought into public operation in July 2018. MOIA covers the entire value chain in the area of ride pooling. MOIA has developed a ridesharing system to avoid individual car traffic and use the road infrastructure more efficiently.

In this post, we discuss how MOIA uses AWS Lake Formation, AWS Glue, and AWS CodePipeline to store and process gigabytes of data on a daily basis to serve 20 different teams with individual user data access and implement fine-grained control of the data lake to comply with General Data Protection Regulation (GDPR) guidelines. This involves controlling access to data at a granular level. The solution enables MOIA’s fast pace of innovation to automatically adapt user permissions to new tables and datasets as they become available.

Background

Each MOIA vehicle can carry six passengers. Customers interact with the MOIA app to book a trip, cancel a trip, and give feedback. The highly distributed system prepares multiple offers to reach their destination with different pickup points and prices. Customers select an option and are picked up from their chosen location. All interactions between the customers and the app, as well as all the interactions between internal components and systems (the backend’s and vehicle’s IoT components), are sent to MOIA’s data lake.

Data from the vehicle, app, and backend must be centralized to have the synchronization between trips planned and implemented, and then to collect passenger feedback. To provide different pricing and routing options, MOIA needed centralized data. MOIA decided to build and secure its Amazon Simple Storage Service (Amazon S3) based Data Lake using AWS Lake Formation.

Different MOIA teams that includes Data Analysts, Data Scientists and Data Engineers need to access centralized data from different sources for the development and operations of the application workloads. It’s a legal requirement to control the access and format of the data to these different teams. The app development team needs to understand customer feedback in an anonymized way, pricing-related data must be accessed only by the business analytics team, vehicle data is meant to be used only by the vehicle maintenance team, and the routing team needs access to customer location and destination.

Architecture overview

The following diagram illustrates MOIA solution architecture.

The solution has the following components:

Data input – The apps, backend, and IoT devices are continuously streaming event messages via Amazon Kinesis Data Streams in a nested information in a JSON format.
Data transformation – When the raw data is received, the data pipeline orchestrator (Apache Airflow) starts the data transformation jobs to meet requirements for respective data formats. Amazon EMR, AWS Lambda, or AWS Fargate cleans, transforms, and loads the events in the S3 data lake.
Persistence layer – After the data is transformed and normalized, AWS Glue crawlers classify the data to determine the format, schema, and associated properties. They group data into tables or partitions and write metadata to the AWS Glue Data Catalog based on the structured partitions created in Amazon S3. The data is written on Amazon S3 partitioned by its event type and timestamp so it can be easily accessed via Amazon Athena and the correct permissions can be applied using Lake Formation. The crawlers are run at a fixed interval in accordance with the batch jobs, so no manual runs are required.
Governance layer – Access on the data lake is managed using the Lake Formation governance layer. MOIA uses Lake Formation to centrally define security, governance, and auditing policies in one place. These policies are consistently implemented, which eliminates the need to manually configure them across security services like AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), storage services like Amazon S3, or analytics and machine learning (ML) services like Amazon Redshift, Athena, Amazon EMR for Apache Spark. Lake Formation reduces the effort in configuring policies across services and provides consistent assistance for GDPR requirements and compliance. When a user makes a request to access Data Catalog resources or underlying data from the data lake, for the request to succeed, it must pass permission checks by both IAM and Lake Formation. Lake Formation permissions control access to Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. User permissions to the data on the table, row, or column level are defined in Lake Formation, so that users only access the data if they’re authorized.
Data access layer – After the flattened and structured data is available to be accessed by the data analysts, scientists, and engineers they can use Amazon Redshift or Athena to perform SQL analysis, or AWS Data Wrangler and Apache Spark to perform programmatic analysis and build ML use cases.
Infrastructure orchestration and data pipeline – After the data is ingested and transformed, the infrastructure orchestration layer (CodePipeline and Lake Formation) creates or updates the governance layer with the proper and predefined permissions periodically every hour (see more about the automation in the governance layer later in this post).

Governance challenge

MOIA wants to evolve their ML models for routing, demand prediction, and business models continuously. This requires MOIA to constantly review models and update them, therefore power users such as data administrators and engineers frequently redesign the table schemas in the AWS Glue Data Catalog as part of the data engineering workflow. This highly dynamic metadata transformation requires an equally dynamic governance layer pipeline that can assign the right user permissions to all tables and adapt to these changes transparently without disruptions to end-users. Due to GDPR requirements, there is no room for error, and manual work is required to assign the right permissions according to GDPR compliance on the tables in the data lake with many terabytes of data. Without automation, many developers are needed for administration; this adds human error into the workflows, which is not acceptable. Manual administration and access management isn’t a scalable solution. MOIA needed to innovate faster with GDPR compliance with a small team of developers.

Data schema and data structure often changes at MOIA, resulting in new tables being created in the data lake. To guarantee that new tables inherit the permissions granted to the same group of users who already have access to the entire database, MOIA uses an automated process that grants Lake Formation permission to newly created tables, columns, and databases, as described in the next section.

Solution overview

The following diagram illustrates the continuous deployment loop using AWS CloudFormation.

The workflow contains the following steps:

MOIA data scientists and engineers create or modify new tables to evolve the business and ML models. This results in new tables or changes in the existing schema of the table.
The data pipeline is orchestrated via Apache Airflow, which controls the whole cycle of ingestion, cleansing, and data transformation using Spark jobs (as shown in Step 1). When the data is in its desired format and state, Apache Airflow triggers the AWS Glue crawler job to discover the data schema in the S3 bucket and add the new partitions to the Data Catalog. Apache Airflow also triggers CodePipeline to assign the right permissions on all the tables.
Apache Airflow triggers CodePipeline every hour, which uses MOIA’s libraries to discover the new tables, columns, and databases, and grants Lake Formation permissions in the AWS CloudFormation template for all tables, including new and modified. This orchestration layer ensures Lake Formation permissions are granted. Without this step, nobody can access new tables, and new data isn’t visible to data scientists, business analysts, and others who need it.
MOIA uses Stacker and Troposphere to identify all the tables, assign the tables the right permissions, and deploy the CloudFormation stack. Troposphere is the infrastructure as code (IaC) framework that renders the CloudFormation templates, and stacker helps with the parameterization and deployment of the templates on AWS. Stacker also helps produce reusable templates in the form of blueprints as pure Python code, which can be easily parameterized using YAML files. The Python code uses Boto3 clients to lookup the Data Catalog and search for databases and tables.
The stacker blueprint libraries developed by MOIA (which internally use Boto3 and troposphere) are used to discover newly created databases or tables in the Data Catalog.
A Permissions configuration file allows MOIA to predefine data lake tables (Can be a wildcard indicating all tables available in one database) and specific personas who have access to these tables. These roles are split into different categories called lake personas, and have specific access levels. Example lake personas include data domain engineers, data domain technical users, data administrators, and data analysts. Users and their permissions are defined in the file. In case access to personally identifiable information (PII) data is granted, the GDPR officer ensures that a record of processing activities is in place and the usage of personal data is documented according to the law.
MOIA uses stacker to read the predefined permissions file, and uses the customized stacker blueprints library containing the logic to assign permissions to lake personas for each of the tables discovered in Step 5.
The discovered and modified tables are matched with predefined permissions in Step 6 and Step 7. Stacker uses YAML format as the input for the CloudFormation template parameters to define users and data access permissions. These parameters are related to the user’s roles and define access to the tables. Based on this, MOIA creates a customized stacker that creates AWS CloudFormation resources dynamically. The following are the example stacks in AWS CloudFormation:
1. data-lake-domain-database – Holds all resources that are relevant during the creation of the database, which includes lake permissions at the database level that allow the domain database admin personas to perform operations in the database.
2. data-lake-domain-crawler – Scans Amazon S3 constantly to discover and create new tables and updates.
3. data-lake-lake-table-permissions – Uses Boto3 combined with troposphere to list the available tables and grant lake permissions to the domain roles that access it.

This way, new CloudFormation templates are created, containing permissions for new or modified tables. This process guarantees a fully automated governance layer for the data lake. The generated CloudFormation template contains Lake Formation permission resources for each table, database, or column. The process of managing Lake Formation permissions on Data Catalog databases, tables, and columns is simplified by granting Data Catalog permissions using the Lake Formation tag-based access control method. The advantage of generating a CloudFormation template is audibility. When the new version of the CloudFormation stack is prepared with an access control set on new or modified tables, administrators can compare that stack with the older version to discover newly prepared and modified tables. MOIA can view the differences via the AWS CloudFormation console before new stack deployment.

Either the CloudFormation stack is deployed in the account with the right permissions for all users, or a stack deployment failure notice is sent to the developers via a Slack channel (Step 10). This ensures the visibility in the deployment process and failed pipelines in permission assignment.
Whenever the pipeline fails, MOIA receives the notification via Slack so the engineers can promptly fix the errors. This loop is very important to guarantee that whenever the schema changes, those changes are reflected in the data lake without manual intervention.
Following this automated pipeline, all the tables are assigned right permissions.

Benefits

This solution delivers the following benefits:

Automated enforcement of data access permissions allows people to access the data without manual interventions in a scalable way. With automation, 1,000 hours of manual work are saved every year.
The GDPR team does the assessment internally every month. The GDPR team constantly provides guidance and approves each change in permissions. This audit trail for records of processing activities is automated with the help of Lake Formation (otherwise it needs a dedicated human resource).
The automated workflow of permission assignment is integrated into existing CI/CD processes, resulting in faster onboarding of new teams, features, and dataset releases. An average of 48 releases are done each month (including major and minor version releases and new event types). Onboarding new teams and forming internal new teams is very easy now.
This solution enables you to create a data lake using IaC processes that are easy and GDPR-compliant.

Conclusion

MOIA has created scalable, automated, and versioned permissions with a GDPR-supported, governed data lake using Lake Formation. This solution helps them bring new features and models to market faster, and reduces administrative and repetitive tasks. MOIA can focus on 48 average releases every month, contributing to a great customer experience and new data insights.

About the Authors

Leonardo Pêpe is a Data Engineer at MOIA. With a strong background in infrastructure and application support and operations, he is immersed in the DevOps philosophy. He’s helping MOIA build automated solutions for its data platform and enabling the teams to be more data-driven and agile. Outside of MOIA, Leonardo enjoys nature, Jiu-Jitsu and martial arts, and explores the good of life with his family.

Sushant Dhamnekar is a Solutions Architect at AWS. As a trusted advisor, Sushant helps automotive customers to build highly scalable, flexible, and resilient cloud architectures, and helps them follow the best practices around advanced cloud-based solutions. Outside of work, Sushant enjoys hiking, food, travel, and CrossFit workouts.

Shiv Narayanan is Global Business Development Manager for Data Lakes and Analytics solutions at AWS. He works with AWS customers across the globe to strategize, build, develop and deploy modern data platforms. Shiv loves music, travel, food and trying out new tech.

Toyota Connected and AWS Design and Deliver Collision Assistance Application

2021-08-27 Srikanth Kodali

Post Syndicated from Srikanth Kodali original https://aws.amazon.com/blogs/architecture/toyota-connected-and-aws-design-and-deliver-collision-assistance-application/

This post was cowritten by Srikanth Kodali, Sr. IoT Data Architect at AWS, and Will Dombrowski, Sr. Data Engineer at Toyota Connected

Toyota Connected North America (TC) is a technology/big data company that partners with Toyota Motor Corporation and Toyota Motor North America to develop products that aim to improve the driving experience for Toyota and Lexus owners.

TC’s Mobility group provides backend cloud services that are built and hosted in AWS. Together, TC and AWS engineers designed, built, and delivered their new Collision Assistance product, which debuted in early August 2021.

In the aftermath of an accident, Collision Assistance offers Toyota and Lexus drivers instructions to help them navigate a post-collision situation. This includes documenting the accident, filing an insurance claim, and transitioning to the repair process.

In this blog post, we’ll talk about how our team designed, built, refined, and deployed the Collision Assistance product with Serverless on AWS services. We’ll discuss our goals in developing this product and the architecture we developed based on those goals. We’ll also present issues we encountered when testing our initial architecture and how we resolved them to create the final product.

Building a scalable, affordable, secure, and high performing product

We used a serverless architecture because it is often less complex than other architecture types. Our goals in developing this initial architecture were to achieve scalability, affordability, security, and high performance, as described in the following sections.

Scalability and affordability

In our initial architecture, Amazon Simple Queue Service (Amazon SQS) queues, Amazon Kinesis streams, and AWS Lambda functions allow data pipelines to run servers only when they’re needed, which introduces cost savings. They also process data in smaller units and run them in parallel, which allows data pipelines to scale up efficiently to handle peak traffic loads. These services allow for an architecture that can handle non-uniform traffic without needing additional application logic.

Security

Collision Assistance can deliver information to customers via push notifications. This data must be encrypted because many data points the application collects are sensitive, like geolocation.

To secure this data outside our private network, we use Amazon Simple Notification Service (Amazon SNS) as our delivery mechanism. Amazon SNS provides HTTPS endpoint delivery of messages coming to topics and subscriptions. AWS allows us to enable at-rest and/or in-transit encryption for all of our other architectural components as well.

Performance

To quantify our product’s performance, we review the “notification delay.” This metric evaluates the time between the initial collision and when the customer receives a push notification from Collision Assistance. Our ultimate goal is to have the push notification sent within minutes of a crash, so drivers have this information in near real time.

Initial architecture

Figure 1 presents our initial architecture implementation that aims to predict whether a crash has occurred and reduce false positives through the following data pipeline:

The Kinesis stream receives vehicle data from an upstream ingestion service, as discussed in the Enhancing customer safety by leveraging the scalable, secure, and cost-optimized Toyota Connected Data Lake blog.
A Lambda function writes lookup data to Amazon DynamoDB for every Kinesis record.
This Lambda function decreases obvious non-crash data. It sends the current record (X) to Amazon SQS. If X exceeds a certain threshold, it will remain a crash candidate.
Amazon SQS sets a delivery delay so that there will be more Kinesis/DynamoDB records available when X is processed later in the pipeline.
A second Lambda function reads the data from the SQS message. It queries DynamoDB to find the Kinesis lookup data for the message before (X-1) and after (X+1) the crash candidate.
Kinesis GetRecords retrieves X-1 and X+1, because X+1 will exist after the SQS delivery delay times out.
The X-1, X, and X+1 messages are sent to the data science (DS) engine.
When a crash is accurately predicted, these results are stored in a DynamoDB table.
The push notification is sent to the vehicle owner. (Note: the push notification is still in ‘select testing phase’)

Figure 1. Diagram and description of our initial architecture implementation

To be consistent with privacy best practices and reduce server uptime, this architecture uses the minimum amount of data the DS engine needs.

We filter out records that are lower than extremely low thresholds. Once these records are filtered out, around 40% of the data fits the criteria to be evaluated further. This reduces the server capacity needed by the DS engine by 60%.

To reduce false positives, we gather data before and after the timestamps where the extremely low thresholds are exceeded. We then evaluate the sensor data across this timespan and discard any sets with patterns of abnormal sensor readings or other false positive conditions. Figure 2 shows the time window we initially used.

Figure 2. Longitudinal acceleration versus time

Adjusting our initial architecture for better performance

Our initial design worked well for processing a few sample messages and achieved the desired near real-time delivery of the push notification. However, when the pipeline was enabled for over 1 million vehicles, certain limits were exceeded, particularly for Kinesis and Lambda integrations:

Our Kinesis GetRecords API exceeded the allowed five requests per shard per second. With each crash candidate retrieving an X-1 and X+1 message, we could only evaluate two per shard per second, which isn’t cost effective.
Additionally, the downstream SQS-reading Lambda function was limited to 10 records per second per invocation. This meant any slowdown that occurs downstream, such as during DS engine processing, could cause the queue to back up significantly.

To improve cost and performance for the Kinesis-related functionality, we abandoned the DynamoDB lookup table and the GetRecord calls in favor of using a Redis cache cluster on Amazon ElastiCache. This allows us to avoid all throughput exceptions from Kinesis and focus on scaling the stream based on the incoming throughput alone. The ElastiCache cluster scales capacity by adding or removing shards, which improves performance and cost efficiency.

To solve the Amazon SQS/Lambda integration issue, we funneled messages directly to an additional Kinesis stream. This allows the final Lambda function to use some of the better scaling options provided to Kinesis-Lambda event source integrations, like larger batch sizes and max-parallelism.

After making these adjustments, our tests proved we could scale to millions of vehicles as needed. Figure 3 shows a diagram of this final architecture.

Figure 3. Final architecture

Conclusion

Engineers across many functions worked closely to deliver the Collision Assistance product.

Our team of backend Java developers, infrastructure experts, and data scientists from TC and AWS built and deployed a near real-time product that helps Toyota and Lexus drivers document crash damage, file an insurance claim, and get updates on the actual repair process.

The managed services and serverless components available on AWS provided TC with many options to test and refine our team’s architecture. This helped us find the best fit for our use case. Having this flexibility in design was a key factor in designing and delivering the best architecture for our product.

How Magellan Rx Management used Amazon Redshift ML to predict drug therapeutic conditions

2021-08-10 Karim Prasla

Post Syndicated from Karim Prasla original https://aws.amazon.com/blogs/big-data/how-magellan-rx-management-used-amazon-redshift-ml-to-predict-drug-therapeutic-conditions/

This post is co-written with Karim Prasla and Deepti Bhanti from Magellan Rx Management as the lead authors.

Amazon Redshift ML makes it easy for data scientists, data analysts, and database developers to create, train, and use machine learning (ML) models using familiar SQL commands in Amazon Redshift data warehouses. The ML feature can be used by various personas; in this post we discuss how data analysts at Magellan Rx Management used Redshift ML to predict and classify drugs into various therapeutic conditions.

About Magellan Rx Management

Magellan Rx Management, a division of Magellan Health, Inc., is shaping the future of pharmacy. As a next-generation pharmacy organization, we deliver meaningful solutions to the people we serve. As pioneers in specialty drug management, industry leaders in Medicaid pharmacy programs, and disruptors in pharmacy benefit management, we partner with our customers and members to deliver a best-in-class healthcare experience.

Use case

Magellan Rx Management uses data and analytics to deliver clinical solutions that improve patient care, contain costs, and improve outcomes. We utilize predictive analytics to forecast future drugs costs, identify drugs that will drive future trends, including pipeline medications, and proactively identify patients at risk for becoming non-adherent to their medications.

We develop and deliver these analytics via our MRx Predict solution. This solution utilizes a variety of data, including pharmacy and medical claims, compendia information, census data, and many other sources to optimize the predictive model development, deployment, and most importantly maximize predictive accuracy.

Part of the process involves stratification of drugs and patient population based on therapeutic conditions. These conditions are derived by utilizing compendia drug attributes to come up with a classification system focused on disease categories, such as diabetes, hypertension, spinal muscular atrophy, and many others, and ensuring drugs are appropriately categorized as opposed to being included in the “miscellaneous” bucket, which is the case for many specialty medications. Also, this broader approach to classifying drugs, beyond just mechanism of action, has allowed us to effectively develop downstream reporting solutions, patient segmentation, and outcomes analytics.

Although we use different technologies for predictive analytics, we wanted to use Redshift ML to predict appropriate drug therapeutic conditions, because it would allow us to use this functionality within our Amazon Redshift data warehouse by making the predictions using standard SQL programming. Prior to Redshift ML, we had data analysts and clinicians manually categorize any new drugs into the appropriate therapeutic conditions. The end goal of using Redshift ML was to assess how well we could improve our operational efficiency while maintaining a high level of clinical accuracy.

Redshift ML allowed us to create models using standard SQL without having to use external systems, technologies, or APIs. With data already in Amazon Redshift, our data scientists and analysts, who are skilled in SQL, seamlessly created ML models to effectively predict therapeutic conditions for individual drugs.

“At Magellan Rx Management, we leverage data, analytics, and proactive insights, to help solve complex pharmacy challenges, while focusing on improving clinical and economic outcomes for customers and members,” said Karim Prasla, Vice President of Clinical Outcomes Advanced Analytics and Research at Magellan Rx Management.“We use predictive analytics and machine learning to improve operational and clinical efficiencies and effectiveness. With Redshift ML, we were able to enable our data and outcomes analysts to classify new drugs to market into appropriate therapeutic conditions by creating and utilizing ML models with minimal effort. The efficiency gained through leveraging Redshift ML to support this process improved our productivity and optimized our resources while generating a high degree of predictive accuracy.”

Magellan Rx Management continues to focus on using data analytics to identify opportunities to improve patient care and outcomes, to deliver proactive insights to our customers so sound data-driven decisions can be made, and end-user tools to allow customers and support staff to have readily available insights. By incorporating predictive analytics along with providing descriptive and diagnostic insights, we at Magellan Rx Management are able to partner with our customers to ensure we evaluate the “what,” the “why,” and the “what will” to more effectively manage the pharmacy programs.

Key benefits of using Redshift ML

Redshift ML enables you to train models with a single SQL CREATE MODEL command. The CREATE MODEL command creates a model that Amazon Redshift uses to generate model-based predictions with familiar SQL constructs.

Redshift ML provides simple, optimized, and secure integration between Amazon Redshift and Amazon SageMaker, and enables inference within the Amazon Redshift cluster, making it easy to use predictions generated by ML-based models in queries and applications. An analyst with SQL skills can easily create and train models with no expertise in ML programming languages, algorithms, and APIs.

With Redshift ML, you don’t have to perform any of the undifferentiated heavy lifting required for integrating with an external ML service. Redshift ML saves you the time to format and move data, manage permission controls, or build custom integrations, workflows, and scripts. You can easily use popular ML algorithms and simplify training needs that require frequent iteration from training to prediction. Amazon Redshift automatically discovers the best algorithm and tunes the best model for your problem. You can simply make predictions and predictive analytics from within your Amazon Redshift cluster without the need to move data out of Amazon Redshift.

Simplifying integration between Amazon Redshift and SageMaker

Before Redshift ML, a data scientist had to go through a series of steps with various tools to arrive at a prediction—identify the appropriate ML algorithms in SageMaker or use Amazon SageMaker Autopilot, export the data from the data warehouse, and prepare the training data to work with these models. When the model is deployed, the scientist goes through various iterations with new data for making predictions (also known as inference). This involves moving data back and forth between Amazon Redshift and SageMaker through a series of manual steps:

Export training data to Amazon Simple Storage Service (Amazon S3).
Train the model in SageMaker.
Export prediction input data to Amazon S3.
Generate predictions in SageMaker.
Import targets back into the database.

The following diagram depicts these steps.

This iterative process is time-consuming and prone to errors, and automating the data movement can take weeks or months of custom coding that then needs to be maintained.

Redshift ML enables you to use ML with your data in Amazon Redshift without this complexity. Without movement of data, you don’t have any overhead of additional security and governance of data that you export from your data warehouse.

Prerequisites

To take advantage of Redshift ML, you need an Amazon Redshift cluster with the ML feature enabled.

The cluster needs to have an AWS Identity and Access Management (IAM) role attached with sufficient privileges on Amazon S3 and SageMaker. For more information on setup and an introduction to Redshift ML, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML. The IAM role is required for Amazon Redshift to interact with SageMaker and Amazon S3. An S3 bucket is also required to export the training dataset and store other ML-related intermediate artifacts.

Setup and data preparation

We use a sample dataset for this use case; the last column mrx_therapeutic_condition is the one we’re going to predict using the multi-classification model. We ingest this health care drug dataset into the Amazon Redshift cluster, and use it to train and test the model.

To get randomness in the training dataset, we used a random record number column (record_nbr), which can be used to separate the training dataset from testing. If your dataset already has a unique or primary key column, you can use it as a predicate to separate the training from testing dataset. The random factor (record_nbr) column was added to this dataset to serve as a WHERE clause predicate to the SELECT statement in the CREATE MODEL command (discussed later in this post). A classic example is if your prediction column is True or False, then you want to make sure that the training dataset has balanced values of both and not just True or False alone. You can achieve this by using a neutral column as a predicate, which in our case is record_nbr. After loading the data, we validated to ensure all column values are uniformly distributed for training the model. A very important factor for getting better accuracy in predictions is to use a balanced dataset for training in which the distribution of all the values is balanced and covers all characteristics.

In this example, we have kept the entire data in single table but used the WHERE predicate on the record_nbr column to differentiate the training and testing datasets.

The following is the DDL and COPY commands for this use case.

Sign in to Amazon Redshift using the query editor or your preferred SQL client (DBeaver, DBVisualizer, SQL WorkbenchJ) to run the SQL statements to create the database tables.

Create the schema and tables using the following DDL statements:

--create schema
CREATE SCHEMA drug_ml;

--train table
CREATE TABLE drug_ml.ml_drug_data
(
record_nbr int,
col2 datatype,
col3 datatype,
etc..,
mrx_therapeutic_condition character varying(47) 
) 
DISTSTYLE EVEN;

Use the COPY commands to prepare and load data for model training:

--Load 
COPY drug_ml.ml_drug_rs_raw
FROM 's3://<s3bucket>/poc/drug_data_000.gz'
IAM_ROLE 'arn:aws:iam::<AWSAccount>:role/RedshiftMLRole' 
gzip;

Create model

Now that the data setup part is done, let’s create the model in Redshift ML. Then CREATE MODEL runs in the background and you can track progress using SHOW MODEL model_name;.

For a data analyst who is not quite conversant with machine learning, the CREATE MODEL statement is a powerhouse that offers flexibility in the number of options used to create the model. In our current use case, we haven’t passed the model type as a multi-classification model or any other options but based on the input data. The Redshift ML CREATE MODEL statement can select the correct problem type and other associated parameters using Autopilot. Multi-class classification is a problem type that predicts one of many outcomes, such as predicting a rating a customer might give for a product. In our example, we predict therapeutic conditions such as acne, anti-clotting therapy, asthma, or COPD for different patients. Data scientists and ML experts can use it to perform supervised learning to tackle problems ranging from forecasting, personalization, or customer churn prediction. See the following code:

--Create model using 130 K records (80% Training Data)
CREATE MODEL drug_ml.predict_model_drug_therapeutic_condition
FROM (
	SELECT *
	FROM drug_ml.ml_drug_data
	WHERE mrx_therapeutic_condition IS NOT NULL
		AND record_nbr < 130001
	) TARGET mrx_therapeutic_condition FUNCTION predict_drug_therapeutic_condition IAM_ROLE 'Replace with your IAM Role ARN' SETTINGS (S3_BUCKET 'Replace with your bucket name');

The target (label) in the preceding code is the mrx_therapeutic_condition column, which is used for prediction, and the function that is created out of this CREATE MODEL command is named predict_drug_therapeutic_condition.

The training data comes from the table ml_drug_data, and Autopilot automatically detects the multi-classification value model type based on input data.

Show model

The command show model model_name; is used to track the progress and also to get more details about a model (see the following example code). This section contains a representative output from SHOW MODEL after CREATE MODEL is complete. It has the inputs used in the CREATE MODEL statement with more information like model state (TRAINING, READY, FAILED) and the maximum estimated runtime it will take to finish. For better accuracy for predictions, we recommend letting the model finish to completion by increasing the default timeout value (the default is actually short for better accuracy). There is also an option to use CREATE MODEL, which lets the user limit the maximum runtime to a lower number, but the more runtime you have for CREATE MODEL, more it gets to train and iterate using multiple options before settling in on the best model.

--to show the model

SHOW MODEL drug_ml.predict_model_drug_therapeutic_condition;

The following table shows our results.

Inference

For binary and multi-class classification problems, we compute the accuracy as the model metric. We use the following formula to determine the accuracy:

accuracy = (sum (actual == predicted)/total) *100

We use the newly created function predict_drug_therapeutic_condition for the prediction and use the columns other than the target (label) as the input:

-- check accuracy
WITH infer_data AS (
SELECT mrx_therapeutic_condition AS label
,drug_ml.predict_drug_therapeutic_condition(record_nbr , brand_nm, gnrc_nm, drug_prod_strgth , gnrc_seq_num, lbl_nm , drug_dsge_fmt_typ_desc , cmpd_rte_of_admn_typ_desc , specif_thrputic_ingrd_clss_typ , hicl_seq_num, ndc_id , mfc_obsolete_dt , hcfa_termnatn_dt , specif_thrputic_ingrd_clss_desc, gpi14code, druggroup  , drugclass , drugsubclass, brandname) AS predicted,
CASE
WHEN label IS NULL
THEN 'N/A'
ELSE label
END AS actual,
CASE
WHEN actual = predicted
THEN 1::INT
ELSE 0::INT
END AS correct
FROM DRUG_ML.ML_DRUG_DATA),
aggr_data AS (
SELECT SUM(correct) AS num_correct,
COUNT(*) AS total
FROM infer_data)
SELECT (num_correct::FLOAT / total::FLOAT) AS accuracy FROM aggr_data;

--output of above query
accuracy
-------------------
0.975466558808845
(1 row)

The inference query output (0.9923 *100 = 99.23 %) matches the output from the show model command.

Let’s run the prediction query on therapeutic condition to get the count of original vs. ML-predicted values for accuracy comparison:

--check prediction 
WITH infer_data
AS (
	SELECT mrx_therapeutic_condition
		,drug_ml.predict_drug_therapeutic_condition(m.record_nbr, m.brand_nm, m.gnrc_nm, m.drug_prod_strgth, m.gnrc_seq_num, Other Params) AS predicted
	FROM "DRUG_ML"."ML_DRUG_DATA" m
	WHERE mrx_therapeutic_condition IS NOT NULL
		AND record_nbr >= 130001
	)
SELECT CASE 
		WHEN mrx_therapeutic_condition = predicted
			THEN 'Match'
		ELSE 'Un-Match'
		END AS match_ind
	,count(1) AS record_cnt
FROM infer_data
GROUP BY 1;

The following table shows our output.

Conclusion

Redshift ML allows us to develop greater efficiency and enhanced ability to generate MRx therapeutic conditions, especially when new drugs come to market. Redshift ML provides an easy and seamless platform for database users to create, train, and tune models using a SQL interface. ML capabilities empower Amazon Redshift users to create models and deploy them locally on large datasets, which was an arduous task before. Different user personas, from data analysts to advanced data science and ML experts, can take advantage of Redshift ML for machine learning based predictions and gain valuable business insights.

About the Authors

Karim Prasla, Pharm D, MS, BCPS, is a clinician and a health economist focused on optimizing use of data, analytics, and insights to improve patient care and drive positive clinical and economic outcomes. Karim leads a talented team of clinicians, economists, health outcomes scientists, and data analysts at Magellan Rx Management who are passionate about leveraging data analytics to develop products and solutions that provide proactive insights to deliver value to clinicians, customers, and members.

Deepti Bhanti, PhD, is an IT leader at Magellan Rx Management with a focus on delivering big data solutions to meet the data and analytics needs for our customers. Deepti leads a full stack team of IT engineers who are passionate about architecting and building scalable solutions while driving business value and market differentiation through technological innovations.

Rajesh Francis is a Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable analytic solutions. Rajesh worked with Magellan Rx Management to help implement a lake house architecture and leverage Redshift ML for various use cases.

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs. Satish worked with Magellan Rx Management to help jump start ML adoption.

Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world. He was responsible for driving the Amazon Redshift ML feature and working closely with Magellan Rx Management and other customers to incorporate their feedback into the service.

How Comcast uses AWS to rapidly store and analyze large-scale telemetry data

2021-08-06 Asser Moustafa

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/how-comcast-uses-aws-to-rapidly-store-and-analyze-large-scale-telemetry-data/

This blog post is co-written by Russell Harlin from Comcast Corporation.

Comcast Corporation creates incredible technology and entertainment that connects millions of people to the moments and experiences that matter most. At the core of this is Comcast’s high-speed data network, providing tens of millions of customers across the country with reliable internet connectivity. This mission has become more important now than ever.

This post walks through how Comcast used AWS to rapidly store and analyze large-scale telemetry data.

Background

At Comcast, we’re constantly looking for ways to gain new insights into our network and improve the overall quality of service. Doing this effectively can involve scaling solutions to support analytics across our entire network footprint. For this particular project, we wanted an extensible and scalable solution that could process, store, and analyze telemetry reports, one per network device every 5 minutes. This data would then be used to help measure quality of experience and determine where network improvements could be made.

Scaling big data solutions is always challenging, but perhaps the biggest challenge of this project was the accelerated timeline. With 2 weeks to deliver a prototype and an additional month to scale it, we knew we couldn’t go through the traditional bake-off of different technologies, so we had to either go with technologies we were comfortable with or proven managed solutions.

For the data streaming pipeline, we already had the telemetry data coming in on an Apache Kafka topic, and had significant prior experience using Kafka combined with Apache Flink to implement and scale streaming pipelines, so we decided to go with what we knew. For the data storage and analytics, we needed a suite of solutions that could scale quickly, had plenty of support, and had an ecosystem of well-integrated tools to solve any problem that might arise. This is where AWS was able to meet our needs with technologies like Amazon Simple Storage Service (Amazon S3), AWS Glue, Amazon Athena, and Amazon Redshift.

Initial architecture

Our initial prototype architecture for the data store needed to be fast and simple so that we could unblock the development of the other elements of the budding telemetry solution. We needed three key things out of it:

The ability to easily fetch raw telemetry records and run more complex analytical queries
The capacity to integrate seamlessly with the other pieces of the pipeline
The possibility that it could serve as a springboard to a more scalable long-term solution

The first instinct was to explore solutions we used in the past. We had positive experiences with using nosql databases, like Cassandra, to store and serve raw data records, but it was clear these wouldn’t meet our need for running ad hoc analytical queries. Likewise, we had experience with more flexible RDBMs, like Postgres, for handling more complicated queries, but we knew that those wouldn’t scale to meet our requirement to store tens to hundreds of billions of rows. Therefore, any prototyping with one of these approaches would be considered throwaway work.

After moving on from these solutions, we quickly settled on using Amazon S3 with Athena. Amazon S3 provides low-cost storage with near-limitless scaling, so we knew we could store as much historical data as required and Athena would provide serverless, on-demand querying of our data. Additionally, Amazon S3 is known to be a launching pad to many other data store solutions both inside and outside the AWS ecosystem. This was perfect for the exploratory prototyping phase.

Integrating it into the rest of our pipeline would also prove simple. Writing the data to Amazon S3 from our Flink job was straightforward and could be done using the readily available Flink streaming file sink with an Amazon S3 bucket as the destination. When the data was available in Amazon S3, we ran AWS Glue to index our Parquet-formatted data and generate schemas in the AWS Glue metastore for searching using Athena with standard SQL.

The following diagram illustrates this architecture.

Using Amazon S3 and Athena allowed us to quickly unblock development of our Flink pipeline and ensure that the data being passed through was correct. Additionally, we used the AWS SDK to connect to Athena from our northbound Golang microservice and provide REST API access to our data for our custom web application. This allowed us to prove out an end-to-end solution with almost no upfront cost and very minimal infrastructure.

Updated architecture

As application and service development proceeded, it became apparent that Amazon Athena performed for developers running ad hoc queries, but wasn’t going to work as a long-term responsive backend for our microservices and user interface requirements.

One of the primary use cases of this solution was to look at device-level telemetry reports for a period of time and plot and track different aspects of their quality of experience. Because this most often involves solving problems happening in the now, we needed an improved data store for the most recent hot data.

This led us to Amazon Redshift. Amazon Redshift requires loading the data into a dedicated cluster and formulating a schema tuned for your use cases.

The following diagram illustrates this updated architecture.

Data loading and storage requirements

For loading and storing the data in Amazon Redshift, we had a few fundamental requirements:

Because our Amazon Redshift solution would be for querying data to troubleshoot problems happening as recent as the current hour, we needed to minimize the latency of the data load and keep up with our scale while doing it. We couldn’t live with nightly loads.
The pipeline had to be robust and recover from failures automatically.

There’s a lot of nuance that goes into making this happen, and we didn’t want to worry about handling these basic things ourselves, because this wasn’t where we were going to add value. Luckily, because we were already loading the data into Amazon S3, AWS Glue ETL satisfied these requirements and provided a fast, reliable, and scalable solution to do periodic loads from our Amazon S3 data store to our Amazon Redshift cluster.

A huge benefit of AWS Glue ETL is that it provides many opportunities to tune your ETL pipeline to meet your scaling needs. One of our biggest challenges was that we write multiple files to Amazon S3 from different regions every 5 minutes, which results in many small files. If you’re doing infrequent nightly loads, this may not pose a problem, but for our specific use case, we wanted to load data at least every hour and multiple times an hour if possible. This required some specific tuning of the default ETL job:

Amazon S3 list implementation – This allows the Spark job to handle files in batches and optimizes reads for a large number of files, preventing out of memory issues.
Pushdown predicates – This tells the load to skip listing any partitions in Amazon S3 that you know won’t be a part of the current run. For frequent loads, this can mean skipping a lot of unnecessary file listing during each job run.
File grouping – This allows the read from Amazon S3 to group files together in batches when reading from Amazon S3. This greatly improves performance when reading from a large number of small files.
AWS Glue 2.0 – When we were starting our development, only AWS Glue 1.0 was available, and we’d frequently see Spark cluster start times of over 10 minutes. This becomes problematic if you want to run the ETL job more frequently because you have to account for the cluster startup time in your trigger timings. When AWS Glue 2.0 came out, those start times consistently dropped to under 1 minute and they became a afterthought.

With these tunings, as well as increasing the parallelism of the job, we could meet our requirement of loading data multiple times an hour. This made relevant data available for analysis sooner.

Modeling, distributing, and sorting the data

Aside from getting the data into the Amazon Redshift cluster in a timely manner, the next consideration was how to model, distribute, and sort the data when it was in the cluster. For our data, we didn’t have a complex setup with tens of tables requiring extensive joins. We simply had two tables: one for the device-level telemetry records and one for records aggregated at a logical grouping.

The bulk of the initial query load would be centered around serving raw records from these tables to our web application. These types of raw record queries aren’t difficult to handle from a query standpoint, but do present challenges when dealing with tens of millions of unique devices and a report granularity of 5 minutes. So we knew we had to tune the database to handle these efficiently. Additionally, we also needed to be able to run more complex ad hoc queries, like getting daily summaries of each table so that higher-level problem areas could be more easily tracked and spotted in the network. These queries, however, were less time sensitive and could be run on an ad hoc, batch-like basis where responsiveness wasn’t as important.

The schema fields themselves were more or less one-to-one mappings from the respective Parquet schemas. The challenge came, however, in picking partition keys and sorting columns. For partition keys, we identified a logical device grouping column present in both our tables as the one column we were likely to join on. This seemed like a natural fit to partition on and had good enough cardinality that our distribution would be adequate.

For the sorting keys, we knew we’d be searching by the device identifier and the logical grouping; for the respective tables, and we knew we’d be searching temporally. So the primary identifier column of each table and the timestamp made sense to sort on. The documented sort key order suggestion was to use the timestamp column as the first value in the sort key, because it could provide dataset filtering on a specific time period. This initially worked well enough and we were able to get a performance improvement over Athena, but as we scaled and added more data, our raw record retrieval queries were rapidly slowing down. To help with this, we made two adjustments.

The first adjustment came with a change to the sort key. The first part of this involved swapping the order of the timestamp and the primary identifier column. This allowed us to filter down to the device and then search through the range of timestamps on just that device, skipping over all irrelevant devices. This provided significant performance gains and cut our raw record query times by several multiples. The second part of the sort key adjustment involved adding another column (a node-level identifier) to the beginning of the sort key. This allowed us to have one more level of data filtering, which further improved raw record query times.

One trade-off made while making these sort key adjustments was that our more complex aggregation queries had a noticeable decline in performance. This was because they were typically run across the entire footprint of devices and could no longer filter as easily based on time being the first column in the sort key. Fortunately, because these were less frequent and could be run offline if necessary, this performance trade-off was considered acceptable.

If the frequency of these workloads increases, we can use materialized views in Amazon Redshift, which can help avoid unnecessary reruns of the complex aggregations if minimal-to-no data changes in the underlying base tables have occurred since the last run.

The final adjustment was cluster scaling. We chose to use the Amazon Redshift next-generation RA3 nodes for a number of benefits, but three especially key benefits:

RA3 clusters allow for practically unlimited storage in our cluster.
The RA3 ability to scale storage and compute independently paired really well with our expectations and use cases. We fully expected our Amazon Redshift storage footprint to continue to grow, as well as the number, shape, and sizes of our use cases and users, but data and workloads wouldn’t necessarily grow in lockstep. Being able to scale the cluster’s compute power independent of storage (or vice versa) was a key technical requirement and cost-optimization for us.
RA3 clusters come with Amazon Redshift managed storage, which places the burden on Amazon Redshift to automatically situate data based on its temperature for consistently peak performance. With managed storage, hot data was cached on a large local SSD cache in each node, and cold data was kept in the Amazon Redshift persistent store on Amazon S3.

After conducting performance benchmarks, we determined that our cluster was under-powered for the amount of data and workloads it was serving, and we would benefit from greater distribution and parallelism (compute power). We easily resized our Amazon Redshift cluster to double the number of nodes within minutes, and immediately saw a significant performance boost. With this, we were able to recognize that as our data and workloads scaled, so too should our cluster.

Looking forward, we expect that there will be a relatively small population of ad hoc and experimental workloads that will require access to additional datasets sitting in our data lake, outside of Amazon Redshift in our data lake—workloads similar to the Athena workloads we previously observed. To serve that small customer base, we can leverage Amazon Redshift Spectrum, which empowers users to run SQL queries on external tables in our data lake, similar to SQL queries on any other table within Amazon Redshift, while allowing us to keep costs as lean as possible.

This final architecture provided us with the solid foundation of price, performance, and flexibility for our current set of analytical use cases—and, just as important, the future use cases that haven’t shown themselves yet.

Summary

This post details how Comcast leveraged AWS data store technologies to prototype and scale the serving and analysis of large-scale telemetry data. We hope to continue to scale the solution as our customer base grows. We’re currently working on identifying more telemetry-related metrics to give us increased insight into our network and deliver the best quality of experience possible to our customers.

About the Authors

Russell Harlin is a Senior Software Engineer at Comcast based out of the San Francisco Bay Area. He works in the Network and Communications Engineering group designing and implementing data streaming and analytics solutions.

Asser Moustafa is an Analytics Specialist Solutions Architect at AWS based out of Dallas, Texas. He advises customers in the Americas on their Amazon Redshift and data lake architectures and migrations, starting from the POC stage to actual production deployment and maintenance

Amit Kalawat is a Senior Solutions Architect at Amazon Web Services based out of New York. He works with enterprise customers as they transform their business and journey to the cloud.

Automating Your Home with Grafana and Siemens Controllers

2021-08-04 Viktoria Semaan

Post Syndicated from Viktoria Semaan original https://aws.amazon.com/blogs/architecture/automating-your-home-with-grafana-and-siemens-controllers/

Imagine that you have access to a digital twin of your house that allows you to remotely monitor and control different devices inside your home. Forgot to turn off the heater or air conditioning? Didn’t close water faucets? Wondering how long your kids have been watching TV? Wouldn’t it be nice to have all the information from multiple devices in a single place?

Nowadays, many of us have smart things at home, such as thermostats, security cameras, wireless sensors, switches, etc. The problem is that most of these smart things come with different mobile applications. To get a full picture, we end up switching between applications that serve limited needs.

In this blog, we explain how to use Siemens controllers, AWS IoT, and the open-source visualization platform Grafana to quickly build a digital twin of any processes. This includes home automation, industrial applications, security systems, and others. As an example, we will monitor environmental conditions, including temperature and humidity sensors, thermostat settings, light switches, as well as monthly water and energy consumption. We will go through the architecture and steps required to integrate different building components to store data for historical analysis, enable voice control, and create interactive near real-time dashboards showing a digital representation of your house. If you would like to learn more about the solution, we will provide links to all the architecture components and detailed configuration steps.

Figure 1. Smart home automation solution with Siemens LOGO! compact controller

Architecture

In this solution overview, we are using a low-cost Siemens LOGO! controller (hardware version 8.3 or higher). This controller supports traditional industrial protocols such as Simatic S7 and Modbus TCP/IP as well as MQTT through the native AWS IoT Core interface. Automation controllers are the brains behind smart systems that allow orchestrating all the devices in your home. This reference architecture can be extended to other devices that support MQTT protocol, have programmatic APIs, or Software Development Kits (SDKs). It could serve as a starting point for building a home automation solution using AWS IoT and Grafana and further customized based on customer needs.

Figure 2. Reference architecture for smart home automation solution

The components of this solution are:

The LOGO! controller controls home automation equipment and ingests data to AWS IoT Core.
AWS IoT Core collects data at scale and routes messages to multiple AWS services.
AWS Lambda is called inside the AWS IoT Core statement to transform the incoming data prior to ingestion.
Amazon Timestream stores time series data and optimizes it for fast analytical queries.
AWS IoT SiteWise models and stores data from equipment for large scale deployments.
Grafana installed on Amazon Elastic Compute Cloud (Amazon EC2) visualizes data in near real-time using interactive dashboards.
The Alexa Skills Kit (ASK) allows interaction with devices using voice commands.

Figure 3. Remote monitoring dashboard allows homeowners to view and control conditions

Solution overview

Step 1: Ingest data to AWS IoT

The IoT-enabled LOGO! controller provides the out-of-box capability to send data to AWS IoT Core service. In a few clicks, you can configure variables and their update frequency to be published to the AWS Cloud. To get started with the LOGO! controller, please refer to Siemens E-learning portal. AWS IoT Core collects and processes messages from remote devices transmitted over the secured MQTT protocol.

The LOGO! controller publishes data to AWS IoT Core in hexadecimal format. The Lambda function converts the data from hexadecimal to the standard decimal numeric system. If your home automation equipment sends data in the standard decimal format, then AWS IoT Core can directly write data to other AWS services without Lambda.

Step 2: Store data in Timestream or AWS IoT SiteWise

The ingested IoT data is saved for historical analysis. Timestream is a serverless time series database service that is optimized for high throughput ingestion and has built-in analytical functions. It is one of the options you can use to store IoT data. Time series is a common data format to observe how things are changing over time and it is suitable for building IoT applications.

AWS IoT SiteWise is an alternative option to store and organize data at scale. It is beneficial for large-scale commercial building automation and management systems, including offices, hotels, and factories. You can structure data by using built-in asset modeling capabilities.

Step 3: Visualize data in Grafana dashboard

Once data is stored, it can be made available to multiple applications. Grafana is a data visualization platform that you can use to monitor data. It supports near real-time visualization with a refresh rate of 5 seconds or higher. You can visualize data from multiple AWS sources (such as AWS IoT SiteWise, Timestream, and Amazon CloudWatch) and other data sources with a single Grafana dashboard. Grafana can be installed on an Amazon Linux system, Windows, macOS, or deployed on Kubernetes (K8S) or Docker containers. For customers who don’t want to manage infrastructure and are interested in developing completely serverless solution, Amazon offers an Amazon Managed Service for Grafana. At the time of writing this post, this service is available in preview with a limited number of supported plugins.

To build Grafana dashboard and retrieve data from Timestream, you can use SQL queries. Timestream query example to retrieve humidity values and timestamps for the past 24 hours:

SELECT measure_value::double as humidity, time FROM "myhome_db"."livingroom" WHERE measure_name='humidity' and time >=ago(1d)

To retrieve data from AWS IoT SiteWise, you can select asset properties from the asset navigation tab, which makes it simple for non-technical users to build dashboards.

Figure 4. Grafana dashboard configuration with AWS IoT SiteWise

One of the common issues of operational dashboards is that it’s hard to get a physical representation by looking at a cluster of multiple readings. To reflect conditions of physical assets, the information from sensors must be overlaid on top of original physical objects. ImageIt Panel Plugin for Grafana allows you to overcome this issue. You can upload a picture of your house or a system and drag sensor readings to their exact locations, thus creating digital representations of physical objects.

Step 4: Control using Alexa

Using the Alexa Skills Kit, you can build voice skills to be used on devices enabled by Alexa globally. Alexa and AWS IoT enables you to create an end-to-end voice-controlled experience without using any additional hardware. Instead, your functions run on the cloud only when you invoke Alexa with voice commands.

The easiest way to build a custom Alexa skill is to use a Lambda function. You can upload the code for your Alexa skill to a Lambda function. The code will execute in response to Alexa voice interactions and send commands to the LOGO! controller.

Conclusion

In this blog, we reviewed how you can create a digital twin of your home automation or industrial systems using Siemens controllers, AWS IoT, and Grafana dashboards. Connecting the LOGO! controller to AWS gives it access to the Internet of Things (IoT) and opens many potential applications such as anomaly detection, predictive maintenance, intrusion detection, and others.

How GE Healthcare modernized their data platform using a Lake House Architecture

2021-08-02 Krishna Prakash

Post Syndicated from Krishna Prakash original https://aws.amazon.com/blogs/big-data/how-ge-healthcare-modernized-their-data-platform-using-a-lake-house-architecture/

GE Healthcare (GEHC) operates as a subsidiary of General Electric. The company is headquartered in the US and serves customers in over 160 countries. As a leading global medical technology, diagnostics, and digital solutions innovator, GE Healthcare enables clinicians to make faster, more informed decisions through intelligent devices, data analytics, applications, and services, supported by its Edison intelligence platform.

GE Healthcare’s legacy enterprise analytics platform used a traditional Postgres based on-premise data warehouse from a big data vendor to run a significant part of its analytics workloads. The data warehouse is key for GE Healthcare; it enables users across units to gather data and generate the daily reports and insights required to run key business functions. In the last few years, the number of teams accessing the cluster increased by almost three times, with twice the initial number of users running four times the number of daily queries for which the cluster had been designed. The legacy data warehouse wasn’t able to scale to support GE Healthcare’s business needs and would require significant investment to maintain, update, and secure.

Searching for a modern data platform

After working for several years in a database-focused approach, the rapid growth in the data made the GEHC’s on-prem system unviable from a cost and maintenance perspective. This presented GE Healthcare with an opportunity to take a holistic look at the emerging and strategic needs for data and analytics. With this in mind, GE Healthcare decided to adopt a Lake House Architecture using AWS services:

Use Amazon Simple Storage Service (Amazon S3) to store raw enterprise and event data
Use familiar SQL statements to combine and process data across all data stores in Amazon S3 and Amazon Redshift
Apply the “best fit” concept of using the appropriate AWS technology to meet specific business needs

Architecture

The following diagram illustrates GE Healthcare’s architecture.

Choosing Amazon Redshift for the enterprise cloud data warehouse

Choosing the right data store is just as important as how you collect the data for analytics. Amazon Redshift provided the best value because it was easy to launch, access, and store data, and could scale to meet our business needs on demand. The following are a few additional reasons for why GE Healthcare made Amazon Redshift our cloud data warehouse:

The ability to store petabyte-scale data in Amazon S3 and query the data in Amazon S3 and Amazon Redshift with little preprocessing was important because GE Healthcare’s data needs are expanding at a significant pace.
The AWS based strategy provides financial flexibility. GE Healthcare moved from a fixed cost/fixed asset on-premises model to a consumption-based model. In addition, the total cost of ownership (TCO) of the AWS based architecture was less than the solution it was replacing.
Native integration with the AWS Analytics ecosystem made it easier to handle end-to-end analytics workflows without friction. GE Healthcare took a hybrid approach to extract, transform, and load (ETL) jobs by using a combination of AWS Glue, Amazon Redshift SQL, and stored procedures based on complexity, scale, and cost.
The resilient platform makes recovery from failure easier, with errors further down the pipeline less likely to affect production environments because all historical data is on Amazon S3.
The idea of a Lake House Architecture is that taking a one-size-fits-all approach to analytics eventually leads to compromises. It’s not simply about integrating a data lake with a data warehouse, but rather about integrating a data lake, a data warehouse, and purpose-built data stores and enabling unified governance and easy data movement. (For more information about the Lake House Architecture, see Harness the power of your data with AWS Analytics.)
It’s easy to implement new capabilities to support emerging business needs such as artificial intelligence, machine learning, graph databases, and more because of the extensive product capabilities of AWS and the Amazon Redshift Lake House Architecture.

Implementation steps and best practices

As part of this journey, GE Healthcare partnered with AWS Professional Services to accelerate their momentum. AWS Professional Services was instrumental in following the Working Backwards (WB) process, which is Amazon’s customer-centric product development process.

Here’s how it worked:

AWS Professional Services guided GE Healthcare and partner teams through the Lake House Architecture, sharing AWS standards and best practices.
The teams accelerated Amazon Redshift stored procedure migration, including data structure selection and table design.
The teams delivered performant code at scale to enable a timely go live.
They implemented a framework for batch file extract to support downstream data consumption via a data as a service (DaaS) solution.

Conclusion

Modernizing to a Lake House Architecture with Amazon Redshift allowed GEHC to speed up, innovate faster, and better solve customers’ needs. At the time of writing, GE Healthcare workloads are running at full scale in production. We retired our on-premises infrastructure. Amazon Redshift RA3 instances with managed storage enabled us to scale compute and storage separately based on our customers’ needs. Furthermore, with the concurrency scaling feature of Amazon Redshift, we don’t have to worry about peak times affecting user performance any more. Amazon Redshift scales out and in automatically.

We also look forward to realizing the benefits of Amazon Redshift data sharing and AQUA (Advanced Query Accelerator) for Amazon Redshift as we continue to increase the performance and scale of our Amazon Redshift data warehouse. We appreciate AWS’s continual innovation on behalf of its customers.

About the Authors

Krishna Prakash (KP) Bhat is a Sr. Director in Data & Analytics at GE Healthcare. In this role, he’s responsible for architecture and data management of data and analytics solutions within GE Healthcare Digital Technologies Organization. In his spare time, KP enjoys spending time with family. Connect him on LinkedIn.

Suresh Patnam is a Solutions Architect at AWS, specialized in big data and AI/ML. He works with customers in their journey to the cloud with a focus on big data, data lakes, and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family. Connect him on LinkedIn.

Synchronize and control your Amazon Redshift clusters maintenance windows

2021-07-22 Ahmed Gamaleldin

Post Syndicated from Ahmed Gamaleldin original https://aws.amazon.com/blogs/big-data/synchronize-and-control-your-amazon-redshift-clusters-maintenance-windows/

Amazon Redshift is a data warehouse that can expand to exabyte-scale. Today, tens of thousands of AWS customers (including NTT DOCOMO, Finra, and Johnson & Johnson) use Amazon Redshift to run mission-critical business intelligence dashboards, analyze real-time streaming data, and run predictive analytics jobs.

Amazon Redshift powers analytical workloads for Fortune 500 companies, startups, and everything in between. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business. Availability, therefore, is key in continuing to drive customer success, and AWS will use commercially reasonable efforts to make Amazon Redshift available with a Monthly Uptime Percentage for each multi-node cluster, during any monthly billing cycle, of at least 99.9% (our “Service Commitment”).

In this post, we present a solution to help you provide a predictable and repeatable experience to your Amazon Redshift end-users by taking control of recurring Amazon Redshift maintenance windows.

Amazon Redshift maintenance windows

Amazon Redshift periodically performs maintenance to apply fixes, enhancements, and new features to your cluster. This type of maintenance occurs during a 30-minute maintenance window set by default per Region from an 8-hour block on a random day of the week. You should change the scheduled maintenance window according to your business needs by modifying the cluster, either programmatically or by using the Amazon Redshift console. The window must be at least 30 minutes and not longer than 24 hours. For more information, see Managing clusters using the console.

If a maintenance event is scheduled for a given week, it starts during the assigned 30-minute maintenance window. While Amazon Redshift performs maintenance, it terminates any queries or other operations that are in progress. If there are no maintenance tasks to perform during the scheduled maintenance window, your cluster continues to operate normally until the next scheduled maintenance window. Amazon Redshift uses Amazon Simple Notification Service (Amazon SNS) to send notifications of Amazon Redshift events. You enable notifications by creating an Amazon Redshift event subscription. You can create an Amazon Redshift event notification subscription so you can be notified when an event occurs for a given cluster.

When an Amazon Redshift cluster is scheduled for maintenance, you receive an Amazon Redshift “Pending” event, as described in the following table.

Amazon Redshift category	Event ID	Event severity	Description
Pending	REDSHIFT-EVENT-2025	INFO	Your database for cluster <cluster name> will be updated between <start time> and <end time>. Your cluster will not be accessible. Plan accordingly.
Pending	REDSHIFT-EVENT-2026	INFO	Your cluster <cluster name> will be updated between <start time> and <end time>. Your cluster will not be accessible. Plan accordingly.

Amazon Redshift also gives you the option to reschedule your cluster’s maintenance window by deferring your upcoming maintenance by up to 45 days. This option is particularly helpful if you want to maximize your cluster uptime by deferring a future maintenance window. For example, if your cluster’s maintenance window is set to Wednesday 8:30–9:00 UTC and you need to have nonstop access to your cluster for the next 2 weeks, you can defer maintenance to a date 2 weeks from now. We don’t perform any maintenance on your cluster when you have specified a deferment.

Deferring a maintenance window doesn’t apply to mandatory Amazon Redshift updates, such as vital security patches. Scheduled maintenance is different from Amazon Redshift mandatory maintenance. If Amazon Redshift needs to update hardware or make other mandatory updates during your period of deferment, we notify you and make the required changes. Your cluster isn’t available during these updates and such maintenance can’t be deferred. If a hardware replacement is required, you receive an event notification through the AWS Management Console and your SNS subscription as a “Pending” item, as shown in the following table.

Amazon Redshift category	Event ID	Event severity	Description
Pending	REDSHIFT-EVENT-3601	INFO	A node on your cluster <cluster name> will be replaced between <start time> and <end time>. You can’t defer this maintenance. Plan accordingly.
Pending	REDSHIFT-EVENT-3602	INFO	A node on your cluster <cluster name> is scheduled to be replaced between <start time> and <end time>. Your cluster will not be accessible. Plan accordingly.

Challenge

As the number of the Amazon Redshift clusters you manage for an analytics application grows, the end-user experience becomes highly dependent on recurring maintenance. This means that you want to make sure maintenance windows across all clusters happen at the same time and fall on the same day each month. You want to avoid a situation in which, for example, your five Amazon Redshift clusters are each updated on a different day or week. Ultimately, you want to give your Amazon Redshift end-users an uninterrupted number of days where the clusters aren’t subject to any scheduled maintenance. This gives you the opportunity to announce a scheduled maintenance to your users well ahead of the maintenance date.

Amazon Redshift clusters are scheduled for maintenance depending on several factors, including when a cluster was created and its Region. For example, you may have two clusters on the same Amazon Redshift version that are scheduled for different maintenance windows. Regions implement the Amazon Redshift latest versions and patches at different times. A cluster running in us-east-1 might be scheduled for the same maintenance 1 week before another cluster in eu-west-1. This makes it harder for you to provide your end-users with a predictable maintenance schedule.

To solve this issue, you typically need to frequently check and synchronize the maintenance windows across all your clusters to a specific time, for example sat:06:00-sat:06:30. Additionally, you want to avoid having your clusters scheduled for maintenance at different intervals. For that, you need to defer maintenance across all your clusters to fall on the same exact day. For example, you can defer all maintenance across all clusters to happen 1 month from now regardless of when the cluster was last updated. This way you know that your clusters aren’t scheduled for maintenance for the next 30 days. This gives you enough time to announce the maintenance schedule to your Amazon Redshift users.

Solution overview

Having one day per month when Amazon Redshift scheduled maintenance occurs across all your clusters provides your users with a seamless experience. It gives you control and predictability over when clusters aren’t available. You can announce this maintenance window ahead of time to avoid sudden interruptions.

The following solution deploys an AWS Lambda function (RedshiftMaintenanceSynchronizer) to your AWS account. The function runs on a schedule configurable through an AWS CloudFormation template parameter frequency to run every 6, 12, or 24 hours. This function synchronizes maintenance schedules across all your Amazon Redshift clusters to happen at the same time on the same day. It also gives you the option to defer all future maintenance windows across all clusters by a number of days (deferment days) for up to 45 days. This enables you to provide your users with an uninterrupted number of days when your clusters aren’t subject to maintenance.

We use the following input parameters:

Deferment days – The number of days to defer all future scheduled maintenance windows. Amazon Redshift can defer maintenance windows by up to 45 days. The solution adds this number to the date of the last successfully completed maintenance across any of your clusters. The resulting date is the new maintenance date for all your clusters.
Frequency – The frequency to run this solution. You can configure it to run every 6, 12, or 24 hours.
Day – The preferred day of the week to schedule maintenance windows.
Hour – The preferred hour of the day to schedule maintenance windows (24-hour format).
Minute – The preferred minute of the hour to schedule maintenance windows.

The Lambda function performs the following steps:

Lists all your Amazon Redshift clusters in the Region.
Updates the maintenance window for all clusters to the same value of the day/hour/minute input parameters from the CloudFormation template.
Checks for the last successfully completed maintenance across all clusters.
Calculates the deferment maintenance date by adding the deferment input parameter to the date of the last successfully completed maintenance. For example, if the deferment parameter is 30 days and the last successful maintenance window completed on July 1, 2020, then the next deferment maintenance date is July 31, 2020.
Defers the next maintenance window across all clusters to the deferment date calculated in the previous step.

Launch the solution

To get started, deploy the cloudFormation template to your AWS account.

For Stack name, enter a name for your stack for easy reference.
For Day, choose the day of week for the maintenance window to occur.
For Hour, choose the hour for the window to start.
For Minute, choose the minute within the hour for the window to start.

Your window should be a time with the least cluster activity and during off-peak work hours.

For Deferment Days, choose the number of days to defer all future scheduled maintenance windows.
For Solution Run Frequency, choose the frequency of 6, 12, or 24 hours.

cloudFormation Template

After the template has successfully deployed, the following resources are available:

The RedshiftMaintenanceSync Lambda function. This is a Python 3.8 Lambda function that syncs and defers the maintenance windows across all your Amazon Redshift clusters.

lamda_console

The RedshiftMaintenanceSyncEventRule Amazon CloudWatch event rule. This rule triggers on a schedule based on the Frequency input parameter. It triggers the RedshiftMaintenanceSync Lambda function to run the solution logic.

When the Lambda function starts and detects an already available deferment on any of your clusters, it doesn’t attempt to modify the existing deferment, and exits instead.

The solution logs any deferment it performs on any cluster in the associated CloudWatch log group.

Synchronize and defer maintenance windows

To demonstrate this solution, I have two Amazon Redshift clusters with different preferred maintenance windows and scheduled intervals.

Cluster A (see the following screenshot) has a preferred maintenance window set to Friday at 4:30–5:00 PM. It’s scheduled for maintenance in 2 days.

Cluster B has a preferred maintenance window set to Tuesday at 9:45–10:15 AM. It’s scheduled for maintenance in 6 days.

This means that my Amazon Redshift users have a 30-minute interruption twice in the next 2 and 6 days. Also, these interruptions happen at completely different times.

Let’s launch the solution and see how the cluster maintenance windows and scheduling intervals change.

The Lambda function does the following every time it runs:

Lists all the clusters.
Checks if any of the clusters have a deferment enabled and exits if it finds any.
Syncs all preferred maintenance windows across all clusters to fall on the same time and day of week.
Checks for the latest successfully completed maintenance date.
Adds the deferment days to the last maintenance date. This becomes the new deferment date.
Applies the new deferment date to all clusters.

Cluster A’s maintenance window was synced to the input parameters I passed to the CloudFormation template. Additionally, the next maintenance window was deferred until June 21, 2021, 8:06 AM (UTC +02:00).

Cluster B’s maintenance window is also synced to the same value from Cluster A, and the next maintenance window was deferred to the same exact day as Cluster A: June 21, 2021, 8:06 AM (UTC +02:00).

Finally, let’s check the CloudWatch log group to understand what the solution did.

Now both clusters have the same maintenance window and next scheduled maintenance, which is a month from now. In a production scenario, this gives you enough lead time to announce the maintenance window to your Amazon Redshift end-users and provide them with the exact day and time when clusters aren’t available.

Summary

This solution can help you provide a predictable and repeatable experience to your Amazon Redshift end-users by taking control of recurring Amazon Redshift maintenance windows. It enables you to provide your users with an uninterrupted number of days where clusters are always available—barring any scheduled mandatory upgrades. Click here to get started with Amazon Redshift today.

About the Author

Ahmed_Gamaleldin

Ahmed Gamaleldin is a Senior Technical Account Manager (TAM) at Amazon Web Services. Ahmed helps customers run optimized workloads on AWS and make the best out of their cloud journey.

Field Notes: How Sportradar Accelerated Data Recovery Using AWS Services

2021-07-19 Mithil Prasad

Post Syndicated from Mithil Prasad original https://aws.amazon.com/blogs/architecture/field-notes-how-sportradar-accelerated-data-recovery-using-aws-services/

This post was co-written by Mithil Prasad, AWS Senior Customer Solutions Manager, Patrick Gryczkat, AWS Solutions Architect, Ben Burdsall, CTO at Sportradar and Justin Shreve, Director of Engineering at Sportradar.

Ransomware is a type of malware which encrypts data, effectively locking those affected by it out of their own data and requesting a payment to decrypt the data. The frequency of ransomware attacks has increased over the past year, with local governments, hospitals, and private companies experiencing cases of ransomware.

For Sportradar, providing their customers with access to high quality sports data and insights is central to their business. Ensuring that their systems are designed securely and in a way which minimizes the possibility of a ransomware attack is top priority. While ransomware attacks can occur both on premises and in the cloud, AWS services offer increased visibility and native encryption and back up capabilities. This helps prevent and minimize the likelihood and impact of a ransomware attack.

Recovery, backup, and the ability to go back to a known good state is best practice. To further expand their defense and diminish the value of ransom, the Sportradar architecture team set out to leverage their AWS Step Functions expertise to minimize recovery time. The team’s strategy centered on achieving a short deployment process. This process commoditized their production environment, allowing them to spin up interchangeable environments in new isolated AWS accounts, pulling in data from external and isolated sources, and diminishing the value of a production environment as a ransom target. This also minimized the impact of a potential data destruction event.

By partnering with AWS, Sportradar was able to build a secure and resilient infrastructure to provide timely recovery of their service in the event of data destruction by an unauthorized third party. Sportradar automated the deployment of their application to a new AWS account and established a new isolation boundary from an account with compromised resources. In this blog post, we show how the Sportradar architecture team used a combination of AWS CodePipeline and AWS Step Functions to automate and reduce their deployment time to less than two hours.

Solution Overview

Sportradar’s solution uses AWS Step Functions to orchestrate the deployment of resources, the recovery of data, and the deployment of application code, and to navigate all necessary dependencies for order of deployment. While deployment can be orchestrated through CodePipeline, Sportradar used their familiarity with Step Functions to create a quick and repeatable deployment process for their environment.

Sportradar’s solution to a ransomware Disaster Recovery scenario has also provided them with a reliable and accelerated process for deploying development and testing environments. Developers are now able to scale testing and development environments up and down as needed. This has allowed their Development and QA teams to follow the pace of feature development, versus weekly or bi-weekly feature release and testing schedules tied to a single testing environment.

Reference Architecture Showing How Sportradar Accelerated Data Recovery

Figure 1 – Reference Architecture Diagram showing Automated Deployment Flow

Prerequisites

The prerequisites for implementing this deployment strategy are:

An implemented database backup policy
Ideally data should be backed up to a data bunker AWS account outside the scope of the environment you are looking to protect. This is so that in the event of a ransomware attack, your backed up data is isolated from your affected environment and account
Application code within a GitHub repository
Separation of duties
Access and responsibility for the backups and GitHub repository should be separated to different stakeholders in order to reduce the likelihood of both being impacted by a security breach

Step 1: New Account Setup

Once data destruction is identified, the first step in Sportradar’s process is to use a pre-created runbook to create a new AWS account. A new account is created in case the malicious actors who have encrypted the application’s data have access to not just the application, but also to the AWS account the application resides in.

The runbook sets up a VPC for a selected Region, as well as spinning up the following resources:

Security Groups with network connectivity to their git repository (in this case GitLab), IAM Roles for their resources
KMS Keys
Amazon S3 buckets with CloudFormation deployment templates
CodeBuild, CodeDeploy, and CodePipeline

Step 2: Deploying Secrets

It is a security best practice to ensure that no secrets are hard coded into your application code. So, after account setup is complete, the new AWS accounts Access Keys and the selected AWS Region are passed into CodePipeline variables. The application secrets are then deployed to the AWS Parameter Store.

Step 3: Deploying Orchestrator Step Function and In-Memory Databases

To optimize deployment time, Sportradar decided to leave the deployment of their in-memory databases running on Amazon EC2 outside of their orchestrator Step Function. They deployed the database using a CloudFormation template from their CodePipeline. This was in parallel with the deployment of the Step Function, which orchestrates the rest of their deployment.

Step 4: Step Function Orchestrates the Deployment of Microservices and Alarms

The AWS Step Functions orchestrate the deployment of Sportradar’s microservices solutions, deploying 10+ Amazon RDS instances, and restoring each dataset from DB snapshots. Following that, 80+ producer Amazon SQS queues and S3 buckets for data staging were deployed. After the successful deployment of the SQS queues, the Lambda functions for data ingestion and 15+ data processing Step Functions are deployed to begin pulling in data from various sources into the solution.

Then the API Gateways and Lambda functions which provide the API layer for each of the microservices are deployed in front of the restored RDS instances. Finally, 300+ Amazon CloudWatch Alarms are created to monitor the environment and trigger necessary alerts. In total Sportradar’s deployment process brings online: 15+ Step Functions for data processing, 30+ micro-services, 10+ Amazon RDS instances with over 150GB of data, 80+ SQS Queues, 180+ Lambda functions, CDN for UI, Amazon Elasticache, and 300+ CloudWatch alarms to monitor the applications. In all, that is over 600 resources deployed with data restored consistently in less than 2 hours total.

Reference Architecture Diagram for How Sportradar Accelerated Data Recovery Using AWS Services

Figure 2 – Reference Architecture Diagram of the Recovered Application

Conclusion

In this blog, we showed how Sportradar’s team used Step Functions to accelerate their deployments, and a walk-through of an example disaster recovery scenario. Step Functions can be used to orchestrate the deployment and configuration of a new environment, allowing complex environments to be deployed in stages, and for those stages to appropriately wait on their dependencies.

For examples of Step Functions being used in different orchestration scenarios, check out how Step Functions acts as an orchestrator for ETLs in Orchestrate multiple ETL jobs using AWS Step Functions and AWS Lambda and Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy. For migrations of Amazon EC2 based workloads, read more about CloudEndure, Migrating workloads across AWS Regions with CloudEndure Migration.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Multi-Cloud and Hybrid Threat Protection with Sumo Logic Cloud SIEM Powered by AWS

2021-07-16 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/hybrid-threat-protection-with-sumo-logic-cloud-siem-powered-by-aws/

IT security teams need to have a real-time understanding of what’s happening with their infrastructure and applications. They need to be able to find and correlate data in this continuous flood of information to identify unexpected behaviors or patterns that can lead to a security breach.

To simplify and automate this process, many solutions have been implemented over the years. For example:

Security information management (SIM) systems collect data such as log files into a central repository for analysis.
Security event management (SEM) platforms simplify data inspection and the interpretation of logs or events.

Many years ago these two approaches were merged to address both information analysis and interpretation of events. These security information and event management (SIEM) platforms provide real-time analysis of security alerts generated by applications, network hardware, and domain specific security tools such as firewalls and endpoint protection tools).

Today, I’d like to introduce a solution created by one of our partners: Sumo Logic Cloud SIEM powered by AWS. Sumo Logic Cloud SIEM provides deep security analytics and contextualized threat data across multi-cloud and hybrid environments to reduce the time to detect and respond to threats. You can use this solution to quickly detect and respond to the higher-priority issues, including malicious activities that negatively impact your business or brand. The solution comes with more than 300 out-of-the-box integrations, including key AWS services, and can help reduce time and effort required to conduct compliance audits for regulations such as PCI and HIPAA.

Sumo Logic Cloud SIEM is available in AWS Marketplace and you can use a free trial to evaluate the solution. Before having a look at how this works in practice, let’s see how it’s being used.

Customer Case Study – Medidata
I had the chance to meet a very interesting customer to talk about how they use Sumo Logic Cloud SIEM. Scott Sumner is the VP and CISO at Medidata, a company that is redefining what’s possible in clinical trials. Medidata is processing patient data for clinical trials of the Moderna and Johnson & Johnson COVID-19 vaccines, so you can see why security is a priority for them. With such critical workloads, the company must keep the trust of the people participating in those trials.

Scott told me, “There is an old saying: If you can’t measure it, you can’t manage it.” In fact, when he joined Medidata in 2015, one of the first things Scott did was to implement a SIEM. Medidata has been using Sumo Logic for more than five years now. They appreciate that it’s a cloud-native solution, and that has made it easier for them to follow the evolution of the tool over the years.

“Not having transparency in the environment you process data is not good for security professionals.” Scott wanted his team to be able to respond quickly and, to do so, they needed to be able to look at a single screen that displays all IP calls, network flows, and any relevant information. For example, Medidata has very aggressive checks for security scans and any kind of external access. To do so, they have to look not just at the perimeter, but at the entire environment. Sumo Logic Cloud SIEM allows them to react without breaking anything in their corporate environment, including the resources they have in more than 45 AWS accounts.

“One of the metrics that is floated around by security specialists is that you have up to five hours to respond to a tentative intrusion,” Scott says. “With Sumo Logic Cloud SIEM, we can match that time aggressively, and most of the times we respond within five minutes.” The ability to respond quickly allows Medidata to keep patient trust. This is very important for them, because delaying a clinical trial can affect people’s health.

Medidata security response is managed by a global team through three levels. Level 1 is covered by a partner who is empowered to block simple attacks. For the next level of escalation, Level 2, Medidata has a team in each region. Beyond that there is Level 3: a hardcore team of forensics examiners distributed across the US, Europe, and Asia who deal with more complicated attacks.

Availability is also important to Medidata. In addition to providing the Cloud SIEM functionality, Sumo Logic helps them monitor availability issues, such as web server failover, and very quickly figure out possible problems as they happen. Interestingly, they use Sumo Logic to better understand how applications talk with each other. These different use cases don’t complicate their setup because the security and application teams are segregated and use Sumo Logic as a single platform to share information seamlessly between the two teams when needed.

I asked Scott Sumner if Medidata was affected by the move to remote work in 2020 due to the COVID-19 pandemic. It’s an important topic for them because at that time Medidata was already involved in clinical trials to fight the pandemic. “We were a mobile environment anyway. A significant part of the company was mobile before. So, we were ready and not being impacted much by working remotely. All our tools are remote, and that helped a lot. Not sure we’d done it easily with an on-premises solution.”

Now, let’s see how this solution works in practice.

Setting Up Sumo Logic Cloud SIEM
In AWS Marketplace I search for “Sumo Logic Cloud SIEM” and look at the product page. I can either subscribe or start the one-month free trial. The free trial includes 1GB of log ingest for security and observability. There is no automatic conversion to paid offer when the free trials expires. After the free trial I have the option to either buy Sumo Logic Cloud SIEM from AWS Marketplace or remain as a free user. I create and accept the contract and set up my Sumo Logic account.

In the setup, I choose the Sumo Logic deployment region to use. The Sumo Logic documentation provides a table that describes the AWS Regions used by each Sumo Logic deployment. I need this information later when I set up the integration between AWS security services and Sumo Logic Cloud SIEM. For now, I select US2, which corresponds to the US West (Oregon) Region in AWS.

When my Sumo Logic account is ready, I use the Sumo Logic Security Integrations on AWS Quick Start to deploy the required integrations in my AWS account. You’ll find the source files used by this Quick Start in this GitHub repository. I open the deployment guide and follow along.

This architecture diagram shows the environment deployed by this Quick Start:

Following the steps in the deployment guide, I create an access key and access ID in my Sumo Logic account, and write down my organization ID. Then, I launch the Quick Start to deploy the integrations.

The Quick Start uses an AWS CloudFormation template to create a stack. First, I check that I am in the right AWS Region. I use US West (Oregon) because I am using US2 with Sumo Logic. Then, I leave all default values and choose Next. In the parameters, I select US2 as my Sumo Logic deployment region and enter my Sumo Logic access ID, access key, and organization ID.

After that, I enable and configure the integrations with AWS security services and tools such as AWS CloudTrail, Amazon GuardDuty, VPC Flow Logs, and AWS Config.

If you have a delegate administrator with GuardDuty, you can enabled multiple member accounts as long as they are a part of the same AWS organization. With that, all findings from member accounts are vended to the delegate administrator and then to Sumo Logic Cloud SIEM through the GuardDuty events processor.

In the next step, I leave the stack options to their default values. I review the configuration and acknowledge the additional capabilities required by this stack (such as the creation of IAM resources) and then choose Create stack.

When the stack creation is complete, the integration with Sumo Logic Cloud SIEM is ready. If I had a hybrid architecture, I could connect those resources for a single point of view and analysis of my security events.

Using Sumo Logic Cloud SIEM
To see how the integration with AWS security services works, and how security events are handled by the SIEM, I use the amazon-guardduty-tester open-source scripts to generate security findings.

First, I use the included CloudFormation template to launch an Amazon Elastic Compute Cloud (Amazon EC2) instance in an Amazon Virtual Private Cloud (VPC) private subnet. The stack also includes a bastion host to provide external access. When the stack has been created, I write down the IP addresses of the two instances from the stack output.

Then, I use SSH to connect to the EC2 instance in the private subnet through the bastion host. There are easy-to-follow instructions in the README file. I use the guardduty_tester.sh script, installed in the instance by CloudFormation, to generate security findings for my AWS account.

$ ./guardduty_tester.sh

GuardDuty processes these findings and the events are sent to Sumo Logic through the integration I just set up. In the Sumo Logic GuardDuty dashboard, I see the threats ready to be analyzed and addressed.

Availability and Pricing
Sumo Logic Cloud SIEM powered by AWS is a multi-tenant Software as a Service (SaaS) available in AWS Marketplace that ingests data over HTTPS/TLS 1.2 on the public internet. You can connect data from any AWS Region and from multi-cloud and hybrid architectures for a single point of view of your security events.

Start a free trial of Sumo Logic Cloud SIEM and see how it can help your security team.

Read the Sumo Logic team’s blog post here for more information on the service.

— Danilo

Field Notes: Orchestrating and Monitoring Complex, Long-running Workflows Using AWS Step Functions

2021-07-06 Max Winter

Post Syndicated from Max Winter original https://aws.amazon.com/blogs/architecture/field-notes-orchestrating-and-monitoring-complex-long-running-workflows-using-aws-step-functions/

Situation:

IHS Markit’s Wall Street Office (WSO) offers financial reports to hundreds of clients worldwide. When IHS Markit completed the migration of WSO’s SaaS software to AWS, it unlocked the power and agility to deliver new product features monthly, as opposed to a multi-year release cycle. This migration also presented a great opportunity to further enhance the customer experience by automating the WSO reporting team’s own Continuous Integration and Continuous Deployment (CI/CD) workflow. WSO then offered the same migration workflow to its on-prem clients, who needed the ability to upgrade quickly in order to meet a regulatory LIBOR reporting deadline. This rapid upgrade was enabled by fully automating regression testing of new software versions.

In this blog post, I outline the architectures created in collaboration with WSO to orchestrate and monitor the complex, long-running reconciliation workflows in their environment by leveraging the power of AWS Step Functions. To enable each client’s migration to AWS, WSO needed to ensure that the new, AWS version of the reporting application produced identical outputs to the previous, on-premises version. For a single migration, the process is as follows:

spin up the old version of the SQL Server and reporting engine on Windows servers,
run reports,
repeat the process with the new version,
compare the outputs and review the differences.

The problem came with scaling this process. IHS Markit provides financial solutions and tools to numerous clients. To enable these clients to transition away from LIBOR, the WSO team was tasked with migrating over 80 instances of the application, and reconciling hundreds of reports for each migration. During upgrades, customers must manually validate custom extracts created in the WSO Reporting application against current and next version, which limits upgrade frequency and increases the resourcing cost of these validations. Without automation, upgrading all clients would have taken an entire new Operations team, and cost the firm over 700 developer-hours to meet the regulatory LIBOR cessation deadline.

WSO was able to save over 4,000 developer-hours by making this process repeatable so it can be used as an automated regression test as part of the regular Systems Development Lifecycle process The following diagram shows the reconciliation workflow steps enabled as part of this automated process.

Figure 1 – Reconciliation Workflow Steps

Complication:

The team quickly realized that a Serverless and event-driven solution would be required to make this process manageable. The initial approach was to use AWS Lambda functions to call PowerShell scripts to perform each step in the reconciliation process. They also used Amazon SNS to invoke the next Lambda function when the previous step completed.

The problem came when the Operations team tried to monitor these Lambda functions, with multiple parallel reconciliations running concurrently. The Lambda outputs became mixed together in shared Amazon Cloud Watch log groups, and there was no way to quickly see the overall progress of any given reconciliation workflow. It was also difficult to figure out how to recover from errors.

Furthermore, the team found that some steps in this process, such as database restoration, ran longer than the 15 minute Lambda timeout limit. As a result, they were forced to look for alternatives to manage these long-running steps. Following is an architecture diagram showing the serverless component used to automate and scale the process.

Figure 2 – Initial Orchestration Architecture

Solution:

Enter Step Functions and AWS Systems Manager (formerly known as SSM) Automation. To address the problem of orchestrating the many sequential and parallel steps in our workflow, AWS Solutions Architects suggested replacing Amazon SNS with AWS Step Functions.

The Step Functions state machine controls the order in which the steps are invoked, including successful and error state transitions. The service is integrated with 14 other AWS services (Lambda function, SSM Automation, Amazon ECS, and more.), and can invoke them, as well as manual actions. These calls can be synchronous or run via steps that wait for an event. A state machine instance is long-lived and can support processes that take up to a year to complete.

Figure 3 – Step Function Designer UI

This immediately gave the Development team a holistic, visual way to design our workflow, and offered Operations a graphical user interface (UI) to monitor ongoing reconciliations in real time. The Step Functions console lists out all running and past reconciliations, including their status, and allows the operator to drill down into the detailed state diagram of any given reconciliation. The operator can then see how far it’s progressed or where it encountered an error.

The UI also provides Amazon CloudWatch links for any given step, isolating the logs of that particular Lambda execution, eliminating the need to search through the CloudWatch log group manually. The screenshot below illustrates what an in-progress Step Function looks to an operator, with each step listed out with its own status and a link to its log.

Figure 4 – Step Function Execution Monitor

Figure 5 – Step Function Execution Detail Viewer

Figure 6 -Step Function Log Group in CloudWatch

Figure 6 -Step Function Log Group in Amazon CloudWatch

The team also used the Step Function state machine as a container for metadata about each particular reconciliation process instance (like the environment ID and the database and Amazon EC2 instances associated with that environment), reducing the need to pass this data between Lambda functions.

To solve the problem of long-running PowerShell scripts, AWS Solutions Architects suggested using SSM Automation. Unlike Lambda functions, SSM Automation is meant to run operational scripts, with no maximum time limit. They also have native PowerShell integration, which you can use to call the existing scripts and capture their output.

Figure 7 – SSM Automation UI for Monitoring Long-running Tasks and Manual Approval Steps

To save time running hundreds of reports, the team looked into the ‘Map State’ feature of Step Functions. Map takes an array of input data, then creates an instance of the step (in this case a Lambda call) for each item in this array. It waits for them all to complete before proceeding.

This is recommended to implement as a fan-out pattern with almost no orchestration code. The Map State step also gives Operations users the option to limit the level of parallelism, in this case letting only 5 reports run simultaneously. This prevents overloading our reporting applications and databases.

Figure 8 – Map State is used to Fan Out the “CompareReport” Step as Multiple Parallel Steps

To deal with errors in any of the workflow steps, the Development team introduced a manual review step, which you can model in Step Functions. The manual step notifies a mailing list of the error, then waits for a reply to tell it whether to retry or abort the workflow.

The only challenge the Development team found was the mechanism for re-running an individual failed step. At this time, any failure needs to have an explicit state transition within the Step Function’s state diagram. While the Step Function can auto-retry a step, the team wanted to insert a wait-for-human-investigation step before retrying the more expensive and complex steps.

This presented 2 options:

add wait-and-loop-back steps around every step we may want to retry,
route all failures to a single wait-for-investigation step.

The former added significant complexity to the state machine, so AWS Solutions Architects raised this as a product feature that should be added to the Step Functions UI.

The proposed enhancement would allow any failed step to be manually rerun or skipped via UI controls, without adding explicit steps to each state machine to model this. In the meantime, the Dev team went with the latter approach, and had the human error review step loop back to the top of the state machine to retry the entire workflow. To avoid re-running long steps, they created a check within the step Lambda function to query the Step Functions API and determine whether that step had already succeeded before the loop-back, and complete it instantly if it had.

Figure 9 – Human Intervention Steps Used to Allow Time to Review Results or Resolve Errors Before Retrying Failed Steps

Conclusion:

Within 6 weeks, WSO was able to run the first reconciliations and begin the LIBOR migration on time. The Step Function Designer instantly gave the Developer team an operator UI and workflow orchestration engine. Normally, this would have required the creation of an entire 3-tier stack, scheduler and logging infrastructure.

Instead, using Step Functions allowed the developers to spend their time on the reconciliation logic that makes their application unique. The report compare tool developed by the WSO team provides clients with automated artifacts confirming that customer report data remained identical between current version and next version of WSO. The new testing artifacts provide clients with robust and comprehensive testing of critical data extracts.

Figure 10 -The Completed Step Function which Orchestrates an Entire Reconciliation Process

We hope that this blog post provided useful insights to help determine if using AWS Step Functions are a good fit for you.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.