Secure connectivity patterns to access Amazon MSK across AWS Regions

Post Syndicated from Sam Mokhtari original https://aws.amazon.com/blogs/big-data/secure-connectivity-patterns-to-access-amazon-msk-across-aws-regions/

AWS customers often segment their workloads across accounts and Amazon Virtual Private Cloud (Amazon VPC) to streamline access management while being able to expand their footprint. As a result, in some scenarios you, as an AWS customer, need to make an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster accessible to Apache Kafka clients not only in the same Amazon VPC as the cluster but also in a remote Amazon VPC. A guest post by Goldman Sachs presented cross-account connectivity patterns to an MSK cluster using AWS PrivateLink. Inspired by the work of Goldman Sachs, this post demonstrates additional connectivity patterns that can support both cross-account and cross-Region connectivity to an MSK cluster. We also developed sample code that supports the automation of the creation of resources for the connectivity pattern based on AWS PrivateLink.

Overview

Amazon MSK makes it easy to run Apache Kafka clusters on AWS. It’s a fully managed streaming service that automatically configures, and maintains Apache Kafka clusters and Apache Zookeeper nodes for you. Amazon MSK lets you focus on building your streaming solutions and supports familiar Apache Kafka ecosystem tools (such as MirrorMaker, Kafka Connect, and Kafka streams) and helps avoid the challenges of managing the Apache Kafka infrastructure and operations.

If you have workloads segmented across several VPCs and AWS accounts, there may be scenarios in which you need to make Amazon MSK cluster accessible to Apache Kafka clients across VPCs.  To provide secure connection between resources across multiple VPCs, AWS provides several networking constructs. Let’s get familiar with these before discussing the different connectivity patterns:

  • Amazon VPC peering is the simplest networking construct that enables bidirectional connectivity between two VPCs. You can use this connection type to enable between VPCs across accounts and AWS Regions to communicate with each other using private IP addresses.
  • AWS Transit Gateway provides a highly available and scalable design for connecting VPCs. Unlike VPC peering that can go cross-Region, AWS Transit Gateway is a regional service, but you can use inter-Region peering between transit gateways to route traffic across Regions.

AWS PrivateLink is an AWS networking service that provides private access to a specific service instead of all resources within a VPC and without traversing the public internet. You can use this service to expose your own application in a VPC to other users or applications in another VPC via an AWS PrivateLink-powered service (referred to as an endpoint service). Other AWS principals can then create a connection from their VPC to your endpoint service using an interface VPC endpoint.

Amazon MSK networking

When you create an MSK cluster, either via the AWS Management Console or AWS Command Line Interface (AWS CLI), it’s deployed into a managed VPC with brokers in private subnets (one per Availability Zone) as shown in the following diagram. Amazon MSK also creates the Apache ZooKeeper nodes in the same private subnets.

The brokers in the cluster are made accessible to clients in the customer VPC through elastic network interfaces (ENIs) that appear in the customer account. The security groups on the ENIs dictate the source and type of ingress and egress traffic allowed on the brokers.

IP addresses from the customer VPC are attached to the ENIs, and all network traffic stays within the AWS network and is not accessible to the internet.

Connections between clients and an MSK cluster are always private.

This blog demonstrates four connectivity patterns to securely access an MSK cluster from a remote VPC. The following table lists these patterns and their key characteristics. Each pattern aligns with the networking constructs discussed earlier.

VPC Peering AWS Transit Gateway AWS PrivateLink with a single NLB

 

WS PrivateLink multiple NLB

 

Bandwidth Limited by instance network performance and flow limits. Up to 50 Gbps

10 Gbps per AZ

 

10 Gbps per AZ

 

Pricing Data transfer charge (free if data transfer is within AZs) Data transfer charge + hourly charge per attachment Data transfer charge + interface endpoint charge + Network load balancer charge Data transfer charge + interface endpoint charge + Network load balancer charge
Scalability Recommended for smaller number of VPCs No limit on number of VPCs No limit on number of VPCs No limit on number of VPCs

Let’s explore these connectivity options in more detail.

VPC peering

To access an MSK cluster from a remote VPC, the first option is to create a peering connection between the two VPCs.

Let’s say you use Account A to provision an MSK cluster in us-east-1 Region, as shown in the following diagram. Now, you have an Apache Kafka client in the customer VPC in Account B that needs to access this MSK cluster. To enable this connectivity, you just need to create a peering connection between the VPC in Account A and the VPC in Account B. You should also consider implementing fine-grained network access controls with security groups to make sure that only specific resources are accessible between the peered VPCs.

Because VPC peering works across Regions, you can extend this architecture to provide access to Apache Kafka clients in another Region. As shown in the following diagram, to provide access to Kafka clients in the VPC of Account C, you just need to create another peering connection between the VPC in Account C with the VPC in Account A. The same networking principles apply to make sure only specific resources are reachable. In the following diagram, a solid line indicates a direct connection from the Kafka client to MSK cluster, whereas a dotted line indicates a connection flowing via VPC peering.

VPC peering has the following benefits:*

  • Simplest connectivity option.
  • Low latency.
  • No bandwidth limits (it is just limited by instance network performance and flow limits).
  • Lower overall cost compared to other VPC-to-VPC connectivity options.

However, it has some drawbacks:

  • VPC peering doesn’t support transitive peering, which means that only directly peered VPCs can communicate with each other.
  • You can’t use this connectivity pattern when there are overlapping IPv4 or IPv6 CIDR blocks in the VPCs.
  • Managing access can become challenging as the number of peered VPCs grows.

You can use VPC peering when the number of VPCs to be peered is less than 10.

AWS Transit Gateway

AWS Transit Gateway can provide scalable connectivity to MSK clusters. The following diagram demonstrates how to use this service to provide connectivity to MSK cluster. Let’s again consider a VPC in Account A running an MSK cluster, and an Apache Kafka client in a remote VPC in Account B is looking to connect to this MSK cluster. You set up AWS Transit Gateway to connect these VPCs and use route tables on the transit gateway to control the routing.

To extend this architecture to support access from a VPC in another Region, you need to use another transit gateway because this service can’t span Regions. In other words, for the Apache Kafka client in Account C in us-west-2 to connect to the MSK cluster, you need to peer another transit gateway in us-west-2 with the transit gateway in us-east-1 and work with the route tables to manage access to the MSK cluster. If you need to connect another account in us-west-2, you don’t need an additional transit gateway. The Apache Kafka clients in the new account (Account D) simply require a connection to the existing transit gateway in us-west-2 and the appropriate route tables.

The hub and spoke model for AWS Transit Gateway simplifies management at scale because VPCs only need to connect to one transit gateway per Region to gain access to the MSK cluster in the attached VPCs. However, this setup has some drawbacks:

  • Unlike VPC peering in which you only pay for data transfer charges, Transit Gateway has an hourly charge per attachment in addition to the data transfer fee.
  • This connectivity pattern doesn’t support transitive routing.
  • Unlike VPC peering, Transit Gateway is an additional hop between VPCs which may cause more latency.
  • It has higher latency (an additional hop between VPCs) comparing to VPC Peering.
  • The maximum bandwidth (burst) per Availability Zone per VPC connection is 50 Gbps.

You can use AWS Transit Gateway when you need to provide scalable access to the MSK cluster.

AWS PrivateLink

To provide private, unidirectional access from an Apache Kafka client to an MSK cluster across VPCs, you can use AWS PrivateLink. This also eliminates the need to expose the entire VPC or subnet and prevents issues like having to deal with overlapping CIDR blocks between the VPC that hosts the MSK cluster ENIs and the remote Apache Kafka client VPC.

Let’s do a quick recap of the architecture as explained in blog post How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink.

Let’s assume Account A has a VPC with three private subnets and an MSK cluster with three broker nodes in a 3-AZ deployment. You have three ENIs, one for each broker node in each subnet representing the broker nodes, and each ENI gets a private IPv4 address from its subnet’s CIDR block, and an MSK broker DNS endpoint. To expose the MSK cluster in Account A to other accounts via AWS PrivateLink, you have to create a VPC endpoint service in Account A. The VPC endpoint service requires the entity, in this case the MSK cluster, to be fronted by a Network Load Balancer (NLB).

You can choose from two patterns using AWS PrivateLink to provide cross-account access to Amazon MSK: with a single NLB or multiple NLBs.

AWS PrivateLink connectivity pattern with a single NLB

The following diagram illustrates access to an MSK cluster via an AWS PrivateLink connectivity pattern with a single NLB.

In this pattern, you have a single dedicated internal NLB in Account A. The NLB has a separate listener for each MSK broker. Because this pattern has a single NLB endpoint, each of the listeners need to listen on unique port. In the preceding diagram, the ports are depicted as 8443, 8444, and 8445. Correspondingly, for each listener, you have a unique target group, each of which has a single registered target: the IP address of an MSK broker ENI. Because the ports are different from the advertised listeners defined in the MSK cluster for each of the broker nodes, the advertised listeners configuration for each of the broker nodes in the cluster should be updated. Additionally, one target group has all the broker ENI IPs as targets and a corresponding listener (on port 9094), which means a request coming to the NLB on port 9094 can be routed to any of the MSK brokers.

In Account B, you need to create a corresponding VPC endpoint for the VPC endpoint service in Account A. Apache Kafka clients in Account B can connect to the MSK cluster in Account B by directing their requests to the VPC endpoint. For Transport Layer Security (TLS) to work, you also need an Amazon Route 53 private hosted zone with the domain name kafka.<region of the amazon msk cluster>.amazonaws.com, with alias resource record sets for each of the broker endpoints pointing to the VPC endpoint in Account B.

In this pattern, for the Apache Kafka clients local to the VPC with the Amazon MSK broker ENIs in Account A to connect to the MSK cluster, you need to set up a Route 53 private hosted zone, similar to Account B, with alias resource record sets for each of the broker endpoints pointing to the NLB endpoint. This is because the ports in the advertised.listener configuration have been changed for the brokers and the default Amazon MSK broker endpoints won’t work.

To extend this connectivity pattern and provide access to Apache Kafka clients in a remote Region, you need to create a peering connection (which can be via VPC peering or AWS Transit Gateway) between the VPC in Account B and the VPC in the remote Region. The same networking principles apply to make sure only specific intended resources are reachable.

AWS PrivateLink connectivity pattern with multiple NLBs

In the second pattern, you don’t share one VPC endpoint service or NLB across multiple MSK brokers. Instead, you have an independent set for each broker. Each NLB has only one listener listening on the same port (9094) for requests to each Amazon MSK broker. Correspondingly, you have a separate VPC endpoint service for each NLB and each broker. Just like in the first pattern, in Account B, you need a Route53 hosted private zone to alias broker DNS endpoints to VPC endpoints—in this case, they’re aliased to their own specific VPC endpoint.

This pattern has the advantage of not having to modify the advertised listeners configuration in the MSK cluster. However, there is an additional cost of deploying more NLBs, one for each broker. Furthermore, in this pattern, Apache Kafka clients that are local to the VPC with the MSK broker ENIs in Account A can connect to the cluster as usual with no additional setup needed. The following diagram illustrates this setup.

To extend this connectivity pattern and provide access to Apache Kafka clients in a remote Region, you need to create a peering connection between the VPC in Account B and the VPC in the remote Region.

You can use the sample code provided on GitHub to set up the AWS PrivateLink connectivity pattern with multiple NLBs for an MSK cluster. The intent of the code is to automate the creation of multiple resources instead of wiring it manually.

These patterns have the following benefits:

  • They are scalable solutions and do not limit the number of consumer VPCs.
  • AWS PrivateLink allows for VPC CIDR ranges to overlap.
  • You don’t need path definitions or a route table (access only to the MSK cluster), therefore it’s easier to manage

 The drawbacks are as follows:

  • The VPC endpoint and service must be in the same Region.
  • The VPC endpoints support IPv4 traffic only.
  • The endpoints can’t be transferred from one VPC to another.

You can use either connectivity pattern when you need your solution to scale to a large number of Amazon VPCs that can consume each service. You can also use either pattern when the cluster and client VPCs have overlapping IP addresses and when you want to restrict access to only the MSK cluster instead of the VPC itself. The single NLB pattern adds relevant complexity to the architecture because you need to maintain an additional target group and listener that has all brokers registered as well as keep the advertised.listeners property up to date. You can offset that complexity with the multiple NLB pattern but at an additional cost for the increased number of NLBs.

Conclusion

In this post, we explored different secure connectivity patterns to access an MSK cluster from a remote VPC. We also discussed the advantages, challenges, and limitations of each connectivity pattern. You can use this post as guidance to help you identify an appropriate connectivity pattern to address your requirements for accessing an MSK cluster. You can also use a combination of connectivity patterns to address your use case.

References

To read more about the solutions that inspired this post, see How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink and the webinar Cross-Account Connectivity Options for Amazon MSK.


About the Authors

Dr. Sam Mokhtari is a Senior Solutions Architect in AWS. His main area of depth is data and analytics, and he has published more than 30 influential articles in this field. He is also a respected data and analytics advisor who led several large-scale implementation projects across different industries including energy, health, telecom, and transport.

 

 

 

Pooja Chikkala is a Solutions Architect in AWS. Big data and analytics is her area of interest. She has 13 years of experience leading large-scale engineering projects with expertise in designing and managing both on-premises and cloud-based infrastructures.

 

 

 

Rajeev Chakrabarti is a Principal Developer Advocate with the Amazon MSK team. He has worked for many years in the big data and data streaming space. Before joining the Amazon MSK team, he was a Streaming Specialist SA helping customers build streaming pipelines.

 

 

 

Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics, and can be reached at IN.

 

 

Effective data lakes using AWS Lake Formation, Part 5: Securing data lakes with row-level access control

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/effective-data-lakes-using-aws-lake-formation-part-5-secure-data-lakes-with-row-level-access-control/

Increasingly, customers are looking at data lakes as a core part of their strategy to democratize data access across the organization. Data lakes enable you to handle petabytes and exabytes of data coming from a multitude of sources in varying formats, and gives users the ability to access it from their choice of analytics and machine learning tools. Fine-grained access controls are needed to ensure data is protected and access is granted to only those who require it.

AWS Lake Formation is a fully managed service that helps you build, secure, and manage data lakes, and provide access control for data in the data lake. Lake Formation row-level permissions allow you to restrict access to specific rows based on data compliance and governance policies. Lake Formation also provides centralized auditing and compliance reporting by identifying which principals accessed what data, when, and through which services.

Effective data lakes using AWS Lake Formation

This post demonstrates how row-level access controls work in Lake Formation, and how to set them up.

If you have large fact tables storing billions of records, you need a way to enable different users and teams to access only the data they’re allowed to see. Row-level access control is a simple and performant way to protect data, while giving users access to the data they need to perform their job. In the retail industry for instance, you may want individual departments to only see their own transactions, but allow regional managers access to transactions from every department.

Traditionally you can achieve row-level access control in a data lake through two common approaches:

  • Duplicate the data, redact sensitive information, and grant coarse-grained permissions on the redacted dataset
  • Load data into a database or a data warehouse, create a view with a WHERE clause to select only specific records, and grant permission on the resulting view

These solutions work well when dealing with a small number of tables, principals, and permissions. However, they make it difficult to audit and maintain because access controls are spread across multiple systems and methods. To make it easier to manage and enforce fine-grained access controls in a data lake, we announced a preview of Lake Formation row-level access controls. With this preview feature, you can create row-level filters and attach them to tables to restrict access to data for AWS Identity and Access Management (IAM) and SAMLv2 federated identities.

How data filters work for row-level security

Granting permissions on a table with row-level security (row filtering) restricts access to only specific rows in the table. The filtering is based on the values of one or more columns. For example, a salesperson analyzing sales opportunities should only be allowed to see those opportunities in their assigned territory and not others. We can define row-level filters to restrict access where the value of the territory column matches the assigned territory of the user.

With row-level security, we introduced the concept of data filters. Data filters make it simpler to manage and assign a large number of fine-grained permissions. You can specify the row filter expression using the WHERE clause syntax described in the PartiQL dialect.

Example use case

In this post, a fictional ecommerce company sells many different products, like books, videos, and toys. Customers can leave reviews and star ratings for each product, so other customers can make informed decisions about what they should buy. We use the Amazon Customer Reviews Dataset, which includes different products and customer reviews.

To illustrate the different roles and responsibilities of a data owner and a data consumer, we assume two personas: a data lake administrator and a data analyst. The administrator is responsible for setting up the data lake, creating data filters, and granting permissions to data analysts. Data analysts residing in different countries (for our use case, the US and Japan) can only analyze product reviews for customers located in their own country and for compliance reasons, shouldn’t be able to see data for customers located in other countries. We have two data analysts: one responsible for the US marketplace and another for the Japanese marketplace. Each analyst uses Amazon Athena to analyze customer reviews for their specific marketplace only.

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

  • An AWS Lambda function (for Lambda-backed AWS CloudFormation custom resources). We use the function to copy sample data files from the public S3 bucket to your Amazon Simple Storage Service (Amazon S3) bucket.
  • An S3 bucket to serve as our data lake.
  • IAM users and policies:
    • DataLakeAdmin
    • DataAnalystUS
    • DataAnalystJP
  • An AWS Glue Data Catalog database, table, and partition.
  • Lake Formation data lake settings and permissions.

When following the steps in this section, use either us-east-1 or us-west-2 Regions (where the preview functionality is currently available).

Before launching the CloudFormation template, you need to ensure that you disabled Use only IAM access control for new databases/tables by following steps:

  1. Sign in to the Lake Formation console in the us-east-1 or us-west-2 Region.
  2. Under Data catalog, choose Settings.
  3. Deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
  4. Choose Save.

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the CloudFormation console in the same Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For DatalakeAdminUserName and DatalakeAdminUserPassword, enter the user name and password you want for the data lake admin IAM user.
  5. For DataAnalystUsUserName and DataAnalystUsUserPassword, enter the user name and password you want for the data analyst user who is responsible for the US marketplace.
  6. For DataAnalystJpUserName and DataAnalystJpUserPassword, enter the user name and password you want for the data analyst user who is responsible for the Japanese marketplace.
  7. For DataLakeBucketName, enter the name of your data lake bucket.
  8. For DatabaseName and TableName, leave as the default.
  9. Choose Next.
  10. On the next page, choose Next.
  11. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  12. Choose Create.

Stack creation can take about 1 minute.

Query without data filters

After you set up the environment, you can query the product reviews table. Let’s first query the table without row-level access controls to make sure we can see the data. If you’re running queries in Athena for the first time, you need to configure the query result location.

Sign in to the Athena console using the DatalakeAdmin user, and run the following query:

SELECT * 
FROM lakeformation_tutorial_row_security.amazon_reviews
LIMIT 10

The following screenshot shows the query result. This table has only one partition, product_category=Video, so each record is a review comment for a video product.

Let’s run an aggregation query to retrieve the total number of records per marketplace:

SELECT marketplace, count(*) as total_count
FROM lakeformation_tutorial_row_security.amazon_reviews
GROUP BY marketplace

The following screenshot shows the query result. The marketplace column has five different values. In the subsequent steps, we set up row-based filters using the marketplace column.

Set up data filters

Let’s start by creating two different data filters, one for the analyst responsible for the US marketplace, and another for the one responsible for the Japanese marketplace. The we grant the users their respective permissions.

Create a filter for the US marketplace data

Let’s first set up a filter for the US marketplace data.

  1. As the DatalakeAdmin user, open the Lake Formation console.
  2. Choose Data filters.
  3. Choose Create new filter.
  4. For Data filter name, enter amazon_reviews_US.
  5. For Target database, choose the database lakeformation_tutorial_row_security.
  6. For Target table, choose the table amazon_reviews.
  7. For Column-level access, leave as the default.
  8. For Row filter expression, enter marketplace='US'.
  9. Choose Create filter.

Create a filter for the Japanese marketplace data

Let’s create another data filter to restrict access to the Japanese marketplace data.

  1. On the Data filters page, choose Create new filter.
  2. For Data filter name, enter amazon_reviews_JP.
  3. For Target database, choose the database lakeformation_tutorial_row_security.
  4. For Target table, choose the table amazon_reviews.
  5. For Column-level access, leave as the default.
  6. For Row filter expression, enter marketplace='JP'.
  7. Choose Create filter.

Grant permissions to the US data analyst

Now we have two data filters. Next, we need to grant permissions using these data filters to our analysts. We start by granting permissions to the DataAnalystUS user.

  1. On the Data permissions page, choose Grant.
  2. For Principals, choose IAM users and roles, and choose the user DataAnalystUS.
  3. For Policy tags or catalog resources, choose Named data catalog resources.
  4. For Database, choose the database lakeformation_tutorial_row_security.
  5. For Table, choose the table amazon_reviews.
  6. For Table permissions, select Select.
  7. For Data permissions, select Advanced cell-level filters.
  8. Select the filter amazon_reviews_US.
  9. Choose Grant.

The following screenshot show the available data filters you can attach to a table when configuring permissions.

Grant permissions to the Japanese data analyst

Next, complete the following steps to configure permissions for the user DataAnalystJP:

  1. On the Data permissions page, choose Grant.
  2. For Principals, choose IAM users and roles, and choose the user DataAnalystJP.
  3. For Policy tags or catalog resources, choose Named data catalog resources.
  4. For Database, choose the database lakeformation_tutorial_row_security.
  5. For Table, choose the table amazon_reviews.
  6. For Table permissions, select Select.
  7. For Data permissions, select Advanced cell-level filters.
  8. Select the filter amazon_reviews_JP.
  9. Choose Grant.

Query with data filters

With the data filters attached to the product reviews table, we’re ready to run some queries and see how permissions are enforced by Lake Formation. Because row-level security is in preview as of this writing, we need to create a special Athena workgroup named AmazonAthenaLakeFormationPreview, and switch to using it. For more information, see Managing Workgroups.

Sign in to the Athena console using the DataAnalystUS user and switch to the AmazonAthenaLakeFormationPreview workgroup. Run the following query to retrieve a few records, which are filtered based on the row-level permissions we defined:

SELECT * 
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
LIMIT 10

Note the prefix of lakeformation. before the database name; this is required for the preview only.

The following screenshot shows the query result.

Similarly, run a query to count the total number of records per marketplace:

SELECT marketplace, count(*) as total_count
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
GROUP BY marketplace 

The following screenshot shows the query result. Only the marketplace US shows in the results. This is because our user is only allowed to see rows where the marketplace column value is equal to US.

Switch to the DataAnalystJP user and run the same query:

SELECT * 
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
LIMIT 10

The following screenshot shows the query result. All of the records belong to the JP marketplace.

Run the query to count the total number of records per marketplace:

SELECT marketplace, count(*) as total_count
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
GROUP BY marketplace

The following screenshot shows the query result. Again, only the row belonging to the JP marketplace is returned.

Clean up

Now to the final step, cleaning up the resources.

  1. Delete the CloudFormation stack.
  2. Delete the Athena workgroup AmazonAthenaLakeFormationPreview.

Conclusion

In this post, we covered how row-level security in Lake Formation enables you to control data access without needing to duplicate it or manage complicated alternatives such as views. We demonstrated how Lake Formation data filters can make creating, managing, and enforcing row-level permissions simple and easy.

When you want to grant permission on specific cell, you can include or exclude columns in the data filters in addition to the row filter expression. You can learn more about the cell filters in Part 4: Implementing cell-level and row-level security.

You can get started with Lake Formation today by visiting the AWS Lake Formation product page. If you want to try out row-level security, as well as the other exciting new features like ACID transactions and acceleration currently available for preview in the US East (N. Virginia) and the US West (Oregon) Regions, sign up for the preview.


About the Authors

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures.

 

 

 

Sanjay Srivastava is a Principal Product Manager for AWS Lake Formation. He is passionate about building products, in particular products that help customers get more out of their data. During his spare time, he loves to spend time with his family and engage in outdoor activities including hiking, running, and gardening.
 

 

Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs

Post Syndicated from Hareesh Singireddy original https://aws.amazon.com/blogs/architecture/expiring-amazon-s3-objects-based-on-last-accessed-date-to-decrease-costs/

Organizations are using Amazon Simple Storage Service (S3) for building their data lakes, websites, mobile applications, and enterprise applications. As the number of objects within your S3 bucket increases, you may want to move older objects into lower-cost tiers of Amazon S3. In some cases you may want to delete the objects altogether to further reduce S3 storage costs. A common practice is to use S3 Lifecycle rules to achieve this. These rules can be applied to objects based on their creation date. In certain situations, you may want to keep objects available that are still being accessed, but transition or delete objects that are no longer in use.

In this post, we will demonstrate how you can create custom object expiry rules for Amazon S3 based on the last accessed date of the object. We will first walk through the various features used within the workflow, followed by an architecture diagram outlining the process flow.

Amazon S3 server access logging

S3 Server access logging provides detailed records of the requests that are made to objects in Amazon S3 buckets. Amazon S3 periodically collects access log records, consolidates the records in log files, and then uploads log files to your target bucket as log objects. Each log record consists of information such as bucket name, the operation in the request, and the time at which the request was received. S3 Server Access Log Format provides more details about the format of the log file.

Amazon S3 inventory

Amazon S3 inventory provides a list of your objects and the corresponding metadata on a daily or weekly basis, for an S3 bucket or a shared prefix. The inventory lists are stored as a comma-separated value (CSV) file compressed with GZIP, as an Apache optimized row columnar (ORC) file compressed with ZLIB, or as an Apache Parquet file compressed with Snappy.

Amazon S3 Lifecycle

Amazon S3 Lifecycle policies help you manage your objects through two types of actions, Transition and Expiration. In the architecture shown following in Figure 1, we create an S3 Lifecycle configuration rule that expires objects after ‘x’ days. It has a filter for an object tag of “delete=True”. You can configure the value of ‘x’ based on your requirements.

If you are using an S3 bucket to store short lived objects with unknown access patterns, you might want to keep the objects that are still being accessed, but delete the rest. This will let you retain objects in your S3 bucket even after their expiry date as per the S3 lifecycle rules, while saving you costs by deleting objects that are not needed anymore. The following diagram shows an architecture that considers the last accessed date of the object before deleting S3 objects.

Figure 1. Object expiry architecture flow

Figure 1. Object expiry architecture flow

This architecture uses native S3 features mentioned earlier in combination with other AWS services to achieve the desired outcome.

Here is the architecture flow:

  1. The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
  2. An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
  3. An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
  4. The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
    • Capture the number of days (x) configuration from the S3 Lifecycle configuration.
    • Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than ‘x’ days, but not accessed during that time.
    • Write a manifest file with the list of these objects to an S3 bucket.
    • Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of “delete=True”.
  5. The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to ‘x’ days. They will have the tag given via the S3 batch operation of “delete=True”.

The preceding architecture is built for fault tolerance. If a particular run fails, all the objects that must be expired will be picked up during the next run. You can configure error handling and automatic retries in your Lambda function. An Amazon Simple Notification Service (SNS) topic will send out a notification in the event of a failure.

Cost considerations

S3 server access logs, S3 inventory lists, and manifest files can accumulate many objects over time. We recommend you configure an S3 Lifecycle policy on the target bucket to periodically delete older objects. Although following the guidelines in this post can decrease some of your costs, S3 requests, S3 inventory, S3 Object Tagging, and Lifecycle transitions also have costs associated with them. Additional details can be found on the S3 pricing page.

Amazon Athena charges you based on the amount of data scanned by each query. But Amazon S3 inventory can also output files in Apache ORC or Apache Parquet format, which can reduce the amount data scanned by Athena. The Athena pricing page would be helpful to review.

AWS Lambda has a free usage tier of 1M free requests per month and 400,000 GB-seconds of compute time per month. However, you are charged based on the number of requests, the amount of memory allocated, and the runtime duration of the function. See more at the Lambda pricing page.

Conclusion

In this blog post, we showed how you can create a custom process to delete objects from your S3 bucket based on the last time the object was accessed. You can use this architecture to customize your object transitions, clean up your S3 buckets for any unnecessary objects, and keep your S3 buckets cost-effective. This architecture can also be used on versioned S3 buckets with some minor modifications.

We hope you found this blog post useful and welcome your feedback!

Read more about queries, rules, and tags:

Backblaze Drive Stats for Q2 2021

Post Syndicated from original https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2021/

As of June 30, 2021, Backblaze had 181,464 drives spread across four data centers on two continents. Of that number, there were 3,298 boot drives and 178,166 data drives. The boot drives consisted of 1,607 hard drives and 1,691 SSDs. This report will review the quarterly and lifetime failure rates for our data drives, and we’ll compare the failure rates of our HDD and SSD boot drives. Along the way, we’ll share our observations of and insights into the data presented and, as always, we look forward to your comments below.

Q2 2021 Hard Drive Failure Rates

At the end of June 2021, Backblaze was monitoring 178,166 hard drives used to store data. For our evaluation, we removed from consideration 231 drives which were used for either testing purposes or as drive models for which we did not have at least 60 drives. This leaves us with 177,935 hard drives for the Q2 2021 quarterly report, as shown below.

Notes and Observations on the Q2 2021 Stats

The data for all of the drives in our data centers, including the 231 drives not included in the list above, is available for download on the Hard Drive Test Data webpage.

Zero Failures

Three drive models recorded zero failures during Q2, let’s take a look at each.

  • 6TB Seagate (ST6000DX000): The average age of these drives is over six years (74 months) and with one failure over the last year, this drive is aging quite well. The low number of drives (886) and drive days (80,626) means there is some variability in the failure rate, but the lifetime failure rate of 0.92% is solid.
  • 12TB HGST (HUH721212ALE600): These drives reside in our Dell storage servers in our Amsterdam data center. After recording a quarterly high of five failures last quarter, they are back on track with zero failures this quarter and a lifetime failure rate of 0.41%.
  • 16TB Western Digital (WUH721816ALE6L0): These drives have only been installed for three months, but no failures in 624 drives is a great start.

Honorable Mention

Three drive models recorded one drive failure during the quarter. They vary widely in age.

  • On the young side, with an average age of five months, the 16TB Toshiba (MG08ACA16TEY) had its first drive failure out of 1,430 drives installed.
  • At the other end of the age spectrum, one of our 4TB Toshiba (MD04ABA400V) drives finally failed, the first failure since Q4 of 2018.
  • In the middle of the age spectrum with an average of 40.7 months, the 8TB HGST drives (HUH728080ALE600) also had just one failure this past quarter.

Outliers

Two drive models had an annualized failure rate (AFR) above 4%, let’s take a closer look.

  • The 4TB Toshiba (MD04ABA400V) had an AFR of 4.07% for Q2 2021, but as noted above, that was with one drive failure. Drive models with low drive days in a given period are subject to wide swings in the AFR. In this case, one less failure during the quarter would result in an AFR of 0% and one more failure would result in an AFR of over 8.1%.
  • The 14TB Seagate (ST14000NM0138) drives have an AFR of 5.55% for Q2 2021. These Seagate drives along with 14TB Toshiba drives (MG07ACA14TEY) were installed in Dell storage servers deployed in our U.S. West region about six months ago. We are actively working with Dell to determine the root cause of this elevated failure rate and expect to follow up on this topic in the next quarterly drive stats report.

Overall AFR

The quarterly AFR for all the drives jumped up to 1.01% from 0.85% in Q1 2021 and 0.81% one year ago in Q2 2020. This jump ended a downward trend over the past year. The increase is within our confidence interval, but bears watching going forward.

HDDs vs. SSDs, a Follow-up

In our Q1 2021 report, we took an initial look at comparing our HDD and SSD boot drives, both for Q1 and lifetime timeframes. As we stated at the time, a numbers-to-numbers comparison was suspect as each type of drive was at a different point in its life cycle. The average age of the HDD drives was 49.63 months while the SSDs average age was 12.66 months. As a reminder, the HDD and SSD boot drives perform the same functions which include booting the storage servers and performing reads, writes, and deletes of daily log files and other temporary files.

To create a more accurate comparison, we took the HDD boot drives that were in use at the end of Q4 2020 and went back in time to see where their average age and cumulative drive days would be similar to those same attributes for the SDDs at the end of Q4 2020. We found that at the end of Q4 2015 the attributes were the closest.

Let’s start with the HDD boot drives that were active at the end of Q4 2020.

Next, we’ll look at the SSD boot drives that were active at the end of Q4 2020.

Finally, let’s look at the lifetime attributes of the HDD drives active in Q4 2020 as they were back in Q4 2015.

To summarize, when we control using the same drive models, the same average drive age, and a similar number of drive days, HDD and SSD drives failure rates compare as follows:

While the failure rate for our HDD boot drives is nearly two times higher than the SSD boot drives, it is not the nearly 10 times failure rate we saw in the Q1 2021 report when we compared the two types of drives at different points in their lifecycle.

Predicting the Future?

What happened to the HDD boot drives from 2016 to 2020 as their lifetime AFR rose from 1.54% in Q4 2015 to 6.26% in Q4 2020? The chart below shows the lifetime AFR for the HDD boot drives from 2014 through 2020.

As the graph shows, beginning in 2018 the HDD boot drive failures accelerated. This continued in 2019 and 2020 even as the number of HDD boot drives started to decrease when failed HDD boot drives were replaced with SSD boot drives. As the average age of the HDD boot drive fleet increased, so did the failure rate. This makes sense and is borne out by the data. This raises a couple of questions:

  • Will the SSD drives begin failing at higher rates as they get older?
  • How will the SSD failure rates going forward compare to what we have observed with the HDD boot drives?

We’ll continue to track and report on SSDs versus HDDs based on our data.

Lifetime Hard Drive Stats

The chart below shows the lifetime AFR of all the hard drive models in production as of June 30, 2021.

Notes and Observations on the Lifetime Stats

The lifetime AFR for all of the drives in our farm continues to decrease. The 1.45% AFR is the lowest recorded value since we started back in 2013. The drive population spans drive models from 4TB to 16TB and varies in average age from three months (WDC 16TB) to over six years (Seagate 6TB).

Our best performing drive models in our environment by drive size are listed in the table below.

Notes:

  1. The WDC 16TB drive, model: WUH721816ALE6L0, does not appear to be available in the U.S. through retail channels at this time.
  2. Status is based on what is stated on the website. Further investigation may be required to ensure you are purchasing a new drive versus a refurbished drive marked as new.
  3. The source and price were as of 7/30/2021.
  4. In searching for the Toshiba 16TB drive, model: MG08ACA16TEY, you may find model: MG08ACA16TE for much less ($399.00 or less). These are not the same drive and we have no information on the latter model. The MG08ACA16TEY includes the Sanitize Instant Erase feature.

The Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

If you just want the summarized data used to create the tables and charts in this blog post, you can download the ZIP file containing the CSV files for each chart.

Good luck and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q2 2021 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Security updates for Tuesday

Post Syndicated from original https://lwn.net/Articles/865029/rss

Security updates have been issued by Arch Linux (chromium, nodejs, nodejs-lts-erbium, and nodejs-lts-fermium), Debian (pyxdg, shiro, and vlc), openSUSE (qemu), Oracle (lasso), Red Hat (glibc, lasso, rh-php73-php, rh-varnish6-varnish, and varnish:6), Scientific Linux (lasso), SUSE (dbus-1, lasso, python-Pillow, and qemu), and Ubuntu (exiv2, gnutls28, and qpdf).

The Ransomware Task Force: A New Approach to Fighting Ransomware

Post Syndicated from Jen Ellis original https://blog.rapid7.com/2021/08/03/the-ransomware-task-force-a-new-approach-to-fighting-ransomware/

The Ransomware Task Force: A New Approach to Fighting Ransomware

In the past few months, we’ve seen ransomware attacks shut down healthcare across Ireland, fuel delivery across parts of the US, and meat processing across Australia, Canada and the US. We’ve seen demands of payments in the tens of millions of dollars. We’re also continuing to see trends around ransomware-as-a-service and double or triple extortion continuing to rise. It’s clear that ransomware attacks are increasing in frequency, breadth, sophistication, scale, and impact.

Recognizing this, the Institute for Security and Technology put together a comprehensive Ransomware Task Force (RTF) to identify new approaches to shift the dynamics of ransomware and reduce opportunities for attackers. The Ransomware Task Force involved more than 60 participants representing a wide range of expertise and experience, including from multiple governments, law enforcement, civil society and public policy nonprofits, and security advancement groups. From the private sector, organizations of all sizes participated, including many that have experienced ransomware attacks firsthand or that are involved in dealing with the fallout, such as cybersecurity companies, law firms, and cyber insurers. Rapid7 was among those that participated — I was one of the co-chairs, and my amazing colleagues, Bob Rudis, Tod Beardsley, and Scott King participated as well.

From the outset, the intent of the Task Force was to look at the issue holistically and come up with a comprehensive set of recommendations to deter and disrupt ransomware attackers, thereby helping organizations prepare for and respond to attacks at scale. Recognizing the scale and severity of the issue — and the need for systemic and societal responses — our target audience was policymakers and government leaders.

The Task Force recognized that ransomware is not a new topic, and we had no desire to rehash previous efforts. Instead, we sought to learn from them and, where appropriate, amplify and extend them, supporting the next period of growth on this thorny issue. Ransomware’s reach and impact are increasing, which has a serious impact on society. The effects are only likely to worsen without significant action from governments and other leaders.

Key recommendations

The final report issued by the Task Force makes 48 recommendations, broken into actions to deter, disrupt, prepare for, and respond to ransomware attacks. The recommendations are designed to work in concert with each other, though we recognize there are a large number of them, and many will take time to implement. In reality, though, there truly is no silver bullet for addressing ransomware, no one thing that will magically solve this problem. If we want to shift the dynamics in a meaningful way that makes it harder for attackers to succeed, we need to make adjustments in a range of areas. It’s also worth noting that the Task Force’s goal was to provide recommendations to government and other leaders, not to provide tactical, technical guidance.

Given there are 48 recommendations, and they are well set out in the report, I won’t go over them now. I’ll just highlight a few of the big themes and, where relevant, what’s happened since the launch of the report.

Make it a top priority

One of the biggest challenges we face with any discussion around cybercrime is that it’s often viewed as a niche technical problem, not as a broad societal issue. This has made it harder to get the required attention and investment in solutions. The Task Force called for senior political leaders to recognize ransomware for what it is: a national security issue and a major threat to our ways of life (Action 1.2.5, page 26). We also called for a whole-of-government approach whereby leaders would engage various stakeholders across the government to help ensure necessary action is taking place collaboratively across the board (Actions 1.2.1 and 1.2.2, page 23).

One possible silver lining of the recent attacks against critical infrastructure is that they’ve helped establish this level of priority. In the US, we’ve seen various parts of the government start to take action: Congress has held hearings and proposed legislation; the Department of Justice has given ransomware investigations similar status to those for terrorism; the Department of Homeland Security has issued new cybersecurity guidelines for pipelines; the White House issued a memo to urge the private sector to take steps to protect against ransomware; and even President Biden has talked about ransomware in press conferences and with other world leaders.

Global action for a global problem

To take meaningful action to reduce ransomware attacks, we must acknowledge the geopolitical aspects. Firstly, the issue affects countries all around the world. Governments taking action should do so in coordination and cooperation in order to amplify the impact and hit attackers on multiple fronts at once (Actions 1.1.1 – 1.1.4, 1.2.6, pages 21-22, 26).

Secondly, and perhaps more crucially, one of the main advantages for attackers is the existence of nations that provide safe havens, because they’re either unwilling or unable to prosecute cybercriminals. This also makes it much harder for other countries to prosecute these criminals, and as such, ransomware attackers rarely seem to fear consequences for their actions.

The Task Force recommended that governments work together to tackle the issue of safe havens and adopt key practices to protect their citizens — or help them better protect themselves (Actions 1.3.1 and 1.3.2, page 27).

We’ve already seen some progress in this regard, as ransomware was raised at the recent G7 Summit, and the resulting communique included the following commitment from members:

“We also commit to work together to urgently address the escalating shared threat from criminal ransomware networks. We call on all states to urgently identify and disrupt ransomware criminal networks operating from within their borders, and hold those networks accountable for their actions.”

It will be interesting to see whether and how the G7 members will follow through on this commitment. I hope they’ll take action, build momentum, and recruit participation from other nations.

Reducing paths to revenue

As mentioned above, we’re seeing attackers demand higher and higher ransoms, which likely attracts other criminals to enter the market. Hopefully, the opposite is also true; if we reduce the opportunity to make money from ransomware, the number of attacks will decrease.

This rationale, coupled with discomfort over the idea of ransom payments being used to fund other types of organized crime — including human trafficking, child exploitation, and weapons trafficking — resulted in a great deal of discussion around the notion of banning ransom payments.

While the Task Force agreed that payments should be discouraged, the idea of a legal prohibition was challenging. Given the lack of real risk or friction for attackers, it’s likely that if payments were outlawed, attackers wouldn’t simply give up. Rather, they’d first play a game of chicken against victims, focusing on the organizations least likely to resist paying — namely providers of critical functions that can’t be disrupted without profound impact on society, or small-to-medium businesses that aren’t financially able to prepare for and weather an attack.

Given the concerns over these practicalities, the Task Force did not recommend banning payments. Rather, we looked at alternative ways of reducing the ease with which attackers realize a profit. There are two main paths to this: reducing the likelihood of victims making a payment, and making it technically harder for attackers to get their payment.

In terms of making victims think twice before making a payment, the RTF recommended a few measures:

  • Requiring the disclosure of payments (Action 4.2.4, page 46): This will help to build greater understanding of what is happening in the attack landscape and may enable law enforcement to build more information on attackers, or even recapture payments.
  • Requiring organizations to conduct cost-benefit analysis prior to making payments (Action 4.3.1 and 4.3.2, pages 47 and 48): This will encourage organizations to look into alternative options for resolution — for example, turning to the No More Ransom Project to seek decryption keys.
  • Creating a fund to assist certain organizations in recovery (Action 4.1.2, page 43): Often, organizations say the cost of recovery significantly outsizes that of the ransom, leaving them no choice but to give into their attacker’s demands. For qualifying organizations, this fund would rebalance the scales and give them a pragmatic alternative to paying the ransom.

On the other track — disrupting the system that facilitates the payment of ransoms — the RTF recommended that cryptocurrency exchanges, kiosks, and over-the-counter trading desks be required to comply with existing laws, such as Know Your Customer (KYC), Anti-Money Laundering (AML), and Combatting Financing of Terrorism (CFT) (Action 2.1.2, pages 29 and 30).

Better preparation, better response

During the explorations of the Task Force, it became apparent that part of the reason ransomware attacks are so successful is that many organizations don’t truly understand the threat, believe it’s relevant to them, or understand how to protect themselves. We repeatedly heard that, while there is a lot of information on ransomware, it’s overwhelming and often unhelpful. Many organizations don’t know what to focus on, and guidance may be oversimplified, overcomplicated, or insufficient.

With this in mind, one of our top recommendations was for the development of a ransomware framework that would cover measures for both preparing for and responding to attacks (Action 3.1.1, pages 35 and 36). The framework would need to be pragmatic, actionable, and address varying levels of sophistication and capability (Action 3.1.2, page 36). And because one of our main themes was around international cooperation, we also recommended there be a single source of truth adopted and promoted by multiple governments around the world. In fact, we recommended the framework be developed through both international and public-private collaboration. It should also be kept up to date to react to evolving ransomware attack trends.

Creating the framework is a lift, but it’s only part of the battle — you can’t drive adoption if you don’t also tackle the lack of awareness and understanding. As such, we also recommend that governments run high-profile awareness campaigns, partnering with organizations with reach into audiences that aren’t being well addressed today (Actions 3.2.1 and 3.2.2, pages 37 and 38). For example, many governments have toolkits or content aimed at small-to-medium businesses, but most leaders of these organizations seem largely unaware of the risk — until someone they know personally is hit by an attack.

The path forward

Unfortunately, ransomware continues to dominate headlines and harm organizations around the world. As a result, many governments are paying a great deal of attention to this issue and looking for solutions. I’m relieved to say the Ransomware Task Force’s report and recommendations have seen a fair bit of interest and support. For us, the next challenge is to keep the momentum going and help governments translate interest into action.

In the meantime, my colleagues at Rapid7 and I will continue to try to help our customers and community prepare for and respond to attacks. We’re working on some other content to help people better understand the dynamics of the issue, as well as the steps they can take to protect themselves or get involved in broader response efforts.

Look out for our series of blogs on different aspects of ransomware, and in the meantime, check out our interviews with ransomware experts on our Security Nation podcast. You can also check out my talk and Q&A on the Ransomware Task Force at Black Hat, or as part of Rapid7’s Virtual Vegas, which includes a Ransomware (un)Happy Hour — bring your ransomware war stories, lessons learned, or questions.

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Durable Objects: Easy, Fast, Correct — Choose three.

Post Syndicated from Kenton Varda original https://blog.cloudflare.com/durable-objects-easy-fast-correct-choose-three/

Durable Objects: Easy, Fast, Correct — Choose three.

Durable Objects: Easy, Fast, Correct — Choose three.

Storage in distributed systems is surprisingly hard to get right. Distributed databases and consensus are well-known to be extremely hard to build. But, application code isn’t necessarily easy either. There are many ways in which apps that use databases can have subtle timing bugs that could result in inconsistent results, or even data loss. Worse, these problems can be very hard to test for, as they’ll often manifest only under heavy load, or only after a sudden machine failure.

Up until recently, Durable Objects were no exception. A Durable Object is a special kind of Cloudflare Worker that has access to persistent storage and processes requests in one of Cloudflare’s points of presence. Each Object has its own private storage, accessible through a classical key/value storage API. Like any classical database API, this storage API had to be used carefully to avoid possible race conditions and data loss, especially when performance mattered. And like any classical database API, many apps got it wrong.

However, rather than fix the apps, we decided to fix the model. Last month, we rolled out deep changes to the Durable Objects runtime such that many applications which previously contained subtle race conditions are now correct by default, and many that were previously slow are now fast. Developers can now write their code in an intuitive way, and have it work. No changes at all are needed to your code in order to take advantage of these new features.

So, let me tell you about what changed…

Background: Durable Objects are Single-Threaded

To understand what changed, it’s necessary to first understand Durable Objects. For a full introduction, see the Durable Objects announcement blog post.

The most important point is: Each Durable Object runs in exactly one location, in one single thread, at a time. Each object has its own private on-disk storage. This is a very different situation from a typical database, where many clients may be accessing the same data. In Durable Objects, any particular piece of data belongs to exactly one thread at a time.

Because a single Durable Object is single-threaded, it’s possible, and even encouraged, to keep state and perform synchronization in memory. This is, indeed, the killer feature of Durable Objects. With classical databases, in-memory state is extremely difficult to keep synchronized between all database clients. But with Durable Objects, since each piece of data belongs to a specific thread, this synchronization is easy.

However, interacting with the disk is still an I/O (input/output) operation, which means that each operation returns a Promise which you must await. As we’ll see, this re-introduces some of the synchronization difficulties that we were trying to avoid. However, it turns out, we can solve these difficulties within the system itself, without bothering application developers.

An Example

Consider this code:

// Used to be slow and racy -- but not anymore!
async function getUniqueNumber() {
  let val = await this.storage.get("counter");
  await this.storage.put("counter", val + 1);
  return val;
}

At first glance, this seems like reasonable code that returns a unique number each time it is called (incrementing each time).

Unfortunately, before now, this code had two problems:

  1. It had a subtle race condition (even though Durable Objects are single-threaded!).
  2. It was kind of slow.

The Race Condition

A race condition occurs when two operations running concurrently might interfere with each other in a way that makes them behave incorrectly. Race conditions are commonly associated with code that uses multiple threads.

JavaScript, however, famously does not use threads. Instead, it uses event-driven programming, with callbacks. It’s not possible for two pieces of JavaScript code to be running "at the same time" in the same isolate (and Durable Objects promises that no other isolate could possibly be accessing the same storage). Does that mean that race conditions aren’t a problem in JavaScript, the way they are in multi-threaded apps?

Unfortunately, it does not. The problem is, the code above is an async function, containing two await statements. Each time await is used, execution pauses, waiting for the specified Promise to complete.

In the meantime, though, other code can run! For example, the Durable Object might receive two requests at the same time. If each of them calls getUniqueNumber(), then the two calls might be interleaved. Each time one call performs an await, execution may switch to the other call. So, the two calls might end up looking like this:

Request 1 timeline Request 2 timeline
async function getUniqueNumber() {
  let val = await this.storage.get("counter");
 
 
 
 
async function getUniqueNumber() {
  let val = await this.storage.get("counter");
  await this.storage.put("counter", val + 1);
 
 
  await this.storage.put("counter", val + 1);
  return val;
}
 
 
 
 
  return val;
}

There’s a big problem here: Both of these two calls will call get("counter") before either of them calls put("counter", val + 1). That means, both of them will return the same value!

This problem is especially bad because it only happens when multiple requests are being handled at the same time — and even then, only sometimes. It is very hard to test for this kind of problem, and everything might seem just fine when the application is deployed, as long as it isn’t getting too much traffic. But one day, when a lot of visitors try to use the same object at the same time, all of a sudden getUniqueNumber() starts returning duplicates!

The Slowness

To add insult to injury, getUniqueNumber() was (until recently) pretty slow. The problem is, it has to do two round trips to storage — a get() and a put(). The get() might typically take a couple milliseconds. The put(), however, will take much longer, probably tens of milliseconds.

Why is put() so slow? Because we don’t want to lose data. The worst thing an application can do is tell the user that their action was successful when it wasn’t. If, for some reason, a write cannot be completed, then it’s imperative that the application presents an error to the user, so that the user knows that something is wrong and they’ll have to try again or look for a fix.

In order to make sure an application does not prematurely report success to the user, await put() has to make sure it doesn’t return until the data is actually safe on disk. Disks are slow, so this might take a while.

But that’s not all. Disks can fail. In order for the data to be really safe, we have to write the same data on multiple disks, in multiple machines. That means we have to wait for some network traffic.

But that’s still not all. What if a meteor were to come out of the sky and land on a Cloudflare data center, completely destroying it? Or, more likely, what if the power or network connection failed? We don’t want a user’s data to be lost in this case, or even temporarily become unavailable. Therefore, Durable Object data is replicated to multiple Cloudflare locations. This requires communicating across long distances before any write can be confirmed. There is little we can do to make this faster, the speed of light being what it is.

A call to getUniqueNumber() will therefore always take tens of milliseconds. If an application calls it multiple times, awaiting each call before beginning the next, it can easily become very slow very quickly. Or, at least, that was the case before our recent changes.

The Wrong Fixes

There are several ways that an application could fix these problems, but all of them have their own issues.

Transactions?

Many databases offer "transactions". A transaction allows an application to make sure some operation completes "atomically", with no interference from concurrent operations.

The Durable Objects storage API has always supported transactions. We could use them to fix our getUniqueNumber() implementation like so:

// No more race condition... but slow and complicated.
async function getUniqueNumber() {
  let val;
  await this.storage.transaction(async (txn) => {
    val = await txn.get("counter");
    await txn.put("counter", val + 1);
  });
  return val;
}

This fixes our race condition. Now, if getUniqueNumber() is called multiple times concurrently such that the storage operations interleave, the system will detect the problem. One of the concurrent calls will be chosen to be the "winner", and will complete normally. The other calls will be canceled and retried, so that they can see the value written by the first call.

This fixes our problems! But, at some cost:

  • getUniqueNumber() is now even slower than it was before. The difference typically won’t be huge, but setting up a transaction does require some additional coordination in the database. Of course, if the transaction needs to be retried, then it may end up being much slower. And retries will tend to happen more when load gets high… the worst possible time.
  • Speaking of retries, many developers might not realize that the transaction callback can be called multiple times. It’s difficult to test for this, since retries will only happen when concurrent operations cause conflicts. The problem is especially acute when the application is trying to synchronize not just on-disk state, but also in-memory state — if the transaction callback modifies in-memory state, it must be careful to ensure that its changes are idempotent. The need for idempotency may not be top of mind for most developers, and tests won’t catch the problem, making it very easy to end up deploying buggy code.

So we solved our problem, but we did it with a foot-gun. If we keep using the foot-gun, we’re probably going to shoot our own feet eventually.

Is there another way?

In-memory caching?

Durable Objects’ superpower is their in-memory state. Each object has only one active instance at any particular time. All requests sent to that object are handled by that same instance. That means, you can store some state in memory.

// Much faster! But (used to be) wrong.
async function getUniqueNumber() {
  if (this.val === undefined) {
    this.val = await this.storage.get("counter");
  }

  let result = this.val;
  ++this.val;
  this.storage.put("counter", this.val);
  return result;
}

This code is MUCH faster than the previous implementation, because it stores the value in memory. In fact, after the function runs once, further calls won’t wait for any I/O at all — they will return immediately. This is because by caching the value in memory, we avoid waiting for a get() (except for the first time), and we don’t wait for the put() either, trusting that it will complete asynchronously later on.

Returning immediately also means that there’s no opportunity for concurrency, so the calls that return immediately will always return unique numbers! This means that not only is this implementation faster than our original implementation, it is also more correct. This is only possible because the Durable Objects platform guarantees that there will only be one instance, and therefore only one copy of this.val.

Unfortunately, there are two problems with this code:

  • We still have a race condition on initialization. If the first two calls to getUniqueNumber() happen to occur at about the same time, then initialization will be performed multiple times. The second call will likely clobber what the first call did, and the two calls will end up returning the same number. We could solve this problem by making initialization more complicated — the first call could create an initialization promise, and other concurrent calls could wait on it, so that initialization really only happens once. But this creates even deeper complexity: What if initialization fails for some reason? The object could be placed in a permanently broken state. It’s possible to get this right, but it’s surprisingly tricky.
  • Because we don’t wait for the put() to report success, it’s possible that it could be silently lost. For example, if the machine hosting the Durable Object suffered a sudden power failure, then the Durable Object would be transferred to some other machine. When it starts up there, calls to getUniqueNumber() might return numbers that had already been returned under the old instance before it failed, because the put()s hadn’t actually completed before the failure occurred. But if we await the put(), then our function becomes slow again, and creates more opportunities for race conditions (e.g. in the calling code).

Our answer: Make it automatic

When looking at this, we had two options:

  1. Try to carefully document these problems and educate developers about them, so that they could write code that does the right thing.
  2. Change the system so that naturally-written code just does the right thing by default — and runs quickly.

We chose option 2. We accomplished this in three parts.

Part 1: Input Gates

Let’s go back to our original example. Can we make this example "just work", even in the face of concurrent requests?

// Can this "just work" please?
async function getUniqueNumber() {
  let val = await this.storage.get("counter");
  await this.storage.put("counter", val + 1);
  return val;
}

It turns out we can! We create a new rule:

Input gates: While a storage operation is executing, no events shall be delivered to the object except for storage completion events. Any other events will be deferred until such a time as the object is no longer executing JavaScript code and is no longer waiting for any storage operations. We say that these events are waiting for the "input gate" to open.

If we do this, then our storage operations above are no longer an opportunity for concurrency. Our concurrent requests now look like this:

Request 1 timeline Request 2 timeline
async function getUniqueNumber() {
  let val = await this.storage.get("counter");
 
 
 
 
// Request 2 delivery is blocked because
// request 1 is waiting for storage.
  await this.storage.put("counter", val + 1);
 
 
 
// Request 2 delivery is blocked because
// request 1 is waiting for storage.
  return val;
}
 
 
 
 
 
 
 
async function getUniqueNumber() {
  let val = await this.storage.get("counter");
  await this.storage.put("counter", val + 1);
  return val;
}

The two calls return unique numbers, as expected. Hooray! (Unfortunately, we did it by delaying the second request, creating latency and reducing throughput — but we’ll address that in part 3, below.)

Note that our rule does not preclude making multiple concurrent requests to storage at the same time. You can still say:

let promise1 = this.storage.get("foo");
let promise2 = this.storage.put("bar", 123);
await promise1;
frob();
await promise2;

Here, the get() and put() execute concurrently. Moreover, the call to frob() may execute before the put() has completed (but strictly after the get() completes, since we awaited that promise). However, no other event — such as receiving a new request — can unexpectedly happen in the meantime.

On the other hand, the rule protects you not just against concurrent incoming requests, but also concurrent responses to outgoing requests. For example, say you have:

async function task1() {
  await fetch("https://example.com/api1");
  return await this.getUniqueNumber();
}
async function task2() {
  await fetch("https://example.com/api2");
  return await this.getUniqueNumber();
}
let promise1 = task1();
let promise2 = task2();
let val1 = await promise1;
let val2 = await promise2;

This code launches two fetch() calls concurrently. After each fetch completes, getUniqueNumber() is invoked. Could the two calls interfere with each other?

No, they will not. The completion of a fetch() is itself a kind of event. Our rule states that such events cannot be delivered while storage events are in progress. When the first of the two fetches returns, the app calls getUniqueNumber(), which starts performing some storage operations. If the second fetch() also returns while these storage operations are still outstanding, that return will be deferred until after the storage operations are done. Once again, our code ends up correct!

At this point, the async programming experts in the audience are probably starting to feel like something is fishy here. Indeed, there is a catch. What if we do:

// Still a problem even with input gates.
let promise1 = getUniqueNumber();
let promise2 = getUniqueNumber();
let val1 = await promise1;
let val2 = await promise2;

In this case, there is, in fact, a problem. Two calls to getUniqueNumber() are initiated by the same event. The application does not await the first call before starting the second, so the two calls end up running concurrently. Our special rule doesn’t protect us here, because there is no incoming event that can be deferred between when the two calls are made. From the system’s point of view, there’s no way to distinguish this code from code which legitimately decided to perform two storage operations in parallel.

As such, in this case, the two calls to getUniqueNumber() will interfere with each other. However, this problem is far less likely to come about by accident, and is far easier to catch in testing. This bug is deterministic, not caused by the unpredictable timing of network events. We consider this an acceptable caveat in order to solve the larger problem posed by concurrent requests.

Part 2: Output Gates

Let’s go back to our in-memory caching example. Can we make it work?

// Can we make this "just work"?
async function getUniqueNumber() {
  if (this.val === undefined) {
    this.val = await this.storage.get("counter");
  }

  let result = this.val;
  ++this.val;
  this.storage.put("counter", this.val);
  return result;
}

With input gates (part 1), we’ve solved one of the two problems this code had: the race condition of initialization. We no longer need to worry that two requests will call this at the same time, leading this.val to be initialized twice.

However, the problem with not awaiting the put() is still there. If we don’t await it, then we could lose data. If we do await it, then the call is slow.

We make another new rule:

Output gates: When a storage write operation is in progress, any new outgoing network messages will be held back until the write has completed. We say that these messages are waiting for the "output gate" to open. If the write ultimately fails, the outgoing network messages will be discarded and replaced with errors, while the Durable Object will be shut down and restarted from scratch.

With this rule, we no longer have to await the result of put(). Our code can happily continue executing and just assume the put() will succeed. If the put() doesn’t succeed, then anything the application does here will never be observable to the rest of the world anyway. For example, if the app prematurely sends a response to the user saying that the operation succeeded, this response will not actually be delivered until after the put() completes successfully. So, by the time the user receives the message, it is no longer "premature"! In the very rare event that the write operation fails, the user will not receive the premature confirmation at all.

Note that output gates apply not only to responses sent back to a client, but also to new outgoing requests made with fetch() — those requests will be delayed from being sent until all prior writes are confirmed. So, once again, it is impossible for anything else in the world to observe a premature confirmation.

With this change, our getUniqueNumber() implementation with in-memory caching is now fully correct, while retaining most of its speed advantage over the non-caching implementation. Except for the very first call, the application will never be blocked waiting for getUniqueNumber() to finish. The final response from the app to the client will be delayed pending write confirmation, but that write can be performed in parallel with any writes the application performs after getUniqueNumber() completes.

Part 3: Automatic in-memory caching

Our in-memory caching example now works great. But, it’s still a little bit complicated and unnatural to write. Let’s go back to our original, simple code one more time… can we make it fast by default?

// Can we make this not just work, but just work FAST?
async function getUniqueNumber() {
  let val = await this.storage.get("counter");
  await this.storage.put("counter", val + 1);
  return val;
}

The answer to this part is a classic one: we can add automatic caching to the storage layer, just like most operating systems do for disk storage.

We have rolled out an in-memory caching layer for Durable Objects. This layer keeps up to several megabytes worth of data directly in memory in the process where the object runs.

When a get() requests a key that is in cache, the operation returns immediately, without even context-switching out of the thread and isolate where the object is hosted. If the key is not in cache, then a storage request will still be needed, but reads complete relatively quickly.

Better yet, put() requests now always complete "instantaneously". A put() simply writes to cache. We rely on output gates ("part 2", above) to prevent the premature confirmation of writes to any external party. Writes will be coalesced (even if you await them), so that the output gate waits only for O(1) network round trips of latency, not O(n).

Moreover, because get() and put() now complete instantly in most or all cases, the negative impact of input gates on throughput is largely mitigated, because the gate now spends relatively little time blocked.

With Durable Objects built-in caching, our simple code is now just as fast as our code that manually implemented in-memory caching. Combined with input and output gates, our code is now simple, fast, and correct, all at the same time.

Bonus Correctness

Our caching layer provides some bonus consistency guarantees, in addition to performance.

First, writes are automatically coalesced. That is, if you perform multiple put() or delete() operations without awaiting them or anything else in between, then the operations are automatically grouped together and stored atomically. In the case of a sudden power failure, after coming back up, either all of the writes will have been stored, or none of them will. For example:

// Move a value from "foo" to "bar".
let val = await this.storage.get("foo");

this.storage.delete("foo");
this.storage.put("bar", val);
// There's no possibility of data loss, because the delete() and the
// following put() are automatically coalesced into one atomic
// operation. This is true as long as you do not `await` anything
// in between.

Second, the API is also able to provide stronger ordering guarantees for reads. Previously, overlapping storage operations did not have guaranteed ordering. For example, if you issued a get() and a put() on the same key at the same time (without awaiting one before starting the other), then it was not deterministic whether the get() might return the value written by the put() — regardless of the ordering of the statements in your code. The caching layer fixes this. Now, operations are performed in exactly the order in which they were initiated, regardless of when they complete.

These two features eliminate more subtle bugs that might otherwise be hard to catch in testing, so that you don’t have to be a database expert to write code that works.

Optional Bypass

We expect gates and caching will be a win in the vast majority of use cases, but not always. In some use cases, concurrency won’t lead to any problems, and so blocking it may be a loss. Sometimes, the application is OK with prematurely confirming writes in order to minimize latency. And sometimes, caching may just waste memory because the same keys are not frequently accessed.

For those cases, we offer explicit bypasses:

this.storage.get("foo", {allowConcurrency: true, noCache: true});
this.storage.put("foo", "bar", {allowUnconfirmed: true, noCache: true});

Developers who have taken the time to think carefully about these issues can use these flags to tune performance to their specific needs. For those who don’t want to think about it, the defaults should work well.

Conclusion

Concurrency is hard. It doesn’t matter if you’re a novice or an expert: even experts regularly get it wrong. It’s difficult to think about all the ways that concurrent operations might overlap to corrupt your application state.

The traditional answer has been to make applications stateless, and defer all concurrency control to the database layer using transactions. However, transactions are slow, which is a big reason why so many web applications today take hundreds of milliseconds or more to respond to basic actions.

Durable Objects are all about state. By keeping state in memory in addition to on disk, and directing requests for the same data to be coordinated through the same instance, we can make applications much faster. But until recently, this was extremely tricky to get right.

With input gates, output gates, and caching, code written in the most intuitive way now "just works", and runs fast. This means you can focus on building your application, without wasting time optimizing I/O performance and debugging obscure race conditions.

Paragon: Yet Another Cyberweapons Arms Manufacturer

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/08/paragon-yet-another-cyberweapons-arms-manufacturer.html

Forbes has the story:

Paragon’s product will also likely get spyware critics and surveillance experts alike rubbernecking: It claims to give police the power to remotely break into encrypted instant messaging communications, whether that’s WhatsApp, Signal, Facebook Messenger or Gmail, the industry sources said. One other spyware industry executive said it also promises to get longer-lasting access to a device, even when it’s rebooted.

[…]

Two industry sources said they believed Paragon was trying to set itself apart further by promising to get access to the instant messaging applications on a device, rather than taking complete control of everything on a phone. One of the sources said they understood that Paragon’s spyware exploits the protocols of end-to-end encrypted apps, meaning it would hack into messages via vulnerabilities in the core ways in which the software operates.

Read that last sentence again: Paragon uses unpatched zero-day exploits in the software to hack messaging apps.

Exploring how culture and computing intersect

Post Syndicated from Oliver Quinlan original https://www.raspberrypi.org/blog/culture-computing-stem-education-diversity-research-seminar/

It can be easy to think of science, technology, engineering, and maths (STEM) as fields that develop in a linear way, always progressing towards ever better solutions and approaches. Of course, alternative solutions are posed to all sorts of problems, but in western culture, those solutions that did not take hold are sometimes seen as the approaches that were ‘wrong’ or mistaken, and that eventually gave way to the ‘right’ approaches. A culture that includes the belief that there is only one ‘right’ way can be alienating to anyone who sees the world in a different way.

Ron Eglash.
Dr Ron Eglash, University of Michigan

Dr Ron Eglash from the University of Michigan explored the intersections of diverse cultural ideas and computing in his talk at the final research seminar in our series about diversity and inclusion (see below for the recorded video). His work and insights show us how we might think about diversity in computing as being dependent on the diversity of cultural concepts and beliefs that can underpin the subject. Ron also shared free resources for educators who want to help their students learn about STEM while exploring cultural ideas.

Where do our ideas about computing and STEM come from?

Ron’s talk explored the overlaps of technology, culture, and society. In his research work, Ron has facilitated collaborations across the world between STEM students and people from indigenous cultures, opening up computing to people who have different backgrounds and different ways of seeing the world and, in the process, revealing many complex assumptions that different cultures have about computing and technology.

Ron’s work challenges some of the assumptions in western culture about technological knowledge. He started his talk by showing the evolution of knowledge as a branching set of possibilities and ideas that societies choose to move forward with or leave behind. To illustrate, he gave examples of different concepts of mathematics that western society has taken on board, refined, or discarded throughout its history, demonstrating that there are different versions of mathematics we could have had but chose not to.

A branching diagram showing a very simplified historical relationship of the knowledge systems of Native American, Asian, African, and European people. Created by Ron Eglash.
A simplified view of the relationships of knowledge systems across the world, as shown by Ron in his talk.

These different choices in adoption and exploration of ideas, Ron continued, are more evident when one looks at the knowledge systems of different cultures side by side: different knowledge systems represent different paths that groups of people have chosen — not in totality but as the result of smaller decisions that select which ideas will be influential and which will be eliminated.

What ideas pattern our cultures?

One idea that western society has chosen, and that Ron highlighted for us, is the extraction of value. This is something we can see across this society, and it’s a powerful idea that fundamentally shapes how many of us think about the world. We extract value from the natural world in the way we exploit raw materials. We extract value from labour through the organisation of working arrangements that we have made the norm. And we extract value from social relationships through the online social media platforms, online games, and other digital tools that have so quickly become a central part of billions of people’s lives.

Traditional African art: by using patterns of recursive and non-linear scaling, artists intentionally symbolised the bottom-up and circular ideas permeating their culture.
Examples of indigenous visual art patterned by circular and bottom-up principles, as shown by Ron in his talk.

But western culture, with its particular knowledge system and core tenet of value extraction, represents just one possible way of social and technical development. In nature, systems do not extract value, they circulate it: value moves in a recursive loop as organisms grow, die, and are subsumed back into the ecosystem. Many indigenous cultures have developed within this framework of circulating value. The possible benefits of a circular economy are becoming a topic of discussion in western society, and we would do well to remember that this concept is not western in origin: other cultures have been practicing it for a long time, a point Ron made clear in his talk. And as Ron showed us through his research, the framework of circulating value permeates various indigenous cultures in ways that go beyond approaches such as sustainable agriculture, and thereby creates repeating, fractal patterns in cultural artefacts at different scales, from artworks, to the way settlements are organised, to philosophical ideas.

Close-up photo of an Angelica flowerhead.
Many natural phenomena show fractal patterns, for example this Angelica flowerhead, a sphere of spheres. (Photo by Chiswick Chap – Own work, CC BY-SA 3.0)

In nature, there are many examples of fractal geometry because of biological and chemical phenomena of bottom-up growth and replication. Ron shared images gathered during his research that highlight that fractal patterns are also clearly visible in, for example, traditional African art: by using visual patterns of recursive and non-linear scaling, artists intentionally symbolised the bottom-up and circular ideas permeating their culture. African cultural concepts of recursion and non-linearity, which were also brought to the Americas during the transatlantic slave trade, can be seen today in, for example, cornrow hair braiding, quilting, growing traditions, and spiritual practices.

Examples of hair braiding patterns  informed by African cultural traditions.
Examples of hair braiding patterns informed by African cultural traditions, as shown by Ron in his talk.

Computing activities based on circulation of value

The links between indigenous cultural concepts and computing algorithms are many. To explore these in the context of education, Ron and his team have worked in collaboration with members of indigenous communities to develop Culturally Situated Design Tools (CSDT), a suite of computing and STEM activities and learning resources that allow young people of a range of ages to discover the relationship between computing and programming concepts and cultural ideas that trace back to indigenous cultures. The CSDT development process Ron described involved genuine collaboration: seeking ‘cultural permission’ from communities; deeply understanding the cultural concepts behind the artefacts that were being developed; and creating tools that not only allow students to explore traditional designs and artefacts but also give them the scope to design their own original artefacts and to actively contribute to communities’ cultural practices.

Screenshot from the Culturally Situated Design Tools website showing Cornrow Curves Tutorials.
Screenshot from the Culturally Situated Design Tools website showing Cornrow Curves Tutorials.

Ron underlined in his talk how important it is not to see activities like CSDT as a lure to ‘trick’ young people into engaging with STEM classes; the intention is not using them as a veneer to interest more young people in industries underpinned by an extractive world view. Instead, circular and bottom-up concepts are an alternative way of seeing how technology can be used to influence and construct the world.

Returning creative contributions

As such, an important aspect of the pedagogy of Culturally Situated Design Tools is returning creative contributions to the community whose concepts or artefacts are being explored in each activity. The aim is to create a generative cycle of STEM engagement, and Ron demonstrated how this can work by sharing more about a project he conducted with STEM students in Albany, NY. Students began the project by exploring cornrow design simulations. They brought these out of the computer, out of their schools, and into local braiding shops by producing 3D-printed mannequins featuring their cornrow designs. Through engaging with the braiding shop owners, the students learned that the owners had challenges to do with the pH level of hair products, and this led to the students producing pH testing kits for them. The practical applications benefitted the communities connected to the braiding shops and inspired more student interest in the project — thus, a circular, mutually beneficial process of engagement emerged.

A generative cycle of STEM education, in which students learn with activities based on cultural artefacts and then use their learning to give back to the community the artefacts came from.
A generative cycle of STEM education, in which students learn with activities based on cultural artefacts and then use their learning to give back to the community the artefacts came from. As shown by Ron in his talk.

Importantly, the STEM activities that Ron and his collaborators have developed cannot be separated from their cultural context. This way of teaching STEM is not about recruiting young people to become software developers or other tech professionals, but instead about giving them the skills to be creative contributors and problem solvers within communities so that they can help promote the circulation of value.

Rethinking diversity

I have long been enthusiastic about the potential of computing and digital making as a tool for many disciplines, and Ron’s talk made me consider what this might mean at a much deeper level than providing different routes into computing. There is a lot of discussion about how we need to increase diversity in the STEM field to make the field more equitable and able to positively contribute to society, but Ron’s presentation challenged me to think about the cultural assumptions that shape the nature of STEM, and how these influence who engages with the field. Increasing diversity and inclusion in computing and STEM is not just a case of making opportunities open to everyone, but about actually re-shaping the nature of the field so it can be equitable in its interactions with ecological systems, cultures, and human experiences.

Do watch the video of Ron’s presentation and the following Q&A for more on these concepts, examples of the computing activities and how to use them, and discussion of these fundamental ideas. You’ll find his presentation slides on our ‘previous seminars’ page.

You can find the resources Ron shared at csdt.org and generativejustice.org/projects.

Join us at our next online seminar

We are taking a break from our monthly research seminars in August! In the meantime, you can revisit our previous seminars about diversity and inclusion. On 7 September, we’ll be back to start our new seminar series focusing on AI, machine learning, and data science education, in partnership with The Alan Turing Institute. At these seminars, you’ll hear from a range of international speakers about current best practices in teaching young people the technical concepts and ethical considerations involved in these technologies. Do sign up and put the dates in your calendar!

The post Exploring how culture and computing intersect appeared first on Raspberry Pi.

Field Notes: Building an Automated Image Processing and Model Training Pipeline for Autonomous Driving

Post Syndicated from Antonia Schulze original https://aws.amazon.com/blogs/architecture/field-notes-building-an-automated-image-processing-and-model-training-pipeline-for-autonomous-driving/

In this blog post, we demonstrate how to build an automated and scalable data pipeline for autonomous driving. This solution was built with the goal of accelerating the process of analyzing recorded footage and training a model to improve the experience of autonomous driving.

We will demonstrate the extraction of images from ROS bag file by using Amazon Rekognition to label the images for cataloging, and build a searchable database using Amazon DynamoDB. This is so we can find relevant images for training computer vision Machine Learning (ML) algorithms. Next, we show you how to use the database to find suitable images, create a labeling job with Amazon SageMaker Ground Truth, and train a machine learning model to detect cars. The following diagram shows the architecture for this solution.

Overview of the solution

Figure 1 - Architecture Showing how to build an automated Image Processing and Model Training pipeline

Figure 1 – Architecture Showing how to build an automated Image Processing and Model Training pipeline

Prerequisites

This post uses an AWS Cloud Development Kit (AWS CDK) stack written in Python. Follow the instructions in the CDK Getting Started guide to set up your environment.

Deployment

The full pipeline can be deployed with one command: * `bash deploy.sh deploy true`. We can follow the progress of deployment on the command line, but also in the CloudFormation section of the AWS console. Once the pipeline is deployed, we must upload bag files to the rosbag-ingest bucket to launch the pipeline. Once the pipeline has finished, we can clone the repository to the SageMaker Notebook instance ros-bag-demo-notebook.

Walkthrough

  • The Robot Operating System (ROS) is a collection of open source middleware, which provides tools and libraries for building robotic systems. The middleware uses a Publish/Subscribe (pub/sub) architecture, which can be used for the transportation of sensor data to any software modules, which need to operate on that data.
    • Each sensor publishes its data as a topic, and then any module which needs that data subscribes to that topic.
  • This Pub/Sub architecture lends itself well to recording data from multiple sensors of varying modalities (camera, LIDAR, RADAR) into a single file which can be replayed for testing and diagnostic purposes. ROS supports this capability with its ROS bag module which stores data in an ROS bag format file.
    • An ROS bag file includes a collection of topics, each with a set of time-stamped messages. These files can be replayed on an ROS system, with the timestamps, ensuring that messages are published to the topics in real time and the order they were recorded.
  • The input for this example is a set of ROS bag files, each one is approximately 10 GB.
    • To extract the image data from our input ROS bag files, you create a Docker container based on an ROS image.
    • You then create an ROS launch configuration file to extract images to .png files based on the ROS bag tutorial instructions. The Docker container is stored in an Amazon Elastic Container Registry (Amazon ECR), ready to run as an AWS Fargate task.

AWS Fargate is a serverless compute engine for containers that work with both Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (EKS). By using Fargate, we can create and run the Docker containers with our ROS environment, and have many containers running in parallel with each processing a single ROS bag file.

When you have the individual images, you need a way to assess their contents to build a searchable image catalog. This objective allows ML data scientists to search through the recorded images to find, for example, images containing pedestrians. The catalog can also be extended with data from other sources, such as weather data, location data, and so forth. You use Amazon Rekognition to process the images, and it helps add image and video analysis to your applications. When you provide an image or video to the Amazon Rekognition API, the service identifies objects, people, text, scenes, and activities. By requesting that Amazon Rekognition label each image, you receive a large amount of information to catalog the image.

The image ingestion pipeline is largely event driven. Many of the AWS services you use have limits on job concurrency and API access rates. To resolve these issues, you place all events into an Amazon Simple Queue Service (Amazon SQS) queue, invoke a Lambda function on queue, and make the appropriate API call (for example, Amazon Rekognition DetectLabels). If the API call is successful, you delete the message from the queue, otherwise (for example, the rate is exceeded) you exit the Lambda function and the message will be returned to the queue. One benefit is that when service limits change, depending on the account configuration or Region, the pipeline will automatically scale to accommodate these changes.

  • The pipeline is launched when an ROS bag file is uploaded to the Amazon Simple Storage Service (Amazon S3) bucket which has been configured to post an object creation event to an SQS queue.
  • A Lambda function is invoked from the SQS queue and it starts a Step Functions step, which runs our dockerized container on a Fargate cluster. An extracted image is stored in an S3 bucket, which invokes a second SQS queue to start a Lambda function. The Lambda function calls the DetectLabels function of the Amazon Rekognition API which, returns labels for everything that Amazon Rekognition can detect in the scene.
  • This also includes the confidence level for each label. The labels and confidence scores are stored in a DynamoDB data catalog table. You can query all images for specific objects that you are interested in and filter to create subsets that are of interest.
Figure 2 - DynamoDB table containing detected objects and confidence scores

Figure 2 – DynamoDB table containing detected objects and confidence scores

Because you will use a public workforce for labeling in the next section, you will need to create anonymized versions of images where faces and license plates are blurred out. Amazon Rekognition has a DetectFaces API call to find any faces in the image. There is no corresponding call for detecting license plates, so you detect all text in the image with the DetectText API. Use the write of the .json output file to invoke a Lambda function which calls the Amazon Rekognition APIs and blurs the relevant Regions before saving them to S3.

Image labeling with Amazon SageMaker Ground Truth

Since the images are now stored in their raw and anonymized format we can start the data labeling step. We will sample the images we want to label. The data catalog in DynamoDB lets you query the table based on your parameters and sub-area you want to optimize your model on. For example, you could query the DynamoDB table for images having a crowd of pedestrians and specifically label these images and allow your model to improve in these particular circumstances. Once we have identified the images of interest, we can copy them to a specific S3 folder and start the SageMaker Ground Truth job on an object detection task. You can find a detailed blog post on streamlining data for object detection in Amazon SageMaker Ground Truth.

The result of a SageMaker Ground Truth job is a manifest file containing the S3 Path, bounding box coordinates, and class labels (per image). This is automatically uploaded to S3. We need to replace the anonymized images with the raw image S3 Path since we want to train the model on raw images. We have provided you a sample manifest file in the repository and you can follow along the blogpost with the Jupyter Notebook provided in `object-detection/Transfer-Learning.iypnb`. First, we can verify that the annotations are high quality by viewing the following sample image.

Visualization of annotations from SageMaker Ground Truth job

Figure 3 – Visualization of annotations from SageMaker Ground Truth job

Fine-tune a GluonCV model with SageMaker Script Mode

The ML technique transfer learning allows us to use neural networks that have previously been trained on large datasets of similar applications, and fine-tune them based on a smaller custom annotated data. Frameworks such as GluonCV provide a model zoo for object detection, that allows us to have a quick access to these pre-trained models. In this case, we have selected a YOLOv3 model that has been pre-trained on the COCO dataset. Based on empirical analysis, other networks such as Faster-RCNN outperform YOLOv3, but tend to have slower inference times as measured in frames per second, which is a key aspect for real-time applications.

The preferred object detection format for GluonCV is based on .lst file format, and converts to the RecordIO design, providing faster disk access and compact storage. GluonCV provides a tutorial on how to convert a .lst file format into a RecordIO file.

To train a customized neural network we will use Amazon SageMaker Script Mode, allowing us to use your own training algorithms and the straightforward SageMaker UI.

from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
s3_output_path = "s3://<path to bucket where model weights will be saved>/"

model_estimator = MXNet(
    entry_point="train_yolov3.py",
    role=role,
    train_instance_count=1,  
    train_instance_type="ml.p3.8xlarge",
    framework_version="1.8.0",
    output_path=s3_output_path,
    py_version="py37"
)

model_estimator.fit("s3://<bucket path for train and validation record-io files>/")

Hyperparameter optimization on SageMaker

While training neural networks, there are many parameters that can be optimized to the use case and the custom dataset. We refer to this as automatic model tuning in SageMaker or hyperparameter optimization. SageMaker launches multiple training jobs with a unique combination of hyperparameters, and search for the configuration achieving the highest mean average precision (mAP) on our held-out test data.

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "wd": ContinuousParameter(0.0001, 0.001),
    "batch-size": CategoricalParameter([8, 16])
    }
metric_definitions = [
    {"Name": "val:car mAP", "Regex": "val:mAP=(.*?),"},
    {"Name": "test:car mAP", "Regex": "test:mAP=(.*?),"},
    {"Name": "BoxCenterLoss", "Regex": "BoxCenterLoss=(.*?),"},
    {"Name": "ObjLoss", "Regex": "ObjLoss=(.*?),"},
    {"Name": "BoxScaleLoss", "Regex": "BoxScaleLoss=(.*?),"},
    {"Name": "ClassLoss", "Regex": "ClassLoss=(.*?),"},
]
objective_metric_name = "val:car mAP"

hpo_tuner = HyperparameterTuner(
    model_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=10,  # maximum jobs that should be ran
    max_parallel_jobs=2
)

hpo_tuner.fit("s3://<bucket path for train and validation record-io files>/")

Model compilation

Although we don’t have hard constraints for a model environment when training in the cloud, we should mind the production environment when running inference with trained models: no powerful GPUs and limited storage are common challenges. Fortunately, Amazon SageMaker Neo allows you to train once and run anywhere in the cloud and at the edge, while reducing the memory footprint of your model.

best_estimator = hpo_tuner.best_estimator()
compiled_model = best_estimator.compile_model(
    target_instance_family="ml_c4",
    role=role,
    input_shape={"data": [1, 3, 512, 512]},
    output_path=s3_output_path,
    framework="mxnet",
    framework_version="1.8",
    env={"MMS_DEFAULT_RESPONSE_TIMEOUT": "500"}
)

Deploying the model

Deploying a model requires a few additional lines of code for hosting.

from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = compiled_model.deploy(
initial_instance_count=1, instance_type="ml.c4.xlarge", endpoint_name="YOLO-DEMO-endpoint", deserializer=JSONDeserializer(),serializer=JSONSerializer()
)

Run inference

Once the model is deployed with an endpoint, we can test some inference. As the model has been trained on 512×512 pixel images, we need to format inference images respectively, before serializing the data and making a prediction request to the SageMaker endpoint.

import PIL.Image
import numpy as np
test_image = PIL.Image.open("test.png")
test_image = np.asarray(test_image.resize((512, 512))) 
endpoint_response = predictor.predict(test_image)

We can then visualize the response and show the confidence score associated with the prediction on the test image.

Figure 4 - Visualization of the response and confidence score associated with the prediction on the test image.

Figure 4 – Visualization of the response and confidence score associated with the prediction on the test image.

Clean Up

To clean up the deployment you should run bash deploy.sh destroy false. In Addition to that, you also need to delete the SageMaker Endpoint. Some resources like S3 buckets and DynamoDB tables must be manually emptied and deleted through the console to be fully removed.

Conclusion

This post described how to extract images at large scale from ROS bag files and label a subset of them with SageMaker Ground Truth. With this labeled training dataset, we fine-tuned an object detection neural network using SageMaker Script Mode. To deploy the model in the autonomous driving vehicle, we compiled the model with SageMaker Neo, reducing the storage size and optimizing the model graph on the specific hardware. Finally, you ran some test inference predictions and visualized them in a SageMaker Notebook. You can find the code for this blog post in this GitHub repository.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Watson: Launchpad now runs on Python 3

Post Syndicated from original https://lwn.net/Articles/864938/rss

On his blog, Colin Watson has a lengthy reflection on moving the code for Ubuntu’s Launchpad software-collaboration web application from Python 2 to Python 3. He looks at some of the problem areas for upgrading, both in general and for Launchpad specifically, some pain points that were encountered, lessons learned, and the nine known regressions that reached the Launchpad production code during the process.

I’m not going to defend the Python 3 migration process; it was pretty rough in a lot of ways. Nor am I going to spend much effort relitigating it here, as it’s already been done to death elsewhere, and as I understand it the core Python developers have got the message loud and clear by now. At a bare minimum, a lot of valuable time was lost early in Python 3’s lifetime hanging on to flag-day-type porting strategies that were impractical for large projects, when it should have been providing for “bilingual” strategies (code that runs in both Python 2 and 3 for a transitional period) which is where most libraries and most large migrations ended up in practice. For instance, the early advice to library maintainers to maintain two parallel versions or perhaps translate dynamically with 2to3 was entirely impractical in most non-trivial cases and wasn’t what most people ended up doing, and yet the idea that 2to3 is all you need still floats around Stack Overflow and the like as a result. (These days, I would probably point people towards something more like Eevee’s porting FAQ as somewhere to start.)

Watson: Launchpad now runs on Python 3!

Post Syndicated from original https://lwn.net/Articles/864938/rss

On his blog, Colin Watson has a lengthy reflection on moving the code for Ubuntu’s Launchpad software-collaboration web application from Python 2 to Python 3. He looks at some of the problem areas for upgrading, both in general and for Launchpad specifically, some pain points that were encountered, lessons learned, and the nine known regressions that reached the Launchpad production code during the process.

I’m not going to defend the Python 3 migration process; it was pretty rough in a lot of ways. Nor am I going to spend much effort relitigating it here, as it’s already been done to death elsewhere, and as I understand it the core Python developers have got the message loud and clear by now. At a bare minimum, a lot of valuable time was lost early in Python 3’s lifetime hanging on to flag-day-type porting strategies that were impractical for large projects, when it should have been providing for “bilingual” strategies (code that runs in both Python 2 and 3 for a transitional period) which is where most libraries and most large migrations ended up in practice. For instance, the early advice to library maintainers to maintain two parallel versions or perhaps translate dynamically with 2to3 was entirely impractical in most non-trivial cases and wasn’t what most people ended up doing, and yet the idea that 2to3 is all you need still floats around Stack Overflow and the like as a result. (These days, I would probably point people towards something more like Eevee’s porting FAQ as somewhere to start.)

Benchmark the performance of the new Auto WLM with adaptive concurrency in Amazon Redshift

Post Syndicated from Raj Sett original https://aws.amazon.com/blogs/big-data/benchmarking-the-performance-of-the-new-auto-wlm-with-adaptive-concurrency-in-amazon-redshift/

With Amazon Redshift, you can run a complex mix of workloads on your data warehouse clusters. For example, frequent data loads run alongside business-critical dashboard queries and complex transformation jobs. We also see more and more data science and machine learning (ML) workloads. Each workload type has different resource needs and different service level agreements. How does Amazon Redshift give you a consistent experience for each of your workloads? Amazon Redshift workload management (WLM) helps you maximize query throughput and get consistent performance for the most demanding analytics workloads, all while optimally using the resources of your existing cluster.

Amazon Redshift has recently made significant improvements to automatic WLM (Auto WLM) to optimize performance for the most demanding analytics workloads. With the release of Amazon Redshift Auto WLM with adaptive concurrency, Amazon Redshift can now dynamically predict and allocate the amount of memory to queries needed to run optimally. Amazon Redshift dynamically schedules queries for best performance based on their run characteristics to maximize cluster resource utilization.

In this post, we discuss what’s new with WLM and the benefits of adaptive concurrency in a typical environment. We synthesized a mixed read/write workload based on TPC-H to show the performance characteristics of a workload with a highly tuned manual WLM configuration versus one with Auto WLM. In this experiment, Auto WLM configuration outperformed manual configuration by a great margin. From a throughput standpoint (queries per hour), Auto WLM was 15% better than the manual workload configuration. Overall, we observed 26% lower average response times (runtime + queue wait) with Auto WLM.

What’s new with Amazon Redshift WLM?

Workload management allows you to route queries to a set of defined queues to manage the concurrency and resource utilization of the cluster. Today, Amazon Redshift has both automatic and manual configuration types. With manual WLM configurations, you’re responsible for defining the amount of memory allocated to each queue and the maximum number of queries, each of which gets a fraction of that memory, which can run in each of their queues. Manual WLM configurations don’t adapt to changes in your workload and require an intimate knowledge of your queries’ resource utilization to get right. Amazon Redshift Auto WLM doesn’t require you to define the memory utilization or concurrency for queues. Auto WLM adjusts the concurrency dynamically to optimize for throughput. Optionally, you can define queue priorities in order to provide queries preferential resource allocation based on your business priority.

Auto WLM also provides powerful tools to let you manage your workload. Query priorities lets you define priorities for workloads so they can get preferential treatment in Amazon Redshift, including more resources during busy times for consistent query performance, and query monitoring rules offer ways to manage unexpected situations like detecting and preventing runaway or expensive queries from consuming system resources.

Our initial release of Auto WLM in 2019 greatly improved the out-of-the-box experience and throughput for the majority of customers. However, in a small number of situations, some customers with highly demanding workloads had developed highly tuned manual WLM configurations for which Auto WLM didn’t demonstrate a significant improvement. Over the past 12 months, we worked closely with those customers to enhance Auto WLM technology with the goal of improving performance beyond the highly tuned manual configuration. One of our main innovations is adaptive concurrency. With adaptive concurrency, Amazon Redshift uses ML to predict and assign memory to the queries on demand, which improves the overall throughput of the system by maximizing resource utilization and reducing waste.

Electronic Arts, Inc. is a global leader in digital interactive entertainment. EA develops and delivers games, content, and online services for internet-connected consoles, mobile devices, and personal computers. EA has more than 300 million registered players around the world. Electronic Arts uses Amazon Redshift to gather player insights and has immediately benefited from the new Amazon Redshift Auto WLM.

By adopting Auto WLM, our Amazon Redshift cluster throughput increased by at least 15% on the same hardware footprint. Our average concurrency increased by 20%, allowing approximately 15,000 more queries per week now. All this with marginal impact to the rest of the query buckets or customers. Because Auto WLM removed hard walled resource partitions, we realized higher throughput during peak periods, delivering data sooner to our game studios.

– Alex Ignatius, Director of Analytics Engineering and Architecture for the EA Digital Platform.

Benefits of Amazon Redshift Auto WLM with adaptive concurrency

Amazon Redshift has implemented an advanced ML predictor to predict the resource utilization and runtime for each query. The model continuously receives feedback about prediction accuracy and adapts for future runs. Higher prediction accuracy means resources are allocated based on query needs. This allows for higher concurrency of light queries and more resources for intensive queries. The latter leads to improved query and cluster performance because less temporary data is written to storage during a complex query’s processing. A unit of concurrency (slot) is created on the fly by the predictor with the estimated amount of memory required, and the query is scheduled to run. If you have a backlog of queued queries, you can reorder them across queues to minimize the queue time of short, less resource-intensive queries while also ensuring that long-running queries aren’t being starved. We also make sure that queries across WLM queues are scheduled to run both fairly and based on their priorities.

The following are key areas of Auto WLM with adaptive concurrency performance improvements:

  • Proper allocation of memory – Reduction of over-allocation of memory creates more room for other queries to run and increases concurrency. Additionally, reduction of under-allocation reduces spill to disk and therefore improves query performance.
  • Elimination of static partitioning of memory between queues – This frees up the entire available memory, which is then available for queries.
  • Improved throughput – You can pack more queries into the system due to more efficient memory utilization.

The following diagram shows how a query moves through the Amazon Redshift query run path to take advantage of the improvements of Auto WLM with adaptive concurrency.

Benchmark test

To assess the efficiency of Auto WLM, we designed the following benchmark test. It’s a synthetic read/write mixed workload using TPC-H 3T and TPC-H 100 GB datasets to mimic real-world workloads like ad hoc queries for business analysis.

In this modified benchmark test, the set of 22 TPC-H queries was broken down into three categories based on the run timings. The shortest queries were categorized as DASHBOARD, medium ones as REPORT, and longest-running queries were marked as the DATASCIENCE group. The DASHBOARD queries were pointed to a smaller TPC-H 100 GB dataset to mimic a datamart set of tables. The COPY jobs were to load a TPC-H 100 GB dataset on top of the existing TPC-H 3 T dataset tables. The REPORT and DATASCIENCE queries were ran against the larger TPC-H 3 T dataset as if those were ad hoc and analyst-generated workloads against a larger dataset. Also, the TPC-H 3 T dataset was constantly getting larger through the hourly COPY jobs as if extract, transform, and load (ETL) was running against this dataset.

The following table summarizes the synthesized workload components.

Schema: tpch100g Schema: tpch3t
Data Set -TPC-H 100 GB Data Set – TPC-H 3T
Workload Types DASH 16 dashboard queries running every 2 seconds
REPORT 6 report queries running every 15 minutes
DATASCIENCE 4 data science queries running every 30 minutes
COPY 3 COPY jobs every hour loading TPC-H 100 GB data on to TPC-H 3 T

The following table summarizes the manual and Auto WLM configurations we used.

Manual Configuration Auto Configuration
Queues/Query Groups Memory % Max Concurrency Concurrency Scaling Priority Memory % Max Concurrency Concurrency Scaling Priority
Dashboard 24 5 Off NA Auto Auto Off Normal
Report 25 6 Off NA Auto Auto Off Normal
DataScience 25 4 Off NA Auto Auto Off Normal
COPY 25 3 Off NA Auto Auto Off Normal
Default 1 1 Off NA Auto Auto Off Normal

We ran the benchmark test using two 8-node ra3.4xlarge instances, one for each configuration. The same exact workload ran on both clusters for 12 hours.

Summary of results

We noted that manual and Auto WLM had similar response times for COPY, but Auto WLM made a significant boost to the DATASCIENCE, REPORT, and DASHBOARD query response times, which resulted in a high throughput for DASHBOARD queries (frequent short queries).

Given the same controlled environment (cluster, dataset, queries, concurrency), Auto WLM with adaptive concurrency managed the workload more efficiently and provided higher throughput than the manual WLM configuration. Better and efficient memory management enabled Auto WLM with adaptive concurrency to improve the overall throughput. Elimination of the static memory partition created an opportunity for higher parallelism. More short queries were processed though Auto WLM, whereas longer-running queries had similar throughput. To optimize the overall throughput, adaptive concurrency control kept the number of longer-running queries at the same level but allowed more short-running queries to run in parallel.

Detailed results

In this section, we review the results in more detail.

Throughput and average response times

The following table summarizes the throughput and average response times, over a runtime of 12 hours. Response time is runtime + queue wait time.

WLM Configuration Query Type Count of Queries Total Response Time (Secs) Average Response Time (Secs)
Auto COPY 72 1329 18.46
Manual COPY 72 1271 17.65
Auto DASH 126102 271691 2.15
Manual DASH 109774 304551 2.77
Auto DATASCIENCE 166 20768 125.11
Manual DATASCIENCE 160 32603 203.77
Auto REPORT 247 38986 157.84
Manual REPORT 230 55693 242.14
Auto Total 126587 332774 2.63
Manual Total 110236 394118 3.58
Auto Over Manual (%) 14.83% -26.47%

The following chart shows the throughput (queries per hour) gain (automatic throughput) over manual (higher is better).

The following chart shows the average response time of each query (lower is better).

Bucket by query completion times

The following results data shows a clear shift towards left for Auto WLM. More and more queries completed in a shorter amount of time with Auto WLM.

% of queries completed in
WLM Configuration Total Queries 0-5 seconds 6-30 seconds 31-60 seconds 61-120 seconds 121-300 seconds 301-900 seconds Over 900 seconds
Manual 110155 87.14 11.37 1.2 0.09 0.1 0.09 0.01
Auto 126477 92.82 6.06 0.85 0.13 0.09 0.03 0.01

The following chart visualizes these results.

Query latency and count over time

As we can see from the following charts, Auto WLM significantly reduces the queue wait times on the cluster.

The following chart shows the count of queries processed per hour (higher is better).

The following chart shows the count of queued queries (lower is better).

The following chart shows the total queue wait time per hour (lower is better).

Temporary data spill to disk

Because it correctly estimated the query runtime memory requirements, Auto WLM configuration was able to reduce the runtime spill of temporary blocks to disk. Basically, a larger portion of the queries had enough memory while running that those queries didn’t have to write temporary blocks to disk, which is good thing. This in turn improves query performance.

The following chart shows that DASHBOARD queries had no spill, and COPY queries had a little spill.

Auto WLM outperforms the manual configuration

Based on these tests, Auto WLM was a better choice than manual configuration. If we look at the three main aspects where Auto WLM provides greater benefits, a mixed workload (manual WLM with multiple queues) reaps the most benefits using Auto WLM. The majority of the large data warehouse workloads consists of a well-defined mixture of short, medium, and long queries, with some ETL process on top of it. So large data warehouse systems have multiple queues to streamline the resources for those specific workloads. Also, overlap of these workloads can occur throughout a typical day. If the Amazon Redshift cluster has a good mixture of workloads and they don’t overlap with each other 100% of the time, Auto WLM can use those underutilized resources and provide better performance for other queues.

Conclusion

Our test demonstrated that Auto WLM with adaptive concurrency outperforms well-tuned manual WLM for mixed workloads. If you’re using manual WLM with your Amazon Redshift clusters, we recommend using Auto WLM to take advantage of its benefits. Moreover, Auto WLM provides the query priorities feature, which aligns the workload schedule with your business-critical needs.

For more information about Auto WLM, see Implementing automatic WLM and the definition and workload scripts for the benchmark.


About the Authors

Raj Sett is a Database Engineer at Amazon Redshift. He is passionate about optimizing workload and collaborating with customers to get the best out of Redshift. Outside of work, he loves to drive and explore new places.

 

 

 

Paul Lappas is a Principal Product Manager at Amazon Redshift. Paul is passionate about helping customers leverage their data to gain insights and make critical business decisions. In his spare time Paul enjoys playing tennis, cooking, and spending time with his wife and two boys.

 

 

Gaurav Saxena is a software engineer on the Amazon Redshift query processing team. He works on several aspects of workload management and performance improvements for Amazon Redshift. In his spare time, he loves to play games on his PlayStation.

 

 

Mohammad Rezaur Rahman is a software engineer on the Amazon Redshift query processing team. He focuses on workload management and query scheduling. In his spare time, he loves to spend time outdoor with family.

How GE Healthcare modernized their data platform using a Lake House Architecture

Post Syndicated from Krishna Prakash original https://aws.amazon.com/blogs/big-data/how-ge-healthcare-modernized-their-data-platform-using-a-lake-house-architecture/

GE Healthcare (GEHC) operates as a subsidiary of General Electric. The company is headquartered in the US and serves customers in over 160 countries. As a leading global medical technology, diagnostics, and digital solutions innovator, GE Healthcare enables clinicians to make faster, more informed decisions through intelligent devices, data analytics, applications, and services, supported by its Edison intelligence platform.

GE Healthcare’s legacy enterprise analytics platform used a traditional Postgres based on-premise data warehouse from a big data vendor to run a significant part of its analytics workloads. The data warehouse is key for GE Healthcare; it enables users across units to gather data and generate the daily reports and insights required to run key business functions. In the last few years, the number of teams accessing the cluster increased by almost three times, with twice the initial number of users running four times the number of daily queries for which the cluster had been designed. The legacy data warehouse wasn’t able to scale to support GE Healthcare’s business needs and would require significant investment to maintain, update, and secure.

Searching for a modern data platform

After working for several years in a database-focused approach, the rapid growth in the data made the GEHC’s on-prem system unviable from a cost and maintenance perspective. This presented GE Healthcare with an opportunity to take a holistic look at the emerging and strategic needs for data and analytics. With this in mind, GE Healthcare decided to adopt a Lake House Architecture using AWS services:

  • Use Amazon Simple Storage Service (Amazon S3) to store raw enterprise and event data
  • Use familiar SQL statements to combine and process data across all data stores in Amazon S3 and Amazon Redshift
  • Apply the “best fit” concept of using the appropriate AWS technology to meet specific business needs

Architecture

The following diagram illustrates GE Healthcare’s architecture.

Choosing Amazon Redshift for the enterprise cloud data warehouse

Choosing the right data store is just as important as how you collect the data for analytics. Amazon Redshift provided the best value because it was easy to launch, access, and store data, and could scale to meet our business needs on demand. The following are a few additional reasons for why GE Healthcare made Amazon Redshift our cloud data warehouse:

  • The ability to store petabyte-scale data in Amazon S3 and query the data in Amazon S3 and Amazon Redshift with little preprocessing was important because GE Healthcare’s data needs are expanding at a significant pace.
  • The AWS based strategy provides financial flexibility. GE Healthcare moved from a fixed cost/fixed asset on-premises model to a consumption-based model. In addition, the total cost of ownership (TCO) of the AWS based architecture was less than the solution it was replacing.
  • Native integration with the AWS Analytics ecosystem made it easier to handle end-to-end analytics workflows without friction. GE Healthcare took a hybrid approach to extract, transform, and load (ETL) jobs by using a combination of AWS Glue, Amazon Redshift SQL, and stored procedures based on complexity, scale, and cost.
  • The resilient platform makes recovery from failure easier, with errors further down the pipeline less likely to affect production environments because all historical data is on Amazon S3.
  • The idea of a Lake House Architecture is that taking a one-size-fits-all approach to analytics eventually leads to compromises. It’s not simply about integrating a data lake with a data warehouse, but rather about integrating a data lake, a data warehouse, and purpose-built data stores and enabling unified governance and easy data movement. (For more information about the Lake House Architecture, see Harness the power of your data with AWS Analytics.)
  • It’s easy to implement new capabilities to support emerging business needs such as artificial intelligence, machine learning, graph databases, and more because of the extensive product capabilities of AWS and the Amazon Redshift Lake House Architecture.

Implementation steps and best practices

As part of this journey, GE Healthcare partnered with AWS Professional Services to accelerate their momentum. AWS Professional Services was instrumental in following the Working Backwards (WB) process, which is Amazon’s customer-centric product development process.

Here’s how it worked:

  • AWS Professional Services guided GE Healthcare and partner teams through the Lake House Architecture, sharing AWS standards and best practices.
  • The teams accelerated Amazon Redshift stored procedure migration, including data structure selection and table design.
  • The teams delivered performant code at scale to enable a timely go live.
  • They implemented a framework for batch file extract to support downstream data consumption via a data as a service (DaaS) solution.

Conclusion

Modernizing to a Lake House Architecture with Amazon Redshift allowed GEHC to speed up, innovate faster, and better solve customers’ needs. At the time of writing, GE Healthcare workloads are running at full scale in production. We retired our on-premises infrastructure. Amazon Redshift RA3 instances with managed storage enabled us to scale compute and storage separately based on our customers’ needs. Furthermore, with the concurrency scaling feature of Amazon Redshift, we don’t have to worry about peak times affecting user performance any more. Amazon Redshift scales out and in automatically.

We also look forward to realizing the benefits of Amazon Redshift data sharing and AQUA (Advanced Query Accelerator) for Amazon Redshift as we continue to increase the performance and scale of our Amazon Redshift data warehouse. We appreciate AWS’s continual innovation on behalf of its customers.


About the Authors

Krishna Prakash (KP) Bhat is a Sr. Director in Data & Analytics at GE Healthcare. In this role, he’s responsible for architecture and data management of data and analytics solutions within GE Healthcare Digital Technologies Organization. In his spare time, KP enjoys spending time with family. Connect him on LinkedIn.

 

Suresh Patnam is a Solutions Architect at AWS, specialized in big data and AI/ML. He works with customers in their journey to the cloud with a focus on big data, data lakes, and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family. Connect him on LinkedIn.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close