Kinesis Data Firehose now supports dynamic partitioning to Amazon S3

Post Syndicated from Jeremy Ber original https://aws.amazon.com/blogs/big-data/kinesis-data-firehose-now-supports-dynamic-partitioning-to-amazon-s3/

Amazon Kinesis Data Firehose provides a convenient way to reliably load streaming data into data lakes, data stores, and analytics services. It can capture, transform, and deliver streaming data to Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service, generic HTTP endpoints, and service providers like Datadog, New Relic, MongoDB, and Splunk. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, and encrypt your data streams before loading, which minimizes the amount of storage used and increases security.

Customers who use Amazon Kinesis Data Firehose often want to partition their incoming data dynamically based on information that is contained within each record, before sending the data to a destination for analysis. An example of this would be segmenting incoming Internet of Things (IoT) data based on what type of device generated it: Android, iOS, FireTV, and so on. Previously, customers would need to run an entirely separate job to repartition their data after it lands in Amazon S3 to achieve this functionality.

Kinesis Data Firehose data partitioning simplifies the ingestion of streaming data into Amazon S3 data lakes, by automatically partitioning data in transit before it’s delivered to Amazon S3. This makes the datasets immediately available for analytics tools to run their queries efficiently and enhances fine-grained access control for data. For example, marketing automation customers can partition data on the fly by customer ID, which allows customer-specific queries to query smaller datasets and deliver results faster. IT operations or security monitoring customers can create groupings based on event timestamps that are embedded in logs, so they can query smaller datasets and get results faster.

In this post, we’ll discuss the new Kinesis Data Firehose dynamic partitioning feature, how to create a dynamic partitioning delivery stream, and walk through a real-world scenario where dynamically partitioning data that is delivered into Amazon S3 could improve the performance and scalability of the overall system. We’ll then discuss some best practices around what makes a good partition key, how to handle nested fields, and integrating with Lambda for preprocessing and error handling. Finally, we’ll cover the limits and quotas of Kinesis Data Firehose dynamic partitioning, and some pricing scenarios.

Data partitioning with Kinesis Data Firehose

First, let’s discuss why you might want to use dynamic partitioning instead of Kinesis Data Firehose’s standard timestamp-based data partitioning. Consider a scenario where your analytical data lake in Amazon S3 needs to be filtered according to a specific field, such as a customer identification—customer_id. Using the standard timestamp-based strategy, your data will look something like this, where <DOC-EXAMPLE-BUCKET> stands for your bucket name.

s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=01/part-0000.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=01/part-0001.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=02/part-0002.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=02/part-0003.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=03/part-0004.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=03/part-0005.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=04/part-0006.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=04/part-0007.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=05/part-0008.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=05/part-0009.parquet
s3://<DOC-EXAMPLE-BUCKET>/year=2021/month=06/day=06/part-0010.parquet

The difficulty in identifying particular customers within this array of data is that a full file scan will be required to locate any individual customer. Now consider the data partitioned by the identifying field, customer_id.

s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-000/part-0000.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-001/part-0001.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-002/part-0002.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-003/part-0003.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-004/part-0004.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-005/part-0005.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-006/part-0006.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-007/part-0007.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-008/part-0008.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-009/part-0009.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=customer-010/part-0010.parquet

In this data partitioning scheme, you only need to scan one folder to find data related to a particular customer. This is how analytics query engines like Amazon Athena, Amazon Redshift Spectrum, or Presto are designed to work—they prune unneeded partitioning during query execution, thereby reducing the amount of data that is scanned and transferred. Partitioning data like this will result in less data scanned overall.

Key features

With the launch of Kinesis Data Firehose Dynamic Partitioning, you can now enable data partitioning to be dynamic, based on data content within the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDK when you create or update an existing Kinesis Data Firehose delivery stream.

At a high level, Kinesis Data Firehose Dynamic Partitioning allows for easy extraction of keys from incoming records into your delivery stream by allowing you to select and extract JSON data fields in an easy-to-use query engine.

Kinesis Data Firehose Dynamic Partitioning and key extraction will result in larger file sizes landing in Amazon S3, in addition to allowing for columnar data formats, like Apache Parquet, that query engines prefer.

With Kinesis Data Firehose Dynamic Partitioning, you have the ability to specify delimiters to detect or add on to your incoming records. This makes it possible to clean and organize data in a way that a query engine like Amazon Athena or AWS Glue would expect. This not only saves time but also cuts down on additional processes after the fact, potentially reducing costs in the process.

Kinesis Data Firehose has built-in support for extracting the keys for partitioning records that are in JSON format. You can select and extract the JSON data fields to be used in partitioning by using JSONPath syntax. These fields will then dictate how the data is partitioned when it’s delivered to Amazon S3. As we’ll discuss in the walkthrough later in this post, extracting a well-distributed set of partition keys is critical to optimizing your Kinesis Data Firehose delivery stream that uses  dynamic partitioning.

If the incoming data is compressed, encrypted, or in any other file format, you can include in the PutRecord or PutRecords API calls the data fields for partitioning. You can also use the integrated Lambda function with your own custom code to decompress, decrypt, or transform the records to extract and return the data fields that are needed for partitioning. This is an expansion of the existing transform Lambda function that is available today with Kinesis Data Firehose. You can transform, parse, and return the data fields by using the same Lambda function.

In order to achieve larger file sizes when it sinks data to Amazon S3, Kinesis Data Firehose buffers incoming streaming data to a specified size or time period before it delivers to Amazon S3. Buffer sizes range from 1 MB to 4 GB when data is delivered to Amazon S3, and the buffering interval ranges from 1 minute to 1 hour.

Example of a partitioning structure

Consider the following clickstream event.

{
    "type": {
        "device": "mobile",
        "event": "user_clicked_submit_button"
    },
    "customer_id": "1234",
    "event_timestamp":1630442383,
    "geo": "US"
}

You can now partition your data by customer_id, so Kinesis Data Firehose will automatically group all events with the same customer_id and deliver them to separate folders in your S3 destination bucket. The new folders will be created dynamically—you only specify which JSON field will act as dynamic partition key.

Assume that you want to have the following folder structure in your S3 data lake.

s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/part-0000.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/part-0001.parquet

s3://<DOC-EXAMPLE-BUCKET>/customer_id=4567/part-0000.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=4567/part-0001.parquet 

The Kinesis Data Firehose configuration for the preceding example will look like the one shown in the following screenshot.

Kinesis Data Firehose evaluates the prefix expression at runtime. It groups records that match the same evaluated S3 prefix expression into a single dataset. Kinesis Data Firehose then delivers each dataset to the evaluated S3 prefix. The frequency of dataset delivery to S3 is determined by the delivery stream buffer setting.

You can do even more with the jq JSON processor, including accessing nested fields and create complex queries to identify specific keys among the data.

In the following example, I decide to store events in a way that will allow me to scan events from the mobile devices of a particular customer.

{
    "type": {
        "device": "mobile",
        "event": "user_clicked_submit_button"
    },
    "customer_id": "1234",
    "event_timestamp": 1630442383
    "geo": "US"
}

Given the same event, I’ll use both the device and customer_id fields in the Kinesis Data Firehose prefix expression, as shown in the following screenshot. Notice that the device is a nested JSON field.

The generated S3 folder structure will be as follows, where <DOC-EXAMPLE-BUCKET> is your bucket name.

s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/device=mobile/part-0000.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/device=mobile/part-0001.parquet

s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/device=desktop/part-0000.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/device=desktop/part-0001.parquet

Now assume that you want to partition your data based on the time when the event actually was sent, as opposed to using Kinesis Data Firehose native support for ApproximateArrivalTimestamp, which represents the time in UTC when the record was successfully received and stored in the stream. The time in the event_timestamp field might be in a different time zone.

With Kinesis Data Firehose Dynamic Partitioning, you can extract and transform field values on the fly. I’ll use the event_timestamp field to partition the events by year, month, and day, as shown in the following screenshot.

The preceding expression will produce the following S3 folder structure, where <DOC-EXAMPLE-BUCKET> is your bucket name.

s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/year=2021/month=05/day=21/part-0000.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/year=2021/month=05/day=21/part-0001.parquet

s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/year=2021/month=05/day=22/part-0000.parquet
s3://<DOC-EXAMPLE-BUCKET>/customer_id=1234/year=2021/month=05/day=22/part-0001.parquet

Create a dynamically partitioned delivery stream

To begin delivering dynamically partitioned data into Amazon S3, navigate to the Amazon Kinesis console page by searching for or selecting Kinesis.

From there, choose Create Delivery Stream, and then select your source and sink.

For this example, you will receive data from a Kinesis Data Stream, but you can also choose Direct PUT or other sources as the source of your delivery stream.

For the destination, choose Amazon S3.

Next, choose the Kinesis Data Stream source to read from. If you have a Kinesis Data Stream previously created, simply choose Browse and select from the list. If not, follow this guide on how to create a Kinesis Data Stream.

Give your delivery stream a name and continue on to the Transform and convert records section of the create wizard.

In order to transform your source records with AWS Lambda, you can enable data transformation. This process will be covered in the next section, and we’ll leave both the AWS Lambda transformation as well as the record format conversion disabled for simplicity.

For your S3 destination, select or create an S3 bucket that your delivery stream has permissions to.

Below that setting, you can now enable dynamic partitioning on the incoming data in order to deliver data to different S3 bucket prefixes based on your specified JSONPath query.

You now have the option to enable the following features, shown in the screenshot below:

  • Multi record deaggregation – Separate records that enter your delivery stream based on valid JSON criteria or a new line delimiter such as \n. This can be useful for data that comes in to your delivery stream in a specific format, but needs to be reformatted according to the downstream analysis engine.
  • New line delimiter – Configure your delivery stream to add a new line delimiter between records within objects delivered to Amazon S3, such as \n.
  • Inline parsing for JSON – Specify data record parameters to be used as dynamic partitioning keys, and provide a value for each key.

You can simply add your key/value pairs, then choose Apply dynamic partitioning keys to apply the partitioning scheme to your S3 bucket prefix. Keep in mind that you will also need to supply an error prefix for your S3 bucket before continuing.

Set your S3 buffering conditions appropriately for your use case. In my example, I’ve lowered the buffering to the minimum of 1 MiB of data delivered, or 60 seconds before delivering data to Amazon S3.

Keep the defaults for the remaining settings, and then choose Create delivery stream.

After data begins flowing through your pipeline, within the buffer interval you will see data appear in S3, partitioned according to the configurations within your Kinesis Data Firehose.

For delivery streams without the dynamic partitioning feature enabled, there will be one buffer across all incoming data. When data partitioning is enabled, Kinesis Data Firehose will have a buffer per partition, based on incoming records. The delivery stream will deliver each buffer of data as a single object when the size or interval limit has been reached, independent of other data partitions.

Lambda transformation of non-JSON records

If the data flowing through a Kinesis Data Firehose is compressed, encrypted, or in any non-JSON file format, the dynamic partitioning feature won’t be able to parse individual fields by using the JSONPath syntax specified previously. To use the dynamic partitioning feature with non-JSON records, use the integrated Lambda function with Kinesis Data Firehose to transform and extract the fields needed to properly partition the data by using JSONPath.

The following Lambda function will decode a user payload, extract the necessary fields for the Kinesis Data Firehose dynamic partitioning keys, and return a proper Kinesis Data Firehose file, with the partition keys encapsulated in the outer payload.

# This is an Amazon Kinesis Data Firehose stream processing Lambda function that  
# replays every read record from input to output  
# and extracts partition keys from the records.  
  
from __future__ import print_function  
import base64  
import json  
import datetime  
  
# Signature for all Lambda functions that user must implement  
def lambda_handler(firehose_records_input, context):  
      
    # Create return value.  
    firehose_records_output = {}  
    # Create result object.  
    firehose_records_output['records'] = []  
  
    print("\n")  
    # Go through records and process them.  
    for firehose_record_input in firehose_records_input['records']:  
  
        # Get user payload.  
        payload = base64.b64decode(firehose_record_input['data'])  
        jsonVal = json.loads(payload)  
          
        print("Record that was received")  
        print(jsonVal)  
        print("\n")  
        # Create output Firehose record and add modified payload and record ID to it.  
        firehose_record_output = {}  
        eventTimestamp = datetime.datetime.fromtimestamp(jsonVal['eventTimestamp'])  
        partitionKeys = {}  
        partitionKeys["customerId"] = jsonVal['customerId']  
        partitionKeys["year"] = eventTimestamp.strftime('%Y')  
        partitionKeys["month"] = eventTimestamp.strftime('%m')  
        partitionKeys["date"] = eventTimestamp.strftime('%d')  
        partitionKeys["hour"] = eventTimestamp.strftime('%H')  
        partitionKeys["minute"] = eventTimestamp.strftime('%M')  
  
          
        # Must set proper record ID.  
        firehose_record_output['recordId'] = firehose_record_input['recordId']  
        firehose_record_output['data'] = firehose_record_input['data']  
        firehose_record_output['result'] =  'Ok'  
        firehose_record_output['partitionKeys'] =  partitionKeys  
  
        # Add the record to the list of output records.  
        firehose_records_output['records'].append(firehose_record_output)  
  
    # At the end return processed records.  
    return firehose_records_output  

Using Lambda to extract the necessary fields for dynamic partitioning provides both the benefit of encrypted and compressed data and the benefit of dynamically partitioning data based on record fields.

Limits and quotas

Kinesis Data Firehose Dynamic Partitioning has a limit of 500 active partitions per delivery stream while it is actively buffering data—in other words, how many active partitions exist in the delivery stream during the configured buffering hints. This limit is adjustable, and if you want to increase it, you’ll need to submit a support ticket for a limit increase.

Each new value that is determined by the JSONPath select query will result in a new partition in the Kinesis Data Firehose delivery stream. The partition has an associated buffer of data that will be delivered to Amazon S3 in the evaluated partition prefix. Upon delivery to Amazon S3, the buffer that previously held that data and the associated partition will be deleted and deducted from the active partitions count in Kinesis Data Firehose.

Consider the following records that were ingested to my delivery stream.

{"customer_id": "123"}
{"customer_id": "124"}
{"customer_id": "125"}

If I decide to use customer_id for my dynamic data partitioning and deliver records to different prefixes, I’ll have three active partitions if the records keep ingesting for all of my customers. When there are no more records for "customer_id": "123", Kinesis Data Firehose will delete the buffer and will keep only two active partitions.

If you exceed the maximum number of active partitions, the rest of the records in the delivery stream will be delivered to the S3 error prefix. For more information, see the Error Handling section of this blog post.

A maximum throughput of 25 MB per second is supported for each active partition. This limit is not adjustable. You can monitor the throughput with the new metric called PerPartitionThroughput.

Best practices

The right partitioning can help you to save costs related to the amount of data that is scanned by analytics services like Amazon Athena. On the other hand, over-partitioning may lead to the creation of smaller objects and wipe out the initial benefit of cost and performance. See Top 10 Performance Tuning Tips for Amazon Athena.

We advise you to align your partitioning keys with your analysis queries downstream to promote compatibility between the two systems. At the same time, take into consideration how high cardinality can impact the dynamic partitioning active partition limit.

When you decide which fields to use for the dynamic data partitioning, it’s a fine balance between picking fields that match your business case and taking into consideration the partition count limits. You can monitor the number of active partitions with the new metric PartitionCount, as well as the number of partitions that have exceeded the limit with the metric PartitionCountExceeded.

Another way to optimize cost is to aggregate your events into a single PutRecord and PutRecordBatch API call. Because Kinesis Data Firehose is billed per GB of data ingested, which is calculated as the number of data records you send to the service, times the size of each record rounded up to the nearest 5 KB, you can put more data per each ingestion call.

Data partition functionality is run after data is de-aggregated, so each event will be sent to the corresponding Amazon S3 prefix based on the partitionKey field within each event.

Error handling

Imagine that the following record enters your Kinesis Data Firehose delivery stream.

{“customerID”: 1000}

When your dynamic partition query scans over this record, it will be unable to locate the specified key of customer_id, and therefore will result in an error. In this scenario, we suggest using S3 error prefix when you create or modify your Kinesis Data Firehose stream.

All failed records will be delivered to the error prefix. The records you might find there are the events without the field you specified as your partition key.

Cost and pricing examples

Kinesis Data Firehose Dynamic Partitioning is billed per GB of partitioned data delivered to S3, per object, and optionally per jq processing hour for data parsing. The cost can vary based on the AWS Region where you decide to create your stream.

For more information, see our pricing page.

Summary

In this post, we discussed the Kinesis Data Firehose Dynamic Partitioning feature, and explored the use cases where this feature can help improve pipeline performance. We also covered how to develop and optimize a Kinesis Data Firehose pipeline by using dynamic partitioning and the best practices around building a reliable delivery stream.

Kinesis Data Firehose dynamic partitioning will be available in all Regions at launch, and we urge you to try the new feature to see how it can simplify your delivery stream and query engine performance. Be sure to provide us with any feedback you have about the new feature.


About the Authors

Jeremy Ber has been working in the telemetry data space for the past 5 years as a Software Engineer, Machine Learning Engineer, and most recently a Data Engineer. In the past, Jeremy has supported and built systems that stream in terabytes of data per day and process complex machine learning algorithms in real time. At AWS, he is a Solutions Architect Streaming Specialist, supporting both Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Kinesis.

Michael Greenshtein’s career started in software development and shifted to DevOps. Michael worked with AWS services to build complex data projects involving real-time, ETLs, and batch processing. Now he works in AWS as Solutions Architect in the Europe, Middle East, and Africa (EMEA) region, supporting a variety of customer use cases.

How US federal agencies can authenticate to AWS with multi-factor authentication

Post Syndicated from Kyle Hart original https://aws.amazon.com/blogs/security/how-us-federal-agencies-can-authenticate-to-aws-with-multi-factor-authentication/

This post is part of a series about how AWS can help your US federal agency meet the requirements of the President’s Executive Order on Improving the Nation’s Cybersecurity. We recognize that government agencies have varying degrees of identity management and cloud maturity and that the requirement to implement multi-factor, risk-based authentication across an entire enterprise is a vast undertaking. This post specifically focuses on how you can use AWS information security practices to help meet the requirement to “establish multi-factor, risk-based authentication and conditional access across the enterprise” as it applies to your AWS environment.

This post focuses on the best-practices for enterprise authentication to AWS – specifically federated access via an existing enterprise identity provider (IdP).

Many federal customers use authentication factors on their Personal Identity Verification (PIV) or Common Access Cards (CAC) to authenticate to an existing enterprise identity service which can support Security Assertion Markup Language (SAML), which is then used to grant user access to AWS. SAML is an industry-standard protocol and most IdPs support a range of authentication methods, so if you’re not using a PIV or CAC, the concepts will still work for your organization’s multi-factor authentication (MFA) requirements.

Accessing AWS with MFA

There are two categories we want to look at for authentication to AWS services:

  1. AWS APIs, which include access through the following:
  2. Resources you launch that are running within your AWS VPC, which can include database engines or operating system environments such as Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon WorkSpaces, or Amazon AppStream 2.0.

There is also a third category of services where authentication occurs in AWS that is beyond the scope of this post: applications that you build on AWS that authenticate internal or external end users to those applications. For this category, multi-factor authentication is still important, but will vary based on the specifics of the application architecture. Workloads that sit behind an AWS Application Load Balancer can use the ALB to authenticate users using either Open ID Connect or SAML IdP that enforce MFA upstream.

MFA for the AWS APIs

AWS recommends that you use SAML and an IdP that enforces MFA as your means of granting users access to AWS. Many government customers achieve AWS federated authentication with Active Directory Federation Services (AD FS). The IdP used by our federal government customers should enforce usage of CAC/PIV to achieve MFA and be the sole means of access to AWS.

Federated authentication uses SAML to assume an AWS Identity and Access Management (IAM) role for access to AWS resources. A role doesn’t have standard long-term credentials such as a password or access keys associated with it. Instead, when you assume a role, it provides you with temporary security credentials for your role session.

AWS accounts in all AWS Regions, including AWS GovCloud (US) Regions, have the same authentication options for IAM roles through identity federation with a SAML IdP. The AWS Single Sign-on (SSO) service is another way to implement federated authentication to the AWS APIs in regions where it is available.

MFA for AWS CLI access

In AWS Regions excluding AWS GovCloud (US), you can consider using the AWS CloudShell service, which is an interactive shell environment that runs in your web browser and uses the same authentication pipeline that you use to access the AWS Management Console—thus inheriting MFA enforcement from your SAML IdP.

If you need to use federated authentication with MFA for the CLI on your own workstation, you’ll need to retrieve and present the SAML assertion token. For information about how you can do this in Windows environments, see the blog post How to Set Up Federated API Access to AWS by Using Windows PowerShell. For information about how to do this with Python, see How to Implement a General Solution for Federated API/CLI Access Using SAML 2.0.

Conditional access

IAM permissions policies support conditional access. Common use cases include allowing certain actions only from a specified, trusted range of IP addresses; granting access only to specified AWS Regions; and granting access only to resources with specific tags. You should create your IAM policies to provide least-privilege access across a number of attributes. For example, you can grant an administrator access to launch or terminate an EC2 instance only if the request originates from a certain IP address and is tagged with an appropriate tag.

You can also implement conditional access controls using SAML session tags provided by their IdP and passed through the SAML assertion to be consumed by AWS. This means two separate users from separate departments can assume the same IAM role but have tailored, dynamic permissions. As an example, the SAML IdP can provide each individual’s cost center as a session tag on the role assertion. IAM policy statements can be written to allow the user from cost center A the ability to administer resources from cost center A, but not resources from cost center B.

Many customers ask about how to limit control plane access to certain IP addresses. AWS supports this, but there is an important caveat to highlight. Some AWS services, such as AWS CloudFormation, perform actions on behalf of an authorized user or role, and execute from within the AWS cloud’s own IP address ranges. See this document for an example of a policy statement using the aws:ViaAWSService condition key to exclude AWS services from your IP address restrictions to avoid unexpected authorization failures.

Multi-factor authentication to resources you launch

You can launch resources such as Amazon WorkSpaces, AppStream 2.0, Redshift, and EC2 instances that you configure to require MFA. The Amazon WorkSpaces Streaming Protocol (WSP) supports CAC/PIV authentication for pre-authentication, and in-session access to the smart card. For more information, see Use smart cards for authentication. To see a short video of it in action, see the blog post Amazon WorkSpaces supports CAC/PIV smart card authentication. Redshift and AppStream 2.0 support SAML 2.0 natively, so you can configure those services to work with your SAML IdP similarly to how you configure AWS Console access and inherit the MFA enforced by the upstream IdP.

MFA access to EC2 instances can occur via the existing methods and enterprise directories used in your on-premises environments. You can, of course, implement other systems that enforce MFA access to an operating system such as RADIUS or other third-party directory or MFA token solutions.

Shell access with Systems Manager Session Manager

An alternative method for MFA for shell access to EC2 instances is to use the Session Manager feature of AWS Systems Manager. Session Manager uses the Systems Manager management agent to provide role-based access to a shell (PowerShell on Windows) on an instance. Users can access Session Manager from the AWS Console or from the command line with the Session Manager AWS CLI plugin. Similar to using CloudShell for CLI access, accessing EC2 hosts via Session Manager uses the same authentication pipeline you use for accessing the AWS control plane. Your interactive session on that host can be configured for audit logging.

Security best practices in IAM

The focus of this blog is on integrating an agency’s existing MFA-enabled enterprise authentication service; but to make it easier for you to view the entire security picture, you might be interested in IAM security best practices. You can enforce these best-practice security configurations with AWS Organizations Service Control Policies.

Conclusion

This post covered methods your federal agency should consider in your efforts to apply the multi-factor authentication (MFA) requirements in the Executive Order on Improving the Nation’s Cybersecurity to your AWS environment. To learn more about how AWS can help you meet the requirements of the executive order, see the other posts in this series:

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Kyle Hart

Kyle is a Principal Solutions Architect supporting US federal government customers in the Washington, D.C. area.

How MOIA built a fully automated GDPR compliant data lake using AWS Lake Formation, AWS Glue, and AWS CodePipeline

Post Syndicated from Leonardo Pêpe original https://aws.amazon.com/blogs/big-data/how-moia-built-a-fully-automated-gdpr-compliant-data-lake-using-aws-lake-formation-aws-glue-and-aws-codepipeline/

This is a guest blog post co-written by Leonardo Pêpe, a Data Engineer at MOIA.

MOIA is an independent company of the Volkswagen Group with locations in Berlin and Hamburg, and operates its own ride pooling services in Hamburg and Hanover. The company was founded in 2016 and develops mobility services independently or in partnership with cities and existing transport systems. MOIA’s focus is on ride pooling and the holistic development of the software and hardware for it. In October 2017, MOIA started a pilot project in Hanover to test a ride pooling service, which was brought into public operation in July 2018. MOIA covers the entire value chain in the area of ​​ride pooling. MOIA has developed a ridesharing system to avoid individual car traffic and use the road infrastructure more efficiently.

In this post, we discuss how MOIA uses AWS Lake Formation, AWS Glue, and AWS CodePipeline to store and process gigabytes of data on a daily basis to serve 20 different teams with individual user data access and implement fine-grained control of the data lake to comply with General Data Protection Regulation (GDPR) guidelines. This involves controlling access to data at a granular level. The solution enables MOIA’s fast pace of innovation to automatically adapt user permissions to new tables and datasets as they become available.

Background

Each MOIA vehicle can carry six passengers. Customers interact with the MOIA app to book a trip, cancel a trip, and give feedback. The highly distributed system prepares multiple offers to reach their destination with different pickup points and prices. Customers select an option and are picked up from their chosen location. All interactions between the customers and the app, as well as all the interactions between internal components and systems (the backend’s and vehicle’s IoT components), are sent to MOIA’s data lake.

Data from the vehicle, app, and backend must be centralized to have the synchronization between trips planned and implemented, and then to collect passenger feedback. To provide different pricing and routing options, MOIA needed centralized data. MOIA decided to build and secure its Amazon Simple Storage Service (Amazon S3) based Data Lake using AWS Lake Formation.

Different MOIA teams that includes Data Analysts, Data Scientists and Data Engineers need to access centralized data from different sources for the development and operations of the application workloads. It’s a legal requirement to control the access and format of the data to these different teams. The app development team needs to understand customer feedback in an anonymized way, pricing-related data must be accessed only by the business analytics team, vehicle data is meant to be used only by the vehicle maintenance team, and the routing team needs access to customer location and destination.

Architecture overview

The following diagram illustrates MOIA solution architecture.

The solution has the following components:

  1. Data input – The apps, backend, and IoT devices are continuously streaming event messages via Amazon Kinesis Data Streams in a nested information in a JSON format.
  2. Data transformation – When the raw data is received, the data pipeline orchestrator (Apache Airflow) starts the data transformation jobs to meet requirements for respective data formats. Amazon EMR, AWS Lambda, or AWS Fargate cleans, transforms, and loads the events in the S3 data lake.
  3. Persistence layer – After the data is transformed and normalized, AWS Glue crawlers classify the data to determine the format, schema, and associated properties. They group data into tables or partitions and write metadata to the AWS Glue Data Catalog based on the structured partitions created in Amazon S3. The data is written on Amazon S3 partitioned by its event type and timestamp so it can be easily accessed via Amazon Athena and the correct permissions can be applied using Lake Formation. The crawlers are run at a fixed interval in accordance with the batch jobs, so no manual runs are required.
  4. Governance layer – Access on the data lake is managed using the Lake Formation governance layer. MOIA uses Lake Formation to centrally define security, governance, and auditing policies in one place. These policies are consistently implemented, which eliminates the need to manually configure them across security services like AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), storage services like Amazon S3, or analytics and machine learning (ML) services like Amazon Redshift, Athena, Amazon EMR for Apache Spark. Lake Formation reduces the effort in configuring policies across services and provides consistent assistance for GDPR requirements and compliance. When a user makes a request to access Data Catalog resources or underlying data from the data lake, for the request to succeed, it must pass permission checks by both IAM and Lake Formation. Lake Formation permissions control access to Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. User permissions to the data on the table, row, or column level are defined in Lake Formation, so that users only access the data if they’re authorized.
  5. Data access layer – After the flattened and structured data is available to be accessed by the data analysts, scientists, and engineers they can use Amazon Redshift or Athena to perform SQL analysis, or AWS Data Wrangler and Apache Spark to perform programmatic analysis and build ML use cases.
  6. Infrastructure orchestration and data pipeline – After the data is ingested and transformed, the infrastructure orchestration layer (CodePipeline and Lake Formation) creates or updates the governance layer with the proper and predefined permissions periodically every hour (see more about the automation in the governance layer later in this post).

Governance challenge

MOIA wants to evolve their ML models for routing, demand prediction, and business models continuously. This requires MOIA to constantly review models and update them, therefore power users such as data administrators and engineers frequently redesign the table schemas in the AWS Glue Data Catalog as part of the data engineering workflow. This highly dynamic metadata transformation requires an equally dynamic governance layer pipeline that can assign the right user permissions to all tables and adapt to these changes transparently without disruptions to end-users. Due to GDPR requirements, there is no room for error, and manual work is required to assign the right permissions according to GDPR compliance on the tables in the data lake with many terabytes of data. Without automation, many developers are needed for administration; this adds human error into the workflows, which is not acceptable. Manual administration and access management isn’t a scalable solution. MOIA needed to innovate faster with GDPR compliance with a small team of developers.

Data schema and data structure often changes at MOIA, resulting in new tables being created in the data lake. To guarantee that new tables inherit the permissions granted to the same group of users who already have access to the entire database, MOIA uses an automated process that grants Lake Formation permission to newly created tables, columns, and databases, as described in the next section.

Solution overview

The following diagram illustrates the continuous deployment loop using AWS CloudFormation.

The workflow contains the following steps:

  1. MOIA data scientists and engineers create or modify new tables to evolve the business and ML models. This results in new tables or changes in the existing schema of the table.
  2. The data pipeline is orchestrated via Apache Airflow, which controls the whole cycle of ingestion, cleansing, and data transformation using Spark jobs (as shown in Step 1). When the data is in its desired format and state, Apache Airflow triggers the AWS Glue crawler job to discover the data schema in the S3 bucket and add the new partitions to the Data Catalog. Apache Airflow also triggers CodePipeline to assign the right permissions on all the tables.
  3. Apache Airflow triggers CodePipeline every hour, which uses MOIA’s libraries to discover the new tables, columns, and databases, and grants Lake Formation permissions in the AWS CloudFormation template for all tables, including new and modified. This orchestration layer ensures Lake Formation permissions are granted. Without this step, nobody can access new tables, and new data isn’t visible to data scientists, business analysts, and others who need it.
  4. MOIA uses Stacker and Troposphere to identify all the tables, assign the tables the right permissions, and deploy the CloudFormation stack. Troposphere is the infrastructure as code (IaC) framework that renders the CloudFormation templates, and stacker helps with the parameterization and deployment of the templates on AWS. Stacker also helps produce reusable templates in the form of blueprints as pure Python code, which can be easily parameterized using YAML files. The Python code uses Boto3 clients to lookup the Data Catalog and search for databases and tables.
  5. The stacker blueprint libraries developed by MOIA (which internally use Boto3 and troposphere) are used to discover newly created databases or tables in the Data Catalog.
  6. A Permissions configuration file allows MOIA to predefine data lake tables (Can be a wildcard indicating all tables available in one database) and specific personas who have access to these tables. These roles are split into different categories called lake personas, and have specific access levels. Example lake personas include data domain engineers, data domain technical users, data administrators, and data analysts. Users and their permissions are defined in the file. In case access to personally identifiable information (PII) data is granted, the GDPR officer ensures that a record of processing activities is in place and the usage of personal data is documented according to the law.
  7. MOIA uses stacker to read the predefined permissions file, and uses the customized stacker blueprints library containing the logic to assign permissions to lake personas for each of the tables discovered in Step 5.
  8. The discovered and modified tables are matched with predefined permissions in Step 6 and Step 7. Stacker uses YAML format as the input for the CloudFormation template parameters to define users and data access permissions. These parameters are related to the user’s roles and define access to the tables. Based on this, MOIA creates a customized stacker that creates AWS CloudFormation resources dynamically. The following are the example stacks in AWS CloudFormation:
    1. data-lake-domain-database – Holds all resources that are relevant during the creation of the database, which includes lake permissions at the database level that allow the domain database admin personas to perform operations in the database.
    2. data-lake-domain-crawler – Scans Amazon S3 constantly to discover and create new tables and updates.
    3. data-lake-lake-table-permissions – Uses Boto3 combined with troposphere to list the available tables and grant lake permissions to the domain roles that access it.

This way, new CloudFormation templates are created, containing permissions for new or modified tables. This process guarantees a fully automated governance layer for the data lake. The generated CloudFormation template contains Lake Formation permission resources for each table, database, or column. The process of managing Lake Formation permissions on Data Catalog databases, tables, and columns is simplified by granting Data Catalog permissions using the Lake Formation tag-based access control method. The advantage of generating a CloudFormation template is audibility. When the new version of the CloudFormation stack is prepared with an access control set on new or modified tables, administrators can compare that stack with the older version to discover newly prepared and modified tables. MOIA can view the differences via the AWS CloudFormation console before new stack deployment.

  1. Either the CloudFormation stack is deployed in the account with the right permissions for all users, or a stack deployment failure notice is sent to the developers via a Slack channel (Step 10). This ensures the visibility in the deployment process and failed pipelines in permission assignment.
  2. Whenever the pipeline fails, MOIA receives the notification via Slack so the engineers can promptly fix the errors. This loop is very important to guarantee that whenever the schema changes, those changes are reflected in the data lake without manual intervention.
  3. Following this automated pipeline, all the tables are assigned right permissions.

Benefits

This solution delivers the following benefits:

  • Automated enforcement of data access permissions allows people to access the data without manual interventions in a scalable way. With automation, 1,000 hours of manual work are saved every year.
  • The GDPR team does the assessment internally every month. The GDPR team constantly provides guidance and approves each change in permissions. This audit trail for records of processing activities is automated with the help of Lake Formation (otherwise it needs a dedicated human resource).
  • The automated workflow of permission assignment is integrated into existing CI/CD processes, resulting in faster onboarding of new teams, features, and dataset releases. An average of 48 releases are done each month (including major and minor version releases and new event types). Onboarding new teams and forming internal new teams is very easy now.
  • This solution enables you to create a data lake using IaC processes that are easy and GDPR-compliant.

Conclusion

MOIA has created scalable, automated, and versioned permissions with a GDPR-supported, governed data lake using Lake Formation. This solution helps them bring new features and models to market faster, and reduces administrative and repetitive tasks. MOIA can focus on 48 average releases every month, contributing to a great customer experience and new data insights.


About the Authors

Leonardo Pêpe is a Data Engineer at MOIA. With a strong background in infrastructure and application support and operations, he is immersed in the DevOps philosophy. He’s helping MOIA build automated solutions for its data platform and enabling the teams to be more data-driven and agile. Outside of MOIA, Leonardo enjoys nature, Jiu-Jitsu and martial arts, and explores the good of life with his family.

 

Sushant Dhamnekar is a Solutions Architect at AWS. As a trusted advisor, Sushant helps automotive customers to build highly scalable, flexible, and resilient cloud architectures, and helps them follow the best practices around advanced cloud-based solutions. Outside of work, Sushant enjoys hiking, food, travel, and CrossFit workouts.

 

 

Shiv Narayanan is Global Business Development Manager for Data Lakes and Analytics solutions at AWS. He works with AWS customers across the globe to strategize, build, develop and deploy modern data platforms. Shiv loves music, travel, food and trying out new tech.

Practical Entity Resolution on AWS to Reconcile Data in the Real World

Post Syndicated from David Amatulli original https://aws.amazon.com/blogs/architecture/practical-entity-resolution-on-aws-to-reconcile-data-in-the-real-world/

This post was co-written with Mamoon Chowdry, Solutions Architect, previously at AWS.

Businesses and organizations from many industries often struggle to ensure that their data is accurate. Data often has to match people or things exactly in the real world, such as a customer name, an address, or a company. Matching our data is important to validate it, de-duplicate it, or link records in different systems together. Know Your Customer (KYC) regulations also mean that we must be confident in who or what our data is referring to. We must match millions of records from different data sources. Some of that data may have been entered manually and contain inconsistencies.

It can often be hard to match data with the entity it is supposed to represent. For example, if a customer enters their details as, “Mr. John Doe, #1a 123 Main St.“ and you have a prior record in your customer database for ”J. Doe, Apt 1A, 123 Main Street“, are they referring to the same or a different person?

In cases like this, we often have to manually update our data to make sure it accurately and consistently matches a real-world entity. You may want to have consistent company names across a list of business contacts. When there isn’t an exact match, we have to reconcile our data with the available facts we know about that entity. This reconciliation is commonly referred to as entity resolution (ER). This process can be labor-intensive and error-prone.

This blog will explore some of the common types of ER. We will share a basic architectural pattern for near real-time ER processing. You will see how ER using fuzzy text matching can reconcile manually entered names with reference data.

Multiple ways to do entity resolution

Entity resolution is a broad and deep topic, and a complete discussion would be beyond the scope of this blog. However, at a high level there are four common approaches to matching ambiguous fields or records, to known entities.

    1. Fuzzy text matching. We might normally compare two strings to see if they are identical. If they don’t exactly match, it is often helpful to find the nearest match. We do this by calculating a similarity score. For example, “John Doe” and “J Doe” may have a similarity score of 80%. A common way to compare the similarity of two strings is to use the Levenshtein distance, which measures the distance between two sequences.

We may also examine more than one field. For example, we may compare a name and address. Is “Mr. J Doe, 123 Main St” likely to be the same person as “Mr John Doe, 123 Main Street”? If we compare multiple fields in a record and analyze all of their similarity scores, this is commonly called Pairwise comparison.

2. Clustering. We can plot records in an n-dimensional space based on values computed from their fields. Their similarity to other reference records is then measured by calculating how close they are to each other in that space. Those that are clustered together are likely to refer to the same entity. Clustering is an effective method for grouping or segmenting data for computer vision, astronomy, or market segmentation. An example of this method is K-means clustering.

3. Graph networks. Graph networks are commonly used to store relationships between entities, such as people who are friends with each other, or residents of a particular address. When we need to resolve an ambiguous record, we can use a graph database to identify potential relationships to other records. For example, “J Doe, 123 Main St,” may be the same as “John Doe, 123 Main St,” because they have the same address and similar names.

Graph networks are especially helpful when dealing with complex relationships over millions of entities. For example, you can build a customer profile using web server logs and other data.

4. Commercial off-the-shelf (COTS) software. Enterprises can also deploy ER software, such as these offerings from the AWS Marketplace and Senzing entity resolution. This is helpful when companies may not have the skill or experience to implement a solution themselves. It is important to mention the role of Master Data Management (MDM) with ER. MDM involves having a single trusted source for your data. Tools, such as Informatica, can help ER with their MDM features.

Our solution (shown in Figure 1) allows us to build a low-cost, streamlined solution using AWS serverless technology. The architecture uses AWS Lambda, which allows you to run code without having to provision or manage servers. This code will be invoked through an API, which is created with Amazon API Gateway. API Gateway is a fully managed service used by developers to create, publish, maintain, monitor, and secure API operations at any scale. Finally, we will store our reference data in Amazon Simple Storage Service (S3).

Entity resolution solution using AWS serverless services

We initially match manually entered strings to a list of reference strings. The strings we will try to match will be names of companies.

Figure 1. Example request dataflow through AWS

Figure 1. Example request dataflow through AWS

  1. Our API takes a string as input
  2. It then invokes the ER Lambda function
  3. This loads the index and data files of our reference dataset
  4. The ER finds the closest match in the list of real-world companies
  5. The closest match is returned

The reference data and index files were created from an export of the fuzzy match algorithm.

The fuzzy match algorithm in detail

The algorithm in the AWS Lambda function works by converting each string to a collection of n-grams. N-grams are smaller substrings that are commonly used for analyzing free-form text.

The n-grams are then converted to a simple vector. Each vector is a numerical statistic that represents the Term Frequency – Inverse Document Frequency (TF-IDF). Both TF-IDF and n-grams are used to prepare text for searching. N-grams of strings that are similar in nature, tend to have similar TF-IDF vectors. We can plot these vectors in a chart. This helps us find similar strings as they are grouped or clustered together.

Comparing vectors to find similar strings can be fairly straightforward. But if you have numerous records, it can be computationally expensive and slow. To solve this, we use the NMSLIB library. This library indexes the vectors for faster similarity searching. It also gives us the degree of similarity between two strings. This is important because we may want to know the accuracy of a match we have found. For example, it can be helpful to filter out weak matches.

The entity resolution Lambda

Using the NMSLIB library, which is loaded using Lambda layers, we initialize an index using Neighborhood APProximation (NAPP).

# initialize the index
newIndex = nmslib.init(method='napp', space='negdotprod_sparse_fast',
data_type=nmslib.DataType.SPARSE_VECTOR)

Next we imported the index and data files that were created from our reference data.

# load the index file
newIndex.loadIndex(DATA_DIR + 'index_company_names.bin',
load_data=True)

The input parameter companyName is then used to query the index to find the approximate nearest neighbor. By using the knnQueryBatch method, we distribute the work over a thread pool, which provides faster querying.

# set the input variable and empty output list
inputString = companyName
outputList = []
# Find the nearest neighbor for our company name
# (K is the number of matches, set to 1)
newQueryMatrix = vectorizer.transform(inputString)
newNbrs = index.knnQueryBatch(newQueryMatrix, k = K, num_threads = numThreads)

The best match is then returned as a JSON response.

# return the match
for i in range(K):
    outputList.append(orgNames[newNbrs[0][0][i]])
return {
      'statusCode': '200',
      'body': json.dumps(outputList),
       }

Cost estimate for solution

Our solution is a combination of Amazon API GatewayAWS Lambda, and Amazon S3 (hyperlinks are to pricing pages). As an example, let’s assume that the API will receive 10 million requests per month. We can estimate the costs of running the solution as:

Service Description Cost
AWS Lambda 10 million requests and associated compute costs $161.80
Amazon API Gateway HTTP API requests, avg size of request (34 KB), Avg message size (32 KB), requests (10 million/month) $10.00
Amazon S3 S3 Standard storage (including data transfer costs) $7.61
Total $179.41

Table 1. Example monthly cost estimate (USD)

Conclusion

Using AWS services to reconcile your data with real-world entities helps make your data more accurate and consistent. You can automate a manual task that could have been laborious, expensive, and error-prone.

Where can you use ER in your organization? Do you have manually entered or inaccurate data? Have you struggled to match it with real-world entities? You can experiment with this architecture to continue to improve the accuracy of your own data.

Further reading:

Welcome to AWS Storage Day 2021

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/welcome-to-aws-storage-day-2021/

Welcome to the third annual AWS Storage Day 2021! During Storage Day 2020 and the first-ever Storage Day 2019 we made many impactful announcements for our customers and this year will be no different. The one-day, free AWS Storage Day 2021 virtual event will be hosted on the AWS channel on Twitch. You’ll hear from experts about announcements, leadership insights, and educational content related to AWS Storage services.

AWS Storage DayThe first part of the day is the leadership track. Wayne Duso, VP of Storage, Edge, and Data Governance, will be presenting a live keynote. He’ll share information about what’s new in AWS Cloud Storage and how these services can help businesses increase agility and accelerate innovation. The keynote will be followed by live interviews with the AWS Storage leadership team, including Mai-Lan Tomsen Bukovec, VP of AWS Block and Object Storage.

The second part of the day is a technical track in which you’ll learn more about Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (EBS), Amazon Elastic File System (Amazon EFS), AWS Backup, Cloud Data Migration, AWS Transfer Family and Amazon FSx.

To register for the event, visit the AWS Storage Day 2021 event page.

Now as Jeff Barr likes to say, let’s get into the announcements.

Amazon FSx for NetApp ONTAP
Today, we are pleased to announce Amazon FSx for NetApp ONTAP, a new storage service that allows you to launch and run fully managed NetApp ONTAP file systems in the cloud. Amazon FSx for NetApp ONTAP joins Amazon FSx for Lustre and Amazon FSx for Windows File Server as the newest file system offered by Amazon FSx.

Amazon FSx for NetApp ONTAP provides the full ONTAP experience with capabilities and APIs that make it easy to run applications that rely on NetApp or network-attached storage (NAS) appliances on AWS without changing your application code or how you manage your data. To learn more, read New – Amazon FSx for NetApp ONTAP.

Amazon S3
Amazon S3 Multi-Region Access Points is a new S3 feature that allows you to define global endpoints that span buckets in multiple AWS Regions. Using this feature, you can now build multi-region applications without adding complexity to your applications, with the same system architecture as if you were using a single AWS Region.

S3 Multi-Region Access Points is built on top of AWS Global Accelerator and routes S3 requests over the global AWS network. S3 Multi-Region Access Points dynamically routes your requests to the lowest latency copy of your data, so the upload and download performance can increase by 60 percent. It’s a great solution for applications that rely on reading files from S3 and also for applications like autonomous vehicles that need to write a lot of data to S3. To learn more about this new launch, read How to Accelerate Performance and Availability of Multi-Region Applications with Amazon S3 Multi-Region Access Points.

Creating a multi-region access point

There’s also great news about the Amazon S3 Intelligent-Tiering storage class! The conditions of usage have been updated. There is no longer a minimum storage duration for all objects stored in S3 Intelligent-Tiering, and monitoring and automation charges for objects smaller than 128 KB have been removed. Smaller objects (128 KB or less) are not eligible for auto-tiering when stored in S3 Intelligent-Tiering. Now that there is no monitoring and automation charge for small objects and no minimum storage duration, you can use the S3 Intelligent-Tiering storage class by default for all your workloads with unknown or changing access patterns. To learn more about this announcement, read Amazon S3 Intelligent-Tiering – Improved Cost Optimizations for Short-Lived and Small Objects.

Amazon EFS
Amazon EFS Intelligent Tiering is a new capability that makes it easier to optimize costs for shared file storage when access patterns change. When you enable Amazon EFS Intelligent-Tiering, it will store the files in the appropriate storage class at the right time. For example, if you have a file that is not used for a period of time, EFS Intelligent-Tiering will move the file to the Infrequent Access (IA) storage class. If the file is accessed again, Intelligent-Tiering will automatically move it back to the Standard storage class.

To get started with Intelligent-Tiering, enable lifecycle management in a new or existing file system and choose a lifecycle policy to automatically transition files between different storage classes. Amazon EFS Intelligent-Tiering is perfect for workloads with changing or unknown access patterns, such as machine learning inference and training, analytics, content management and media assets. To learn more about this launch, read Amazon EFS Intelligent-Tiering Optimizes Costs for Workloads with Changing Access Patterns.

AWS Backup
AWS Backup Audit Manager allows you to simplify data governance and compliance management of your backups across supported AWS services. It provides customizable controls and parameters, like backup frequency or retention period. You can also audit your backups to see if they satisfy your organizational and regulatory requirements. If one of your monitored backups drifts from your predefined parameters, AWS Backup Audit Manager will let you know so you can take corrective action. This new feature also enables you to generate reports to share with auditors and regulators. To learn more, read How to Monitor, Evaluate, and Demonstrate Backup Compliance with AWS Backup Audit Manager.

Amazon EBS
Amazon EBS direct APIs now support creating 64 TB EBS Snapshots directly from any block storage data, including on-premises. This was increased from 16 TB to 64 TB, allowing customers to create the largest snapshots and recover them to Amazon EBS io2 Block Express Volumes. To learn more, read Amazon EBS direct API documentation.

AWS Transfer Family
AWS Transfer Family Managed Workflows is a new feature that allows you to reduce the manual tasks of preprocessing your data. Managed Workflows does a lot of the heavy lifting for you, like setting up the infrastructure to run your code upon file arrival, continuously monitoring for errors, and verifying that all the changes to the data are logged. Managed Workflows helps you handle error scenarios so that failsafe modes trigger when needed.

AWS Transfer Family Managed Workflows allows you to configure all the necessary tasks at once so that tasks can automatically run in the background. Managed Workflows is available today in the AWS Transfer Family Management Console. To learn more, read Transfer Family FAQ.

Storage Day 2021 Join us online for more!
Don’t forget to register and join us for the AWS Storage Day 2021 virtual event. The event will be live at 8:30 AM Pacific Time (11:30 AM Eastern Time) on September 2. The event will immediately re-stream for the Asia-Pacific audience with live Q&A moderators on Friday, September 3, at 8:30 AM Singapore Time. All sessions will be available on demand next week.

We look forward to seeing you there!

Marcia

Introducing the Ransomware Economy

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/ransomware-economy/

Ransomware skull and code symbols

Ransomware continues to proliferate for a simple reason—it’s profitable. And it’s profitable not just for the ransomware developers themselves—they’re just one part of the equation—but for a whole ecosystem of players who make up the ransomware economy. To understand the threats to small and medium-sized businesses (SMBs) and organizations today, it’s important to understand the scope and scale of what you’re up against.

Today, we’re digging into how the ransomware economy operates, including the broader ecosystem and the players involved, emerging threats to SMBs, and the overall financial footprint of ransomware worldwide.

This post is a part of our ongoing series on ransomware. Take a look at our other posts for more information on how businesses can defend themselves against a ransomware attack, and more.

Top Ransomware Syndicates in Operation Today

Cybercriminals have long been described as operating in “gangs.” The label conjures images of hackers furiously tapping away at glowing workstations in a shadowy warehouse. But the work of the ransomware economy today is more likely to take place in a boardroom than a back alley. Cybercriminals have graduated from gangs to highly complex organized crime syndicates that operate ransomware brands as part of a sophisticated business model.

Operators of these syndicates are just as likely to be worrying about user experience and customer service as they are with building malicious code. A look at the branding on display on some syndicates’ leak sites makes the case plain that these groups are more than a collective of expert coders—they’re savvy businesspeople.

images of ransomware gang marketing
Source: Bleepingcomputer.com.

Ransomware operators are often synonymous with the software variant they brand, deploy, and sell. Many have rebranded over the years or splintered into affiliated organizations. Some of the top ransomware brands operating today, along with high profile attacks they have carried out, are shown in the infographic below:

infographic of top ransomware brands

The groups shown above do not constitute an exhaustive list. In June 2021, FBI Director Christopher Wray stated that the FBI was investigating 100 different ransomware variants and new ones pop up everyday. While some brands have existed for years (Ryuk, for example), the list is also likely obsolete as soon as it’s published. Ransomware brands bubble up, go bust, and reorganize, changing with the cybersecurity tides.

Chainalysis, a blockchain data platform, published their Ransomware 2021: Critical Mid-year Update that shows just how much brands fluctuate year to year and, they note, even month to month:

Top 10 ransomware strains by revenue by year, 2014-2021 Q1
Source: Chainalysis.

How Ransomware Syndicates Operate

Ransomware operators may appear to be single entities, but there is a complex ecosystem of suppliers and ancillary providers behind them that exchange services with each other on the dark web. The flowchart below illustrates all the players and how they interact:

diagram of ransomware syndicate workflow

Dark Web Service Providers

Cybercrime “gangs” could once be tracked down and caught like the David Levi Phishing Gang that was investigated and prosecuted in 2005. Today’s decentralized ecosystem, however, makes going after ransomware operators all the more difficult. These independent entities may never interact with each other outside of the dark web where they exchange services for cryptocurrency:

    • Botmasters: Create networks of infected computers and sell access to those compromised devices to threat actors.
    • Access Sellers: Take advantage of publicly disclosed vulnerabilities to infect servers before the vulnerabilities are remedied, then advertise and sell that access to threat actors.
ad for ransomware syndicate
Advertisement from an access seller for access to an organization’s RDP. Source: Threatpost.
  • Operators: The entity that actually carries out the attack with access purchased from botmasters or access sellers and software purchased from developers or developed in-house. May employ a full staff, including customer service, IT support, marketing, etc. depending on how sophisticated the syndicate is.
  • Developers: Write the ransomware software and sell it to threat actors for a cut of the ransom.
  • Packer Developers: Add protection layers to the software, making it harder to detect.
  • Analysts: Evaluate the victim’s financial health to advise on ransom amounts that they’re most likely to pay.
  • Affiliates: Purchase ransomware as a service from operators/developers who get a cut of the ransom.
  • Negotiating Agents: Handle interactions with victims.
  • Laundering Services: Exchange cryptocurrency for fiat currency on exchanges or otherwise transform ransom payments into usable assets.

Victim-side Service Providers

Beyond the collection of entities directly involved in the deployment of ransomware, the broader ecosystem includes other players on the victim’s side, who, for better or worse, stand to profit off of ransomware attacks. These include:

  • Incident response firms: Consultants who assist victims in response and recovery.
  • Ransomware brokers: Brought in to negotiate and handle payment on behalf of the victim and act as intermediaries between the victim and operators.
  • Insurance providers: Cover victims’ damages in the event of an attack.
  • Legal counsel: Often manage the relationship between the broker, insurance provider, and victim, and advise on ransom payment decision-making.

Are Victim-side Providers Complicit?

While these providers work on behalf of victims, they also perpetuate the cycle of ransomware. For example, insurance providers that cover businesses in the event of a ransomware attack often advise their customers to pay the ransom if they think it will minimize downtime as the cost of extended downtime can far exceed the cost of a ransom payment. This becomes problematic for a few reasons:

  • First, paying the ransom incentivizes cybercriminals to continue plying their trade.
  • Second, as Colonial Pipeline discovered, the decryption tools provided by cybercriminals in exchange for ransom payments aren’t to be trusted. More than a month after Colonial paid the $4.4 million ransom and received a decryption tool from the hackers, CEO Joseph Blount testified before Congress that recovery from the attack was still not complete. After all that, they had to rely on recovering from their backups anyway.

The Emergence of Ransomware as a Service

In the ransomware economy, operators and their affiliates are the threat actors that carry out attacks. This affiliate model where operators sell ransomware as a service (RaaS) represents one of the biggest threats to SMBs and organizations today.

Cybercrime syndicates realized they could essentially license and sell their tech to affiliates who then carry out their own misdeeds empowered by another criminal’s software. The syndicates, affiliates, and other entities each take a portion of the ransom.

Operators advertise these partner programs on the dark web and thoroughly vet affiliates before bringing them on to filter out law enforcement posing as low-level criminals. One advertisement by the REvil syndicate noted, “No doubt, in the FBI and other special services, there are people who speak Russian perfectly, but their level is certainly not the one native speakers have. Check these people by asking them questions about the history of Ukraine, Belarus, Kazakhstan or Russia, which cannot be googled. Authentic proverbs, expressions, etc.”

Ransomware as a service ad
Advertisement for ransomware affiliates. Source: Kaspersky.

Though less sophisticated than some of the more notorious viruses, these “as a service” variants enable even amateur cybercriminals to carry out attacks. And they’re likely to carry those attacks out on the easiest prey—small businesses who don’t have the resources to implement adequate protections or weather extended downtime.

Hoping to increase their chances of being paid, low-level threat actors using RaaS typically demanded smaller ransoms, under $100,000, but that trend is changing. Coveware reported in August 2020 that affiliates are getting bolder in their demands. They reported the first six-figure payments to the Dharma ransomware group, an affiliate syndicate, in Q2 2020.

The one advantage savvy business owners have when it comes to RaaS: attacks are high volume (carried out against many thousands of targets) but low quality and easily identifiable by the time they are widely distributed. By staying on top of antivirus protections and detection, business owners can increase their chances of catching the attacks before it’s too late.

The Financial Side of the Ransomware Economy

So, how much money do ransomware crime syndicates actually make? The short answer is that it’s difficult to know because so many ransomware attacks go unreported. To get some idea of the size of the ransomware economy, analysts have to do some sleuthing.

Chainalysis tracks transactions to blockchain addresses linked to ransomware attacks in order to capture the size of ransomware revenues. In their regular reporting on the cybercrime cryptocurrency landscape, they showed that the total amount paid by ransomware victims increased by 311% in 2020 to reach nearly $350 million worth of cryptocurrency. In May, they published an update after identifying new ransomware addresses that put the number over $406 million. They expect the number will only continue to grow.

Total cryptocurrency value received by ransomware addresses, 2016-2021 (YTD)
Source: Chainalysis.

Similarly, threat intel company, Advanced Intelligence, and cybersecurity firm, HYAS, tracked Bitcoin transactions to 61 addresses associated with the Ryuk syndicate. They estimate that the operator may be worth upwards of $150 million alone. Their analysis sheds some light on how ransomware operators turn their exploits and the ransoms paid into usable cash.

Extorted funds are gathered in holding accounts, passed to money laundering services, then either funneled back into the criminal market and used to pay for other criminal services or cashed out at real cryptocurrency exchanges. The process follows these steps, as illustrated below:

  • The victim pays a broker.
  • The broker converts the cash into cryptocurrency.
  • The broker pays the ransomware operator in cryptocurrency.
  • The ransomware operator sends the cryptocurrency to a laundering service.
  • The laundering service exchanges the coins for fiat currency on cryptocurrency exchanges like Binance and Huobi.
diagram of ransomware payment flow
Source: AdvIntel.

In an interesting development, the report found that Ryuk actually bypassed laundering services and cashed out some of their own cryptocurrency directly on exchanges using stolen identities—a brash move for any organized crime operation.

Protecting Your Company From Ransomware

Even though the ransomware economy is ever-changing, having an awareness of where attacks come and the threats you’re facing can prepare you if you ever face one yourself. To summarize:

  • Ransomware operators may seem to be single entities, but there’s a broad ecosystem of players behind them that trade services on the dark web.
  • Ransomware operators are sophisticated business entities.
  • RaaS enables even low-level criminals to get in the game.
  • Ransomware operators raked in at least $406 million in 2020, and likely more than that, as many ransomware attacks and payments go unreported.

We put this post together not to trade in fear, but to prepare SMBs and organizations with information in the fight against ransomware. And, you don’t have to fight it alone. Download our Complete Guide to Ransomware E-book and Guide for even more intel on ransomware today, plus steps to take to defend against ransomware, and how to respond if you do fall victim to an attack.

The post Introducing the Ransomware Economy appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Top 10 security best practices for securing data in Amazon S3

Post Syndicated from Megan O'Neil original https://aws.amazon.com/blogs/security/top-10-security-best-practices-for-securing-data-in-amazon-s3/

With more than 100 trillion objects in Amazon Simple Storage Service (Amazon S3) and an almost unimaginably broad set of use cases, securing data stored in Amazon S3 is important for every organization. So, we’ve curated the top 10 controls for securing your data in S3. By default, all S3 buckets are private and can be accessed only by users who are explicitly granted access through ACLs, S3 bucket policies, and identity-based policies. In this post, we review the latest S3 features and Amazon Web Services (AWS) services that you can use to help secure your data in S3, including organization-wide preventative controls such as AWS Organizations service control policies (SCPs). We also provide recommendations for S3 detective controls, such as Amazon GuardDuty for S3, AWS CloudTrail object-level logging, AWS Security Hub S3 controls, and CloudTrail configuration specific to S3 data events. In addition, we provide data protection options and considerations for encrypting data in S3. Finally, we review backup and recovery recommendations for data stored in S3. Given the broad set of use cases that S3 supports, you should determine the priority of controls applied in accordance with your specific use case and associated details.

Block public S3 buckets at the organization level

Designate AWS accounts for public S3 use and prevent all other S3 buckets from inadvertently becoming public by enabling S3 Block Public Access. Use Organizations SCPs to confirm that the S3 Block Public Access setting cannot be changed. S3 Block Public Access provides a level of protection that works at the account level and also on individual buckets, including those that you create in the future. You have the ability to block existing public access—whether it was specified by an ACL or a policy—and to establish that public access isn’t granted to newly created items. This allows only designated AWS accounts to have public S3 buckets while blocking all other AWS accounts. To learn more about Organizations SCPs, see Service control policies.

Use bucket policies to verify all access granted is restricted and specific

Check that the access granted in the Amazon S3 bucket policy is restricted to specific AWS principals, federated users, service principals, IP addresses, or VPCs that you provide. A bucket policy that allows a wildcard identity such as Principal “*” can potentially be accessed by anyone. A bucket policy that allows a wildcard action “*” can potentially allow a user to perform any action in the bucket. For more information, see Using bucket policies.

Ensure that any identity-based policies don’t use wildcard actions

Identity policies are policies assigned to AWS Identity and Access Management (IAM) users and roles and should follow the principle of least privilege to help prevent inadvertent access or changes to resources. Establishing least privilege identity policies includes defining specific actions such as S3:GetObject or S3:PutObject instead of S3:*. In addition, you can use predefined AWS-wide condition keys and S3‐specific condition keys to specify additional controls on specific actions. An example of an AWS-wide condition key commonly used for S3 is IpAddress: { aws:SourceIP: “10.10.10.10”}, where you can specify your organization’s internal IP space for specific actions in S3. See IAM.1 in Monitor S3 using Security Hub and CloudWatch Logs for detecting policies with wildcard actions and wildcard resources are present in your accounts with Security Hub.

Consider splitting read, write, and delete access. Allow only write access to users or services that generate and write data to S3 but don’t need to read or delete objects. Define an S3 lifecycle policy to remove objects on a schedule instead of through manual intervention— see Managing your storage lifecycle. This allows you to remove delete actions from your identity-based policies. Verify your policies with the IAM policy simulator. Use IAM Access Analyzer to help you identify, review, and design S3 bucket policies or IAM policies that grant access to your S3 resources from outside of your AWS account.

Enable S3 protection in GuardDuty to detect suspicious activities

In 2020, GuardDuty announced coverage for S3. Turning this on enables GuardDuty to continuously monitor and profile S3 data access events (data plane operations) and S3 configuration (control plane APIs) to detect suspicious activities. Activities such as requests coming from unusual geolocations, disabling of preventative controls, and API call patterns consistent with an attempt to discover misconfigured bucket permissions. To achieve this, GuardDuty uses a combination of anomaly detection, machine learning, and continuously updated threat intelligence. To learn more, including how to enable GuardDuty for S3, see Amazon S3 protection in Amazon GuardDuty.

Use Macie to scan for sensitive data outside of designated areas

In May of 2020, AWS re-launched Amazon Macie. Macie is a fully managed service that helps you discover and protect your sensitive data by using machine learning to automatically review and classify your data in S3. Enabling Macie organization wide is a straightforward and cost-efficient method for you to get a central, continuously updated view of your entire organization’s S3 environment and monitor your adherence to security best practices through a central console. Macie continually evaluates all buckets for encryption and access control, alerting you of buckets that are public, unencrypted, or shared or replicated outside of your organization. Macie evaluates sensitive data using a fully-managed list of common sensitive data types and custom data types you create, and then issues findings for any object where sensitive data is found.

Encrypt your data in S3

There are four options for encrypting data in S3, including client-side and server-side options. With server-side encryption, S3 encrypts your data at the object level as it writes it to disks in AWS data centers and decrypts it when you access it. As long as you authenticate your request and you have access permissions, there is no difference in the way you access encrypted or unencrypted objects.

The first two options use AWS Key Management Service (AWS KMS). AWS KMS lets you create and manage cryptographic keys and control their use across a wide range of AWS services and their applications. There are options for managing which encryption key AWS uses to encrypt your S3 data.

  • Server-side encryption with Amazon S3-managed encryption keys (SSE-S3). When you use SSE-S3, each object is encrypted with a unique key that’s managed by AWS. This option enables you to encrypt your data by checking a box with no additional steps. The encryption and decryption are handled for you transparently. SSE-S3 is a convenient and cost-effective option.
  • Server-side encryption with customer master keys (CMKs) stored in AWS KMS (SSE-KMS), is similar to SSE-S3, but with some additional benefits and costs compared to SSE-S3. There are separate permissions for the use of a CMK that provide added protection against unauthorized access of your objects in S3. SSE-KMS also provides you with an audit trail that shows when your CMK was used and by whom. SSE-KMS gives you control of the key access policy, which might provide you with more granular control depending on your use case.
  • In server-side encryption with customer-provided keys (SSE-C), you manage the encryption keys and S3 manages the encryption as it writes to disks and decryption when you access your objects. This option is useful if you need to provide and manage your own encryption keys. Keep in mind that you are responsible for the creation, storage, and tracking of the keys used to encrypt each object and AWS has no ability to recover customer-provided keys if they’re lost. The major thing to account for with SSE-C is that you must provide the customer-managed key every-time you PUT or GET an object.
  • Client-side encryption is another option to encrypt your data in S3. You can use a CMK stored in AWS KMS or use a master key that you store within your application. Client-side encryption means that you encrypt the data before you send it to AWS and that you decrypt it after you retrieve it from AWS. AWS doesn’t manage your keys and isn’t responsible for encryption or decryption. Usually, client-side encryption needs to be deeply embedded into your application to work.

Protect data in S3 from accidental deletion using S3 Versioning and S3 Object Lock

Amazon S3 is designed for durability of 99.999999999 percent of objects across multiple Availability Zones, is resilient against events that impact an entire zone, and designed for 99.99 percent availability over a given year. In many cases, when it comes to strategies to back up your data in S3, it’s about protecting buckets and objects from accidental deletion, in which case S3 Versioning can be used to preserve, retrieve, and restore every version of every object stored in your buckets. S3 Versioning lets you keep multiple versions of an object in the same bucket and can help you recover objects from accidental deletion or overwrite. Keep in mind this feature has costs associated. You may consider S3 Versioning in selective scenarios such as S3 buckets that store critical backup data or sensitive data.

With S3 Versioning enabled on your S3 buckets, you can optionally add another layer of security by configuring a bucket to enable multi-factor authentication (MFA) delete. With this configuration, the bucket owner must include two forms of authentication in any request to delete a version or to change the versioning state of the bucket.

S3 Object Lock is a feature that helps you mitigate data loss by storing objects using a write-once-read-many (WORM) model. By using Object Lock, you can prevent an object from being overwritten or deleted for a fixed time or indefinitely. Keep in mind that there are specific use cases for Object Lock, including scenarios where it is imperative that data is not changed or deleted after it has been written.

Enable logging for S3 using CloudTrail and S3 server access logging

Amazon S3 is integrated with CloudTrail. CloudTrail captures a subset of API calls, including calls from the S3 console and code calls to the S3 APIs. In addition, you can enable CloudTrail data events for all your buckets or for a list of specific buckets. Keep in mind that a very active S3 bucket can generate a large amount of log data and increase CloudTrail costs. If this is concern around cost then consider enabling this additional logging only for S3 buckets with critical data.

Server access logging provides detailed records of the requests that are made to a bucket. Server access logs can assist you in security and access audits.

Backup your data in S3

Although S3 stores your data across multiple geographically diverse Availability Zones by default, your compliance requirements might dictate that you store data at even greater distances. Cross-region replication (CRR) allows you to replicate data between distant AWS Regions to help satisfy these requirements. CRR enables automatic, asynchronous copying of objects across buckets in different AWS Regions. For more information on object replication, see Replicating objects. Keep in mind that this feature has costs associated, you might consider CCR in selective scenarios such as S3 buckets that store critical backup data or sensitive data.

Monitor S3 using Security Hub and CloudWatch Logs

Security Hub provides you with a comprehensive view of your security state in AWS and helps you check your environment against security industry standards and best practices. Security Hub collects security data from across AWS accounts, services, and supported third-party partner products and helps you analyze your security trends and identify the highest priority security issues.

The AWS Foundational Security Best Practices standard is a set of controls that detect when your deployed accounts and resources deviate from security best practices, and provides clear remediation steps. The controls contain best practices from across multiple AWS services, including S3. We recommend you enable the AWS Foundational Security Best Practices as it includes the following detective controls for S3 and IAM:

IAM.1: IAM policies should not allow full “*” administrative privileges.
S3.1: Block Public Access setting should be enabled
S3.2: S3 buckets should prohibit public read access
S3.3: S3 buckets should prohibit public write access
S3.4: S3 buckets should have server-side encryption enabled
S3.5: S3 buckets should require requests to use Secure Socket layer
S3.6: Amazon S3 permissions granted to other AWS accounts in bucket policies should be restricted
S3.8: S3 Block Public Access setting should be enabled at the bucket level

For details of each control, including remediation steps, please review the AWS Foundational Security Best Practices controls.

If there is a specific S3 API activity not covered above that you’d like to be alerted on, you can use CloudTrail Logs together with Amazon CloudWatch for S3 to do so. CloudTrail integration with CloudWatch Logs delivers S3 bucket-level API activity captured by CloudTrail to a CloudWatch log stream in the CloudWatch log group that you specify. You create CloudWatch alarms for monitoring specific API activity and receive email notifications when the specific API activity occurs.

Conclusion

By using the ten practices described in this blog post, you can build strong protection mechanisms for your data in Amazon S3, including least privilege access, encryption of data at rest, blocking public access, logging, monitoring, and configuration checks.

Depending on your use case, you should consider additional protection mechanisms. For example, there are security-related controls available for large shared datasets in S3 such as Access Points, which you can use to decompose one large bucket policy into separate, discrete access point policies for each application that needs to access the shared data set. To learn more about S3 security, see Amazon S3 Security documentation.

Now that you’ve reviewed the top 10 security best practices to make your data in S3 more secure, make sure you have these controls set up in your AWS accounts—and go build securely!

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon S3 forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Megan O’Neil

Megan is a Senior Specialist Solutions Architect focused on threat detection and incident response. Megan and her team enable AWS customers to implement sophisticated, scalable, and secure solutions that solve their business challenges.

Author

Temi Adebambo

Temi is the Senior Manager for the America’s Security and Network Solutions Architect team. His team is focused on working with customers on cloud migration and modernization, cybersecurity strategy, architecture best practices, and innovation in the cloud. Before AWS, he spent over 14 years as a consultant, advising CISOs and security leaders at some of the largest global enterprises.

Active Exploitation of Confluence Server CVE-2021-26084

Post Syndicated from Caitlin Condon original https://blog.rapid7.com/2021/09/02/active-exploitation-of-confluence-server-cve-2021-26084/

Active Exploitation of Confluence Server CVE-2021-26084

On August 25, 2021, Atlassian published details on CVE-2021-26084, a critical remote code execution vulnerability in Confluence Server and Confluence Data Center. The vulnerability arises from an OGNL injection flaw and allows authenticated attackers, “and in some instances an unauthenticated user,” to execute arbitrary code on Confluence Server or Data Center instances.

The vulnerable endpoints can be accessed by a non-administrator user or unauthenticated user if “Allow people to sign up to create their account” is enabled. To check whether this is enabled, go to COG > User Management > User Signup Options. The affected versions are before version 6.13.23, from version 6.14.0 before 7.4.11, from version 7.5.0 before 7.11.6, and from version 7.12.0 before 7.12.5.

Proof-of-concept exploit code has been publicly available since August 31, 2021, and active exploitation has been reported as of September 2. Confluence Server and Data Center customers who have not already done so should update to a fixed version immediately, without waiting for their typical patch cycles. For a complete list of fixed versions, see Atlassian’s advisory here.

For full vulnerability analysis, including triggers and check information, see Rapid7’s analysis in AttackerKB.

Rapid7 customers

InsightVM and Nexpose customers can assess their exposure to CVE-2021-26084 with remote vulnerability checks as of the August 26, 2021 content release.

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

New – Amazon EFS Intelligent-Tiering Optimizes Costs for Workloads with Changing Access Patterns

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-efs-intelligent-tiering-optimizes-costs-for-workloads-with-changing-access-patterns/

Amazon Elastic File System (Amazon EFS) offers four storage classes: two Standard storage classes, Amazon EFS Standard and Amazon EFS Standard-Infrequent Access (EFS Standard-IA), and two One Zone storage classes, Amazon EFS One Zone, and Amazon EFS One Zone-Infrequent Access (EFS One Zone-IA). Standard storage classes store data within and across multiple availability zones (AZ). One Zone storage classes store data redundantly within a single AZ, at 47 percent lower price compared to file systems using Standard storage classes, for workloads that don’t require multi-AZ resilience.

The EFS Standard and EFS One Zone storage classes are performance-optimized to deliver lower latency. The Infrequent Access (IA) storage classes are cost-optimized for files that are not accessed every day. With EFS lifecycle management, you can move files that have not been accessed for the duration of the lifecycle policy (7, 14, 30, 60, or 90 days) to the IA storage classes. This will reduce the cost of your storage by up to 92 percent compared to EFS Standard and EFS One Zone storage classes respectively.

Customers love the cost savings provided by the IA storage classes, but they also want to ensure that they won’t get unexpected data access charges if access patterns change and files that have transitioned to IA are accessed frequently. Reading from or writing data to the IA storage classes incurs a data access charge for every access.

Today, we are launching Amazon EFS Intelligent-Tiering, a new EFS lifecycle management feature that automatically optimizes costs for shared file storage when data access patterns change, without operational overhead.

With EFS Intelligent-Tiering, lifecycle management monitors the access patterns of your file system and moves files that have not been accessed for the duration of the lifecycle policy from EFS Standard or EFS One Zone to EFS Standard-IA or EFS One Zone-IA, depending on whether your file system uses EFS Standard or EFS One Zone storage classes. If the file is accessed again, it is moved back to EFS Standard or EFS One Zone storage classes.

EFS Intelligent-Tiering optimizes your costs even if your workload file data access patterns change. You’ll never have to worry about unbounded data access charges because you only pay for data access charges for transitions between storage classes.

Getting started with EFS Intelligent-Tiering
To get started with EFS Intelligent-Tiering, create a file system using the AWS Management Console, enable lifecyle management and set two lifecycle policies.

Choose a Transition into IA option to move infrequently accessed files to the IA storage classes. From the drop down list, you can choose lifecycle policies of 7, 14, 30, 60, or 90 days. Additionally, choose a Transition out of IA option and select On first access to move files back to EFS Standard or EFS One Zone storage classes on access.

For an existing file system, you can click the Edit button on your file system to enable or change lifecycle management and EFS Intelligent-Tiering.

Also, you can use the PutLifecycleConfiguration API action or put-lifecycle-configuration command specifying the file system ID of the file system for which you are enabling lifecycle management and the two policies for EFS Intelligent-Tiering.

$ aws efs put-lifecycle-configuration \
   --file-system-id File-System-ID \
   --lifecycle-policies "[{"TransitionToIA":"AFTER_30_DAYS"},
     {"TransitionToPrimaryStorageClass":"AFTER_1_ACCESS"}]"
   --region us-west-2 \
   --profile adminuser

You get the following response:

{
  "LifecyclePolicies": [
    {
        "TransitionToIA": "AFTER_30_DAYS"
    },
    {
        "TransitionToPrimaryStorageClass": "AFTER_1_ACCESS"
    }
  ]
}

To disable EFS Intelligent-Tiering, set both the Transition into IA and Transition out of IA options to None. This will disable lifecycle management, and your files will remain on the storage class they’re on.

Any files that have already started to move between storage classes at the time that you disabled EFS Intelligent-Tiering will complete moving to their new storage class. You can disable transition policies independently of each other.

For more information, see Amazon EFS lifecycle management in the Amazon EFS User Guide.

Now Available
Amazon EFS Intelligent-Tiering is available in all AWS Regions where Amazon EFS is available. To learn more, join us for the third annual and completely free-to-attend AWS Storage Day 2021 and tune in to our livestream on the AWS Twitch channel today.

You can send feedback to the AWS forum for Amazon EFS or through your usual AWS Support contacts.

Channy

How to Accelerate Performance and Availability of Multi-region Applications with Amazon S3 Multi-Region Access Points

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/s3-multi-region-access-points-accelerate-performance-availability/

Building multi-region applications allows you to improve latency for end users, achieve higher availability and resiliency in case of unexpected disasters, and adhere to business requirements related to data durability and data residency. For example, you might want to reduce the overall latency of dynamic API calls to your backend services . Or you might want to extend a single-region deployment to handle internet routing issues, failures of submarine cables, or regional connectivity issues – and therefore avoid costly downtime. Today, thanks to multi-region data replication functions such as Amazon DynamoDB global tables, Amazon Aurora global database, Amazon ElastiCache global datastore, and Amazon Simple Storage Service (Amazon S3) cross-region replication, you can build multi-region applications across 25 AWS Regions worldwide.

Yet, when it comes to implementing multi-region applications, you often have to make your code region-aware and take care of the heavy lifting of interacting with the correct regional resources, whether it’s the closest or the most available. For example, you might have three S3 buckets with object replication across three AWS Regions. Your application code needs to be aware of how many copies of the bucket exist and where they are located, which bucket is the closest to the caller, and how to fall back to other buckets in case of issues. The complexity grows when you add new regions to your multi-region architecture and redeploy your stack in each region whenever a global configuration changes.

Today, I’m happy to announce the general availability of Amazon S3 Multi-Region Access Points, a new S3 feature that allows you to define global endpoints that span buckets in multiple AWS Regions. With S3 Multi-Region Access Points, you can build multi-region applications with the same simple architecture used in a single region.

S3 Multi-Region Access Points deliver built-in network resilience, building on top AWS Global Accelerator to route S3 requests over the AWS global network. This is especially important to minimize network congestion and overall latency, while maintaining a simple application architecture. AWS Global Accelerator constantly monitors for regional availability and can shift requests to another region within seconds. By dynamically routing your requests to the lowest latency copy of your data, S3 Multi-Region Access Points increase upload and download performance by up to 60%. This is great not just for server-side applications that rely on S3 for reading configuration files or application data, but also for edge applications that need a performant and reliable write-only endpoint, such as IoT devices or autonomous vehicles.

S3 Multi-Region Access Points in Action
To get started, you create an S3 Multi-Region Access Point in the S3 console, via API, or with AWS CloudFormation.

Let me show you how to create one using the S3 console. Each access point needs a name, unique at the account level.

After it’s created, you can access it through its alias, which is generated automatically and globally unique. The alias will look like a random string ending with .mrap – for example, mmqdt41e4bf6x.mrap. It can also be accessed over the internet via https://mmqdt41e4bf6x.mrap.s3-global.amazonaws.com, via VPC, or on-premises using AWS PrivateLink.

Then, you associate multiple buckets (new or existing) to the access point, one per Region. If you need data replication, you’ll need to enable bucket versioning too.

Finally, you configure the Block Public Access settings for the access point. By default, all public access is blocked, which works fine for most cases.

The creation process is asynchronous, you can view the creation status in the Console or by listing the S3 Multi-Region Access Points from the CLI. When it becomes Ready, you can configure optional settings for the access point policy and object replication.

Similar to regular access points, you can customize the access control policy to limit the use of the access point with respect to the bucket’s permission. Keep in mind that both the access point and the underlying buckets must permit a request. S3 Multi-Region Access Points cannot extend the permissions, just limit (or equal) them. You can also use IAM Access Analyzer to verify public and cross-account access for buckets that use S3 Multi-Region Access Points and preview access to your buckets before deploying permissions changes.

Your S3 Multi-Region Access Point access policy might look like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Default",
      "Effect": "Allow",
      "Principal": {
        "AWS": "YOUR_ACCOUNT_ID" 
      },
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3::YOUR_ACCOUNT_ID:accesspoint/YOUR_ALIAS/object/*"
    }
   ]
}

To replicate data between buckets used with your S3 Multi-Region Access Point, you configure S3 Replication. In some cases, you might want to store different content in each bucket, or have a portion of a regional bucket for use with a global endpoint and other portions that aren’t replicated and used only with a regional access point or direct bucket access. For example, an IoT device configuration might include references to other regional API endpoints or regional resources that will be different for each bucket.

The new S3 console provides two basic templates that you can use to easily and centrally create replication rules:

  • Replicate objects from one or more source bucket to one or more destination buckets: This is ideal for ready-only use cases where data is always generated in a specific AWS Region and you want it to be available in all other Regions, too.
  • Replicate objects among all specified buckets: This is ideal for the IoT scenario I mentioned, where you’d define a write-only access point that devices use to upload data to the closest region, and you need this data to be available in all regions.

Of course, thanks to filters and conditions, you can create more sophisticated replication setups. For example, you might want to replicate only certain objects based on a prefix or tags.

Keep in mind that bucket versioning must be enabled for cross-region replication.

The console will take care of creating and configuring the replication rules and IAM roles. Note that to add or remove buckets, you would create a new the S3 Multi-Region Access Point with the revised list.

In addition to the replication rules, here is where you configure replication options such as Replication Time Control (RTC), replication metrics and notifications, and bidirectional sync. RTC allows you to replicate most new objects in seconds, and 99.99% of those objects within 15 minutes, for use cases where replication speed is important; replications metrics allow you to monitor how synchronized are your buckets in terms of object and byte count; bidirectional sync allows you to achieve an active-active configuration for put-heavy use cases in which object metadata needs to be replicated across buckets too.

After replication is configured, you get a very useful visual and interactive summary that allows you to verify which AWS Regions are enabled. You’ll see where they are on the map, the name of the regional buckets, and which replication rules are being applied.

After the S3 Multi-Region Access Point is defined and correctly configured, you can start interacting with it through the S3 API, AWS CLI, or the AWS SDKs. For example, this is how you’d write and read a new object using the CLI (don’t forget to upgrade to the latest CLI version):

# create a new object
aws s3api put-object --bucket arn:aws:s3::YOUR_ACCOUNT_ID:accesspoint/YOUR_ALIAS --key test.png --body test.png
# retrieve the same object
aws s3api get-object --bucket arn:aws:s3::YOUR_ACCOUNT_ID:accesspoint/YOUR_ALIAS --key test.png test.png

Last but not least, you can use bucket metrics in Amazon CloudWatch to keep track of how user requests are distributed across buckets in multiple AWS Regions.

CloudFormation Support at Launch
Today, you can start using two new CloudFormation resources to easily define an S3 Multi-Region Access Point: AWS::S3::MultiRegionAccessPoint and AWS::S3::MultiRegionAccessPointPolicy.

Here is an example:

Resources:
  MyS3MultiRegionAccessPoint:
    Type: AWS::S3::MultiRegionAccessPoint
    Properties:
      Regions:
        - Bucket: regional-bucket-ireland
        - Bucket: regional-bucket-australia
        - Bucket: regional-bucket-us-east
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        IgnorePublicAcls: true
        BlockPublicPolicy: true
        RestrictPublicBuckets: true
  MyMultiRegionAccessPointPolicy:
    Type: AWS::S3::MultiRegionAccessPointPolicy
    Properties:
      MrapName: !Ref MyS3MultiRegionAccessPoint
      Policy:
        Version: 2012-10-17
        Statement:
          - Action: '*'
            Effect: Allow
            Resource: !Sub
              - 'arn:aws:s3::${AWS::AccountId}:accesspoint/${mrapalias}/object/*'
              - mrapalias: !GetAtt
                  - MyS3MultiRegionAccessPoint
                  - Alias
            Principal: {"AWS": !Ref "AWS::AccountId"}

The AWS::S3::MultiRegionAccessPoint resource depends only on the S3 bucket names. You don’t need to reference other regional stacks and you can easily centralize the S3 Multi-Region Access Point definition into its own stack. On the other hand, cross-region replication needs to be configured on each S3 bucket.

Cost considerations
When you use an S3 Multi-Region Access Point to route requests within the AWS global network, you pay a data routing cost of $0.0033 per GB processed, in addition to the standard charges for S3 requests, storage, data transfer, and replication. If your applications access the S3 Multi-Region Access Point over the internet, you’re also charged an internet acceleration cost per GB. This cost depends on the transfer type (upload or download) and whether the client and the bucket are in the same or different locations. For details, visit the S3 pricing page and select the data transfer tab.

Let me share a few practical examples:

  • All traffic within an AWS Region: In this simple case, your application runs in US East (N. Virginia) and you configure two S3 buckets in US East (N. Virginia) and US West (Oregon). The application uploads 100GB of data and the lowest latency bucket is in US East(N. Virginia). All the data is routed by your S3 Multi-Region Access Point in the same region and the total cost is $0.33.
  • All traffic across two AWS Regions: In this case, your application runs in US East (N. Virginia) and you configure two S3 buckets in US East (Ohio) and US West (Oregon). The application uploads 100GB of data and the lowest latency bucket is in US East (Ohio). All the data is routed by your S3 Multi-Region Access Point across two AWS Regions. The data routing cost for 100GB is the same of the previous example ($0.33), plus the S3 data transfer cost of $0.01 per GB, resulting in a total cost of $1.33.
  • All traffic over the internet across North America, Europe, and Asia Pacific (download and upload): In this case, your application runs on customer devices in North America, Europe, and Asia, and you configure two S3 buckets in US East (N. Virginia) and Europe (Ireland). One customer in North America uploads 50GB of data, which is routed to the bucket in US East (N. Virginia); a second customer in Europe downloads 50GB of data from the bucket in Europe (Ireland); a third customer in Asia downloads 50GB of data from the bucket in Europe (Ireland). The data routing cost for 150GB is $0.495. Plus the data transfer out from S3 to Europe of $0.09 per GB ($9), the internet acceleration cost from North America to the S3 bucket in US East (N. Virginia) of $0.0025 per GB ($0.125), the internet acceleration cost from the S3 bucket in Europe (Ireland) to Europe of $0.005 per GB ($0.25), and the internet acceleration cost from the S3 bucket in Europe (Ireland) to Asia of $0.05 per GB ($2.5). The total cost is $12.37. Please note that this example is intended to demonstrate how the internet acceleration cost works across continents. Also note that the internet acceleration cost to Asia might be reduced by an order of magnitude with an additional S3 bucket in Asia (see next example).
  • All the traffic over the internet across North America, Europe, and Asia Pacific (only upload): In this case, we consider the same conditions of the previous example. The only difference is that all customers only upload data and that you configure an additional bucket in Asia Pacific (Singapore). The data routing cost is the same ($0.495). Plus the internet acceleration cost from North America to the S3 bucket in US East (N. Virginia) of $0.0025 per GB ($0.125), the internet acceleration cost from Europe to the S3 bucket in Europe (Ireland) of $0.0025 per GB ($0.125), and the internet acceleration cost from Asia to the S3 bucket in Asia Pacific (Singapore) of $0.01 per GB ($0.5). The total cost is $1.24.

In other words, the routing cost is easy to estimate and doesn’t depend on the application type or data access pattern. The internet acceleration cost depends on the access pattern (downloads are more expensive than uploads) and on the client location with respect to the closest AWS Region. For global applications that upload or download data over the internet, you can minimize the internet acceleration cost by configuring at least one S3 bucket in each continent.

Available Today
Amazon S3 Multi-Region Access Points allow you to increase resiliency and accelerate application performance up to 60% when accessing data across multiple AWS Regions. We look forward to feedback about your use cases so that we can iterate quickly and simplify how you design and implement multi-region applications.

You can get started using the S3 API, CLI, SDKs, AWS CloudFormation or the S3 Console. This new functionality is available in 17 AWS Regions worldwide (see the full list of supported AWS Regions).

Learn More

Watch this video to hear more about S3 Multi-Region Access Points and see a short demo.

Check out the technical documentation for S3 Multi-Region Access Points.

Alex

Amazon S3 Intelligent-Tiering – Improved Cost Optimizations for Short-Lived and Small Objects

Post Syndicated from Sean M. Tracey original https://aws.amazon.com/blogs/aws/amazon-s3-intelligent-tiering-further-automating-cost-savings-for-short-lived-and-small-objects/

In 2018, we first launched Amazon S3 Intelligent-Tiering (S3 Intelligent-Tiering). For customers managing data across business units, teams, and products, unpredictable access patterns are often the norm. With the S3 Intelligent-Tiering storage class, S3 automatically optimizes costs by moving data between access tiers as access patterns change.

Today, we’re pleased to announce two updates to further enhance savings.

  • S3 Intelligent-Tiering now has no minimum storage duration period for all objects.
  • Monitoring and automation charges are no longer collected for objects smaller than 128 KB.

How Does this Benefit Customers?
Amazon S3 Intelligent-Tiering can be used to store shared datasets, where data is aggregated and accessed by different applications, teams, and individuals, whether for analytics, machine learning, real-time monitoring, or other data lake use cases.

An image showing how S3 Intelligent-Tiering Optimises costs by moving objects between access tier

With these use cases, it’s common that many users within an organization will store data with a wide range of objects and delete subsets of data in less than 30 days.

To date, S3 Intelligent-Tiering was intended for objects larger than 128 KB stored for a minimum of 30 days. As of today, monitoring and automation charges will no longer be collected for objects smaller than 128 KB — this includes both new and already existing objects in the S3 Intelligent-Tiering storage class. Additionally, objects deleted, transitioned, or overwritten within 30 days will no longer accrue prorated charges.

With these changes, S3 Intelligent-Tiering is the ideal storage class for data with unknown, changing, or unpredictable access patterns, independent of object size or retention period.

How Can I Use This Now?
S3 Intelligent-Tiering can either be applied to objects individually, as they are written to S3 by adding the Intelligent-Tiering header to the PUT request for your object, or through the creation of a lifecycle rule.

One way you can explore the benefits of S3 Intelligent-Tiering is through the Amazon S3 Console.

Once there, select a bucket you wish to upload an object to and store with the S3 Intelligent-Tiering class, then select the Upload button on the object display view. This will take you to a page where you can upload files or folders to S3.

You can drag and drop or use either the Add Files or Add Folders button to upload objects to your bucket. Once selected, you will see a view like the following image.

A screenshot showing the upload of a file into an S3 bucket.

Next, scroll down the page and expand the Properties section. Here, we can select the storage class we wish for our object (or objects) to be stored in. Select Intelligent-Tiering from the storage class options list. Then select the Upload button at the bottom of the page.

Screenshot showing the properties section, choosing Intelligent-Tiering as the storage class

Your objects will now be stored in your S3 bucket utilizing the S3 Intelligent-Tiering storage class, further optimizing costs by moving data between access tiers as access patterns change.

S3 Intelligent-Tiering is available in all AWS Regions, including the AWS GovCloud (US) Regions, the AWS China (Beijing) Region, operated by Sinnet, and the AWS China (Ningxia) Region, operated by NWCD. To learn more, visit the S3 Intelligent-Tiering page.

New – Amazon FSx for NetApp ONTAP

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-fsx-for-netapp-ontap/

Back in 2018 I wrote about the first two members of the Amazon FSx family of fully-managed, highly-reliable, and highly-performant file systems, Amazon FSx for Lustre and Amazon FSx for Windows File Server. Both of these services give you the ability to use popular open source and commercially-licensed file systems without having to deal with hardware provisioning, software configuration, patching, backups, and so forth. Since those launches, we have added many new features to both services in response to your requests:

Amazon FSx for Lustre now supports Persistent file systems with SSD- and HDD-based storage for longer-term storage and workloads, storage capacity scaling, crash-consistent backups, data compression, and storage quotas.

Amazon FSx for Windows File Server now supports many enterprise-ready features including Multi-AZ file systems, self-managed Active Directories, fine-grained file restoration, file access auditing, storage size and capacity throughput scaling, and a low cost HDD storage option.

Because these services support the file access and storage paradigms that are already well understood by Lustre and Windows File Server users, it is easy to migrate existing applications and to fine-tune existing operational regimens when you put them to use. While migration is important, so are new applications! All of the Amazon FSx systems make it easy for you to build applications that need high-performance fully managed storage along with the rich set of features provided by the file systems.

Amazon FSx for NetApp ONTAP
As I often tell you, we are always looking for more ways to meet the needs of our customers. To this end, we are launching Amazon FSx for NetApp ONTAP today. You get the popular features, performance, and APIs of ONTAP file systems with the agility, scalability, security, and resiliency of AWS, making it easier for you to migrate on-premises applications that rely on network-attached storage (NAS) appliances to AWS.

ONTAP (a NetApp product) is an enterprise data management offering designed to provide high-performance storage suitable for use with Oracle, SAP, VMware, Microsoft SQL Server, and so forth. ONTAP is flexible and scalable, with support for multi-protocol access and file systems that can scale up to 176 PiB. It supports a wide variety of features that are designed to make data management cheaper and easier including inline data compression, deduplication, compaction, thin provisioning, replication (SnapMirror), and point-in-time cloning (FlexClone).

FSx for ONTAP is fully managed so you can start to enjoy all of these features in minutes. AWS provisions the file servers and storage volumes, manages replication, installs software updates & patches, replaces misbehaving infrastructure components, manages failover, and much more. Whether you are migrating data from your on-premises NAS environment or building brand-new cloud native applications, you will find a lot to like! If you are migrating, you can enjoy all of the benefits of a fully-managed file system while taking advantage of your existing tools, workflows, processes, and operational expertise. If you are building brand-new applications, you can create a cloud-native experience that makes use of ONTAP’s rich feature set. Either way, you can scale to support hundreds of thousands of IOPS and benefit from the continued, behind-the-scenes evolution of the compute, storage, and networking components.

There are two storage tiers, and you can enable intelligent tiering to move data back and forth between them on an as-needed basis:

Primary Storage is built on high performance solid state drives (SSD), and is designed to hold the part of your data set that is active and/or sensitive to latency. You can provision up to 192 TiB of primary storage per file system.

Capacity Pool Storage grows and shrinks as needed, and can scale to pebibytes. It is cost-optimized and designed to hold data that is accessed infrequently.

Within each Amazon FSx for NetApp ONTAP file system you can create one or more Storage Virtual Machines (SVMs), each of which supports one or more Volumes. Volumes can be accessed via NFS, SMB, or as iSCSI LUNs for shared block storage. As you can see from this diagram, you can access each volume from AWS compute services, VMware Cloud on AWS, and from your on-premises applications:

If your on-premises applications are already making use of ONTAP in your own data center, you can easily create an ONTAP file system in the cloud, replicate your data using NetApp SnapMirror, and take advantage of all that Amazon FSx for NetApp ONTAP has to offer.

Getting Started with Amazon FSx for NetApp ONTAP
I can create my first file system from the command line, AWS Management Console, or the NetApp Cloud Manager. I can also make an API call or use a CloudFormation template. I’ll use the Management Console.

Each file system runs within a Virtual Private Cloud (VPC), so I start by choosing a VPC and a pair of subnets (preferred and standby). Every SVM has an endpoint in the Availability Zones associated with both of the subnets, with continuous monitoring, automated failover, and automated failback to ensure high availability.

I open the Amazon FSx Console, click Create file system, select Amazon FSx for NetApp ONTAP, and click Next:

I can choose Quick create and use a set of best practices, or Standard create and set all of the options myself. I’ll go for the first option, since I can change all of the configuration options later if necessary. I select Quick create, enter a name for my file system (jb-fsx-ontap-1), and set the storage capacity in GiB. I also choose the VPC, and enable ONTAP’s storage efficiency features:

I confirm all of my choices, and note that this option will also create a Storage Virtual Machine (fsx) and a volume (vol1) for me. Then I click Create file system to “make it so”:

The file system Status starts out as Creating, then transitions to Available within 20 minutes or so:

My first SVM transitions from Pending to Created shortly thereafter, and my first volume transitions from Pending to Created as well. I can click on the SVM to learn more about it and to see the full set of management and access endpoints that it provides:

I can click Volumes in the left-side navigation and see all of my volumes. The root volume (fsx_root) is created automatically and represents all of the storage on the SVM:

I can select a volume and click Attach to get customized instructions for attaching it to an EC2 instance running Linux or Windows:

I can select a volume and then choose Update volume from the Action menu to change the volume’s path, size, storage efficiency, or tiering policy:

To learn more about the tiering policy, read about Amazon FSx for NetApp ONTAP Storage.

I can click Create volume and create additional volumes within any of my file systems:

There’s a lot more than I have space to show you, so be sure to open up the Console and try it out for yourself.

Things to Know
Here are a couple of things to know about Amazon FSx for NetApp ONTAP:

Regions – The new file system is available in most AWS regions and in GovCloud; check out the AWS Regional Service list for more information.

Pricing – Pricing is based on multiple usage dimensions including the Primary Storage, Capacity Pool Storage, throughput capacity, additional SSD IOPS, and backup storage consumption; consult the Amazon FSx for NetApp ONTAP Pricing page for more information.

Connectivity – You can use AWS Direct Connect to connect your on-premises applications to your new file systems. You can use Transit Gateway to connect to VPCs in other accounts and/or regions.

Availability – As I mentioned earlier, each file system is powered by AWS infrastructure in a pair of Availability Zones. Amazon FSx for NetApp ONTAP automatically replicates data between the zones and monitors the AWS infrastructure, initiating a failover (typically within 60 seconds), and then replacing infrastructure components as necessary. There’s a 99.99% availability SLA for each file system.

Jeff;

[$] 5.15 Merge window, part 1

Post Syndicated from original https://lwn.net/Articles/867821/rss

As of this writing, 3,440 non-merge changesets have been pulled into the
mainline repository for the 5.15 development cycle. A mere 3,440 patches
may seem like a slow start, but those patches are densely populated with
significant new features. Read on for a look at what the first part of the
5.15 merge window has brought.

Fuzzing 100+ open source projects with OSS-Fuzz – lessons learned (ADA Logics blog)

Post Syndicated from original https://lwn.net/Articles/868165/rss

On the ADA Logics blog, David Korczynski and Adam Korczynski write about their work integrating 115 open-source projects with Google’s OSS-Fuzz project for doing continuous fuzz testing. They describe the process of integrating a project into OSS-Fuzz, and discuss their findings, which include more than 2000 bugs (500+ security relevant), of which 1300+ have been fixed at this point:

Throughout the process we integrated projects written in C, C++, Python, Go and Rust and the types of bugs we found across the projects are a reflection of the language the project was written in. Typically, for managed languages the bugs are within the umbrella term of uncaught exceptions and denial of service bugs, whereas in native languages the bugs are mostly split between assert violations, NULL-dereferences, heap-out-of-bounds, stack-out-of-bounds, stack overflows, integer arithmetic, memory leaks, out-of-memory and timeout bugs.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close